| Literature DB >> 35169226 |
Giovanna Nicora1,2, Susanna Zucca2, Ivan Limongelli2, Riccardo Bellazzi1, Paolo Magni3.
Abstract
Genomic variant interpretation is a critical step of the diagnostic procedure, often supported by the application of tools that may predict the damaging impact of each variant or provide a guidelines-based classification. We propose the application of Machine Learning methodologies, in particular Penalized Logistic Regression, to support variant classification and prioritization. Our approach combines ACMG/AMP guidelines for germline variant interpretation as well as variant annotation features and provides a probabilistic score of pathogenicity, thus supporting the prioritization and classification of variants that would be interpreted as uncertain by the ACMG/AMP guidelines. We compared different approaches in terms of variant prioritization and classification on different datasets, showing that our data-driven approach is able to solve more variant of uncertain significance (VUS) cases in comparison with guidelines-based approaches and in silico prediction tools.Entities:
Mesh:
Year: 2022 PMID: 35169226 PMCID: PMC8847497 DOI: 10.1038/s41598-022-06547-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Datasets collected and purpose.
| Dataset name | Purpose | # of variants | |
|---|---|---|---|
| Model building | Clinvitae training | Training | 8496 |
| Clinvitae probability threshold tuning (PTT) | Tuning the probability threshold for classification | 4247 | |
| Model validation | Clinvitae test | Comparison between different ML methods and the pathogenicity score in[ | 1415 |
| Clinvitae Validation | Testing classification of the selected ML method, in comparison with the pathogenicity score and the bayesian score | 161,744 | |
| ICR639 | Testing classification and prioritization of the selected ML method on a real dataset, in comparison with the pathogenicity score, the bayesian score, CADD and VVP | 18,046 |
Figure 1Proportion of benign and pathogenic variants in (A) Clinvitae training, PTT and test sets, (B) Clinvitae validation set, (C) ICR639 hereditary cancer dataset.
Results of logistic regression A approach (LR-A), logistic regression B approach (LR-B) and pathogenicity score (PS) on the Clinvitae test set.
| LR-A | LR-B | PS | |
|---|---|---|---|
| Accuracy | 0.9752 | 0.9780 | 0.9597 |
| Precision | 0.9889 | 0.9926 | 0.9941 |
| AUC | 0.9708 | 0.9737 | 0.9505 |
| F1 | 0.9684 | 0.9720 | 0.9472 |
| Recall | 0.9487 | 0.9522 | 0.9045 |
| Balanced accuracy | 0.9708 | 0.9737 | 0.9505 |
| MCC | 0.9486 | 0.9546 | 0.9174 |
| PRC | 0.9587 | 0.9643 | 0.9374 |
Performance of logistic regression A approach (LR-A), logistic regression B approach (LR-B), pathogenicity score (PS) and the Bayesian approach (BS) on the entire Clinvitae validation set (“all” columns) and on the subset of Clinvitae variants that are interpreted as VUS by the ACMG/AMP guidelines (“VUS”) columns.
| LR-A | LR-B | PS | BS | |||||
|---|---|---|---|---|---|---|---|---|
| All | VUS | All | VUS | All | VUS | All | VUS | |
| Accuracy | 0.9717 | 0.8630 | 0.9702 | 0.8507 | 0.9559 | 0.7793 | 0.9082 | 0.6161 |
| Precision | 0.9901 | 0.9582 | 0.9943 | 0.9703 | 0.9870 | 0.9435 | 0.9979 | 0.9679 |
| AUC | 0.9667 | 0.8429 | 0.9641 | 0.8270 | 0.9476 | 0.7443 | 0.8872 | 0.5489 |
| F1 | 0.9643 | 0.8147 | 0.9621 | 0.7921 | 0.9433 | 0.6633 | 0.8727 | 0.1817 |
| Recall | 0.9398 | 0.7087 | 0.9318 | 0.6692 | 0.9033 | 0.5114 | 0.7755 | 0.1003 |
| Balanced accuracy | 0.9667 | 0.8439 | 0.9641 | 0.8270 | 0.9476 | 0.7743 | 0.8872 | 0.5489 |
| MCC | 0.9419 | 0.7302 | 0.9542 | 0.7193 | 0.9098 | 0.5738 | 0.8184 | 0.2357 |
Figure 2Precision and recall curves for LR-A, LR-B, BS and PS on all Clinvitae validation variants and on Clinvitae validation variants interpreted as VUS according to the ACMG/AMP guidelines.
Figure 3Normalized discounted cumulative gain tie-aware (mean and standard deviation) computed on patients from ICR639 hereditary cancer dataset.