| Literature DB >> 31615419 |
James M Holt1, Brandon Wilk2, Camille L Birch2, Donna M Brown2, Manavalan Gajapathy2, Alexander C Moss2, Nadiya Sosonkina2,3, Melissa A Wilk2, Julie A Anderson2, Jeremy M Harris2, Jacob M Kelly2, Fariba Shaterferdosian2, Angelina E Uno-Antonison2, Arthur Weborg2, Elizabeth A Worthey2.
Abstract
BACKGROUND: When applying genomic medicine to a rare disease patient, the primary goal is to identify one or more genomic variants that may explain the patient's phenotypes. Typically, this is done through annotation, filtering, and then prioritization of variants for manual curation. However, prioritization of variants in rare disease patients remains a challenging task due to the high degree of variability in phenotype presentation and molecular source of disease. Thus, methods that can identify and/or prioritize variants to be clinically reported in the presence of such variability are of critical importance.Entities:
Keywords: Binary classification; Clinical genome sequencing; Variant prioritization
Mesh:
Year: 2019 PMID: 31615419 PMCID: PMC6792253 DOI: 10.1186/s12859-019-3026-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Feature selection
| Feature label | RF(sklearn) | BRF(imblearn) |
|---|---|---|
| HPO-cosine | 0.2895 | 0.2471 |
| PyxisMap | 0.2207 | 0.2079 |
| CADD Scaled | 0.1031 | 0.1007 |
| phylop100 conservation | 0.0712 | 0.0817 |
| phylop conservation | 0.0641 | 0.0810 |
| phastcon100 conservation | 0.0572 | 0.0628 |
| GERP rsScore | 0.0357 | 0.0416 |
| HGMD assessment type_DM | 0.0373 | 0.0344 |
| HGMD association confidence_High | 0.0309 | 0.0311 |
| Gnomad Genome total allele count | 0.0192 | 0.0322 |
| ClinVar Classification_Pathogenic | 0.0228 | 0.0200 |
| ADA Boost Splice Prediction | 0.0081 | 0.0109 |
| Random Forest Splice Prediction | 0.0077 | 0.0105 |
| Meta Svm Prediction_D | 0.0088 | 0.0092 |
| PolyPhen HV Prediction_D | 0.0075 | 0.0071 |
| Effects_Premature stop | 0.0049 | 0.0057 |
| SIFT Prediction_D | 0.0026 | 0.0056 |
| PolyPhen HD Prediction_D | 0.0025 | 0.0049 |
| Effects_Possible splicing modifier | 0.0029 | 0.0035 |
| ClinVar Classification_Likely Pathogenic | 0.0034 | 0.0020 |
This table shows the top 20 features that were used to train the classifiers ordered from most important to least important. After training, the two random forest classifiers report the importance of each feature in the classifier (total is 1.00 per classifier). We average the two importance values, and order them from most to least important. Feature labels with an ‘_’ represent a single category of a multi-category feature (i.e. “HGMD assessment type_DM” means the “DM” bin-count feature from the “HGMD assessment type” annotation in Codicem)
Classifier performance statistics
| Classifier | CV10 Acc. | AUROC | AUPRC |
|---|---|---|---|
| RandomForest(sklearn) | 0.84+-0.13 | 0.9282 | 0.1961 |
| LogisticRegression(sklearn) | 0.84+-0.13 | 0.9300 | 0.2458 |
| BalancedRandomForest(imblearn) | 0.86+-0.11 | 0.9313 | 0.2015 |
| EasyEnsembleClassifier(imblearn) | 0.85+-0.08 | 0.9303 | 0.1918 |
For each tuned classifier, we show performance measures commonly used for classifiers (from left to right): 10-fold cross validation balanced accuracy (CV10 Acc.), area under the receiver operator curve (AUROC), and area under the precision-recall curve (AUPRC). The CV10 Acc. was gathered during hyperparameter tuning by calculating the average and standard deviation of the 10-fold cross validation. AUROC and AUPRC was evaluated on the testing set after hyperparameter tuning and fitting to the full training set
Fig. 1Receiver operator and precision-recall curves. These figures show the performance of the four classifiers on the testing set after hyperparameter tuning and fitting to the training set. On the left, we show the receiver operator curve (false positive rate against the true positive rate). On the right, we show the precision recall curve. Area under the curve (AUROC or AUPRC) is reported beside each method in the legend
Ranking performance statistics
| Ranking System | Case Rank - Median (Mean) | |||
|---|---|---|---|---|
| All (n=189) | VUS (n=111) | LP (n=42) | Path. (n=36) | |
| CADD Scaled | 57.0 (99.13) | 69.0 (107.78) | 39.5 (91.24) | 28.0 (81.67) |
| HPO-cosine | 22.0 (53.96) | 22.0 (56.05) | 26.0 (56.38) | 19.5 (44.69) |
| Exomiser(hiPhive) | 79.0 (105.34) | 85.0 (116.33) | 93.5 (101.10) | 34.0 (76.42) |
| Exomiser(hiPhive, human only) | 35.0 (53.60) | 37.0 (63.84) | 34.0 (45.60) | 24.5 (31.36) |
| Phen-Gen | 55.0 (48.66) | 65.0 (52.91) | 47.0 (47.48) | 24.0 (36.92) |
| DeepPVP | 15.0 (76.95) | 23.0 (79.68) | 19.5 (84.95) | 6.0 (59.19) |
| RandomForest(sklearn) | 10.0 (29.64) | 15.0 (39.27) | 8.0 (20.07) | 4.0 (11.11) |
| LogisticRegression(sklearn) | 6.0 (29.24) | 14.0 (39.87) | 3.0 (22.05) | 1.0 (4.83) |
| BalancedRandomForest(imblearn) | 8.0 (28.24) | 14.0 (38.64) | 5.0 (17.67) | 3.0 (8.50) |
| EasyEnsembleClassifier(imblearn) | 7.0 (28.72) | 15.0 (40.15) | 6.0 (18.40) | 2.0 (5.50) |
This table shows the ranking performance statistics for all methods evaluated on our test set. CADD Scaled and HPO-cosine are single value measures that were used as inputs to the classifiers we tested. The middle four rows (two Exomiser runs, Phen-Gen, and DeepPVP) represent external tools that ranked the same set of variants as the classifier algorithms. Phen-Gen was the only external tool that did not rank every variant in the set, so we conservatively assumed unranked variants were at the next best position despite being unranked. The bottom four rows are the tuned, binary classification methods tested in this paper. Each method was used to rank (prioritize) the Codicem-filtered variants from each proband in the test set, and the position of reported variants was recorded such that lower values indicate better performance with “1” indicating the first variant in the list. The “Case Rank” columns show the median and mean ranks for all reported variants along with the variants split into their reported pathogenicity (variant of uncertain significance (VUS), likely pathogenic (LP), or pathogenic (Path.)) derived from the ACMG guidelines. All values in this table were generated using only the Codicem-filtered variants from testing set
Top variant statistics. This table shows the ranking performance statistics for all methods evaluated on our test set (same order as Table 3)
| Ranking System | Percentage in Top X Variants - X=(1, 10, 20) | |||
|---|---|---|---|---|
| All (n=189) | VUS (n=111) | LP (n=42) | Path. (n=36) | |
| CADD Scaled | 4, 17, 24 | 0, 9, 15 | 7, 21, 30 | 13, 41, 47 |
| HPO-cosine | 7, 32, 47 | 7, 31, 48 | 7, 28, 40 | 8, 38, 50 |
| Exomiser(hiPhive) | 7, 29, 36 | 6, 30, 36 | 2, 16, 28 | 16, 38, 44 |
| Exomiser(hiPhive, human only) | 7, 28, 37 | 6, 28, 36 | 2, 16, 30 | 16, 38, 50 |
| Phen-Gen | 4, 21, 30 | 5, 20, 27 | 4, 16, 26 | 2, 27, 44 |
| DeepPVP | 11, 42, 52 | 4, 36, 47 | 16, 42, 50 | 27, 61, 72 |
| RandomForest(sklearn) | 16, 53, 65 | 9, 45, 55 | 19, 61, 76 | 36, 69, 80 |
| LogisticRegression(sklearn) | 23, 58, 72 | 13, 44, 62 | 26, 71, 80 | 52, 88, 94 |
| BalancedRandomForest(imblearn) | 16, 55, 67 | 9, 44, 57 | 23, 66, 76 | 33, 77, 86 |
| EasyEnsembleClassifier(imblearn) | 17, 58, 70 | 12, 43, 60 | 14, 71, 78 | 36, 88, 94 |
The “Percentage in Top X Variants” columns show the percentage of reported variants that were found in the top 1, 10, and 20 variants in a case after ranking by the corresponding method