| Literature DB >> 26442818 |
Christopher Douville1, David L Masica1, Peter D Stenson2, David N Cooper2, Derek M Gygax3, Rick Kim3, Michael Ryan3, Rachel Karchin1,4.
Abstract
Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features--DNA and protein sequence conservation, indel length, and occurrence in repeat regions--are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in-frame and frameshift indels (VEST-indel) as pathogenic or benign. We apply 24 features, including a new "PubMed" feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false-positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta-predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta-predictor with improved performance over any individual method.Entities:
Keywords: bioinformatics pathogenicity predictor; in-frame frameshift; indel; insertion deletion variant; meta-predictor
Mesh:
Year: 2015 PMID: 26442818 PMCID: PMC5057310 DOI: 10.1002/humu.22911
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
Datasets Used in Development of the VEST‐Indel Method
| Feature | Null | Multi‐method | |||
|---|---|---|---|---|---|
| selection | Training | Testing | distribution | benchmark | |
| In‐frame | |||||
| Pathogenic | 500a | 2,475a | 39b | N/A | 57f |
| Benign | 500c | 1,877c | 8,105d | 346e | 156e |
| Frameshift | |||||
| Pathogenic | 500a | 24,478a | 184b | N/A | 478f |
| Benign | 500c | 1,350c | 1,340d | 537e | 60e |
Superscript letters indicate the source of the examples for each type of insertion/deletion variant and each stage of VEST‐indel development (feature selection, classifier training, classifier validation, empirical null).
aHGMD.
bClinVar.
cESP6500.
dInter‐species benigns from SIFT‐indel.
e1000G Phase 3.
fHGMD2014.
There is no overlap between examples in any of the columns. N/A, not applicable because only benign examples were used to develop the empirical null distribution.
Training and Validation Sets Used by Current Prediction Methods
| Non‐overlapping multi‐method | ||||||
|---|---|---|---|---|---|---|
| Training set (as published) | Test set (as published) | benchmark set | ||||
| Pathogenic | Benign | Pathogenic | Benign | Pathogenic | Benign | |
| In‐frame | ||||||
| PROVEAN | Uniprot | Uniprot | HGMD 2011 | 1000G P1 | HGMD2014.4 | 1000G P3 AA |
| DDIG‐in | HGMD 2012 | 1000G P1 | Uniprot | Uniprot | HGMD2014.4 | 1000G P3 AA |
| SIFT‐indel | HGMD 2010 | Interspecies | Uniprot | Uniprot | HGMD2014.4 | 1000G P3 AA |
| CADD | Simulated | Fixed Polymorphisms | ClinVar | ESP6500 | HGMD2014.4 | 1000G P3 AA |
| VEST‐indel | HGMD 2014.3 | ESP6500 AA | ClinVar | Interspecies | HGMD2014.4 | 1000G P3 AA |
| Frameshift | ||||||
| PROVEAN | N/A | N/A | N/A | N/A | N/A | N/A |
| DDIG‐in | HGMD 2012 | 1000G P1 | HGMD 2012 | Interspecies | HGMD2014.4 | 1000G P3 AA |
| SIFT‐indel | HGMD 2010 | Interspecies | N/A | N/A | HGMD2014.4 | 1000G P3 AA |
| CADD | Simulated | Fixed Polymorphisms | ClinVar | ESP6500 | HGMD2014.4 | 1000G P3 AA |
| VEST‐indel | HGMD 2014.3 | ESP6500 AA | ClinVar | Interspecies | HGMD2014.4 | 1000G P3 AA |
1000G P1 and 1000G P3 are variants from 1000 Genomes Phase 1 and 3, respectively. Interspecies benign variants derived from pairwise genome alignments of human and cow, dog, horse, chimp, rhesus macaque, and rat. Uniprot variants were obtained from the UniProtKB/Swiss‐Prot “Human Polymorphisms and Disease Mutations” dataset (Release 2011_09), annotated as deleterious, neutral, or unknown based on keywords from the provided Uniprot descriptions. AA, African or African American Ancestry and N/A, not applicable.
VEST‐Indel Performance Metrics
| Sensitivity | Specificity | Balanced accuracy | |
|---|---|---|---|
| In‐frame cross validation | 0.90 | 0.90 | 0.90 |
| In‐frame testing | 0.80 | 0.85 | 0.82 |
| Frameshift cross validation | 0.83 | 0.88 | 0.85 |
| Frameshift testing | 0.89 | 0.86 | 0.87 |
Training utilized 10‐fold cross validation and pathogenic variants from Human Gene Mutation Database 2014.3 and benign examples from Exome Sequencing Project (minor allele frequency in African Ancestry ≥ 0.01). The test set consisted of pathogenic examples from ClinVar and benign examples derived from pairwise genome alignments of human and cow, dog, horse, chimp, rhesus macaque, and rat.
Comparing Performance with Previously Published Results and Testing all Methods with the New Multi‐Method Benchmark Dataset
| Previously published | Multi‐method benchmark | ||||
|---|---|---|---|---|---|
| Sensitivity | Specificity | Sensitivity | Specificity | Balanced Accuracy | |
| In‐frame | |||||
| VEST‐indel | 0.90a | 0.90a | 0.81 | 0.96 | 0.88 |
| SIFT‐indel | 0.81 | 0.82 | 0.86 | 0.76 | 0.81 |
| DDIG‐in | 0.89 | N/A | 0.78 | 0.91 | 0.84 |
| PROVEAN | 0.93/0.96 | 0.80/0.68 | 0.95 | 0.80 | 0.88 |
| CADD | N/A | N/A | 0.74 | 0.88 | 0.81 |
| Frameshift | |||||
| VEST‐indel | 0.83a | 0.88a | 0.85 | 0.95 | 0.90 |
| SIFT‐indel | 0.90 | 0.78 | 0.94 | 0.25 | 0.59 |
| DDIG‐in | 0.86 | 0.72 | 0.75 | 0.80 | 0.77 |
| CADD | N/A | N/A | 0.98 | 0.05 | 0.52 |
Previously published sensitivity and specificity based on author's cross‐validation experiments. PROVEAN does not use cross validation so the reported numbers are from validation set experiments done separately for insertion and deletion variants. N/A, not applicable. Published results for the DDIG‐in in‐frame classifier do not include specificity; their self‐reporting consists of an accuracy (not balanced accuracy) of 0.84 and precision of 0.81. The authors of CADD did not report the performance achieved with indels separately.
aResults from Table 1 included here for comparison. Multi‐method benchmark set consisted of pathogenic examples from Human Gene Mutation Database 2014.4 and benign examples 1000 Genomes Phase 3 (minor allele frequency in African Ancestry ≥ 0.1).