| Literature DB >> 27468419 |
Eva König1, Johannes Rainer1, Francisco S Domingues1.
Abstract
BACKGROUND: Although several methods have been proposed for predicting the effects of genetic variants and their role in disease, it is still a challenge to identify and prioritize pathogenic variants within sequencing studies.Entities:
Keywords: Feature analysis; GO annotation bias; feature combination; gene features; pathogenic variant prediction; variant features
Year: 2016 PMID: 27468419 PMCID: PMC4947862 DOI: 10.1002/mgg3.214
Source DB: PubMed Journal: Mol Genet Genomic Med ISSN: 2324-9269 Impact factor: 2.183
The twenty‐eight features used in this study
| Feature | Class | Type 1 | Type 2 | Description | Reference |
|---|---|---|---|---|---|
| PON‐P2 | Numeric | Variant | TPS | PON‐P2 score | Niroula et al. ( |
| SIFT | Numeric | Variant | RPS | SIFT score | Ng and Henikoff ( |
| PROVEAN | Numeric | Variant | RPS | PROVEAN score | Choi et al. ( |
| Grantham | Numeric | Variant | RPS | Grantham score | Grantham ( |
| GERP | Numeric | Variant | RPS | GERP conservation score, computed on 37 eutherian mammals | Cooper et al. ( |
| evolutionary.rate | Numeric | Variant | Raw feature | Residue evolutionary rate computed with rate4site on PhylomeDB alignments | Pupko et al. ( |
| disordered.region | Numeric | Variant | Raw feature | Disordered region value between 0 and 1, computed with SPINE‐D. Ordered: val < 0.5, disordered: val ≥ 0.5 | Zhang et al. ( |
| accessibility | Numeric | Variant | Raw feature | Residue accessibility value between ‐5 and 95, computed with SCRATCH‐1D. Low values correspond to buried residues, high values to exposed residues | Cheng et al. ( |
| secondary.structure.3 | Categorical | Variant | Raw feature | Secondary structure prediction, 3 class, computed with SCRATCH‐1D | Cheng et al. ( |
| secondary.structure.8 | Categorical | Variant | Raw feature | Secondary structure prediction, 8 class, computed with SCRATCH‐1D | Cheng et al. ( |
| PfamA | Categorical | Variant | Raw feature | 1 if the variant is in a PfamA domain, 0 else | Finn et al. ( |
| AAindex.polarity | Numeric | Variant | Raw feature | AAindex GRAR740102 (Polarity) from section AAindex1 | Kawashima et al. ( |
| AAindex.hydropathy | Numeric | Variant | Raw feature | AAindex KYTJ820101 (Hydropathy) from section AAindex1 | Kawashima et al. ( |
| AAindex.volume | Numeric | Variant | Raw feature | AAindex GRAR740103 (Volume) from section AAindex1 | Kawashima et al. ( |
| AAindex.composition | Numeric | Variant | Raw feature | AAindex KH900101 (AA composition of total proteins) from section AAindex1 | Kawashima et al. ( |
| AAindex.net.charge | Numeric | Variant | Raw feature | AAindex KLEP840101 (Net charge) from section AAindex1 | Kawashima et al. ( |
| protein.age | Numeric | Gene | Raw feature | Protein age computed with ProteinHistorian | Capra et al. ( |
| paralog.id | Numeric | Gene | Raw feature | Maximum paralog identity value of gene, 0 if the gene has no paralog | Cunningham et al. ( |
| paralog.nr | Numeric | Gene | Raw feature | Number of human paralog genes | Cunningham et al. ( |
| mouse.orth.id | Numeric | Gene | Raw feature | Maximum mouse ortholog identity value of gene, 0 if the gene has no mouse ortholog | Cunningham et al. ( |
| mouse.orth.nr | Numeric | Gene | Raw feature | Number of mouse ortholog genes | Cunningham et al. ( |
| GO.BP | Numeric | Gene | Raw feature | Number of GO BP annotations of a gene with information content greater 2 considering all evidence codes and disregarding children nodes | The Gene Ontology Consortium ( |
| expression | Numeric | Gene | Raw feature | Fraction of tissues in which this gene is expressed at a threshold | Kolesnikov et al. ( |
| degree | Numeric | Gene | Raw feature | Degree of gene on the mentha network | Calderone et al. ( |
| centrality | Numeric | Gene | Raw feature | Alpha centrality of gene on the mentha network | Calderone et al. ( |
| betweenness | Numeric | Gene | Raw feature | Betweenness of gene on the mentha network | Calderone et al. ( |
| gene.length | Numeric | Gene | Raw feature | Gene length in base pairs | Cunningham et al. ( |
| protein.length | Numeric | Gene | Raw feature | Protein length in base pairs | Cunningham et al. ( |
| PolyPhen‐2 | Numeric | Variant | TPS | PolyPhen‐2 score | Adzhubei et al. ( |
| Condel | Numeric | Variant | TPS | Condel 2.0 score (weighted average of FatHMM and MutationAssessor) | González‐Pérez and López‐Bigas ( |
| CADD | Numeric | Variant | TPS | CADD phred score | Kircher et al. ( |
TPS, trained prediction score; RPS, rule prediction score.
Not used as features in the main analysis due to circularity.
Figure 1Hierarchical clustering of all features (except PfamA, secondary.structure.3, and secondary.structure.8) on their absolute Pearson correlation values. In each cell, the rounded absolute correlation value is given, the higher the value, the darker the corresponding cell. Black squares correspond to the 10 clusters that were found to maximize the mean silhouette values.
Figure 2Decision trees computed on the training set using different input features. Each tree node has three rows: the upper row contains the decision made in this node with 0 = benign and 1 = pathogenic; the second row shows the fraction of single amino acid polymorphisms (SAPs) classified at this node as benign (left) and pathogenic (right); the third row shows the percentage of all input SAPs that are classified at this node. Starting from the root node, at each node, the left child is traversed if the condition evaluates to true and the right child is traversed if the condition evaluates to false. (A) Tree computed on raw features, excluding rule prediction scores (RPSs) and PON‐P2. (B) Tree computed on raw features and RPSs, excluding PON‐P2. (C) Tree computed on all features.
Figure 3Cross‐validation (CV) test error of stepwise forward selection with random forest and linear regression. Points correspond to the mean test error from all CV iterations with error bars. The label corresponds to the feature that was added in this step and the number below indicates the percentage of CV iterations in which this feature was selected. Labels are printed above and below the points for random forest and linear regression, respectively. The vertical dashed line shows the cutoff for feature selection. Only the first 10 steps of the forward selection are shown. GO = GO.BP, evol.rate = evolutionary.rate, disord = disordered.region, paral.id = paralog.id, acc = accessibility, prot.age = protein.age, mouse.id = mouse.orth.id; expr = expression, betw = betweeness, PROV = PROVEAN, p.len = protein.length, Grant = Grantham. (A) Stepwise forward selection on raw scores, excluding rule prediction scores (RPS) and PON‐P2. (B) Stepwise forward selection on raw scores and RPSs, excluding PON‐P2. (C) Stepwise forward selection on all features.
Figure 4Box‐and‐whisker plots of the seven features constituting the predictive feature sets 1, 2, and 3 on the training set in pathogenic (P) and benign (B) single amino acid polymorphisms. The boxes show the first and third quartile of the distributions with their median; the whiskers extend to 1.5 times the interquartile range. For PROVEAN, SIFT, and PON‐P2, the classification thresholds estimated by the developers are shown as dashed horizontal lines. (A) GO.BP. (B) evolutionary.rate, (C) disordered.region. (D) PROVEAN. (E) SIFT. (F) GERP. (G) PON‐P2.
Figure 5Increase in Gene Ontology (GO) biological process (BP) annotations from 2008 to 2013 for 957 genes for which the first disease variant has been reported in HGMD in 2008 (left, first group), for 349 genes for which the first disease variant has been reported in HGMD in 2013 (middle, second group), and for 13,001 genes that have no pathogenic variant reported in HGMD (right, third group).
Figure 6Receiver operating characteristic (ROC) curves showing specificity versus sensitivity of the logistic regression models and prediction scores at different thresholds on the validation set. In the legend, models are ordered according to their AUC values. One line is dotted to improve visibility. (A) Full ROC curve. The vertical dashed line at 0.86 corresponds to the specificity of PON‐P2 as estimated by the developers. (B) Same data as in (A), zoomed into the region where the lines of the ROC curve intersect the specificity threshold of 0.86.
Performance of classifiers
| Classifier | Features | Cutoff | Sensitivity | Specificity | Accuracy | PPV | NPV | MCC | AUC |
| Significant |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PFS3 LR | PON‐P2, GO.BP, PROVEAN | 0.525 | 0.885 | 0.884 | 0.885 | 0.905 | 0.860 | 0.767 | 0.946 | 4.2 × 10−3 | Yes |
| PON‐P2 | – | 0.5 | 0.795 | 0.931 | 0.855 | 0.935 | 0.783 | 0.722 | 0.940 | 1.3 × 10−6 | Yes |
| PFS2 LR | GO.BP, PROVEAN, SIFT, GERP, disordered.region | 0.557 | 0.854 | 0.851 | 0.853 | 0.878 | 0.823 | 0.703 | 0.915 | 1.6 × 10−13 | Yes |
| PROVEAN | – | 2.282 | 0.826 | 0.802 | 0.815 | 0.840 | 0.786 | 0.627 | 0.879 | 1.4 × 10−2 | Yes |
| PFS1 LR | GO.BP, evolutionary.rate, disordered.region | 0.577 | 0.795 | 0.802 | 0.798 | 0.834 | 0.757 | 0.594 | 0.861 | 3.5 × 10−1 | No |
| SIFT | – | 0.05 | 0.798 | 0.776 | 0.788 | 0.817 | 0.754 | 0.572 | 0.859 | – | – |
PFS, predictive feature set; LR, logistic regression; PPV, positive predictive value; NPV, negative predictive value; MCC, Matthew's correlation coefficient; AUC, area under the curve.
Ranked by AUC.
Constituting features for the predictive feature sets.
Bootstrap test for difference in AUC to next ranking classifier.
Whether the difference in AUC is significant at a 0.05 threshold.
Cutoff that maximizes the MCC.
Cutoff as proposed by the program developers.
Figure 7Venn diagram showing the overlap of false predictions on the validation set by PON‐P2, PROVEAN, and the logistic regression models trained on the predictive feature sets (PFS) 1, 2, and 3. (A) False positives (single amino acid polymorphisms falsely predicted as pathogenic). (B) False negatives (single amino acid polymorphisms falsely predicted as benign).