| Literature DB >> 25647319 |
Abhishek Niroula1, Siddhaling Urolagin1, Mauno Vihinen1.
Abstract
More reliable and faster prediction methods are needed to interpret enormous amounts of data generated by sequencing and genome projects. We have developed a new computational tool, PON-P2, for classification of amino acid substitutions in human proteins. The method is a machine learning-based classifier and groups the variants into pathogenic, neutral and unknown classes, on the basis of random forest probability score. PON-P2 is trained using pathogenic and neutral variants obtained from VariBench, a database for benchmark variation datasets. PON-P2 utilizes information about evolutionary conservation of sequences, physical and biochemical properties of amino acids, GO annotations and if available, functional annotations of variation sites. Extensive feature selection was performed to identify 8 informative features among altogether 622 features. PON-P2 consistently showed superior performance in comparison to existing state-of-the-art tools. In 10-fold cross-validation test, its accuracy and MCC are 0.90 and 0.80, respectively, and in the independent test, they are 0.86 and 0.71, respectively. The coverage of PON-P2 is 61.7% in the 10-fold cross-validation and 62.1% in the test dataset. PON-P2 is a powerful tool for screening harmful variants and for ranking and prioritizing experimental characterization. It is very fast making it capable of analyzing large variant datasets. PON-P2 is freely available at http://structure.bmc.lu.se/PON-P2/.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25647319 PMCID: PMC4315405 DOI: 10.1371/journal.pone.0117380
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of PON-P2 architecture and implementation.
PON-P2 uses pre-calculated feature vectors and bootstrap random forest for prediction. In addition, it makes benefit of information about functional and/or structural annotations, when available, and identifies reliably predicted variations and groups them either as pathogenic or neutral.
Fig 2Distribution of variations at functional and structural sites.
The pathogenic variations are represented by white bars and neutral variations by grey bars. The functional and structural annotation sites were obtained from Swiss-Prot and PDB. Binding, binding site; Metal, metal binding site; Active, active site; IM, intra membrane region; Site, catalytic, co-factor, anti-codon, regulatory or other essential site surrounding ligands in the structure.
Prediction performance of feature subsets on test data.
| PPV | NPV | Sens | Spec | Acc | MCC | OPM | Coverage | |
|---|---|---|---|---|---|---|---|---|
| AA | 0.65 | 0.77 | 0.74 | 0.68 | 0.71 | 0.42 | 0.36 | 0.63 |
| SeqProf | 0.67 | 0.79 | 0.80 | 0.66 | 0.73 | 0.46 | 0.39 | 0.74 |
| SelPres | 0.72 | 0.83 | 0.84 | 0.71 | 0.77 | 0.55 | 0.46 | 0.53 |
| AA + GO | 0.73 | 0.82 | 0.71 | 0.84 | 0.79 | 0.55 | 0.47 | 0.37 |
| SeqProf + GO | 0.78 | 0.84 | 0.74 | 0.87 | 0.82 | 0.62 | 0.53 | 0.51 |
| AA + SelPres | 0.81 | 0.85 | 0.83 | 0.83 | 0.83 | 0.66 | 0.57 | 0.45 |
| AA + SeqProf | 0.82 | 0.83 | 0.80 | 0.85 | 0.83 | 0.65 | 0.56 | 0.49 |
| AA + SelPres + SeqProf | 0.82 | 0.85 | 0.81 | 0.85 | 0.83 | 0.67 | 0.58 | 0.52 |
| SelPres + SeqProf + GO | 0.78 | 0.87 | 0.79 | 0.86 | 0.83 | 0.65 | 0.56 | 0.53 |
| AA + SeqProf + GO | 0.82 | 0.87 | 0.81 | 0.88 | 0.85 | 0.69 | 0.61 | 0.57 |
| AA + SelPres + GO | 0.82 | 0.89 | 0.86 | 0.86 | 0.86 | 0.71 | 0.63 | 0.52 |
| PON-P2 | 0.82 | 0.89 | 0.85 | 0.86 | 0.86 | 0.71 | 0.63 | 0.62 |
aAll scores are calculated for the variations that were predicted at confidence level 0.95.
bSens, Sensitivity; Spec, Specificity; Acc, Accuracy; OPM, Overall performance measure; AA, Amino acid features; GO, GO annotation derived feature; SelPres, Selective pressure; SeqProf, Sequence profile features (proportion of reference amino acid, proportion of variant amino acid and number of sequences in the multiple sequence alignment)
cCoverage is the proportion of the data that are predicted either pathogenic or neutral.
Fig 3Performance cuboids for PON-P2 and other methods.
Six performance measures: PPV, NPV, sensitivity, specificity, acc (accuracy) and normalized MCC (nMCC = MCC×0.5+0.5) for each method are represented by the distances of the six faces of the cuboid from the origin. (A) Performance cuboids for different feature subsets used in PON-P2. Seq prof, Proportions of reference and altered amino acids and number of sequences in multiple sequence alignment; Sel pres + Seq prof, evolutionary features; Sel pres + Seq prof + GO, evolutionary features and GO annotations (B) Performance cuboids for PolyPhen-2, PON-P, PON-P2 and SIFT for all predicted variations by each method on independent test dataset. The performance scores for PON-P and PON-P2 are for predictions at 0.95 confidence level. OPMs for PolyPhen-2, PON-P, PON-P2 and SIFT are 0.41, 0.61, 0.63 and 0.40, respectively. (C) Performance cuboids for predictors using c95-test set. OPMs for PolyPhen-2, PON-P, PON-P2 and SIFT are 0.47, 0.61, 0.63 and 0.48, respectively.
Performance scores of different prediction methods.
| Condel | PPH2 | Provean | SIFT | SNAP | PON-P | PON-P2 | |
|---|---|---|---|---|---|---|---|
| 10-fold cross-validation | |||||||
| TP | 8626 | 10387 | 10170 | 8928 | 10140 | 6432 | 6375 (10191) |
| TN | 7820 | 7960 | 9189 | 8577 | 8763 | 5787 | 7860 (10572) |
| FP | 2894 | 4042 | 3887 | 3708 | 4299 | 993 | 805 (2497) |
| FN | 2566 | 2182 | 2469 | 2451 | 3420 | 880 | 778 (2396) |
| PPV | 0.75 | 0.72 | 0.72 | 0.71 | 0.70 | 0.87 | 0.89 (0.80) |
| NPV | 0.75 | 0.79 | 0.79 | 0.78 | 0.78 | 0.87 | 0.91 (0.82) |
| Sens | 0.77 | 0.83 | 0.81 | 0.79 | 0.81 | 0.88 | 0.89 (0.81) |
| Spec | 0.73 | 0.66 | 0.70 | 0.70 | 0.67 | 0.85 | 0.91 (0.81) |
| Acc | 0.75 | 0.75 | 0.75 | 0.74 | 0.74 | 0.87 | 0.90 (0.81) |
| MCC | 0.50 | 0.50 | 0.51 | 0.48 | 0.48 | 0.73 | 0.80 (0.62) |
| OPM | 0.42 | 0.42 | 0.43 | 0.41 | 0.41 | 0.65 | 0.73 (0.53) |
| Independent test data set | |||||||
| TP | 852 | 952 | 870 | 869 | 1077 | 567 | 638 (969) |
| TN | 972 | 975 | 1135 | 1062 | 1092 | 722 | 909 (1255) |
| FP | 353 | 470 | 432 | 432 | 513 | 137 | 144 (350) |
| FN | 266 | 230 | 312 | 259 | 224 | 96 | 113 (332) |
| PPV | 0.71 | 0.67 | 0.67 | 0.67 | 0.68 | 0.81 | 0.82 (0.74) |
| NPV | 0.79 | 0.81 | 0.78 | 0.80 | 0.83 | 0.88 | 0.89 (0.79) |
| Sens | 0.76 | 0.81 | 0.74 | 0.77 | 0.83 | 0.86 | 0.85 (0.75) |
| Spec | 0.73 | 0.68 | 0.72 | 0.71 | 0.68 | 0.84 | 0.86 (0.78) |
| Acc | 0.75 | 0.73 | 0.73 | 0.74 | 0.75 | 0.85 | 0.86 (0.77) |
| MCC | 0.49 | 0.48 | 0.46 | 0.48 | 0.51 | 0.69 | 0.71 (0.53) |
| OPM | 0.42 | 0.41 | 0.39 | 0.40 | 0.43 | 0.61 | 0.63 (0.45) |
aHumVar trained PolyPhen-2. The performance of this version was better than for HumDiv trained PolyPhen-2 (data not shown).
bPerformance scores are computed by using the predicted variants at 0.95 confidence level.
cPerformance scores inside parentheses are for the predictor when the unreliable cases are included.
dSens, Sensitivity; Spec, Specificity; Acc, Accuracy; OPM, Overall performance measure.
Performance scores of prediction methods on data used by MutationTaster2 dataset.
| CADD | Condel | PPH2 | Provean | SIFT | MT2 | PON-P2 | |
|---|---|---|---|---|---|---|---|
| TP | 503 | 439 | 506 | 507 | 530 | 548 | 327 (501) |
| TN | 541 | 541 | 543 | 540 | 525 | 523 | 363 (571) |
| FP | 59 | 59 | 57 | 60 | 75 | 77 | 1 (29) |
| FN | 97 | 161 | 94 | 93 | 70 | 52 | 37 (99) |
| PPV | 0.90 | 0.88 | 0.90 | 0.89 | 0.88 | 0.88 | 0.98 (0.95) |
| NPV | 0.85 | 0.77 | 0.85 | 0.85 | 0.88 | 0.91 | 0.91 (0.85) |
| Sens | 0.84 | 0.73 | 0.84 | 0.85 | 0.88 | 0.91 | 0.90 (0.84) |
| Spec | 0.90 | 0.90 | 0.91 | 0.90 | 0.88 | 0.87 | 1.00 (0.95) |
| Acc | 0.87 | 0.82 | 0.87 | 0.87 | 0.88 | 0.89 | 0.95 (0.89) |
| MCC | 0.74 | 0.64 | 0.75 | 0.75 | 0.76 | 0.79 | 0.90 (0.79) |
| OPM | 0.66 | 0.55 | 0.67 | 0.66 | 0.68 | 0.71 | 0.85 (0.72) |
aCADD, Combined Annotation Dependent Depletion; MT2, MutationTaster2; OPM, Overall performance measure; Sens, Sensitivity; Spec, Specificity; Acc, Accuracy
bVariants with C-score greater than 15 were considered as deleterious and lower than 15 were considered as neutral as suggested by the method developers.
cHumVar trained PolyPhen-2. The performance of this version was better than for HumDiv trained PolyPhen-2 (data not shown).
dPerformance scores are computed by using the predicted variants at 0.95 confidence level. The scores in the parentheses are for the predictor when the unreliable cases are included.