| Literature DB >> 35782867 |
Yang Yang1,2, Aibin Shao1, Mauno Vihinen3.
Abstract
Genetic variations are investigated in human and many other organisms for many purposes (e.g., to aid in clinical diagnosis). Interpretation of the identified variations can be challenging. Although some dedicated prediction methods have been developed and some tools for human variants can also be used for other organisms, the performance and species range have been limited. We developed a novel variant pathogenicity/tolerance predictor for amino acid substitutions in any organism. The method, PON-All, is a machine learning tool trained on human, animal, and plant variants. Two versions are provided, one with Gene Ontology (GO) annotations and another without these details. GO annotations are not available or are partial for many organisms of interest. The methods provide predictions for three classes: pathogenic, benign, and variants of unknown significance. On the blind test, when using GO annotations, accuracy was 0.913 and MCC 0.827. When GO features were not used, accuracy was 0.856 and MCC 0.712. The performance is the best for human and plant variants and somewhat lower for animal variants because the number of known disease-causing variants in animals is rather small. The method was compared to several other tools and was found to have superior performance. PON-All is freely available at http://structure.bmc.lu.se/PON-All and http://8.133.174.28:8999/.Entities:
Keywords: amino acid substitution; animal variants; machine learning; mutation; pathogenicity; plant variants; prediction; variation interpretation
Year: 2022 PMID: 35782867 PMCID: PMC9245922 DOI: 10.3389/fmolb.2022.867572
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
Division of cases to data sets for cross-validation and blind testing. The first number is for proteins and the second for variants.
| 10-fold cross-validation | Blind test | Total | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Pathogenic | Neutral | Total | Pathogenic | Neutral | Total | Pathogenic | Neutral | Total | |
| Humans | 2,173/17,504 | 12,141/23,600 | 13,383/41,104 | 170/1,980 | 669/1,967 | 740/3,926 | 2,343/19,484 | 12,810/25,567 | 14,123/45,030 |
| Animals | 117/162 | 116/144 | 232/306 | 109/155 | 125/169 | 233/324 | 226/317 | 241/313 | 465/630 |
| Plants | 913/2,601 | 629/1,562 | 1,150/4,163 | 228/736 | 152/374 | 288/1,110 | 1,141/3,337 | 781/1,936 | 1,438/5,273 |
| Total | 3,203/20,267 | 12,886/25,306 | 14,765/45,573 | 507/2,871 | 946/2,510 | 1,261/5,360 | 3,710/23,138 | 13,832/27,816 | 16,026/50,933 |
Comparison of method performance in 10-fold cross-validation when using all the features for training. The numbers are averages.
| Measure | RF | XGBoost | LGBM |
|---|---|---|---|
| TP | 1,528 | 1,633.4 | 1,651.3 |
| TN | 2,307.9 | 2,364.3 | 2,343.1 |
| FP | 222.7 | 166.3 | 187.5 |
| FN | 498.7 | 393.3 | 375.4 |
| PPV | 0.87 | 0.91 | 0.90 |
| NPV | 0.82 | 0.86 | 0.86 |
| Sensitivity | 0.75 | 0.81 | 0.81 |
| Specificity | 0.91 | 0.93 | 0.93 |
| Accuracy | 0.84 | 0.88 | 0.88 |
| MCC | 0.68 | 0.75 | 0.75 |
| OPM | 0.59 | 0.67 | 0.67 |
| AUC | 0.83 | 0.87 | 0.87 |
FIGURE 1Flowchart for PON-All predictor.
Performance assessment in the blind test set with and without the GO feature. The results are shown with and without (in brackets) rejection.
| Measure | All variants | Humans | Animals | Plants | ||||
|---|---|---|---|---|---|---|---|---|
| w GO | wo GO | w GO | wo GO | w GO | wo GO | w GO | wo GO | |
| TP | 1,945 (2,278) | 1,201 (1,928) | 1,274 (1,552) | 789 (1,327) | 72 (102) | 64 (112) | 603 (624) | 341 (489) |
| TN | 1,855 (2,284) | 1,344 (2,109) | 1,421 (1,780) | 1,052 (1,659) | 118 (143) | 98 (140) | 318 (361) | 201 (310) |
| FP | 143 (365) | 177 (540) | 138 (326) | 148 (447) | 4 (26) | 14 (29) | 4 (13) | 15 (64) |
| FN | 217 (433) | 251 (783) | 94 (268) | 154 (493) | 35 (53) | 12 (43) | 88 (112) | 85 (247) |
| PPV | 0.932 (0.862) | 0.872 (0.781) | 0.902 (0.826) | 0.842 (0.748) | 0.947 (0.797) | 0.821 (0.794) | 0.993 (0.980) | 0.958 (0.884) |
| NPV | 0.895 (0.841) | 0.843 (0.729) | 0.938 (0.869) | 0.872 (0.771) | 0.771 (0.730) | 0.891 (0.765) | 0.783 (0.763) | 0.703 (0.557) |
| Sensitivity | 0.900 (0.840) | 0.827 (0.711) | 0.931 (0.853) | 0.837 (0.729) | 0.673 (0.658) | 0.842 (0.723) | 0.873 (0.848) | 0.800 (0.664) |
| Specificity | 0.928 (0.862) | 0.884 (796) | 0.911 (0.845) | 0.877 (0.788) | 0.967 (0.846) | 0.875 (0.828) | 0.988 (0.965) | 0.931 (0.829) |
| Accuracy | 0.913 (0.851) | 0.856 (0.753) | 0.921 (0.849) | 0.859 (0.761) | 0.830 (0.756) | 0.862 (0.778) | 0.909 (0.887) | 0.844 (0.720) |
| MCC | 0.827 (0.703) | 0.712 (0.509) | 0.841 (0.697) | 0.714 (0.518) | 0.678 (0.515) | 0.714 (0.555) | 0.817 (0.777) | 0.695 (0.466) |
| AUC | 0.913 (0.851) | 0.856 (0.753) | 0.921 (0.85) | 0.855 (0.758) | 0.818 (0.751) | 0.858 (0.775) | 0.929 (0.895) | 0.842 (0.747) |
| OPM | 0.763 (0.617) | 0.628 (0.429) | 0.781 (0.611) | 0.630 (0.438) | 0.588 (0.434) | 0.631 (0.470) | 0.751 (0.701) | 0.608 (0.391) |
| Coverage | 0.776 (1.000) | 0.555 (1.000) | 0.746 (1.000) | 0.546 (1.000) | 0.707 (1.000) | 0.580 (1.000) | 0.913 (1.000) | 0.578 (1.000) |
Blind test performance of PON-All compared to other predictors.
| PON-all wGO | PON-all woGO | PON-P2 | SIFT 4G | PolyPhen2 | MutationTaster | FATHMM | PROVEAN | MetaSVM | MetaLR | CADD_10 | CADD_15 | CADD_20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TP | 1,274 | 789 | 831 | 1,391 | 1,530 | 1,544 | 1,113 | 1,364 | 1,239 | 1,234 | 1,630 | 1,599 | 1,545 |
| TN | 1,421 | 1,052 | 1,032 | 1,197 | 1,003 | 1,044 | 1,472 | 1,320 | 1,651 | 1,639 | 498 | 710 | 1,020 |
| FP | 138 | 148 | 141 | 591 | 790 | 771 | 323 | 490 | 166 | 178 | 1,319 | 1,107 | 797 |
| FN | 94 | 154 | 132 | 288 | 149 | 135 | 563 | 313 | 440 | 445 | 49 | 80 | 134 |
| PPV | 0.902 | 0.842 | 0.855 | 0.702 | 0.659 | 0.667 | 0.775 | 0.736 | 0.882 | 0.874 | 0.553 | 0.591 | 0.660 |
| NPV | 0.938 | 0.872 | 0.887 | 0.806 | 0.871 | 0.885 | 0.723 | 0.808 | 0.790 | 0.786 | 0.910 | 0.899 | 0.884 |
| Sens | 0.931 | 0.837 | 0.863 | 0.828 | 0.911 | 0.920 | 0.664 | 0.813 | 0.738 | 0.735 | 0.971 | 0.952 | 0.920 |
| Spes | 0.911 | 0.877 | 0.880 | 0.669 | 0.559 | 0.575 | 0.820 | 0.729 | 0.909 | 0.902 | 0.274 | 0.391 | 0.561 |
| ACC | 0.921 | 0.859 | 0.872 | 0.746 | 0.730 | 0.741 | 0.745 | 0.770 | 0.827 | 0.822 | 0.609 | 0.660 | 0.734 |
| MCC | 0.841 | 0.714 | 0.742 | 0.503 | 0.500 | 0.523 | 0.491 | 0.543 | 0.659 | 0.649 | 0.337 | 0.410 | 0.512 |
| OPM | 0.781 | 0.63 | 0.661 | 0.423 | 0.416 | 0.436 | 0.414 | 0.459 | 0.570 | 0.559 | 0.291 | 0.341 | 0.426 |
| Coverage | 0.746 | 0.546 | 0.544 | 0.883 | 0.884 | 0.890 | 0.884 | 0.888 | 0.890 | 0.890 | 0.890 | 0.890 | 0.890 |
For CADD, 10,15, and 20 are three common thresholds.
FIGURE 2Visualization of PON-All predictions (without GO terms) for human, mouse, and Drosophila BTK PH domains. The sequences were adjusted based on multiple sequence alignment so that corresponding amino acid positions in the three proteins are on the same line. Red indicates predicted pathogenic variation, benign variants are blue, UVs are gray, and the original residue is white. The distribution of the number of pathogenic variants is shown to the right in the human PH domain (PDB 6tt2). The color scheme for the numbers of pathogenic variants is the scale in the figure.