| Literature DB >> 30546376 |
Maxim S Kovalev1, Anna A Igolkina1, Maria G Samsonova1, Sergey V Nuzhdin1,2.
Abstract
The impact of deleterious variation on both plant fitness and crop productivity is not completely understood and is a hot topic of debates. The deleterious mutations in plants have been solely predicted using sequence conservation methods rather than function-based classifiers due to lack of well-annotated mutational datasets in these organisms. Here, we developed a machine learning classifier based on a dataset of deleterious and neutral mutations in Arabidopsis thaliana by extracting 18 informative features that discriminate deleterious mutations from neutral, including 9 novel features not used in previous studies. We examined linear SVM, Gaussian SVM, and Random Forest classifiers, with the latter performing best. Random Forest classifiers exhibited a markedly higher accuracy than the popular PolyPhen-2 tool in the Arabidopsis dataset. Additionally, we tested whether the Random Forest, trained on the Arabidopsis dataset, accurately predicts deleterious mutations in Orýza sativa and Pisum sativum and observed satisfactory levels of performance accuracy (87% and 93%, respectively) higher than obtained by the PolyPhen-2. Application of Transfer learning in classifiers did not improve their performance. To additionally test the performance of the Random Forest classifier across different angiosperm species, we applied it to annotate deleterious mutations in Cicer arietinum and validated them using population frequency data. Overall, we devised a classifier with the potential to improve the annotation of putative functional mutations in QTL and GWAS hit regions, as well as for the evolutionary analysis of proliferation of deleterious mutations during plant domestication; thus optimizing breeding improvement and development of new cultivars.Entities:
Keywords: Cicer; Orýza; Pisum; deleterious mutation; random forest (bagging) and machine learning
Year: 2018 PMID: 30546376 PMCID: PMC6279870 DOI: 10.3389/fpls.2018.01734
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 5.753
FIGURE 1Distribution of features used to characterize the impact of amino acid substitutions in protein sequence for subsets of neutral and deleterious mutations in Arabidopsis thaliana. The first row of features – Grantham, Sneath, Epstein, Miyata, and Blo62 (BLOSUM62) – represents distributions of substitution scores based on five corresponding distance matrices. The second row represents the scores obtained with the PolyPhen-2 service: pph2_Score1 and pph2_dScore reflect PSIC scores; pph2_IdPmax, pph2_IdQmin, and pph2_Nobs represent specific features based on the multiple protein alignments. The third row contains features of the secondary protein structure: two features of belonging to helix or strand (helix, strand), and three scores obtained with PCI-SS service (E_dist, T_dist, H_dist). The last row includes two features of the amino acid context around the substitution of interest (Neighb1, Neighb2) and belonging to known Pfam domains (PfamHit). The detailed explanation of features are presented in the Supplementary Table S1.
Performance of four classifiers: PolyPhen2, Linear SVM, Gaussian SVM and Random Forest on the Arabidopsis thaliana dataset.
| PolyPhen-2 (PPh2) | Linear SVM (lSVM) | Gaussian SVM (gSVM) | Random Forest (RF) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Neutral | Deleterious | Neutral | Deleterious | Neutral | Deleterious | Neutral | Deleterious | ||
| Actual classes | 293 | 164 | 296 | 61 | 301 | 56 | 306 | 3051 | |
| 1100 | 543 | 70 | 573 | 74 | 569 | 3060 | 583 | ||
| Accuracy | 0.836 | 0.869 | 0.870 | 300.889 | |||||
| False Positive Rate (FPR) | 0.179 | 0.171 | 0.157 | 300.143 | |||||
| False Negative Rate (FNR) | 0.156 | 0.109 | 0.115 | 300.093 | |||||
| Sensitivity | 0.844 | 0.891 | 0.885 | 300.907 | |||||
| Specificity | 0.821 | 0.829 | 0.843 | 300.857 | |||||
| AUC | 0.907 | 0.937 | 0.935 | 300.952 | |||||
FIGURE 2Classification accuracy of 300 Random Forest classifiers learned on the Arabidopsis thaliana dataset and applied to classify mutations in pea and rice. Some of the 300 classifiers demonstrated the same values of accuracy on both Orýza sativa and Pisum sativum. Size and color of circles show frequencies of the classifiers with the same performance. The accuracy value for the best classifier is emphasized with red color.
Testing classifiers learned on Arabidopsis dataset to discriminate deleterious and neutral mutations in rice and pea.
| Accuracy | FPR | FNR | AUC | Accuracy | FPR | FNR | AUC | |
|---|---|---|---|---|---|---|---|---|
| PPh2 | 0.814 | 300.102 | 0.270 | 0.855 | 0.897 | 300.044 | 0.162 | 0.975 |
| lSVM | 0.848 | 0.144 | 0.160 | 0.918 | 0.912 | 0.103 | 0.074 | 0.971 |
| gSVM | 0.842 | 0.164 | 0.152 | 0.890 | 0.912 | 0.088 | 0.088 | 0.955 |
| RF | 300.873 | 300.115 | 300.139 | 300.928 | 300.926 | 300.074 | 300.074 | 300.981 |
| lSVM + TL | 0.848 | 0.144 | 0.160 | 0.918 | 0.912 | 0.103 | 0.074 | 0.971 |
| gSVM + TL | 0.803 | 0.285 | 0.110 | 0.902 | 0.904 | 0.147 | 0.044 | 0.960 |
| RF + TL | 200.861 | 200.128 | 200.149 | 200.926 | 200.919 | 200.088 | 200.074 | 200.979 |
Comparison of the number of deleterious and neutral mutation predicted by PolyPhen-2 and Random Forest classifier in Cicer arietinum.
| Random forest | |||
|---|---|---|---|
| Neutral | Deleterious | ||
| 1923 | 239 | ||
| 278 | 851 | ||
Mean ffrequencies of non-synonymous deleterious and neutral mutations, as well as synonymous mutations in chickpea dataset.
| Mean frequency | |
|---|---|
| Deleterious | 0.050 |
| Neutral | 0.097 |
| Synonymous | 0.109 |
Results of the Wilcoxon rank sum test for mutation frequencies comparison.
| Neutral | Synonymous | |
|---|---|---|
| 0.036 (<0.05) | 0.003 (<0.05) | |
| 0.279 (>0.05) |