| Literature DB >> 28330163 |
Abhigyan Nath1, Karthikeyan Subbiah2.
Abstract
To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families.Entities:
Keywords: Balanced training set; Class imbalance problem; Optimal class distribution; RNA silencing; Random undersampling; Relieff; SMOTE; SVM
Year: 2016 PMID: 28330163 PMCID: PMC4801844 DOI: 10.1007/s13205-016-0410-1
Source DB: PubMed Journal: 3 Biotech ISSN: 2190-5738 Impact factor: 2.406
Physicochemical groupings of amino acids taken for the present study
| S. no. | Name of amino acid property group | Amino acids in the specific group |
|---|---|---|
| 1. | Tiny amino acids group | Ala, Cys, Gly, Ser, Thr |
| 2. | Small amino acids group | Ala, Cys, Asp, Gly, Asn, Pro, Ser, Thr and Val |
| 3. | Aliphatic amino acids group | Ile, Leu and Val |
| 4. | Nonpolar amino acid groups | Ala, Cys, Phe, Gly, Ile, Leu, Met, Pro, Val, Trp and Tyr |
| 5. | Aromatic amino acid group | Phe, His, Trp and Tyr |
| 6. | Polar amino acid group | Asp, Glu, His, Lys, Asn, Gln. Arg, Ser, and Thr |
| 7. | Charged amino acid group | Asp, Glu, His, Arg, Lys |
| 8. | Basic amino acid group | His, Lys and Arg |
| 9. | Acidic amino acid group | Asp and Glu |
| 10. | Hydrophobic acid group | Ala, Cys, Phe, Ile, Leu, Met, Val, Trp, Tyr |
| 11. | Hydrophilic acid group | Asp, Glu, Lys, Asn, Gln |
Fig. 1Schematic representation of the current pipeline
Performance evaluation metrics of the different learning algorithms trained on the imbalanced datasets
| Learning algorithms | Sensitivity | Specificity | Accuracy | AUC | Youden’s Index | Dominance |
|
|---|---|---|---|---|---|---|---|
| Imbalanced data set | |||||||
| NB | 90.8 | 29.2 | 36.9 | 0.678 | 0.200 | 0.616 | 51.49 |
| FLDA | 64.7 | 84.9 | 82.3 | 0.819 | 0.492 | −0.202 | 74.1 |
| SMO | 52.1 | 97.1 | 91.4 | 0.746 | 0.496 | −0.450 | 71.1 |
| IBK | 68.9 | 97.0 | 93.4 | 0.841 | 0.659 | −0.281 | 81.7 |
Performance evaluation metrics of the different machine learning algorithms trained on the different randomly undersampled training sets
| Learning algorithms | Sensitivity | Specificity | Accuracy | AUC | Youden’s Index | Dominance |
|
|---|---|---|---|---|---|---|---|
| Undersampling (1:1) (fully balanced) training set | |||||||
| NB | 91.6 | 23.5 | 57.6 | 0.631 | 0.151 | 0.681 | 46.3 |
| FLDA | 73.9 | 68.5 | 71.4 | 0.768 | 0.424 | 0.054 | 71.1 |
| SMO | 77.3 | 74.8 | 76.1 | 0.761 | 0.521 | 0.025 | 76.0 |
| IBK | 80.7 | 81.5 | 81.1 | 0.818 | 0.622 | −0.008 | 81.5 |
| Undersampling (1:2) training set | |||||||
| NB | 89.1 | 30.3 | 49.9 | 0.666 | 0.194 | 0.588 | 51.9 |
| FLDA | 63.0 | 63.0 | 63.0 | 0.661 | 0.26 | 0 | 63 |
| SMO | 72.3 | 88.7 | 83.2 | 0.805 | 0.61 | −0.164 | 80.08 |
| IBK | 72.3 | 90.8 | 84.6 | 0.809 | 0.631 | −0.185 | 81.0 |
| Undersampling (1:3) training set | |||||||
| NB | 90.8 | 28.9 | 44.3 | 0.664 | 0.197 | 0.619 | 51.2 |
| FLDA | 58.8 | 55.7 | 56.5 | 0.613 | 0.507 | 0.031 | 57.2 |
| SMO | 67.2 | 91.9 | 85.7 | 0.796 | 0.591 | −0.247 | 78.5 |
| IBK | 72.3 | 93.0 | 87.8 | 0.082 | 0.653 | −0.207 | 81.9 |
| Undersampling (1:4) training set | |||||||
| NB | 88.2 | 31.1 | 42.5 | 0.694 | 0.193 | 0.571 | 52.37 |
| FLDA | 64.7 | 73.5 | 71.8 | 0.731 | 0.382 | −0.088 | 68.9 |
| SMO | 63.0 | 92.4 | 86.6 | 0.777 | 0.554 | −0.294 | 76.2 |
| IBK | 68.9 | 94.7 | 89.6 | 0.823 | 0.636 | −0.258 | 80.7 |
| Undersampling (1:5) training set | |||||||
| NB | 89.1 | 31.1 | 40.8 | 0.692 | 0.202 | 0.58 | 52.6 |
| FLDA | 66.4 | 79.0 | 76.9 | 0.791 | 0.454 | −0.126 | 72.42 |
| SMO | 57.1 | 94.6 | 88.4 | 0.759 | 0.61 | −0.375 | 73.4 |
| IBK | 70.6 | 93.9 | 90.1 | 0.841 | 0.645 | −0.233 | 81.4 |
| Undersampling (1:6) training set | |||||||
| NB | 89.1 | 29.6 | 38.1 | 0.688 | 0.187 | 0.595 | 51.3 |
| FLDA | 68.1 | 80.4 | 78.6 | 0.805 | 0.485 | −0.123 | 73.9 |
| SMO | 56.3 | 95.0 | 89.4 | 0.756 | 0.513 | −0.387 | 73.13 |
| IBK | 71.4 | 95.2 | 91.8 | 0.824 | 0.666 | −0.238 | 82.4 |
Performance evaluation metrics of the different machine learning algorithms trained on the different SMOTE oversampled training sets
| Learning Algorithms | Sensitivity | Specificity | Accuracy | AUC | Youden’s Index | Dominance |
|
|---|---|---|---|---|---|---|---|
| SMOTE 100 % training set | |||||||
| NB | 91.2 | 33.1 | 46.1 | 0.738 | 0.243 | 0.581 | 54.9 |
| FLDA | 81.5 | 84.5 | 83.8 | 0.896 | 0.660 | −0.030 | 82.9 |
| SMO | 81.1 | 94.7 | 91.6 | 0.879 | 0.758 | −0.136 | 87.6 |
| IBK | 97.9 | 85.1 | 88.0 | 0.912 | 0.830 | 0.128 | 91.2 |
| SMOTE 200 % training set | |||||||
| NB | 91.6 | 35.0 | 52.1 | 0.749 | 0.266 | 0.566 | 56.6 |
| FLDA | 91.3 | 85.4 | 87.2 | 0.934 | 0.767 | 0.005 | 88.3 |
| SMO | 92.4 | 93.9 | 93.5 | 0.932 | 0.863 | −0.015 | 93.1 |
| IBK | 98.9 | 79.7 | 85.5 | 0.894 | 0.786 | 0.192 | 88.7 |
| SMOTE 300 % training set | |||||||
| NB | 91.2 | 36.0 | 56.1 | 0.751 | 0.272 | 0.552 | 57.2 |
| FLDA | 95.2 | 84.4 | 88.3 | 0.946 | 0.796 | 0.108 | 89.6 |
| SMO | 96.2 | 92.3 | 93.7 | 0.942 | 0.885 | 0.003 | 94.2 |
| IBK | 99.4 | 79.1 | 86.5 | 0.890 | 0.785 | 0.203 | 88.6 |
| SMOTE 400 % training set | |||||||
| NB | 90.9 | 36.9 | 56.1 | 0.751 | 0.278 | 0.54 | 57.9 |
| FLDA | 95.8 | 84.9 | 89.4 | 0.952 | 0.807 | 0.109 | 90.1 |
| SMO | 96.5 | 91.8 | 93.7 | 0.941 | 0.883 | 0.047 | 94.1 |
| IBK | 99.3 | 74.6 | 84.9 | 0.870 | 0.733 | 0.247 | 86.0 |
| SMOTE 500 % training set | |||||||
| NB | 92.0 | 36.8 | 62.4 | 0.745 | 0.288 | 0.552 | 58.1 |
| FLDA | 97.3 | 83.7 | 90.0 | 0.962 | 0.810 | 0.136 | 90.2 |
| SMO | 98.5 | 92.6 | 95.3 | 0.955 | 0.911 | 0.059 | 95.5 |
| IBK | 99.6 | 73.8 | 85.8 | 0.867 | 0.734 | 0.258 | 85.7 |
| SMOTE 594 % (fully balanced) training set | |||||||
| NB | 92.4 | 36.4 | 64.4 | 0.742 | 0.288 | 0.56 | 57.9 |
| FLDA | 97.7 | 85.1 | 91.4 | 0.964 | 0.828 | 0.12 | 91.1 |
| SMO | 97.9 | 90.8 | 94.4 | 0.944 | 0.887 | 0.071 | 94.2 |
| IBK | 99.6 | 73.5 | 86.6 | 0.862 | 0.731 | 0.261 | 85.5 |
Fig. 2ROC curves of the four classifiers using the training set with optimal class distribution [SMOTE (500 %)]
Leave on out cross validation performance evaluation metrics on the best training set
| Learning algorithms | Sensitivity | Specificity | Accuracy | AUC | Youden’s Index | Dominance |
|
|---|---|---|---|---|---|---|---|
| LOOCV on SMOTE (500 %) | |||||||
| NB | 92.3 | 36.4 | 62.3 | 0.745 | 0.287 | 0.559 | 57.96 |
| FLDA | 97.2 | 85.1 | 90.7 | 0.966 | 0.823 | 0.121 | 90.90 |
| SMO | 98.9 | 92.3 | 95.3 | 0.956 | 0.912 | 0.066 | 95.50 |
| IBK | 99.4 | 75.8 | 86.8 | 0.876 | 0.752 | 0.236 | 86.80 |
Comparison of the performance evaluation metrics of the current work with the previous methods
| Methods | Sensitivity | Specificity | Accuracy | AUC | Youden’s Index | Dominance |
|
|---|---|---|---|---|---|---|---|
| Jagga and Gupta ( | 80.90 | 80.57 | 80.61 | 0.910 | 0.614 | 0.003 | 80.70 |
| SMO [SMOTE (500 %)] | 98.50 | 92.60 | 95.30 | 0.955 | 0.911 | 0.059 | 95.50 |
Fig. 3Heat map representation of ranking the sequence features (excluding dipeptides) according to their discriminative ability
Fig. 4Heat map representation of ranking the dipeptides according to their discriminative ability