| Literature DB >> 30416381 |
Yu Huang1, Ningning He2, Yu Chen1, Zhen Chen2, Lei Li1,2,3,4.
Abstract
N6-methyladenosine (m6A) is a prevalent RNA methylation modification involved in several biological processes. Hundreds or thousands of m6A sites identified from different species using high-throughput experiments provides a rich resource to construct in-silico approaches for identifying m6A sites. The existing m6A predictors are developed using conventional machine-learning (ML) algorithms and most are species-centric. In this paper, we develop a novel cross-species deep-learning classifier based on bidirectional Gated Recurrent Unit (BGRU) for the prediction of m6A sites. In comparison with conventional ML approaches, BGRU achieves outstanding performance for the Mammalia dataset that contains over fifty thousand m6A sites but inferior for the Saccharomyces cerevisiae dataset that covers around a thousand positives. The accuracy of BGRU is sensitive to the data size and the sensitivity is compensated by the integration of a random forest classifier with a novel encoding of enhanced nucleic acid content. The integrated approach dubbed as BGRU-based Ensemble RNA Methylation site Predictor (BERMP) has competitive performance in both cross-validation test and independent test. BERMP also outperforms existing m6A predictors for different species. Therefore, BERMP is a novel multi-species tool for identifying m6A sites with high confidence. This classifier is freely available at http://www.bioinfogo.org/bermp.Entities:
Keywords: Deep learning; N6-methyladenosine; Random forest; Recurrent neural network; bidirectional Gated Recurrent Unit
Mesh:
Substances:
Year: 2018 PMID: 30416381 PMCID: PMC6216033 DOI: 10.7150/ijbs.27819
Source DB: PubMed Journal: Int J Biol Sci ISSN: 1449-2288 Impact factor: 6.580
Figure 1The framework of BERMP. BERMP covered three species (i.e. Mammalia, Saccharomyces cerevisiae, Arabidopsis thaliana) with two prediction modes (i.e. full transcript mode and mature mRNA mode). After the selection of a specific species and mode, the query sequences, the query sequences were analyzed and consensus motifs were extracted with flanking nucleic acids and submitted to the random forest (RF) based classifier with ENAC encoding (left) and the bidirectional GRU-based deep learning classifier with word embedding (right). The prediction scores from both classifiers were integrated through logistical regression approach and finally the final prediction score were outputted.
Figure 2Performance comparison of the seven m The AUC (A) and AUC01 values (B) for mammalian mRNA mode were calculated via five-fold cross validation (Figure S1A). The AUC (C) and AUC01 values (D) for Saccharomyces cerevisiae mRNA mode were calculated via ten-fold cross validation (Figure S1B). For each algorithm, the AUC or AUC01 values between the adjacent data sets were statistically compared and the horizontal line represented no statistical difference (P >0.05). The P value was calculated by a paired student's t-test.
Prediction results of different classifiers via cross validation.
| Species1 | Classifiers2 | Acc3 | Sn3 | Sp3 | MCC3 | AUC3 | AUC013 |
|---|---|---|---|---|---|---|---|
| RF | 86.13 | 47.12 | 90.02 | 0.314 | 0.806 | 0.0340 | |
| RF | 85.39 | 37.33 | 90.18 | 0.241 | 0.769 | 0.0255 | |
| RF | 85.39 | 34.45 | 90.46 | 0.222 | 0.769 | 0.0219 | |
| RF | 85.17 | 30.70 | 90.60 | 0.193 | 0.727 | 0.0204 | |
| UGRU | 87.48 | 62.21 | 90.00 | 0.423 | 0.885 | 0.0403 | |
| BGRU | 87.57 | 63.15 | 90.00 | 0.430 | 0.889 | 0.0413 | |
| BERMP | 87.80 | 65.76 | 90.00 | 0.448 | 0.891 | 0.0456 | |
| RF | 85.74 | 38.80 | 90.42 | 0.256 | 0.761 | 0.0251 | |
| RF | 84.38 | 22.79 | 90.58 | 0.125 | 0.666 | 0.0143 | |
| RF | 83.75 | 20.08 | 90.09 | 0.094 | 0.623 | 0.0132 | |
| RF | 84.00 | 19.80 | 90.40 | 0.095 | 0.621 | 0.0124 | |
| UGRU | 85.90 | 43.73 | 90.10 | 0.289 | 0.813 | 0.0263 | |
| BGRU | 85.90 | 44.74 | 90.00 | 0.296 | 0.815 | 0.0272 | |
| BERMP | 86.14 | 46.58 | 90.08 | 0.311 | 0.817 | 0.0294 | |
| RF | 67.64 | 44.91 | 90.36 | 0.396 | 0.792 | 0.0285 | |
| RF | 61.41 | 32.27 | 90.55 | 0.281 | 0.724 | 0.0207 | |
| RF | 60.45 | 30.27 | 90.64 | 0.262 | 0.719 | 0.0209 | |
| RF | 58.77 | 27.36 | 90.18 | 0.226 | 0.693 | 0.0153 | |
| UGRU | 54.86 | 19.45 | 90.27 | 0.138 | 0.648 | 0.0101 | |
| BGRU | 56.86 | 23.64 | 90.09 | 0.184 | 0.679 | 0.0142 | |
| BERMP | 68.59 | 47.10 | 90.10 | 0.412 | 0.800 | 0.0280 | |
| RF | 81.02 | 71.71 | 90.33 | 0.632 | 0.898 | 0.0511 | |
| RF | 85.53 | 81.24 | 90.02 | 0.714 | 0.928 | 0.0612 | |
| RF | 84.33 | 78.67 | 90.00 | 0.691 | 0.919 | 0.0572 | |
| RF | 83.55 | 77.10 | 90.00 | 0.677 | 0.910 | 0.0514 | |
| UGRU | 84.95 | 79.71 | 90.20 | 0.703 | 0.923 | 0.0581 | |
| BGRU | 85.93 | 81.71 | 90.14 | 0.721 | 0.928 | 0.0583 | |
| BERMP | 85.95 | 81.81 | 90.10 | 0.722 | 0.927 | 0.0582 |
Note: 1 The datasets and the number of folds for cross validation were depicted in Figure S1. 2RFENAC=RF classifier with the ENAC encoding, RFKSNPF= RF classifier with the encoding of K-spaced nucleotide pair frequencies, RFPseDNC=RF classifier with the encoding of Pseudo dinucleotide composition, UGRU= the unidirectional GRU-based RNN classifier with word embedding, BGRU= the bidirectional GRU-based RNN classifier with word embedding, BERMP= BGRU-based Ensemble RNA Methylation site Predictor that integrating BGRU and RFENAC. 3Acc=accuracy, Sn=sensitivity, Sp=specificity, MCC=Matthew's Correlation Coefficient, AUC=area under the receiver operating characteristic, AUC01 = AUC with a <10% false positive rate (i.e., specificity>90%).
Figure 3Relationship between data size and prediction performance of classifiers using the The AUC values (A) and AUC01 values (B) were calculated using four different data sizes (all, one-fifth, one-tenth and one-fiftieth) via five-fold cross validation (Figure S1A).
Performance comparison of SRAMP and BERMP on the independent mammalian dataset at various stringency thresholds.
| Mode | Stringency (Specificity) | SRAMP | BERMP | ||
|---|---|---|---|---|---|
| Sensitivity | MCC | Sensitivity | MCC | ||
| Full transcript mode | Very high (98.7%) | 25.7% | 0.373 | 29.6% | 0.421 |
| High (93.7%) | 50.3% | 0.414 | 60.3% | 0.492 | |
| Moderate (88.1%) | 64.5% | 0.405 | 74.9% | 0.475 | |
| Low (83.0%) | 72.8% | 0.385 | 82.5% | 0.447 | |
| Mature mRNA mode | Very high (99.1%) | 11.0% | 0.211 | 11.0% | 0.215 |
| High (95.0%) | 29.6% | 0.273 | 33.5% | 0.309 | |
| Moderate (90.0%) | 44.0% | 0.293 | 48.7% | 0.325 | |
| Low (85.3%) | 54.2% | 0.294 | 58.9% | 0.325 | |
Note: The very high, high, moderate and low stringency thresholds correspond to approximately 99%, 95%, 90% and 85% specificities in five-fold cross-validation tests, respectively. The same datasets were used to develop and compare both classifiers (Figure S1). The results for SRAMP excerpted from 15.
Comparison of BERMP and other predictors on identifying m6A sites from Saccharomyces cerevisiae.
| Predictor | Specificity | Sensitivity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|
| BERMP | 69.56 | 72.95 | 71.26 | 0.43 | 0.800 |
| pRNAm-PC | 69.75 | 69.72 | 69.74 | 0.40 | 0.762 |
| M6A-HPCS | 62.89 | 71.77 | 67.33 | 0.35 | 0.713 |
| RAM-NPPS | 69.08 | 72.46 | 70.77 | 0.42 | 0.780 |
Note: The classifiers were based on the same dataset 16. The results for pRNAm-PC excerpted from 22 and those for M6A-HPCS excerpted from 23. RAM-NPPS was re-implemented and BERMP was developed using the same training dataset (Figure S1). The identical independent dataset was employed for comparison and the corresponding results were shown above (Figure S1).
Comparison between BERMP and RFAthM6A on identifying m6A sites from Arabidopsis thaliana.
| Specificity level | High specificity (90%) | Moderate specificity (85%) | Low specificity (80%) | |||||
|---|---|---|---|---|---|---|---|---|
| Sensitivity | MCC | Sensitivity | MCC | Sensitivity | MCC | |||
| BERMP | 0.823 | 0.726 | 0.888 | 0.739 | 0.917 | 0.722 | ||
| RFAthM6A | 0.822 | 0.725 | 0.873 | 0.724 | 0.908 | 0.712 | ||
Note: The classifiers were developed and compared via five-fold cross validation based on the same dataset (Figure S1). Three specificity thresholds (high: 90%; moderate: 85%; low: 80%) were selected. The results for RFAthM6A excerpted from 17.