| Literature DB >> 30410501 |
Xiaoli Qiang1, Huangrong Chen2, Xiucai Ye3, Ran Su4, Leyi Wei2.
Abstract
As one of the well-studied RNA methylation modifications, N6-methyladenosine (m6A) plays important roles in various biological progresses, such as RNA splicing and degradation, etc. Identification of m6A sites is fundamentally important for better understanding of their functional mechanisms. Recently, machine learning based prediction methods have emerged as an effective approach for fast and accurate identification of m6A sites. In this paper, we proposed "M6AMRFS", a new machine learning based predictor for the identification of m6A sites. In this predictor, we exploited a new feature representation algorithm to encode RNA sequences with two feature descriptors (dinucleotide binary encoding and Local position-specific dinucleotide frequency), and used the F-score algorithm combined with SFS (Sequential Forward Search) to enhance the feature representation ability. To predict m6A sites, we employed the eXtreme Gradient Boosting (XGBoost) algorithm to build a predictive model. Benchmarking results showed that the proposed predictor is competitive with the state-of-the art predictors. Importantly, robust predictions for multiple species by our predictor demonstrate that our predictive models have strong generalization ability. To the best of our knowledge, M6AMRFS is the first tool that can be used for the identification of m6A sites in multiple species. To facilitate the use of our predictor, we have established a user-friendly webserver with the implementation of M6AMRFS, which is currently available in http://server.malab.cn/M6AMRFS/. We anticipate that it will be a useful tool for the relevant research of m6A sites.Entities:
Keywords: N6-methyladenosine site; RNA methylation; eXtreme Gradient Boosting; feature representation; feature selection; machine learning
Year: 2018 PMID: 30410501 PMCID: PMC6209681 DOI: 10.3389/fgene.2018.00495
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Summary of the benchmark datasets from four species.
| Datasets | Species | Positives | Negatives | Total | Sequence length | Reference |
|---|---|---|---|---|---|---|
| Dataset-S51 | 1307 | 1307 | 2614 | 51 nt | ||
| Dataset-H41 | 1130 | 1130 | 2260 | 41 nt | ||
| Dataset-M41 | 725 | 725 | 1450 | 41 nt | ||
| Dataset-A101 | 1000 | 1000 | 2000 | 101 nt | ||
FIGURE 1Framework of algorithms proposed in this study. There are two main steps. In the first step, the input RNA sequences are filtered by removing those irrelevant sequences. Then, the remaining sequences are fed into the proposed feature extraction algorithm for feature representation. In the second step, the resulting feature representations are optimized by feature selection, and then, the optimal feature representations are predicted by a XGBoost model.
FIGURE 2Procedure of feature selection.
FIGURE 3Performance of XGBoost and other classifiers on four benchmark datasets.
Performances of XGBoost and other machine learning algorithms.
| Dataset-S51 | Acc | Sn | Sp | MCC | Dataset-H41 | Acc | Sn | Sp | MCC |
|---|---|---|---|---|---|---|---|---|---|
| GBDT | 0.7234 | 0.7200 | 0.7269 | 0.4468 | GBDT | 0.9089 | 0.8204 | 0.9973 | 0.8308 |
| KNN | 0.6167 | 0.7337 | 0.4996 | 0.2400 | KNN | 0.6566 | 0.4062 | 0.9071 | 0.3620 |
| LR | 0.7192 | 0.6924 | 0.7460 | 0.4390 | LR | 0.9066 | 0.8204 | 0.9929 | 0.8257 |
| NB | 0.7050 | 0.7100 | 0.7001 | 0.4101 | NB | 0.8155 | 0.6327 | 0.9982 | 0.6779 |
| RF | 0.7165 | 0.7192 | 0.7138 | 0.4331 | RF | 0.8982 | 0.7965 | 1.0000 | 0.8135 |
| SVM | 0.7257 | 0.7169 | 0.7345 | 0.4515 | SVM | 0.9018 | 0.8035 | 1.0000 | 0.8195 |
| XGBoost | 0.7314 | 0.7345 | 0.7284 | 0.4629 | XGBoost | 0.9089 | 0.8195 | 0.9982 | 0.8311 |
| GBDT | 0.8890 | 0.7779 | 1.0000 | 0.7979 | GBDT | 0.7795 | 0.7624 | 0.7967 | 0.5594 |
| KNN | 0.6448 | 0.4303 | 0.8593 | 0.3207 | KNN | 0.6638 | 0.7524 | 0.5752 | 0.3329 |
| LR | 0.8807 | 0.7793 | 0.9821 | 0.7775 | LR | 0.7914 | 0.7910 | 0.7919 | 0.5829 |
| NB | 0.7862 | 0.5793 | 0.9931 | 0.6288 | NB | 0.7517 | 0.8005 | 0.7029 | 0.5057 |
| RF | 0.8890 | 0.7779 | 1.0000 | 0.7979 | RF | 0.7260 | 0.7152 | 0.7367 | 0.4520 |
| SVM | 0.8848 | 0.7766 | 0.9931 | 0.7884 | SVM | 0.7971 | 0.7957 | 0.7986 | 0.5943 |
| XGBoost | 0.8890 | 0.7779 | 1.0000 | 0.7979 | XGBoost | 0.7890 | 0.7824 | 0.7957 | 0.5781 |
Performance of features before and after feature selection.
| Datasets | Methods | Acc | Sn | Sp | MCC |
|---|---|---|---|---|---|
FIGURE 4Results of feature selection via varying the feature number. (A) denotes the results of feature selection on Dataset-S51. (B) denotes the results of feature selection on Dataset-H41. (C) denotes the results of feature selection on Dataset-M41. (D) denotes the results of feature selection on Dataset-A101.
Comparison with other feature representation algorithms.
| Dataset-S51 | Acc | Sn | Sp | MCC | Dataset-H41 | Acc | Sn | Sp | MCC |
|---|---|---|---|---|---|---|---|---|---|
| RFH | 0.7295 | 0.7582 | 0.7008 | 0.4598 | RFH | 0.9097 | 0.8195 | 1 | 0.8332 |
| PseDNC | 0.64 | 0.6993 | 0.5807 | 0.282 | PseDNC | 0.6956 | 0.5973 | 0.7938 | 0.3989 |
| PCP | 0.627 | 0.6389 | 0.6151 | 0.2541 | PCP | 0.6447 | 0.6177 | 0.6717 | 0.2898 |
| KNN | 0.7131 | 0.6917 | 0.7345 | 0.4266 | KNN | 0.8235 | 0.7363 | 0.9106 | 0.657 |
| AthMethPre | 0.7536 | 0.7605 | 0.7467 | 0.5073 | AthMethPre | 0.9071 | 0.8142 | 1 | 0.8286 |
| Our features | 0.7425 | 0.7521 | 0.733 | 0.4852 | Our features | 0.9102 | 0.8204 | 1 | 0.8339 |
| RFH | 0.8903 | 0.7848 | 0.9959 | 0.7987 | RFH | 0.7993 | 0.7705 | 0.8281 | 0.5996 |
| PseDNC | 0.6228 | 0.6386 | 0.6069 | 0.2456 | PseDNC | 0.8138 | 0.8057 | 0.8219 | 0.6277 |
| PCP | 0.6166 | 0.5669 | 0.6662 | 0.2343 | PCP | 0.8257 | 0.8281 | 0.8233 | 0.6514 |
| KNN | 0.8283 | 0.7448 | 0.9117 | 0.6659 | KNN | 0.8238 | 0.8462 | 0.8014 | 0.6483 |
| AthMethPre | 0.8897 | 0.7793 | 1 | 0.799 | AthMethPre | 0.85 | 0.85 | 0.85 | 0.7 |
| Our features | 0.8924 | 0.789 | 0.9959 | 0.8022 | Our features | 0.8105 | 0.8067 | 0.8143 | 0.6210 |
Results of the proposed predictor and the state-of-the-art predictors on benchmark datasets from different species.
| Dataset-S51 | Acc | Sn | Sp | MCC | Dataset-H41 | Acc | Sn | Sp | MCC |
|---|---|---|---|---|---|---|---|---|---|
| pRNAm-PC | 0.6974 | 0.6972 | 0.6975 | 0.4000 | MethyRNA | 0.9038 | 0.8168 | 0.9911 | N.A. |
| M6AMRFS | 0.7425 | 0.7521 | 0.7330 | 0.4852 | M6AMRFS | 0.9102 | 0.8204 | 1.0000 | 0.8339 |
| MethyRNA | 0.8839 | 0.7779 | 1.0000 | N.A. | RFAthM6A | 0.8545 | 0.8738 | 0.8352 | 0.7095 |
| M6AMRFS | 0.7933 | 0.8281 | 0.7584 | 0.588 | M6AMRFS | 0.8105 | 0.8067 | 0.8143 | 0.6210 |