| Literature DB >> 32582286 |
Lian Liu1, Xiujuan Lei1, Zengqiang Fang1, Yujiao Tang2, Jia Meng2, Zhen Wei2.
Abstract
N 6-methyladenosine (m6A) is one of the most widely studied epigenetic modifications, which plays an important role in many biological processes, such as splicing, RNA localization, and degradation. Studies have shown that m6A on lncRNA has important functions, including regulating the expression and functions of lncRNA, regulating the synthesis of pre-mRNA, promoting the proliferation of cancer cells, and affecting cell differentiation and many others. Although a number of methods have been proposed to predict m6A RNA methylation sites, most of these methods aimed at general m6A sites prediction without noticing the uniqueness of the lncRNA methylation prediction problem. Since many lncRNAs do not have a polyA tail and cannot be captured in the polyA selection step of the most widely adopted RNA-seq library preparation protocol, lncRNA methylation sites cannot be effectively captured and are thus likely to be significantly underrepresented in existing experimental data affecting the accuracy of existing predictors. In this paper, we propose a new computational framework, LITHOPHONE, which stands for long noncoding RNA methylation sites prediction from sequence characteristics and genomic information with an ensemble predictor. We show that the methylation sites of lncRNA and mRNA have different patterns exhibited in the extracted features and should be differently handled when making predictions. Due to the used experiment protocols, the number of known lncRNA m6A sites is limited, and insufficient to train a reliable predictor; thus, the performance can be improved by combining both lncRNA and mRNA data using an ensemble predictor. We show that the newly developed LITHOPHONE approach achieved a reasonably good performance when tested on independent datasets (AUC: 0.966 and 0.835 under full transcript and mature mRNA modes, respectively), marking a substantial improvement compared with existing methods. Additionally, LITHOPHONE was applied to scan the entire human lncRNAome for all possible lncRNA m6A sites, and the results are freely accessible at: http://180.208.58.19/lith/.Entities:
Keywords: ensemble model; epitranscriptome; lncRNA; m6A; site prediction
Year: 2020 PMID: 32582286 PMCID: PMC7297269 DOI: 10.3389/fgene.2020.00545
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Single-base resolution m6A datasets in lncRNA m6A prediction.
| HEK293T | Abacm antibody | Bastian et al., |
| HEK293T | Sysy antibody | Bastian et al., |
| MOLM13 | Vu et al., | |
| A549 | Shengdong et al., | |
| CD8T | Shengdong et al., | |
| HeLa | Ke et al., |
Performance under 10-fold cross-validation.
| Full transcript | RF | 0.923 | 0.938 | 0.930 | 0.861 | 0.971 |
| SVM | 0.884 | 0.942 | 0.913 | 0.828 | 0.964 | |
| KNN | 0.5 | 0.501 | 0.500 | 0.001 | 0.945 | |
| LR | 0.881 | 0.944 | 0.912 | 0.827 | 0.962 | |
| XGBoost | 0.907 | 0.940 | 0.924 | 0.848 | 0.955 | |
| Mature lncRNA | RF | 0.784 | 0.724 | 0.754 | 0.511 | 0.827 |
| SVM | 0.738 | 0.713 | 0.725 | 0.451 | 0.796 | |
| KNN | 0.499 | 0.501 | 0.500 | 0.001 | 0.727 | |
| LR | 0.602 | 0.807 | 0.704 | 0.418 | 0.789 | |
| XGBoost | 0.645 | 0.697 | 0.671 | 0.345 | 0.722 | |
Performance under independent test.
| Full transcript | lncRNA | lncRNA | RF | 0.922 | 0.930 | 0.926 | 0.853 | 0.966 |
| SVM | 0.903 | 0.934 | 0.919 | 0.838 | 0.963 | |||
| KNN | 0.500 | 0.500 | 0.500 | 0.000 | 0.942 | |||
| LR | 0.895 | 0.926 | 0.911 | 0.822 | 0.959 | |||
| XGBoost | 0.922 | 0.903 | 0.913 | 0.826 | 0.947 | |||
| lncRNA | mRNA | RF | 0.981 | 0.046 | 0.514 | 0.077 | 0.759 | |
| SVM | 0.984 | 0.051 | 0.518 | 0.098 | 0.678 | |||
| KNN | 0.499 | 0.501 | 0.500 | 0.000 | 0.572 | |||
| LR | 0.954 | 0.171 | 0.562 | 0.200 | 0.716 | |||
| XGBoost | 0.908 | 0.250 | 0.579 | 0.209 | 0.697 | |||
| mRNA | lncRNA | RF | 0.752 | 0.934 | 0.843 | 0.698 | 0.936 | |
| SVM | 0.744 | 0.899 | 0.822 | 0.651 | 0.905 | |||
| KNN | 0.492 | 0.508 | 0.500 | 0.000 | 0.703 | |||
| LR | 0.539 | 0.953 | 0.746 | 0.541 | 0.872 | |||
| XGBoost | 0.721 | 0.891 | 0.806 | 0.622 | 0.869 | |||
| mRNA | mRNA | RF | 0.846 | 0.833 | 0.839 | 0.679 | 0.913 | |
| SVM | 0.829 | 0.839 | 0.834 | 0.669 | 0.908 | |||
| KNN | 0.499 | 0.501 | 0.500 | 0.001 | 0.798 | |||
| LR | 0.717 | 0.896 | 0.806 | 0.623 | 0.898 | |||
| XGBoost | 0.831 | 0.832 | 0.832 | 0.664 | 0.907 | |||
| Mature RNA | lncRNA | lncRNA | RF | 0.766 | 0.694 | 0.730 | 0.461 | 0.821 |
| SVM | 0.712 | 0.689 | 0.700 | 0.401 | 0.789 | |||
| KNN | 0.500 | 0.500 | 0.500 | 0.000 | 0.734 | |||
| LR | 0.590 | 0.802 | 0.696 | 0.401 | 0.797 | |||
| XGBoost | 0.757 | 0.703 | 0.730 | 0.460 | 0.784 | |||
| lncRNA | mRNA | RF | 0.757 | 0.522 | 0.639 | 0.287 | 0.705 | |
| SVM | 0.814 | 0.424 | 0.619 | 0.258 | 0.717 | |||
| KNN | 0.493 | 0.508 | 0.501 | 0.002 | 0.520 | |||
| LR | 0.804 | 0.472 | 0.638 | 0.292 | 0.660 | |||
| XGBoost | 0.652 | 0.527 | 0.590 | 0.181 | 0.615 | |||
| mRNA | lncRNA | RF | 0.788 | 0.608 | 0.698 | 0.403 | 0.807 | |
| SVM | 0.761 | 0.631 | 0.696 | 0.395 | 0.774 | |||
| KNN | 0.500 | 0.500 | 0.500 | 0.000 | 0.542 | |||
| LR | 0.419 | 0.838 | 0.628 | 0.283 | 0.653 | |||
| XGBoost | 0.694 | 0.694 | 0.694 | 0.387 | 0.749 | |||
| mRNA | mRNA | RF | 0.858 | 0.825 | 0.841 | 0.683 | 0.916 | |
| SVM | 0.840 | 0.842 | 0.841 | 0.682 | 0.915 | |||
| KNN | 0.499 | 0.501 | 0.500 | 0.001 | 0.800 | |||
| LR | 0.742 | 0.895 | 0.819 | 0.645 | 0.908 | |||
| XGBoost | 0.831 | 0.832 | 0.832 | 0.664 | 0.907 | |||
Figure 1Search for optimal parameter of the ensemble predictor. The optimal result was achieved when α = 0.3. When α = 0, only lncRNA sites were used for training; while when α = 1, only mRNA sites were considered.
Comparison of ensemble model and lncRNA trained model.
| mRNA trained | 0.788 | 0.608 | 0.698 | 0.403 | 0.807 |
| lncRNA trained | 0.766 | 0.694 | 0.730 | 0.461 | 0.821 |
| Ensemble (α = 0.3) | 0.797 | 0.689 | 0.743 | 0.489 | 0.835 |
Figure 2Feature selection results. (A) The ranking of the features for full transcript m6A site prediction. (B) The ranking of the features for mature lncRNA m6A site prediction. (C) Top 134 features were selected for full transcript m6A site prediction. (D) Top 41 features were selected for mature lncRNA m6A site prediction.
Performance comparison for lncRNA m6A site prediction.
| Full transcript | SRAMP | 0.705 | 0.791 | 0.748 | 0.498 | 0.827 |
| MethyRNA | 0.717 | 0.752 | 0.734 | 0.469 | 0.801 | |
| Gene2vec | 0.798 | 0.813 | 0.805 | 0.611 | 0.865 | |
| LITHOPHONE | 0.922 | 0.930 | 0.926 | 0.853 | 0.966 | |
| Mature RNA | SRAMP | 0.604 | 0.748 | 0.676 | 0.355 | 0.749 |
| MethyRNA | 0.622 | 0.644 | 0.633 | 0.266 | 0.679 | |
| Gene2vec | 0.778 | 0.689 | 0.734 | 0.469 | 0.806 | |
| LITHOPHONE | 0.797 | 0.689 | 0.743 | 0.489 | 0.835 | |
Figure 3ROC for lncRNA methylation site prediction. The proposed approach substantially outperformed competing approaches. (A) The ROC curve for the full transcript mode. (B) The ROC curve for the mature RNA mode.