| Literature DB >> 31523126 |
Yiyou Song1, Qingru Xu1, Zhen Wei1,2, Di Zhen1, Jionglong Su2,3, Kunqi Chen1,4, Jia Meng3,5.
Abstract
Currently, although many successful bioinformatics efforts have been reported in the epitranscriptomics field for N 6-methyladenosine (m6A) site identification, none is focused on the substrate specificity of different m6A-related enzymes, ie, the methyltransferases (writers) and demethylases (erasers). In this work, to untangle the target specificity and the regulatory functions of different RNA m6A writers (METTL3-METT14 and METTL16) and erasers (ALKBH5 and FTO), we extracted 49 genomic features along with the conventional sequence features and used the machine learning approach of random forest to predict their epitranscriptome substrates. Our method achieved reasonable performance on both the writer target prediction (as high as 0.918) and the eraser target prediction (as high as 0.888) in a 5-fold cross-validation, and results of the gene ontology analysis of their preferential targets further revealed the functional relevance of different RNA methylation writers and erasers.Entities:
Keywords: N6-methyladenosine (m6A); RNA methylation; epitranscriptome; random forest; target prediction
Year: 2019 PMID: 31523126 PMCID: PMC6728658 DOI: 10.1177/1176934319871290
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
GEO data sets used to identify ground truth target sites.
| ID | Regulator | Cell type | GEO SRA study | Publication |
|---|---|---|---|---|
| 1 | METTL14 | A549 | SRP039397 | Schwartz et al[ |
| 2 | METTL14 | Hela | SRP022152 | Liu et al[ |
| 3 | METTL14 | MonoMac6 | SRP103072 | Weng et al[ |
| 4 | METTL14 | NB4 | SRP103072 | Weng et al[ |
| 5 | METTL3 | A549 | SRP039397 | Schwartz et al[ |
| 6 | METTL3 | AML | SRP099081 | Barbieri et al[ |
| 7 | METTL3 | Hek293T | SRP039397 | Schwartz et al[ |
| 8 | METTL3 | Hela | SRP022152 | Liu et al[ |
| 9 | METTL16 | HEK293A | SRP094637 | Pendleton et al[ |
| 10 | ALKBH5 | gsc11 | SRP067910 | Zhang et al[ |
| 11 | FTO | AML | SRP067910 | Li et al[ |
Abbreviations: ALKBH5, ALKB homolog 5; FTO, fat mass and obesity–associated protein; GEO, Gene Expression Omnibus; METTL3, methyltransferase-like 3; METTL14, methyltransferase-like 14; METTL16, methyltransferase-like 16; SRA, Sequence Read Archive.
Genomic features used in the analysis.
| ID | Name | Description | Note |
|---|---|---|---|
| 1 | UTR5 | 5′ UTR | Dummy variables indicating whether the site overlaps the topological region on the major RNA transcript |
| 2 | UTR3 | 3′ UTR | |
| 3 | cds | CDS | |
| 4 | Stop_codons | Stop codons flanked by 100 bp | |
| 5 | Start_codons | Start codons flanked by 100 bp | |
| 6 | TSS | Downstream 100 bp of TSS | |
| 7 | TSS_A | Downstream 100 bp of TSS on A | |
| 8 | Stop_codons | Stop codons | |
| 9 | exon_stop | Exons containing stop codons | |
| 10 | alternative_exon | Alternative exons | |
| 11 | constitutive_exon | Constitutive exons | |
| 12 | internal_exon | Internal exons | |
| 13 | long_exon | Long exons (exon length ⩾ 400 bp) | |
| 14 | last_exon | Last exons[ | |
| 15 | last_exon_400bp | 5′ 400 bp of the last exons[ | |
| 16 | last_exon_sc400 | 5′ 400 bp of the last exons containing stop codons[ | |
| 17 | intron | Introns | |
| 18 | pos_UTR5 | Relative position on 5′ UTR | Relative position on the region |
| 19 | pos_UTR3 | Relative position on 3′ UTR | |
| 20 | pos_CDS | Relative position on CDS | |
| 21 | pos_exons | Relative position on exon | |
| 22 | length_UTR5 | 5′ UTR length | The region length in base pairs |
| 23 | length_UTR3 | 3′ UTR length | |
| 24 | length_cds | CDS length | |
| 25 | length_gene_ex | Mature transcript length | |
| 26 | length_gene_full | Full transcript length | |
| 27 | PC_1bp | PhastCons scores of the nucleotide[ | Scores related to evolutionary conservation |
| 28 | PC_101bp | Average phastCons scores within the flanking 50 bp region[ | |
| 29 | FC_1bp | fitCons scores of the nucleotide[ | |
| 30 | FC_101bp | Average fitCons scores within the flanking 50 bp region[ | |
| 31 | struc_hybridize | Predicted RNA hybridized region[ | RNA secondary structure |
| 32 | struc_loop | Predicted RNA loop region[ | |
| 33 | isoform_num | Isoform number | Attributes of the genes or transcripts |
| 34 | exon_num | Exon number | |
| 35 | HK_genes | Housekeeping genes[ | |
| 36 | sncRNA | sncRNA | RNA annotations related to m6A biology |
| 37 | lncRNA | lncRNA | |
| 38 | miR_targeted_genes | miRNA-targeted genes[ | |
| 39 | Verified_miRtargets | miRNA-targeted sites verified by experiment[ | |
| 40 | TargetScan | Predicted miRNA targeted sites by TargetScan[ | |
| 41 | HNRNPC_eCLIP | eCLIP data of HNRNPC RNA binding sites[ | RNA-binding protein annotation from MeTDB database[ |
| 42 | METTL3_TREW | METTL3-binding region[ | |
| 43 | METTL14_TREW | METTL14-binding region[ | |
| 44 | WTAP_TREW | WTAP-binding region[ | |
| 45 | YTHDC1_TREW | YTHDC1-binding region[ | |
| 46 | YTHDF1_TREW | YTHDF1-binding region[ | |
| 47 | YTHDF2_TREW | YTHDF2-binding region[ | |
| 48 | ALKBH5_TREW | ALKBH5-binding region[ | |
| 49 | FTO_TREW | FTO-binding region[ |
Abbreviations: ALKBH5, ALKB homolog 5; FTO, fat mass and obesity–associated protein; GEO, Gene Expression Omnibus; METTL3, methyltransferase-like 3; METTL14, methyltransferase-like 14; METTL16, methyltransferase-like 16; SRA, Sequence Read Archive.
Features that are directly related to the prediction are not used to avoid overfitting. For example, the features 42 and 43 were not used for writer target prediction, whereas feature 48 and 49 were not used for eraser target prediction.
Figure 1.Feature Selection for Predictors. (A) The top 20 genomic features were used for prediction of the targets of erasers, including conservation score, METTL3 targets, etc. (B) The top 15 genomic features were used for prediction of the targets of writers, with the distance to known m6A site as the most important predictive feature, followed by gene length and conservation score.
Performance of predictors based on different features.
| Feature type | Erasers (FTO vs ALKBH5) | Writers (M3/M14 vs M16) | ||||
|---|---|---|---|---|---|---|
| Sensitivity | Specificity | AUROC | Sensitivity | Specificity | AUROC | |
| Sequence | 0.789 | 0.781 | 0.849 | 0.656 | 0.746 | 0.772 |
| Genome | 0.762 | 0.736 | 0.827 | 0.802 | 0.795 | 0.886 |
| Both | 0.814 | 0.813 | 0.887 | 0.802 | 0.795 | 0.889 |
Abbreviations: ALKBH5, ALKB homolog 5; AUROC, area under the receiver operating characteristics; FTO, fat mass and obesity–associated protein.
This result was achieved on RNA methylation sites with probability greater than .6 with a 5-fold cross-validation.
Prediction performance on different data sets (AUROC).
| Enzyme type | Data set | |||
|---|---|---|---|---|
| Data set
1 | Data set
2 | Data set
3 | Data set
4 | |
| Erasers (FTO vs ALKBH5) | 0.873 | 0.873 | 0.872 | 0.888 |
| Writers (M3/M14 vs M16) | 0.889 | 0.888 | 0.911 | 0.877 |
Abbreviations: ALKBH5, ALKB homolog 5; AUROC, area under the receiver operating characteristics; FTO, fat mass and obesity–associated protein.
Four data sets were considered, corresponding to the experiment-validated RNA methylation sites from RMBase and also supported by WHISTLE prediction with probability greater than .6, .7, .8, and .9, respectively. The detailed performance of 5 different classification predictors (RF, SVM, GLM, Naïve Bayes, and decision tree) is presented in Supplementary Table S2.
Figure 2.Biological processes enriched in targets of m6A enzymes. Distinct biological processes are enriched in the predicted target sites of different enzymes. Figure shows the top 10 most statistically enriched biological processes associated with the targets of different m6A enzymes.