| Literature DB >> 35456483 |
Zhizhou He1,2, Jing Xu1, Haoran Shi3, Shuxiang Wu1,4.
Abstract
5-methylcytosine (m5C) is a common post-transcriptional modification observed in a variety of RNAs. m5C has been demonstrated to be important in a variety of biological processes, including RNA structural stability and metabolism. Driven by the importance of m5C modification, many projects focused on the m5C sites prediction were reported before. To better understand the upstream and downstream regulation of m5C, we present a bioinformatics framework, m5CRegpred, to predict the substrate of m5C writer NSUN2 and m5C readers YBX1 and ALYREF for the first time. After features comparison, window lengths selection and algorism comparison on the mature mRNA model, our model achieved AUROC scores 0.869, 0.724 and 0.889 for NSUN2, YBX1 and ALYREF, respectively in an independent test. Our work suggests the substrate of m5C regulators can be distinguished and may help the research of m5C regulators in a special condition, such as substrates prediction of hyper- or hypo-expressed m5C regulators in human disease.Entities:
Keywords: 5-methylcytosine; machine learning; readers
Mesh:
Substances:
Year: 2022 PMID: 35456483 PMCID: PMC9025882 DOI: 10.3390/genes13040677
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.141
Figure 1The workflow for m5CRegpred. The methylation sites and RNA Binding proteins (RBP) sites were obtained from four and two types of sequencing techniques, respectively. Eight kinds of encoding methods were considered in the project.
Base-resolution datasets of m5C sites.
| ID | Technique | Source | Cell Line | Ref. |
|---|---|---|---|---|
| 1 | RNA-BisSeq | GSE93751 | HeLa | [ |
| 2 | RNA-BisSeq | GSE133671 | T24 | [ |
| 3 | BS-seq with improved protocol | GSE122260 | HEK293T | [ |
| 4 | BS-seq with improved protocol | GSE122260 | HeLa | [ |
| 5 | RBS-Seq | GSE90963 | HeLa | [ |
| 6 | Aza-IP | GSE38957 | HeLa | [ |
Target sites of m5C regulators identified by Par-CLIP or eCLIP.
| Protein | Cell Line | Technique | Source | Ref. | |
|---|---|---|---|---|---|
| Writer | NSUN2 | K562 | eCLIP | GENCODE | [ |
| Reader | YBX1 | T24 | PAR-CLIP | GSE133620 | [ |
| ALYREF | T24 | PAR-CLIP | GSE133620 | [ |
Independent test with different features.
| Mature mRNA Model | Full Transcript Model | Average | |||||
|---|---|---|---|---|---|---|---|
| NSUN2 | YBX1 | ALYREF | NSUN2 | YBX1 | ALYREF | ||
| EIIP | 0.656 | 0.656 | 0.807 | 0.721 | 0.764 | 0.849 | 0.742 |
| autoCor | 0.567 | 0.546 | 0.584 | 0.523 | 0.617 | 0.710 | 0.591 |
| crossCor | 0.594 | 0.520 | 0.718 | 0.609 | 0.597 | 0.679 | 0.620 |
| PseKNC | 0.660 | 0.622 | 0.723 | 0.738 | 0.732 | 0.774 | 0.708 |
| ChemProper | 0.602 | 0.649 | 0.665 | 0.698 | 0.692 | 0.778 | 0.681 |
| ONE_HOT | 0.606 | 0.646 | 0.668 | 0.708 | 0.690 | 0.778 | 0.683 |
| CONPOSI | 0.656 | 0.656 | 0.807 | 0.721 | 0.764 | 0.849 | 0.742 |
| PSNP | 0.869 | 0.724 | 0.889 | 0.871 | 0.763 | 0.847 | 0.827 |
Figure 2Full transcript model and mature mRNA model. To select negative sites, the unmodified sites and methylated sites un-regulated by NSUN2/YBX1/ALYREF from the intron and exons were both considered in the full transcript model; whereas the mature mRNA model only considered sites from exons. As most captured sequences during library preparation are exons (mature mRNA) due to polyA selection, the performance of full transcript model will be overestimated.
Figure 3Performance of different length windows with PSNP encoding method.
Figure 4Performance analysis on different machine learning algorithms. SVM represents for support vector machine, RF represents for random forest and GLMs represent for generalize linear model.
Figure 5Motif discovery for the training data andfalse-negative sites of independent test data. For the training data, only the positive sites were used for motif discovery. The motif with width of 4 bp to 8 bp was scanned by STREME. The different motifs of ALYREF in training data may be due to the different data size, which contained 296 sequences in the full transcript model whereas only 137 in mature mRNA model.