Literature DB >> 26343792

Identification and analysis of the N(6)-methyladenosine in the Saccharomyces cerevisiae transcriptome.

Wei Chen^1,2, Hong Tran², Zhiyong Liang³, Hao Lin³, Liqing Zhang².

Abstract

Knowledge of the distribution of N(6)-methyladenosine (m(6)A) is invaluable for understanding RNA biological functions. However, limitation in experimental methods impedes the progress towards the identification of m(6)A site. As a complement of experimental methods, a support vector machine based-method is proposed to identify m(6)A sites in Saccharomyces cerevisiae genome. In this model, RNA sequences are encoded by their nucleotide chemical property and accumulated nucleotide frequency information. It is observed in the jackknife test that the accuracy achieved by the proposed model in identifying the m(6)A site was 78.15%. For the convenience of experimental scientists, a web-server for the proposed model is provided at http://lin.uestc.edu.cn/server/m6Apred.php.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2015 PMID： 26343792 PMCID： PMC4561376 DOI： 10.1038/srep13859

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

The methylation of the 6th nitrogen of Adenosine (N6-methyladenosine, m6A) is the most prevalent form of RNA modification and is found in all three domains of life1. m6A is catalyzed by an evolutionarily conserved, multi-component enzyme2. Unlike adenosine-to-inosine editing, m6A does not alter the coding capacity of transcripts34. However, it has been demonstrated that m6A is associated with a number of key biological processes including mRNA splicing, export, stability, and immune tolerance567. Moreover, it has been reported that m6A is closely correlated with the mammalian brain development8. The regulatory role of m6A in cell division has also been reported in plants9. By using high-throughput techniques such as MeRIP-Seq8 and m6A-seq10, the distribution of m6A has been characterized in the human and mouse transcriptomes8. The experimental results revealed that m6A sites tend to occur near the stop codon, in 3’ UTR, and within long internal exons811. The nonrandom distribution of m6A sites across the genome is highly conserved from yeasts to humans1112, suggesting that m6A modification is both fundamental and important for organisms. The experimental results also demonstrated that the m6A sites identified in the yeast harbored the RGAC (R = A/G) consensus motif12, reminiscent of the mammalian RRACU (R = A/G) motif11. Similar to epigenetic DNA and histone modifications, m6A modification is also dynamic and reversible, the m6A patterns change in different cell types10 or when cells are stressed12. The experimental methods yielded quite encouraging results and did play a role in promoting the research progress on identifying the distribution of m6A in the transcriptome. However, resolution of both m6A-seq and MeRIP-seq methods is low, only ~24 nt (nucleotide) around the methylated adenosine11. Therefore, experimental methods cannot pinpoint which adenosine residue is actually modified. In addition, current experimental methods are both costly and time consuming. Therefore, it is necessary to develop new methods for studying the distribution and function of m6A. As excellent complements to experimental techniques, computational methods will speed up genome-wide m6A detection. However, to the best of our knowledge, there is no computational tool available for the discovery of m6A. In the present study, we propose a support vector machine based method to identify the m6A sites in the Saccharomyces cerevisiae genome. By using the nucleotide chemical property and accumulated nucleotide frequency information, the sequence-order effects and nucleotide physicochemical properties are integrated together in the proposed model. In the jackknife test, an overall accuracy of 78.15% is achieved in identifying the m6A sites in the benchmark dataset. For the convenience of the experimental scientists, a web-server for the proposed model is provided at http://lin.uestc.edu.cn/server/m6Apred.php.

Results

Nucleotide preference

In order to understand nucleotide preference surrounding m6A sites, based on the benchmark dataset, we computed the sequence logos of the 10 upstream and 10 downstream nucleotides using WebLogo13. As shown in Fig. 1, besides the well-known consensus motif RGAC (R = A/G) located at −2 to 1 bp relative to the m6A site (position 0)12, strong preference of nucleotides in both upstream and downstream sequences surrounding the m6A site were also observed. The adenines are favored at positions −4, −3, and −2, whereas the uracils are favored at positions from +2 to +4. In contrast, except for the RGAC (R = A/G) located at −2 to 1, no exclusive preference of nucleotides was observed surrounding the unmethylated adenosine.

Figure 1

Sequence logo of the 10 upstream and 10 downstream nucleotides surrounding m6A sites.

m6A sites identification

Three cross-validation methods, the sub-sampling (or K-fold cross-validation) test, the independent dataset test, and the jackknife test, are often used to evaluate the quality of a predictor. Among the three methods, the jackknife test is deemed as the least arbitrary and most objective14 and hence has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors151617. Accordingly, the jackknife test was used to examine the performance of the model proposed in the current study. In the jackknife test, each sample in the training dataset is in turn singled out as an independent test sample and all the properties are calculated without including the one being identified. In order to compare the contribution of the features for m6A site identification, we firstly performed the predictions using individual nucleotide chemical property and their combinations. The predictive results are reported in Table 1. Among the three kinds of nucleotide chemical properties, the hydrogen bond yields the highest predictive accuracy (71.32%), indicating that it has the largest contribution for m6A site identification. However, the predictive accuracies obtained by using each kind of nucleotide chemical property alone are all lower than that obtained by using all three kinds of nucleotide chemical properties (Table 1).

Table 1

The predictive results by using different features for m6A identification.

Features	Sn (%)	Sp(%)	Acc (%)
Ring Structure	69.27	63.43	66.34
Functional Group	70.70	69.90	70.31
Hydrogen Bond	74.18	68.46	71.32
Nucleotide chemical property	75.23	78.02	75.87
Nucleotide chemical property and accumulated nucleotide frequency	79.21	77.04	78.13

Considering the observed nucleotide preference surrounding the m6A sites (Fig. 1) and the above results, the accumulated nucleotide frequency and nucleotide chemical property were combined to encode the sequences in the training dataset. Hence, each 21-bp long sequence in the dataset was represented by an 84 (4×21)-dimensional vector (see Methods) and used as the input of SVM to train the model for identifying m6A sites. In the jackknife test, the proposed model obtained an accuracy of 78.15% with a sensitivity of 79.21% and specificity of 77.04% (Table 1). The predictive accuracy thus obtained is higher than that obtained by merely using nucleotide chemical properties (75.87%), indicating that nucleotide frequency contributes slightly to the identification of m6A sites. As the performance of the proposed model may depend on the threshold, similar to a recent work18, three thresholds of high, medium and low obtained in jackknife test were selected with the specificity values of 95%, 90% and 85%, respectively. The predictive performances of the proposed model with these different thresholds were reported in Table 2. Meanwhile, in order to provide a graphical illustration to show the performance of the model as its discrimination threshold varied, the ROC curve was plotted in Fig. 2 and an AUROC of 0.84 was obtained.

Table 2

Performance of the proposed model at different thresholds on jackknife test.

Classifier	Sn (%)	Sp (%)	Acc (%)
High	38.22	94.95	66.59
Medium	55.05	90.02	72.54
Low	68.39	84.98	76.68

Figure 2

A graphical illustration to show the performance of the model by means of the ROC curve.

The vertical coordinate is the true positive rate (Sn) while horizontal coordinate is the false positive rate (1-Sp). The area under the ROC curve (AUROC) is 0.84.

To ensure that the predictive accuracy is not sensitive to the selection of negative data, we repeated the random sampling procedure ten times and obtained ten random samples of negative datasets for downstream training and prediction. The predictive results of these models for identifying m6A sites in the jackknife test were reported in Supplementary Table S1. We found that the predictive accuracy is not affected by the selection of negative data. In addition, the proposed model was also evaluated on the independent testing dataset (see Methods). We found that the proposed model obtained an accuracy of 75.73% with a sensitivity of 53.89% and a specificity of 79.07% for identifying m6A sites on the testing dataset with the positive-to-negative ratio of 1:10. The precision-recall curve, which plots the corresponding precision-recall pairs over a range of values, was also plotted in Supplementary Figure S1. These results demonstrate the reliability of the model developed in this study.

Comparison with Other classifiers

To further demonstrate the power of the proposed method, we also did some comparative calculations as described below. First, based on the sequence similarity principle, we used the classic sequence similarity search-based tool BLAST19 to conduct the jackknife test on the same benchmark dataset. The results thus obtained are given in Table 3, from which we can see that the percentage rate for Acc obtained by BLAST is about 10% lower than the proposed model for m6A identifications.

Table 3

Comparison of different classifiers for m6A identification.

Classifier	Sn (%)	Sp(%)	Acc (%)	AUROC
Blast	70.75	67.55	69.11	–
Naïve Bayes	78.72	70.91	74.81	0.82
Logistic Function	79.32	74.76	77.04	0.83
RBFNetwork	61.18	84.49	72.83	0.79
Random Forest	78.73	64.78	71.75	0.78
SVM	79.21	77.04	78.15	0.84

Second, we also compared the predictive results of the proposed method with that of four other commonly used classifiers, i.e., Naïve Bayes20, Logistic Function21, RBFNetwork22, and Random Forest23 as implemented in WEKA24. The jackknife test results for identifying m6A sites in the benchmark dataset for different classifiers were listed in Table 3. It is shown that the sensitivity, specificity, accuracy and AUROC of the proposed SVM model are all higher than that of Naïve Bayes, Logistic Function, and Random Forest. Although the specificity of the proposed method is lower than that of RBFnetwork, its sensitivity, accuracy, and AUROC are all higher than that of RBFnetwork. Hence, these results suggest that our proposed method is promising and has great potential to become a useful tool for m6A identifications.

Web-server

To enable applications of the proposed model and for the convenience of the vast majority of experimental scientists, an online predictor is created. The step-by-step guide on how to use it is provided as the following: Step 1. Open the web server at http://lin.uestc.edu.cn/server/m6Apred.php and you will see the top page on your computer screen, as shown in Fig. 3. Click on the Read Me button to see a brief introduction about the predictor and the caveat when using it.

Figure 3

A semi-screenshot for the top page of the web-server at http://lin.uestc.edu.cn/server/m6Apred.php.

Step 2. On clicking the open circle, the threshold (All, High, Medium, or Low) as reported in Table 2 will be selected. Either type or copy/paste the query RNA sequences into the input box at the center of Fig. 3. The input sequence should be in FASTA format. A sequence in FASTA format consists of a single initial line beginning with a greater-than symbol (“>”) in the first column, followed by lines of sequence data. The words right after the “>” symbol in the single initial line are optional and only used for the purpose of identification and description. All lines should be no longer than 120 characters and usually do not exceed 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box. Step 3. Click on the Submit button to see the predicted result. For example, if use the query RNA sequences in the Example window as the input, the following results will be shown on the screen: the outcome for the 1st query example is: the ‘A’ at position 11 is methylated with a probability of 0.92 and the ‘A’ at position 32 is also methylated with a probability of 0.92. The outcome for the 2nd query sample is: the A at position 11 is unmethylated with a probability of 0.96. All these results are consistent with the experimental observations. Step 4. Click on the Data button to download the datasets used to train and test the model. Step 5. Click on the Citation button to find the relevant paper that reports the detailed development and algorithm of the model.

Caveats

Each of the input query sequences must be 21 bp or longer and only contains valid characters: ‘A’, ‘C’, ‘G’, ‘U’.

Conclusions

By using the nucleotide chemical property and nucleotide density information, we proposed a support vector machine based model to identify m6A sites in the Saccharomyces cerevisiae transcriptome. To identify the key features for m6A site identification, we compared the predictive results obtained by using different kinds of parameters (Table 1). In comparison with accumulated nucleotide frequency, nucleotide chemical property plays the more important roles for m6A site identification. Among the three considered nucleotide chemical properties, the hydrogen bond has the largest contribution for m6A site identification, consistent with the recent finding that the hydrogen bond is implicated in formation of RNA secondary structure25 which decreases the m6A methylation26. In addition, we also compared the predictive accuracy of SVM with four other commonly used classification methods for m6A site identification. We found that the predictive result of SVM is better than those of Naïve Bayes, Logistic Function and Random Forest. This is likely due to the limited number of the experimentally validated m6A sites that used to train the models. Naïve Bayes, Logistic Function and Random Forest require a large number of samples to train, whereas SVM needs fewer training data. For the convenience of researchers in the scientific community, a web-server for the proposed model is provided. We hope that these results will provide further insights into the understanding of the distribution and function of m6A modifications. As the current method is only applicable to Saccharomyces cerevisiae, future work will expand to other species to train and improve the model.

Methods

Dataset

By using the m6A-seq technique, Schwartz et al. identified 1,307 methylated adenine (m6A) sites centered around RGAC motifs from 1,183 genes in Saccharomyces cerevisiae12. In order to obtain a high quality training dataset and avoid experiment bias, the 832 m6A sites with distances to the detected m6A-seq peaks less than 10 bp were selected as positive samples of the training dataset12. The pairwise sequence similarity within all the positive training samples is less than 85%. The remaining 475 (1,307−832 = 475) m6A sites were used to construct the independent testing dataset. The negative samples were obtained by the following steps. By searching Saccharomyces cerevisiae genome, we obtained 33,280 adenines centered around the RGAC consensus motif, which were not detected by the m6A-seq technique. Therefore, the 33,280 adenines were deemed as nonmethylated adenine. To balance out the numbers between positive and negative samples in model training, we randomly picked 832 samples from the 33,280 non-methylated adenines and used them as negative samples. Following these procedures, we obtained a benchmark dataset including 832 m6A site containing sequences and 832 non-m6A site containing sequences, respectively. To examine whether the predictive accuracy is sensitive to the selection of negative data, we repeated the random sampling procedure ten times and obtained ten random samples of negative datasets for downstream training and prediction. We also randomly fetched 4,750 negative samples from the ten negative datasets and merged them with the above mentioned 475 samples in the testing dataset. By doing so, an independent testing dataset with the positive-to-negative ratio of 1:10 (475:4,750) was obtained. It was observed via preliminary trials that when the length of the sequences in the benchmark dataset is 21 bp with the m6A in the center, the corresponding predictive results were most promising. Accordingly, all the sequences in the training and testing dataset are 21 bp long and are available at http://lin.uestc.edu.cn/server/m6Apred.php.

Sequence encoding

One of the keys in developing a model for identifying genomic attributes is to encode the biological samples with effective expressions. In the present study, nucleotide chemical properties and density information of each nucleotide in RNA sequences were considered.

Chemical property of each nucleotide

There are four different kinds of nucleotides, i.e., adenine (A), guanine (G), cytosine (C) and uracil (U), found in RNA. Each nucleotide has different chemical structure and chemical binding. Shown in Fig. 4, adenine and guanine have two rings, while cytosine and uracil have only one ring. Although RNA is generally single stranded, its biological functions are correlated with the secondary structure. When forming secondary structures, in terms of hydrogen bond, guanine and cytosine have strong hydrogen bonds, whereas adenine and uracil have weak hydrogen bonds. Additionally, in terms of chemical functionality, adenine and cytosine can be classified into the same group, called amino group, while guanine and uracil into the keto group. Therefore, the four kinds of nucleotides can be classified into three different groups in terms of these chemical properties (Table 4).

Figure 4

Chemical structure of each nucleotide.

Table 4

Chemical property of nucleotide in RNA sequence.

Chemical property	Class	Nucleotides
Ring Structure	Purine	A, G
Ring Structure	Pyrimidine	C, U
Functional Group	Amino	A, C
Functional Group	Keto	G, U
Hydrogen Bond	Strong	C, G
Hydrogen Bond	Weak	A, U

In order to include these chemical properties in RNA encoding, we define three coordinates (x, y, z) to represent three chemical groups and assign 1 or 0 values. Hence, each nucleotide s = (x, y, z) in the sequence can be encoded by the following formula27. where the coordinate value of each nucleotide is determined by their chemical property of the nucleotide as shown in Table 4. Thus, based on chemical properties, A can be represented by coordinates (1, 1, 1), C can be represented by coordinates (0, 1, 0), G can be represented by coordinates (1, 0, 0), U can be represented by coordinates (0, 0, 1).

Accumulated nucleotide frequency

In order to include the nucleotide frequency information and the distribution of each nucleotide in the RNA sequence, we define the density d of any nucleotide s at position i in RNA sequence by the following formula26, where l is the sequence length, |Si| is the length of the i-th prefix string {s1, s2, …, s} in the sequence, q ∈ {A, C, G or U}. Suppose an example sequence “UCGUUCAUGG”. The density of ‘U’ is 1 (1/1), 0.5 (2/4), 0.6 (3/5), 0.5 (4/8) at positions 1, 4, 5, and 8, respectively. The density of ‘C’ is 0.5 (1/2), 0.33 (2/6) at positions 2 and 6, respectively. The density of ‘G’ is 0.33 (1/3), 0.22 (2/9), 0.3 (3/10) at positions 3, 9, and 10, respectively. The density of ‘A’ is 0.14 (1/7) at position 7. By integrating both the nucleotide chemical property and accumulated nucleotide information, the sample sequence “UCGUUCAUGG” can be represented by {(0, 0, 1, 1), (0, 1, 0, 0.5), (1, 0, 0, 0.33), (0, 0, 1, 0.5), (0, 0, 1, 0.6), (0, 1, 0, 0.33), (1, 1, 1, 0.14), (0, 0, 1, 0.5), (1, 0, 0, 0.22), (1, 0, 0, 0.3)}. By doing so, not only the chemical property was considered, but also the long range sequence order information was incorporated. Therefore, the samples in the benchmark dataset were encoded in terms of both nucleotide chemical property and nucleotide densities.

Support vector machine

The SVM classification algorithm has been widely used in the realm of bioinformatics282930. Its basic principle is to transform the input vector into a high-dimension Hilbert space and seek a separating hyperplane with the maximal margin in this space. In this study, the libsvm-3.18 package was used as an implementation of SVM, which can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Because of its effectiveness and speed in nonlinear classification process, the radial basis kernel function (RBF) was selected to perform the prediction. A grid search method was used to optimize the regularization parameter C and kernel parameter γ. The probability score obtained from SVM was used to make predictions.

Performance evaluations

The performance of the model was evaluated using the following metrics: sensitivity (Sn) also named recall, specificity (Sp), precision and accuracy (Acc), which can be expressed as where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. Meanwhile, in order to provide a graphical illustration to show the performance of the model as its discrimination threshold varied, the ROC (receiver operating characteristic) curve was created, where its vertical coordinate is for the true positive rate while horizontal coordinate for the false positive rate. The best possible prediction method would yield a point with the coordinate (0, 1) representing 100% true positive rate and 0 false positive rate or 100% specificity. Therefore, the (0, 1) point is also considered as a perfect classification. A completely random guess would give a point along a diagonal from the point (0, 0) to (1, 1). The AUROC (area under the ROC curve) is often used to indicate the performance quality of a binary classifier: the value 0.5 of AUROC is equivalent to random prediction while 1 of AUROC represents a perfect one.

Additional Information

How to cite this article: Chen, W. et al. Identification and analysis of the N6-methyladenosine in the Saccharomyces cerevisiae transcriptome. Sci. Rep. 5, 13859; doi: 10.1038/srep13859 (2015).

27 in total

1. Prediction of replication origins by calculating DNA structural properties.

Authors: Wei Chen; Pengmian Feng; Hao Lin
Journal: FEBS Lett Date: 2012-02-28 Impact factor: 4.124

2. Prediction of midbody, centrosome and kinetochore proteins based on gene ontology information.

Authors: Wei Chen; Hao Lin
Journal: Biochem Biophys Res Commun Date: 2010-09-18 Impact factor: 3.575

3. Transcriptome-wide mapping of N(6)-methyladenosine by m(6)A-seq based on immunocapturing and massively parallel sequencing.

Authors: Dan Dominissini; Sharon Moshitch-Moshkovitz; Mali Salmon-Divon; Ninette Amariglio; Gideon Rechavi
Journal: Nat Protoc Date: 2013-01-03 Impact factor: 13.491

4. Comprehensive analysis of mRNA methylation reveals enrichment in 3' UTRs and near stop codons.

Authors: Kate D Meyer; Yogesh Saletore; Paul Zumbo; Olivier Elemento; Christopher E Mason; Samie R Jaffrey
Journal: Cell Date: 2012-05-17 Impact factor: 41.582

5. Integrating local and global error statistics for multi-scale RBF network training: an assessment on remote sensing data.

Authors: Giorgos Mountrakis; Wei Zhuang
Journal: PLoS One Date: 2012-08-02 Impact factor: 3.240

6. The RNA Modification Database, RNAMDB: 2011 update.

Authors: William A Cantara; Pamela F Crain; Jef Rozenski; James A McCloskey; Kimberly A Harris; Xiaonong Zhang; Franck A P Vendeix; Daniele Fabris; Paul F Agris
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

7. N6-methyladenosine in nuclear RNA is a major substrate of the obesity-associated FTO.

Authors: Guifang Jia; Ye Fu; Xu Zhao; Qing Dai; Guanqun Zheng; Ying Yang; Chengqi Yi; Tomas Lindahl; Tao Pan; Yun-Gui Yang; Chuan He
Journal: Nat Chem Biol Date: 2011-10-16 Impact factor: 15.040

8. Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors: Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-12-17 Impact factor: 2.691

9. Naïve Bayes classifier with feature selection to identify phage virion proteins.

Authors: Peng-Mian Feng; Hui Ding; Wei Chen; Hao Lin
Journal: Comput Math Methods Med Date: 2013-05-15 Impact factor: 2.238

10. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq.

Authors: Dan Dominissini; Sharon Moshitch-Moshkovitz; Schraga Schwartz; Mali Salmon-Divon; Lior Ungar; Sivan Osenberg; Karen Cesarkas; Jasmine Jacob-Hirsch; Ninette Amariglio; Martin Kupiec; Rotem Sorek; Gideon Rechavi
Journal: Nature Date: 2012-04-29 Impact factor: 49.962

30 in total

1. Identifying N ⁶-methyladenosine sites in the Arabidopsis thaliana transcriptome.

Authors: Wei Chen; Pengmian Feng; Hui Ding; Hao Lin
Journal: Mol Genet Genomics Date: 2016-09-02 Impact factor: 3.291

2. SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features.

Authors: Yuan Zhou; Pan Zeng; Yan-Hui Li; Ziding Zhang; Qinghua Cui
Journal: Nucleic Acids Res Date: 2016-02-20 Impact factor: 16.971

3. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization.

Authors: Zhen Chen; Pei Zhao; Chen Li; Fuyi Li; Dongxu Xiang; Yong-Zi Chen; Tatsuya Akutsu; Roger J Daly; Geoffrey I Webb; Quanzhi Zhao; Lukasz Kurgan; Jiangning Song
Journal: Nucleic Acids Res Date: 2021-06-04 Impact factor: 16.971

4. WHISTLE: a high-accuracy map of the human N6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach.

Authors: Kunqi Chen; Zhen Wei; Qing Zhang; Xiangyu Wu; Rong Rong; Zhiliang Lu; Jionglong Su; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal: Nucleic Acids Res Date: 2019-04-23 Impact factor: 16.971

5. The N¹-Methyladenosine Methylome of Petunia mRNA.

Authors: Weiyuan Yang; Jie Meng; Juanxu Liu; Beibei Ding; Tao Tan; Qian Wei; Yixun Yu
Journal: Plant Physiol Date: 2020-05-27 Impact factor: 8.340

6. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.

Authors: Bin Liu; Xin Gao; Hanyu Zhang
Journal: Nucleic Acids Res Date: 2019-11-18 Impact factor: 16.971