Literature DB >> 32858457

im6A-TS-CNN: Identifying the N⁶-Methyladenine Site in Multiple Tissues by Using the Convolutional Neural Network.

Kewei Liu¹, Lei Cao¹, Pufeng Du², Wei Chen³.

Abstract

N6-methyladenosine (m6A) is the most abundant post-transcriptional modification and involves a series of important biological processes. Therefore, accurate detection of the m6A site is very important for revealing its biological functions and impacts on diseases. Although both experimental and computational methods have been proposed for identifying m6A sites, few of them are able to detect m6A sites in different tissues. With the consideration of the spatial specificity of m6A modification, it is necessary to develop methods able to detect the m6A site in different tissues. In this work, by using the convolutional neural network (CNN), we proposed a new method, called im6A-TS-CNN, that can identify m6A sites in brain, liver, kidney, heart, and testis of Homo sapiens, Mus musculus, and Rattus norvegicus. In im6A-TS-CNN, the samples were encoded by using the one-hot encoding scheme. The results from both a 5-fold cross-validation test and independent dataset test demonstrate that im6A-TS-CNN is better than the existing method for the same purpose. The command-line version of im6A-TS-CNN is available at https://github.com/liukeweiaway/DeepM6A_cnn.

Entities: Chemical Disease Gene Species

Keywords: convolution neural network; m6A; one-hot encoding; spatial specificity of gene expression

Year: 2020 PMID： 32858457 PMCID： PMC7473875 DOI： 10.1016/j.omtn.2020.07.034

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

As a common and abundant of RNA post-transcriptional modification (PTM), N6-methyladenosine (m6A) modification plays an important role in almost all processes of cell cycles, such as affecting translation efficiency, cell development, cell viability, etc. m6A is catalyzed by a methyltransferase complex containing METTL3, METTL14, and WTAP. As a kind of dynamic PTM, m6A can be erased by the demethylases FTO and ALKBH5. Recently, more and more studies have revealed that m6A is closely correlated with diseases, such as obesity, thyroid tumor, prostate cancer, zika virus, and acute myelogenous leukemia. However, our knowledge about the functions of m6A modifications is still unintelligible. To deepen our understanding on the functions of m6A, the key step is to know the precise position of m6A in transcriptomes. There are two main ways to identify m6A sites. One way is using experimental methods, such as Methylated RNA Immunoprecipitation (MeRIP), m6A sequencing (m6A-seq), photo-crosslinking-assisted (PA)-m6A-seq, and m6A-crosslinking immunoprecipitation (CLIP). These experimental methods laid important foundations for the detection of m6A modification sites. Accordingly, some bioinformatics tools that are able to detect m6A sites directly from the reads generated by the experiments were proposed., However, as the amount of sequencing data increases, we need to find an effective and efficient way to detect m6A in the transcriptome. Accordingly, sequence information-based computational methods were proposed to identify m6A sites. These methods can be queried in a recent review. With the research on the spatial specificity of gene expression, it has been found that the location of the m6A site is distinct in different tissues and species. Therefore, Dao et al. proposed a tool, called iRNA-m6A that can identify m6A modification sites in different tissues in human, mouse, and rat by using the algorithm of SVM, based on the data of Zhang et al.. This method greatly improves the accuracy of predicting the m6A site. However, the performance for predicting the m6A site still has great potential to be improved. In recent years, the deep learning algorithms made great contributions to bioinformatics. A large number of computational methods based on deep-learning algorithm, such as Gene2Vec, BERMP, DeepM6ASeq, and iPseU-CNN have been proposed. Inspired by these successful applications of the deep-learning algorithm in identifying RNA modifications, in the present work, we proposed a convolutional neural network (CNN)-based method, called im6A-TS-CNN, to identify m6A sites in different tissues from human, mouse, and rat. Results from a 5-fold cross-validation test and independent dataset test demonstrated that the performance of im6A-TS-CNN is better than or comparable with that of the existing method for the same aim. Moreover, the universality of im6A-TS-CNN was also demonstrated by a cross-species validation test. The framework of im6A-TS-CNN is illustrated in Figure 1.

Figure 1

The Framework of the im6A-TS-CNN

The first step is to collect tissue-specific m6A data from the human, mouse, and rat. The second step is encoding the sequences by using the one-hot scheme. The third step is model construction.

The Framework of the im6A-TS-CNN The first step is to collect tissue-specific m6A data from the human, mouse, and rat. The second step is encoding the sequences by using the one-hot scheme. The third step is model construction.

Results and Discussion

Model Performance

In this article, the Keras in TensorFlow 2.0 under Python 3.6 was used to perform the predictions. The results from a 5-fold cross-validation test and independent dataset test of the proposed method for identifying the tissue-specific m6A modification sites in the human, mouse, and rat were shown in Table 1. With the comparison of results from a 5-fold cross-validation test and independent test, it was found that the proposed method is stable for identifying the m6A sites.

Table 1

The Performance of im6A-TS-CNN for Identifying m6A Sites

	5-Fold Cross Validation					Independent Test
	Sn (%)	Sp (%)	Acc (%)	MCC	AUC	Sn (%)	Sp (%)	Acc (%)	MCC	AUC
h_b	75.35	69.71	72.53	0.4523	0.8029	75.17	70.20	72.69	0.4543	0.8056
h_k	81.70	78.25	79.98	0.6006	0.8781	79.95	78.53	79.24	0.5848	0.8727
h_l	80.18	79.69	79.94	0.5992	0.8811	84.81	75.02	79.92	0.6012	0.8805
m_b	81.50	75.85	78.67	0.5749	0.8705	86.22	70.74	78.48	0.5765	0.8722
m_h	78.37	67.60	72.99	0.4633	0.8115	75.82	71.36	73.59	0.4723	0.8161
m_k	79.91	81.00	80.46	0.6094	0.8842	80.52	81.00	80.76	0.6151	0.8855
m_l	72.39	70.24	71.32	0.4288	0.7953	75.56	67.58	71.57	0.4328	0.7927
m_t	75.21	75.61	75.41	0.5090	0.8380	83.45	68.87	76.16	0.5288	0.8467
r_b	79.04	74.23	76.64	0.5379	0.8469	78.05	75.84	76.95	0.5391	0.8516
r_k	84.15	80.77	82.46	0.6500	0.9017	84.85	80.59	82.72	0.6550	0.9077
r_l	81.56	79.63	80.59	0.6126	0.8830	84.51	75.94	80.22	0.6067	0.8847

h, m and r before the hyphen stand for human, mouse, and rat, respectively; after the hyphen stand for brain, heart, kidney, liver, and testis, respectively.

The Performance of im6A-TS-CNN for Identifying m6A Sites h, m and r before the hyphen stand for human, mouse, and rat, respectively; after the hyphen stand for brain, heart, kidney, liver, and testis, respectively. To measure objectively the performance of the proposed method, the receiver operating characteristic (ROC) curves, from a 5-fold cross-validation test and independent test were plotted in Figure 2 as well. It was found that most of the areas under the ROC curve (AUCs) are higher than 0.8 in both the 5-fold cross-validation test and independent test, demonstrating the reliability of the proposed method for identifying tissue-specific m6A sites.

Figure 2

The ROC Curves for Identifying m6A in Different Tissues in the Three Species under the 5-Fold Cross-Validation Test and Independent Dataset Test

The value of AUC is given in the right corner of each graph.

The ROC Curves for Identifying m6A in Different Tissues in the Three Species under the 5-Fold Cross-Validation Test and Independent Dataset Test The value of AUC is given in the right corner of each graph.

Comparison with Existing Method

To further testify the superiority of im6A-TS-CNN, we compared its performance with that of Zhang et al.’s iRNA-m6A model, based on both the 5-fold cross-validation test and independent test. The comparative results in terms of AUC are shown in Table 2. Except for the identification of the m6A sites from the brain of mouse and rat, im6A-TS-CNN outperforms iRNA-m6A for the identification of m6A sites in the other tissues in the human, mouse, and rat. These results demonstrate that im6A-TS-CNN is a powerful tool for identifying tissue-specific m6A sites from different species.

Table 2

Comparative Results between im6A-TS-CNN and iRNA-m6A under the 5-Fold Cross-Validation Test and Independent Test

	5-Fold Cross Validation (AUC)			Independent Test (AUC)
	m6A-TS-CNN	iRNA-m⁶A	Difference	im6A-TS-CNN	iRNA-m⁶A	Difference
h_b	0.8029	0.7756	0.0273∗	0.8056	0.7845	0.0211∗
h_k	0.8781	0.8634	0.0147∗	0.8727	0.8565	0.0162∗
h_l	0.8811	0.8738	0.0073∗	0.8805	0.8681	0.0124∗
m_b	0.8705	0.8731	−0.0026	0.8722	0.8613	0.0109∗
m_h	0.8115	0.7948	0.0167∗	0.8161	0.7878	0.0283∗
m_k	0.8842	0.8726	0.0116∗	0.8855	0.8697	0.0158∗
m_l	0.7953	0.7743	0.0210∗	0.7927	0.762	0.0307∗
m_t	0.8380	0.8156	0.0224∗	0.8467	0.8182	0.0285∗
r_b	0.8469	0.8282	0.0187∗	0.8516	0.8968	−0.0452
r_k	0.9017	0.8877	0.0140∗	0.9077	0.8761	0.0316∗
r_l	0.8830	0.8766	0.0064∗	0.8847	0.8265	0.0582∗

h, m and r before the hyphen stand for human, mouse and rat; b, h, k, l, t after the hyphen stand for brain, liver, kidney, heart and testis, respectively.

inidcates the performance of im6A-TS-CNN is better than iRNA-m6A for identifying m6A sites.

Comparative Results between im6A-TS-CNN and iRNA-m6A under the 5-Fold Cross-Validation Test and Independent Test h, m and r before the hyphen stand for human, mouse and rat; b, h, k, l, t after the hyphen stand for brain, liver, kidney, heart and testis, respectively. inidcates the performance of im6A-TS-CNN is better than iRNA-m6A for identifying m6A sites.

Cross-Species and Cross-Tissue Validation

Since the datasets are from different species and tissues, it is interesting to test whether the model, trained based on the data from a specific tissue in a species, is able to identify m6A from other tissues and species. Accordingly, the cross-species and cross-tissue validation was performed. The AUCs of im6A-TS-CNN for identifying m6A sites from other species and tissues are shown in Figure 3. As shown in Figure 3, it can be concluded that im6A-TS-CNN is also effective for the cross-species and cross-tissue identification of m6A sites, demonstrating the universality of the proposed method.

Figure 3

Heatmap Showing the AUC Values of Cross-Species and Cross-Tissue Validation

The abscissa represents the independent dataset, and the ordinate represents the model.

Heatmap Showing the AUC Values of Cross-Species and Cross-Tissue Validation The abscissa represents the independent dataset, and the ordinate represents the model.

Conclusions

In this article, we proposed a CNN-based method, called i6mA-TS-CNN, for identifying m6A in the brain, liver, kidney, heart, and testis from the human, mouse, and rat. The results from a 5-fold cross-validation test and independent test demonstrate that i6mA-TS-CNN is better than the existing method for identifying tissue-specific m6A. For the convenience of the scientific community, the command-line version of i6mA-TS-CNN, together with its source code and user manual, is provided at https://github.com/liukeweiaway/DeepM6A_cnn. In addition, the high-, normal-, and low-threshold options were provided to control the false-positive rate. The corresponding performance with different options was listed in Table S1. Taken together, we hope that the i6mA-TS-CNN will become a useful tool for identifying m6A sites.

Materials and Methods

Datasets

A high-quality dataset is very important for the construction of a computational model. In 2019, Zhang et al. developed a high-throughput, antibody-independent m6A detection method based on the m6A-sensitive RNA endoribonuclease to identify the m6A site in different tissues, namely the brain, liver, kidney, heart, and testis from the human, mouse, and rat. Based on these data, Dao et al. built a high-quality benchmark dataset that can be used to train a computational method for identifying m6A sites, which contains both m6A site- and non-m6A site-containing sequences with the length of 41 nt. The CD-HIT program was used to make sure that the sequence similarity of the dataset was less than 80%. The detailed information of this dataset is provided in Table 3.

Table 3

The Information of Benchmark Datasets for Predicting RNA m6A Sites

Name	Training		Testing
Name	Positive	Negative	Positive	Negative
h_b	4,605	4,605	4,604	4,604
h_k	2,634	2,634	2,634	2,634
h_l	4,574	4,574	4,573	4,573
m_b	8,025	8,025	8,025	8,025
m_h	4,133	4,133	4,133	4,133
m_k	3,953	3,953	3,952	3,952
m_l	2,201	2,201	2,200	2,200
m_t	4,707	4,707	4,706	4,706
r_b	2,352	2,352	2,351	2,351
r_k	1,762	1,762	1,762	1,762
r_l	3,433	3,433	3,432	3,432

h, m and r before the hyphen stand for human, mouse and rat; b, h, k, l, t after the hyphen stand for brain, liver, kidney, heart and testis, respectively.

The Information of Benchmark Datasets for Predicting RNA m6A Sites h, m and r before the hyphen stand for human, mouse and rat; b, h, k, l, t after the hyphen stand for brain, liver, kidney, heart and testis, respectively.

One-Hot Encoding

One-hot encoding is a common and effective method. According to such a scheme, in an RNA segment, A is represented as (1,0,0,0), U as (0,1,0,0), C as (0,0,1,0), and G as (0,0,0,1). Therefore, an RNA sequence of length l can be converted into a 4-l dimensional vector.

Convolutional Neural Network

In recent years, Convolutional Neural Network (CNN) has been widely used to solve biological problems.,, The structure of the CNN is shown in Figure 1. It contains a convolutional layer with 200 filters in which the kernel size is 6. After convolution operation, a max-pooling layer with the size of 4 was added. The convolution layer is mathematically represented and computed as the following:where R represents the RNA segment, f denotes the index of the kernel, and j denotes the index of the output position. In Equation 1, each filter W is an S × N weight matrix, where S is the filter size, and N is the input channels. The rectified linear function (ReLU) is expressed as the following: In order to prevent overfitting, we choose to lose some parameters and set the dropout rate of 0.16. The results were output to a fully connected layer containing 164 neural units and then compressed to 32 neural units. Finally, the softmax function was used to predict whether the RNA segment contains m6A sites or not and is expressed as the following: When building the model, the stochastic gradient descent (SGD) was used as the optimizer with a learning rate of 0.001, and the categorical cross entropy was used as the loss function. In the training process, a total of 2,000 epochs were carried out by using the early stopping method with the patience of 50 and min_delta of 0.001.

Evaluation Metrics

In order to evaluate the model, we use the sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficient (MCC), which are defined as the following,,, to evaluate the performance of the following model:where N+ is the total number of the RNA sequence containing modification site, is the number of false-negative samples, N− is the total number of the RNA sequence that did not contain any modification site, and is teh number of false-positive samples. In addition, we also used the ROC curve and the area under the ROC curve (AUC) to evaluate the proposed model.

Author Contributions

W.C. conceived and designed the study. K.L. conducted the experiments and implemented the algorithms. W.C., L.C., P.D., and K.L. performed the analysis and wrote the paper. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no competing interests.

28 in total

1. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

2. N(6)-methyladenosine Modulates Messenger RNA Translation Efficiency.

Authors: Xiao Wang; Boxuan Simen Zhao; Ian A Roundtree; Zhike Lu; Dali Han; Honghui Ma; Xiaocheng Weng; Kai Chen; Hailing Shi; Chuan He
Journal: Cell Date: 2015-06-04 Impact factor: 41.582

3. Dynamics of Human and Viral RNA Methylation during Zika Virus Infection.

Authors: Gianluigi Lichinchi; Boxuan Simen Zhao; Yinga Wu; Zhike Lu; Yue Qin; Chuan He; Tariq M Rana
Journal: Cell Host Microbe Date: 2016-10-20 Impact factor: 21.023

4. Comprehensive analysis of mRNA methylation reveals enrichment in 3' UTRs and near stop codons.

Authors: Kate D Meyer; Yogesh Saletore; Paul Zumbo; Olivier Elemento; Christopher E Mason; Samie R Jaffrey
Journal: Cell Date: 2012-05-17 Impact factor: 41.582

5. Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences.

Authors: Zhen Chen; Pei Zhao; Fuyi Li; Yanan Wang; A Ian Smith; Geoffrey I Webb; Tatsuya Akutsu; Abdelkader Baggag; Halima Bensmail; Jiangning Song
Journal: Brief Bioinform Date: 2019-11-11 Impact factor: 11.622

6. N6-methyladenosine in nuclear RNA is a major substrate of the obesity-associated FTO.

Authors: Guifang Jia; Ye Fu; Xu Zhao; Qing Dai; Guanqun Zheng; Ying Yang; Chengqi Yi; Tomas Lindahl; Tao Pan; Yun-Gui Yang; Chuan He
Journal: Nat Chem Biol Date: 2011-10-16 Impact factor: 15.040

7. A majority of m6A residues are in the last exons, allowing the potential for 3' UTR regulation.

Authors: Shengdong Ke; Endalkachew A Alemu; Claudia Mertens; Emily Conn Gantman; John J Fak; Aldo Mele; Bhagwattie Haripal; Ilana Zucker-Scharff; Michael J Moore; Christopher Y Park; Cathrine Broberg Vågbø; Anna Kusśnierczyk; Arne Klungland; James E Darnell; Robert B Darnell
Journal: Genes Dev Date: 2015-09-24 Impact factor: 11.361

8. Gene2vec: gene subsequence embedding for prediction of mammalian N ⁶-methyladenosine sites from mRNA.

Authors: Quan Zou; Pengwei Xing; Leyi Wei; Bin Liu
Journal: RNA Date: 2018-11-13 Impact factor: 4.942

9. Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation.

Authors: Balachandran Manavalan; Shaherin Basith; Tae Hwan Shin; Leyi Wei; Gwang Lee
Journal: Mol Ther Nucleic Acids Date: 2019-04-30

10. DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning.

Authors: Yiqian Zhang; Michiaki Hamada
Journal: BMC Bioinformatics Date: 2018-12-31 Impact factor: 3.169

7 in total

1. Geographic encoding of transcripts enabled high-accuracy and isoform-aware deep learning of RNA methylation.

Authors: Daiyun Huang; Kunqi Chen; Bowen Song; Zhen Wei; Jionglong Su; Frans Coenen; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal: Nucleic Acids Res Date: 2022-10-14 Impact factor: 19.160

2. A Novel Early-Stage Lung Adenocarcinoma Prognostic Model Based on Feature Selection With Orthogonal Regression.

Authors: Binhua Tang; Yuqi Wang; Yu Chen; Ming Li; Yongfeng Tao
Journal: Front Cell Dev Biol Date: 2021-01-08

3. predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance.

Authors: Sabit Ahmed; Afrida Rahman; Md Al Mehedi Hasan; Md Khaled Ben Islam; Julia Rahman; Shamim Ahmad
Journal: PLoS One Date: 2021-04-01 Impact factor: 3.240

4. Enhancer-LSTMAtt: A Bi-LSTM and Attention-Based Deep Learning Method for Enhancer Recognition.

Authors: Guohua Huang; Wei Luo; Guiyang Zhang; Peijie Zheng; Yuhua Yao; Jianyi Lyu; Yuewu Liu; Dong-Qing Wei
Journal: Biomolecules Date: 2022-07-17

5. m5C-Related lncRNAs Predict Overall Survival of Patients and Regulate the Tumor Immune Microenvironment in Lung Adenocarcinoma.

Authors: Junfan Pan; Zhidong Huang; Yiquan Xu
Journal: Front Cell Dev Biol Date: 2021-06-29

6. Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications.

Authors: Zitao Song; Daiyun Huang; Bowen Song; Kunqi Chen; Yiyou Song; Gang Liu; Jionglong Su; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal: Nat Commun Date: 2021-06-29 Impact factor: 14.919

7. M6A-BiNP: predicting N⁶-methyladenosine sites based on bidirectional position-specific propensities of polynucleotides and pointwise joint mutual information.

Authors: Mingzhao Wang; Juanying Xie; Shengquan Xu
Journal: RNA Biol Date: 2021-06-23 Impact factor: 4.652

7 in total