Literature DB >> 32858457

im6A-TS-CNN: Identifying the N6-Methyladenine Site in Multiple Tissues by Using the Convolutional Neural Network.

Kewei Liu1, Lei Cao1, Pufeng Du2, Wei Chen3.   

Abstract

N6-methyladenosine (m6A) is the most abundant post-transcriptional modification and involves a series of important biological processes. Therefore, accurate detection of the m6A site is very important for revealing its biological functions and impacts on diseases. Although both experimental and computational methods have been proposed for identifying m6A sites, few of them are able to detect m6A sites in different tissues. With the consideration of the spatial specificity of m6A modification, it is necessary to develop methods able to detect the m6A site in different tissues. In this work, by using the convolutional neural network (CNN), we proposed a new method, called im6A-TS-CNN, that can identify m6A sites in brain, liver, kidney, heart, and testis of Homo sapiens, Mus musculus, and Rattus norvegicus. In im6A-TS-CNN, the samples were encoded by using the one-hot encoding scheme. The results from both a 5-fold cross-validation test and independent dataset test demonstrate that im6A-TS-CNN is better than the existing method for the same purpose. The command-line version of im6A-TS-CNN is available at https://github.com/liukeweiaway/DeepM6A_cnn.
Copyright © 2020 The Author(s). Published by Elsevier Inc. All rights reserved.

Entities:  

Keywords:  convolution neural network; m6A; one-hot encoding; spatial specificity of gene expression

Year:  2020        PMID: 32858457      PMCID: PMC7473875          DOI: 10.1016/j.omtn.2020.07.034

Source DB:  PubMed          Journal:  Mol Ther Nucleic Acids        ISSN: 2162-2531            Impact factor:   8.886


Introduction

As a common and abundant of RNA post-transcriptional modification (PTM), N6-methyladenosine (m6A) modification plays an important role in almost all processes of cell cycles, such as affecting translation efficiency, cell development, cell viability, etc. m6A is catalyzed by a methyltransferase complex containing METTL3, METTL14, and WTAP. As a kind of dynamic PTM, m6A can be erased by the demethylases FTO and ALKBH5. Recently, more and more studies have revealed that m6A is closely correlated with diseases, such as obesity, thyroid tumor, prostate cancer, zika virus, and acute myelogenous leukemia. However, our knowledge about the functions of m6A modifications is still unintelligible. To deepen our understanding on the functions of m6A, the key step is to know the precise position of m6A in transcriptomes. There are two main ways to identify m6A sites. One way is using experimental methods, such as Methylated RNA Immunoprecipitation (MeRIP), m6A sequencing (m6A-seq), photo-crosslinking-assisted (PA)-m6A-seq, and m6A-crosslinking immunoprecipitation (CLIP). These experimental methods laid important foundations for the detection of m6A modification sites. Accordingly, some bioinformatics tools that are able to detect m6A sites directly from the reads generated by the experiments were proposed., However, as the amount of sequencing data increases, we need to find an effective and efficient way to detect m6A in the transcriptome. Accordingly, sequence information-based computational methods were proposed to identify m6A sites. These methods can be queried in a recent review. With the research on the spatial specificity of gene expression, it has been found that the location of the m6A site is distinct in different tissues and species. Therefore, Dao et al. proposed a tool, called iRNA-m6A that can identify m6A modification sites in different tissues in human, mouse, and rat by using the algorithm of SVM, based on the data of Zhang et al.. This method greatly improves the accuracy of predicting the m6A site. However, the performance for predicting the m6A site still has great potential to be improved. In recent years, the deep learning algorithms made great contributions to bioinformatics. A large number of computational methods based on deep-learning algorithm, such as Gene2Vec, BERMP, DeepM6ASeq, and iPseU-CNN have been proposed. Inspired by these successful applications of the deep-learning algorithm in identifying RNA modifications, in the present work, we proposed a convolutional neural network (CNN)-based method, called im6A-TS-CNN, to identify m6A sites in different tissues from human, mouse, and rat. Results from a 5-fold cross-validation test and independent dataset test demonstrated that the performance of im6A-TS-CNN is better than or comparable with that of the existing method for the same aim. Moreover, the universality of im6A-TS-CNN was also demonstrated by a cross-species validation test. The framework of im6A-TS-CNN is illustrated in Figure 1.
Figure 1

The Framework of the im6A-TS-CNN

The first step is to collect tissue-specific m6A data from the human, mouse, and rat. The second step is encoding the sequences by using the one-hot scheme. The third step is model construction.

The Framework of the im6A-TS-CNN The first step is to collect tissue-specific m6A data from the human, mouse, and rat. The second step is encoding the sequences by using the one-hot scheme. The third step is model construction.

Results and Discussion

Model Performance

In this article, the Keras in TensorFlow 2.0 under Python 3.6 was used to perform the predictions. The results from a 5-fold cross-validation test and independent dataset test of the proposed method for identifying the tissue-specific m6A modification sites in the human, mouse, and rat were shown in Table 1. With the comparison of results from a 5-fold cross-validation test and independent test, it was found that the proposed method is stable for identifying the m6A sites.
Table 1

The Performance of im6A-TS-CNN for Identifying m6A Sites

5-Fold Cross Validation
Independent Test
Sn (%)Sp (%)Acc (%)MCCAUCSn (%)Sp (%)Acc (%)MCCAUC
h_b75.3569.7172.530.45230.802975.1770.2072.690.45430.8056
h_k81.7078.2579.980.60060.878179.9578.5379.240.58480.8727
h_l80.1879.6979.940.59920.881184.8175.0279.920.60120.8805
m_b81.5075.8578.670.57490.870586.2270.7478.480.57650.8722
m_h78.3767.6072.990.46330.811575.8271.3673.590.47230.8161
m_k79.9181.0080.460.60940.884280.5281.0080.760.61510.8855
m_l72.3970.2471.320.42880.795375.5667.5871.570.43280.7927
m_t75.2175.6175.410.50900.838083.4568.8776.160.52880.8467
r_b79.0474.2376.640.53790.846978.0575.8476.950.53910.8516
r_k84.1580.7782.460.65000.901784.8580.5982.720.65500.9077
r_l81.5679.6380.590.61260.883084.5175.9480.220.60670.8847

h, m and r before the hyphen stand for human, mouse, and rat, respectively; after the hyphen stand for brain, heart, kidney, liver, and testis, respectively.

The Performance of im6A-TS-CNN for Identifying m6A Sites h, m and r before the hyphen stand for human, mouse, and rat, respectively; after the hyphen stand for brain, heart, kidney, liver, and testis, respectively. To measure objectively the performance of the proposed method, the receiver operating characteristic (ROC) curves, from a 5-fold cross-validation test and independent test were plotted in Figure 2 as well. It was found that most of the areas under the ROC curve (AUCs) are higher than 0.8 in both the 5-fold cross-validation test and independent test, demonstrating the reliability of the proposed method for identifying tissue-specific m6A sites.
Figure 2

The ROC Curves for Identifying m6A in Different Tissues in the Three Species under the 5-Fold Cross-Validation Test and Independent Dataset Test

The value of AUC is given in the right corner of each graph.

The ROC Curves for Identifying m6A in Different Tissues in the Three Species under the 5-Fold Cross-Validation Test and Independent Dataset Test The value of AUC is given in the right corner of each graph.

Comparison with Existing Method

To further testify the superiority of im6A-TS-CNN, we compared its performance with that of Zhang et al.’s iRNA-m6A model, based on both the 5-fold cross-validation test and independent test. The comparative results in terms of AUC are shown in Table 2. Except for the identification of the m6A sites from the brain of mouse and rat, im6A-TS-CNN outperforms iRNA-m6A for the identification of m6A sites in the other tissues in the human, mouse, and rat. These results demonstrate that im6A-TS-CNN is a powerful tool for identifying tissue-specific m6A sites from different species.
Table 2

Comparative Results between im6A-TS-CNN and iRNA-m6A under the 5-Fold Cross-Validation Test and Independent Test

5-Fold Cross Validation (AUC)
Independent Test (AUC)
m6A-TS-CNNiRNA-m6ADifferenceim6A-TS-CNNiRNA-m6ADifference
h_b0.80290.77560.02730.80560.78450.0211
h_k0.87810.86340.01470.87270.85650.0162
h_l0.88110.87380.00730.88050.86810.0124
m_b0.87050.8731−0.00260.87220.86130.0109
m_h0.81150.79480.01670.81610.78780.0283
m_k0.88420.87260.01160.88550.86970.0158
m_l0.79530.77430.02100.79270.7620.0307
m_t0.83800.81560.02240.84670.81820.0285
r_b0.84690.82820.01870.85160.8968−0.0452
r_k0.90170.88770.01400.90770.87610.0316
r_l0.88300.87660.00640.88470.82650.0582

h, m and r before the hyphen stand for human, mouse and rat; b, h, k, l, t after the hyphen stand for brain, liver, kidney, heart and testis, respectively.

inidcates the performance of im6A-TS-CNN is better than iRNA-m6A for identifying m6A sites.

Comparative Results between im6A-TS-CNN and iRNA-m6A under the 5-Fold Cross-Validation Test and Independent Test h, m and r before the hyphen stand for human, mouse and rat; b, h, k, l, t after the hyphen stand for brain, liver, kidney, heart and testis, respectively. inidcates the performance of im6A-TS-CNN is better than iRNA-m6A for identifying m6A sites.

Cross-Species and Cross-Tissue Validation

Since the datasets are from different species and tissues, it is interesting to test whether the model, trained based on the data from a specific tissue in a species, is able to identify m6A from other tissues and species. Accordingly, the cross-species and cross-tissue validation was performed. The AUCs of im6A-TS-CNN for identifying m6A sites from other species and tissues are shown in Figure 3. As shown in Figure 3, it can be concluded that im6A-TS-CNN is also effective for the cross-species and cross-tissue identification of m6A sites, demonstrating the universality of the proposed method.
Figure 3

Heatmap Showing the AUC Values of Cross-Species and Cross-Tissue Validation

The abscissa represents the independent dataset, and the ordinate represents the model.

Heatmap Showing the AUC Values of Cross-Species and Cross-Tissue Validation The abscissa represents the independent dataset, and the ordinate represents the model.

Conclusions

In this article, we proposed a CNN-based method, called i6mA-TS-CNN, for identifying m6A in the brain, liver, kidney, heart, and testis from the human, mouse, and rat. The results from a 5-fold cross-validation test and independent test demonstrate that i6mA-TS-CNN is better than the existing method for identifying tissue-specific m6A. For the convenience of the scientific community, the command-line version of i6mA-TS-CNN, together with its source code and user manual, is provided at https://github.com/liukeweiaway/DeepM6A_cnn. In addition, the high-, normal-, and low-threshold options were provided to control the false-positive rate. The corresponding performance with different options was listed in Table S1. Taken together, we hope that the i6mA-TS-CNN will become a useful tool for identifying m6A sites.

Materials and Methods

Datasets

A high-quality dataset is very important for the construction of a computational model. In 2019, Zhang et al. developed a high-throughput, antibody-independent m6A detection method based on the m6A-sensitive RNA endoribonuclease to identify the m6A site in different tissues, namely the brain, liver, kidney, heart, and testis from the human, mouse, and rat. Based on these data, Dao et al. built a high-quality benchmark dataset that can be used to train a computational method for identifying m6A sites, which contains both m6A site- and non-m6A site-containing sequences with the length of 41 nt. The CD-HIT program was used to make sure that the sequence similarity of the dataset was less than 80%. The detailed information of this dataset is provided in Table 3.
Table 3

The Information of Benchmark Datasets for Predicting RNA m6A Sites

NameTraining
Testing
PositiveNegativePositiveNegative
h_b4,6054,6054,6044,604
h_k2,6342,6342,6342,634
h_l4,5744,5744,5734,573
m_b8,0258,0258,0258,025
m_h4,1334,1334,1334,133
m_k3,9533,9533,9523,952
m_l2,2012,2012,2002,200
m_t4,7074,7074,7064,706
r_b2,3522,3522,3512,351
r_k1,7621,7621,7621,762
r_l3,4333,4333,4323,432

h, m and r before the hyphen stand for human, mouse and rat; b, h, k, l, t after the hyphen stand for brain, liver, kidney, heart and testis, respectively.

The Information of Benchmark Datasets for Predicting RNA m6A Sites h, m and r before the hyphen stand for human, mouse and rat; b, h, k, l, t after the hyphen stand for brain, liver, kidney, heart and testis, respectively.

One-Hot Encoding

One-hot encoding is a common and effective method. According to such a scheme, in an RNA segment, A is represented as (1,0,0,0), U as (0,1,0,0), C as (0,0,1,0), and G as (0,0,0,1). Therefore, an RNA sequence of length l can be converted into a 4-l dimensional vector.

Convolutional Neural Network

In recent years, Convolutional Neural Network (CNN) has been widely used to solve biological problems.,, The structure of the CNN is shown in Figure 1. It contains a convolutional layer with 200 filters in which the kernel size is 6. After convolution operation, a max-pooling layer with the size of 4 was added. The convolution layer is mathematically represented and computed as the following:where R represents the RNA segment, f denotes the index of the kernel, and j denotes the index of the output position. In Equation 1, each filter W is an S × N weight matrix, where S is the filter size, and N is the input channels. The rectified linear function (ReLU) is expressed as the following: In order to prevent overfitting, we choose to lose some parameters and set the dropout rate of 0.16. The results were output to a fully connected layer containing 164 neural units and then compressed to 32 neural units. Finally, the softmax function was used to predict whether the RNA segment contains m6A sites or not and is expressed as the following: When building the model, the stochastic gradient descent (SGD) was used as the optimizer with a learning rate of 0.001, and the categorical cross entropy was used as the loss function. In the training process, a total of 2,000 epochs were carried out by using the early stopping method with the patience of 50 and min_delta of 0.001.

Evaluation Metrics

In order to evaluate the model, we use the sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficient (MCC), which are defined as the following,,, to evaluate the performance of the following model:where N+ is the total number of the RNA sequence containing modification site, is the number of false-negative samples, N− is the total number of the RNA sequence that did not contain any modification site, and is teh number of false-positive samples. In addition, we also used the ROC curve and the area under the ROC curve (AUC) to evaluate the proposed model.

Author Contributions

W.C. conceived and designed the study. K.L. conducted the experiments and implemented the algorithms. W.C., L.C., P.D., and K.L. performed the analysis and wrote the paper. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no competing interests.
  28 in total

1.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

2.  N(6)-methyladenosine Modulates Messenger RNA Translation Efficiency.

Authors:  Xiao Wang; Boxuan Simen Zhao; Ian A Roundtree; Zhike Lu; Dali Han; Honghui Ma; Xiaocheng Weng; Kai Chen; Hailing Shi; Chuan He
Journal:  Cell       Date:  2015-06-04       Impact factor: 41.582

3.  Dynamics of Human and Viral RNA Methylation during Zika Virus Infection.

Authors:  Gianluigi Lichinchi; Boxuan Simen Zhao; Yinga Wu; Zhike Lu; Yue Qin; Chuan He; Tariq M Rana
Journal:  Cell Host Microbe       Date:  2016-10-20       Impact factor: 21.023

4.  Comprehensive analysis of mRNA methylation reveals enrichment in 3' UTRs and near stop codons.

Authors:  Kate D Meyer; Yogesh Saletore; Paul Zumbo; Olivier Elemento; Christopher E Mason; Samie R Jaffrey
Journal:  Cell       Date:  2012-05-17       Impact factor: 41.582

5.  Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences.

Authors:  Zhen Chen; Pei Zhao; Fuyi Li; Yanan Wang; A Ian Smith; Geoffrey I Webb; Tatsuya Akutsu; Abdelkader Baggag; Halima Bensmail; Jiangning Song
Journal:  Brief Bioinform       Date:  2019-11-11       Impact factor: 11.622

6.  N6-methyladenosine in nuclear RNA is a major substrate of the obesity-associated FTO.

Authors:  Guifang Jia; Ye Fu; Xu Zhao; Qing Dai; Guanqun Zheng; Ying Yang; Chengqi Yi; Tomas Lindahl; Tao Pan; Yun-Gui Yang; Chuan He
Journal:  Nat Chem Biol       Date:  2011-10-16       Impact factor: 15.040

7.  A majority of m6A residues are in the last exons, allowing the potential for 3' UTR regulation.

Authors:  Shengdong Ke; Endalkachew A Alemu; Claudia Mertens; Emily Conn Gantman; John J Fak; Aldo Mele; Bhagwattie Haripal; Ilana Zucker-Scharff; Michael J Moore; Christopher Y Park; Cathrine Broberg Vågbø; Anna Kusśnierczyk; Arne Klungland; James E Darnell; Robert B Darnell
Journal:  Genes Dev       Date:  2015-09-24       Impact factor: 11.361

8.  Gene2vec: gene subsequence embedding for prediction of mammalian N 6-methyladenosine sites from mRNA.

Authors:  Quan Zou; Pengwei Xing; Leyi Wei; Bin Liu
Journal:  RNA       Date:  2018-11-13       Impact factor: 4.942

9.  Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation.

Authors:  Balachandran Manavalan; Shaherin Basith; Tae Hwan Shin; Leyi Wei; Gwang Lee
Journal:  Mol Ther Nucleic Acids       Date:  2019-04-30

10.  DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning.

Authors:  Yiqian Zhang; Michiaki Hamada
Journal:  BMC Bioinformatics       Date:  2018-12-31       Impact factor: 3.169

View more
  7 in total

1.  Geographic encoding of transcripts enabled high-accuracy and isoform-aware deep learning of RNA methylation.

Authors:  Daiyun Huang; Kunqi Chen; Bowen Song; Zhen Wei; Jionglong Su; Frans Coenen; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal:  Nucleic Acids Res       Date:  2022-10-14       Impact factor: 19.160

2.  A Novel Early-Stage Lung Adenocarcinoma Prognostic Model Based on Feature Selection With Orthogonal Regression.

Authors:  Binhua Tang; Yuqi Wang; Yu Chen; Ming Li; Yongfeng Tao
Journal:  Front Cell Dev Biol       Date:  2021-01-08

3.  predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance.

Authors:  Sabit Ahmed; Afrida Rahman; Md Al Mehedi Hasan; Md Khaled Ben Islam; Julia Rahman; Shamim Ahmad
Journal:  PLoS One       Date:  2021-04-01       Impact factor: 3.240

4.  Enhancer-LSTMAtt: A Bi-LSTM and Attention-Based Deep Learning Method for Enhancer Recognition.

Authors:  Guohua Huang; Wei Luo; Guiyang Zhang; Peijie Zheng; Yuhua Yao; Jianyi Lyu; Yuewu Liu; Dong-Qing Wei
Journal:  Biomolecules       Date:  2022-07-17

5.  m5C-Related lncRNAs Predict Overall Survival of Patients and Regulate the Tumor Immune Microenvironment in Lung Adenocarcinoma.

Authors:  Junfan Pan; Zhidong Huang; Yiquan Xu
Journal:  Front Cell Dev Biol       Date:  2021-06-29

6.  Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications.

Authors:  Zitao Song; Daiyun Huang; Bowen Song; Kunqi Chen; Yiyou Song; Gang Liu; Jionglong Su; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal:  Nat Commun       Date:  2021-06-29       Impact factor: 14.919

7.  M6A-BiNP: predicting N6-methyladenosine sites based on bidirectional position-specific propensities of polynucleotides and pointwise joint mutual information.

Authors:  Mingzhao Wang; Juanying Xie; Shengquan Xu
Journal:  RNA Biol       Date:  2021-06-23       Impact factor: 4.652

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.