Literature DB >> 32421834

mRNALoc: a novel machine-learning based in-silico tool to predict mRNA subcellular localization.

Anjali Garg1, Neelja Singhal1, Ravindra Kumar1, Manish Kumar1.   

Abstract

Recent evidences suggest that the localization of mRNAs near the subcellular compartment of the translated proteins is a more robust cellular tool, which optimizes protein expression, post-transcriptionally. Retention of mRNA in the nucleus can regulate the amount of protein translated from each mRNA, thus allowing a tight temporal regulation of translation or buffering of protein levels from bursty transcription. Besides, mRNA localization performs a variety of additional roles like long-distance signaling, facilitating assembly of protein complexes and coordination of developmental processes. Here, we describe a novel machine-learning based tool, mRNALoc, to predict five sub-cellular locations of eukaryotic mRNAs using cDNA/mRNA sequences. During five fold cross-validations, the maximum overall accuracy was 65.19, 75.36, 67.10, 99.70 and 73.59% for the extracellular region, endoplasmic reticulum, cytoplasm, mitochondria, and nucleus, respectively. Assessment on independent datasets revealed the prediction accuracies of 58.10, 69.23, 64.55, 96.88 and 69.35% for extracellular region, endoplasmic reticulum, cytoplasm, mitochondria, and nucleus, respectively. The corresponding values of AUC were 0.76, 0.75, 0.70, 0.98 and 0.74 for the extracellular region, endoplasmic reticulum, cytoplasm, mitochondria, and nucleus, respectively. The mRNALoc standalone software and web-server are freely available for academic use under GNU GPL at http://proteininformatics.org/mkumar/mrnaloc.
© The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Year:  2020        PMID: 32421834      PMCID: PMC7319581          DOI: 10.1093/nar/gkaa385

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Localization of mRNA is an evolutionarily conserved phenomenon that controls many important biological processes like cell-fate determination and polar cell growth (1). After post-transcriptional modifications, such as 5′ capping, splicing and addition of 3′ poly (A) tail, the nascently transcribed mRNA either gets localized within the nucleus or alternatively travels out of the nucleus. It has been suggested that mRNA localization has many advantages over protein localization (2–6). These are: (a) localization of mRNA to a specific location helps the cell to build a local repository of proteins at the site of function instead of transporting individual protein molecules to the site of function. This also compartmentalizes protein synthesis and forms a protein gradient within the cells, which ultimately results in local synthesis of encoded proteins at the target site; (b) mRNA localization works as a translation/co-translational regulator; (c) mRNA localization is a better energy-efficient pathway compared to protein targeting and; (d) mRNA localization aids in formation of only functional and non-harmful multi-protein complexes which aids in avoiding unnecessary protein-protein interactions that might be harmful to the cells (7,8). Not all protein synthesis occurs after mRNA localization. A large number of mRNA sequences are also transported co-translationally (9). Five different mechanisms namely, diffusion and localized entrapment, localized degradation, localized synthesis, active transport and, polarized nuclear export are considered important for mRNA localization. However, ribonucleoprotein transport complex is the main mode by which majority of RNA is transported. Building the ribonucleoprotein complex is a sequence specific phenomenon, which is guided by a short stretch of 20–200 cis-acting nucleotide sequences known as ‘zipcode’. It is located at the 3′ untranslated region of the mRNA sequence, although in some cases they can also be present in the 5′UTR or in the coding sequence (10,11). Proteins present in a subcellular compartment are related to the physiological and metabolic function associated with that subcellular compartment. Hence, prediction of subcellular location of mRNA might suggests the biological function of the gene from which the mRNA was transcribed. Thus, a tool that can predict the correct intracellular location of transcripts may also help in understanding how gene expression is regulated and, how cells achieve polarity. To our knowledge, computational predictors that can predict the subcellular localization of eukaryotic mRNA are unavailable, till date. Hence, we developed a Support Vector Machine (SVM) based in-silico tool which can predict the eukaryotic mRNA subcellular locations on the basis of primary sequence information of mRNA/cDNA. Named as mRNALoc (acronym for ‘mRNA Localization), this tool is based on the experimentally validated localization data of mRNA retrieved from ‘RNALocate’ (12).

ANALYSIS WORKFLOW

Data sources

In the present work, we collected the mRNA sequences and their subcellular location information from RNALocate database (version 2.0) (12). RNALocate is a manually curated database that provides complete subcellular location annotation of RNA with experimental support. Initially, a total of 28829 mRNA sequences with annotated subcellular localization were obtained. The downloaded mRNA sequences revealed their localization to both single and multiple subcellular locations. In the present study, we considered only those mRNA sequences which showed single locations. The mRNA dataset was classified in five subgroups on the basis of subcellular locations namely, cytoplasm, endoplasmic reticulum, extracellular, mitochondria and nucleus. The number of mRNA sequences in the five locations were as follows: 6964 in cytoplasm, 1998 in endoplasmic reticulum, 1131 in extracellular region, 442 in mitochondria and 6346 in nucleus. Since, redundant mRNA sequences results in overestimation of prediction capability, hence to reduce the redundancy and to avoid homology bias in prediction, we used NCBI BLASTCLUST program to retain only sequences showing alignment identity ≤40% over 70% or more of their full length (BLASTCLUST with ‘-S 40 and -L 0.7’ option) (13). The final non-redundant mRNA dataset contained 6376 sequences of cytoplasm, 1426 sequences of endoplasmic reticulum, 855 sequences of extracellular region, 421 sequences of mitochondria and 5831 mRNA sequences of nucleus. 5/6 part of total 40% non-redundant data was used for training the model. Remaining 1/6 data was used for the independent evaluation of the trained model. For detail about collection of dataset, redundancy removal, constructions of training and independent datasets please see supplementary material. The NCBI gene accession numbers, mRNA sequences and subcellular locations are available in the download section of mRNALoc webserver (http://proteininformatics.org/mkumar/mrnaloc/download.html).

Overview of mRNALoc

mRNALoc is a web resource to predict the subcellular localization of eukaryotic mRNA. The overall workflow of mRNALoc is shown in Figure 1. Users have to provide the mRNA sequences in a FASTA format. The submitted mRNA sequence will be converted into numerical encoding using pseudo oligonucleotide composition or pseudo K-tuple nucleotide composition (PseKNC) (14–17). On the basis of SVM prediction score, the mRNA will be predicted to localize at one of the five subcellular locations, namely cytoplasm, endoplasmic reticulum, extracellular location, mitochondria and nucleus. mRNALoc prediction is based on the five trained SVM models, each specific for one location. During prediction each model provides the prediction score for its corresponding location. The subcellular location, whose SVM model gets the maximum score, will be the predicted location. The final outcome of mRNALoc depends on the user-selected threshold. Higher thresholds would result in more specific predictions, while lower threshold would result in low specificity predictions.
Figure 1.

Overall schema of mRNALoc. mRNALoc predicts five subcellular locations viz., mitochondria, cytoplasm, nucleus, endoplasmic reticulum and extracellular. Firstly, it removes the sequences from the query that has non-standard nucleotides then generates combined features from pseudo K-tuple nucleotide composition, which is further used as input for Support Vector Machine (SVM) prediction.

Overall schema of mRNALoc. mRNALoc predicts five subcellular locations viz., mitochondria, cytoplasm, nucleus, endoplasmic reticulum and extracellular. Firstly, it removes the sequences from the query that has non-standard nucleotides then generates combined features from pseudo K-tuple nucleotide composition, which is further used as input for Support Vector Machine (SVM) prediction.

TRAINING OF PREDICTIVE SVM MODELS

The performance mRNALoc for all subcellular locations during five-fold cross-validation mode of training is shown in Table 1. Using the combined input of PseKNC (K = 2, 3, 4 and 5) we found 65.19, 75.36, 67.10, 99.70 and 73.59% accuracy of prediction for mRNA whose subcellular locations were extracellular region, endoplasmic reticulum, cytoplasm, mitochondria and nucleus, respectively. When evaluated on an independent dataset, mRNALoc did prediction with sensitivity, specificity, accuracy and MCC values of 81.38, 56.67, 58.10 and 0.18 for extracellular region, 75.10, 68.60, 69.23 and 0.27 for endoplasmic reticulum, 73.26, 58.06, 64.55 and 0.31 for cytoplasm, 87.32, 97.16, 96.88 and 0.63 for mitochondria and 50.20, 81.62, 69.35 and 0.34 for nucleus, respectively (Supplementary Figures S1 and S2, Supplementary Table S1).
Table 1.

The performance metrics for mRNA subcellular localization under hybrid K-mer feature (2+3+4+5), and performance of the SVM based classifiers (mRNALoc) on independent data

LocationSen (%)Spe (%)ACC (%)MCCTHRAUC
Training dataset
Extracellular region62.6765.3465.190.14−0.200.69
Endoplasmic reticulum74.0975.4975.360.320.400.81
Cytoplasm66.6967.4167.100.340.400.69
Mitochondria96.2899.7999.700.950.100.98
Nucleus74.1773.2273.590.470.400.76
Independent dataset
Extracellular region81.3856.6758.100.18−0.200.76
Endoplasmic reticulum75.1068.6069.230.270.400.75
Cytoplasm73.2658.0664.550.310.400.70
Mitochondria87.3297.1696.880.630.100.98
Nucleus50.2081.6269.350.340.400.74

Sen: sensitivity, Spe: specificity, ACC: accuracy, MCC: Mathews correlation coefficient, THR: threshold, and AUC: area under ROC curve.

The performance metrics for mRNA subcellular localization under hybrid K-mer feature (2+3+4+5), and performance of the SVM based classifiers (mRNALoc) on independent data Sen: sensitivity, Spe: specificity, ACC: accuracy, MCC: Mathews correlation coefficient, THR: threshold, and AUC: area under ROC curve.

COMPARISON WITH EXISTING mRNA SUBCELLULAR LOCALIZATION PREDICTION METHODS

Though, the role of mRNA localization is unambiguously established in cellular physiology, attempts to build in-silico tools to predict the subcellular localizations of mRNA are negligible in comparison to protein subcellular localization prediction tools. Recently, Yan et al. proposed a deep-learning based method, named as RNATracker (18), to predict the subcellular localization of mRNA using data from CeFra-Seq (19) and APEX-RIP (3). Using the data from RNALocate, a human mRNA subcellular localization method iLoc-mRNA was also developed (20). Though, both RNATracker and iLoc-mRNA are based on two different mRNA subcellular localization datasets and, were developed using two different approaches, mRNALoc has several advantages over both RNATracker and iLoc-mRNA. For example, (a) localization data produced by CeFra-Seq/APEX-RIP are inherently noisy and sometimes inaccurate also (18). The mRNALoc was developed from datasets retrieved from RNALocate (12), which contains manually curated mRNA subcellular localization information with experimental evidences. (b) The RNATracker among all the isoforms, considered only the longest isoform while, mRNALoc did not made any such distinction. (c) Redundant mRNA sequences were not removed from RNATracker and in iLoc-mRNA the redundancy threshold was 80%. While in mRNALoc, we used 40% non-redundant mRNA sequences to train the predictor. This may be the reason underlying high MCC and AUC for RNATracker and iLoc-mRNA. (d) Both, RNATracker and iLoc-mRNA were developed using only localization data of human mRNA. On the contrary, mRNALoc is a general-purpose eukaryotic mRNA subcellular localization prediction tool, which is applicable to all eukaryotes. (e) RNATracker also excluded low expressed genes, but mRNALoc made no such distinction (Supplementary Table S2). We also conducted one-to-one comparison of performance of iLoc-mRNA and mRNALoc. As RNATracker required gene expression and coordination files for prediction, it was not possible to include it in the evaluation. For comparison we used the independent dataset of mRNALoc. Since, iLoc-mRNA is specifically designed for human mRNA subcellular localization prediction, we used 50 human mRNA sequences of independent dataset of mRNALoc. The number of human mRNA in different locations and prediction result of mRNALoc and iLoc-mRNA is shown in Table 2. In extracellular region and mitochondria, we didn’t find human mRNA sequences in mRNALoc independent dataset hence, these locations were not included in the evaluation.
Table 2.

Comparative evaluation of mRNALoc and iLoc-mRNA. In extracellular region and mitochondria no human mRNA was present, hence these two locations were not included in the evaluation

mRNALociLoc-mRNA
LocationNumber of human mRNA sequencesTrue positiveFalse negativeTrue positiveFalse negative
Cytoplasm5035151832
Endoplasmic reticulum5034163713
Extracellular region00000
Mitochondria00000
Nucleus5033171337
Comparative evaluation of mRNALoc and iLoc-mRNA. In extracellular region and mitochondria no human mRNA was present, hence these two locations were not included in the evaluation As shown in Table 2, for cytoplasm and nucleus the performance of mRNALoc was better than iLoc-mRNA but, in endoplasmic reticulum the performance of iLoc-mRNA was better than mRNALoc. It is also pertinent to mention that in iLoc-mRNA prediction were made for one of the following locations namely, cytosol/cytoplasm, ribosome, endoplasmic reticulum, and nucleus/exosome/dendrite/mitochondrion. We feel that combining nucleus, exosome, dendrite, and mitochondria as a single location is not appropriate as these are diverse subcellular locations which should not be merged in a single category.

DESCRIPTION OF THE WEBSERVER

Implementation of mRNALoc

The web server is hosted on a Linux system. The back-end pipeline is implemented in the Perl language. The webserver has an intuitive interface and ‘how-to’ guide to help the user. Each mRNA query sequence must be at least 100 bp long and contains only valid characters, namely ‘A’, ‘C’, ‘G’ and ‘T/U’. Sequences having non-standard nucleotides will be omitted from the prediction pipeline (Figure 2).
Figure 2.

Screenshots of mRNALoc webserver.

Screenshots of mRNALoc webserver.

The output of mRNALoc

The output of mRNALoc is presented in a tabular format. It contains the highest scores obtained from the five SVM models and the location to which the mRNA is assigned. A maximum of fifty sequences can be processed by mRNALoc webserver in one go. Hence, for genome scale prediction a standalone version will be required (Figure 2 and Supplementary Figure S3).

CONCLUSIONS AND FUTURE PROSPECTS

The annotation of subcellular localization has been addressed mainly at the protein level. Many in silico tools were developed to predict protein subcellular location using machine-learning techniques. It has been unequivocally established that both mRNA and protein localization play an equal role in protein translocation. In future versions of mRNALoc we would like to overcome some of the limitations of the present tool. The first and foremost is that our tool is currently limited by the accuracy of the RNALocate datasets. Though, RNALocate contain data from 65 organisms, most of the data is enriched with the common model organisms like, Homo sapiens, Mus musculus, and Saccharomyces cerevisiae etc. Moreover, considering at the biological level, instead of cytosol, mitochondria or extracellular locations, axons, dendrites, dendritic spines, or anterior/posterior vs dorsal/ventral locations are more relevant. Another, limitation is that due to lesser availability of plant mRNA localization data compared to other domains of life, mRNALoc performance might be compromised (21). The performance of a machine-learning method depends on the data on which it is trained. We believe that with development of new and better RNA localization finding techniques, information about RNA localization in plants would also be available in the near future and future versions of mRNALoc would then support prediction of plant mRNA sequences, also. We admit that mRNALoc is in an early stage of development and training on additional datasets is needed to further improve our tool. Further prediction of mRNA localization will also help in predicting the novel zipcodes which may guide researchers to cast new hypothesis for unraveling the finer details of mechanism of mRNA-protein complex formation which is actually responsible for mRNA location. Though, the current version of mRNALoc supports prediction of only eukaryotic mRNA, the future versions of mRNALoc would definitely include data from other organisms and locations. Click here for additional data file.
  21 in total

Review 1.  Mechanisms of subcellular mRNA localization.

Authors:  Malgorzata Kloc; N Ruth Zearfoss; Laurence D Etkin
Journal:  Cell       Date:  2002-02-22       Impact factor: 41.582

2.  Detecting the undetectable: uncovering duplicated segments in Arabidopsis by comparison with rice.

Authors:  Klaas Vandepoele; Cedric Simillion; Yves Van de Peer
Journal:  Trends Genet       Date:  2002-12       Impact factor: 11.639

3.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition.

Authors:  Bin Liu; Longyun Fang; Ren Long; Xun Lan; Kuo-Chen Chou
Journal:  Bioinformatics       Date:  2015-10-17       Impact factor: 6.937

4.  Global analysis of mRNA localization reveals a prominent role in organizing cellular architecture and function.

Authors:  Eric Lécuyer; Hideki Yoshida; Neela Parthasarathy; Christina Alm; Tomas Babak; Tanja Cerovina; Timothy R Hughes; Pavel Tomancak; Henry M Krause
Journal:  Cell       Date:  2007-10-05       Impact factor: 41.582

5.  Delocalization of Vg1 mRNA from the vegetal cortex in Xenopus oocytes after destruction of Xlsirt RNA.

Authors:  M Kloc; L D Etkin
Journal:  Science       Date:  1994-08-19       Impact factor: 47.728

Review 6.  Principles and roles of mRNA localization in animal development.

Authors:  Caroline Medioni; Kimberly Mowry; Florence Besse
Journal:  Development       Date:  2012-09       Impact factor: 6.868

Review 7.  Protein targeting to subcellular organelles via MRNA localization.

Authors:  Benjamin L Weis; Enrico Schleiff; William Zerges
Journal:  Biochim Biophys Acta       Date:  2013-02

8.  iRSpot-EL: identify recombination spots with an ensemble learning approach.

Authors:  Bin Liu; Shanyi Wang; Ren Long; Kuo-Chen Chou
Journal:  Bioinformatics       Date:  2016-08-16       Impact factor: 6.937

9.  CeFra-seq reveals broad asymmetric mRNA and noncoding RNA distribution profiles in Drosophila and human cells.

Authors:  Louis Philip Benoit Bouvrette; Neal A L Cody; Julie Bergalet; Fabio Alexis Lefebvre; Cédric Diot; Xiaofeng Wang; Mathieu Blanchette; Eric Lécuyer
Journal:  RNA       Date:  2017-10-27       Impact factor: 4.942

10.  Prediction of mRNA subcellular localization using deep recurrent neural networks.

Authors:  Zichao Yan; Eric Lécuyer; Mathieu Blanchette
Journal:  Bioinformatics       Date:  2019-07-15       Impact factor: 6.937

View more
  8 in total

1.  DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism.

Authors:  Duolin Wang; Zhaoyue Zhang; Yuexu Jiang; Ziting Mao; Dong Wang; Hao Lin; Dong Xu
Journal:  Nucleic Acids Res       Date:  2021-05-07       Impact factor: 16.971

2.  Illuminating lncRNA Function Through Target Prediction.

Authors:  Hua-Sheng Chiu; Sonal Somvanshi; Ting-Wen Chen; Pavel Sumazin
Journal:  Methods Mol Biol       Date:  2021

3.  iT4SE-EP: Accurate Identification of Bacterial Type IV Secreted Effectors by Exploring Evolutionary Features from Two PSI-BLAST Profiles.

Authors:  Haitao Han; Chenchen Ding; Xin Cheng; Xiuzhi Sang; Taigang Liu
Journal:  Molecules       Date:  2021-04-24       Impact factor: 4.411

4.  mRNALocater: Enhance the prediction accuracy of eukaryotic mRNA subcellular localization by using model fusion strategy.

Authors:  Qiang Tang; Fulei Nie; Juanjuan Kang; Wei Chen
Journal:  Mol Ther       Date:  2021-04-03       Impact factor: 12.910

5.  RNALocate v2.0: an updated resource for RNA subcellular localization with increased coverage and annotation.

Authors:  Tianyu Cui; Yiying Dou; Puwen Tan; Zhen Ni; Tianyuan Liu; DuoLin Wang; Yan Huang; Kaican Cai; Xiaoyang Zhao; Dong Xu; Hao Lin; Dong Wang
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

6.  EL-RMLocNet: An explainable LSTM network for RNA-associated multi-compartment localization prediction.

Authors:  Muhammad Nabeel Asim; Muhammad Ali Ibrahim; Muhammad Imran Malik; Christoph Zehe; Olivier Cloarec; Johan Trygg; Andreas Dengel; Sheraz Ahmed
Journal:  Comput Struct Biotechnol J       Date:  2022-07-26       Impact factor: 6.155

7.  Prediction of Synaptically Localized RNAs in Human Neurons Using Developmental Brain Gene Expression Data.

Authors:  Anqi Wei; Liangjiang Wang
Journal:  Genes (Basel)       Date:  2022-08-20       Impact factor: 4.141

8.  mLoc-mRNA: predicting multiple sub-cellular localization of mRNAs using random forest algorithm coupled with feature selection via elastic net.

Authors:  Prabina Kumar Meher; Anil Rai; Atmakuri Ramakrishna Rao
Journal:  BMC Bioinformatics       Date:  2021-06-24       Impact factor: 3.169

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.