Literature DB >> 31147700

CNIT: a fast and accurate web tool for identifying protein-coding and long non-coding transcripts based on intrinsic sequence composition.

Jin-Cheng Guo^1,2,3, Shuang-Sang Fang^3,4, Yang Wu^1,3, Jian-Hua Zhang⁵, Yang Chen², Jing Liu⁶, Bo Wu³, Jia-Rui Wu¹, En-Min Li², Li-Yan Xu², Liang Sun³, Yi Zhao¹.

Abstract

As more and more high-throughput data has been produced by next-generation sequencing, it is still a challenge to classify RNA transcripts into protein-coding or non-coding, especially for poorly annotated species. We upgraded our original coding potential calculator, CNCI (Coding-Non-Coding Index), to CNIT (Coding-Non-Coding Identifying Tool), which provides faster and more accurate evaluation of the coding ability of RNA transcripts. CNIT runs ∼200 times faster than CNCI and exhibits more accuracy compared with CNCI (0.98 versus 0.94 for human, 0.95 versus 0.93 for mouse, 0.93 versus 0.92 for zebrafish, 0.93 versus 0.92 for fruit fly, 0.92 versus 0.88 for worm, and 0.98 versus 0.85 for Arabidopsis transcripts). Moreover, the AUC values of 11 animal species and 27 plant species showed that CNIT was capable of obtaining relatively accurate identification results for almost all eukaryotic transcripts. In addition, a mobile-friendly web server is now freely available at http://cnit.noncode.org/CNIT.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31147700 PMCID： PMC6602462 DOI： 10.1093/nar/gkz400

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Numerous studies show that non-coding RNAs (ncRNAs) have critical roles in diverse biological processes from plants to animals (1–4), such as sponging by microRNAs (5), cell development (6), acting as modular scaffolds (7) and regulating epigenetic inheritance (8). Despite the increasing number of high-throughput data produced by next-generation sequencing, the classification of protein-coding or non-coding transcripts remains a challenge, especially for poorly annotated species. For instance, existing software available for annotating plants is rare and/or of low accuracy and plants are important resource for novel drug leads (9). The study of plant long non-coding RNA is still in its infancy, and the biological functions and mechanisms of plant non-coding RNAs are mainly focused on model plants such as rice and Arabidopsis. The first step is to identify lncRNA with effective identification software at the beginning of new research, so as to determine the research method and direction for the functional delineation of the newly discovered RNA. At present, few existing software programs can be used to identify plant non-coding RNA, and in general, the accuracy of identification has not been verified by a large number of data sets. To overcome these shortcomings and make it easier for users to distinguish transcripts, we updated our CNCI algorithm (10) to create CNIT. In comparison with CNCI, CNIT runs ∼200 times faster than CNCI and exhibits higher accuracy, especially for plants, when using human and Arabidopsis data as training sets. Because CNIT, similar to CNCI, classifies protein-coding and non-coding RNAs solely based on intrinsic sequence composition, it is potentially applicable to a variety of species lacking a whole-genome sequence or with poorly annotated information. In addition, we constructed a mobile-friendly web server for researchers, making CNIT now freely available at http://cnit.noncode.org/CNIT/ as both a web server and a downloadable stand-alone package.

MATERIALS AND METHODS

Dataset processing

In order to construct and validate the CNIT model, we downloaded and filtered protein-coding and non-coding sequence data of 11 animal and 26 plant species from Refseq and Ensembl (Supplementary materials and methods and Supplementary Table S2). Animal protein-coding and non-coding transcripts were from the RefSeq database (11). For plants, coding transcripts were obtained from the Refseq, and noncoding transcripts were from Ensembl Plants (v37) database (12). A total of 19752 coding RNAs and 19752 non-coding RNAs of human origin (GRCH38) were selected for training and testing. In addition, 2588 coding RNAs and 2588 non-coding RNAs of Arabidopsis thaliana species (EnsemblPlants-v37) were used to build the plant model. Among the above total coding and non-coding transpcripts dataset, 70% were selected for training and 30% for testing. To evaluate the cross-species performance of CNIT, the rest of 10 animal species and 25 plant species were used for validation. These training and testing datasets collected by CNIT can be obtained from the download page (http://cnit.noncode.org/CNIT/download). In recent years, small open reading frame (sORF, length of sequence less than 300nt) has been studied continuously, but still has not formed a well-organized known database (13–15). Therefore, the existence of sORF was not considered in all the above lncRNA data sets. Moreover, in order to compare the performance of identifying mRNAs with sORFs, we then extracted the human mRNAs data set which contains sORFs. To evaluate the performance of CNIT compared with other software, we further downloaded independent test datasets from CPC2 datasets (http://cpc2.cbi.pku.edu.cn/help/data_set.php), including human, mouse, zebrafish, fruitfly, worm and Arabidopsis thaliana datasets, for validation and comparison, which met strict standards and were high-quality (16).

Model construction

Consistent with CNCI (10), we first constructed a comparison frequency matrix of adjoining nucleotide triplets (ANT) using the training dataset (lncRNA sequence & coding domain sequence (CDS). Based on the comparison frequency matrix, a sub-sequence as most-like CDS (MLCDS) with highest summation of ANT frequency were found in each reading frame. Six MLCDSs were obtained from six open reading frames. Among them, the MLCDS with a maximal score (summation of ANT frequency) was termed as MMLCDS. Based on the six MLCDSs, the MMLCDS score, standard deviation of six MLCDS scores, standard deviation of six MLCDS lengths, and MMLCDS codon frequency (4*4*4 = 64 dimensions) with a total of 67 features were finally used to construct the XGBoost models (Supplementary materials and methods).

RESULTS

CNIT identification performance and comparison with existing tools

CPC2 and CNCI in the existing tools can be used to compare the performance in a wide range of species. The data were collected from test data downloaded from the CPC2 website (http://cpc2.cbi.pku.edu.cn/help/data_set.php, Figure 1A). The data of six species were identified by the five software programs: CNIT, CNCI, CPC2, CPAT (It only can identify four species) (17) and PLEK (18) software, and the accuracy was calculated (Figure 1B). Compared with CNCI, CNIT has higher accuracy. In terms of computing time, CNIT is ∼200 times faster than CNCI, as evaluated by calculating the average running time ratio of CNCI and CNIT in the six species when both were in single thread mode. Moreover, CNIT is almost better than that of CPC2, except for the identification of worm sequences (accuracy: 0.915 versus 0.975). CNIT also showed a better performance in more species than the CPAT and PLEK with more accuracy. Then, we used the above five software programs to identify mRNAs with sORFs (Supplementary Table S3). According to the results, compared with CNCI, CPAT and CPC2, CNIT has a higher accuracy in mRNAs with sORFs sequence identification, while PLEK has the highest accuracy.

Figure 1.

Evaluation of the accuracy of CNIT, CNCI, CPC2, CPAT and PLEK software. Overall comparison data (A) and detailed accuracy (B) in the six organisms from the CPC2 website.

Evaluation of the accuracy of CNIT, CNCI, CPC2, CPAT and PLEK software. Overall comparison data (A) and detailed accuracy (B) in the six organisms from the CPC2 website. For all downloaded animals and plants sequence data, CNIT identified them one by one and drew AUC curves to see the identification effect. For animal species, CNIT achieved a very high AUC value for mammals, amphibians, reptiles, birds, fish and invertebrates, indicating that it can distinguish coding from non-coding RNA. Similarly, for plants, CNIT also obtained high AUCs for monocotyledons, dicotyledons, bryophytes, ferns, Chlorophyta and red algae, especially monocotyledons and dicotyledons. CNIT validates plants and animals that cover most of the genera of the order family. Although not very rigorously, CNIT can identify most eukaryotic RNA as coding or non-coding RNA. Here, we show the prediction of CNIT for 37 species (11 animal and 26 plant species) with the corresponding AUC value (Figure 2). We also compared the performance of CNIT and CPC2 using the above datasets and showed the prediction accuracy of 37 species in Supplementary Table S2, meanwhile macro-averaged F1 statistic was performed for imbalanced datasets. The relevant comparison showed that CNIT’s ability to recognize coding transcripts outperformed CPC2 (Supplementary Figure S1A). Although CPC2 could identify non-coding sequences better (Supplementary Figure S1B), CNIT has more advantages in identifying sequences including coding and non-coding ones synchronously with higher macro-averaged F1, especially for plant species (Supplementary Figure S1C).

Figure 2.

Global prediction by ROC analysis for CNIT across 37 species.

Web server introduction

It is convenient for users to access the CNIT web portal at http://cnit.noncode.org/CNIT/. The web tool accepts RNA transcripts in FASTA format as input and outputs assess coding potential of the sequences. CNIT provides two identification methods, the simplest is to enter a single or multiple FASTA format RNA sequence through the website home page, and click RUN to submit the identification task. In addition, one can also submit the RNA sequence files in FASTA format on the ‘Batch' page for batch identification. However, if the sequence contains too many Ns or something else (more than 10% of the sequence), it may not produce results. In addition to web-side identification, users can download CNIT software packages and install them under the Linux system. For installation and usage, see the ‘Download’ page. When the identification program finishes running, the identified results will appear on the results page. CNIT results give an overview of the coding status of the input sequences. Each row corresponds to one input sequence. The columns show the transcript ID, the coding/noncoding classification label (Index), the coding probability score (CNIT Score: where greater than 0 indicates coding RNA, less than 0 indicates non-noncoding RNA; the larger the score, the greater the coding possibility). Users can further click ‘View’ to enter the identification detail page. A unique job ID is assigned to each job by the web server. Users can use job-ID to track the job progress and retrieve the results, which will be saved on the server for one week.

EXAMPLE

We took human coding gene L1 cell adhesion molecule transcript variant 1 (L1CAM: NM_000425.4) (19) as an example and used online CNIT for identification. CNIT predicted that it was a coding transcript, with a CNIT score = 0.88 (Figure 3A).

Figure 3.

Screenshot of the CNIT web server. (A) Summary html view output with coding probability. (B, C) Graphical view of the ‘Details’ page.

Screenshot of the CNIT web server. (A) Summary html view output with coding probability. (B, C) Graphical view of the ‘Details’ page. ‘View’ in the last column can be clicked to display more detailed information. The details page is divided into three parts. A description of L1CAM summarizing its coding probability and features is presented at the top (Figure 3B). In the middle of the page, an interactive visualization of three supporting features, including sequence length, MLCDS start and MLCDS end, are provided. In addition, the sequence detail of L1CAM is noted in the middle of the page (Figure 3C). In the CNIT Score Detail Plot, the red line represents the correct transcriptional reading frame out of other colored lines, such as the identification result for human coding gene L1CAM (Figure 4A). By contrast, CNIT analysis of human non-coding transcript HOX transcript antisense RNA transcript variant 2 (HOTAIR: NR_003716) (20) did not identify a coding sequence (CNIT Score = −0.31, Figure 4B). At the bottom of the ‘details’ page, you can blast your sequence in the NONCODE database in this page directly.

Figure 4.

Examples of CNIT analysis of transcripts for coding RNA L1CAM (A) and non-coding RNA HOTAIR (B). CNIT score distribution of the six reading frames for each transcript is the left y-axis and sequence length is normalized to nucleotide triplets in the x-axis. Red line represents the correct transcription reading frame and the other five lines (blue) represent the other five reading frames.

SUMMARY

Non-coding RNAs have emerged as major components of the eukaryotic transcriptome. Genome-wide analyses have revealed the existence of thousands of long noncoding RNAs (lncRNAs) in several species, and a growing number of lncRNAs have been found to be implicated in human disease (21–23) and plant growing and breeding (4,24–26). Despite the increasing number of high-throughput data produced by next-generation sequencing, the classification of protein-coding or noncoding transcripts remains a challenge, especially for poorly annotated species. In other words, the existing software available for annotating plants is rare or of low accuracy. CNCI published in 2013 is widely used by worldwide researchers and has been cited >200 times (Web of Science) in the past 5 years (10). To better serve researchers and make it easier for users to distinguish transcripts, we updated our CNCI algorithm to CNIT. Because CNIT classifies protein-coding and non-coding RNAs solely based on intrinsic sequence composition, as does CNCI, it is particularly well suited for transcriptome analysis of not well-studied species with high accuracy, robustness and consistency, to help researchers validate coding or noncoding hypotheses for further functional studies. Moreover, in comparison with CNCI, CNIT runs∼200 times faster than CNCI and exhibits higher accuracy, especially for plants (0.98 versus 0.94 in humans, 0.95 versus 0.93 in mice, 0.93 versus 0.92 in zebrafish, 0.93 versus 0.92 in the fruit fly, 0.92 versus 0.88 in worms, and 0.98 versus 0.85 in Arabidopsis). The current CNIT can be further applied to species with incomplete genome annotations, such as Artemisia annua (Qing Hao), Astragalus membranaceus (Huang Qi), Ginseng (Ren Shen), etc. Moreover, we constructed a user-friendly web server that is freely available at website: http://cnit.noncode.org/CNIT/. As a result, it will be easy for users to employ this online tool in batches or single sessions rather than just under the Linux system. Thus, CNIT is a handy and useful tool, not only for predicting protein-coding or non-coding sequences generated by high-throughput sequencing data, but also for analyzing the sequence features across species as either an online or offline tool. Click here for additional data file.

26 in total

Review 1. Non-coding RNA genes and the modern RNA world.

Authors: S R Eddy
Journal: Nat Rev Genet Date: 2001-12 Impact factor: 53.242

Review 2. HOTAIR lifts noncoding RNAs to new levels.

Authors: Caroline J Woo; Robert E Kingston
Journal: Cell Date: 2007-06-29 Impact factor: 41.582

3. A ceRNA hypothesis: the Rosetta Stone of a hidden RNA language?

Authors: Leonardo Salmena; Laura Poliseno; Yvonne Tay; Lev Kats; Pier Paolo Pandolfi
Journal: Cell Date: 2011-07-28 Impact factor: 41.582

4. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes.

Authors: Nicholas T Ingolia; Gloria A Brar; Noam Stern-Ginossar; Michael S Harris; Gaëlle J S Talhouarne; Sarah E Jackson; Mark R Wills; Jonathan S Weissman
Journal: Cell Rep Date: 2014-08-21 Impact factor: 9.423

5. ncFANs: a web server for functional annotation of long non-coding RNAs.

Authors: Qi Liao; Hui Xiao; Dechao Bu; Chaoyong Xie; Ruoyu Miao; Haitao Luo; Guoguang Zhao; Kuntao Yu; Haitao Zhao; Geir Skogerbø; Runsheng Chen; Zhongdao Wu; Changning Liu; Yi Zhao
Journal: Nucleic Acids Res Date: 2011-07 Impact factor: 16.971

6. PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme.

Authors: Aimin Li; Junying Zhang; Zhongyin Zhou
Journal: BMC Bioinformatics Date: 2014-09-19 Impact factor: 3.169

7. Long non-coding RNAs function annotation: a global prediction method based on bi-colored networks.

Authors: Xingli Guo; Lin Gao; Qi Liao; Hui Xiao; Xiaoke Ma; Xiaofei Yang; Haitao Luo; Guoguang Zhao; Dechao Bu; Fei Jiao; Qixiang Shao; RunSheng Chen; Yi Zhao
Journal: Nucleic Acids Res Date: 2012-11-05 Impact factor: 16.971

8. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model.

Authors: Liguo Wang; Hyun Jung Park; Surendra Dasari; Shengqin Wang; Jean-Pierre Kocher; Wei Li
Journal: Nucleic Acids Res Date: 2013-01-17 Impact factor: 16.971

9. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts.

Authors: Liang Sun; Haitao Luo; Dechao Bu; Guoguang Zhao; Kuntao Yu; Changhai Zhang; Yuanning Liu; Runsheng Chen; Yi Zhao
Journal: Nucleic Acids Res Date: 2013-07-27 Impact factor: 16.971

Review 10. A perspective on mammalian upstream open reading frame function.

Authors: Joanna Somers; Tuija Pöyry; Anne E Willis
Journal: Int J Biochem Cell Biol Date: 2013-04-25 Impact factor: 5.085

32 in total

Review 1. Functional Micropeptides Encoded by Long Non-Coding RNAs: A Comprehensive Review.

Authors: Jianfeng Pan; Ruijun Wang; Fangzheng Shang; Rong Ma; Youjun Rong; Yanjun Zhang
Journal: Front Mol Biosci Date: 2022-06-13

2. Prognostic value of a five-lncRNA signature in esophageal squamous cell carcinoma.

Authors: Lan Zhang; Pan Li; Enjie Liu; Chenju Xing; Di Zhu; Jianying Zhang; Weiwei Wang; Guozhong Jiang
Journal: Cancer Cell Int Date: 2020-08-10 Impact factor: 5.722

3. Expression of Human Endogenous Retroviruses in Systemic Lupus Erythematosus: Multiomic Integration With Gene Expression.

Authors: Nathaniel Stearrett; Tyson Dawson; Ali Rahnavard; Prathyusha Bachali; Matthew L Bendall; Chen Zeng; Roberto Caricchio; Marcos Pérez-Losada; Amrie C Grammer; Peter E Lipsky; Keith A Crandall
Journal: Front Immunol Date: 2021-04-27 Impact factor: 7.561

4. Computational Analysis Predicts Hundreds of Coding lncRNAs in Zebrafish.

Authors: Shital Kumar Mishra; Han Wang
Journal: Biology (Basel) Date: 2021-04-26

5. Systematic identification and characterization of long noncoding RNAs (lncRNAs) during Aedes albopictus development.

Authors: Wenjuan Liu; Peng Cheng; Kexin Zhang; Maoqing Gong; Zhong Zhang; Ruiling Zhang
Journal: PLoS Negl Trop Dis Date: 2022-04-13

6. A Novel Clinical Six-Flavoprotein-Gene Signature Predicts Prognosis in Esophageal Squamous Cell Carcinoma.

Authors: Liu Peng; Jin-Cheng Guo; Lin Long; Feng Pan; Jian-Mei Zhao; Li-Yan Xu; En-Min Li
Journal: Biomed Res Int Date: 2019-10-30 Impact factor: 3.411

7. A four-long noncoding RNA signature predicts survival of hepatocellular carcinoma patients.

Authors: Haitao Jiang; Lianhe Zhao; Yunjie Chen; Liang Sun
Journal: J Clin Lab Anal Date: 2020-05-30 Impact factor: 2.352

Review 8. Decoding LncRNAs.

Authors: Lidia Borkiewicz; Joanna Kalafut; Karolina Dudziak; Alicja Przybyszewska-Podstawka; Ilona Telejko
Journal: Cancers (Basel) Date: 2021-05-27 Impact factor: 6.639

9. A five-microRNA signature for individualized prognosis evaluation and radiotherapy guidance in patients with diffuse lower-grade glioma.

Authors: Jian-Hua Zhang; Ruiqin Hou; Yuhualei Pan; Yuhan Gao; Ying Yang; Wenqin Tian; Yan-Bing Zhu
Journal: J Cell Mol Med Date: 2020-05-15 Impact factor: 5.310

10. Identification of key genes by integrating DNA methylation and next-generation transcriptome sequencing for esophageal squamous cell carcinoma.

Authors: Yang Chen; Lian-Di Liao; Zhi-Yong Wu; Qian Yang; Jin-Cheng Guo; Jian-Zhong He; Shao-Hong Wang; Xiu-E Xu; Jian-Yi Wu; Feng Pan; De-Chen Lin; Li-Yan Xu; En-Min Li
Journal: Aging (Albany NY) Date: 2020-01-21 Impact factor: 5.682