Literature DB >> 17631615

CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine.

Lei Kong¹, Yong Zhang, Zhi-Qiang Ye, Xiao-Qiao Liu, Shu-Qi Zhao, Liping Wei, Ge Gao.

Abstract

Recent transcriptome studies have revealed that a large number of transcripts in mammals and other organisms do not encode proteins but function as noncoding RNAs (ncRNAs) instead. As millions of transcripts are generated by large-scale cDNA and EST sequencing projects every year, there is a need for automatic methods to distinguish protein-coding RNAs from noncoding RNAs accurately and quickly. We developed a support vector machine-based classifier, named Coding Potential Calculator (CPC), to assess the protein-coding potential of a transcript based on six biologically meaningful sequence features. Tenfold cross-validation on the training dataset and further testing on several large datasets showed that CPC can discriminate coding from noncoding transcripts with high accuracy. Furthermore, CPC also runs an order-of-magnitude faster than a previous state-of-the-art tool and has higher accuracy. We developed a user-friendly web-based interface of CPC at http://cpc.cbi.pku.edu.cn. In addition to predicting the coding potential of the input transcripts, the CPC web server also graphically displays detailed sequence features and additional annotations of the transcript that may facilitate users' further investigation.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2007 PMID： 17631615 PMCID： PMC1933232 DOI： 10.1093/nar/gkm391

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Recent transcriptome studies have revealed that a large number of transcripts in mammals and other organisms do not encode proteins but function as noncoding RNAs (ncRNAs) instead. In vivo experiments have demonstrated important biological roles of noncoding RNAs, including regulation of transcription and translation, RNA modification and epigenetic modification of chromatin structure (1–3). There is immense interest within the biological community to identify and study new noncoding RNAs. As millions of transcripts are generated by large-scale cDNA and EST sequencing projects every year, there is a need for automatic methods to accurately and quickly distinguish protein-coding RNAs from noncoding RNAs. Since to date no web server and few standalone tools have been designed for this purpose, researchers sometimes used tools developed for other purposes such as cDNA annotation and functionally domain identification (4–12). However these methods showed varied performance on different datasets (12,13). Recently a new algorithm and standalone software named CONC was published that classifies transcripts as ‘coding’ or ‘noncoding’ using machine learning methods (13). CONC showed improved performance over previous tools such as ESTScan (6). However, CONC is slow for large datasets and does not have a web-server interface, limiting its usefulness. It works well with high-quality transcripts but may suffer from errors such as frameshifts which are common in ESTs and even occur occasionally in full-length cDNAs (11). Furthermore, CONC only outputs the ‘coding’/‘noncoding’ classification but does not provide an explanation or related information. New tools are desired that are more accurate, run faster, and have a more user-friendly web-based interface.

METHODS

To assess a transcript's coding potential, we extract six features from the transcript's nucleotide sequence. A true protein-coding transcript is more likely to have a long and high-quality Open Reading Frame (ORF) compared with a non-coding transcript. Thus, our first three features assess the extent and quality of the ORF in a transcript. We use the framefinder software (14) to identify the longest reading frame in the three forward frames. Known for its error tolerance, framefinder can identify most correct ORFs even when the input transcripts contain sequencing errors such as point mutations, indels and truncations (14,15). We extract the LOG-ODDS SCORE and the COVERAGE OF THE PREDICTED ORF as the first two features by parsing the framefinder raw output with Perl scripts (available for download from the web site). The LOG-ODDS SCORE is an indicator of the quality of a predicted ORF and the higher the score, the higher the quality. A large COVERAGE OF THE PREDICTED ORF is also an indicator of good ORF quality (14). We add a third binary feature, the INTEGRITY OF THE PREDICTED ORF, that indicates whether an ORF begins with a start codon and ends with an in-frame stop codon. The large and rapidly growing protein sequence databases provide a wealth of information for the identification of protein-coding transcript. We derive another three features from parsing the output of BLASTX (16) search (using the transcript as query, E-value cutoff 1e-10) against UniProt Reference Clusters (UniRef90) which was developed as a nonredundant protein database with a 90% sequence identity threshold (17). First, a true protein-coding transcript is likely to have more hits with known proteins than a non-coding transcript does. Thus we extract the NUMBER OF HITS as a feature. Second, for a true protein-coding transcript the hits are also likely to have higher quality; i.e. the HSPs (High-scoring Segment Pairs) overall tend to have lower E-value. Thus we define feature HIT SCORE as follows: where E is the E-value of the j-th HSP in frame i, S measures the average quality of the HSPs in frame i and HIT SCORE is the average of S across three frames. The higher the HIT SCORE, the better the overall quality of the hits and the more likely the transcript is protein-coding. Thirdly, for a true protein-coding transcript most of the hits are likely to reside within one frame, whereas for a true non-coding transcript, even if it matches certain known protein sequence segments by chance, these chance hits are likely to scatter in any of the three frames. Thus, we define feature FRAME SCORE to measure the distribution of the HSPs among three reading frames: The higher the FRAME SCORE, the more concentrated the hits are and the more likely the transcript is protein-coding. We incorporate these six features into a support vector machine (SVM) machine learning classifier (18). Mapping the input features onto a high-dimensional feature space via a proper kernel function, SVM constructs a classification hyper-plane (maximum margin hyper-plane) to separate the transformed data (18). Known for its high accuracy and good performance, SVM is a widely used classification tool in bioinformatics analysis such as microarray-based cancer classification (19,20), prediction of protein function (21,22) and prediction of subcellular localization (23,24). We employed the LIBSVM package (25) to train a SVM model using the standard radial basis function kernel (RBF kernel). The C and gamma parameters were determined by grid-search in the training dataset. We trained the SVM model using the same training data set as CONC used (13), containing 5610 protein-coding cDNAs and 2670 noncoding RNAs.

EVALUATION

We evaluated our method, named Coding Potential Calculator (CPC), by 10-fold cross-validation on the training data sets. The accuracy was 95.77%. For further evaluation we tested CPC on three large datasets including two non-coding RNA datasets from the Rfam 7.0 database (26) and RNAdb databank (27), respectively, and a protein-coding RNA dataset from the EMBL nucleotide databank based on cross-links to the UniProt/SwissProt protein knowledgebase (17,28). We recorded the accuracy and computation time of CPC in Table 1, and compared it with CONC (version 1.01 downloaded from the authors’ website http://cubic.bioc.columbia.edu/~liu/conc/ and installed locally). Both CPC and CONC were run in a Linux box with Intel Xeon 3.0G CPU and 4G RAM. Overall, CPC showed better accuracy on all three datasets with an order-of-magnitude faster speed (Table 1). For more stringent evaluations we removed those sequences in the three test datasets that were similar to one or more sequences in the training set (BLASTN E-value cutoff 1e-2) and tested CPC on the remaining sequences. We also tested CPC on new entries in the latest UniRef90 release (version 10.1) which were not included in the previous release used to train CPC (version 9.4). In both cases the accuracy of CPC remained high (see section ‘More Stringent Evaluation’ and Table S1 in Supplementary Data).

Table 1.

Evaluation of accuracy and CPU time of CPC and CONC on three datasets

Dataset	Dataset type	Dataset size^a	Accuracy		Time (min)

			CPC	CONC	CPC	CONC
Rfam	Noncoding	30 770	98.62%	97.12%	3513	46 376
RNADB	Noncoding	3996	91.50%	85.44%	598	7322
Embl cds	Coding	121 914	99.08%	98.70%	69 116	826 210^b

aCONC focuses on sequences with at least 80 nucleotides and assumes shorter sequences unlikely to have coding potential. CPC does not make this assumption and has similar performance on shorter sequences, but to make a direct comparison here we shows results only on sequences with at least 80 nucleotides.

bBecause the required CPU time is long, the dataset was split and run on 24 nodes in parallel. The reported CPU time was the sum of execution time on individual nodes.

Evaluation of accuracy and CPU time of CPC and CONC on three datasets aCONC focuses on sequences with at least 80 nucleotides and assumes shorter sequences unlikely to have coding potential. CPC does not make this assumption and has similar performance on shorter sequences, but to make a direct comparison here we shows results only on sequences with at least 80 nucleotides. bBecause the required CPU time is long, the dataset was split and run on 24 nodes in parallel. The reported CPU time was the sum of execution time on individual nodes. We then compared CPC with other prediction algorithms following the same evaluation strategy proposed by Frith et al. (12). The results showed that CPC had the highest consistency with expert curation and performed well for the six challenging cases hand-picked by Frith et al. (12) (see section ‘Comparison with other protein-prediction algorithms following Frith et al.’ and Table S2 in Supplementary Data). CPC was also able to accurately predict 92% of the 2,849 short peptides with less than 100 amino acids (see section ‘Performance on Short Peptides’ in Supplementary Data).

WEB SERVER

We developed a user-friendly web interface for CPC (http://cpc.cbi.pku.edu.cn). The CPC web server accepts a set of nucleotide FASTA sequences as input (allowing symbols ‘A’, ‘T’, ‘G’, ‘C’, and ‘U’). The sequences can be pasted directly into the input box or uploaded from a local sequence file. By default, the CPC server runs in ‘interactive mode’ in that results will be shown in the browser once the computation is finished. For a large set of sequences the user can input an email address to run his/her job in ‘batch mode’. The server will send a notice to the user's mailbox upon completion of the job. A unique ‘Task ID’ (TID) is assigned to each job by the web server. Users can use TID to track the job progress and retrieve the results which are saved on the server for 1 week. CPC summarizes the main output in a table (Figure 1a). Each row corresponds to one input sequence. The columns show the sequence ID, the coding/noncoding classification, the SVM score (the ‘distance’ to the SVM classification hyper-plane in the features space), and a ‘Details’ link (as described later). In general, the farther away the score is from zero, the more reliable the prediction is. As a rule of thumb from our experience, the transcripts with score between −1 and 1 are marked as ‘weak noncoding’ or ‘weak coding’. Results in the summary table can be sorted interactively by sequence id, coding/noncoding classification, and SVM score; they can also be filtered by coding/noncoding classification, and SVM score.

Figure 1.

Screenshots of CPC output. (a) Results are summarized in a ‘Table View’. (b) Sequence features and additional annotations of an input transcript are shown in an Evidence page. Users can mouse over or click to see more details. The current version of CPC cannot accurately discriminate transcripts falling entirely within UTR regions from true non-coding transcripts, because neither of them produces amino acid sequences. To handle this limitation, CPC provides the users the option to search database of known UTR sequences, UTRdb (32), using BLAST (see section ‘Recognizing Potential UTR regions’ and Figure S1 in Supplementary Data). To ‘explain’ why a transcript is classified as coding or noncoding, CPC server provides detailed supporting evidence and other related sequence features of the input transcript in an Evidence page (Figure 1b). The Evidence page shows the six features of the transcript, color coded for better visualization. It shows graphically the putative ORF identified by framefinder and the BLASTX hits. Mousing over, users can view details of each ORF and BLASTX hits. The Evidence page also provides options for querying the input transcript against well-annotated database, such as the functional domain database Pfam (29), SMART (30) and SuperFamily (31), UTRdb (32) and ncRNA database RNAdb (27). The Evidence page aims to facilitate the user's detailed investigation of the transcript. We developed the CPC web server on a Java platform using JSP to render the dynamic HTML pages and Apache/Tomcat as the J2EE container. The web site is in compliance with W3C XHTML 1.0 Strict specification and works in both the Microsoft Internet Explorer and Mozilla Firefox browsers. A standalone version of the software is freely available for download on the web site, distributed under GNU GPL. A parallel version with simple distributed computing support is available upon request.

DISCUSSION

With the rapidly increasing amount of data generated by large-scale transcriptome sequencing and intensifying attention on the study of noncoding RNAs, methods that can discriminate noncoding RNAs from protein-coding ones with high reliability and fast speed are important. Integrating multiple sequence features with biological significance, CPC is shown to have good accuracy in both cross-validation and several test datasets. It also runs an order-of-magnitude faster than the previous state-of-the-art tool, and thus is more suitable for high-throughput analysis. CPC uses far fewer features than CONC does (6 versus 180) but achieved comparable, even better, performance in the evaluation. The results demonstrated that the sequence features used by CPC have powerful discriminating power and may reflect the intrinsic properties of coding transcript. Using fewer, sequence-based features also significantly reduced computing cost, thus removing a hurdle for a web server to be developed. Additional information such as potential functional domains and similarity to known UTR regions or ncRNA is useful to users. This and other supplementary information is available in the Evidence pages of CPC, making the results of CPC more easily interpretable and biology-meaningful.

SUPPLEMENTARY DATA

Supplementary data are available at NAR Online.

28 in total

1. Knowledge-based analysis of microarray gene expression data by using support vector machines.

Authors: M P Brown; W N Grundy; D Lin; N Cristianini; C W Sugnet; T S Furey; M Ares; D Haussler
Journal: Proc Natl Acad Sci U S A Date: 2000-01-04 Impact factor: 11.205

2. DIANA-EST: a statistical analysis.

Authors: A G Hatzigeorgiou; P Fiziev; M Reczko
Journal: Bioinformatics Date: 2001-10 Impact factor: 6.937

Review 3. Non-coding RNA genes and the modern RNA world.

Authors: S R Eddy
Journal: Nat Rev Genet Date: 2001-12 Impact factor: 53.242

4. The SUPERFAMILY database in 2004: additions and improvements.

Authors: Martin Madera; Christine Vogel; Sarah K Kummerfeld; Cyrus Chothia; Julian Gough
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

5. Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia.

Authors: Piero Carninci; Kazunori Waki; Toshiyuki Shiraki; Hideaki Konno; Kazuhiro Shibata; Masayoshi Itoh; Katsunori Aizawa; Takahiro Arakawa; Yoshiyuki Ishii; Daisuke Sasaki; Hidemasa Bono; Shinji Kondo; Yuichi Sugahara; Rintaro Saito; Naoki Osato; Shiro Fukuda; Kenjiro Sato; Akira Watahiki; Tomoko Hirozane-Kishikawa; Mari Nakamura; Yuko Shibata; Ayako Yasunishi; Noriko Kikuchi; Atsushi Yoshiki; Moriaki Kusakabe; Stefano Gustincich; Kirk Beisel; William Pavan; Vassilis Aidinis; Akira Nakagawara; William A Held; Hiroo Iwata; Tomohiro Kono; Hiromitsu Nakauchi; Paul Lyons; Christine Wells; David A Hume; Michela Fagiolini; Takao K Hensch; Michelle Brinkmeier; Sally Camper; Junji Hirota; Peter Mombaerts; Masami Muramatsu; Yasushi Okazaki; Jun Kawai; Yoshihide Hayashizaki
Journal: Genome Res Date: 2003-06 Impact factor: 9.043

6. CDS annotation in full-length cDNA sequence.

Authors: Masaaki Furuno; Takeya Kasukawa; Rintaro Saito; Jun Adachi; Harukazu Suzuki; Richard Baldarelli; Yoshihide Hayashizaki; Yasushi Okazaki
Journal: Genome Res Date: 2003-06 Impact factor: 9.043

7. Modeling sequencing errors by combining Hidden Markov models.

Authors: C Lottaz; C Iseli; C V Jongeneel; P Bucher
Journal: Bioinformatics Date: 2003-10 Impact factor: 6.937

Review 8. RNA regulation: a new genetics?

Authors: John S Mattick
Journal: Nat Rev Genet Date: 2004-04 Impact factor: 53.242

9. Discrimination of non-protein-coding transcripts from protein-coding mRNA.

Authors: Martin C Frith; Timothy L Bailey; Takeya Kasukawa; Flavio Mignone; Sarah K Kummerfeld; Martin Madera; Sirisha Sunkara; Masaaki Furuno; Carol J Bult; John Quackenbush; Chikatoshi Kai; Jun Kawai; Piero Carninci; Yoshihide Hayashizaki; Graziano Pesole; John S Mattick
Journal: RNA Biol Date: 2006-04-03 Impact factor: 4.652

10. The Pfam protein families database.

Authors: Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

1021 in total

1. A BRCA1-interacting lncRNA regulates homologous recombination.

Authors: Vivek Sharma; Simran Khurana; Nard Kubben; Kotb Abdelmohsen; Philipp Oberdoerffer; Myriam Gorospe; Tom Misteli
Journal: EMBO Rep Date: 2015-09-27 Impact factor: 8.807

2. Unexpected features of the dark proteome.

Authors: Nelson Perdigão; Julian Heinrich; Christian Stolte; Kenneth S Sabir; Michael J Buckley; Bruce Tabor; Beth Signal; Brian S Gloss; Christopher J Hammang; Burkhard Rost; Andrea Schafferhans; Seán I O'Donoghue
Journal: Proc Natl Acad Sci U S A Date: 2015-11-17 Impact factor: 11.205

3. A novel long non-coding RNA in the rheumatoid arthritis risk locus TRAF1-C5 influences C5 mRNA levels.

Authors: T C Messemaker; M Frank-Bertoncelj; R B Marques; A Adriaans; A M Bakker; N Daha; S Gay; T W Huizinga; R E M Toes; H M M Mikkers; F Kurreeman
Journal: Genes Immun Date: 2015-12-17 Impact factor: 2.676

4. Investigation of circular RNAs in an ectoparasitic mite Varroa destructor (Acarina: Varroidae) of the honey bee.

Authors: Zheguang Lin; Hao Xu; Xiaoling Su; Yalu Ke; Wei Wang; Yujiao Li; Mingliang Zhuang; Heng Chen; Yibing Liu; Kang Wang; Guohong Chen; Ting Ji
Journal: Parasitol Res Date: 2021-01-16 Impact factor: 2.289

5. Divergent transcription of long noncoding RNA/mRNA gene pairs in embryonic stem cells.

Authors: Alla A Sigova; Alan C Mullen; Benoit Molinie; Sumeet Gupta; David A Orlando; Matthew G Guenther; Albert E Almada; Charles Lin; Phillip A Sharp; Cosmas C Giallourakis; Richard A Young
Journal: Proc Natl Acad Sci U S A Date: 2013-02-04 Impact factor: 11.205

6. Genome-wide screening identifies oncofetal lncRNA Ptn-dt promoting the proliferation of hepatocellular carcinoma cells by regulating the Ptn receptor.

Authors: Jin-Feng Huang; Hong-Yue Jiang; Hui Cai; Yan Liu; Yi-Qing Zhu; Sha-Sha Lin; Ting-Ting Hu; Tian-Tian Wang; Wen-Jun Yang; Bang Xiao; Shu-Han Sun; Li-Ye Ma; Hui-Rong Yin; Fang Wang
Journal: Oncogene Date: 2019-01-14 Impact factor: 9.867

7. Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm.

Authors: Supatcha Lertampaiporn; Chinae Thammarongtham; Chakarida Nukoolkit; Boonserm Kaewkamnerdpong; Marasri Ruengjitchatchawalya
Journal: Nucleic Acids Res Date: 2014-04-25 Impact factor: 16.971

Review 8. Encoding activities of non-coding RNAs.

Authors: Yanan Pang; Chuanbin Mao; Shanrong Liu
Journal: Theranostics Date: 2018-03-28 Impact factor: 11.556

9. Saccharopolyspora erythraea's genome is organised in high-order transcriptional regions mediated by targeted degradation at the metabolic switch.

Authors: Esteban Marcellin; Tim R Mercer; Cuauhtemoc Licona-Cassani; Robin W Palfreyman; Marcel E Dinger; Jennifer A Steen; John S Mattick; Lars K Nielsen
Journal: BMC Genomics Date: 2013-01-16 Impact factor: 3.969

10. Genome-wide identification and functional annotation of Plasmodium falciparum long noncoding RNAs from RNA-seq data.

Authors: Qi Liao; Jia Shen; Jianfa Liu; Xi Sun; Guoguang Zhao; Yanzi Chang; Leiting Xu; Xuerong Li; Ya Zhao; Huanqin Zheng; Yi Zhao; Zhongdao Wu
Journal: Parasitol Res Date: 2014-02-13 Impact factor: 2.289