| Literature DB >> 23892401 |
Liang Sun1, Haitao Luo, Dechao Bu, Guoguang Zhao, Kuntao Yu, Changhai Zhang, Yuanning Liu, Runsheng Chen, Yi Zhao.
Abstract
It is a challenge to classify protein-coding or non-coding transcripts, especially those re-constructed from high-throughput sequencing data of poorly annotated species. This study developed and evaluated a powerful signature tool, Coding-Non-Coding Index (CNCI), by profiling adjoining nucleotide triplets to effectively distinguish protein-coding and non-coding sequences independent of known annotations. CNCI is effective for classifying incomplete transcripts and sense-antisense pairs. The implementation of CNCI offered highly accurate classification of transcripts assembled from whole-transcriptome sequencing data in a cross-species manner, that demonstrated gene evolutionary divergence between vertebrates, and invertebrates, or between plants, and provided a long non-coding RNA catalog of orangutan. CNCI software is available at http://www.bioinfo.org/software/cnci.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23892401 PMCID: PMC3783192 DOI: 10.1093/nar/gkt646
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Illustration of ANT score matrix and CNCI framework. The score of each ANT is calculated for human (a) or mouse (b). The three black rows or columns represent three stop codons, including UAA, UAG and UGA (corresponding to ATT, ATC and ACT in cDNA sequence, respectively), which shows low frequency in protein-coding sequence. (c) The framework of CNCI. The top panel shows the process of a sequence in a testing set. For a given sequence, six MLCDS regions (represented by six lines) are identified from six reading frames (represented by six color arrow lines) using a sliding window and dynamic programming algorithm. Then, an MLCDS region with a maximal S-score is selected to incorporate into an SVM. The bottom panel shows the training and classification process. Reliable protein-coding and non-coding sequences are used as a training set, and five features are extracted to train SVM, which classifies the incorporating sequence into protein-coding or non-coding sequence.
Figure 2.CNCI performance. (a) The top panel shows ANT score distribution (the left y-axis) of these six reading frames for each protein-coding transcript, whose length is normalized to 1100 nucleotide triplets in the x-axis. Red line represents the correct transcriptional reading frame and other five lines (blue or green) represent other five reading frames. Green line indicates the distribution of the coverage (the right y-axis) of the MLCDS region for each protein-coding transcript across the normalized length. The three regions marked by blue, yellow and green indicate the mean length of 3′UTR (6%), CDS (56.6%) and 5′UTR (37.4%), respectively, across the normalized length. The bottom panel shows an example of a gene NM_021222. (b) The ROC analyses of CNCI, CPC and phyloCSF. The MAE denoted by solid squares is 0.05, 0.11 and 0.28, respectively. (c) The accuracy of CNCI, CPC and phyloCSF for classification of different lincRNA lengths. (d) The ROC curves and taxonomic tree of 12 species. The minimum error rate is marked following the name of species.