| Literature DB >> 25239089 |
Aimin Li, Junying Zhang1, Zhongyin Zhou.
Abstract
BACKGROUND: High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25239089 PMCID: PMC4177586 DOI: 10.1186/1471-2105-15-311
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Data sources and performance of cross-species prediction
| Species | Data source | Number of transcripts | Accuracy of CNCI | Accuracy of PLEK |
|---|---|---|---|---|
|
| RefSeq mRNA | 26062 |
| 88.1% |
| Ensembl ncRNA | 2963 |
| 89.9% | |
|
| RefSeq mRNA | 14493 |
| 91.3% |
| Ensembl ncRNA | 419 | 89.3% |
| |
|
| RefSeq mRNA | 8874 | 92.9% |
|
| Ensembl ncRNA | 279* | 99.7% |
| |
|
| RefSeq mRNA | 13190 | 94.3% |
|
| Ensembl ncRNA | 182 |
| 99.5% | |
|
| RefSeq mRNA | 1906 |
| 87.1% |
| Ensembl ncRNA | 1166 |
| 99.9% | |
|
| RefSeq mRNA | 3978 |
| 85.1% |
| Ensembl ncRNA | 241 | 95.9% |
| |
|
| RefSeq mRNA | 5709 |
| 85.0% |
| Ensembl ncRNA | 359 | 99.7% |
| |
|
| RefSeq mRNA | 33025 |
| 83.8% |
| Ensembl ncRNA | 367 |
|
| |
|
| RefSeq mRNA | 3401 | 93.4% |
|
| Ensembl ncRNA | 392 | 99.8% |
|
PLEK and CNCI were tested on the same data; better accuracies are shown in bold face type. For RefSeq mRNAs, those with ‘putative’, ‘predicted’ or ‘pseudogene’ annotations were excluded (except for Gorilla gorilla).
*279 non-coding transcripts with lengths of more than 150 nt.
Figure 1Comparison of robustness towards indel sequencing errors. The x-axis is the indel numbers per 100 bases (indel sequencing error rates). Performance (accuracy) of CNCI declines significantly as the indel error rate increases.
Performances on transcripts derived from PacBio and 454
| Dataset | Tool | Sensitivity | Specificity | PPV | NPV | Accuracy |
|---|---|---|---|---|---|---|
| MCF-7 (PacBio) | PLEK | 0.947 |
|
| 0.407 | 0.947 |
| CPC |
|
| 0.970 |
|
| |
| CNCI | 0.918 | 0.787 | 0.991 |
| 0.913 | |
| HelaS3 (454) | PLEK | 0.955 |
|
| 0.262 | 0.954 |
| CPC |
|
| 0.991 |
|
| |
| CNCI | 0.939 | 0.811 | 0.997 |
| 0.937 |
Bold face type indicates the best performances (sensitivity, specificity, PPV, NPV, accuracy) among PLEK, CPC and CNCI. Italic face type indicates the worst specificity and NPV among these tools.
Figure 2Results of PLEK, CPC, CNCI and PhyloCSF on mouse datasets. (A) The fraction of protein-coding transcripts classified as coding or non-coding. (B) The fraction of non-coding transcripts classified as coding or non-coding. Data were collected from RefSeq mouse protein-coding transcripts (release 60) and GENCODE mouse long non-coding transcripts (vM2). Shown is the fraction of transcripts classified as coding or non-coding by each tool. All these tools performed well on protein-coding transcripts. PLEK and CNCI outperformed CPC and PhyloCSF on long non-coding transcripts.
Comparison of computational performances of PLEK, CNCI, CPC and PhyloCSF
| Performance | PLEK | CNCI | CPC | PhyloCSF |
|---|---|---|---|---|
| Run timea (seconds) | 128 | 1048 | 31247 | 181925e |
| Multi-threadingb | Yes | Yes | Nod | No |
| Online runningc | No | No | Yes | No |
Computational time was tested on 1,000 human mRNA transcripts and 1,000 human lncRNA transcripts.
aComputation time consumed when run in a single-threading manner.
bCan the software tool run in a multi-threading manner?
cDoes the software tool provide a website for users to run online?
dCPC improves its computational performance using a page-cache method on its website.
eBed files were load onto the Galaxy webserver (http://galaxy.nbic.nl/) and the tool ‘Stitch Gene Blocks’ was used to retrieve multiple alignment files with sequence entries for the following genome builds based on the 10-way Multiz alignment to hg19: hg19, panTro2, tarSyr1, micMur1, otoGar1 and rheMac2. PhyloCSF was run using the options: --removeRefGaps.
Figure 3Performance comparison of various ranges of On the x-axis, ‘5’ means that k ranged from 1 to 5. Training data comprised 22,389 human RefSeq mRNA transcripts and 22,389 GENCODE lncRNA transcripts. SVM classifiers were trained using 10-fold cross-validation on the training datasets. The figure indicates that the computation load rises and the accuracy increases along as k increases.