| Literature DB >> 27708423 |
Jian Zhao1,2, Xiaofeng Song1, Kai Wang2,3,4,5.
Abstract
RNA-Seq based transcriptome assembly has been widely used to identify novel lncRNAs. However, the best-performing transcript reconstruction methods merely identified 21% of full-length protein-coding transcripts from H. sapiens. Those partial-length protein-coding transcripts are more likely to be classified as lncRNAs due to their incomplete CDS, leading to higher false positive rate for lncRNA identification. Furthermore, potential sequencing or assembly error that gain or abolish stop codons also complicates ORF-based prediction of lncRNAs. Therefore, it remains a challenge to identify lncRNAs from the assembled transcripts, particularly the partial-length ones. Here, we present a novel alignment-free tool, lncScore, which uses a logistic regression model with 11 carefully selected features. Compared to other state-of-the-art alignment-free tools (e.g. CPAT, CNCI, and PLEK), lncScore outperforms them on accurately distinguishing lncRNAs from mRNAs, especially partial-length mRNAs in the human and mouse datasets. In addition, lncScore also performed well on transcripts from five other species (Zebrafish, Fly, C. elegans, Rat, and Sheep). To speed up the prediction, multithreading is implemented within lncScore, and it only took 2 minute to classify 64,756 transcripts and 54 seconds to train a new model with 21,000 transcripts with 12 threads, which is much faster than other tools. lncScore is available at https://github.com/WGLab/lncScore.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27708423 PMCID: PMC5052565 DOI: 10.1038/srep34838
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Features used in lncScore.
| Feature Group | |||
|---|---|---|---|
| Exon | MCSS | ORF | |
| Features (Acronym) | Hexamer Score (HS) | Length (L) | Length (L) & Coverage (C) |
| Hexamer Score Distance (HSD) | Coding Score (CS) | Fickett Score (FS) | |
| Coding Score Percentage(CSP) | Hexamer Score (HS) | ||
| GC-content (GC-c) | Hexamer Score Distance (HSD) | ||
MCSS is the abbreviation of maximum coding subsequence.
Figure 1ROC curves of different feature groups on the full- and partial-length testing datasets.
The area under ROC curve (%) of each single feature.
| Exon | MCSS | ORF | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| HS | HSD | GC-c | CS | L | CSP | L | C | FS | HS | HSD | |
| HP | 90.19 | 87.44 | 73.91 | 88.61 | 87.30 | 89.22 | 83.41 | 88.59 | 79.67 | 87.16 | 80.85 |
| HF | 90.92 | 87.59 | 81.67 | 96.45 | 96.19 | 95.16 | 97.06 | 85.44 | 81.87 | 90.48 | 84.84 |
| MP | 91.13 | 89.15 | 75.67 | 89.63 | 89.00 | 91.73 | 83.70 | 92.67 | 79.23 | 89.08 | 80.79 |
| MF | 92.47 | 89.85 | 83.50 | 96.94 | 97.06 | 96.51 | 97.51 | 89.66 | 81.40 | 92.99 | 85.13 |
The performance of each single feature from three different feature groups (e.g. ORF, exon, MCSS) was evaluated using AUC on the Partial Testing Datasets (HP & MP) and the Full Testing Datasets (HF & MF) of human and mouse species. The full name of the abbreviation of each feature was shown in the Table 1.
Figure 2ROC curves of different tools on the full- and partial-length testing datasets.
Performance (%) comparison on the partial- and full-length testing dataset.
| Partial-length testing dataset | Full-length testing dataset | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| CPAT | CNCI | PLEK | lncScore | CPAT | CNCI | PLEK | lncScore | ||
| Human | Accuracy | 84.03 | 80.51 | 63.14 | 89.12 | 94.41 | 92.20 | 90.61 | 95.21 |
| Sensitivity | 76.19 | 65.40 | 31.76 | 84.15 | 94.97 | 89.00 | 85.96 | 95.56 | |
| PPV | 92.12 | 96.46 | 94.83 | 94.61 | 95.46 | 98.16 | 98.62 | 96.64 | |
| Specificity | 92.75 | 97.33 | 98.07 | 94.67 | 92.75 | 97.33 | 98.07 | 94.67 | |
| NPV | 77.78 | 71.65 | 56.36 | 84.29 | 92.00 | 84.64 | 81.31 | 92.99 | |
| MCC | 69.41 | 65.36 | 39.07 | 78.85 | 87.59 | 84.55 | 81.96 | 89.93 | |
| Mouse | Accuracy | 79.04 | 76.47 | 50.07 | 89.92 | 94.65 | 92.83 | 83.67 | 96.46 |
| Sensitivity | 72.88 | 69.24 | 35.34 | 88.39 | 94.19 | 91.56 | 81.17 | 97.35 | |
| PPV | 97.97 | 98.05 | 90.91 | 97.61 | 98.39 | 98.48 | 95.75 | 97.78 | |
| Specificity | 95.88 | 96.23 | 90.35 | 94.08 | 95.88 | 96.23 | 90.35 | 94.08 | |
| NPV | 56.40 | 63.37 | 33.82 | 74.78 | 86.05 | 80.98 | 64.18 | 92.99 | |
| MCC | 61.15 | 58.02 | 25.21 | 77.27 | 84.21 | 83.52 | 65.47 | 91.10 | |
The default cutoff of CPAT, PLEK, and lncScore is shown in Fig. 4, and the default cutoff of CNCI is 0.
The overall ACC and AUC (%) of CPAT, CNCI, PLEK, lncScore, and 10-fold cross validation on 5 other species datasets.
| Zebrafish | Fruitfly | C. elegans | Rat | Sheep | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ACC | AUC | ACC | AUC | ACC | AUC | ACC | AUC | ACC | AUC | |
| CPAT* | 78.51 | 82.54 | 95.52 | 98.36 | * | * | * | * | * | * |
| CPATH | 78.66 | 83.17 | 92.80 | 98.36 | 97.55 | 99.66 | 89.25 | 94.22 | 77.58 | 88.03 |
| CPATM | 78.29 | 82.54 | 94.44 | 98.37 | 96.53 | 99.69 | 89.72 | 94.23 | 79.99 | 87.70 |
| CNCI | 69.49 | 77.66 | 86.08 | 95.00 | 64.33 | 83.89 | 81.32 | 88.70 | 84.77 | 85.33 |
| PLEK | 62.32 | 70.46 | 82.19 | 89.90 | 75.98 | 95.27 | 83.23 | 89.80 | 66.55 | 69.90 |
| lncScoreH | 79.25 | 84.95 | 95.54 | 98.67 | 96.41 | 99.33 | 89.28 | 94.41 | 84.77 | 94.28 |
| lncScoreM | 79.84 | 85.43 | 96.44 | 98.88 | 97.28 | 99.35 | 89.27 | 94.54 | 82.20 | 93.73 |
| 10_CV | 79.91 | 86.97 | 96.66 | 98.64 | 98.23 | 99.21 | 89.41 | 93.80 | 92.78 | 96.96 |
CPAT* represents CPAT models for zebrafish and fly. CPATH and CPATM stand for CPAT models for human and mouse, respectively. lncScoreH and lncScoreM refer to the models of lncScore respectively for human and mouse. 10_CV is the abbreviation of 10-fold cross validation.
Figure 3Work flow for identification of novel lncRNAs using RNA-seq data.
Figure 4Accuracy versus cutoff score for testing datasets.
The highest point of each line was marked with an asterisk.