| Literature DB >> 28116287 |
Siyu Han1, Yanchun Liang2, Ying Li1, Wei Du1.
Abstract
More and more studies have demonstrated that long noncoding RNAs (lncRNAs) play critical roles in diversity of biological process and are also associated with various types of disease. How to rapidly identify lncRNAs and messenger RNA is the fundamental step to uncover the function of lncRNAs identification. Here, we present a novel method for rapid identification of lncRNAs utilizing sequence intrinsic composition features and open reading frame information based on support vector machine model, named as Lncident (LncRNAs identification). The 10-fold cross-validation and ROC curve are used to evaluate the performance of Lncident. The main advantage of Lncident is high speed without the loss of accuracy. Compared with the exiting popular tools, Lncident outperforms Coding-Potential Calculator, Coding-Potential Assessment Tool, Coding-Noncoding Index, and PLEK. Lncident is also much faster than Coding-Potential Calculator and Coding-Noncoding Index. Lncident presents an outstanding performance on microorganism, which offers a great application prospect to the analysis of microorganism. In addition, Lncident can be trained by users' own collected data. Furthermore, R package and web server are simultaneously developed in order to maximize the convenience for the users. The R package "Lncident" can be easily installed on multiple operating system platforms, as long as R is supported.Entities:
Year: 2016 PMID: 28116287 PMCID: PMC5223071 DOI: 10.1155/2016/9185496
Source DB: PubMed Journal: Int J Genomics ISSN: 2314-436X Impact factor: 2.326
Figure 1The framework of Lncident: containing data collection, feature extraction, classification model construction, and model evaluation.
Figure 2The distribution of the adjoining base(s) on lncRNAs and coding sequences.
Figure 3The Receiver Operating Characteristic (ROC) curve of several feature categories.
Figure 4The performances of different tools on different species.
The performances on human dataset 2.
| Tools | Sensitivity | Specificity | Accuracy |
| MCC | Kappa |
|---|---|---|---|---|---|---|
| CPC.local | 0.6613 |
| 0.8306 | 0.7961 | 0.7028 | 0.6612 |
| CPC.web | 0.6625 | 0.9992 | 0.8309 | 0.7966 | 0.7028 | 0.6618 |
| CPAT.local | 0.9333 | 0.9802 | 0.9568 | 0.9557 | 0.9145 | 0.9135 |
| CPAT.web | 0.9535 | 0.9673 | 0.9604 | 0.9601 | 0.9208 | 0.9208 |
| CNCI | 0.9702 | 0.9157 | 0.9430 | 0.9445 | 0.8873 | 0.8860 |
| PLEK |
| 0.8918 | 0.9435 | 0.9463 | 0.8918 | 0.8870 |
| CPAT.train | 0.9160 | 0.9848 | 0.9504 | 0.9486 | 0.9029 | 0.9008 |
| PLEK.train | 0.7622 | 0.9507 | 0.8565 | 0.8416 | 0.7260 | 0.7130 |
| Lncident | 0.9535 | 0.9795 |
|
|
|
|
Lncident displayed a satisfying overall performance. CPC and CPAT were tested on stand-alone version and web server. The suffix of “train” means CPAT and PLEK with the new-trained model.
Comparisons of C. elegans.
| Tools | Sensitivity | Specificity | Accuracy |
| MCC | Kappa |
|---|---|---|---|---|---|---|
| CPC.web | 0.9990 |
|
|
|
|
|
| CPAT.web | 0.9910 | 0.9187 | 0.9548 | 0.9564 | 0.9120 | 0.9097 |
| CNCI | 0.9938 | 0.7744 | 0.8841 | 0.8955 | 0.7873 | 0.7681 |
| PLEK | 0.9987 | 0.4526 | 0.7256 | 0.7845 | 0.5387 | 0.4513 |
| Lncident |
| 0.9545 | 0.9772 | 0.9778 | 0.9555 | 0.9545 |
| CPAT.train |
| 0.9950 |
|
|
|
|
| PLEK.train | 0.9795 | 0.9950 | 0.9872 | 0.9872 | 0.9746 | 0.9745 |
| Lncident.train | 0.9950 |
| 0.9962 | 0.9962 | 0.9925 | 0.9925 |
For tools with default models, Lncident presented the best result among the alignment-free methods. Both Lncident and CPAT outperformed CPC by utilizing new-trained model. Lncident with model trained on human. The suffix of “train” means the tools with model trained on C. elegans.
Comparisons of S. cerevisiae.
| Tools | Sensitivity | Specificity | Accuracy |
| MCC | Kappa |
|---|---|---|---|---|---|---|
| CPC.web | 0.9734 |
|
|
|
|
|
| CPAT.web |
| 0.7980 | 0.8416 | 0.7316 | 0.6785 | 0.6304 |
| CNCI | 0.9758 | 0.6760 | 0.7407 | 0.6190 | 0.5377 | 0.4598 |
| PLEK | 0.9903 | 0.5507 | 0.6456 | 0.5468 | 0.4491 | 0.3407 |
| Lncident |
| 0.8620 | 0.8918 | 0.7996 | 0.7578 | 0.7295 |
| CPAT.train |
| 0.9487 | 0.9597 | 0.9147 | 0.8942 | 0.8886 |
| PLEK.train | 0.9758 | 0.8053 | 0.8421 | 0.7274 | 0.6682 | 0.6262 |
| Lncident.train |
|
|
|
|
|
|
Lncident presented the best result on S. cerevisiae dataset. Lncident with model trained on human. The suffix of “train” means the tools with model trained on C. elegans.
Running time on human dataset.
| Tools | Running time | |
|---|---|---|
| Stand-alone | Web server | |
| CPC | 5 days 4 h 37 min 43 s | About 1 day |
| CPAT | 16 s | |
| CNCI | 37 min 53 s | |
| PLEK | 3 min 04 s | |
| Lncident | 3 min 01 s | About 12 min |
Test on Intel® Core™ i7-2600 CPU at 3.40 GHz and 8 GB RAM.
The test dataset contains 4,000 mRNAs and 4,000 lncRNAs. It is hard to calculate precise running time of testing on web server; thus, we can only obtain a rough approximation. We omitted the result of CPAT web server, because only file size less than 10 MB is allowed, and when we split our file, we found the average error is too serious to accept.
Figure 5The screenshots of Lncident input and output.