| Literature DB >> 28042575 |
Siyu Han1, Yanchun Liang2, Ying Li1, Wei Du1.
Abstract
Long noncoding RNA (lncRNA) is a kind of noncoding RNA with length more than 200 nucleotides, which aroused interest of people in recent years. Lots of studies have confirmed that human genome contains many thousands of lncRNAs which exert great influence over some critical regulators of cellular process. With the advent of high-throughput sequencing technologies, a great quantity of sequences is waiting for exploitation. Thus, many programs are developed to distinguish differences between coding and long noncoding transcripts. Different programs are generally designed to be utilised under different circumstances and it is sensible and practical to select an appropriate method according to a certain situation. In this review, several popular methods and their advantages, disadvantages, and application scopes are summarised to assist people in employing a suitable method and obtaining a more reliable result.Entities:
Mesh:
Substances:
Year: 2016 PMID: 28042575 PMCID: PMC5153550 DOI: 10.1155/2016/8496165
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Overview of the methods concerning lncRNA identification.
| Published year | Testing datasets | Training species | Model | Query file format | Web interface | |
|---|---|---|---|---|---|---|
| CPC | 2007 | ncRNA | Eukaryotic | SVM | FASTA | Yes |
| CPAT | 2013 | lncRNA | Human; mouse; fly; zebrafish | LR | BED; FASTA | Yes |
| CNCI | 2013 | lncRNA | Human; plant | SVM | FASTA; GTF | No |
| PLEK | 2014 | lncRNA | Human; maize | SVM | FASTA | No |
| lncRNA-MFDL | 2015 | lncRNA | Human | DL |
|
|
| LncRNA-ID | 2015 | lncRNA | Human; mouse | RF | BED; FSATA | No |
| lncRScan-SVM | 2015 | lncRNA | Human; mouse | SVM | GTF | No |
| LncRNApred | 2016 | lncRNA | Human | RF | FASTA | Web only |
Testing datasets denote that one specific method is developed to discriminate ncRNAs or lncRNAs from protein-coding transcripts. The classification model of CPC, CNCI, PLEK, and lncRScan-SVM is support vector machine (SVM); CPAT employs logistic regression (LR); LncRNA-ID and LncRNApred utilise random forests (RF) and lncRNA-MFDL uses deep stacking networks (DSNs) of deep learning (DL) algorithm.
Note that the most popular tool CPC is trained and tested on datasets of ncRNAs and protein-coding transcripts. The training datasets of CPAT are also ncRNAs and protein-coding transcripts, though test on lncRNAs for CPAT is conducted and achieved a higher accuracy.
The access link of lncRNA-MFDL has expired; thus, we cannot verify the information that the original paper failed to mention.
Summary of the features of each method selected.
| ORF | Codon | Sequence structure | Ribosome interaction | Alignment | Protein conservation | |
|---|---|---|---|---|---|---|
| CPC | Quality; coverage; | No | No | No | BLASTX | Number and |
| CPAT | Length; | Hexamer | Content of the bases | No | No | No |
| CNCI | No | ANT matrix; | MLCDS | No | No | No |
| PLEK | No | No | Improved | No | No | No |
| lncRNA-MFDL | Length; | No |
| No | No | No |
| LncRNA-ID | Length; | No | Kozak motif | Ribosome release signal | Profile HMM | Score of HMMER |
| lncRScan-SVM | No | Distribution of stop codon | Score of txCdsPredict; | No | Phylo-HMM | Average PhastCons scores |
| LncRNApred | Length; | No | Length of the sequence; | No | No | No |
All features are categorized into six groups according to the similarity or basic principles. Thus, some items in the table might not be exactly in one-to-one correspondence with the feature names given in the corresponding published references.
Figure 2The ROC curves of CPC, CPAT, CNCI, and PLEK. We assessed the models using the same datasets as LncRNA-ID (selected from GENCODE) used. Both CPC and CPAT were evaluated with the latest versions.
Figure 1An overall procedure of eight tools. The features of each tool are sorted into several groups and only the categories of the features are listed in the figure.
Overview of each tool's performance on different testing datasets.
| Testing dataset | CPC | CPAT | CNCI | PLEK | LncRNA-ID | lncRScan-SVM |
|---|---|---|---|---|---|---|
|
| ||||||
| Specificity |
| 91.80 | 94.70 | |||
| Sensitivity | 19.00 | 78.70 |
| |||
| Accuracy |
| 91.30 | 94.70 | |||
|
| ||||||
| Specificity |
| 93.90 | 95.50 | |||
| Sensitivity | 47.20 | 81.10 |
| |||
| Accuracy |
| 93.70 | 95.40 | |||
|
| ||||||
| Specificity |
| 99.55 | 89.18 | 95.28 | ||
| Sensitivity | 66.48 | 86.95 |
| 96.28 | ||
| Accuracy | 83.22 | 93.25 | 94.32 |
| ||
|
| ||||||
| Specificity | 98.75 |
| 70.94 | 92.10 | ||
| Sensitivity | 76.55 | 38.80 | 88.11 |
| ||
| Accuracy | 87.65 | 68.88 | 79.49 |
| ||
|
| ||||||
| Specificity |
| 85.28 | 89.20 | |||
| Sensitivity | 67.23 |
| 93.88 | |||
| Accuracy | 82.43 | 89.94 |
| |||
|
| ||||||
| Specificity |
| 88.17 | 89.14 | |||
| Sensitivity | 75.46 |
| 95.29 | |||
| Accuracy | 86.91 | 91.76 |
| |||
The results of the tools being tested on the same datasets are listed above. Bold numbers denote the highest value of the metrics.
1MCF-7 is available at http://www.pacb.com/blog/data-release-human-mcf-7-transcriptome/; 2dataset of HelaS3 is available at https://www. ncbi.nlm.nih.gov/sra/SRX214365; 3,4datasets are available at https://www.dropbox.com/sh/7yvmqknartttm6k/AAAQHvLZPjgjf4dtmHM7GNCqa/ H1_gencode?dl=0 and https://www.dropbox.com/sh/7yvmqknartttm6k/AACzaG-QJggvbXW6LA32oo7ba/M1_gencode?dl=0; 5dataset of human and mouse is available at http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0139654.
Priority of employing different methods on different situations.
| CPC | CPAT | CNCI | PLEK | LncRNA-ID | lncRScan-SVM | |
|---|---|---|---|---|---|---|
| Coding potential assessment | ✓ | ✓ | ||||
| Human lncRNAs | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Mouse lncRNAs | ✓ | ✓ | ✓ | |||
| Other Species1 | ✓ | ✓ | ✓ | ✓ | ||
| Testing data with sequencing errors2 | ✓ | ✓ | ✓ | |||
| Lack of annotation | ✓ | ✓ | ✓ | |||
| Massive-scale data3 | ✓ | ✓ | ✓ | ✓ | ||
| Trained by users4 | ✓ | ✓ | ✓ | |||
| Web interface | ✓ | ✓ |
This table only presents the preferences under different situations, which means a method with a tick can achieve a better performance under a certain circumstance.
1Only CPAT, LncRNA-ID, and lncRScan-SVM provide the model for mouse. When analysing other species, CPAT has the model for fly and zebrafish; CNCI and PLEK can predict the sequences of vertebrata and plant. CPAT, PLEK, and LncRNA-ID can build a new model based on users' datasets. 2Users can choose CNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors. 3CPAT is the most efficient method. Though lncRScan-SVM needs more time than CPAT and LncRNA-ID, it is also acceptable. 4LncRNA-ID can handle the imbalanced training data. Training PLEK with users' own datasets may be a time-consuming task.