| Literature DB >> 34940758 |
Christopher Klapproth1, Rituparno Sen2, Peter F Stadler1,3,4,5,6,7, Sven Findeiß1, Jörg Fallmann1.
Abstract
Long non-coding RNAs (lncRNAs) are widely recognized as important regulators of gene expression. Their molecular functions range from miRNA sponging to chromatin-associated mechanisms, leading to effects in disease progression and establishing them as diagnostic and therapeutic targets. Still, only a few representatives of this diverse class of RNAs are well studied, while the vast majority is poorly described beyond the existence of their transcripts. In this review we survey common in silico approaches for lncRNA annotation. We focus on the well-established sets of features used for classification and discuss their specific advantages and weaknesses. While the available tools perform very well for the task of distinguishing coding sequence from other RNAs, we find that current methods are not well suited to distinguish lncRNAs or parts thereof from other non-protein-coding input sequences. We conclude that the distinction of lncRNAs from intronic sequences and untranslated regions of coding mRNAs remains a pressing research gap.Entities:
Keywords: classification problems; coding sequence; feature extraction; lncRNA; machine learning
Year: 2021 PMID: 34940758 PMCID: PMC8708962 DOI: 10.3390/ncrna7040077
Source DB: PubMed Journal: Noncoding RNA ISSN: 2311-553X
Figure 1A comparison of the frequencies of commonly used features (blue) and algorithms (orange) applied by different contemporary tools. A majority of the latter rely on open reading frame (ORF) information to make predictions. Other often utilized features include subsequence (k-mer) frequencies and GC content. SVMs and Random Forests dominate the field as the most commonly implemented algorithms. This is not surprising, as they are two of the by far most flexible approaches for nonlinear classification.
LncRNA detection tools: Some of the essential characteristics of the reviewed lncRNA detection tools are summarized. Latter were ordered by year of release. Furthermore, the average number of citations accumulated per year is given as an approximate impact metric. To minimize redundancy, only tools that were on average cited more than 10 times per year were included in this review, unless they contributed an explicitly unique approach in their feature design not encountered elsewhere, i.e., BASiNet and PLIT.
| Tool | Year | Algorithm | Species | Features | Performance | Mean Citations per Year |
|---|---|---|---|---|---|---|
|
| 2006 | SVM | Eukaryotes (both protein-coding and non-coding genes) | peptide length, amino acid composition, predicted secondary structure content, mean hydrophobicity, percentage of residues exposed to solvent, sequence compositional entropy, number of homologues, alignment entropy | 10-fold CV on protein-coding: F1-score: 97.4% ☼ Precision: 97.1% ☼ Recall: 97.8% ◙ On non-coding: F1-score: 94.5% ☼ Precision: 95.2% ☼ Recall: 93.8% | 12.4 |
|
| 2007 | SVM | Eukaryotes (both protein-coding and non-coding genes) | ORF features (quality, coverage, integrity), number of BLASTX hits, hit score, frame score | 10-fold CV: 95.77% ☼ Accuracy on Rfam database (non-coding): 98.62% ☼ RNADB (non-coding): 91.5% ☼ EMBL cds (protein-coding): 99.08% ◙ Accuracy in lncRNA detection: 76.2% | 131.8 |
|
| 2009 | SVM | Species neutral, case study on | ORF length, isoelectric point, hydropathy, compositional entropy | Accuracy: 91.9% ☼ Specificity: 95% ☼ Sensitivity: 86.4% ☼ | 10.2 |
|
| 2013 | SVM | Vertebrates, plants, orangutan | adjacent nucleotide triplets, sequence score, codon-bias, most-like CDS (MLCDS), length-percentage, score-distance | 10-fold CV accuracy on human: 97.3% ◙ Minimum average error for vertebrates < 0.1 ☼ Plants: 0.24 | 111.3 |
|
| 2013 | Logistic regression | human | ORF length, ORF to transcript length ratio, Fickett score, hexamer usage bias | 10-fold accuracy: 99% ☼ Precision: 96% | 135.6 |
|
| 2013 | SVM | human, mouse | frequency of six | Accuracy in human lncRNA detection: 96.1% ☼ Mouse: 94.2% ◙ Accuracy in human protein-coding gene detection: 94.7% ☼ Mouse: 92.7% | 19.5 |
|
| 2014 | SVM | 11 vertebrates | 10-fold CV accuracy: 95.6% | 50.5 | |
|
| 2015 | SVM | human, mouse | sum of lengths of exons, frequency of exons, mean exon length, standard deviation of stop codon frequency, txCdsPredict | Two test sets created based on (i) random protein-coding and lncRNA sequences and (ii) only dissimilar sequences. Accuracy on set A for human: 91.54% ☼ Mouse: 92.21% ◙ On set B for human: 91.45% ☼ Mouse: 92.2% ◙ MCC on set A for human: 83.17% ☼ Mouse: 84.59% ◙ On set B for human: 82.99% ☼ Mouse: 84.69% ◙ AUC on set A for human: 96.39% ☼ Mouse: 96.62% ◙ On set B for human: 96.39% ☼ Mouse: 96.64% ◙ | 13.2 |
|
| 2015 | Random forests | human, mouse | ORF related features, ribosomal interaction related features, protein conservation scores | Specificity on human: 95.28% ☼ Mouse: 92.1% ◙ Recall on human: 96.28% ☼ Mouse: 94.45% ◙ Accuracy on human: 95.78% ☼ Mouse: 93.28% | 12.7 |
|
| 2016 | Random forest | human, mouse, nematode, fruit fly, arabidopsis | GC content, DNA sequence conservation, protein conservation, polyA abundance, RNA secondary structure conservation, ORF score, expression specificity score | Accuracy: human (93.7%), arabidopsis (98.3%), mouse (89.8%), nematode (98.9%), fruit fly (98.4%) | 16.2 |
|
| 2016 | Deep neural network | human | 10-fold CV accuracy: 98.07% ☼ MCC: 96% ☼ Recall: 98.98% ☼ Precision: 97.14% ☼ AUC: 99.3% | 12.4 | |
|
| 2017 | Random forests | human, mouse | ORF features
(coverage, length), sequence length, coding potential score, | Accuracy for human: 91.9% ☼ Mouse: 93.9% ◙ Sensitivity for human: 92.3% ☼ Mouse: 93.8% ◙ Specificity for human: 91.5% ☼ Mouse: 94.1% ◙ F score for human: 91.9% ☼ Mouse: 95.6% ◙ MCC for human: 83.8% ☼ Mouse: 85.6% | 49.5 |
|
| 2017 | Random forest | Species neutral, trained and tested on animals and plants (both protein-coding and non-coding genes) | ORF features (quality, coverage, integrity), Fickett score, isoelectric point | Accuracy: 96.1% ☼ Specificity: 97% ☼ Recall: 95.2% ◙ Accuracy in lncRNA detection: 94.2% | 97.3 |
|
| 2017 | Random forest | plants | 64 | 13.8 | |
|
| 2018 | Convolutional neural network, recurrent neural network | human, mouse | sequence, ORF features (length, coverage, indicator) | 5-fold accuracy: 99% ◙ Accuracy on human: 91.79% ☼ Mouse: 91.83% ◙ Specificity on human: 87.66% ☼ Mouse: 89.03% ◙ Sensitivity on human: 95.91% ☼ Mouse: 94.63% ◙ AUC on human: 96.72% ☼ Mouse: 96.67% | 22 |
|
| 2018 | Deep belief network | human, mouse | ORF features (length, coverage, hexamer score of longest ORF, entropy density profile), UTR coverage, GC content of UTRs, Fickett score, HMMER index | Precision for lncRNA detection from full-length mRNA transcripts: 97.2% ☼ Recall: 98.1% ☼ Average harmonic mean: 97.7% ◙ Precision for lncRNA detection from both full and partial-length mRNA transcripts: 94.5% ☼ Recall: 93.8% ☼ Average harmonic mean: 94.2% ◙ Precision for lncRNA detection from partial-length mRNA transcripts: 90.3% ☼ Recall: 93.8% ☼ Average harmonic mean: 92% | 22.6 |
|
| 2018 | SVM | Trained on human, tested on human, mouse, wheat, zebrafish, chicken | genomic distance to lncRNA, genomic distance to protein-coding transcript, distance ratio, EIIP value | 10-fold CV accuracy: 96.87% | 17.6 |
|
| 2018 | Decision tree on complex networks | datasets from PLEK and CPC2 | average shortest path, average betweenness centrality, average degree, assortativity, maximum degree, minimum degree, clustering coefficient, motif frequency | 8.6 | |
|
| 2018 | Random forest | human, mouse, rice, arabidopsis | length, GC content, hexamer score, alignment identity, ratio of alignment length and mRNA length, ratio of alignment length and ORF length, transposable elements, sequence divergence from transposable element, ORF length, Ficket score | 11 | |
|
| 2019 | SVM | 11 animal species, 26 plant species | max_score of MLCDS, standard deviation of MLCDS scores and MLCDS lengths, frequency of 64 codons | Accuracy on human: 98% ☼ Mouse: 95% ☼ Zebrafish: 93% ☼ Fruit fly: 93% ☼ arabidopsis: 98% | 20.5 |
|
| 2019 | Random forest | plants: arabidopsis, soy bean, rice, tomato, sorghum, vine grape, maize | transcript length, GC content, Ficket-score, hexamer score, maximum ORF length, ORF coverage, mean ORF coverage, codon bias | AUC: 93.3 % for everything except S. bicolor (75%) and arabidopsis (85%) | 8.5 |
|
| 2019 | Feature relationship | human, mouse, zebrafish, nematode, rice, tomato | GC content, ORF length, coding potential score | Accuracy, 10 fold cross-validation: human (94.5%), mouse (93.6%), zebrafish (88.4%), nematode (93.3%), tomato (93.3%), rice (96.3%) | 12.5 |
Abbreviations used in the table: Mathew’s correlation coefficient (MCC) with , cross validation (CV), area under curve (AUC).
Tools with the highest impact on lncRNA annotation research ordered by year. A tool was considered to be of high impact if it had accumulated a significantly larger number of citations in the same time frame than a competing tool and could be shown to be involved in the identification of multiple bona fide non-coding RNAs that were previously undiscovered. Number of citations refers to the last access on 11 November 2021.
| Tool | Year | Algorithm | Citations | Species |
|---|---|---|---|---|
| CPC | 2007 | SVM | 1857 | Eukaryotes (protein-coding and non-coding transcripts) |
| CNCI | 2013 | SVM | 898 | Vertebrates, plants, orangutan |
| CPAT | 2013 | logistic regression | 1105 | human |
| PLEK | 2014 | SVM | 359 | 11 vertebrate species |
| CPC2 | 2017 | Random forest | 406 | Eukaryotes (protein-coding and non-coding transcripts) |
| FEELnc | 2017 | Random Forest | 198 | human, mouse |
Classification performance measures of benchmarked tools. We calculated sensitivity, specificity, precision and raw accuracy based on classification of 400 randomly chosen human transcripts, 200 of which were protein-coding and 200 were confirmed non-coding RNAs of 200 nt to 3500 nt length. The transcripts used in this benchmark were extracted from the RefSeq database [40] (ver. 38). Bold faced values highlight the maximum achieved for each measurement.
| Tool | Sensitivity | Specificity | Accuracy | Precision |
|---|---|---|---|---|
| CNCI | 0.82 |
| 0.895 |
|
| CPAT |
| 0.92 |
| 0.92 |
| PLEK | 0.93 | 0.79 | 0.86 | 0.82 |
| CPC2 | 0.96 | 0.91 |
| 0.91 |
| FEELnc | 0.91 | 0.93 | 0.92 | 0.92 |
Figure 2Sankey diagrams for the input dataset consisting of 200 randomly chosen lncRNA, 200 coding transcripts, dinucleotide shuffled versions of the latter and 194 randomly chosen sequences from the human genome (hg38) and corresponding annotation (RefSeq database [40] v38) and their assignment to coding or non-coding classes by five high-impact classification tools. All tools were run with standard settings where applicable.