| Literature DB >> 28335739 |
Giovanna M M Ventola1,2, Teresa M R Noviello1,2, Salvatore D'Aniello3, Antonietta Spagnuolo3, Michele Ceccarelli1, Luigi Cerulo4,5.
Abstract
BACKGROUND: The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs.Entities:
Keywords: Classification; Feature selection; lncRNA
Mesh:
Substances:
Year: 2017 PMID: 28335739 PMCID: PMC5364679 DOI: 10.1186/s12859-017-1594-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Distribution of different class of transcripts among Human, Mouse, and Zebrafish in Ensembl and Vega annotation databases
| Ensemble | Vega with KNOWN status | |||||
|---|---|---|---|---|---|---|
| Class | Human | Mouse | Zebrafish | Human | Mouse | Zebrafish |
| PCT | 79851 | 50607 | 41695 | 71030 | 41569 | 11051 |
| LincRNA | 13473 | 5362 | 1039 | 13365 | 4711 | 1004 |
| Intronic | 977 | 277 | 58 | 973 | 277 | 57 |
| Overlapping | 343 | 47 | 9 | 342 | 45 | 9 |
| Antisense | 11186 | 3208 | 711 | 11141 | 3122 | 699 |
| Pseudogene | 14537 | 9442 | 261 | 14491 | 9066 | 199 |
| Other ncRNA | 78167 | 43609 | 11664 | 68265 | 37421 | 5703 |
| IG/TR genes | 434 | 642 | 413 | 496 | ||
| Total | 198968 | 113194 | 55437 | 180020 | 96707 | 18722 |
Signatures detected in top 20 ranked features (Zebrafish)
| Signatue # | Algorithm groups | BASIC | CONS | NUCLEO | ORF | REPS | AUPR (AUC) |
|---|---|---|---|---|---|---|---|
| 1 | IG | TxExLenAvg, | ph8m, py8m, | AAT, ACG, | KOZAK, | 0.39 (0.90) | |
| TxLen, | py8mn | ATT, CCG, | OrfProp | ||||
| TxNex | CG, CGA, | ||||||
| CGC, CGG, | |||||||
| FickScore, | |||||||
| GAG, GG, | |||||||
| GGA, GGC, | |||||||
| TA, TAA, | |||||||
| TCG, TGG, | |||||||
| TT, TTG | |||||||
| 2 | RFS | TxExLenAvg, | ph8m, py8m, | FickScore | KOZAK, | DNA.DNA, | 0.32 (0.87) |
| TxLen, | py8mn | OrfProp | DNA.hAT, | ||||
| TxNex | DNA.hAT.Ac, | ||||||
| DNA.hAT.Charlie, | |||||||
| DNA.hAT.Tip100, | |||||||
| DNA.Kolobok, | |||||||
| DNA.PiggyBac, | |||||||
| DNA.TcMar.Tc1, | |||||||
| LINE.L2, SINE.5S, | |||||||
| SINE.V | |||||||
| 3 | LR, EN, | TxExLenAvg, | ph8m, py8m, | AA, AAT, ACA, | KOZAK, | 0.41 (0.90) | |
| EFmn | TxLen, | py8mn | ACT, AGT, | OrfProp | |||
| TxNex | CAT, CGC, | ||||||
| CTA, CTC, | |||||||
| FickScore, | |||||||
| GAG, GC, | |||||||
| GCC, GGA, | |||||||
| TAA, TAC, | |||||||
| TCC, TGA, | |||||||
| TGG, TTG | |||||||
| 4 | WT | TxNex | AAC, AAT, | 0.36 (0.89) | |||
| ACA, ACC, | |||||||
| ACG, AG, | |||||||
| AGC, AGG, | |||||||
| AT, CAG, | |||||||
| CCA, CCG, | |||||||
| CG, CGA, | |||||||
| CG, CGA, | |||||||
| GG, TC |
Fig. 1Analysis workflow. The analysis workflow adopted to obtain the signatures
Univariate ranked features according to their AUPR (AUC)
| Human | Mouse | Zebrafish | ||||
|---|---|---|---|---|---|---|
| Feature | AUPR (AUC) | Feature | AUPR (AUC) | Feature | AUPR (AUC) | |
| 1 | ph100m | 0.62 (0.92) | phm | 0.43 (0.90) | py8m | 0.27 (0.83) |
| 2 | ph20m | 0.54 (0.91) | py60m | 0.36 (0.90) | ph8m | 0.25 (0.79) |
| 3 | ph20mx | 0.52 (0.89) | phmx | 0.34 (0.87) | TxLen | 0.18 (0.72) |
| 4 | py100mx | 0.52 (0.91) | py60mx | 0.32 (0.88) | FickScore | 0.17 (0.73) |
| 5 | py100m | 0.48 (0.91) | phmn | 0.25 (0.81) | TxNex | 0.16 (0.77) |
| 6 | py20m | 0.43 (0.89) | CG | 0.16 (0.70) | GG | 0.15 (0.66) |
| 7 | TxNex | 0.26 (0.76) | GCG | 0.15 (0.68) | TAA | 0.15 (0.67) |
| 8 | ph20mn | 0.25 (0.77) | CGC | 0.14 (0.67) | AAT | 0.15 (0.65) |
| 9 | CG | 0.24 (0.69) | CGA | 0.14 (0.67) | GAG | 0.15 (0.65) |
| 10 | FickScore | 0.23 (0.76) | CCG | 0.13 (0.67) | GGA | 0.14 (0.65) |
| 11 | CGA | 0.22 (0.68) | CGG | 0.13 (0.68) | KOZAK | 0.14 (0.67) |
| 12 | TCG | 0.21 (0.66) | ACA | 0.13 (0.63) | GGC | 0.13 (0.65) |
| 13 | CCG | 0.21 (0.67) | FickScore | 0.13 (0.73) | TCG | 0.13 (0.63) |
| 14 | TxLen | 0.19 (0.66) | TCG | 0.13 (0.65) | ATT | 0.13 (0.63) |
| 15 | KOZAK | 0.17 (0.65) | CGT | 0.12 (0.63) | CG | 0.13 (0.62) |
| 16 | CGT | 0.17 (0.62) | GC | 0.12 (0.65) | TTG | 0.13 (0.59) |
| 17 | ACA | 0.17 (0.60) | CAT | 0.12 (0.59) | TGG | 0.13 (0.64) |
| 18 | ACG | 0.17 (0.63) | ACG | 0.12 (0.64) | CGG | 0.13 (0.63) |
| 19 | ACT | 0.16 (0.60) | ACT | 0.12 (0.61) | CGA | 0.13 (0.62) |
| 20 | TCT | 0.16 (0.61) | GGC | 0.11 (0.64) | CCG | 0.12 (0.62) |
| 21 | TGG | 0.15 (0.61) | TxNex | 0.11 (0.73) | TT | 0.12 (0.61) |
| 22 | AAT | 0.15 (0.63) | KOZAK | 0.10 (0.65) | TA | 0.12 (0.62) |
| 23 | GTG | 0.15 (0.60) | CTA | 0.10 (0.59) | AG | 0.12 (0.60) |
| 24 | GG | 0.15 (0.62) | TxLen | 0.10 (0.64) | AT | 0.12 (0.62) |
| 25 | ATA | 0.15 (0.61) | AC | 0.09 (0.59) | CAG | 0.12 (0.58) |
Fig. 2Signature stability. Stability of signatures averaged among 100 bootstraps for each feature selection algorithm (average stability on y-axis)
Signatures detected in top 20 ranked features (Human)
| Signatue # | Algorithm groups | BASIC | CONS | NUCLEO | ORF | REPS | AUPR (AUC) |
|---|---|---|---|---|---|---|---|
| 1 | IG, RFS, | TxExLenAvg, | ph100m, | AA, AAT, AT, | KOZAK, | DNA.TcMar.Tigger, | 0.69 (0.94) |
| RF, | TxLen, | ph20m, | ATA, CA, CC, | OrfProp | LINE.L1, | ||
| EFmn | TxNex | ph20mn, | CCG, CG, | LTR.ERV1, | |||
| ph20mx, | CGA, CGT, | LTR.ERVL, | |||||
| py100m, | FickScore, GC, | LTR.ERVL.MaLR, | |||||
| py100mx, | GG, GT, GTG, | SINE.Alu, | |||||
| py20m | TA, TAT, TCG, | SINE.MIR | |||||
| TT, TTA | |||||||
| 2 | GR | TxExLenAvg | ph100m, | ATC, ATG, CA, | DNA.DNA, | 0.55 (0.92) | |
| ph20m, | CAC | DNA.hAT.Blackjack, | |||||
| ph20mx, | DNA.MULE.MuDR, | ||||||
| py100m, | DNA.PiggyBac, | ||||||
| py100mx, | DNA.TcMar.Tc2, | ||||||
| py20m | LINE.Penelope, | ||||||
| LTR.LTR, | |||||||
| RC.Helitron, | |||||||
| SINE.MIR | |||||||
| 3 | GFS | TxExLenAvg, | ph100m, | AA, ACC, CA, | KOZAK | LINE.Penelope | 0.67 (0.94) |
| TxLen, | ph20mx, | CAG, CTA, | |||||
| TxNex | py100m, | FickScore, | |||||
| py20m | GAT, GT, | ||||||
| TAC, TAT, | |||||||
| TGG | |||||||
| 4 | LR, EN | TxLen, | ph100m, | AA, AAT, ACA, | KOZAK | 0.66 (0.94) | |
| TxNex | ph20m, | ACT, CA, | |||||
| ph20mx, | CAA, CAC, | ||||||
| py100m, | CG, CGA, | ||||||
| py100mx | FickScore, GG, | ||||||
| GT, GTG, | |||||||
| TAC, TCT, | |||||||
| TGA, TGG | |||||||
| 5 | 5 WT | TxExLenAvg, | AAC, AAG, | 0.66 (0.94) | |||
| TxNex | AC, ACA, | ||||||
| ACC, ACG, | |||||||
| ACT, AGA, | |||||||
| AGC, AGT, | |||||||
| ATA, CA, CT, | |||||||
| GA, GT, TA, | |||||||
| TC, TG |
Signatures detected in top 20 ranked features (Mouse)
| Signatue # | Algorithm groups | BASIC | CONS | NUCLEO | ORF | REPS | AUPR (AUC) |
|---|---|---|---|---|---|---|---|
| 1 | IG | TxNex | phm, phmn, | ACA, ACG, | KOZAK | 0.47 (0.92) | |
| phmx, | CCG, CG, | ||||||
| py60m, | CGA, CGC, | ||||||
| py60mx | CGG, CGT, | ||||||
| FickScore, GC, | |||||||
| GCG, TAA, | |||||||
| TCG | |||||||
| 2 | GR | phm, phmn, | ACA, AGA, | DNA.hAT.Charlie, | 0.40 (0.91) | ||
| phmx, | AT, CA, CAA, | LINE.RTE.BovB, | |||||
| py60m, | CAT, CG, TGA | LINE.RTE.X, | |||||
| py60mx | LTR.ERVL.MaLR, | ||||||
| SINE.ID, | |||||||
| SINE.MIR, | |||||||
| SINE.tRNA | |||||||
| 3 | RFS | TxExLenAvg, | phm, phmn, | AA, FickScore | KOZAK, | LINE.L1, | 0.44 (0.92) |
| TxLen, | phmx, | OrfProp | LTR.ERV1, | ||||
| TxNex | py60m, | LTR.ERVK, | |||||
| py60mx | LTR.ERVL, | ||||||
| LTR.ERVL.MaLR, | |||||||
| SINE.Alu, | |||||||
| SINE.B2, SINE.B4 | |||||||
| 4 | GFS, LR, | TxExLenAvg, | phm, phmn, | AAC, AAG, | KOZAK | 0.51 (0.93) | |
| EN | TxLen, | phmx, | AC, ACA, ACT, | ||||
| TxNex | py60m, | AGT, CAC, | |||||
| py60mx | CAG, CAT, | ||||||
| CGT, CTT, | |||||||
| FickScore, | |||||||
| GAT, GT, | |||||||
| GTA, GTC, | |||||||
| GTG, TAA, | |||||||
| TAC, TAT | |||||||
| 5 | RF, | TxExLenAvg, | phm, phmn, | AA, AC, | KOZAK | 0.51 (0.93) | |
| EFmn | TxLen, | phmx, | ACA, AGA, | ||||
| TxNex | py60m, | CAC, CAT, | |||||
| py60mx | CCG, CG, | ||||||
| CGC, CGG, | |||||||
| FickScore, GC, | |||||||
| GGC, GT, | |||||||
| TAA, TAT, TT | |||||||
| 6 | WT | TxExLenAvg, | AAC, AAG, | 0.46 (0.92) | |||
| TxNex | AAT, AC, | ||||||
| ACA, ACC, | |||||||
| ACG, ACT, | |||||||
| AGA, AGC, | |||||||
| AT, CA, CG, | |||||||
| CT, GT, TA, | |||||||
| TC, TG |
Performance of tested tools (average Precision/Recall/Accuracy with 95% CI)
| Coding/non-coding tools | SVM | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| CPAT | CPC | PLEK | iSeeRNA | All features | Signature 1 | Signature 2 | Signature 3 | Signature 4 | Signature 5 | Signature 6 | |
| human | |||||||||||
| Precision | 0.71 (± 0.02) | 0.81 (± 0.02) | 0.67 (± 0.01) | 0.97 (± 0.01) | 0.97 (± 0.01) | 0.97 (± 0.01) | 0.97 (± 0.01) | 0.96 (± 0.01) | 0.97 (± 0.01) | 0.98 (± 0.02) | − |
| Recall | 0.96 (± 0.01) | 0.66 (± 0.03) | 0.98 (± 0.01) | 0.91 (± 0.02) | 0.93 (± 0.01) | 0.94 (± 0.02) | 0.91 (± 0.02) | 0.94 (± 0.01) | 0.93 (± 0.01) | 0.82 (± 0.04) | − |
| Accuracy | 0.78 (± 0.01) | 0.75 (± 0.02) | 0.74 (± 0.01) | 0.94 (± 0.01) | 0.95 (± 0.01) | 0.95 (± 0.01) | 0.94 (± 0.01) | 0.95 (± 0.01) | 0.95 (± 0.01) | 0.90 (± 0.02) | − |
| mouse | |||||||||||
| Precision | 0.73 (± 0.01) | 0.84 (± 0.02) | 0.66 (± 0.01) | 0.99 (± 0.01) | 0.99 (± 0.01) | 0.99 (± 0.01) | 0.99 (± 0.01) | 0.99 (± 0.01) | 0.99 (± 0.01) | 0.99 (± 0.01) | 0.99 (± 0.01) |
| Recall | 0.96 (± 0.01) | 0.77 (± 0.02) | 0.91 (± 0.02) | 0.87 (± 0.02) | 0.91 (± 0.02) | 0.90 (± 0.01) | 0.88 (± 0.02) | 0.90 (± 0.02) | 0.91 (± 0.02) | 0.91 (± 0.01) | 0.82 (± 0.07) |
| Accuracy | 0.81 (± 0.01) | 0.81 (± 0.01) | 0.71 (± 0.02) | 0.93 (± 0.01) | 0.95 (± 0.02) | 0.94 (± 0.01) | 0.94 (± 0.01) | 0.94 (± 0.01) | 0.95 (± 0.01) | 0.95 (± 0.01) | 0.90 (± 0.04) |
| zebrafish | |||||||||||
| Precision | 0.83 (± 0.01) | 0.85 (± 0.01) | 0.72 (± 0.01) | 1.00 (± 0.01) | 1.00 (± 0.01) | 1.00 (± 0.01) | 1.00 (± 0.00) | 1.00 (± 0.00) | 1.00 (± 0.01) | − | − |
| Recall | 0.86 (± 0.03) | 0.74 (± 0.03) | 0.82 (± 0.04) | 0.86 (± 0.03) | 0.91 (± 0.01) | 0.91 (± 0.02) | 0.90 (± 0.02) | 0.96 (± 0.02) | 0.80 (± 0.06) | − | − |
| Accuracy | 0.84 (± 0.02) | 0.81 (± 0.02) | 0.74 (± 0.02) | 0.89 (± 0.02) | 0.94 (± 0.01) | 0.93 (± 0.01) | 0.93 (± 0.01) | 0.97 (± 0.02) | 0.85 (± 0.04) | − | − |
Pauli et al. [21] novel transcripts predicted with different zebrafish signatures
| Training Features | Predicted lncRNAs | Pauli et al. lncRNAs | Intersection | Fraction | AUC |
|---|---|---|---|---|---|
| Signature 1 | 17154 | 1133 | 1035 | 0.91 | 0.87 |
| Signature 2 | 17305 | 1133 | 995 | 0.88 | 0.87 |
| Signature 3 | 17198 | 1133 | 1039 | 0.92 | 0.87 |
| Signature 4 | 18615 | 1133 | 962 | 0.85 | 0.81 |
| IseeRNA | 17077 | 1133 | 951 | 0.84 | 0.82 |
| All | 9366 | 1133 | 738 | 0.65 | 0.78 |
Fig. 3Co-expression with neighbor protein coding genes evaluated for transcripts classified with different zebrafish signatures. Co-expression with neighbor protein coding genes is evaluated with the absolute Spearman correlation for transcripts classified with different zebrafish signatures and at different kb windows. In parentheses the pvalue of one tailed wilcox test between lncRNAs–PCT and PCT-PCT (Gold-standard) distributions
Fig. 4Ribosome Release Score evaluated for transcripts classified with different zebrafish signatures. The Ribosome Release Score (RRS), a relative measure of abundance of ribosomes reads in ORF and 3’UTR regions, is evaluated for transcripts classified with different zebrafish signatures and for those belonging to the gold standard (Table 1). In parentheses the pvalue of one tailed wilcox test between PCTs and lncRNAs distributions