| Literature DB >> 31484412 |
Shuai Liu1, Xiaohan Zhao1, Guangyan Zhang1, Weiyang Li1, Feng Liu2, Shichao Liu1, Wen Zhang3.
Abstract
Long non-coding RNAs (lncRNAs) are a class of RNAs with the length exceeding 200 base pairs (bps), which do not encode proteins, nevertheless, lncRNAs have many vital biological functions. A large number of novel transcripts were discovered as a result of the development of high-throughput sequencing technology. Under this circumstance, computational methods for lncRNA prediction are in great demand. In this paper, we consider global sequence features and propose a stacked ensemble learning-based method to predict lncRNAs from transcripts, abbreviated as PredLnc-GFStack. We extract the critical features from the candidate feature list using the genetic algorithm (GA) and then employ the stacked ensemble learning method to construct PredLnc-GFStack model. Computational experimental results show that PredLnc-GFStack outperforms several state-of-the-art methods for lncRNA prediction. Furthermore, PredLnc-GFStack demonstrates an outstanding ability for cross-species ncRNA prediction.Entities:
Keywords: feature selection; genetic algorithm; global sequence features; lncRNA prediction; stacked ensemble learning
Mesh:
Substances:
Year: 2019 PMID: 31484412 PMCID: PMC6770532 DOI: 10.3390/genes10090672
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Summary of datasets.
| Data Sources | Name | Coding RNAs | NcRNAs |
|---|---|---|---|
| GENCODE | Human-Main | 35760 | 20299 |
| Human-Independent | 1500 | 1500 | |
| Mouse-Main | 23987 | 11746 | |
| Mouse-Independent | 1500 | 1500 | |
| preprocess CPPred | Human-Testing | 8557 | 8241 |
| Mouse-Testing | 31102 | 19930 | |
| Zebrafish-Testing | 15594 | 10662 | |
| Fruit-fly-Testing | 17400 | 4098 | |
| 6713 | 413 | ||
| Integrate-Testing | 13903 | 13903 |
NcRNA: Non-coding RNA.
All features considered in this paper and their dimensions.
| Types | Features (Dimension) |
|---|---|
| codon-related features | stop codon count (1), stop codon frequency (1), stop codon frame score (1), stop codon frequency frame score (1), nucleotide position frequencies (4), Fickett TESTCODE score (1) |
| Open reading frame (ORF)-related features | the first ORF length (1), the longest ORF length (1), the ORF coverage (2), the ORF integrity (1), ORF frame score (1), the entropy density profiles (EDP) of ORF (16) |
| GC-related features | GC (1), GC1 (1), GC2 (1), GC3 (1), GC frame score (1), UTR GC content (2) |
| coding sequence-related features | Coding sequence (CDS) length (1), CDS percentage (1), coding potential of the transcripts (CDS score) (1) |
| transcript-related features | transcript length (1), |
| structure-related features | Molecular weight (Mw) (1), isoelectric point (pI) (1), pI/Mw (1), pI/Mw frame score (1), Gravy (1), Instability index (1) |
Figure 1The flowchart of the GA-RF (genetic algorithm-random forest) algorithm.
Figure 2The architecture of the stacked ensemble learning method.
Area under the curve (AUC) scores of models based on optimal feature subsets and sizes of optimal feature subsets for human and mouse.
| Optimal Feature Subset No. | Human | Mouse | ||
|---|---|---|---|---|
| AUC | Number of Features | AUC | Number of Features | |
| 1 | 0.94979 | 134 | 0.96382 | 118 |
| 2 | 0.94946 | 137 | 0.96350 | 125 |
| 3 | 0.94940 | 131 | 0.96343 | 127 |
| 4 | 0.94934 | 136 | 0.96334 | 123 |
| 5 | 0.94929 | 138 | 0.96327 | 114 |
| 6 | 0.94929 | 134 | 0.96324 | 123 |
| 7 | 0.94923 | 129 | 0.96323 | 115 |
| 8 | 0.94916 | 127 | 0.96323 | 121 |
| 9 | 0.94913 | 137 | 0.96322 | 122 |
| 10 | 0.94910 | 128 | 0.96322 | 119 |
The performances of PredLnc-GFStack on Human-Main and Mouse-Main using 10-CV.
| Dataset | AUC | ACC | SN | SP | PRE | F1 |
|---|---|---|---|---|---|---|
| Human | 0.956 | 0.895 | 0.884 | 0.901 | 0.835 | 0.859 |
| Mouse | 0.969 | 0.914 | 0.875 | 0.933 | 0.865 | 0.870 |
The performances of PredLnc-GFStack models on multi-species testing datasets.
| Training Dataset | Testing Dataset | AUC | ACC | SN | SP | PRE | F1 |
|---|---|---|---|---|---|---|---|
| Human-Main | Human-Testing | 0.995 | 0.968 | 0.962 | 0.974 | 0.973 | 0.967 |
| Mouse-Testing | 0.987 | 0.941 | 0.879 | 0.981 | 0.968 | 0.921 | |
| Integrated-Testing | 0.985 | 0.907 | 0.831 | 0.982 | 0.979 | 0.899 | |
| Zebrafish-Testing | 0.971 | 0.901 | 0.772 | 0.989 | 0.980 | 0.863 | |
| Fruit-fly-Testing | 0.992 | 0.940 | 0.714 | 0.993 | 0.962 | 0.819 | |
| 0.983 | 0.960 | 0.828 | 0.969 | 0.621 | 0.710 | ||
| Mouse-Main | Human-Testing | 0.977 | 0.887 | 0.807 | 0.964 | 0.955 | 0.875 |
| Mouse-Testing | 0.995 | 0.944 | 0.869 | 0.992 | 0.985 | 0.924 | |
| Integrated-Testing | 0.984 | 0.871 | 0.757 | 0.985 | 0.981 | 0.855 | |
| Zebrafish-Testing | 0.971 | 0.843 | 0.626 | 0.991 | 0.979 | 0.764 | |
| Fruit-fly-Testing | 0.990 | 0.917 | 0.593 | 0.994 | 0.957 | 0.733 | |
| 0.964 | 0.942 | 0.382 | 0.976 | 0.500 | 0.433 |
Figure 3The performance comparison of PredLnc-GFStack and CPAT, CPC2, Longdist, CPPred on Human-Main and Mouse-Main. (a) The results of all models evaluated by 10-CV on Human-Main. (b)The results of all models evaluated by 10-CV on Mouse-Main.
Figure 4The performances (AUC scores) of all models in cross-species prediction. (a) The results of all models trained on Human-Main. (b)The results of all models trained on Mouse-Main.