| Literature DB >> 33980144 |
Junyi Li1, Huinian Li2, Xiao Ye2, Li Zhang2, Qingzhe Xu2, Yuan Ping2, Xiaozhu Jing2, Wei Jiang2, Qing Liao2, Bo Liu3, Yadong Wang4,5.
Abstract
BACKGROUND: The prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-med big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of lncRNAs.Entities:
Keywords: Generalized topological entropy; Information entropy; Long non-coding RNA; Machine learning
Mesh:
Substances:
Year: 2021 PMID: 33980144 PMCID: PMC8117603 DOI: 10.1186/s12859-020-03884-w
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1a Feature importance of human GRCh37 data based on information entropy and ORF; b feature importance of human GRCh38 data based on information entropy and ORF
Fig. 2Experimental results based on GRCh37 version of human species: a ROC curve of svm algorithm; b ROC curve of random forest algorithm; c ROC curve of XGBoost algorithm; d PR curve of svm algorithm; e PR curve of random forest algorithm; f PR curve of XGBoost algorithm
Fig. 3Experimental results based on GRCh38 version of human species: a ROC curve of svm algorithm; b ROC curve of random forest algorithm; c ROC curve of XGBoost algorithm; d PR curve of svm algorithm; e PR curve of random forest algorithm; f PR curve of XGBoost algorithm
Fig. 4a ROC curve of GRCh37; b PR curve of GRCh37; c ROC curve of GRCh38; d PR curve of GRCh38
Categorical original FASTA files of transcripts
| Transcripts types | GRCh37 ncRNAs | GRCh37 PCTs | GRCh38 ncRNAs | GRCh38 PCTs |
|---|---|---|---|---|
| Number | 34917 | 104763 | 37297 | 104817 |
Fig. 5Data processing flow chart
Categorical FASTA files of transcripts after data processing
| Transcripts types | GRCh37 ncRNAs | GRCh37 PCTs | GRCh38 ncRNAs | GRCh38 PCTs |
|---|---|---|---|---|
| After removing short | 24,513 | 94,830 | 28,628 | 94,527 |
| After deduplication | 21,965 | 41,134 | 24,863 | 41,200 |
| After data balancing | 21,965 | 21,965 | 24,863 | 24,863 |
Fig. 6The flowchart of human LncRNA prediction based on combination of information entropy and ORF features