| Literature DB >> 35192174 |
Yan Wang1,2, Xiaopeng Zhu1, Lili Yang1,3, Xuemei Hu1, Kai He1, Cuinan Yu1, Shaoqing Jiao1, Jiali Chen1, Rui Guo1, Sen Yang4.
Abstract
Long non-coding RNAs play a crucial role in many life processes of cell, such as genetic markers, RNA splicing, signaling, and protein regulation. Considering that identifying lncRNA's localization in the cell through experimental methods is complicated, hard to reproduce, and expensive, we propose a novel method named IDDLncLoc in this paper, which adopts an ensemble model to solve the problem of the subcellular localization. In the proposal model, dinucleotide-based auto-cross covariance features, k-mer nucleotide composition features, and composition, transition, and distribution features are introduced to encode a raw RNA sequence to vector. To screen out reliable features, feature selection through binomial distribution, and recursive feature elimination is employed. Furthermore, strategies of oversampling in mini-batch, random sampling, and stacking ensemble strategies are customized to overcome the problem of data imbalance on the benchmark dataset. Finally, compared with the latest methods, IDDLncLoc achieves an accuracy of 94.96% on the benchmark dataset, which is 2.59% higher than the best method, and the results further demonstrate IDDLncLoc is excellent on the subcellular localization of lncRNA. Besides, a user-friendly web server is available at http://lncloc.club .Entities:
Keywords: Ensemble model; Imbalanced learning; Sequence feature; Subcellular localization of lncRNA
Mesh:
Substances:
Year: 2022 PMID: 35192174 DOI: 10.1007/s12539-021-00497-6
Source DB: PubMed Journal: Interdiscip Sci ISSN: 1867-1462 Impact factor: 2.233