| Literature DB >> 33019721 |
Shiyao Feng1,2, Yanchun Liang1,2, Wei Du1, Wei Lv2, Ying Li1.
Abstract
Recent studies uncover that subcellular location of long non-coding RNAs (lncRNAs) can provide significant information on its function. Due to the lack of experimental data, the number of lncRNAs is very limited, experimentally verified subcellular localization, and the numbers of lncRNAs located in different organelle are wildly imbalanced. The prediction of subcellular location of lncRNAs is actually a multi-classification small sample imbalance problem. The imbalance of data results in the poor recognition effect of machine learning models on small data subsets, which is a puzzling and challenging problem in the existing research. In this study, we integrate multi-source features to construct a sequence-based computational tool, lncLocation, to predict the subcellular location of lncRNAs. Autoencoder is used to enhance part of the features, and the binomial distribution-based filtering method and recursive feature elimination (RFE) are used to filter some of the features. It improves the representation ability of data and reduces the problem of unbalanced multi-classification data. By comprehensive experiments on different feature combinations and machine learning models, we select the optimal features and classifier model scheme to construct a subcellular location prediction tool, lncLocation. LncLocation can obtain an 87.78% accuracy using 5-fold cross validation on the benchmark data, which is higher than the state-of-the-art tools, and the classification performance, especially for small class sets, is improved significantly.Entities:
Keywords: logarithm-distance of Hexamer; multi-source features; subcellullar location; the binomial distribution-based filtering
Mesh:
Substances:
Year: 2020 PMID: 33019721 PMCID: PMC7582431 DOI: 10.3390/ijms21197271
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
The comparison of basic features on different models.
| Feature | Method | Precision | Recall | F-Score | Accuracy |
|---|---|---|---|---|---|
| K-tuple features |
| 0.3622 | 0.2709 | 0.2388 |
|
| Autoencoder(8-mer) + RF | 0.3558 | 0.2701 | 0.2379 | 0.6654 | |
| Autoencoder(8-mer) + LR | 0.2081 | 0.2506 | 0.2040 | 0.6460 | |
| Autoencoder(8-mer) + XGBoost | 0.3271 | 0.2741 | 0.2487 | 0.6559 | |
| Autoencoder(8-mer) + LightGBM | 0.3031 | 0.2649 | 0.2308 | 0.6573 | |
| Autoencoder(8-mer) + EDP + SVM | 0.3888 | 0.2682 | 0.2331 | 0.6647 | |
|
| 0.2938 | 0.2712 | 0.2376 |
| |
| Autoencoder(8-mer) + EDP + LR | 0.3787 | 0.2906 | 0.2790 | 0.6430 | |
| Autoencoder(8-mer) + EDP + XGBoost | 0.3315 | 0.2716 | 0.2464 | 0.6522 | |
| Autoencoder(8-mer) + EDP + LightGBM | 0.2946 | 0.2668 | 0.2325 | 0.6606 | |
| Properties of open reading frame | SVM | 0.1622 | 0.2500 | 0.1967 | 0.6488 |
| RF | 0.3596 | 0.2863 | 0.2748 | 0.6387 | |
|
| 0.2641 | 0.2575 | 0.2120 |
| |
| XGBoost | 0.3023 | 0.2644 | 0.2404 | 0.6265 | |
| LightGBM | 0.2477 | 0.2526 | 0.2098 | 0.6457 | |
| Fickett nucleotide features | SVM | 0.2843 | 0.2560 | 0.2120 | 0.6497 |
|
| 0.3108 | 0.2814 | 0.2633 |
| |
| LR | 0.1985 | 0.2633 | 0.2167 | 0.6539 | |
| XGBoost | 0.3874 | 0.2946 | 0.2910 | 0.6366 | |
| LightGBM | 0.3636 | 0.2904 | 0.2844 | 0.6338 | |
| Physicochemical properties | SVM | 0.3232 | 0.2564 | 0.2098 | 0.6549 |
| RF | 0.2740 | 0.2673 | 0.2495 | 0.6127 | |
| LR | 0.3449 | 0.2629 | 0.2229 | 0.6636 | |
| XGBoost | 0.2752 | 0.2649 | 0.2399 | 0.6268 | |
|
| 0.4111 | 0.3913 | 0.3728 |
| |
| Mutli-scale secondary structures |
| 0.5076 | 0.4590 | 0.4356 |
|
| RF | 0.4204 | 0.4171 | 0.4000 | 0.6927 | |
| LR | 0.2648 | 0.2574 | 0.2133 | 0.6576 | |
| XGBoost | 0.4318 | 0.4122 | 0.4023 | 0.6928 | |
| LightGBM | 0.4248 | 0.4040 | 0.3870 | 0.7042 |
For testing purposes, the autoencoder converts 65,536-dimensional 8-mer data into 128-dimensional output. The encoding layer consists of an input with 65,536 dimensions and three intermediate layers with nodes of 4096, 1024, and 256, respectively. The decoding layer corresponds to the encoding layer, and finally converts the 8-mer sequence into the 128-dimensional real value vector. EDP represents the combination of the EDP of the 2-mer and the EDP of the ORF.
Figure 1New fea.Tuple training results on each model.
Figure 2New fea.Bio training results on each model.
Figure 3Connecting new fea.Tuple and new fea.Bio training results on each model.
The comparison between lncLocation and state-of-the-art predictor.
| Location | lncLocation | iLoc-lncRNA | lncLocator | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | Overall | Precision | Recall | Overall | Precision | Recall | Overall | |
| Accuracy | Accuracy | Accuracy | |||||||
| Nucleus | 0.9583 | 0.7419 | 0.8778 | 0.9759 | 0.7756 | 0.8672 | 0.9217 | 0.3815 | 0.6650 |
| Cytoplasm | 0.8500 | 1.0000 | 0.6768 | 0.9906 | 0.3636 | 0.8801 | |||
| Ribosome | 1.0000 | 0.5556 | 0.9983 | 0.4651 | 0.9753 | 0.0700 | |||
| Exosome | 1.0000 | 0.3333 | 1.0000 | 0.1667 | 0.9727 | 0.0400 | |||
Figure 4Input and output part of the screenshot of the lncLocation web server.
Figure 5Pie chart of the distribution ratio of lncRNA in four organelles.
Figure 6The flowchart of lncLocation. (A) Multi-source feature extraction; (B) Feature learning and model selection.
Benchmark lncRNA subcellular localization dataset.
| Subcellular Localizations | Support Number |
|---|---|
| Cytoplasm | 426 |
| Nucleus | 156 |
| Ribosome | 43 |
| Exosome | 30 |