| Literature DB >> 31151273 |
Jael Sanyanda Wekesa1,2, Yushi Luan3, Ming Chen4, Jun Meng5.
Abstract
Long non-protein-coding RNAs (lncRNAs) identification and analysis are pervasive in transcriptome studies due to their roles in biological processes. In particular, lncRNA-protein interaction has plausible relevance to gene expression regulation and in cellular processes such as pathogen resistance in plants. While lncRNA-protein interaction has been studied in animals, there has yet to be extensive research in plants. In this paper, we propose a novel plant lncRNA-protein interaction prediction method, namely PLRPIM, which combines deep learning and shallow machine learning methods. The selection of an optimal feature subset and subsequent efficient compression are significant challenges for deep learning models. The proposed method adopts k-mer and extracts high-level abstraction sequence-based features using stacked sparse autoencoder. Based on the extracted features, the fusion of random forest (RF) and light gradient boosting machine (LGBM) is used to build the prediction model. The performances are evaluated on Arabidopsis thaliana and Zea mays datasets. Results from experiments demonstrate PLRPIM's superiority compared with other prediction tools on the two datasets. Based on 5-fold cross-validation, we obtain 89.98% and 93.44% accuracy, 0.954 and 0.982 AUC for Arabidopsis thaliana and Zea mays, respectively. PLRPIM predicts potential lncRNA-protein interaction pairs effectively, which can facilitate lncRNA related research including function prediction.Entities:
Keywords: autoencoder; hybrid; light gradient boosting machine; lncRNA-protein interaction; plant; random forest
Mesh:
Substances:
Year: 2019 PMID: 31151273 PMCID: PMC6627874 DOI: 10.3390/cells8060521
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Details of lncRNA–protein interaction datasets for two species.
| Species | Dataset | Positive Samples | Negative Samples | Total |
|---|---|---|---|---|
|
| Training set | 758 | 758 | 1516 |
| Test set | 190 | 190 | 380 | |
|
| Training set | 17,706 | 17,706 | 35,412 |
| Test set | 4427 | 4427 | 8854 |
Figure 1The workflow of PLRPIM model.
Figure 2The network structure of constrained stacked autoencoders (AE).
Hyper parameter settings.
| Name | Settings |
|---|---|
| Learning rate | 0.5, 1, 2 |
| Parameter optimization | SGD, momentum, Adam |
| Batch size | 256, 128, 64 |
| Activation | ReLU |
| Loss | MSE, Cross Entropy |
| Dropout rate | 0.6 |
Figure 3Experimental setup for testing the proposed method. The datasets are split into five folds and hyper parameters are tuned in the training set. The model is learned and applied on the test set based on the hyper parameters. Performance metrics are calculated for all folds.
5-fold results of the proposed method.
| Dataset | Test Set | ACC | PRE | SEN | SPE | MCC | AUC |
|---|---|---|---|---|---|---|---|
|
| 1 | 0.9000 | 0.9250 | 0.8506 | 0.9417 | 0.7995 | 0.9527 |
| 2 | 0.8865 | 0.9368 | 0.8359 | 0.9402 | 0.7784 | 0.9488 | |
| 3 | 0.8971 | 0.9344 | 0.8636 | 0.9337 | 0.7970 | 0.9590 | |
| 4 | 0.9050 | 0.9353 | 0.8641 | 0.9436 | 0.8117 | 0.9490 | |
| 5 | 0.9103 | 0.9223 | 0.9036 | 0.9176 | 0.8206 | 0.9615 | |
| Average | 0.8998 ± 0.008 | 0.9308 ± 0.006 | 0.8636 ± 0.02 | 0.9354 ± 0.009 | 0.8015 ± 0.01 | 0.9542 ± 0.005 | |
|
| 1 | 0.9372 | 0.9394 | 0.9338 | 0.9405 | 0.8744 | 0.9849 |
| 2 | 0.9340 | 0.9380 | 0.9313 | 0.9369 | 0.8681 | 0.9846 | |
| 3 | 0.9312 | 0.9318 | 0.9303 | 0.9321 | 0.8624 | 0.9817 | |
| 4 | 0.9310 | 0.9350 | 0.9259 | 0.9361 | 0.8620 | 0.9812 | |
| 5 | 0.9387 | 0.9367 | 0.9408 | 0.9366 | 0.8773 | 0.9833 | |
| Average | 0.9344 ± 0.003 | 0.9362 ± 0.003 | 0.9324 ± 0.005 | 0.9364 ± 0.003 | 0.8689 ± 0.006 | 0.9831 ± 0.001 |
Predictive performance of classifiers and the proposed method.
| Dataset | Method | ACC | PRE | SEN | SPE | MCC | AUC |
|---|---|---|---|---|---|---|---|
|
| PLRPIM | 0.8998 | 0.9308 | 0.8636 | 0.9354 | 0.8015 | 0.9542 |
| LGBM | 0.8950 | 0.9182 | 0.8668 | 0.9230 | 0.7912 | 0.8949 | |
| XGB | 0.7452 | 0.7615 | 0.7661 | 0.7187 | 0.4993 | 0.8352 | |
| RF | 0.7088 | 0.6851 | 0.8164 | 0.5951 | 0.4294 | 0.8171 | |
| AdaBoost | 0.6962 | 0.6829 | 0.8234 | 0.5622 | 0.4214 | 0.8061 | |
| DT | 0.6233 | 0.6331 | 0.6060 | 0.6359 | 0.2478 | 0.6284 | |
|
| PLRPIM | 0.9344 | 0.9362 | 0.9324 | 0.9364 | 0.8688 | 0.9831 |
| LGBM | 0.9317 | 0.9331 | 0.9300 | 0.9333 | 0.8634 | 0.9317 | |
| XGB | 0.7936 | 0.7689 | 0.8426 | 0.7446 | 0.5909 | 0.8862 | |
| AdaBoost | 0.7849 | 0.7676 | 0.8182 | 0.7516 | 0.5725 | 0.8693 | |
| RF | 0.7536 | 0.7407 | 0.7972 | 0.7103 | 0.5111 | 0.8641 | |
| DT | 0.6523 | 0.6500 | 0.6676 | 0.6368 | 0.3049 | 0.6894 |
Figure 4ROC curves for PLRPIM, light gradient boosting machine (LGBM), extreme gradient boost (XGB), random forest (RF), adaptive boosting (AdaBoost), and decision tree (DT) on (a) Arabidopsis thaliana dataset and (b) Zea mays dataset.
Figure 5ROC curves for PLRPIM, IPMiner, RPISeq-RF, and RPI-SAN on (a) Arabidopsis thaliana and (b) Zea mays datasets.
Predictive performance of other methods and the proposed method.
| Dataset | Method | ACC | PRE | SEN | SPE | MCC | AUC |
|---|---|---|---|---|---|---|---|
|
| PLRPIM | 0.8998 | 0.9308 | 0.8636 | 0.9354 | 0.8015 | 0.9546 |
| IPMiner | 0.8275 | 0.8930 | 0.7448 | 0.9107 | 0.6646 | 0.8823 | |
| RPISeq-RF | 0.8059 | 0.8144 | 0.7922 | 0.8200 | 0.6124 | 0.8761 | |
| RPI-SAN | 0.7579 | 0.7955 | 0.6966 | 0.8199 | 0.5210 | 0.8164 | |
|
| PLRPIM | 0.9344 | 0.9362 | 0.9324 | 0.9364 | 0.8688 | 0.9823 |
| IPMiner | 0.8127 | 0.8142 | 0.8106 | 0.8148 | 0.6258 | 0.9034 | |
| RPISeq-RF | 0.8069 | 0.7993 | 0.8192 | 0.7945 | 0.6142 | 0.8980 | |
| RPI-SAN | 0.7890 | 0.7909 | 0.7869 | 0.7911 | 0.5784 | 0.8792 |
Some selected Zea mays and Arabidopsis thaliana long non-coding RNAs (lncRNAs) and their biological functions.
| Species | lncRNAs | Biological Functions |
|---|---|---|
|
| TCONS_00011717 | GO:0006913—Nucleocytoplasmic transport |
| TCONS_00008833 | GO:0006083—Acetate metabolic process | |
| TCONS_00012080 | GO:0009867—Regulation of jasmonic acid mediated | |
|
| GRMZM2G374777 | PO:0001052—2 leaf expansion stage |
| GRMZM2G097084 | GO:0005524—ATP binding | |
| GRMZM2G078523 | PO:0001052—2 leaf expansion stage | |
| GRMZM2G543070 | PO:0001052—2 leaf expansion stage | |
| GRMZM2G147020 | PO:0001052—2 leaf expansion stage |