| Literature DB >> 31890140 |
Hai-Cheng Yi1,2, Zhu-Hong You1, Li Cheng1, Xi Zhou1, Tong-Hai Jiang1, Xiao Li1, Yan-Bin Wang1.
Abstract
The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as "word" in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.Entities:
Keywords: Distribution representation; Natural language processing; RNA-protein interaction; Word2vec
Year: 2019 PMID: 31890140 PMCID: PMC6926125 DOI: 10.1016/j.csbj.2019.11.004
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
The details of two RNA-protein interactions datasets RPI369 and RPI1807 and lncRNA-protein interactions dataset RPI488.
| Datasets | # of RNAs | # of proteins | Positive samples | Negative samples | References |
|---|---|---|---|---|---|
| RPI369 | 332 | 338 | 369 | 369 | |
| RPI1807 | 1078 | 1807 | 1807 | 1807 | |
| RPI488 | 25 | 247 | 243 | 245 |
Fig. 1Procedure of splitting RNA nucleotides and protein amino acids sequences into smaller k-mers.
Fig. 2The skip-gram word embedding model. Lnc2vec and pro2vec model were trained by using this model and genome-wide human lncRNA and protein sequences. Skip-gram is trained by predicting words surrounding the central word, after training, the weights matrix W of the hidden layer is obtained, that is word vectors.
Fig. 3. The procedure for training RNA2vec and pro2vec. The corpus of RNA and protein sequences obtained from GENCODE project. And the model implemented by word2vec.
Fig. 4The workflow of LPI-Pred. The word embedding model RNA2vec and pro2vec are trained to obtain the sequence information of RNA and protein, and these features after feature selection are used to train Random Forest predictor.
Comparing the five-fold cross-validation performance of k-mer and word embedding with and without feature selection on three gold standard datasets.
| Datasets | feature | Acc (%) | Sens (%) | Spec (%) | Pre (%) | MCC (%) |
|---|---|---|---|---|---|---|
| RPI369 | 68.71 | 67.29 | 70.30 | 69.88 | 37.74 | |
| embedding without feature selection | 71.97 | 70.27 | 44.24 | |||
| embedding with feature selection | 71.14 | 72.64 | ||||
| RPI488 | 89.29 | 95.17 | 94.33 | 79.09 | ||
| embedding without feature selection | 87.64 | 83.17 | 91.93 | 90.82 | 75.52 | |
| embedding with feature selection | 82.75 | |||||
| RPI1807 | 96.88 | 94.96 | 96.04 | 93.72 | ||
| embedding without feature selection | 96.73 | 97.90 | 95.28 | 96.28 | 93.37 | |
| embedding with feature selection | 97.89 |
The boldface indicates this measure performance is the best among the compared sequence feature encoding.
Comparing the five-fold cross-validation performance of LPI-Pred and other machine learning classifiers on three gold standard datasets.
| Datasets | Methods | Acc (%) | Sens (%) | Spec (%) | Pre (%) | MCC (%) |
|---|---|---|---|---|---|---|
| RPI369 | SVM | 65.17 | 66.20 | 64.34 | 65.48 | 30.61 |
| LR | 58.37 | 44.06 | 62.51 | 18.05 | ||
| LPI-Pred | 71.14 | |||||
| RPI488 | SVM | 88.68 | 81.97 | 95.17 | 94.26 | 77.95 |
| LR | 88.68 | 81.97 | 95.17 | 94.26 | 77.95 | |
| LPI-Pred | ||||||
| RPI1807 | SVM | 92.35 | 94.11 | 90.17 | 92.29 | 84.52 |
| LR | 87.26 | 90.17 | 83.56 | 87.39 | 74.17 | |
| LPI-Pred |
The boldface indicates this measure performance is the best among the compared methods for individual dataset.
Comparing five-fold cross-validation performance of LPI-Pred and other state-of-the-art methods on three gold standard datasets.
| Datasets | Methods | Acc (%) | Sens (%) | Spec (%) | Pre (%) | MCC (%) | AUC |
|---|---|---|---|---|---|---|---|
| RPI369 | RPISeq | 70.4 | 70.5 | 70.2 | 70.7 | 40.9 | 0.767 |
| lncPro | 70.4 | 70.8 | 69.6 | 71.3 | 40.9 | 0.740 | |
| LPI-Pred | |||||||
| RPI1807 | RPISeq | 96.8 | 98.4 | 96.0 | 0.996 | ||
| lncPro | 96.9 | 96.5 | 98.1 | 95.5 | 93.8 | 0.994 | |
| RPI-SAN | 96.1 | 93.6 | 91.4 | 92.4 | |||
| LPI-Pred | 97.10 | 96.14 | 94.13 | 0.994 | |||
| RPI488 | RPISeq | 88.0 | 92.6 | 82.2 | 93.2 | 76.2 | 0.903 |
| lncPro | 87.0 | 90.0 | 82.7 | 91.0 | 74.0 | 0.901 | |
| RPI-SAN | 89.7 | 83.7 | 95.2 | 79.3 | |||
| LPI-Pred | 82.75 | 0.911 |
The boldface indicates this measure performance is the best among the compared methods for individual dataset.