| Literature DB >> 34828296 |
Lan Huang1, Shaoqing Jiao1,2, Sen Yang3, Shuangquan Zhang1, Xiaopeng Zhu1,2, Rui Guo1, Yan Wang1,4.
Abstract
Long noncoding RNA (lncRNA) plays a crucial role in many critical biological processes and participates in complex human diseases through interaction with proteins. Considering that identifying lncRNA-protein interactions through experimental methods is expensive and time-consuming, we propose a novel method based on deep learning that combines raw sequence composition features, hand-designed features and structure features, called LGFC-CNN, to predict lncRNA-protein interactions. The two sequence preprocessing methods and CNN modules (GloCNN and LocCNN) are utilized to extract the raw sequence global and local features. Meanwhile, we select hand-designed features by comparing the predictive effect of different lncRNA and protein features combinations. Furthermore, we obtain the structure features and unifying the dimensions through Fourier transform. In the end, the four types of features are integrated to comprehensively predict the lncRNA-protein interactions. Compared with other state-of-the-art methods on three lncRNA-protein interaction datasets, LGFC-CNN achieves the best performance with an accuracy of 94.14%, on RPI21850; an accuracy of 92.94%, on RPI7317; and an accuracy of 98.19% on RPI1847. The results show that our LGFC-CNN can effectively predict the lncRNA-protein interactions by combining raw sequence composition features, hand-designed features and structure features.Entities:
Keywords: convolutional neural network; hand-designed features; lncRNA-protein interactions; raw sequence features; structure features; two sequence preprocessing methods
Mesh:
Substances:
Year: 2021 PMID: 34828296 PMCID: PMC8621699 DOI: 10.3390/genes12111689
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The flowchart of LPI-CNNCP. (a) Build lncRNA-protein interactions datasets and obtain lncRNA and protein sequences; (b) Feed lncRNA and protein hand-designed feature combinations into RF classifier to select the hand-designed features with superior predictive effect; (c) lncRNA and protein sequences are preprocessed by two methods and encoded by using one-hot encoding; (d) lncRNA and protein secondary structure, hydrogen bonding propensities, and van der Waals interactions are obtained and unifying the dimensions through Fourier transform. (e) Feed the global and local encoded sequences, hand-designed features and structure features into CNN model to predict the lncRNA-protein interactions.
Figure 2The flowchart of constructing reliable negative samples. The positive dataset is constructed by extracting interactions with NPInterv4.0. We calculated and sorted the similarity scores, S, between all proteins from the positive dataset, and divided all lncRNAs into two equal parts before applying different strategies to build the negative dataset.
Numeric description of the datasets.
| Dataset | lncRNAs | Proteins | Interaction Pairs | Non-Interaction Pairs |
|---|---|---|---|---|
| RPI21850 | 4221 | 701 | 21850 | 21850 |
| RPI7317 | 1874 | 118 | 7317 | 7317 |
| RPI1847 | 1939 | 60 | 1847 | 1847 |
Figure 3The flowchart of lncRNA sequence encoding. The two preprocessing methods are applied to transform the lncRNA sequences into fixed-length sequences. After that, the sequences are encoded by using one-hot encoding and fed into GloCNN and LocCNN. The protein sequences apply the same sequence encoding.
The result of LGFC-CNN and other four methods on the RPI21850 dataset.
| Methods | ACC | MCC | F1-Score | SN | SP | PPV |
|---|---|---|---|---|---|---|
| LGFC-CNN |
|
|
|
| 0.9039 | 0.9106 |
| RPISeq-RF | 0.9234 | 0.8481 | 0.9255 | 0.9515 | 0.8954 | 0.9009 |
| RPISeq-SVM | 0.921 | 0.8425 | 0.9224 | 0.939 | 0.903 | 0.9063 |
| LPI-BLS | 0.9141 | 0.8286 | 0.9153 | 0.9283 | 0.8999 | 0.9027 |
| IPMiner | 0.923 | 0.8461 | 0.9238 | 0.9335 |
|
|
Figure 4(a) The ROC curves and (b) the PRC curves of LGFC-CNN and other four methods on the RPI21850 dataset.
The result of LGFC-CNN and other four methods on the RPI7317 and RPI1847 datasets.
| RPI7317 | ||||||
|---|---|---|---|---|---|---|
| Methods | ACC | MCC | F1-Score | SN | SP | PPV |
| LGFC-CNN |
|
|
|
|
|
|
| RPISeq-RF | 0.9098 | 0.8202 | 0.9116 | 0.9299 | 0.8897 | 0.894 |
| RPISeq-SVM | 0.9153 | 0.8311 | 0.9169 | 0.9344 | 0.8961 | 0.9 |
| LPI-BLS | 0.9144 | 0.8288 | 0.9151 | 0.9226 | 0.9061 | 0.9077 |
| IPMiner | 0.9134 | 0.8269 | 0.9139 | 0.918 | 0.9088 | 0.9097 |
| RPI1847 | ||||||
| LGFC-CNN |
|
|
| 0.9747 | 0.9856 | 0.989 |
| RPISeq-RF | 0.9621 | 0.9243 | 0.9617 | 0.9531 | 0.9711 | 0.9706 |
| RPISeq-SVM | 0.9585 | 0.9191 | 0.957 | 0.9242 |
|
|
| LPI-BLS | 0.9675 | 0.9352 | 0.9672 | 0.9567 | 0.9783 | 0.9779 |
| IPMiner | 0.9639 | 0.9287 | 0.9631 | 0.9422 | 0.9856 | 0.9849 |
Figure 5(a) The ROC curves and (b) the PRC curves of LGFC-CNN and other four methods on RPI7317. (c) The ROC curves and (d) the PRC curves of LGFC-CNN and other four methods on RPI1847.
Figure 6The heat map generated from the feature combinations in RPI21850.
Figure 7The performance of LGFC-CNN and four different basic modules.
Results of LGFC-CNN on six datasets generated by different negative sample generation strategy.
| Datasets | ACC | MCC | F1-Score | SN | SP | PPV |
|---|---|---|---|---|---|---|
| RPI21850 |
|
|
| 0.979 |
|
|
| ranRPI21850 | 0.9155 | 0.8381 | 0.9207 |
| 0.8505 | 0.8677 |
| RPI7317 |
|
|
|
|
|
|
| ranRPI7317 | 0.8998 | 0.7996 | 0.9003 | 0.9052 | 0.8944 | 0.8954 |
| RPI1847 |
|
|
|
|
|
|
| ranRPI1847 | 0.9675 | 0.9359 | 0.9682 |
| 0.9458 | 0.9481 |