| Literature DB >> 30717470 |
Yanbin Wang1,2, Zhu-Hong You3, Shan Yang4,5, Xiao Li6, Tong-Hai Jiang7, Xi Zhou8.
Abstract
Many life activities and key functions in organisms are maintained by different types of protein⁻protein interactions (PPIs). In order to accelerate the discovery of PPIs for different species, many computational methods have been developed. Unfortunately, even though computational methods are constantly evolving, efficient methods for predicting PPIs from protein sequence information have not been found for many years due to limiting factors including both methodology and technology. Inspired by the similarity of biological sequences and languages, developing a biological language processing technology may provide a brand new theoretical perspective and feasible method for the study of biological sequences. In this paper, a pure biological language processing model is proposed for predicting protein⁻protein interactions only using a protein sequence. The model was constructed based on a feature representation method for biological sequences called bio-to-vector (Bio2Vec) and a convolution neural network (CNN). The Bio2Vec obtains protein sequence features by using a "bio-word" segmentation system and a word representation model used for learning the distributed representation for each "bio-word". The Bio2Vec supplies a frame that allows researchers to consider the context information and implicit semantic information of a bio sequence. A remarkable improvement in PPIs prediction performance has been observed by using the proposed model compared with state-of-the-art methods. The presentation of this approach marks the start of "bio language processing technology," which could cause a technological revolution and could be applied to improve the quality of predictions in other problems.Entities:
Keywords: bio-language processing; convolution neural network; protein–protein interactions; sentencepiece; unigram language model
Mesh:
Year: 2019 PMID: 30717470 PMCID: PMC6406841 DOI: 10.3390/cells8020122
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Figure 1Analogy between natural language and “bio language”.
Figure 2The two-stage workflow of our proposed biology language model for predicting protein–protein interactions (PPIs). Subfigures (a) shows the flow of generating fixed-length feature representation for each protein sequence. Given a set of protein sequences, we first segmented them into protein words, and then the protein words were transformed to a vector by Skip-Gram model. Following this, the sequence vector was obtained by accumulating all the protein word vectors of this sequence. Subfigures (b) was a convolutional neural network with multiple convolution kernels for predicting PPIs. Given a pair of protein sequences, we represented them using Bio2vec and then concatenated them to form a feature pair. Finally, the trained convolutional neural network was used to predict true or false.
Figure 3The Skip-gram word representation model. This model is trained by predicting words surrounding the central word. After training, the weights matrix W of the hidden layer was obtained, these weights are actually the “word vectors”.
Figure 4The proposed convolutional neural network architecture. This convolutional neural network consists of three subnetworks. The first sub-network performs two convolutions, the second and third sub-networks perform a convolution, separately, and the fourth convolution operations use different sizes convolution kernels. The penultimate layer is responsible for concatenating features generated by the three subnets. The full connected layer is used to execute the prediction.
The comparison of Bio2Vec-based method with 3-mers-based method.
| Model | Testing Set | Accu (%) | Sens (%) | Prec (%) | MCC (%) | AUC |
|---|---|---|---|---|---|---|
| Bio2Vec-based |
| 97.31 | 96.28 | 98.48 | 94.76 | 0.9961 |
|
| 93.30 | 92.70 | 93.55 | 87.49 | 0.9720 | |
|
| 88.01 | 89.61 | 87.90 | 78.71 | 0.9394 | |
|
| 99.58 | 99.64 | 99.50 | 99.16 | 0.9995 | |
|
| 92.18 | 86.85 | 97.77 | 85.53 | 0.9647 | |
|
| 90.26 | 88.14 | 91.65 | 82.38 | 0.9621 | |
|
| 83.22 | 89.61 | 80.70 | 82.38 | 0.8924 | |
|
| 98.47 | 100 | 96.98 | 96.99 | 0.9998 |
Figure 5Receiver operating characteristic (ROC) curves comparison of Bio2Vec-based method with 3-mers-based method. ROC curves achieved by Bio2Vec-based method is shown in (a), ROC curves achieved by 3-mers-base method is shown in (b).
Performance comparison of different methods on the Human dataset.
| Model | Accu (%) | Sens (%) | Prec (%) | MCC (%) |
|---|---|---|---|---|
| LDA + RF [ | 96.40 | 94.20 | N/A | 92.80 |
| LDA + RoF [ | 95.70 | 97.60 | N/A | 91.80 |
| LDA + SVM [ | 90.70 | 89.70 | N/A | 81.30 |
| AC + RF [ | 95.50 | 94.00 | N/A | 91.40 |
| AC + RoF [ | 95.10 | 93.30 | N/A | 91.10 |
| AC + SVM [ | 89.30 | 94.00 | N/A | 79.20 |
| Proposed Method | 97.31 | 96.28 | 98.48 | 94.76 |
Performance comparison of different methods on the S. cerevisiae dataset.
| Model | Accu (%) | Sens (%) | Prec (%) | MCC (%) |
|---|---|---|---|---|
| ACC [ | 89.33 | 89.93 | 88.87 | N/A |
| AC [ | 87.36 | 87.30 | 87.82 | N/A |
| Code1 [ | 75.08 | 75.81 | 74.75 | N/A |
| Code2 [ | 80.04 | 76.77 | 82.17 | N/A |
| Code3 [ | 80.41 | 78.14 | 81.66 | N/A |
| Code4 [ | 86.15 | 81.03 | 90.24 | N/A |
| PCA-EELM [ | 87.00 | 86.15 | 87.59 | 77.36 |
| Proposed Method | 93.30 | 92.70 | 93.55 | 87.49 |
Performance comparison of different methods on the H. pylori dataset.
| Model | Accu (%) | Sens (%) | Prec (%) | MCC (%) |
|---|---|---|---|---|
| Phylogenetic bootstrap [ | 75.80 | 69.80 | 80.20 | N/A |
| Boosting [ | 79.52 | 80.30 | 81.69 | 70.64 |
| Signature products [ | 83.40 | 79.90 | 85.70 | N/A |
| HKNN [ | 84.00 | 86.00 | 84.00 | N/A |
| Proposed Method | 88.01 | 89.61 | 80.70 | 78.71 |