| Literature DB >> 30071670 |
Hang Li1,2, Xiu-Jun Gong3,4, Hua Yu5,6, Chang Zhou7,8.
Abstract
Machine learning based predictions of protein⁻protein interactions (PPIs) could provide valuable insights into protein functions, disease occurrence, and therapy design on a large scale. The intensive feature engineering in most of these methods makes the prediction task more tedious and trivial. The emerging deep learning technology enabling automatic feature engineering is gaining great success in various fields. However, the over-fitting and generalization of its models are not yet well investigated in most scenarios. Here, we present a deep neural network framework (DNN-PPI) for predicting PPIs using features learned automatically only from protein primary sequences. Within the framework, the sequences of two interacting proteins are sequentially fed into the encoding, embedding, convolution neural network (CNN), and long short-term memory (LSTM) neural network layers. Then, a concatenated vector of the two outputs from the previous layer is wired as the input of the fully connected neural network. Finally, the Adam optimizer is applied to learn the network weights in a back-propagation fashion. The different types of features, including semantic associations between amino acids, position-related sequence segments (motif), and their long- and short-term dependencies, are captured in the embedding, CNN and LSTM layers, respectively. When the model was trained on Pan's human PPI dataset, it achieved a prediction accuracy of 98.78% at the Matthew's correlation coefficient (MCC) of 97.57%. The prediction accuracies for six external datasets ranged from 92.80% to 97.89%, making them superior to those achieved with previous methods. When performed on Escherichia coli, Drosophila, and Caenorhabditis elegans datasets, DNN-PPI obtained prediction accuracies of 95.949%, 98.389%, and 98.669%, respectively. The performances in cross-species testing among the four species above coincided in their evolutionary distances. However, when testing Mus Musculus using the models from those species, they all obtained prediction accuracies of over 92.43%, which is difficult to achieve and worthy of note for further study. These results suggest that DNN-PPI has remarkable generalization and is a promising tool for identifying protein interactions.Entities:
Keywords: convolution neural networks; long short-term memory neural networks; model generalization; protein–protein interaction
Mesh:
Year: 2018 PMID: 30071670 PMCID: PMC6222503 DOI: 10.3390/molecules23081923
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Benchmark dataset.
| Dataset | Positive Samples | Negative Samples | Total |
|---|---|---|---|
| Benchmark set | 29,071 | 31,496 | 60,567 |
| Training set | 26,128 | 28,439 | 54,567 |
| Hold-out test set | 2943 | 3057 | 6000 |
Validation datasets.
| Dataset | 2010 HPRD | DIP | HIPPIE HQ | HIPPIE LQ | inWeb_inbiomap HQ | inWeb_inbiomap HQ |
|---|---|---|---|---|---|---|
| Positive samples | 8008 | 4514 | 25,701 | 173,343 | 128,591 | 368,882 |
| Low-redundancy | 2413 | 1276 | 3035 | 5587 | 2546 | 5358 |
Protein–protein interaction (PPI) datasets for four species.
| Species | Dataset | Positive Samples | Negative Samples | Total |
|---|---|---|---|---|
| Original set | 6680 | 6881 | 13,561 | |
| Training set | 6012 | 6193 | 12,205 | |
| Testing set | 668 | 688 | 1356 | |
| Original set | 19,133 | 18,449 | 37,582 | |
| Training set | 17,220 | 16,593 | 33,813 | |
| Testing set | 1913 | 1856 | 3769 | |
| Original set | 3696 | 3763 | 7459 | |
| Training set | 17,220 | 16,593 | 33,813 | |
| Testing set | 1913 | 1856 | 3769 | |
| Original set | 22,870 | — | 22,870 |
Figure 1Architecture of the deep learning model.
Figure 2The architecture of convolutional neural network (CNN) layer.
Figure 3Long short-term memory cell.
The parameters and output sizes of each layer.
| Layer | Parameters | Output_Size of Protein A | Output_Size of Protein B |
|---|---|---|---|
| Input | Sentence_length = 1200 | (128,1200) | (128,1200) |
| Batch_size = 128 | |||
| Embedding layer | Input_dim = 23 | (128,1200,128) | (128,1200,128) |
| Output_dim = 128 | |||
| Convolution layer 1 | Filters = 10 | (128,1191,64) | (128,1191,64) |
| Filter_length = 10 | |||
| Activation = relu | |||
| MaxPooling | Pooling_length = 2 | (128,596,64) | (128,596,64) |
| Convolution | Filters = 10 | (128,589,64) | (128,589,64) |
| Filter_length = 8 | |||
| Activation = relu | |||
| MaxPooling | Pooling_length = 2 | (128,295,64) | (128,295,64) |
| Convolution layer 3 | Filters = 10 | (128,291,64) | (128,291,64) |
| Filter_length = 5 | |||
| Activation = relu | |||
| MaxPooling | Pooling_length = 2 | (128,146,64) | (128,146,64) |
| LSTM layer | Output_size = 80 | (128,80) | (128,80) |
| Merge layer | Mode = concat | (128,160) | |
| Output | Activation = sigmoid | (128,1) | |
Performances of deep neural network/protein–protein interaction (DNN-PPI) framework on the benchmark dataset.
| Test Set | Accuracy | Recall | Precision | F-Score | MCC |
|---|---|---|---|---|---|
| 1 | 0.9853 | 0.9845 | 0.9849 | 0.9847 | 0.9706 |
| 2 | 0.9877 | 0.9876 | 0.9865 | 0.9871 | 0.9754 |
| 3 | 0.9916 | 0.9909 | 0.9913 | 0.9911 | 0.9831 |
| 4 | 0.9892 | 0.9911 | 0.9867 | 0.9889 | 0.9784 |
| 5 | 0.9941 | 0.9963 | 0.9915 | 0.9939 | 0.9883 |
| Hold-out | 0.9878 | 0.9891 | 0.9861 | 0.9876 | 0.9757 |
Performances with different proportions of training and testing sets on the benchmark dataset.
| Training/Testing | Accuracy | Recall | Precision | F-Score | MCC |
|---|---|---|---|---|---|
| 0.70/0.30 | 0.9846 | 0.9796 | 0.9883 | 0.9839 | 0.9693 |
| 0.75/0.25 | 0.9870 | 0.9864 | 0.9867 | 0.9865 | 0.9741 |
| 0.80/0.20 | 0.9836 | 0.9768 | 0.9889 | 0.9828 | 0.9672 |
| 0.85/0.15 | 0.9849 | 0.9821 | 0.9864 | 0.9843 | 0.9698 |
Performance comparisons on the hold-out test set.
| Method | Accuracy | Average | |
|---|---|---|---|
| 188D | SVM | 0.9468 | 0.9645 |
| RF | 0.9701 | ||
| GBDT | 0.9767 | ||
| QLC | SVM | 0.9497 | 0.9658 |
| RF | 0.9701 | ||
| GBDT | 0.9775 | ||
| QNC | SVM | 0.9582 | 0.9686 |
| RF | 0.9695 | ||
| GBDT | 0.9782 | ||
| QNC + QLC | SVM | 0.9758 | 0.9751 |
| RF | 0.9716 | ||
| GBDT | 0.9778 | ||
| SAE | 0.9538 | 0.9538 | |
| DNN-PPI | 0.9878 | 0.9878 | |
Accuracy comparisons on validation datasets.
| Dataset Name | Samples | DNN-PPI | SAE | GBDT | Pan et al. |
|---|---|---|---|---|---|
| 2010 HPRD | 8008 | 0.9789 | 0.9205 | 0.9663 | 0.8816 |
| DIP | 4514 | 0.9433 | 0.8773 | 0.9465 | 0.8872 |
| HIPPIE HQ | 25,701 | 0.9608 | 0.8623 | 0.9415 | 0.8301 |
| HIPPIE LQ | 173,343 | 0.9340 | 0.8105 | 0.9180 | — |
| inWeb_inbiomap HQ | 128,591 | 0.9307 | 0.8512 | 0.9284 | — |
| inWeb_inbiomap LQ | 368,882 | 0.9280 | 0.8187 | 0.9028 | — |
DNN-PPI’s accuracy on low-redundancy versions of validation datasets.
| Dataset | Samples | ACC |
|---|---|---|
| 2010 HPRD LR | 2413 | 0.9465 |
| DIP LR | 1276 | 0.9302 |
| HIPPIE HQ LR | 3035 | 0.9420 |
| HIPPIE LQ LR | 5587 | 0.9414 |
| inWeb_inbiomap HQ LR | 2546 | 0.9411 |
| inWeb_inbiomap LQ LR | 5358 | 0.9331 |
Performance comparisons on datasets for other species.
| Species | Recall | Precision | MCC | F-Score | Accuracy | SAE ACC | Guo et al. ACC |
|---|---|---|---|---|---|---|---|
| 0.9416 | 0.9752 | 0.9194 | 0.9581 | 0.9594 | 0.9323 | 0.9528 | |
| 0.9686 | 0.9995 | 0.9681 | 0.9837 | 0.9838 | 0.9348 | 0.9623 | |
| 0.9810 | 0.9918 | 0.9732 | 0.9864 | 0.9866 | 0.9786 | 0.9732 |
Performances on the cross-species validations.
| Training Set | Test Set | Accuracy |
|---|---|---|
| Benchmark dataset | 0.9835 | |
| 0.5267 | ||
| 0.5205 | ||
| 0.4754 | ||
| 0.9243 | ||
| Benchmark dataset | 0.4886 | |
| 0.5230 | ||
| 0.4812 | ||
| 0.9713 | ||
| Benchmark dataset | 0.4803 | |
| 0.5147 | ||
| 0.4924 | ||
| 0.9475 | ||
| Benchmark dataset | 0.4563 | |
| 0.4585 | ||
| 0.4871 |
Figure 4Loss comparisons across different models.
Figure 5Accuracy comparisons across different models.