| Literature DB >> 35589837 |
Kanchan Jha1, Sriparna Saha2, Hiteshi Singh3.
Abstract
Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks in isolation but interact with other proteins (known as protein-protein interaction) present in their surroundings to complete biological activities. The knowledge of protein-protein interactions (PPIs) unravels the cellular behavior and its functionality. The computational methods automate the prediction of PPI and are less expensive than experimental methods in terms of resources and time. So far, most of the works on PPI have mainly focused on sequence information. Here, we use graph convolutional network (GCN) and graph attention network (GAT) to predict the interaction between proteins by utilizing protein's structural information and sequence features. We build the graphs of proteins from their PDB files, which contain 3D coordinates of atoms. The protein graph represents the amino acid network, also known as residue contact network, where each node is a residue. Two nodes are connected if they have a pair of atoms (one from each node) within the threshold distance. To extract the node/residue features, we use the protein language model. The input to the language model is the protein sequence, and the output is the feature vector for each amino acid of the underlying sequence. We validate the predictive capability of the proposed graph-based approach on two PPI datasets: Human and S. cerevisiae. Obtained results demonstrate the effectiveness of the proposed approach as it outperforms the previous leading methods. The source code for training and data to train the model are available at https://github.com/JhaKanchan15/PPI_GNN.git .Entities:
Mesh:
Substances:
Year: 2022 PMID: 35589837 PMCID: PMC9120162 DOI: 10.1038/s41598-022-12201-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Characteristics of PPIs datasets.
| Dataset | # Samples | # Positive samples | # Negative samples |
|---|---|---|---|
| Human | 22,217 | 16,220 | 5997 |
| 7274 | 2847 | 4427 |
Size of node’s features.
| S. no. | Method | Dimension |
|---|---|---|
| 1 | LSTM-based language model ( | 1024 |
| 2 | BERT-based language model ( | 1024 |
| 3 | One-hot encoding of amino acids | 20 |
| 4 | Physicochemical properties of amino acids | 7 |
Figure 1Graph representation of a protein with node features.
Figure 2Illustration of the proposed approach.
Figure 3Illustration of the feature extraction from protein sequence using language models.
Performance of GNN variants using different node features on human test set.
| GNN model | Node features | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| GCN | LSTM-based LM | 97.93 | 98.53 | 98.59 | 94.70 | ||||
| BERT-based LM | 96.32 | 97.33 | 93.65 | 97.58 | 97.46 | 90.80 | 97.59 | 98.39 | |
| One-hot encoding | 81.30 | 94.24 | 45.46 | 82.72 | 88.10 | 47.47 | 84.25 | 92.97 | |
| Physicochemical properties | 77.11 | 95.47 | 26.29 | 78.20 | 85.97 | 31.60 | 75.65 | 88.60 | |
| GAT | LSTM-based LM | 96.18 | 98.62 | 98.28 | 98.86 | ||||
| BERT-based LM | 96.59 | 97.52 | 94.15 | 97.77 | 97.64 | 91.48 | 97.35 | 98.21 | |
| One-hot encoding | 79.84 | 92.37 | 45.12 | 82.33 | 87.07 | 43.50 | 82.24 | 91.48 | |
| Physicochemical properties | 75.36 | 94.21 | 23.15 | 77.25 | 84.89 | 25.12 | 71.18 | 87.16 |
Best values are in bold.
Performance of GNN variants using different node features on S. cerevisiae test set.
| GNN model | Node features | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| GCN | LSTM-based LM | 91.42 | 90.62 | 92.20 | 91.88 | 91.24 | 82.84 | 95.26 | 94.97 |
| BERT-based LM | 86.68 | 84.33 | 88.95 | 88.07 | 86.16 | 73.40 | 92.01 | 91.43 | |
| One-hot encoding | 71.09 | 75.51 | 66.78 | 68.89 | 72.05 | 42.43 | 79.23 | 78.58 | |
| Physicochemical properties | 68.15 | 73.68 | 62.76 | 65.85 | 69.55 | 36.65 | 74.13 | 72.92 | |
| GAT | LSTM-based LM | ||||||||
| BERT-based LM | 86.74 | 82.62 | 90.72 | 89.59 | 85.97 | 73.65 | 92.23 | 91.88 | |
| One-hot encoding | 69.23 | 64.99 | 73.36 | 70.38 | 67.58 | 38.49 | 75.27 | 71.88 | |
| Physicochemical properties | 61.15 | 70.14 | 52.40 | 58.94 | 64.05 | 22.88 | 65.40 | 62.64 |
Best values are in bold.
Average results of 5-fold cross-validation of GNN variants using LSTM-based LM node features for PPI datasets.
| Datasets | GNN model | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Human | GCN | 98.26 (0.39) | 95.08 (1.15) | 98.20 (0.47) | 98.82 (0.27) | 95.57 (0.76) | 98.06 (0.44) | 98.50 (0.42) | |
| GAT | 99.37(0.45) | ||||||||
| GCN | 94.41 (0.94) | 94.19 (1.30) | 94.37 (0.98) | 88.82 (1.73) | 97.01 (1.37) | ||||
| GAT | 94.49 (1.13) | 94.46 (1.15) | 96.50 (1.43) |
Best values are in bold.
The results of GNN variants using LSTM-based LM node features with different number of layers for PPI datasets.
| Datasets | GNN model | # layers | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Human | GCN | 1 | 97.93 | 98.53 | 96.27 | 98.65 | 98.59 | 94.70 | 98.37 | 98.92 |
| 2 | 97.99 | 98.90 | 95.50 | 98.39 | 98.64 | 94.84 | 97.88 | 98.55 | ||
| 3 | 97.03 | 98.54 | 93.04 | 97.39 | 97.96 | 92.50 | 96.51 | 97.58 | ||
| GAT | 1 | 98.13 | 98.84 | 96.18 | 98.62 | 98.73 | 95.20 | 98.28 | 98.86 | |
| 2 | 97.73 | 98.13 | 96.61 | 98.77 | 98.45 | 94.21 | 98.24 | 98.92 | ||
| 3 | 97.66 | 98.60 | 95.17 | 98.18 | 98.39 | 94.11 | 97.50 | 98.23 | ||
| GCN | 1 | 91.42 | 90.62 | 92.20 | 91.88 | 91.24 | 82.84 | 95.26 | 94.97 | |
| 2 | 91.64 | 92.31 | 90.98 | 91.07 | 91.69 | 83.29 | 95.66 | 94.84 | ||
| 3 | 90.57 | 92.08 | 89.06 | 89.35 | 90.70 | 81.18 | 94.19 | 93.00 | ||
| GAT | 1 | 92.15 | 91.76 | 92.53 | 92.29 | 92.02 | 84.30 | 95.85 | 95.11 | |
| 2 | 91.76 | 90.39 | 93.09 | 92.72 | 91.54 | 83.53 | 96.30 | 96.06 | ||
| 3 | 91.08 | 92.99 | 89.18 | 89.54 | 91.23 | 82.22 | 95.02 | 93.58 |
Figure 4Illustration of the designed baseline.
The results of designed baselines using LSTM-based LM features for PPI datasets.
| Datasets | ||||||||
|---|---|---|---|---|---|---|---|---|
| Human | 95.34 | 96.24 | 92.81 | 97.41 | 96.82 | 88.13 | 97.03 | 98.07 |
| 89.83 | 90.36 | 89.34 | 88.91 | 89.63 | 79.67 | 94.04 | 93.86 |
Performance analysis of GNN variants using LSTM-based LM node features for different sample sizes of PPI datasets.
| Datasets | # Samples | GNN model | ||||
|---|---|---|---|---|---|---|
| Human | 4k | GCN | 93.59 | 94.20 | 92.90 | 87.13 |
| GAT | 91.11 | 90.61 | 91.67 | 82.20 | ||
| 8k | GCN | 95.33 | 94.77 | 96.00 | 90.62 | |
| GAT | 94.60 | 93.96 | 95.36 | 89.16 | ||
| 12k | GCN | 96.72 | 96.83 | 96.60 | 93.42 | |
| GAT | 96.28 | 97.02 | 95.46 | 92.54 | ||
| all (22,217) | GCN | 97.93 | 96.83 | 96.27 | 94.70 | |
| GAT | 98.13 | 98.84 | 96.18 | 95.20 | ||
| 2k | GCN | 80.00 | 79.49 | 80.49 | 59.97 | |
| GAT | 77.75 | 82.56 | 73.17 | 55.90 | ||
| 4k | GCN | 85.13 | 85.06 | 85.19 | 70.25 | |
| GAT | 85.0 | 85.32 | 84.69 | 70.00 | ||
| 6k | GCN | 89.17 | 89.45 | 88.87 | 78.32 | |
| GAT | 88.58 | 87.18 | 90.07 | 77.22 | ||
| all (8854) | GCN | 91.42 | 90.62 | 92.20 | 82.24 | |
| GAT | 92.15 | 91.76 | 92.53 | 84.30 |
Comparative analysis of the proposed approach with existing methods for human dataset.
| Method | ||||||||
|---|---|---|---|---|---|---|---|---|
| Sun’s work[ | 96.82 | – | – | – | – | – | – | – |
| Jha’s work[ | 97.20 | 98.07 | 95.04 | 97.99 | 98.03 | 93.16 | 98.39 | 98.87 |
| Yang’s work[ | 96.91 | 97.90 | 93.73 | 98.06 | – | – | – | – |
| Jha’s work[ | 97.94 | 95.84 | 98.13 | 98.51 | 95.18 | |||
| Proposed approach | 98.84 | 98.28 | 98.86 |
Best values are in bold.
Comparative analysis of the proposed approach with existing methods for S. cerevisiae dataset.
| Method | ||||||||
|---|---|---|---|---|---|---|---|---|
| Wong’s work[ | 93.92 | 91.10 | – | – | 88.60 | 94.00 | – | |
| Du’s work[ | 92.50 | 90.56 | 94.49 | 94.38 | - | 85.08 | 97.43 | - |
| Gonzalez’s work[ | 92.59 | 91.40 | 91.59 | 93.65 | 92.51 | 85.20 | 97.40 | - |
| Hashemifar’s work[ | 94.55 | 92.24 | – | 96.68 | – | – | – | – |
| Jha’s work[ | 94.49 | 93.19 | 93.41 | 94.58 | 89.01 | |||
| Proposed approach | 95.15 | 94.46 | 97.24 | 96.50 |
Best values are in bold.