| Literature DB >> 35183129 |
Olivér M Balogh1,2, Bettina Benczik1,3, András Horváth2, Mátyás Pétervári1, Péter Csermely4, Péter Ferdinandy1,3, Bence Ágg5,6.
Abstract
BACKGROUND: The investigation of possible interactions between two proteins in intracellular signaling is an expensive and laborious procedure in the wet-lab, therefore, several in silico approaches have been implemented to narrow down the candidates for future experimental validations. Reformulating the problem in the field of network theory, the set of proteins can be represented as the nodes of a network, while the interactions between them as the edges. The resulting protein-protein interaction (PPI) network enables the use of link prediction techniques in order to discover new probable connections. Therefore, here we aimed to offer a novel approach to the link prediction task in PPI networks, utilizing a generative machine learning model.Entities:
Keywords: Conditional GAN; Edge prediction; Interactome; PPI prediction; Protein interaction prediction
Mesh:
Substances:
Year: 2022 PMID: 35183129 PMCID: PMC8858570 DOI: 10.1186/s12859-022-04598-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Preprocessing of the input network. Schematic summary of the preprocessing module, that takes in the provided protein–protein interaction (PPI) network, and produces the downscaled networks with 90% (N90) and its 90% (thus 81%, N81) of edges from the original one (N100) in the form of adjacency lists, and generates the induced subgraphs for each. These representation files are created for the original network as well but are not required in the machine learning part, resulting in the listed 5 files to be fed into the conditional generative adversarial network (cGAN) model down the line
Fig. 2Data preprocessing: downscaling and induced subgraph generation. Human protein–protein interaction (PPI) network (N100) with red colored edges to be deleted in the tenfold cross-validation stage to retrieve training dataset (N90) in one example fold (A). The networks constructed from truncated datasets (N90) generated with tenfold cross-validation are traversed with a modified version of the classical breadth-first search (BFS) to extract equal-sized induced subgraphs which serve as input for conditional generative adversarial network (cGAN) (see Fig. 1). Example induced subgraph node color intensity and node labels represent depth level of the modified BFS. In contrast to the classical BFS, the traversal was supplemented with size specifications and on the last depth level of modified BFS, nodes were randomly selected (possible nodes on last level: gray nodes; selected nodes on last level: pale green nodes with label) (B). Only network components covered by modified BFS are shown in these representations (giant component and appropriately sized isolated components). See Additional file 2: Fig. S5 for high-resolution image of the representative initial full network with node annotation
Fig. 3Machine learning. Schematic diagram of the conditional generative adversarial network (cGAN) architecture that uses the representation of the initial protein–protein interaction (PPI) network connectivity as condition with no input noise in the generator, and pairs of condition and real or generated connectivity representations in the discriminator (A) and simplified visualization of the prediction process (B–D). Input (sample condition) of the generator model is a representation of the initial connectivity via the adjacency matrix of the induced subgraph (B). Generator was implemented as a layer skipping concatenation method among the convolutional (encoder blocks) and transpose convolution (decoder blocks) layers (C). Output (fake sample) of the generator model is a representation of the predicted connectivity in the form of a confidence matrix, approximating the expected adjacencies in the given induced subgraph (D)
Mean results across tenfold cross-validation of prediction for each investigated species
| Species | AUROC | AUPRC | NDCG | Computing time (s) |
|---|---|---|---|---|
| 0.913 | 0.169 | 0.761 | 1310 | |
| 0.931 | 0.202 | 0.781 | 899 | |
| 0.909 | 0.137 | 0.742 | 1334 | |
| 0.925 | 0.252 | 0.809 | 1525 | |
| 0.898 | 0.120 | 0.721 | 1429 | |
| Mean | 0.915 | 0.176 | 0.763 | 1299 |
AUROC: area under the receiver operating characteristic curve; AUPRC: area under the precision-recall curve; NDCG: normalized discounted cumulative gain
Mean results across tenfold cross-validation of prediction for each investigated species with the 3 different input data type: adjacency matrices only, adjacency matrices concatenated with embedding vector-based matrices, embedding vector-based matrices only
| Species | Adjacency only | Combined | Embedding only | ||||||
|---|---|---|---|---|---|---|---|---|---|
| AUROC | AUPRC | NDCG | AUROC | AUPRC | NDCG | AUROC | AUPRC | NDCG | |
| 0.913 | 0.169 | 0.761 | 0.915 | 0.179 | 0.767 | 0.761 | 0.022 | 0.606 | |
| 0.931 | 0.202 | 0.781 | 0.930 | 0.210 | 0.787 | 0.789 | 0.034 | 0.635 | |
| 0.909 | 0.137 | 0.742 | 0.904 | 0.135 | 0.739 | 0.739 | 0.020 | 0.599 | |
| 0.925 | 0.252 | 0.809 | 0.923 | 0.258 | 0.812 | 0.750 | 0.027 | 0.633 | |
| 0.898 | 0.120 | 0.721 | 0.895 | 0.126 | 0.729 | 0.745 | 0.017 | 0.583 | |
| Mean | 0.915 | 0.176 | 0.763 | 0.913 | 0.182 | 0.767 | 0.757 | 0.024 | 0.612 |
AUROC: area under the receiver operating characteristic curve; AUPRC: area under the precision-recall curve; NDCG: normalized discounted cumulative gain; Combined: adjacency matrices concatenated with embedding vector-based matrices as input
Comparisons of the results from Table 2, presenting q-values from Mann–Whitney–Wilcoxon tests
| Species | Adjacency only–combined | Adjacency only–embedding only | ||||
|---|---|---|---|---|---|---|
| AUROC | AUPRC | NDCG | AUROC | AUPRC | NDCG | |
| 0.297 | 0.577 | 0.579 | < 0.001 | < 0.001 | < 0.001 | |
| 0.971 | 0.631 | 0.631 | < 0.001 | < 0.001 | < 0.001 | |
| 0.523 | 0.796 | 0.688 | < 0.001 | < 0.001 | < 0.001 | |
| 0.228 | 0.481 | 0.475 | < 0.001 | < 0.001 | < 0.001 | |
| 0.429 | 0.429 | 0.290 | < 0.001 | < 0.001 | < 0.001 | |
| Mean | 0.490 | 0.583 | 0.533 | < 0.001 | < 0.001 | < 0.001 |
AUROC: area under the receiver operating characteristic curve; AUPRC: area under the precision-recall curve; NDCG: normalized discounted cumulative gain; Combined: adjacency matrices concatenated with embedding vector-based matrices as input
Basic network properties and the limiting effects of using induced subgraphs for novel link prediction, for each investigated species
| Species | Number of proteins (nodes) | Number of interactions (links) | Network sparsity with self-loops included (%) | Induced subgraph node coverage on N90 (%) | Induced subgraph node coverage on N100 (%) | Known missing links that could not be predicted in N100 from N90 (%) |
|---|---|---|---|---|---|---|
| 3555 | 16,586 | 0.26 | 67.2 | 69.8 | 12.3 | |
| 2052 | 16,635 | 0.79 | 77.8 | 79.3 | 6.5 | |
| 3868 | 16,548 | 0.22 | 64.0 | 66.5 | 14.7 | |
| 4053 | 22,249 | 0.27 | 68.5 | 71.2 | 10.1 | |
| 3731 | 14,637 | 0.21 | 68.9 | 71.5 | 15.3 | |
| Mean | 3452 | 17,331 | 0.28 | 69.3 | 71.7 | 11.8 |
Fig. 4Visualizations of results: modified breadth-first search (BFS) coverage and predictions. Coverage of network representing an example truncated human protein–protein interaction (PPI) training dataset (N90) achieved by modified BFS. Node size and color intensity represent the frequency of occurrence of the given node in induced subgraphs assessed by the modified BFS (A). Human PPI network with green edges denoting the top predicted interactions to complete the truncated N90 network to an N100 network (B). Only network components covered by modified BFS are shown in these representations (giant component and appropriately sized isolated components). See Additional file 2: Fig. S5 for high-resolution image of the representative initial full network with node annotation