| Literature DB >> 35758793 |
Rohit Singh1, Kapil Devkota2, Samuel Sledzieski1, Bonnie Berger3,4, Lenore Cowen2.
Abstract
SUMMARY: Computational methods to predict protein-protein interaction (PPI) typically segregate into sequence-based 'bottom-up' methods that infer properties from the characteristics of the individual protein sequences, or global 'top-down' methods that infer properties from the pattern of already known PPIs in the species of interest. However, a way to incorporate top-down insights into sequence-based bottom-up PPI prediction methods has been elusive. We thus introduce Topsy-Turvy, a method that newly synthesizes both views in a sequence-based, multi-scale, deep-learning model for PPI prediction. While Topsy-Turvy makes predictions using only sequence data, during the training phase it takes a transfer-learning approach by incorporating patterns from both global and molecular-level views of protein interaction. In a cross-species context, we show it achieves state-of-the-art performance, offering the ability to perform genome-scale, interpretable PPI prediction for non-model organisms with no existing experimental PPI data. In species with available experimental PPI data, we further present a Topsy-Turvy hybrid (TT-Hybrid) model which integrates Topsy-Turvy with a purely network-based model for link prediction that provides information about species-specific network rewiring. TT-Hybrid makes accurate predictions for both well- and sparsely-characterized proteins, outperforming both its constituent components as well as other state-of-the-art PPI prediction methods. Furthermore, running Topsy-Turvy and TT-Hybrid screens is feasible for whole genomes, and thus these methods scale to settings where other methods (e.g. AlphaFold-Multimer) might be infeasible. The generalizability, accuracy and genome-level scalability of Topsy-Turvy and TT-Hybrid unlocks a more comprehensive map of protein interaction and organization in both model and non-model organisms.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35758793 PMCID: PMC9235477 DOI: 10.1093/bioinformatics/btac258
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Topsy-Turvy synthesizes sequence-to-structure-based prediction using D-SCRIPT with network-based prediction using GLIDE. (A) D-SCRIPT uses a protein language model to generate representative embeddings of protein sequences, which are combined with a convolutional neural network to predict protein interaction. It is supervised using binary interaction labels from the training network and regularized by a measure of contact map sparsity. (B) GLIDE scores all possible edges using a weighted combination of global and local network scores which are learned from the edges already in the training network. (C) Topsy-Turvy is supervised with both the binary interaction labels of the true (training) network and with the GLIDE predicted scores, thus integrating bottom-up and top-down approaches for PPI prediction into the learned Topsy-Turvy model
Hyperparameter search: cross-validation AUPR (area under precision–recall curve) scores on full human PPI network for (a) grid search for g, with g fixed to 90 (estimated from small-scale explorations), (b) grid search for g, with g fixed to 0.2 [i.e. the optimal value from (a)]
|
| |
| gp (with g | AUPR |
| 0.1 | 0.739 |
|
|
|
| 0.4 | 0.759 |
| 0.8 | 0.760 |
|
| |
|
| AUPR |
| 90 | 0.697 |
|
|
|
| 95 | 0.691 |
| 97.5 | 0.690 |
Note: The metrics reported in the tables are the validation AUPR scores maximized over three epochs of training.
GLIDE and node2vec comparison: AUPR scores for PPI prediction on the Drosophila BioGRID network
|
| GLIDE | node2vec |
|---|---|---|
| 0.8 |
| 0.681 |
| 0.6 |
| 0.721 |
| 0.4 |
| 0.664 |
| 0.2 |
| 0.574 |
Note: Higher values of p correspond to a higher proportion of edges preserved in the training network. Bold entries represent best performance.
Topsy-Turvy improves upon D-SCRIPT (Sledzieski ), PIPR (Chen ) and DeepPPI (Richoux ) for cross-species PPI prediction
| Species | Model | AUPR | AUROC | FPR | |
|---|---|---|---|---|---|
| 0.1 Recall | 0.5 Recall | ||||
|
| PIPR | 0.526 | 0.839 | 0.002 | 0.057 |
| DeepPPI | 0.518 | 0.816 |
| 0.059 | |
| D-SCRIPT | 0.663 ± 0.05 | 0.901 ± 0.02 | 0.002 | 0.014 | |
| Topsy-Turvy |
|
| 0.001 |
| |
|
| PIPR | 0.278 | 0.728 | 0.007 | 0.197 |
| DeepPPI | 0.231 | 0.659 | 0.012 | 0.274 | |
| D-SCRIPT | 0.605 ± 0.06 | 0.890 ± 0.02 | 0.003 | 0.022 | |
| Topsy-Turvy |
|
|
|
| |
|
| PIPR | 0.346 | 0.757 | 0.002 | 0.148 |
| DeepPPI | 0.252 | 0.671 | 0.007 | 0.252 | |
| D-SCRIPT | 0.550 ± 0.08 | 0.853 ± 0.04 | 0.003 | 0.032 | |
| Topsy-Turvy |
|
|
|
| |
|
| PIPR | 0.230 | 0.718 | 0.017 | 0.213 |
| DeepPPI | 0.201 | 0.652 | 0.018 | 0.288 | |
| D-SCRIPT | 0.399 ± 0.09 | 0.790 ± 0.06 | 0.005 | 0.089 | |
| Topsy-Turvy |
|
|
|
| |
|
| PIPR | 0.271 | 0.675 | 0.005 | 0.246 |
| DeepPPI | 0.271 | 0.688 | 0.004 | 0.243 | |
| D-SCRIPT | 0.513 ± 0.09 | 0.770 ± 0.03 | 0.002 | 0.040 | |
| Topsy-Turvy |
|
|
|
| |
Note: All species were evaluated using models trained on a large corpus of human PPIs. For D-SCRIPT and Topsy-Turvy, we report the average and standard deviation of results from three random initializations. For PIPR and DeepPPI, we report here the results from the study in Sledzieski where the same evaluation scheme and data was used. For all datasets, there is a 1:10 ratio of positive to negative pairs, which means a random baseline would have an AUPR of 0.091 and an AUROC of 0.5. Bold entries represent best performance.
Cross-species performance of D-SCRIPT and Topsy-Turvy, subdivided by node degree in target species
| Model | Overall AUPR | AUPR by maximum degree | |||
|---|---|---|---|---|---|
|
|
|
|
| ||
| D-SCRIPT | 0.356 | 0.030 | 0.067 | 0.118 | 0.475 |
| Topsy-Turvy |
|
|
|
|
|
Note: Both methods were trained on human PPI data and tested on fly (BioGRID). The analysis is limited to protein pairs where both proteins occur in the fly PPI graph. In addition to overall AUPR, we also group each protein pair by the maximum of the degrees of its nodes in the fly PPI network. Both methods improve as maximum degree increases, and Topsy-Turvy consistently outperforms D-SCRIPT across all subsets—especially so for putative interactions between low-degree nodes. Bold entries represent best performance.
Fig. 2.Comparing Topsy-Turvy and GLIDE in situations when both can be used. GLIDE was trained on a subset of the fly PPI network (e.g. training on 80% of PPIs when p = 0.8); Topsy-Turvy was trained on human PPI data and had no access to fly data for training. Both methods were evaluated on held-out positives as well as a randomly sampled set of negative examples, where pairs containing proteins with degree 21 on the subset networks were removed from the held-out examples during testing; the analysis is limited to proteins in the fly PPI network. In addition to reporting overall AUPR, we also group each protein-pair in the evaluation set by their shortest-path distance in the training network
TT-Hybrid improves upon both of its constituent components on in-species prediction
| Sparsity | GLIDE | Topsy-Turvy | TT-Hybrid | Random |
|---|---|---|---|---|
|
| 0.380 | 0.038 |
| 0.004 |
|
| 0.437 | 0.079 |
| 0.009 |
|
| 0.412 | 0.105 |
| 0.014 |
|
| 0.318 | 0.133 |
| 0.019 |
Note: We generated partitions of the fly network of varying sparsity, using the sparsified networks as training for GLIDE. Sparsity p corresponds to the proportion of edges retained in the training network (p = 0.8 is the least sparse). Topsy-Turvy was trained on human PPIs. TT-Hybrid combines the predictions from both GLIDE and Topsy-Turvy. Here, we report the AUPR of each method on the held out edges removed from each network subset. We also show the AUPR of the random control; due to varying class imbalances, AUPR scores increase slightly with increasing sparsity. Bold entries represent best performance.