Literature DB >> 31350874

DISPOT: a simple knowledge-based protein domain interaction statistical potential.

Oleksandr Narykov¹, Dmytro Bogatov², Dmitry Korkin^1,3.

Abstract

MOTIVATION: The complexity of protein-protein interactions (PPIs) is further compounded by the fact that an average protein consists of two or more domains, structurally and evolutionary independent subunits. Experimental studies have demonstrated that an interaction between a pair of proteins is not carried out by all domains constituting each protein, but rather by a select subset. However, determining which domains from each protein mediate the corresponding PPI is a challenging task.
RESULTS: Here, we present domain interaction statistical potential (DISPOT), a simple knowledge-based statistical potential that estimates the propensity of an interaction between a pair of protein domains, given their structural classification of protein (SCOP) family annotations. The statistical potential is derived based on the analysis of >352 000 structurally resolved PPIs obtained from DOMMINO, a comprehensive database of structurally resolved macromolecular interactions.
AVAILABILITY AND IMPLEMENTATION: DISPOT is implemented in Python 2.7 and packaged as an open-source tool. DISPOT is implemented in two modes, basic and auto-extraction. The source code for both modes is available on GitHub: https://github.com/korkinlab/dispot and standalone docker images on DockerHub: https://hub.docker.com/r/korkinlab/dispot. The web server is freely available at http://dispot.korkinlab.org/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Substances：

Year: 2019 PMID： 31350874 PMCID： PMC6954640 DOI： 10.1093/bioinformatics/btz587

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Large-scale characterization of protein–protein interactions (PPIs) using high-throughput interactomics approaches, such as yeast-two-hybrid and tandem-affinity purification/mass spectrometry methods (Gavin ; Rolland ), have provided the scientists with the new insights of the cell functioning at the systems level and allowed to better understand the molecular machinery underlying complex genetic disorders (Barabasi and Oltvai, 2004; Cui ; Mitra ). Structural studies of PPIs have revealed that a PPI is often carried out by smaller structural protein subunits, the protein domains (Ekman ; Jin ; Vogel ). Roughly two-thirds of eukaryotic and more than one-third of prokaryotic proteins are estimated to be multi-domain proteins (Ekman ), and thus it is not surprising that ≈ 46% of structurally resolved interactions are domain–domain interactions (Kuang ). A high-throughput breakdown of the interactome at this, domain-level, resolution is a much more experimentally challenging task, currently unfeasible at the whole-system level and requiring computational methods to step in (Deng ; Finn ; Ohue ; Segura ). Here, we present a simple knowledge-based domain interaction statistical potential (DISPOT), a tool that leverages the statistical information on interactions shared between the homologous domains from structurally defined domain families. The knowledge-based potentials are extracted from our comprehensive database of structurally resolved macromolecular interactions, DOMMINO (Kuang ). Our statistical potential can be integrated into PPI prediction methods that deal with multi-domain proteins by ranking all possible pairwise combinations of domain interactions between two or more proteins. We want to stress that although DISPOT potentials provide some insight into PPI, it is not a classification method, and data provided by it should be used in conjunction with additional information, e.g. a specific pathway (Fig. 1E).

Fig. 1.

DISPOT statistical potential and its application. (A) A crystal structure (left) of the protein complex between CNTO607 Fab human monoclonal antibody (yellow and red colors denote two different chains) and interleukin-13 (IL-13, shown in blue), and the corresponding domain–domain interaction network (right). Shown in italics are SCOP family IDs, and in bold are DISPOT values for the corresponding interactions. Nodes colored with the same color belong to the same chain. Solid lines connecting nodes correspond to the physical interactions, while dashed lines connect nodes corresponding to the protein domains that do not physically interact. (B) A heatmap showing DISPOT values calculated for each pair of SCOP families, where only potentials for pairs of SCOP families with five and more non-redundant interactions are plotted. The families are grouped based on the SCOP class (a–g) and are ordered within each fold based on their IDs. (C) A contact map showing the correlation between experimentally obtained human interactome HI-I-05 and DISPOT-based PPI prediction. A prediction that calls a PPI correctly is shown in magenta, while PPIs that were missed are shown in cyan. (D) Correlation calculated using R2 correlation coefficient between the hu.MAP interaction probability score and DISPOT statistical potential for KEGG pathways (bottom) and GO clusters (top). (E) Distribution of the protein-level DISPOT statistical potentials grouped by the number of SCOP domains in a protein defined using SUPERFAMILY

2 Methodology

The development of DISPOT is driven by several observations. First, an average interaction between a pair of proteins is not carried out by all domains constituting each protein, but only by a select subset. Indeed, each domain has its unique structure and biological function and may not be designed to interact with a particular domain from another protein (Banappagari ; Shimizu ). Second, the domain–domain interactions often share homology: when two homologous domains interact with their partners, these partners frequently also share the homology with each other (Kuang ). Thus, one can introduce the domain–domain interaction propensity in terms of the frequency of domain–domain interactions between the two domain families. Lastly, the propensity of domains to interact is expected to vary across different families, thus allowing to provide the finer resolution of the PPI network. The quantification of the odds for a domain from one domain family to interact with a domain from another family is defined in this work as a knowledge-based statistical potential. Statistical potentials are widely used in biophysical applications, often for characterizing the residue contacts between the protein chains (Huang and Zou, 2008; Krüger ; Lu ). One of the main applications of the residue-level statistical potentials is in protein docking (Kozakov ). Our domain–domain statistical potential complements the residue-level potentials by considering structural units from the higher-level of protein structure hierarchy and requiring no structural information about the protein domains. Specifically, the input for DISPOT includes the protein sequences of the two proteins interacting with each other. First, the domain architecture of each protein is obtained. To do so, a region of the protein sequence is annotated to a family of homologous domains. For the definition of domain families, we leverage the structural classification of proteins (SCOP) family-level classification (Andreeva ). SCOP represents a structure-based hierarchical classification of relationships between protein domains or single-domain proteins with ‘family’ being the first level of SCOP classification and ‘superfamily’ being the second level. Protein domains from the same SCOP family are evolutionary closely related and often share the same function. Since a protein with no structural information cannot be directly annotated by SCOP, we use SUPERFAMILY (Gough and Chothia, 2002), a Hidden Markov Model (HMM)-based approach that maps regions of a protein sequence to one or several SCOP families or superfamilies. SUPERFAMILY allows us to cover a substantial subset of known proteins: the HMM coverage at the protein sequence and overall amino acid levels for the UniProt database were reported at 64.73% and 58.78%, respectively, in 2014 (Oates ). Second, for each pair of SCOP families we count a number of non-redundant PPIs between the members of these families that have been experimentally determined. Our source of data is DOMMINO (Kuang , 2016) a comprehensive database of structurally resolved macromolecular interactions. It contains information about interactions between the protein domains, interdomain linkers, terminal sequences, and protein peptides. In this work, we use exclusively domain–domain interactions because the data about this type of interactions is the most abundant. To remove redundancy in the data, we use ASTRAL compendium (Brenner ), which is integrated into the SCOPe database (Fox ). From ASTRAL, we obtain a set of domains, where each domain shares <95% sequence identity to any other domain in the set. This set is then used to determine pairs of redundant domain–domain interactions in the original DOMMINO dataset. Two domain–domain interactions are determined as redundant if both corresponding pairs of domains share 95% or more sequence identity. For each pair of redundant domain–domain interactions, one interaction is randomly removed. The process continues until no pair of redundant interactions can be detected. Third, for each domain family from each protein, a statistical potential is calculated (Fig. 1A). There are two types of statistical potentials introduced in this work: (i) calculated for a domain from a specific domain family and (ii) calculated for a pair of domains, one domain from each of the two interacting proteins. The statistical potential P for a single domain D is calculated based on the total number of interactions extracted from the non-redundant DOMMINO dataset for the specific SCOP family this domain belongs to. The statistical potential P for a pair of domains, D and D, is calculated based on the total number of occurrences N of the interactions between all domains from the same two SCOP families as D and D. Those numbers are then transformed into probabilities as follows: where is an average number of interactions for a domain family and is an average number of interactions for a pair of domain families, both calculated from the non-redundant DOMMINO set. DISPOT potentials are derived following a standard strategy for calculating a statistical potential. The statistical potentials for the atomic contact pairs are traditionally derived based on Boltzmann relation (Huang and Zou, 2008): where k is the Boltzmann constant, T is the system’s temperature, p is an experimentally observed density of atom pairs from different partners in a complex at distance and is corresponding density in the reference state. Since we do not work with the atomic-level physical interactions, we replace the Boltzmann constant from DISPOT equations and substitute temperature with the inverse of normalization constant Z. In addition, p and are substituted with the number of interactions between domains in DOMMINO database. DISPOT can also provide integrated protein-level statistics. There are multiple ways to combine the domain-level statistics into a protein-level statistics. Two simple approaches to integrate domain–domain interactions for a given PPI in terms of a standalone (single protein) and interaction (protein pair) potentials are: respectively, where i and j correspond to the domains from protein u and v. The rationale behind these definitions lies in the assumption that a single strongest domain–domain interaction is the one of the most important defining factor for the PPI. These definitions of cumulative potentials were tested in terms of their ability to predict a PPI using several experimental sources. First, we obtained the coverage landscape by the cumulative potentials on the experimental protein–protein interactomes one obtained using high-throughput yeast-two-hybrid screening (HI-I-05) (Rual ) and another one obtained using curated literature-based search (LitBM-17, http://interactome.baderlab.org/data/LitBM-17.psi). As expected, while this naïve method was able to recover 2944 PPIs in HI-I-05, it missed 1188 PPIs even using a lenient threshold of −20 (Fig. 1C). Similarly, the cumulative potential was able to recover only 1718 PPIs while 1453 PPIs were not recovered (Supplementary Fig. S1). We then apply the same pairwise cumulative potential to the large-scale mass spectrometry study (Drew ). Specifically we study the correlation between the hu.MAP probability score and cumulative pairwise score among KEGG pathways (Kanehisa and Goto, 2000) and GO clusters produced by GeneSCF on 13 855 genes with SUPERFAMILY annotation (Subhash and Kanduri, 2016) (Fig. 1D). While the number of highly correlated pairs was substantial, the number of pairs with very little correlation still prevailed. Finally, the analysis of the cumulative single potential for a protein showed that it can obtain a diverse range of values and this property seems to be independent of how many domains this protein has (Fig. 1E). Similar behavior was observed when looking at the other basic cumulative measures (Supplementary Fig. S3). Overall, we have analyzed and summarized interactions from 3619 SCOP family pairs that were extracted from 352 199 PPIs. In total, domains from 1384 SCOP families were characterized that form domain–domain interactions in 1384 ‘homo-SCOP’ interaction pairs (i.e., both domains are annotated with the same SCOP family) and 2235 ‘hetero-SCOP’ pairs (Fig. 1B and Supplementary Fig. S1). The analysis of the calculated statistical potentials showed a wide diversity across different families. Finally, we would like to make a cautionary note of using the developed tool. DISPOT was designed not as a PPI prediction tool, but rather a tool that provides additional information on the likelihood of specific domain–domain interactions in a given physical PPI. The main reason is the fact that structural coverage of the PPI space is still far from being full, which leads to the presence of a high number of false negatives if one was to use DISPOT as a standalone predictor. This intuition has been supported by our evaluation of DISPOT against the two interactomics golden standards. Thus, if a researcher wants to employ DISPOT in a PPI prediction method, we recommend adding the DISPOT potentials as features to the overall feature vector, that would include other parameters, such as secondary structure, evolutionary conservation of the sequence, predicted residue hydrophobicity, etc.

3 Implementation and usage

The basic mode is implemented in Python with the dependency on packages pandas and numpy. It takes SCOP identifiers (IDs) for either ‘family’ (fa) or ‘superfamily’ (sf) hierarchy levels as an input and produces statistical potential for corresponding pair of domains. Switching between the SCOP levels is implemented in command line option sf. One of the possible input options is a command line option domains, which provides a list of space-separated SCOP identifiers. Based on this list, the program produces all possible unique pairwise combinations of identifiers and the corresponding statistical potentials. Option max produces the highest value of statistical potential for a selected domain and an SCOP ID for the corresponding interaction domain partner. Option output specifies the output file. If no file path is specified, then program opens a console output prompting a user to input the data. A detailed description of all acceptable input formats and options is available in README file and help menu of the main script dispot.py. The auto-extraction version relies on the SUPERFAMILY models and scripts and HMMER program for extracting the corresponding SCOP IDs for either family or superfamily levels of hierarchy. The Perl programming language interpreter is an additional dependency. HMMER is compatible with the major linux distributions (it has been tested on Ubuntu 16.04 and Alpine 3.7 with additional installation of alpine-glibc). Windows users are advised to use the docker image. The main script is dispot.py, and it includes several options: fasta_folder—to specify a path to the folder with FASTA files; output_folder—to specify a path to the results and max—to substitute the regular output of all pairwise statistical potentials with the highest statistical potential for a given domain family and an SCOP ID of the interaction partner on which this value is achieved. Additional script batch_process.py provides almost the same functionality, except it uses the default locations: ./data/for the input and ./data/results/for the output. For each FASTA sequence, we extract a SUPERFAMILY-derived SCOP ID and the location(s) of the corresponding domain on the protein sequence. It is stored in the ./tmp/folder and is available until the next run of any of the scripts mentioned in this section. The data are stored in the Python dictionary objects serialized by package pickle. DISPOT has also been implemented as a web server that carries the full functionality of the developed methods and comes with a tutorial. The web server is freely available at http://dispot.korkinlab.org/.

Funding

This work was supported by the National Science Foundation (1458267) and National Institute of Health (LM012772-01A1) to D.K. Conflict of Interest: none declared. Click here for additional data file.

28 in total

1. The ASTRAL compendium for protein structure and sequence analysis.

Authors: S E Brenner; P Koehl; M Levitt
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

Review 2. Network biology: understanding the cell's functional organization.

Authors: Albert-László Barabási; Zoltán N Oltvai
Journal: Nat Rev Genet Date: 2004-02 Impact factor: 53.242

3. iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions.

Authors: Robert D Finn; Mhairi Marshall; Alex Bateman
Journal: Bioinformatics Date: 2004-09-07 Impact factor: 6.937

4. PIPER: an FFT-based protein docking program with pairwise potentials.

Authors: Dima Kozakov; Ryan Brenke; Stephen R Comeau; Sandor Vajda
Journal: Proteins Date: 2006-11-01

5. An iterative knowledge-based scoring function for protein-protein recognition.

Authors: Sheng-You Huang; Xiaoqin Zou
Journal: Proteins Date: 2008-08

6. Functional organization of the yeast proteome by systematic analysis of protein complexes.

Authors: Anne-Claude Gavin; Markus Bösche; Roland Krause; Paola Grandi; Martina Marzioch; Andreas Bauer; Jörg Schultz; Jens M Rick; Anne-Marie Michon; Cristina-Maria Cruciat; Marita Remor; Christian Höfert; Malgorzata Schelder; Miro Brajenovic; Heinz Ruffner; Alejandro Merino; Karin Klein; Manuela Hudak; David Dickson; Tatjana Rudi; Volker Gnau; Angela Bauch; Sonja Bastuck; Bettina Huhse; Christina Leutwein; Marie-Anne Heurtier; Richard R Copley; Angela Edelmann; Erich Querfurth; Vladimir Rybin; Gerard Drewes; Manfred Raida; Tewis Bouwmeester; Peer Bork; Bertrand Seraphin; Bernhard Kuster; Gitte Neubauer; Giulio Superti-Furga
Journal: Nature Date: 2002-01-10 Impact factor: 49.962

7. Eukaryotic protein domains as functional units of cellular evolution.

Authors: Jing Jin; Xueying Xie; Chen Chen; Jin Gyoon Park; Chris Stark; D Andrew James; Marina Olhovsky; Rune Linding; Yongyi Mao; Tony Pawson
Journal: Sci Signal Date: 2009-11-24 Impact factor: 8.192

Review 8. Integrative approaches for finding modular structure in biological networks.

Authors: Koyel Mitra; Anne-Ruxandra Carvunis; Sanath Kumar Ramesh; Trey Ideker
Journal: Nat Rev Genet Date: 2013-10 Impact factor: 53.242

9. A proteome-scale map of the human interactome network.

Authors: Thomas Rolland; Murat Taşan; Benoit Charloteaux; Samuel J Pevzner; Quan Zhong; Nidhi Sahni; Song Yi; Irma Lemmens; Celia Fontanillo; Roberto Mosca; Atanas Kamburov; Susan D Ghiassian; Xinping Yang; Lila Ghamsari; Dawit Balcha; Bridget E Begg; Pascal Braun; Marc Brehme; Martin P Broly; Anne-Ruxandra Carvunis; Dan Convery-Zupan; Roser Corominas; Jasmin Coulombe-Huntington; Elizabeth Dann; Matija Dreze; Amélie Dricot; Changyu Fan; Eric Franzosa; Fana Gebreab; Bryan J Gutierrez; Madeleine F Hardy; Mike Jin; Shuli Kang; Ruth Kiros; Guan Ning Lin; Katja Luck; Andrew MacWilliams; Jörg Menche; Ryan R Murray; Alexandre Palagi; Matthew M Poulin; Xavier Rambout; John Rasla; Patrick Reichert; Viviana Romero; Elien Ruyssinck; Julie M Sahalie; Annemarie Scholz; Akash A Shah; Amitabh Sharma; Yun Shen; Kerstin Spirohn; Stanley Tam; Alexander O Tejeda; Shelly A Trigg; Jean-Claude Twizere; Kerwin Vega; Jennifer Walsh; Michael E Cusick; Yu Xia; Albert-László Barabási; Lilia M Iakoucheva; Patrick Aloy; Javier De Las Rivas; Jan Tavernier; Michael A Calderwood; David E Hill; Tong Hao; Frederick P Roth; Marc Vidal
Journal: Cell Date: 2014-11-20 Impact factor: 41.582