Literature DB >> 27648447

Statistical Approaches for the Construction and Interpretation of Human Protein-Protein Interaction Network.

Yang Hu¹, Ying Zhang², Jun Ren¹, Yadong Wang³, Zhenzhen Wang⁴, Jun Zhang².

Abstract

The overall goal is to establish a reliable human protein-protein interaction network and develop computational tools to characterize a protein-protein interaction (PPI) network and the role of individual proteins in the context of the network topology and their expression status. A novel and unique feature of our approach is that we assigned confidence measure to each derived interacting pair and account for the confidence in our network analysis. We integrated experimental data to infer human PPI network. Our model treated the true interacting status (yes versus no) for any given pair of human proteins as a latent variable whose value was not observed. The experimental data were the manifestation of interacting status, which provided evidence as to the likelihood of the interaction. The confidence of interactions would depend on the strength and consistency of the evidence.

Entities: Disease Gene Mutation Species

Mesh：

Substances：
Proteins

Year: 2016 PMID： 27648447 PMCID： PMC5015007 DOI： 10.1155/2016/5313050

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

Individual proteins cannot perform their biological functions by themselves, and actually they need to perform their functions in the biological process through interacting with other proteins [1]. Usually the interaction between two proteins means either they perform a biological function corporately or there is physical direct contact between them [2]. Most of the important molecular processes in cell, such as DNA replication, need to be performed by a large number of protein complexes. And these complexes are made up by the interactions between proteins. The study of PPIs is also considered to be a central problem in proteomics for living cells. Due to the dynamic interaction between proteins, the impact of surrounding environment should also be taken into account. The study of human PPI network can help to enhance the understanding of the disease but also provide a theoretical foundation for finding new treatment. With the continuous progress and development of high-throughput experimental technology, more and more large quantities of interactions between human proteins had been confirmed by a variety of experimental methods. And many kinds of biological interaction networks have been investigated [3-7]. However, current high-throughput experimental techniques also indicated the shortcomings of high error; not only might the different experimental methods induce different experimental results, but also even different research groups using the same experimental method could not guarantee the exact same result. Therefore, it was urgent to integrate the data from different biological experiments, and even different species, to construct a highly credible network of PPIs. So in this paper, a Bayesian hierarchical model of human PPI network was constructed with a variety of sources of protein interaction data. Meanwhile, a Monte Carlo expectation maximization algorithm was used to estimate the parameters of the model. Then the confidence of protein interaction relationship was calculated based on Bayesian model, and human PPI network with high-confidence level could be obtained. Thereafter, the role of intrinsic disordered proteins (IDPs) was investigated in the high-confidence PPI network. First of all, different functional modules were obtained through clustering of high-confidence PPI network based on the network topology structure. Then we found the functional modules which were significantly correlated with intrinsically disordered proteins and analysed the effect of IDPs in these functional modules, while searching for the associations between these functional modules and diseases.

2. Materials and Methods

2.1. Data Collection

In Table 1, we show the experimental data that will be used for the construction of the human PPI network [8-20]. Note that the literature or text mining approach represents most of the low-throughput experimental studies of individual protein-protein interaction. It is possible that the result from the same experiment will be recorded in multiple databases. We will eliminate this type of redundancy. It should be emphasized that the MPC experiments provide result in the format of protein complexes instead of pair-wise protein-protein interactions. Since proteins located in the same complex might not interact with one another directly, we will account for this factor in our model.

Table 1

Data sets or databases used to construct the human protein-protein interaction network.

Method	Organism	Reference
Y2H	Human	Stelzl et al. [8]
Y2H	Human	Rual et al. [9]
MPC	Human	Ewing et al. [10]
Literature	Human	HPRD [11], http://www.hprd.org/
Y2H	Yeast	Ito et al. [12]
Y2H	Yeast	Uetz et al. [13]
MPC	Yeast	Gavin et al. [14]
MPC	Yeast	Ho et al. [15]
MPC	Yeast	Gavin et al. [16]
MPC	Yeast	Krogan et al. [17]
Literature	Multiple	IntAct [18], http://www.ebi.ac.uk/intact/
Literature	Multiple	MIPS [19], http://mips.gsf.de/proj/ppi/
Multiple	Multiple	DIP [20], http://dip.doe-mbi.ucla.edu

2.2. Statistical Modeling of Various Data Sources

The overall scheme of our approach is illustrated in Figure 1. We consider an empirical Bayes approach to integrate various sources of evidence. Let Z be the binary indicator such that Z = 1 means that human proteins i and j have a direct physical interaction and it is 0 otherwise. Hence, Z is the true interacting status that is not observed. To infer Z , we consider individual model for each type of observed data and integrate the evidence to compute the probability of Z = 1.

Figure 1

Overall scheme to construct the human protein-protein interaction network. The interaction status of a given pair of human proteins and their homolog in other organisms are unobserved (dashed box) and the experimental data and genomic features are observed evidence (solid boxes). Solid arrows represent model hierarchy and dashed arrows represent inference steps.

2.2.1. Human Y2H Data

It has been found that there are a number of mechanisms that can lead to the expression of the reporter gene in a Y2H experiment, which means that an observed interaction might not necessarily mean a true interaction. In our model, we consider the following mechanisms: (a) true interaction; (b) self-activation; and (c) unknown process. Let Y be the binary indicator such that Y = 1 if proteins i and j are observed to interact in a Y2H experiment and it is 0 otherwise. Then Y = 1 only if at least one of the three above mechanisms is functional. Let X = 1 if protein i is a self-activation protein and let it be 0 otherwise. We define Then we have

2.2.2. Human MPC Data

MPC experiment reveals protein complexes instead of individual pairwise PPI. We say protein B is an n-step neighbour of protein A if the shortest path between A and B in the PPI network is of length n. We conjecture that the bait will mostly fish out its 1-step neighbours, and 2-step neighbours and distant proteins (at least three step-away) are occasionally observed. Hence, we define the following parameters for the bait proteins:Let C be the set of proteins in a complex corresponding to bait protein k. Denote by n (1), n (2) the set of 1-step and 2-step neighbours of the bait protein k under a given value of Z. Then the probability of observing C can be written as follows:where |·| is the function that maps a set to its size.

2.2.3. Literature Data on Human PPI

Let L be the interaction status of proteins i and j reported. We will account for the false positive rate (γ 0,) and false negative rate (γ 0,):

2.2.4. Data from Other Organisms

We will also collect (Y , C ) from other organisms with corresponding unobserved variables denoted by (Z , X ). Similar models can be used to model (Y, C, L) for inference of (Z , X ). To connect (Z , X ) to (Z, X), we consider the following models:where J is the joint sequence identity between i and i′ and between j and j′ and I is sequence identity between i and i′; Δ1, Δ0, Ω 1, and Ω 0 are functions of the joint or individual sequence identities with parameters ϕ 1, ϕ 0, λ 1, and λ 0, which can be modeled by parametric structure.

2.3. Construction of Hierarchical Bayesian Model

So far we have introduced the distribution models for the experimental data and genomic features that are conditional on the values of Z and X. To finish the model, we also need to specify the distributions of Z and X, which can be modeled with independent Bernoulli distributions:With the observed data and the unobserved variables, we can infer the posterior probability of Z using the EM algorithm. Note that there are multiple organisms and multiple data sets for some of the organisms. Different parameters will be used to account for difference in the data. As illustrated in (10), the complete log likelihood function of our model can be expanded below, and the factor of (10) can be substituted by (3)~(9):where the parameter vector θ = {ρ, r, α , α , α , ψ 1, ψ 2, γ 1, γ 0, β 1, β 0, ϕ 1, ϕ 0, λ 1, λ 0}.

2.4. Monte Carlo Expectation Maximization for Parameter Estimation

In the model, it was not possible to estimate the true value of potential variables and model parameters directly. In order to effectively estimate the potential variables and model parameters, this paper used the Monte Carlo expectation maximization algorithm based on incomplete parameter estimation, as illustrated in Algorithm 1.

Algorithm 1

Monte Carlo expectation maximization for parameter estimation.

In the E-step of Algorithm 1, we use Gibbs sampling to sample (Z, X, Z , X ) from in turn. Repeat the sampling process until the estimations of missing data are obtained. Then in the M-step of Algorithm 1, the parameter vector θ = {γ 1, γ 0, α , α , α , β 1, β 0, ϕ 1, ϕ 0, λ 1, λ 0} is estimated by Greedy Hill Climbing. Finally the iteration is stopped when diff > 0.01.

3. Results

All the protein names were mapped to the Entrez IDs. Finally we got 32540 proteins, and there were 144603 interactions between these proteins.

3.1. Construction of the Human PPI Network with Reliable Confidence Measure

Four models were established separately using high-throughput Y2H experimental data, high-throughput MPC experimental data, human PPI data, and all the PPI data. The comparisons among these four models were listed in Table 2.

Table 2

Comparison of parameters based on different data.

Parameters	High-throughput Y2H	High-throughput MPC	Human PPI data	All PPI data
ρ	6.8 × 10⁻³	1.9 × 10⁻³	6.1 × 10⁻³	1.4 × 10⁻²
r	7.7 × 10⁻⁵	—	5.3 × 10⁻⁵	8.9 × 10⁻⁵
α _I	0.658	—	0.543	0.933
α _S	0.426	—	0.496	0.852
α _U	4.5 × 10⁻³	—	9.7 × 10⁻⁴	0.007
ψ ₁	—	0.738	0.755	0.809
ψ ₂	—	0.623	0.764	0.788

After the estimation of parameter vector θ by Monte Carlo EM, we recalculated the posterior probability of Z, which is Pr⁡[Z∣H, Y, W, L, Y , W ], with θ and the observed values H, Y, W, L, Y , W . And for each pair of PPI, we considered them as reliable confidence interaction if Pr⁡[Z = 1∣H, Y, W, L, Y , W ] > 0.8. Then we got 48361 PPIs with reliable confidence measure among 23286 proteins.

3.2. Characterization of Network and Roles of IDPs Based on Network Topology

We analysed the role of IDPs in the human PPI networks with reliable confidence measure. A IDP was defined as a protein with continuous intrinsically disorder region whose length was larger than 40 amino acids. And 8735 IDPs were identified from 23286 proteins after predictions. Firstly, the human PPI network was cut into subnetworks or modules by SCAN. SCAN obtained modules based on the similarity between common neighbors. Then we used modularity and similarity-based modularity as metrics. Modularity is a statistical measure of the quality of network clustering, which is defined as follows:where N is the number of clusterings, L is the number of edges, l is the number of edges for s th module, and d is the degree of all the nodes in s th module. We could obtain the best clustering by optimizing Q . And similarity-based modularity is the supplementary for the modularity, which is defined as follows:As shown in Figure 2, on one hand, the modularity monotonically decreased from the position nearby zero, and it could not be maximized. On the other hand, the similarity-based modularity could be maximized while the threshold ε equals 0.61. Conditional on the ε = 0.61, the reliable human PPI network was cut into 241 modules. Under the significant level α = 0.05, the p value of each module was calculated by the formula below:where N is the number of all the proteins and M is the number of all the IDPs. 33 modules among 241 modules were significantly associated with IDPs.

Figure 2

The optimization of Q and Q for different ε. Red line and green line correspond to Q and Q separately.

However, due to the fact that acquisition of functional modules is only dependent on the network topology, we analysed the modules with known diseases. And the overlap of PPI in hela cell and a functional module which was highly related with IDPs was shown in Figure 3. The weight of each side is the posterior probability of the real value Z. If a node with more than 5 neighbours was defined as a hub node in this subnetwork, a total of 69% of the hub nodes were IDPs. It is verified that IDPs were easy to become hub nodes of the protein interaction network due to the flexibility of the structure, revealing an important role of IDPs in the regulation of cervical cancer hela cell.

Figure 3

A reliable subnetwork for hela cell. Circles correspond to IDPs. And the degree of grey corresponds to the length of intrinsically disordered region for IDP.

4. Discussion

Our model is unique and novel in the following perspectives. First, it integrates Y2H and MPC data in a cohesive and unified model that connect the two types of data through the unobserved true status of direct physical interaction Z. Second, the model allows a natural calculation of the confidence of each interacting pair via the posterior probability. This is a critical measurement in downstream analysis and will be accounted for. To our knowledge, no previous study has considered uncertainty in the PPI network analysis. The inference of the interacting probability involves a large number of latent variables. The combinatorial effects make it impractical to compute the expectation of the missing variables analytically during the E-step. It is likely that various data sets carry different amount of information regarding the true interaction status. Hence, the inference can be made by appropriately weighing data of various types instead of treating them equally. This can be achieved by setting parameter constrain.

20 in total

1. Towards a proteome-scale map of the human protein-protein interaction network.

Authors: Jean-François Rual; Kavitha Venkatesan; Tong Hao; Tomoko Hirozane-Kishikawa; Amélie Dricot; Ning Li; Gabriel F Berriz; Francis D Gibbons; Matija Dreze; Nono Ayivi-Guedehoussou; Niels Klitgord; Christophe Simon; Mike Boxem; Stuart Milstein; Jennifer Rosenberg; Debra S Goldberg; Lan V Zhang; Sharyl L Wong; Giovanni Franklin; Siming Li; Joanna S Albala; Janghoo Lim; Carlene Fraughton; Estelle Llamosas; Sebiha Cevik; Camille Bex; Philippe Lamesch; Robert S Sikorski; Jean Vandenhaute; Huda Y Zoghbi; Alex Smolyar; Stephanie Bosak; Reynaldo Sequerra; Lynn Doucette-Stamm; Michael E Cusick; David E Hill; Frederick P Roth; Marc Vidal
Journal: Nature Date: 2005-09-28 Impact factor: 49.962

2. Functional organization of the yeast proteome by systematic analysis of protein complexes.

Authors: Anne-Claude Gavin; Markus Bösche; Roland Krause; Paola Grandi; Martina Marzioch; Andreas Bauer; Jörg Schultz; Jens M Rick; Anne-Marie Michon; Cristina-Maria Cruciat; Marita Remor; Christian Höfert; Malgorzata Schelder; Miro Brajenovic; Heinz Ruffner; Alejandro Merino; Karin Klein; Manuela Hudak; David Dickson; Tatjana Rudi; Volker Gnau; Angela Bauch; Sonja Bastuck; Bettina Huhse; Christina Leutwein; Marie-Anne Heurtier; Richard R Copley; Angela Edelmann; Erich Querfurth; Vladimir Rybin; Gerard Drewes; Manfred Raida; Tewis Bouwmeester; Peer Bork; Bertrand Seraphin; Bernhard Kuster; Gitte Neubauer; Giulio Superti-Furga
Journal: Nature Date: 2002-01-10 Impact factor: 49.962

3. MIPS: a database for genomes and protein sequences.

Authors: H W Mewes; D Frishman; U Güldener; G Mannhaupt; K Mayer; M Mokrejs; B Morgenstern; M Münsterkötter; S Rudd; B Weil
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

Review 4. Similarity computation strategies in the microRNA-disease network: a survey.

Authors: Quan Zou; Jinjin Li; Li Song; Xiangxiang Zeng; Guohua Wang
Journal: Brief Funct Genomics Date: 2015-07-01 Impact factor: 4.241

Review 5. Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks.

Authors: Xiangxiang Zeng; Xuan Zhang; Quan Zou
Journal: Brief Bioinform Date: 2015-06-09 Impact factor: 11.622

6. A comprehensive two-hybrid analysis to explore the yeast protein interactome.

Authors: T Ito; T Chiba; R Ozawa; M Yoshida; M Hattori; Y Sakaki
Journal: Proc Natl Acad Sci U S A Date: 2001-03-13 Impact factor: 11.205

7. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.

Authors: Nevan J Krogan; Gerard Cagney; Haiyuan Yu; Gouqing Zhong; Xinghua Guo; Alexandr Ignatchenko; Joyce Li; Shuye Pu; Nira Datta; Aaron P Tikuisis; Thanuja Punna; José M Peregrín-Alvarez; Michael Shales; Xin Zhang; Michael Davey; Mark D Robinson; Alberto Paccanaro; James E Bray; Anthony Sheung; Bryan Beattie; Dawn P Richards; Veronica Canadien; Atanas Lalev; Frank Mena; Peter Wong; Andrei Starostine; Myra M Canete; James Vlasblom; Samuel Wu; Chris Orsi; Sean R Collins; Shamanta Chandran; Robin Haw; Jennifer J Rilstone; Kiran Gandi; Natalie J Thompson; Gabe Musso; Peter St Onge; Shaun Ghanny; Mandy H Y Lam; Gareth Butland; Amin M Altaf-Ul; Shigehiko Kanaya; Ali Shilatifard; Erin O'Shea; Jonathan S Weissman; C James Ingles; Timothy R Hughes; John Parkinson; Mark Gerstein; Shoshana J Wodak; Andrew Emili; Jack F Greenblatt
Journal: Nature Date: 2006-03-22 Impact factor: 49.962

8. IntAct--open source resource for molecular interaction data.

Authors: S Kerrien; Y Alam-Faruque; B Aranda; I Bancarz; A Bridge; C Derow; E Dimmer; M Feuermann; A Friedrichsen; R Huntley; C Kohler; J Khadake; C Leroy; A Liban; C Lieftink; L Montecchi-Palazzi; S Orchard; J Risse; K Robbe; B Roechert; D Thorneycroft; Y Zhang; R Apweiler; H Hermjakob
Journal: Nucleic Acids Res Date: 2006-12-01 Impact factor: 16.971

9. Allele-specific behavior of molecular networks: understanding small-molecule drug response in yeast.

Authors: Fan Zhang; Bo Gao; Liangde Xu; Chunquan Li; Dapeng Hao; Shaojun Zhang; Meng Zhou; Fei Su; Xi Chen; Hui Zhi; Xia Li
Journal: PLoS One Date: 2013-01-04 Impact factor: 3.240

10. Sequence-based prediction of protein-protein interactions using weighted sparse representation model combined with global encoding.

Authors: Yu-An Huang; Zhu-Hong You; Xing Chen; Keith Chan; Xin Luo
Journal: BMC Bioinformatics Date: 2016-04-26 Impact factor: 3.169

7 in total

1. Comparative bioinformatical analysis of pancreatic head cancer and pancreatic body/tail cancer.

Authors: Lingdi Yin; Linmei Xiao; Yong Gao; Guangfu Wang; Hao Gao; Yunpeng Peng; Xiaole Zhu; Jishu Wei; Yi Miao; Kuirong Jiang; Zipeng Lu
Journal: Med Oncol Date: 2020-04-10 Impact factor: 3.064

2. ProphTools: general prioritization tools for heterogeneous biological networks.

Authors: Carmen Navarro; Victor Martínez; Armando Blanco; Carlos Cano
Journal: Gigascience Date: 2017-12-01 Impact factor: 6.524

3. Multiple kernels learning-based biological entity relationship extraction method.

Authors: Xu Dongliang; Pan Jingchang; Wang Bailing
Journal: J Biomed Semantics Date: 2017-09-20

4. Predicting disease-related genes using integrated biomedical networks.

Authors: Jiajie Peng; Kun Bai; Xuequn Shang; Guohua Wang; Hansheng Xue; Shuilin Jin; Liang Cheng; Yadong Wang; Jin Chen
Journal: BMC Genomics Date: 2017-01-25 Impact factor: 3.969

5. Prediction and Validation of Hub Genes Associated with Colorectal Cancer by Integrating PPI Network and Gene Expression Data.

Authors: Yongfu Xiong; Wenxian You; Rong Wang; Linglong Peng; Zhongxue Fu
Journal: Biomed Res Int Date: 2017-10-25 Impact factor: 3.411

6. Hub Genes and Key Pathway Identification in Colorectal Cancer Based on Bioinformatic Analysis.

Authors: Jian Lv; Lili Li
Journal: Biomed Res Int Date: 2019-11-06 Impact factor: 3.411

7. TBX2 Identified as a Potential Predictor of Bone Metastasis in Lung Adenocarcinoma via Integrated Bioinformatics Analyses and Verification of Functional Assay.

Authors: Huajian Yu; Fangyu Zhao; Jing Li; Kechao Zhu; Hechun Lin; Zhen Pan; Miaoxin Zhu; Ming Yao; Mingxia Yan
Journal: J Cancer Date: 2020-01-01 Impact factor: 4.207

7 in total