Literature DB >> 25707432

Identifying essential proteins from active PPI networks constructed with dynamic gene expression.

Qianghua Xiao, Jianxin Wang, Xiaoqing Peng, Fang-xiang Wu, Yi Pan.

Abstract

Essential proteins are vitally important for cellular survival and development, and identifying essential proteins is very meaningful research work in the post-genome era. Rapid increase of available protein-protein interaction (PPI) data has made it possible to detect protein essentiality at the network level. A series of centrality measures have been proposed to discover essential proteins based on the PPI networks. However, the PPI data obtained from large scale, high-throughput experiments generally contain false positives. It is insufficient to use original PPI data to identify essential proteins. How to improve the accuracy, has become the focus of identifying essential proteins. In this paper, we proposed a framework for identifying essential proteins from active PPI networks constructed with dynamic gene expression. Firstly, we process the dynamic gene expression profiles by using time-dependent model and time-independent model. Secondly, we construct an active PPI network based on co-expressed genes. Lastly, we apply six classical centrality measures in the active PPI network. For the purpose of comparison, other prediction methods are also performed to identify essential proteins based on the active PPI network. The experimental results on yeast network show that identifying essential proteins based on the active PPI network can improve the performance of centrality measures considerably in terms of the number of identified essential proteins and identification accuracy. At the same time, the results also indicate that most of essential proteins are active.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2015 PMID： 25707432 PMCID： PMC4331804 DOI： 10.1186/1471-2164-16-S3-S1

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Introduction

Essential proteins play a decisive role in the survival and development of the cell. The identification of essential proteins is crucial to understanding the minimal requirements for cellular life and for practical purpose, such as drug design [1]. The prediction and discovery of essential genes have been performed by experimental procedures such as single gene knockouts [2], RNA interference [3] and conditional knockouts [4], but these techniques require a large investment of time and resources and they are not always feasible. Considering these experimental constraints, a highly accurate computation approach for identify essential proteins would be of great value. At the present, there are many computational approaches for predicting essential proteins based on their properties. Most of these research approaches focused on their topological properties in biological networks, such as protein-protein interaction (PPI) networks. Recently, many methods were proposed for detecting essential proteins based on network topology, such as degree centrality(DC) [5], betweenness centrality (BC) [6], closeness centrality (CC) [7], subgraph centrality (SC) [8], eigenvector centrality (EC) [9], information centrality (IC) [10], edge clustering coefficient centrality (NC) [11], local average connectivity centrality (LAC) [12], etc. These centrality measures were used to identify essential proteins based on network topology. Experiment results shown that they are better than pseudorandom selection in detecting essential proteins. However, there exist some limitations on these methods. The PPI data generated by high-throughput technologies is incomplete and contains many false positives and false negatives, which impacts the correctness of predicting essential proteins. He et al. illustrated that some PPIs are more important than others in reality [13]. Some research works shown that many essential proteins have low connectivity and are difficult to be identified by centrality measures [13-16]. Many research works focused on identification essential proteins by integration PPI networks and other biological information, such as cellular localization, gene annotation, genome sequence, and so on [13,16,17]. Acencio et al. demonstrated that network topological features, cellular localization and biological process information are extremely useful for reliable prediction of essential genes [17]. Hart et al. pointed out that essentiality is a product of the protein complex rather than the individual protein [18]. Tew et al. [19] incorporated function information with topological information to detect essential proteins. Li et al. [20] proposed a new method to identify essential proteins by integration of PPI network topology with protein complexes information. Recently, Li et al. proposed a new method for predicting essential proteins based on the integration of PPI network and gene expression profiles [21], named PeC. Peng et al. [22] proposed an iteration method for predicting essential proteins by integrating the orthology with PPI networks. The current centrality measures were based on the topology of PPI networks. However, PPI network are static, which cannot reflect the real interaction in networks. In other words, the PPI data generated by high-throughput technologies is incomplete and contains many false positives and false negatives, which impacts the correctness of predicting essential proteins. In this paper, we propose a new method for predicting essential proteins based on active PPI network. We construct an active PPI network based on static PPI network and dynamic gene expression data. Then some centrality measures (DC, LAC, NC, BC, CC and SC) which are based on network topology have been applied to predict essential proteins based on the constructed active network. The experimental results show that it is more effective to predict essential proteins based on the active PPI network than based on static PPI network.

Methods

In this section, we first construct an active PPI network based on dynamic gene expression profiles and static PPI network. Then, we identify essential proteins based on the constructed active PPI network.

Time-dependent model and Time-independent model

Let x = {x1,..., x,..., x} be a time series of observation values at equally-spaced time points from a dynamic system. Wu et al. [23] have adopted AR (autoregressive) model to analyze the time dependence of time-course (dynamic) gene expression profiles. In [26], the time-dependent relationships can be modeled by an AR model of order p, denoted by AR(p), as follow: where β(i = 0, 1,..., p) are the autoregressive coefficients, and ε(m = p + 1,..., M ) represent random errors, which independently and identically follow a normal distribution with the mean of 0 and the variance of σ2. The system of Model (1) can be rewritten in the matrix form as: where The likelihood function for Model (2) is If the rank (X) = p + 1 holds, the maximum likelihood estimates of β and σ2 are and The value of the maximum likelihood is given by In Model (2), the matrix X has p + 1 columns and M − p rows. Thus a necessary condition for rank(X) = p + 1 is M − p ≥ p + 1 or p ≤ (M − 1)/2. On the other hand, the time-independent model is also an autoregressive model with the order of zero. That is a noisy profile can be modeled by where β0 is a constant number and ε(m = p,...,M ) are the random errors which are subject to a normal distribution independent of time with the mean of 0 and the variance of . The likelihood function for Model (7) is The maximum likelihood estimates of β0 and are and respectively. The maximum values of the likelihood is given by where is a (p + 1) dimensional vector whose first component is and others are zeros. The likelihood ratio of Model (7) to Model (1) is given by According to the likelihood principle, if Λ in Formula (12) is too small, the series x = {x1,..., x,..., x} is more likely time-dependent than time-independent. The statistic follows an F distribution with (p, M − 2p − 1) degrees of freedom when Model (7) is true for a series of observations. When F is very large, thus the p-value is very small, Model (7) is rejected, i.e., observation series x = {x1,..., x,..., x} is time-dependent. From Formula (13), one can calculate the probability (p-value) that a series of observations is not time-independent. As the regression degree in Model (1) is unknown, the p-values are calculated by Formula (13) for all possible orders p (1 ≤ p ≤ (M − 1)/2). The proposed method calls a gene to be significantly expressed (time-dependent) if one of these p-values calculated from its expression profile is smaller than a user-preset threshold value.

Construction of the active protein interaction network

Tang et al. [24] use a potential threshold to filter noisy gene expression data, then construct an active PPI network. In their method the common value of a threshold is applied to all the genes and time points. Wang et al. [25] propose a method to identify active time points for each protein in a cellular process or cycle using a 3-sigma principle to compute an active threshold for each gene according to the characteristics of its expression curve, then construct an active PPI network. We first filter noisy genes based on time-dependent model and time-independent model, time-dependent genes expression data is more likely dynamically deterministic than random while time-independent genes expression data is more likely random than dynamically deterministic. Those gene expression data are considered to be noises if they are time-independent and their means are very small. We then use a threshold function to compute an active threshold for each gene according to their expression data. We finally construct an active PPI network (NF-APIN) [26]. Our threshold function is described as follows: For each gene, u and σ are the mean and standard deviation of its expression values. The Active threshold is calculated by Formula (14) for all possible values k(0 ≤ k ≤ 3). In this paper the value of coefficient k is selected as 2.5. If the expression level of a gene is over its active threshold at a time point, the corresponding protein is regarded as active at the time point. For each time point, if two proteins interacted with each other in the static PPI network are active at the same time point, the proteins and their interaction form a part of NF-APIN at the time point. The process is repeated until the NF-APIN is created.

Centrality measures

A PPI network is usually regarded as an undirected graph G = (V, E), where a node v ∈ V represents a protein and an edge e(u, v) ∈ E denotes an interaction between two proteins v and u. In our paper, we have described the active PPI network constructed by our strategy as G' = (V', E'), a node v ∈ V' represents a protein and an edge e(u, v) ∈ E' denotes an interaction between two proteins v and u. We assign N as the total number of nodes in the network. In graph theory and network analysis, centrality of a vertex measures its relative importance within a graph. At the present, six classical centrality measures based on network topology are defined as follows: Degree Centrality (DC). The degree centrality of a vertex v is defined as Where deg(v) is degree of vertex v. Betweenness Centrality (BC). The betweenness centrality of a vertex v is defined as the fraction of shortest paths that pass through the node v. Where σis the total number of shortest paths from node s to node t, σ(v) is the number of those paths that pass through v. Closeness Centrality (CC). The closeness centrality of a vertex v is the reciprocal of the sum of graph-theoretic distances from the node v to all other nodes in the graph G. Where d(u, v) is a natural distance between all pairs of nodes, defined by the length of their shortest paths. Subgraph Centrality (SC). The subgraph centrality of a vertex i is the total number of closed walks in which v takes part and gives more weight to closed walks of short lengths. where µ(i) is the number of closed walks of length l starting and ending at protein i, v1, v2,...vis an orthonormal basis of Rcomposed by eigenvectors of the adjacency matrix A of the network and λ1, λ2,...λare the corresponding eigenvalues. where denotes the ith component of v. Local Average Connectivity Centrality (LAC). The local average connectivity of a node v (LAC(v)) is defined as the average local connectivity of its neighbors: where Nis the set of neighbors of node v, Cis the subgraph G[N] besides N. For a node w in C, deg(w) is its degree. Edge Clustering Coefficient (NC) [11]. The edge clustering coefficient of Ecan be defined by the following expression: Where Zdenotes the number of triangles that include the edge actually in the network, dand dare degrees of nodes u and v, respectively.

Results

Experimental datasets

The yeast's PPI network (20101010) is downloaded from DIP [27]. We filtered the self-interactions and repeated ones in the original PPI network. As a result, the PPI network used in our experiment has 5093 proteins and 24743 interactions. The yeast's dynamic gene expression data comes from [28], includes 6, 777 gene products under 36 different time points. The 6, 777 gene products in the gene express profile cover 95% of the proteins in the PPI network. The list of essential proteins of yeast downloaded from the following databases: MIPS [29], SGD [30], SGDP [31] and DEG [32], which contains 1285 essential proteins. Within the 1285 essential protein, 1167 proteins present in PPI network.

Compare with seven typical Centrality measure in different PPI networks

In order to validate the performance of the proposed strategy, we conduct a comparison between two different PPI networks applying seven typical centrality measures defined in last section to predict essential protein. Proteins are ranked in descending order according to their scores computed by each centrality measure. According to the sort, a certain number of top proteins should be regarded as essential proteins. With that, we select the top 100, top 200, top300, top400, top500 proteins as essential protein candidates and identify how many of these are true essential proteins. Numbers of essential proteins detected by seven typical centrality measures in two different networks are shown in Figure 1.

Figure 1

Number of essential proteins detected by each methods in two different networks. As is shown in Fig.1, the performance of each centrality measures in identifying essential proteins based on APPIN is better than PPIN. Especially, the improvements of SC based on APPIN are more than 50% when predicting 100 proteins, the number of essential proteins identified by LAC and NC based on APPIN achieves to 80. In Figure 1, PPIN denote that a certain centrality measure is applied based on the original PPI network of the yeast, and APPIN denote that a certain centrality measure is applied based on the active PPI network [24]. As is shown in Figure 1, the performance of each centrality measures in identifying essential proteins based on APPIN is better than PPIN. Especially, the improvements of SC based on APPIN are more than 50% when predicting 100 proteins, the number of essential proteins identified by LAC and NC based on APPIN achieves to 80. To further illustrate the efficiency of our strategy, we have analyzed by using a jackknife methodology [33]. In Figure 2, proteins are ordered in descending according to their scores. The curve is plotted with the cumulative counters of true essential proteins and the cumulative counters of predicted essential proteins. The areas under the curve (AUC) for each centrality measures in different networks are compared in Figure 2. It is obvious that the AUC for DC, BC, CC, SC, NC and LAC based on APPIN are better than PPIN.

Figure 2

DC, BC, CC, SC, LAC and NC are compared in two different networks by a jackknife methodology. To further illustrate the efficiency of our strategy, we have analyzed by using a jackknife methodology. In Fig.2, proteins are ordered in descending according to their scores. The curve is plotted with the cumulative counters of true essential proteins and the cumulative counters of predicted essential proteins. In addition, we also conduct a comparison of overlaps true essential proteins predicted by each centrality measure in different two networks. The numbers of true essential proteins in top 100 predicted proteins are shown in Table 1 where S1 and S2 are the number of essential protein predicted in two different networks, respectively, S3 is the number of overlaps essential proteins. From Table 1 we can see that the number of common essential proteins identified in two networks is relatively low. This proves that identifying essential protein based on the active PPI network is a necessary complement. In conclusion, the efficiency of identifying essential proteins based on an active PPI network is better than the origin PPI network. This indicates that active proteins more like to be essential proteins.

Table 1

The case of overlaps essential proteins in different two networks when predicting 100 proteins

Centrality measures	S1/S2	S3	S1 − S3	S2 − S3
Degree Centrality (DC)	56/46	26	30	20
Betweenness Centrality (BC)	54/44	23	31	21
Closeness Centrality (CC)	55/41	12	43	29
Subgraph Centrality(SC)	57/37	10	47	27
Edge Clustering Coefficient (NC)	80/56	26	54	30
Local Average Connectivity Centrality (LAC)	82/59	35	47	24

The case of overlaps essential proteins in different two networks when predicting 100 proteins

Conclusion

At present, the prediction of essential proteins is still a hot topic in the post-genome era. Many researches for identifying essential proteins are based on entire PPI networks. However, the PPI data obtained from various kinds of experimental techniques and methods, which generally contain false positives. It is insufficient to use original PPI data to identify essential proteins. In this study, we first filtered noisy genes based on dynamic gene expression profiles, and then constructed an active PPI network. After that, we predicted essential proteins based on our constructed active PPI networks using seven typical centrality measures. The experimental results show that the precision of identifying essential proteins based on our active PPI network is obviously higher than based on the origin PPI network. One direction of our further work is to apply the other prediction methods based on active PPI networks and confirm whether essential proteins have active characteristics.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

QX and JW obtained the protein-protein interaction data, gene expression data and essential proteins, generated the prediction model and drafted the manuscript. QX and XP performed experimental comparison and evaluated the results. JW, QX, FW, YP initiate the study and write the manuscript. All authors have read and approved the final manuscript.

29 in total

1. Identification of essential proteins based on edge clustering coefficient.

Authors: Jianxin Wang; Min Li; Huan Wang; Yi Pan
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2012 Jul-Aug Impact factor: 3.710

2. Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes.

Authors: Benjamin P Tu; Andrzej Kudlicki; Maga Rowicka; Steven L McKnight
Journal: Science Date: 2005-10-27 Impact factor: 47.728

3. Subgraph centrality in complex networks.

Authors: Ernesto Estrada; Juan A Rodríguez-Velázquez
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2005-05-06

4. Functional centrality: detecting lethality of proteins in protein interaction networks.

Authors: Kar Leong Tew; Xiao-Li Li; Soon-Heng Tan
Journal: Genome Inform Date: 2007

5. TnAraOut, a transposon-based approach to identify and characterize essential bacterial genes.

Authors: N Judson; J J Mekalanos
Journal: Nat Biotechnol Date: 2000-07 Impact factor: 54.908

6. Why do hubs tend to be essential in protein networks?

Authors: Xionglei He; Jianzhi Zhang
Journal: PLoS Genet Date: 2006-04-26 Impact factor: 5.917

7. High-betweenness proteins in the yeast protein interaction network.

Authors: Maliackal Poulo Joy; Amy Brock; Donald E Ingber; Sui Huang
Journal: J Biomed Biotechnol Date: 2005-06-30

8. Detecting protein complexes from active protein interaction networks constructed with dynamic gene expression profiles.

Authors: Qianghua Xiao; Jianxin Wang; Xiaoqing Peng; Fang-Xiang Wu
Journal: Proteome Sci Date: 2013-11-07 Impact factor: 2.480

9. Computational prediction of essential genes in an unculturable endosymbiotic bacterium, Wolbachia of Brugia malayi.

Authors: Alexander G Holman; Paul J Davis; Jeremy M Foster; Clotilde K S Carlow; Sanjay Kumar
Journal: BMC Microbiol Date: 2009-11-28 Impact factor: 3.605

10. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes.

Authors: Ren Zhang; Yan Lin
Journal: Nucleic Acids Res Date: 2008-10-30 Impact factor: 16.971

13 in total

1. Identification of Essential Proteins Based on a New Combination of Local Interaction Density and Protein Complexes.

Authors: Jiawei Luo; Yi Qi
Journal: PLoS One Date: 2015-06-30 Impact factor: 3.240

2. Predicting essential proteins based on subcellular localization, orthology and PPI networks.

Authors: Gaoshi Li; Min Li; Jianxin Wang; Jingli Wu; Fang-Xiang Wu; Yi Pan
Journal: BMC Bioinformatics Date: 2016-08-31 Impact factor: 3.169

3. An ensemble framework for identifying essential proteins.

Authors: Xue Zhang; Wangxin Xiao; Marcio Luis Acencio; Ney Lemke; Xujing Wang
Journal: BMC Bioinformatics Date: 2016-08-25 Impact factor: 3.169

Review 4. A Comprehensive Overview of Online Resources to Identify and Predict Bacterial Essential Genes.

Authors: Chong Peng; Yan Lin; Hao Luo; Feng Gao
Journal: Front Microbiol Date: 2017-11-27 Impact factor: 5.640

5. DiffSLC: A graph centrality method to detect essential proteins of a protein-protein interaction network.

Authors: Divya Mistry; Roger P Wise; Julie A Dickerson
Journal: PLoS One Date: 2017-11-09 Impact factor: 3.240

6. Norepinephrine triggers an immediate-early regulatory network response in primary human white adipocytes.

Authors: Juan Carlos Higareda-Almaraz; Michael Karbiener; Maude Giroud; Florian M Pauler; Teresa Gerhalter; Stephan Herzig; Marcel Scheideler
Journal: BMC Genomics Date: 2018-11-03 Impact factor: 3.969

7. A Novel Method for Identifying Essential Genes by Fusing Dynamic Protein⁻Protein Interactive Networks.

Authors: Fengyu Zhang; Wei Peng; Yunfei Yang; Wei Dai; Junrong Song
Journal: Genes (Basel) Date: 2019-01-08 Impact factor: 4.096

8. The mechanism of TiaoGanYiPi formula for treating chronic hepatitis B by network pharmacology and molecular docking verification.

Authors: Xu Cao; Xiaobin Zao; Baiquan Xue; Hening Chen; Jiaxin Zhang; Shuo Li; Xiaobin Li; Shun Zhu; Rui Guo; Xiaoke Li; Yong'an Ye
Journal: Sci Rep Date: 2021-04-16 Impact factor: 4.379

9. Improved flower pollination algorithm for identifying essential proteins.

Authors: Xiujuan Lei; Ming Fang; Fang-Xiang Wu; Luonan Chen
Journal: BMC Syst Biol Date: 2018-04-24

10. Analysis of Differentially Expressed Genes in Coronary Artery Disease by Integrated Microarray Analysis.

Authors: Meenashi Vanathi Balashanmugam; Thippeswamy Boreddy Shivanandappa; Sivagurunathan Nagarethinam; Basavaraj Vastrad; Chanabasayya Vastrad
Journal: Biomolecules Date: 2019-12-25