Literature DB >> 26641660

Detecting Protein Complexes in Protein Interaction Networks Modeled as Gene Expression Biclusters.

Eileen Marie Hanna¹, Nazar Zaki¹, Amr Amin^2,3.

Abstract

Developing suitable methods for the detection of protein complexes in protein interaction networks continues to be an intriguing area of research. The importance of this objective originates from the fact that protein complexes are key players in most cellular processes. The more complexes we identify, the better we can understand normal as well as abnormal molecular events. Up till now, various computational methods were designed for this purpose. However, despite their notable performance, questions arise regarding potential ways to improve them, in addition to ameliorative guidelines to introduce novel approaches. A close interpretation leads to the assent that the way in which protein interaction networks are initially viewed should be adjusted. These networks are dynamic in reality and it is necessary to consider this fact to enhance the detection of protein complexes. In this paper, we present "DyCluster", a framework to model the dynamic aspect of protein interaction networks by incorporating gene expression data, through biclustering techniques, prior to applying complex-detection algorithms. The experimental results show that DyCluster leads to higher numbers of correctly-detected complexes with better evaluation scores. The high accuracy achieved by DyCluster in detecting protein complexes is a valid argument in favor of the proposed method. DyCluster is also able to detect biologically meaningful protein groups. The code and datasets used in the study are downloadable from https://github.com/emhanna/DyCluster.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Multiprotein Complexes

Year: 2015 PMID： 26641660 PMCID： PMC4671556 DOI： 10.1371/journal.pone.0144163

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Protein complexes are groups of interacting proteins associated to specific cellular functions [1] and they are fundamental players in almost all biological processes. The identification of the complexes incorporated in a protein-protein interaction (PPI) dataset is indeed highly beneficial. One of the ultimate goals of this scenario is to be able to associate protein complexes with normal molecular events, and subsequently, to link the occurrence of inconsistent processes with different diseases. Undoubtedly, such knowledge could lead to the development of more effective therapies. The experimental methods designed to study the PPI and incorporated complexes, such as yeast two-hybrid (Y2H) [2] and tandem affinity purification (TAP-MS) [3] approaches, are vulnerable to high error rates [4] and practically limited in terms of time and cost. As a result, various computational methods were developed to complement and to reduce the required experimental efforts. A graph G = (V, E) is conventionally used to represent proteins V and their interconnections E as nodes and edges, respectively. Such representation was, and still is, the basis of many computational methods seeking to accurately describe and to identify enclosed protein-complex structures. A large number of these methods are based on the assumption that protein complexes correspond to dense and highly-interconnected sub-graphs. Among those methods, we here point out: Markov Clustering (MCL) [5] which uses random walks in protein interaction networks; the molecular complex detection (MCODE) algorithm [6] which considers complexes as dense regions grown from highly-weighted vertices; the clustering based on maximal cliques (CMC) method [7]; the Affinity Propagation (AP) algorithm [8]; ClusterONE [9] which identifies protein complexes by clustering with overlapping neighborhood expansion; the restricted neighborhood search (RNSC) algorithm [10, 11]; and CFinder [12] which is based on the clique percolation method. Other approaches which are not centered on the density notion were also presented; namely: ProRank [13, 14] and ProRank+ [15] which mainly use a protein ranking algorithm to identify essential proteins in a PPI network and form complexes accordingly; and finally PEWCC [16] which assesses the reliability of PPI data based on the weighted clustering coefficient notion prior to detecting protein complexes. When compared to reference sets of biologically-identified protein complexes, most of the introduced computational approaches could achieve good complex-detection rates with adequate evaluation scores. Certainly, the higher their accuracy levels, the more they are reliable and the more likely they can be utilized by scientists and biologists. The improvements of protein-complex detection algorithms as well as the design of novel approaches seem to meet at the notion of reforming the way in which a PPI dataset is initially represented. PPI networks are in fact dynamic [17]. Hence, the shift from viewing PPI networks as static to modeling the dynamicity of these networks became fundamental [18]. This adaptation can currently be acquired thanks to the amounts and the diversity of biological information, whether temporal, spatial or contextual, generated by advanced experimental techniques such as ChIP-chip [19] and ChIP-seq [20]. In this paper, first we emphasize the advantages of shifting to dynamic PPI networks, specifically when it comes to the problem of detecting protein complexes; then we underline possible approaches to model the dynamic aspect of protein interactions and we highlight some of the existing methods. Second, we introduce “DyCluster”, a framework for the detection of protein complexes in dynamic PPI networks modeled using gene expression data, through biclustering techniques. Finally, we present our experimental study which shows that the results generated by applying complex-detection methods based on our framework are better than those corresponding to the methods applied on static PPI networks, in terms of the number of matched complexes, accuracy and other evaluation scores. An additional experiment on biological data is also presented.

Dynamic PPI Networks

As an inter-disciplinary research area, computational biology is expected to profit from the continuous growth and diversity of biological data collected using advanced experimental techniques. Such information includes, but is not limited to, gene expression data [21] which report quantitative measurement of RNA species in cellular compartments across various conditions; sub-cellular localization annotations [22] which provide spatial positions of elements in cellular components; and gene ontology annotations [23] which highlight genes that are present across different species. The enrichment of biological representations, and particularly PPI networks, using such data types indeed allows better replication of real cellular events through the modeling of temporal, spatial and contextual dynamics which describe and influence cellular processes [24-26]. When the dynamics controlling the occurrence of protein interactions are included in PPI networks, the analytical results, and namely the detected protein complexes in such network variants, would potentially be more accurate. Since PPI datasets are generated by experimental techniques that are liable to high error rates [4], the computational methods designed to explore them are also susceptible to those errors. Various filtering techniques were thus proposed to pre-process PPI data before analyzing them, such as FSWeight [27], AdjustCD [28] and PE-measure [16]. Nevertheless, issues also exist in other biological information, such as gene expression data, which have yet low protein coverage in contrast with PPI datasets that are typically very large. Despite that, the combination of different descriptive biological data may be considered as a search for evidence intersection. The higher the recurrence of information and/or inferences in experimental results, the better could be our confidence that they exist in reality. Consequently, dynamic PPI networks, modeled using various experimental data, could verify or possibly contradict known biological concepts and may as well uncover previously-unknown biological facts. Different kinds of information could be drawn when exploring a PPI data set. Nonetheless, the categorization of such data is generally not simple; as in the case of distinguishing between protein complexes and functional modules, for example. In fact, complexes are formed by proteins which interconnect at the same time and place, whereas the members of functional modules may interact at different times and places [29]. Accordingly, by incorporating spatiotemporal information drawn from gene expression and sub-cellular localization annotations datasets, for instance, such classification of network modules can be acquired. Similarly, the biological enrichment of a PPI network potentially allows the identification of protein sub-complexes. Many methods were developed to solve this important research problem, but they all apply to static PPI networks [30]. The inclusion of temporal, spatial and contextual attributes, which guide PPIs, can lower the rates of false positives and false negatives at the level of the detected complexes and their protein members as well. In other words, these attributes can be used to cluster the proteins and their interconnections based on the conditions which govern them. A protein complex-detection method shall be applied on the clusters, with a generalization capability indeed. Consequently, the overall accuracy of the produced results would be better. The former potentially applies to other exploratory approaches of PPI networks. Instead of a single and comprehensive representation of a PPI dataset, by incorporating conditionality features of PPI events, we would rather be looking at a series of snapshots of a PPI network modeled based on either one or a combination of temporal, spatial and contextual settings (Fig 1). The interpretation of a dynamic interaction network and its state transitions depends on the types of data which are used to biologically-condition PPI events.

Fig 1

Snapshots of a hypothetical PPI network, showing its dynamics through different temporal, spatial and/or contextual settings.

Nodes and edges of the same color belong to the same protein complex.

Snapshots of a hypothetical PPI network, showing its dynamics through different temporal, spatial and/or contextual settings.

Nodes and edges of the same color belong to the same protein complex. Gene expression data report quantities of RNA across different time points in cellular processes. It is believed that genes with correlated expressions across different conditions most likely interact. The combination of gene expression information with PPI data to model the dynamics of the corresponding PPI networks could potentially reveal the processes which underline the formation of protein complexes. For instance, Wang et al. [31] showed that a just-in-time mechanism elapsing through continuous time points delineates the formation of most complexes. The statistical 3-sigma principle was then used by the works presented in [31] and [32] to define the active time points of proteins based on their gene expression levels and consequently, introduce approaches to detect and refine protein complexes. The core-attachment interpretation of complexes was recently adopted in [33]; based on the dynamics inferred by gene expression data, the identification of a protein complex is split into two main parts: a static core consisting of proteins expressed throughout the whole cell cycle and a short-lived dynamic attachment. The results achieved by these approaches were better than the ones based on static PPI networks. Kim et al. [24] highlighted some of the computational methods used to infer dynamic networks from expression data based on statistical dependence to classify nodes and edges as active or inactive. These methods include: Bayesian networks [34], relevance networks [35], Markov Random Fields [36], ordinary differential equations [37] and logic-based models [38]. Since it is favorable to incorporate the spatial dynamics towards improving complex-detection approaches, various methods were designed to study the spatial movements of proteins [25]. However, in addition to mathematical modeling techniques, further approaches to appropriately integrate spatial protein dynamics in PPI networks are still required. By providing information about genes that are shared across species, gene ontology annotations can also be used to model the dynamics of PPI networks [26]. As an indicator of interaction probability, various weighting schemes were introduced to assign PPI weights based on the similarity degrees of gene ontology terms between interacting partners. Among these approaches are SWEMODE [39], which detects communities within PPI networks based on weighted clustering coefficient and weighted average nearest-neighbors degree measures, and OIIP [26], which is a method to detect protein complexes in PPI networks by assigning node and edge weights based on the size of gene annotations. Modeling the dynamics of PPI networks through the integration of biological attributes particularly enhances the computational methods designed to detect protein complexes. It not only participates in uncovering the mechanisms of protein-complex formation but also points out useful details for the design of such methods. In addition, the former may help categorize protein complexes and could be informative regarding their building blocks as well.

Methods

We hereafter present DyCluster, a framework for detecting protein complexes in dynamic PPI networks modeled using gene expression data through biclustering techniques. Our framework requires a gene expression dataset and a PPI dataset. It consists of five main steps: Biclustering the gene expression data Extracting the biclusters’ PPIs from an assigned PPI dataset Pruning the biclusters’ PPIs Detecting the protein complexes Merging and filtering the sets of detected protein complexes An outline of the approach is presented in Fig 2.

Fig 2

An outline of the DyCluster framework developed for the detection of protein complexes in dynamic PPI networks modeled as gene expression biclusters.

Biclustering Gene Expression Data

A gene expression dataset reports the expression levels of a large number of genes across different environmental conditions, time points, organs, species, etc. It is conventionally represented as a matrix in which rows and columns correspond to genes and their expression levels at different conditions (samples), respectively. Various methods were developed to analyze gene expression data under the assumption that the ones which exhibit similar expression patterns across a set of conditions are more likely functionally-related [40]. The analysis of these datasets is challenging because they are usually unbalanced, i.e. the number of genes is quite larger than the number of conditions [41]. Many approaches were proposed to group genes according to their expression patterns; in particular, data mining approaches such as classification and clustering. Classification methods require knowing the label of the resulting classes in advance. Several research efforts were invested in studying the application of such supervised techniques on gene expression data [42]. However, the prior suggestion of classes somehow limits the process of data exploration. On the other hand, typical clustering techniques have two drawbacks when applied on gene expression data [43]: first, each gene must be grouped into a cluster even if its similarity with the cluster members is relatively low; and second, a gene can belong to one cluster only. Consequently, classical clustering methods cannot fully handle gene expression data since they do not account for the fact that a large number of genes can exhibit multiple biological functions [44], and thus can belong to more than one cluster. Besides, clustering spans the whole samples set whereas in reality, the expression patterns of a gene cluster may be correlated based on a subset of samples only. It is actually expected to produce groupings of co-expressed elements under subsets of conditions whose expression patterns are presumably independent across the rest of the conditions. Thanks to the simultaneous two-dimensional clustering capability which they provide, biclustering techniques presented better means to explore expression data [45, 46]. Actually, they allow the identification of subsets of co-regulated genes across subsets of samples. And in analogy to biological facts, a gene may belong to multiple clusters and a gene may not fit in any cluster in some cases. A formal problem formulation of biclustering gene expression data is as follows: Let A be an n*m data matrix representing a gene expression dataset consisting of n genes measured across m conditions, a is a real value corresponding to the expression level of the gene at row i and the condition at column j. The goal is to find a set of biclusters BC(I, J); where I is a subsets of genes which exhibit similar expression patters across the subset of conditions J. We hereafter, highlight some of the existing biclustering approaches which will also be used at later stages to evaluate DyCluster. The first application of biclustering on gene expression data was conducted by Cheng and Church [47]. They presented a method (CC) consisting of a greedy search heuristic to form the biclusters, namely the set covering algorithm, and relying on the Mean Square Residue (MSR) measure to assess their quality based on a specified threshold. The MSR of a bicluster BC, of I rows and J columns, reflects the degree of coherence of the genes and the conditions that it includes (as shown in Eq (1)). where bc , bc , bc and bc represent the elements in row i and column j, the row and the column means, and the mean of BC, respectively. The lower the MSR, the higher is the bicluster coherence. Correlations among genes can be expressed in terms of scaling and shifting patterns. One aspect of the robustness of a biclustering algorithm, when applied on expression data, is in its ability to capture both types of patterns. MSR can only detect shifting correspondences among the expression levels of genes [48]. Despite that, it has been adopted by several similar approaches and some variants of this measure were also introduced to identify scaling patterns [49]. Other methods, that do not use metrics to evaluate the formed groupings throughout their operations, were also developed. The Order Preserving Sub Matrix (OPSM) algorithm [50] searches for large sub-matrices in which genes have the same linear ordering of the samples. The Iterative Signature Algorithm (ISA) [51] uses the signature algorithm to identify self-consistent transcriptional modules consisting of co-expressed genes and the samples corresponding to them. A comprehensive survey of these methods and others can be found in [45]. Given a gene expression dataset, the first stage of our framework involves biclustering these data into subsets of genes which exhibit similar variations in their expression levels across subsets of conditions, as shown in Fig 2(a).

Extracting Biclusters’ PPI Data

Given the generated set of gene biclusters as shown in Eq (2): The next step consists of finding the interconnections among the members of each bicluster based on a specified PPI dataset. The interactions in the PPI dataset which involve elements belonging to the set of proteins, P = {p , p , …, p }, contained in BC (I , J ), are added to the sub-PPI dataset, BC (I , J )_PPI, corresponding to this bicluster. The sub-PPI dataset will then include the proteins initially existing in the bicluster in addition to their interaction partners drawn from the considered PPI dataset as shown in Fig 2(b).

Pruning Biclusters’ PPI Data

The biological approaches used to identify PPIs are very sensitive to experimental conditions and are thus susceptible to high error rates [4]. As a result, many methods were developed to filter PPI datasets in order to reduce the level of false positive and false negative interactions [16, 27, 28]. In our work, we use the PE method introduced by Zaki et al. in [16] to assess the reliability of protein interactions at the level of generated biclusters and prune the corresponding PPI subsets accordingly. Experiments show that PE-measure is efficient as it reduces the level of noise in protein interaction networks by looking for sub-graphs that are closest to maximal cliques, based on the weighted clustering coefficient measures, Fig 2(c).

Detecting Protein Complexes

Successively, a protein-complex detection method is applied on the pruned biclusters’ PPIs, disjointedly on every bicluster. Subsequently, several sets of identified protein complexes are formed (DC 1, DC 2, …, DC ) as shown in Fig 2(d).

Merging and Filtering the Detected Sets of Protein Complexes

Merging and filtering the resultant sets of complexes is crucial to the overall accuracy of our approach. However, developing an appropriate post-processing method is challenging because it is subject to various considerations. For instance, in its simplest form, it may consist of matching the detected entities against each other and combining the ones which have an overlap greater than a certain threshold. In contrast, keeping the common members of highly-overlapping entities may also be explored and it might lead to better outcomes. Another approach may think through the core-attachment interpretation of complexes [1] and consider that a repeated subgroup of interacting proteins in several detected groupings may be a potentially correct core, which forms different complexes when linked with various protein attachments. Nonetheless, in our paper, we keep this task for later research stages and we hereby limit the formation of the combined set of complexes to merging based on an overlap threshold and a condition by which members of one complex interact with a certain percentage of members of the other complex; in addition to filtering duplicates. This step finalizes the complex-detection process outlined by our framework, Fig 2(e).

Experimental Study

Datasets

DyCluster requires a gene expression dataset to model the dynamic aspect of protein interactions and a PPI dataset from which the interconnections among those proteins are extracted. Indeed, the higher the homogeneity of both sets, namely in terms of the species and the number of common genes that they cover, the better are the expected outcomes. We referred to Gene Expression Omnibus (GEO) repository [52] from which we selected the expression dataset of accession number GSE3431 [53], entitled “Logic of the yeast metabolic cycle”. It reports the expression levels of genes across twelve time intervals in three successive metabolic cycles. Our choice was primarily based on its wide coverage of yeast proteins and potentially, a high number of participants in various cellular processes. The yeast PPI dataset was downloaded from the Database of Interacting Proteins (DIP) [54] catalogue of experimentally-determined protein interactions. Finally, as reference set of yeast protein complexes with which we compared our results is the CYC2008 catalogue [55] containing 408 complexes.

Experimental Settings

For the gene expression biclustering step, we used three algorithms: OPSM [50], CC [47] and ISA [51]. Here, we note that although efforts are spent in the direction of finding suitable ways to evaluate biclustering approaches [56], comparing their performances is still a challenging task. Added to that, in order to shed the light on the advantage of using gene expression data, we also examined the results of applying the framework using the one-way clustering method k-means [57]. The parameters settings of these algorithms are presented in Table 1. For the CC algorithm, as mentioned earlier, the Mean Square Residue (MSR) of a bicluster reflects the degree of coherence of the genes and the conditions contained in it. And, the lower the MSR, the higher is the coherence of the bicluster. Here, the upper limit of MSR is 0.5, by default. The threshold for multiple node deletion is used throughout the iterations of the algorithm to remove multiple nodes in the direction of lowering the MSR value of the generated biclusters. The number of output biclusters can also be specified for the CC method, here 10. While searching for large sub-matrices in which genes have the same linear ordering of the samples, the number of passed models at each iteration of the OPSM algorithm is set to 10, by default. The Iterative Signature Algorithm (ISA) identifies co-expressed genes across conditions based on thresholds for gene scores (t ) as well as condition scores (t ), both set to 0.5 by default. It also requires specifying the number of starting points for biclusters formation, here 100. The k-means clustering method takes as input parameters the number of clusters to be generated, set to 10, the number of iterations of the algorithm, set to 100, the number of replications, here 1, in addition to the distance measure used to calculate the level of expression similarity of genes, here Pearson’s correlation. We used the BicAT tool [58] to visualize and perform the biclustering of the gene expression dataset.

Table 1

Parameter settings of the applied biclustering algorithms.

Parameter Settings
CC	upper limit of MSR: δ = 0.5
	threshold for multiple node deletion: α = 1.2
	number of output biclusters = 10
OPSM	number of passed models for each iteration: l = 10
ISA	threshold of genes: t _g = 0.5
	threshold of chips: t _c = 0.5
	number of starting points = 100
k-means	distance measure: Pearson’s correlation
	number of clusters = 10
	number of iterations = 100
	number of replications = 1

For the step consisting of pruning the PPI data at the biclusters levels, we adopted the PE method [16] with default parameters, specifically, with edges reliability score threshold equals to 0.1. In terms of protein-complex detection methods, we used ProRank [13], ProRank+ [15], ClusterONE [9] and CMC [7], MCODE [6] and CFinder [12]. ProRank, ProRank+, ClusterONE and CFinder were applied with default parameters. Given a protein interaction network, CMC generates maximal cliques which may overlap. The highly-overlapping ones, i.e. with overlap greater than a specified threshold, are examined for possible merging if their degree of inter-connectivity exceeds a merging threshold. The overlap and merging thresholds were set to 0.75 and 0.5, respectively. For MCODE: the degree cutoff for a node to be scored was set to 2; the node score cutoff was set to 0.2, i.e. a node can be added to a cluster (complex) only if its score is no more than 20% less than the score of the seed node of the cluster; the k-core parameter, here set to 2, filters out clusters that do not contain a maximally inter-connected sub-cluster of at least degree k; and the maximum depth parameter which limits the distance from the seed node within which the algorithm can search for cluster members from seed was set to 3. Added to that, the generated sets of detected complexes were examined and refined as follows: if two complexes have a number of overlapping members greater than 75% of the size of the smaller complex; and if the members of the first complex interact with at least 50% of the members of the second complex, then they are merged.

Evaluation Scores

The quality scores, used to evaluate our approach, included: (a) the number of complexes in the reference catalogue that are matched with at least one of the predicted complexes with an overlap score, OS ≥ 0.2; (b) the clustering-wise sensitivity (Sn) and (c) the clustering-wise positive predictive value (PPV) used to calculate the matching quality, mainly in terms of the correctly-matched protein members among the detected complexes; (d) the geometric accuracy (Acc) which is the geometric mean of Sn and PPV; and (e) the maximum matching ratio (MMR) which measures the maximal one-to-one mapping between predicted and reference complexes by dividing the total weight of the maximum matching with the number of reference complexes. Given m predicted complexes and n reference complexes, the corresponding formulas are shown in Table 2, where t represents the number of proteins that are found in both predicted complex m and reference complex n.

Table 2

The formula of the quality scored used to evaluate our approach.

Evaluation Scores	Equations
Overlap score: between two protein complexes A and B	OS(A,B)=\|A∩B\|2\|A\|\|B\|
Clustering-wise sensitivity	Sn=∑i=1nmaxj=1mtij∑i=1nni
Clustering-wise positive predictive value	PPV=∑j=1mmaxi=1ntij∑j=1m∑i=1ntij
Accuracy	Acc=Sn×PPV

Results

According to the presented framework, the gene expression dataset, GSE3431, was processed by the three biclustering algorithms, OPSM, CC and ISA, and by the k-means clustering algorithm, one at a time. The PPIs corresponding to the proteins contained in each of the resulting biclusters were extracted from the specified yeast PPI dataset and were pruned using PE technique. The protein complex-detection methods, listed above, were applied on the generated biclusters. Finally, the detected sets of complexes were merged, filtered and matched against the CYC2008 reference catalogue. In order to observe the advantage of our approach, Table 3 presents the results of detecting protein complexes in static PPI networks using various methods, i.e. without incorporating gene expression data. In contrast, Table 4 shows the outcomes corresponding results to our proposed approach. Results in both tables are in terms of the number of matched protein complexes and the number of detected complexes along with the corresponding evaluation scores.

Table 3

Experimental results of matching the detected sets of protein complexes by various detection methods against the CYC2008 reference catalogue.

Method	No. of matched complexes	No. of detected complexes	Acc	S _n	MMR	PPV
ProRank	41	230	0.4715	0.3072	0.1032	0.7237
ProRank+	46	274	0.4788	0.3371	0.1161	0.6801
ClusterONE	76	365	0.6008	0.511	0.2349	0.7064
CMC	114	4292	0.6587	0.6517	0.347	0.6658
MCODE	62	168	0.55	0.4271	0.149	0.7082
CFinder	116	6381	0.6143	0.5641	0.3776	0.669

Table 4

Experimental results of matching the detected sets of protein complexes by our proposed framework against the CYC2008 reference catalogue in comparison to ProRank, ProRank+, ClusterONE, CMC, MCODE and CFinder.

Method	Biclustering Algorithm	No. of matched cmplxs	No. of detected cmplxs	Acc	S _n	MMR	PPV
ProRank	OPSM	78	335	0.5911	0.4627	0.2103	0.755
	CC	63	252	0.5658	0.4296	0.1804	0.7451
	ISA	71	320	0.564	0.4332	0.195	0.7342
	k-means	71	331	0.556	0.4222	0.1896	0.7322
ProRank+	OPSM	81	397	0.5982	0.5116	0.225	0.6995
	CC	65	305	0.5668	0.4724	0.1947	0.6802
	ISA	78	392	0.5677	0.4719	0.2231	0.683
	k-means	78	424	0.5687	0.4782	0.2196	0.6764
ClusterONE	OPSM	89	929	0.6426	0.5758	0.2469	0.7172
	CC	78	578	0.6267	0.5465	0.2036	0.7186
	ISA	87	890	0.6015	0.5506	0.2499	0.6571
	k-means	83	862	0.6153	0.533	0.2334	0.7102
CMC	OPSM	100	1207	0.6159	0.5566	0.2903	0.6816
	CC	95	1145	0.5983	0.5264	0.2844	0.6801
	ISA	100	1843	0.6041	0.5518	0.3071	0.6614
	k-means	94	1126	0.6088	0.5542	0.2913	0.6689
MCODE	OPSM	71	475	0.5695	0.4602	0.1835	0.7049
	CC	60	285	0.545	0.4058	0.1581	0.7321
	ISA	63	315	0.5529	0.4232	0.171	0.7222
	k-means	74	448	0.5658	0.4583	0.1947	0.6986
CFinder	OPSM	94	2079	0.6187	0.525	0.2925	0.7291
	CC	98	1236	0.5977	0.559	0.3005	0.6391
	ISA	99	2119	0.5738	0.5393	0.3021	0.6104
	k-means	99	1352	0.5988	0.5455	0.3098	0.6574

As the experimental results show, the incorporation of gene expression data in the process of detecting protein complexes in dynamic PPI networks is indeed beneficial, in contrast with the outcomes of detecting complexes in static networks. On one hand, it could notably increase the number of matched complexes, as it is the case for ProRank, ProRank+ and ClusterONE. We note here that the total number of detected complexes increased. Nevertheless, the quality scores, which depend on this number and the number of matched complexes as well, were slightly better. The former underlines the effectiveness and the potential of our framework in terms of increasing the number of matches while also ameliorating the quality of the detected entities. Here, we recall the need to develop a more suitable approach for merging, filtering and refining the identified sets of complexes (the last step of the presented framework) which would potentially lead to enhanced evaluation scores. On the other hand, biclustering genes based on their expression patterns could significantly reduce the large number of complexes detected by some algorithms, such as CMC and CFinder, while not compromising the quality of the results. We also examine the statistical significance of the improvements in the evaluation metrics (Acc, Sn, MMR and PPV). To do that, we perform a paired t-test to compare the results of just applying each complex-detection method on the PPI data, i.e. scores in each row of Table 3, with the scores corresponding to applying the framework with the same detection method and various biclustering algorithm (scores in Table 4). The samples are considered related since they are based on the same PPI data and reference set of protein complexes. Fig 3 shows the resulting p-values less than or equal to 0.1, they correspond to significant improvements given by the proposed framework. It is important to note that p-values tend to be lower when the difference in the sample means is higher. Although the mean differences among the considered scores are not high in this case, we can still note the reflected statistical significance of the improvements.

Fig 3

Statistical significance of scores differences between pairs of protein-complex detection methods without and with gene expression data based on the proposed framework.

The displayed p-values are the ones less than or equal to 0.1 reflecting improvements in the scores, i.e. the matching qualities of the detected protein complexes.

Statistical significance of scores differences between pairs of protein-complex detection methods without and with gene expression data based on the proposed framework.

The displayed p-values are the ones less than or equal to 0.1 reflecting improvements in the scores, i.e. the matching qualities of the detected protein complexes. The conveyed study validates the enhancement of protein complex-detection approaches by integrating gene expression data, particularly through biclustering techniques. The framework models the dynamic aspect of PPI networks by grouping proteins according to the similarities of their expression patterns across subsets of conditions. Moreover, our method is not restricted by single threshold imposition on gene expression levels. As mentioned earlier, biclustering approaches are better than conventional clustering methods when it comes to expression data analysis [45, 46]. Nonetheless, the results attained by DyCluster using the k-means clustering algorithm accentuate the improvement which can be gained by incorporating gene expression information to model the dynamics of PPI interactions and to detect protein complexes in PPI networks accordingly.

Testing DyCluster on Biological Data

In order to further test the effectiveness of the presented framework in identifying biologically related group of genes/proteins, we selected 140 pathway-focused genes implicated in programmed cell death in Rat Apoptosis and inflammation. The Rat Apoptosis RT2 Profiler PCR Array profiles the expression of 84 key genes (available at http://www.sabiosciences.com/rt_pcr_product/HTML/PARN-012Z.html) involved in programmed cell death. Apoptosis plays a critical role in normal biological processes requiring cell removal including differentiation, development, and homeostasis. Similarly, the Rat Inflammatory Cytokines and Receptors RT2 Profiler PCR Array profiles the expression of another 84 key genes (available at http://www.sabiosciences.com/rt_pcr_product/HTML/PARN-011Z.html) mediating the inflammatory response. Acute inflammation occurs in response to cell damage due to infection or injury. During this process, cellular and plasma derived factors encourage extravasation, the recruitment of circulating immune cells into the affected tissue. The two set of genes which are relevant to liver cancer are then combined and housekeeping genes and redundant genes are removed. Monitoring the expression of these genes helps to determine the mechanisms behind programmed cell death. The genes are then processed using String 9.1 [59] (Search Tool for the Retrieval of Interacting Genes/Proteins). String is a biological database and web resource of known and predicted protein-protein interactions. Genes with no records in String 9.1 were removed and therefore, 140 genes were considered. All proteins and their interactions were retrieved and the corresponding network was built. Once the PPI network (1,413 interactions and 140 proteins) was built, several enrichment features available in String 9.1 (features related to KEGG pathway, Reactome Pathway, Molecular function, Pfam domain, InterPro-Domains) were used to generate several sub-networks/groups which were then treated as protein complexes. The idea here is to see whether DyCluster is capable of detecting such groups of biologically-related proteins given only the PPI network information. In this experimental work, the gene expression dataset, of accession number GSE17384, was downloaded from the GEO [52] repository. It is entitled: “Gene expression data from the LEC rat model with naturally occuring and oxidative stress induced liver tumorigenesis” [60]. It reports the variations of gene expression levels in a stepwise manner from the normal liver condition, to chronic induced liver tumor by time-series microarray analysis. In other words, the study involves a comparison between normal liver tissues and developed liver tumors at different time points. It could potentially reveal genes which participate in the progressive formation of the disease. The OPSM method [50] was used to bicluster the gene expression data since it showed a relatively good performance in our experimental study. The PPI dataset was deduced from two sets of genes involved in apoptosis (RT2 Profiler PCR Array Rat Apoptosis, PARN-012A. The ProRank+ algorithm was employed to detect the corresponding protein entities/complexes. Then, we examined the generated results for potential matching with the reference sub-networks/groups generated using String. Table 5 shows the detected components by DyCluster framework, listed by types, along with their matching percentages. The experimental results thus confirm the potential of our approach in detecting and understanding protein entities of key roles in normal and abnormal cellular functions.

Table 5

The biological components detected by our framework, listed by types, along with their matching percentages.

	Detected Component	Matching Percentage
InterPro-Domains	Chemokine receptor family	100
	G protein-coupled receptor, rhodopsin-like	100
	GPCR, rhodopsin-like, 7TM	100
	BLC2 family	83.3
	BLC2-like	83.3
	Death effector domain	66.7
	Interleukin-6 receptor alpha, binding	50
	Death domain	100
	Apoptosis regulator, Bcl-2, BH2 motif, conserved site	75
	Chemokine interleukin-8-like domain	60
KEGG Pathway	Chemokine signaling pathway	40
	Cytokine-cytokine receptor interaction	32.8
	NOD-like receptor signaling pathway	31.3
	Apoptosis	34.4
	Autoimmune thyroid disease	71.4
	Huntington’s disease	66.7
	Systemic lupus erythematosus	40
	Asthma	50
	Intestinal immune network for IgA production	25
	Cell adhesion molecules	50
	Pathways in cancer	70
Molecular Function	Peptide receptor activity	58.3
	Receptor activity	52.2
	Growth factor activity	60
	C-C chemokine binding	66.7
	Tumor necrosis factor receptor superfamily binding	40
	Death effector domain binding	66.7
	Growth factor binding	50
	Nucleic acid binding transcription factor activity	75
	Chemokine activity	77.8
Pfam Domains	7 transmembrane receptor, rhodopsin family	100
	Apoptosis regulator proteins, Bcl-2 family	83.3
	Death effector domain	66.7
	Interleukin-6 receptor alpha chain, binding	50
	Small cytokines (intecrine/chemokine), interleukin-8 like	53.3
	Death domain	100
Reactome Pathway	Activation of DNA fragmentation factor	66.7
	Interleukin-1 family precursors are cleaved by caspase-1	100
	Downstream TCR signaling	100
	FasL/CD95L signaling	100
	Exocytosis of platelet alpha granule contents	100
	IRAK4 is activated by autophosphorylation	75
	Beta defensins	66.7
	TRAIL signaling	66.7
	Interleukin-1 processing	75
	FASL:FAS Receptor Trimer, FADD complex	100

Discussion

DyCluster was tested using several biclustering techniques and various protein complex detection methods. As the experimental results show, the incorporation of gene expression data in the process of detecting protein complexes in dynamic PPI networks is indeed beneficial, in contrast with the detection of complexes in static networks. Fig 4 shows the number of matched and detected complexes per detection method presented in Tables 3 and 4. It can be noticed that on one hand, our framework can notably increase the correctness and the quality of the results, as it is the case for ProRank, ProRank+ and ClusterONE where the numbers of matched complexes, Acc, Sn, PPV and MMR are higher. On the other hand, biclustering genes based on their expression patterns can significantly reduce the large number of complexes detected by some algorithms, such as CMC and CFinder, while not compromising the quality of the outcomes. The framework models the dynamic aspect of PPI networks by grouping proteins according to the similarities of their expression patterns across subsets of conditions. Moreover, it is not restricted by threshold imposition on gene expression levels. As mentioned earlier, biclustering approaches are better than conventional clustering methods when it comes to expression data analysis. Nonetheless, the results attained by DyCluster using the k-means clustering algorithm accentuate the improvement which can be gained by incorporating gene expression information to model the dynamics of PPI interactions and to detect protein complexes in PPI networks accordingly. Finally, the produced results based on the case study shown Table 5 are in favor of the DyCluster framework.

Fig 4

The number of matched (in green) and detected (in blue) complexes per detection method.

Conclusion

DyCluster is a framework for the detection of protein complexes in dynamic protein interaction networks modeled by incorporating gene expression data, through biclustering techniques. It responds to the important shift from interpreting PPI data as a single static network to modeling and exploring the dynamic nature of these networks. That is done by incorporating gene expression data, interpreted using biclustering techniques, in the interaction networks and detecting complexes accordingly. The experimental results greatly favor our approach which allows the correct identification of more protein complexes. Moreover, in cases where this is not attained, the overall number of detected complexes is decreased and this leads to better evaluation scores. Hypothetically, the more biological information is added to PPI networks, the better the interaction dynamics are reflected. Therefore, and based on our results, further extensions consist of refining the modeling of PPI dynamics using additional biological data types.

48 in total

1. Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes.

Authors: Benjamin P Tu; Andrzej Kudlicki; Maga Rowicka; Steven L McKnight
Journal: Science Date: 2005-10-27 Impact factor: 47.728

Review 2. Genome-wide analysis of protein-DNA interactions.

Authors: Tae Hoon Kim; Bing Ren
Journal: Annu Rev Genomics Hum Genet Date: 2006 Impact factor: 8.929

Review 3. Toward the dynamic interactome: it's about time.

Authors: Teresa M Przytycka; Mona Singh; Donna K Slonim
Journal: Brief Bioinform Date: 2010-01-08 Impact factor: 11.622

Review 4. Evolution and dynamics of protein interactions and networks.

Authors: Emmanuel D Levy; Jose B Pereira-Leal
Journal: Curr Opin Struct Biol Date: 2008-04-28 Impact factor: 6.809

5. Iterative signature algorithm for the analysis of large-scale gene expression data.

Authors: Sven Bergmann; Jan Ihmels; Naama Barkai
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2003-03-11

6. A novel genetic system to detect protein-protein interactions.

Authors: S Fields; O Song
Journal: Nature Date: 1989-07-20 Impact factor: 49.962

7. KELLER: estimating time-varying interactions between genes.

Authors: Le Song; Mladen Kolar; Eric P Xing
Journal: Bioinformatics Date: 2009-06-15 Impact factor: 6.937

8. Genome-wide mapping of in vivo protein-DNA interactions.

Authors: David S Johnson; Ali Mortazavi; Richard M Myers; Barbara Wold
Journal: Science Date: 2007-05-31 Impact factor: 47.728

9. STRING 8--a global view on proteins and their functional interactions in 630 organisms.

Authors: Lars J Jensen; Michael Kuhn; Manuel Stark; Samuel Chaffron; Chris Creevey; Jean Muller; Tobias Doerks; Philippe Julien; Alexander Roth; Milan Simonovic; Peer Bork; Christian von Mering
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

10. How to infer gene networks from expression profiles.

Authors: Mukesh Bansal; Vincenzo Belcastro; Alberto Ambesi-Impiombato; Diego di Bernardo
Journal: Mol Syst Biol Date: 2007-02-13 Impact factor: 11.429

7 in total

1. COSCEB: Comprehensive search for column-coherent evolution biclusters and its application to hub gene identification.

Authors: Ankush Maind; Shital Raut
Journal: J Biosci Date: 2019-06 Impact factor: 1.826

2. A density-based approach for detecting complexes in weighted PPI networks by semantic similarity.

Authors: HongFang Zhou; Jie Liu; JunHuai Li; WenCong Duan
Journal: PLoS One Date: 2017-07-12 Impact factor: 3.240

3. Detecting protein complexes with multiple properties by an adaptive harmony search algorithm.

Authors: Rongquan Wang; Caixia Wang; Huimin Ma
Journal: BMC Bioinformatics Date: 2022-10-07 Impact factor: 3.307

4. Therapeutic potential of active components of saffron in post-surgical adhesion band formation.

Authors: Mohammad-Hassan Arjmand; Milad Hashemzehi; Atena Soleimani; Fereshteh Asgharzadeh; Amir Avan; Saeedeh Mehraban; Maryam Fakhraei; Gordon A Ferns; Mikhail Ryzhikov; Masoumeh Gharib; Roshanak Salari; Sayyed Hadi Sayyed Hoseinian; Mohammad Reza Parizadeh; Majid Khazaei; Seyed Mahdi Hassanian
Journal: J Tradit Complement Med Date: 2021-01-20