Literature DB >> 24564744

Prediction of heterotrimeric protein complexes by two-phase learning using neighboring kernels.

Peiying Ruan, Morihiro Hayashida, Osamu Maruyama, Tatsuya Akutsu.

Abstract

BACKGROUND: Protein complexes play important roles in biological systems such as gene regulatory networks and metabolic pathways. Most methods for predicting protein complexes try to find protein complexes with size more than three. It, however, is known that protein complexes with smaller sizes occupy a large part of whole complexes for several species. In our previous work, we developed a method with several feature space mappings and the domain composition kernel for prediction of heterodimeric protein complexes, which outperforms existing methods.
RESULTS: We propose methods for prediction of heterotrimeric protein complexes by extending techniques in the previous work on the basis of the idea that most heterotrimeric protein complexes are not likely to share the same protein with each other. We make use of the discriminant function in support vector machines (SVMs), and design novel feature space mappings for the second phase. As the second classifier, we examine SVMs and relevance vector machines (RVMs). We perform 10-fold cross-validation computational experiments. The results suggest that our proposed two-phase methods and SVM with the extended features outperform the existing method NWE, which was reported to outperform other existing methods such as MCL, MCODE, DPClus, CMC, COACH, RRW, and PPSampler for prediction of heterotrimeric protein complexes.
CONCLUSIONS: We propose two-phase prediction methods with the extended features, the domain composition kernel, SVMs and RVMs. The two-phase method with the extended features and the domain composition kernel using SVM as the second classifier is particularly useful for prediction of heterotrimeric protein complexes.

Entities: Chemical Disease Species

Mesh：

Substances：
Multiprotein Complexes

Year: 2014 PMID： 24564744 PMCID： PMC4016531 DOI： 10.1186/1471-2105-15-S2-S6

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

To identify a set of proteins as a functional protein complex is essential for understanding molecular systems in living cells. Several proteins form a complex and work as a transcription factor, whereas there exist another type of proteins that work as enzymes. Hence, to identify proteins that constitute such transcription factors is useful for uncovering gene regulatory networks and metabolic pathways. Many computational methods have been developed for predicting protein complexes from protein-protein interaction networks [1,2]. Enright et al. developed the Markov cluster (MCL) algorithm [3], which repeatedly executes two operators called expansion and inflation to a matrix whose element represents the transition probability from a protein to another. The expansion operation takes the power of the matrix, and the inflation operation takes the Hadamard power of the matrix. MCL is fast and efficient because of these operations. Macropol et al. developed the repeated random walks (RRW) method [4], which iteratively expands a cluster depending on the probabilities in steady states of random walks with restarts. Maruyama and Chihara improved the RRW method by weighting the restart probabilities and proposed the node-weighted expansion (NWE) method [5]. Bader and Hogue developed the molecular complex detection (MCODE) method [6], which uses a modified clustering coefficient defined by edge density in a subset of the original and adjacent vertices to find densely connected regions. King et al. developed the restricted neighborhood search clustering (RNSC) method [7], which selects clusters generated by a cost function according to the cluster size, density and functional homogeneity. Altaf-Ul-Amin et al. developed DPClus [8], which tries to find densely connected regions. Chua et al. developed the protein complex prediction (PCP) method [9], which finds maximal cliques using the functional similarity weight based on indirect interactions. Liu et al. developed the clustering based on maximal cliques (CMC) method [10], which generates all maximal cliques from the protein-protein interaction networks, and assembles highly overlapped clusters based on their interconnectivity. Wu et al. developed the core-attachment based (COACH) method [11]. Most methods basically focus on finding densely connected subgraph in protein-protein interaction networks. Hence, it is considered to be difficult that they detect small protein complexes because, for instance, the edge density of two interacting proteins is always 1.0 even if the proteins do not form a complex. However, protein complexes with small sizes occupy a large part of whole known protein complexes. CYC2008 is a comprehensive catalogue of 408 manually curated yeast protein complexes [12]. In the catalogue, 172 complexes (42%) are heterodimeric, and 87 complexes (21%) are heterotrimeric as reported also in [13]. In our previous study, hence, we developed a method using our proposed kernel for predicting heterodimeric protein complexes [14], which outperforms an existing method using the naive Bayes classifier [15]. In this paper, we propose prediction methods for heterotrimeric protein complexes by extending techniques in our previous method on the basis of the idea that heterotrimeric protein complexes are not likely to share the same protein with other heterotrimeric protein complexes. For that purpose, we apply supervised learning methods twice such as support vector machine (SVM) [16] and relevance vector machine (RVM) [17]. Tatsuke and Maruyama developed the proteins' partition sampler (PPSampler) method based on the Metropolis-Hastings algorithm, which generates clusters whose sizes follow a power-law distribution, and outperforms other existing methods in F-measure for whole protein complexes [13]. For prediction of heterotrimeric protein complexes, they reported that the F-measure of NWE was better than those of the existing methods, MCL, MCODE, DPClus, CMC, COACH, RRW, and PPSampler. We perform 10-fold cross-validation, and calculate the average F-measure. The results suggest that our proposed methods outperform the existing method NWE.

Methods

In this section, we propose prediction methods for heterotrimeric protein complexes. More accurately, we consider the following problem: Given a network of protein-protein interactions weighted by some reliability, determine whether or not three distinct proteins that are connected in the protein-protein interaction network form a protein complex. Let G(V, E) be an undirected graph with a set V of vertices and a set E of edges, representing the protein-protein interaction network. Here, a vertex represents a protein, an edge (i, j) represents an interaction between proteins Pand P, and the weight wrepresents reliability and strength of the interaction between Pand P. In this paper, we use the WI-PHI database [1] as edge weights, which has been calculated from heterogeneous biological experimental data. We call Pa neighboring protein to Pif (i, j) ∈ E. Then, our proposed methods use the support vector machine (SVM), its discriminant function, and the relevance vector machine (RVM).

Support and relevance vector machine

We briefly review the support and relevance vector machines [16,17]. Suppose that N training data {, t} with target t∈ {-1, 1} are given. For our purpose, corresponds to a set of three distinct proteins, t= 1 corresponds to the case that the set forms a heterotrimeric protein complex. Then, we consider linear models represented by the form where ϕdenotes a basis function, M denotes the number of basis functions, adenotes the coefficient, and b denotes the bias parameter. In the SVM, ϕ() is implicitly defined as K(, ) with a positive semidefinite kernel function K, M is equal to N, and aand b are determined by maximizing the margin. New sets of proteins are classified according to the sign of y(). We make use of this discriminant function y() in our proposed methods. The RVM is a Bayesian sparse kernel technique for classification and regression, and shares some characteristics of the SVM. As well as the SVM, the basis functions of the RVM are given by kernels, which are not required to be positive semidefinite. It, however, is known that training time of the RVM is in general longer than that of the SVM. In the RVM, a hyperparameter γfor each parameter aand a prior distribution over parameters aare introduced to obtain a sparse model. For the classification, the model in Eq. (1) is transformed as σ(y()), where σ(y) denotes the logistic sigmoid function 1/(1 + e), and aand b are determined by maximizing the marginal log-likelihood with respect to γ.

Extension of feature space mapping

In our previous study, we proposed seven feature space mappings for prediction of heterodimeric protein complexes [14]. These are based on the idea that the reliability of the interaction in a heterodimer should be high and conversely the reliability of the interaction between a protein in a heterodimer and a protein not in the heterodimer should be low. We extend the feature space mappings for two interacting proteins to mappings for three proteins. Table 1 shows detailed extended mappings for three distinct proteins P, P, and Pthat are connected in the protein-protein interaction network. Here the fifth mapping in the previous study is eliminated because more neighboring proteins increase the maximum of differences close to the maximum of neighboring weights denoted by (F3). (F1) and (F2) denote the maximum and minimum of the weights of interactions between P, P, and P, respectively. The first feature in the previous study is the weight of the interaction between two proteins. Since there are at least two interactions for three focused proteins and we cannot use all the weights as elements of our feature vector without changes, we take the maximum and minimum of the weights (see Figure 1). In addition, the proteins in a heterotrimer should interact with each other, and (F2), which is the minimum of the weights, is expected to be high. (F3) and (F4) denote the maximum and minimum of the weights of interactions between either of P, P, Pand a neighboring protein P, respectively, where r ≠ i, j, k and (i, r) ∈ E, (j, r) ∈ E, or (k, r) ∈ E. It is considered that (F3), which is the maximum of the neighboring weights of a heterotrimer, should be lower than the weights of interactions in the heterotrimer. Consider the case that a protein Pinteracts with two of proteins P, P, and P, where Pis not any of P, P, and P(see Figure 1). If the weights of both interactions are large, these proteins including Pmay form a complex. We introduce the maximum of smaller weights of interactions with neighboring proteins Pdenoted by (F5). (F6) and (F7) denote the maximum and the minimum of the numbers of domains contained in P, P, and P, respectively. The number of domains in a protein complex is expected to be large because domains are considered as mediators of protein-protein interactions.

Table 1

Feature space mapping from three distinct proteins P, P, P.

(F1)	max{(p,q)∈E\|p,q∈{i,j,k}}wpq
(F2)	min{(p,q)∈E\|p,q∈{i,j,k}}wpq
(F3)	max{(p,r)∈E\|p∈{i,j,k},r∉{i,j,k}}wpr
(F4)	min{(p,r)∈E\|p∈{i,j,k},r∉{i,j,k}}wpr
(F5)	max{(p,r),(q,r)∈E\|p,q∈{i,j,k},p≠q,r∉{i,j,k}}min{wpr,wqr}
(F6)	max{# domains of P_i, # domains of P_j, # domains of P_k}
(F7)	min{# domains of P_i, # domains of P_j, # domains of P_k}

Figure 1

Example of a subgraph including three focused proteins . In this example, protein Pis neighboring to both of Pand P.

Feature space mapping from three distinct proteins P, P, P. Example of a subgraph including three focused proteins . In this example, protein Pis neighboring to both of Pand P. In addition to the extended features, we examine the domain composition kernel developed in our previous study [14]. We defined equivalence =between two proteins Pand Pas the condition that Pconsists of the same domains of P, and defined equivalence =between two sets and that consist of and , respectively, as , where denotes the symmetric group of degree n on the set {1, ⋯, n}. Then, the domain composition kernel Kwas defined by

Two-phase learning approach

Our proposed methods take two-phase learning approach. The basic idea for designing our methods is that heterotrimeric protein complexes are not likely to share the same protein with other heterotrimeric protein complexes. We estimate model parameters of SVM using training data in the first phase, and predict whether or not the training data and the neighboring sets sharing at least one protein with the training data are heterotrimeric protein complexes, respectively. Then, the second phase predictor makes use of the discriminant values obtained by the first phase predictor. It is expected that the discriminant values for a target set of proteins and its neighboring set do not become large together if heterotrimeric protein complexes do not share the same protein. Suppose that the training data set comprises N sets of three distinct proteins with the corresponding label t∈ {-1, 1}. For each , we calculate 7-dimensional feature vector (1)() using (F1),…,(F7) shown in Table 1 and the kernel matrix whose (i, j)-th element is 〈(1)(), (1)()〉 + αK(, ), where α is a constant and denotes the inner product. Then, we obtain the model parameters in Eq. (1) by applying the SVM to the training data set. Let be all sets of three distinct proteins that are neighboring to and connected in the protein-protein interaction network, where we call a neighboring set to if and share the same protein and is not (see Figure 2). For each , we calculate the discriminant values y() and y() for all . Since the discriminant values may include outliers, by taking the averages of positive and negative discriminant values separately, we define four feature space mappings for ,

Figure 2

Example of a subgraph including a focused set of proteins and neighboring sets of proteins. Each neighboring set of three proteins shares at least one protein with the focused set (black circle). In this example, set S1 of three proteins shares two proteins with the focused set, and S2, S3 share one protein, respectively.

where |S| denotes the number of elements in the set S. Here, we define f (2() = 0 (f (2() = 0, f (2() = 0) if . We compose 11-dimensional feature vector (2)() using (1), f (2, f (2, f (2and f (2, calculate the kernel matrix with the (i, j)-th element 〈(2)(), (2)()〉 + αK(, ), and we apply some supervised learning method. It should be noted that our methods use only training data to estimate model parameters. For test data , we calculate 〈(2)(), (2)()〉 + αK(, ) for training data i, and determine whether or not is a heterotrimeric protein complex according to the second classifier. Example of a subgraph including a focused set of proteins and neighboring sets of proteins. Each neighboring set of three proteins shares at least one protein with the focused set (black circle). In this example, set S1 of three proteins shares two proteins with the focused set, and S2, S3 share one protein, respectively.

Computational experiments

Data and implementation

To evaluate our proposed methods, we performed computational experiments and compared them with the existing method NWE [5]. We used the WI-PHI database [1] containing 49607 interacting protein pairs except self interactions as input weights of interactions, which is available at the supporting information web page of the paper. The weights were obtained from high-throughput yeast two-hybrid data [18,19] and several biological databases such as BioGRID [2] and BIND [20] by using a log-likelihood score (LLS) to each dataset and the socioaffinity (SA) index [21] that measures the log-odds score of the number of times that two proteins are observed to interact to the expectation value from the dataset. We prepared datasets using heterotrimeric protein complexes in CYC2008 protein complex catalogue [12], which contains 87 heterotrimeric protein complexes, and is available at http://wodaklab.org/cyc2008/. We restricted positive and negative examples to sets of three distinct proteins that form a single connected component in the input protein-protein interaction network. Thus, 7 heterotrimers were eliminated, and we used 80 heterotrimers as positive examples. For negative examples, we extracted 32647 sets of three proteins included in protein complexes with size more than three of CYC2008, and we selected uniquely at random 100 examples from the sets because our methods require many neighboring sets of three proteins for an example in the second phase. It is considered that negative examples selected from such sets are more difficult to be classified than those selected from all sets of three proteins except heterotrimers. For NWE, we set some options related with the size of complexes so that NWE output protein complexes with size two or more from the WI-PHI protein-protein interaction network in the same way as [13], and extracted only protein complexes with size three from the result. For measuring the performance, we used accuracy, precision, recall, and F-measure defined by where TP, FP, and FN mean the number of true positive, false positive, false negative examples, respectively. We used 'libsvm' (version 3.11) [22] and 'SparseBayes' package (version 2.0) [23] as implementations of SVM and RVM, respectively.

Results

We performed 10-fold cross-validation, and took the average of accuracy, precision, recall, and F-measure. Furthermore, we repeated this procedure 10 times for other datasets with randomly selected negative examples, and took the average. Table 2 shows the results on the average of accuracy, precision, recall, and F-measure by our proposed methods and NWE. 'SVM+SVM' and 'SVM+RVM' denote two-phase methods using SVM and RVM as the second classifier, respectively. 'SVM' denotes usual SVM using only features (1). α denotes the coefficient of the domain composition kernel K. We examined α = 0.5 because the case was best for prediction of heterodimeric protein complexes in our previous study [14]. NWE predicted 54 protein complexes with size three from the WI-PHI protein-protein interaction network, and 19 of them were actual heterotrimeric protein complexes in the CYC2008 protein complex catalogue. We can see from the table that the F-measure by SVM+SVM, SVM+RVM, SVM for both α = 0, and 0.5 were higher than those by NWE, respectively. Furthermore, the accuracy and F-measure by the two-phase method SVM+SVM were higher than those by usual SVM with (1), respectively. The accuracy and F-measure by SVM+RVM, however, were lower than those by SVM, respectively. It implies that RVMs may be less useful than SVMs for these problems that SVMs can be applied. Thus, the results suggest that our proposed methods SVM+SVM, SVM+RVM, and SVM outperform the existing method NWE. The results also suggest the usefulness of the second phase.

Table 2

Results on the average of accuracy, precision, recall, and F-measure by our proposed methods and NWE.

	SVM+SVM		SVM+RVM		SVM		NWE

α	0	0.5	0	0.5	0	0.5
accuracy	0.885	0.907	0.810	0.853	0.861	0.876	-
precision	0.936	0.869	0.847	0.899	0.909	0.873	0.352
recall	0.840	0.926	0.770	0.766	0.819	0.862	0.218
F-measure	0.880	0.891	0.767	0.810	0.854	0.862	0.270

'SVM+SVM' and 'SVM+RVM' denote two-phase methods using SVM and RVM as the second classifier, respectively. 'SVM' denotes usual SVM using only features (1). α denotes the coefficient of the domain composition kernel K. Note that the accuracy is not defined for NWE because it is unsupervised, and predict protein complexes of various sizes. The precision and recall for NWE were calculated as TP divided by the numbers of predicted and known heterotrimers, respectively.

Results on the average of accuracy, precision, recall, and F-measure by our proposed methods and NWE. 'SVM+SVM' and 'SVM+RVM' denote two-phase methods using SVM and RVM as the second classifier, respectively. 'SVM' denotes usual SVM using only features (1). α denotes the coefficient of the domain composition kernel K. Note that the accuracy is not defined for NWE because it is unsupervised, and predict protein complexes of various sizes. The precision and recall for NWE were calculated as TP divided by the numbers of predicted and known heterotrimers, respectively.

Conclusions

We proposed prediction methods by two-phase learning for heterotrimeric protein complexes. In the methods, we extended the feature space mappings in our previous study for prediction of heterodimeric protein complexes, and made use of the discriminant function for neighboring sets of three proteins. To validate our proposed methods, we performed 10-fold cross-validation computational experiments. The results suggest that our two-phase prediction methods and SVM with the extended features outperform the existing method NWE, which was reported to outperform many other existing methods such as MCL, MCODE, DPClus, CMC, COACH, RRW, and PPSampler, although our methods are limited to prediction of heterotrimeric protein complexes. For further evaluation, we would like to perform computational experiments for other datasets if such data become available. We have some possibility to further improve the prediction accuracy. For instance, we can use sequence information for designing feature space mappings as well as domains contained in proteins. In addition, we can introduce some probabilistic model such as conditional and Markov random fields to neighboring sets of three proteins although in this paper we considered kernels between neighboring sets.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PR and MH developed and implemented the methods. MH drafted the manuscript. OM and TA participated in the discussions during the development of the methods and helped draft the manuscript. All authors read and approved the final manuscript.

18 in total

1. Proteome survey reveals modularity of the yeast cell machinery.

Authors: Anne-Claude Gavin; Patrick Aloy; Paola Grandi; Roland Krause; Markus Boesche; Martina Marzioch; Christina Rau; Lars Juhl Jensen; Sonja Bastuck; Birgit Dümpelfeld; Angela Edelmann; Marie-Anne Heurtier; Verena Hoffman; Christian Hoefert; Karin Klein; Manuela Hudak; Anne-Marie Michon; Malgorzata Schelder; Markus Schirle; Marita Remor; Tatjana Rudi; Sean Hooper; Andreas Bauer; Tewis Bouwmeester; Georg Casari; Gerard Drewes; Gitte Neubauer; Jens M Rick; Bernhard Kuster; Peer Bork; Robert B Russell; Giulio Superti-Furga
Journal: Nature Date: 2006-01-22 Impact factor: 49.962

2. WI-PHI: a weighted yeast interactome enriched for direct physical interactions.

Authors: Lars Kiemer; Stefano Costa; Marius Ueffing; Gianni Cesareni
Journal: Proteomics Date: 2007-03 Impact factor: 3.984

3. Using indirect protein-protein interactions for protein complex prediction.

Authors: Hon Nian Chua; Kang Ning; Wing-Kin Sung; Hon Wai Leong; Limsoon Wong
Journal: J Bioinform Comput Biol Date: 2008-06 Impact factor: 1.122

4. A comprehensive two-hybrid analysis to explore the yeast protein interactome.

Authors: T Ito; T Chiba; R Ozawa; M Yoshida; M Hattori; Y Sakaki
Journal: Proc Natl Acad Sci U S A Date: 2001-03-13 Impact factor: 11.205

5. An automated method for finding molecular complexes in large protein interaction networks.

Authors: Gary D Bader; Christopher W V Hogue
Journal: BMC Bioinformatics Date: 2003-01-13 Impact factor: 3.169

6. Development and implementation of an algorithm for detection of protein complexes in large interaction networks.

Authors: Md Altaf-Ul-Amin; Yoko Shinbo; Kenji Mihara; Ken Kurokawa; Shigehiko Kanaya
Journal: BMC Bioinformatics Date: 2006-04-14 Impact factor: 3.169

7. BioGRID: a general repository for interaction datasets.

Authors: Chris Stark; Bobby-Joe Breitkreutz; Teresa Reguly; Lorrie Boucher; Ashton Breitkreutz; Mike Tyers
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. A core-attachment based method to detect protein complexes in PPI networks.

Authors: Min Wu; Xiaoli Li; Chee-Keong Kwoh; See-Kiong Ng
Journal: BMC Bioinformatics Date: 2009-06-02 Impact factor: 3.169

9. Prediction of heterodimeric protein complexes from weighted protein-protein interaction networks using novel features and kernel functions.

Authors: Peiying Ruan; Morihiro Hayashida; Osamu Maruyama; Tatsuya Akutsu
Journal: PLoS One Date: 2013-06-11 Impact factor: 3.240