Literature DB >> 24564855

Collective prediction of protein functions from protein-protein interaction networks.

Qingyao Wu, Yunming Ye, Michael K Ng, Shen-Shyang Ho, Ruichao Shi.

Abstract

BACKGROUND: Automated assignment of functions to unknown proteins is one of the most important task in computational biology. The development of experimental methods for genome scale analysis of molecular interaction networks offers new ways to infer protein function from protein-protein interaction (PPI) network data. Existing techniques for collective classification (CC) usually increase accuracy for network data, wherein instances are interlinked with each other, using a large amount of labeled data for training. However, the labeled data are time-consuming and expensive to obtain. On the other hand, one can easily obtain large amount of unlabeled data. Thus, more sophisticated methods are needed to exploit the unlabeled data to increase prediction accuracy for protein function prediction.
RESULTS: In this paper, we propose an effective Markov chain based CC algorithm (ICAM) to tackle the label deficiency problem in CC for interrelated proteins from PPI networks. Our idea is to model the problem using two distinct Markov chain classifiers to make separate predictions with regard to attribute features from protein data and relational features from relational information. The ICAM learning algorithm combines the results of the two classifiers to compute the ranks of labels to indicate the importance of a set of labels to an instance, and uses an ICA framework to iteratively refine the learning models for improving performance of protein function prediction from PPI networks in the paucity of labeled data.
CONCLUSION: Experimental results on the real-world Yeast protein-protein interaction datasets show that our proposed ICAM method is better than the other ICA-type methods given limited labeled training data. This approach can serve as a valuable tool for the study of protein function prediction from PPI networks.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteins

Year: 2014 PMID： 24564855 PMCID： PMC4015526 DOI： 10.1186/1471-2105-15-S2-S9

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

We have witnessed a revolution in sequencing technologies in last decade. The biological sciences are undergoing an explosion in the amount of genome sequences. There are increasing interests about using computational methods to identify the biological functions of the protein sequences [1], as experimentally determining protein functions is time-consuming and it cannot catch up with the fast growth of newly found proteins [2]. Various studies have applied machine learning methods to protein data from biological experiments to predict the functions for unknown proteins. (e.g. [3,4]). Classical computational approaches for protein function prediction represent each protein as a set of features, and employ machine learning algorithms to automatically predict the protein function based on these features. The most well-established methods [5] are the BLAST [6] approach based on sequence, PROSITE [7] based on sequence motifs, and PFAM [8] based on profile methods. In recent years, the development of experimental methods for genome scale analysis of molecular interaction networks offers new ways to infer protein function in the context of protein-protein interaction (PPI) network, wherein proteins and detected PPIs are represented by nodes and edges, respectively. The basic idea is that the direct interaction partners of a protein are likely to share similar biological functions [9]. Assignment of protein functions using PPI data has also been extensively studied, such as neighborhood counting based method [10], graph theoretic methods [11], hierarchical clustering-based methods [12] and graph clustering methods [13]. Although many efforts have been made in protein function prediction, most of them were based on either sequence similarity that ignores the protein interactions, or PPI information without using attributes derived from the content of protein sequence. The former method often fails to work if a query protein has no or very little sequence similarity to any proteins of known labels, the latter method has similar problem if there are insufficient relevant PPI information. To explicitly use the information of the content of the data and the links information of the PPI network to improve the prediction performance, collective classification (CC) is proposed. It received considerable attentions in the last decade. Various CC algorithms has been proposed in the literature [14], such as the iterative classification algorithm (ICA) [15], Gibbs sampling (Gibbs) [16], and variants of the weighted-vote relational neighbor algorithm (wvRN) [17]. Here, we focus on ICA-type approaches, which consist of a local classifier, such as kNN, to infer the class labels of related instances. The key idea is to construct new relational feature vectors by summarizing the label information from neighborhood nodes, and then use the relational features together with the attribute features derived from the content of data to learn local classifiers for prediction. Figure 1 is an illustration of how ICA proceeds. In Figure 1(a), an attribute-only classifier Minduced from using only the attribute features is first learned to estimate the classes of unlabeled instances. The algorithm then employs an aggregation function to compute the relational features by counting the number of neighbors with respect to each label. Once the features are constructed, a collective classifier, M, is learned using both the attribute features and relational features (Figure 1(b)); The algorithm repeats step c and step d to make new prediction for unlabeled instances (Figure 1(c)), and to update the relational features based on the new generated predictions (Figure 1(d)). The ICA-type of algorithms usually assume a separate training graph with abundant labeled data. However, in many applications such as protein function prediction problems, the number of labeled protein data is actually very limited and very expensive to obtain. In this situation, most data have no connection to labeled data, and supervision knowledge cannot be obtained from the local connections (as illustrated in Figure 1(a)). As a result, the collective classifier Mlearned from these networks may suffer a reduction in the classification performance.

Figure 1

An example of ICA algorithm learning with limited labeled data. (a) initial state, train classifier Mand classify V; (b) Compute relational features X, train classifier M; (c) re-predict V(use M); (d) re-compute relational features X. ICA repeats step c and step d until a fixed iteration number. This paper describes an effective Markov chain based CC algorithm (ICAM) to tackle the label deficiency problem in CC for protein function prediction from PPI networks. Our idea is to model the classifier Mvia the Markov chain with restart. The Markov chain model computes the ranks of labels to indicate the importance of a set of labels to an instance by propagating the label information in a graph constructed from labeled and unlabeled data. The ICAM algorithm further refines the Markov chain model using an ICA framework to generate the possible labels for a given instance. By these techniques, Mcan be learned more effectively. Experiments on the realworld Yeast PPI datasets have demonstrated that our proposed ICAM method improves the classification performance when compared with the ICA-type CC methods. The main contributions of this paper are as follows. • We study the label deficiency problem of collective classification (CC) and show that the protein function prediction problem from PPI networks can be formulated as a CC task. • We extend the ICA-type CC algorithm and propose the ICAM algorithm to leverage the unlabeled portion of the data to improve the classification performance of CC via the Markov chain with restart. • We demonstrate the effectiveness of our proposed ICAM algorithm using the Yeast benchmark datasets. We find that ICAM leads to significant accuracy gains compared to other ICA-type methods when there are limited numbers of labeled data available.

Methods

Preliminaries

Assume that the PPI network data are represented as a graph G(V, E, X, Y, c), where V is a set of nodes, E is a set of edges representing the interactions between the instances. Each instance/node v∈ V is described by an attribute vector x∈ X. Each Y∈ Y is a set of labels for v, and c is the number of possible labels. Assume that we have a set of labeled nodes V⊂ V with known labels Y= {Y∈ V}, and the task is to predict the labels Yfor unlabeled nodes V= V - V. In this paper, we are primarily interested in generating a ranking of possible labels for a given protein such that its correct functions receive higher ranking than the less unlikely one.

The ICAM algorithm

Inspired by the ICA approach, we introduce the ICAM algorithm for collective classification. The algorithm is summarized in Algorithm 1. Similar to the ICA framework, the ICAM algorithm has two parts as follows: bootstrap and iterative inference. The bootstrap part learns an attribute-only classifier Mfrom the known nodes, and uses Mto predict labels for the unknown nodes V(step 1-2). In the iterative inference part, the relational features Xare updated based on the estimated class labels of data (step 4). Specifically, Xof the (i + 1)-th iteration is based on the known and predicted labels from the i-th iteration. Next, the algorithm trains a collective classifier Musing both attribute features Xand relational features Xto compute the labels for unlabeled data. The iterative process stops when the predictions of Mare stabilized or a fixed number of iteration is reached. An important component of the ICA algorithm is to build the relational features that summarizes the relational information, and to construct new feature vectors to train the classifier M. For instance, Neville et al. [15] summarize the labels of neighboring nodes as relational features as illustrated in Figure 1(b), where node "B" has two positive neighboring nodes and two negative neighboring nodes. Here, the relational features is "¡2, 2¿", and then "¡2, 2¿" is appended onto the original feature vector, , as new features, " ". ICA-type CC methods usually increase accuracy for network data using a large amount of labeled data to train M. In this scenario, the supervision knowledge can be effectively propagated in the network and improve the learning accuracy [18]. However, the labeled data are time-consuming to obtain and the number of labeled data is very limited. Most of the nodes may not link to the labeled nodes, as illustrated in Figure 1(a). As a result, the prediction accuracy of the collective classifier Mwill be decreased greatly. Algorithm 1 ICAM (V, E, X, Y, n) Input: V = nodes, E = edges, X= attribute feature vectors, Y= labels of known nodes, n = # of iterations, Procedure: 1: M= learnClassifer(X, Y); 2: 3: for t = 0 to n do 4: X= aggregation(V, E, Y∪ Y); 5: Re-train M= learnClassifer(X, X, Y); 6: 7: end for 8: return Y In our ICAM algorithm, we assume that the attribute features xand the relational features rare conditionally independent given the class label Y[19]. We then use two distinct classifiers to make two separate predictions for attribute features Xand relational features X. The prediction is given as follows: where γ is a constant independent of Y. The attribute classifier to estimate p(Y) is referred to as M, and the relational classifier to estimate p(Y) is referred to as M. There are two main advantages of this prediction method. First, this method allows us to train classifiers Mand Mfor attribute features Xand relational features Xin parallel. Second, in the collective inference process, the classifier Mcan be re-trained in each iteration based on the re-constructed relational features Xto improve the prediction accuracy of the collective classifier M. Various traditional supervised learning methods can be used to train Mand Mwhere the classifier, such as kNN, naive bayes and logistic regression [16,20], is learned from a separate training data with a large amount of labeled data. However, when dealing with label deficiency problem in PPI networks, we propose to use transductive learning method for acquiring additional information from unlabeled data to improve the classification performance. Specifically, we set up Markov chain based learning models to estimate p(Y) and p(Y).

Markov chain based learning

The Markov chain based learning model is based on the idea of random walk with restart. We note that there are many learning tasks using random walk techniques such as protein network cluster discovery [21], community discovery [22], multi-instance multi-label learning [23], and transfer learning [24]. The idea of random walk with restart is to consider a random walker that starts from labeled nodes, and iteratively transmits to its neighborhood with probabilities proportional to their edge weights. At each step, it has a probability α(0 < α <1) to return back to labeled node. The steady-state probability that the walker will finally stay at node j is the relevance score of node j with respect to the labeled nodes [25]: where u = [u] is the steady-state probability of relevance scores of different nodes, P is the affinity matrix associated with the instances in Markov chain transition probability graph, and q is the label distribution vector containing the elements of labeled instances being 1 and 0 for others. Here, the steady-state probability (relevance score of the instances) captures the global structure of the graph and relationship between the nodes. The advantage of this random walk procedure is that it converges to a unique solution for any initial u(0). The process converges fast, needing just a few iterations. The random walk and related methods have been shown to have good performance on the learning tasks mentioned above. In the following, we introduce the learning of Mvia the Markov chain with restart using all the instances (both labeled and unlabeled). The process of learning Mis similar. Given the constructed attribute feature vector x∈ Xfor a node v∈ V, pairwise affinity between the nodes based on relational information are computed using the Gaussian kernel function as follows where ||xis the Euclidean distance between the i-th feature vector and the j-th feature vector in X. The parameter σ is a positive number to control the linkage in the manifold [26]. The m-by-m matrix A, with its (i, j)-th entry given by a, is always nonnegative. Similar to (1), using the Gaussian kernel to r∈ Xleads to the affinity matrix R for relational features. We then set up Markov chain models for classifiers Mand Mbased on A and R, respectively. For the classifier M, we construct an m-by-m Markov transition probability matrix P by normalizing the entries of A with respect to each column, i.e., each column sum of P is equal to one, . For such P, we model the probabilities of visiting the other instances from the current instance in a Markov chain transition probability graph. We construct a transition probability graph, all the labeled and unlabeled instances are linked together. Intuitively, a random walker starts from nodes with known label to propagate labels among labeled instances to the other unlabeled instances. The walker iteratively visits its neighborhood of nodes with the transition probability graph based on A. Next we use the idea in topic-sensitive PageRank [27] as a Markov chain with restart [25] to solve the learning problem. The random walker has a probability of α to return to labeled instances at each step. It can be interpreted that during each iteration each instance receives the label information from its neighbors via the random walk, and also retains its initial label information. The parameter αspecifies the relative amount of the information from its neighbors and its initial label information. Using this approach, we compute the steady state probabilities that the random walker finally stay at different instances. These steady state probabilities give ranking of labels to indicate the importance of a set of labels to an unlabeled instance. More formally, we adopt the following equation: to compute the steady probabilities U = [u, u, ⋯, u] (m-by-c matrix) according to P and Q = [q, q, ⋯ , q] (m-by-c matrix) which is the assigned probability distribution vector of the class labels that are constructed from the labeled data. The restart parameter 0 < α <1 controls the importance of the assigned probability distributions in the labeled data to the resulting label ranking scores of instances. Given the training data, one simple way to construct qis using a uniform distribution on the instances with the label class d. The summations of the entries of qis equal to 1. More precisely, where lis the number of instances with the label class d in the training data. The steady probability distribution vector U is solved by the iteration method with an initial matrix U0 where each column is a probability distribution vector. The overall algorithm is summarized in Algorithm 2. Algorithm 2 Markov Chain based Classifer Input: P, Q and U0, α, and the tolerance ϵ Output:the steady probability distribution matrix U Procedure: 1: Set t = 1; 2: Compute U= (1 - α)PU+ αQ; 3: If ||U- U|| < ϵ, then stop, set U = U; otherwise set t = t + 1 and goto Step 2.

Experimental results

In this section, we compare the performance of ICAM algorithm with other ICA-type collective classification algorithms: ICA, Gibbs and ICML. We show that the proposed algorithm outperforms these algorithms given limited number of labeled training data.

KDD Cup 2001 data and baselines

The first experiment is conducted for Yeast gene function prediction from KDD Cup 2001 [28]. The dataset includes 1,243 genes and 1,806 interactions among the pair of genes encoding the proteins physically interact with one another. These interaction relationships are symmetric. The protein functions are autocorrelated in this dataset and a subset of the data have been withheld for testing. The task is to predict the functions of the proteins encoded by the genes. There are 14 functions and a protein can have one (or several) function(s). We compare our proposed method with the following three baseline learners: 1. ICA. The Iterative Classification Algorithm (ICA) algorithm proposed by Neville et al. [15] is one of the simplest and most popular CC methods that is frequently used as baseline for CC evaluation in previous studies. For multi-label problem, we transform it into multiple single-label prediction problems using one-against-all strategy and employ ICA to make prediction for each single-label problem. 2. Gibbs. This baseline is another ICA-type CC algorithm using the ICA iterative classification framework. In each iteration, Gibbs re-samples the label of each node based on the estimated label distribution [16]. We also use one against-all strategy to convert the multi-label problem into multiple single-label problems for the Gibbs algorithm. 3. ICML. This baseline is a multi-label CC algorithm proposed by Kongetal. [29]. ICML extends the ICA algorithm to multi-label problems by considering dependencies among the label set in the iteration process. In the experiments, we use kNN as node classifier for ICA, Gibbs and ICML. The parameter k was automatically selected in the range of 10 to 30 at an increment of 5 using 3-fold cross validation on the training set. For the proposed ICAM method, we learn the classifiers Mand Musing Markov chain based models to perform separate predictions. We set the value of α in the Markov chain model to be 0.95 as suggested in [23].

Evaluation criteria

We evaluate the performance of our proposed method with four multi-label evaluation measures: average precision, coverage, ranking loss, and one-error. They are commonly used for multi-label learning algorithm evaluation. Given a multi-label dataset D = {(x, Y)|1 ≤ i ≤ m}, where is an instance and is the true labels of x, and Y= (Y, Y, ..., Y) ∈ {0, 1}. Here xbelongs to the j-th label when Y= 1, otherwise Y= 0, and c is the number of possible labels. The evaluation measures are defined using the following two outputs provided by the learning algorithms: s(x, l) returns a real-value that indicates the confidence for the class label l to be a proper label of x; rank(x, l) returns the ranks of class label l derived from s(x, l). Coverage [30] evaluates how far we need, on the average, to go down on the list of labels in order to cover all the true labels of an instance: Ranking loss [30] evaluates the average fraction of label pairs that are reverse ordered for the instance: where . One-error [30] evaluates how many times the top-ranked label is not in the set of true labels of the instance. Define a classifier H that assigns a single label to an instance xby , then the one-error is Average precision [30] evaluates the average fraction of labels ranked above a particular label l ∈ Y in Y: where . The smaller the value of coverage ranking loss and one-error, the better the performance. As for average precision, the bigger the value the better the performance, we report the results of 1-average precision. Thus, for all evaluation metrics, the smaller the value the better the performance.

Results on KDD Cup 2001 data

In this experiment, we test the performance of our proposed ICAM algorithm on the KDD Cup 2001 dataset. We randomly select 50% of data as training set, and use the remaining 50% of data as test set. The experiment is conducted 10 times by randomly selected training/test split (each with a different random seed), and we report the results of mean as well as standard deviation of each compared algorithm. The mean as well as standard deviation of each compared method over the same 10 trails are reported. Table 1 shows the performance of each compared method on the Yeast protein dataset. The best performance achieved among different compared algorithms is marked in bolded. The results show the competitiveness of the ICAM method against other learning methods. Difference evaluation measures for the learning performance are used in the experiments. One algorithm rarely outperforms another algorithm in all criteria. For example, a method that is optimal for instance ranking loss usually does not perform well in coverage or one-error [31]. In the experiments, we find that ICAM is able to produce better results across all evaluation metrics. These results are impressive and imply that the ICAM algorithm is a good collective classification method for protein function prediction.

Table 1

The performance (mean ± standard deviation) of compared algorithms on the Yeast protein dataset.

Methods	Coverage	Ranking Loss	One-error	1-Average Precision
ICA	4.217 ± 0.273	0.140 ± 0.013	0.042 ± 0.005	0.155 ± 0.005
Gibbs	4.319 ± 0.195	0.148 ± 0.008	0.043 ± 0.005	0.154 ± 0.006
ICML	4.409 ± 0.091	0.153 ± 0.006	0.043 ± 0.007	0.162 ± 0.006
ICAM	3.748 ± 0.164	0.100 ± 0.008	0.041 ± 0.005	0.151 ± 0.005

The performance (mean ± standard deviation) of compared algorithms on the Yeast protein dataset. We also test the performance of different comparable algorithms with different number of labeled instances ranging from 200 to 1000 with an increment of 200. For example, we randomly pick up 200 instances as training data. The remaining data is used for testing. The experiment is conducted 10 times by randomly selecting training/test split. We report the results of mean as well as standard deviation of each compared algorithms. Figure 2 shows the performance of ICAM and other learning methods with respect to different number of labeled instances.

Figure 2

The performance of different algorithms on the Yeast protein dataset with varying number of labeled instances.

The performance of different algorithms on the Yeast protein dataset with varying number of labeled instances. We can see from the figure that ICAM (the black line) has the best performance in general. ICAM outperforms other algorithms using different number of training data, especially when the size of training data is small. Specifically, ICAM achieves coverage improvement of 0.4916 over the second best method Gibbs (ICAM:4.2213 versus Gibbs:4.7129) and achieves 0.0439 improvement on ranking loss (ICAM:0.1184 versus ICML:0.1623) when the number of training instance is 200. As the size of training data increases, ICAM consistently achieves better performance than other learning algorithms across all evaluation criteria. We find that ICAM outperforms the other ICA-type methods substantially in terms of coverage. On the other hand, ICAM only slightly outperforms other methods in terms of one-error. We note that one-error and coverage are two different quantitative measures. One-error evaluates how many times the top-ranked label is not in the set of possible labels. Thus, if the goal of a prediction system is to assign a single function to a protein (single-label classification), the one-error is identical to test error. Whilst coverage measures how far we need, on the average, to go down on the list of the labels in order to cover all the possible labels assigned to a protein. Coverage is loosely related to precision at the level of perfect recall [30]. The experimental results indicate that the top-rank label predictions from other ICA-type methods are as accurate as those from ICAM, but the predictions from ICAM are more complete than other ICA-type methods. A reasonable explanation for this finding is that the ICA-type methods focused on the single-label setting. In this case, the multi-label problem is first transformed into multiple single-label prediction problems, and then the ICA-type methods use independent classifiers induced from labeled training data for each problem. Nevertheless, ICA-type approaches ignore the effect of unlabeled data and the interdependence of the protein functions. On the other hand, our proposed ICAM approach is based on Markov chain based transductive learning method that uses both label and unlabeled data for label propagation. The Markov chain based method takes the correlation of the classes into account to effectively compute ranking of labels to an instance. Therefore, ICAM provides an opportunity to leverage the individual ICA-type classifiers to achieve higher coverage of predictions.

Results on KDD Cup 2002 data

To validate the effectiveness of the proposed method when there are only a limited number of positive labeled training data, we conduct additional experiments on a relatively large scale Yeast dataset from KDD Cup 2002. It consists of 4507 instances (i.e., genes) from experiments with a set of cerevisiae (Yeast) strains. Each instance is described by various types of information that characterize the gene associated with the instance. The data sources for describing the instances include abstracts from the scientific literature (MEDLINE), gene localization and functions. We represent each instance by a feature vector with 20545 dimensions. The pairs of genes whose encoded proteins physically interact with one another. Such protein-protein interaction network consists of 1218 links. Each instance is labeled with one of three class labels "nc", "control" and "change". The "change" label indicates instances in which the activity of the hidden system was significantly changed, but the activity of the control system was not significantly changed. The goal of the KDD Cup 2002 task is to learn a model that can accurately predict the genes that affect the hidden system but not the control system. In this case, the positive class consists of those genes with "change" labels and the negative class consists of those genes with either "nc" or the "control" label. This partition is highly imbalanced. The rate of positive instances is only 1.2%. Therefore, we base our evaluation analysis on Receiver Operating Characteristic (ROC) curves, which reflect the true positive rate of a classifier as a function of its false positive rate. ROC curves are commonly used for evaluating highly skewed binary classification problems. Recent study has shown that ROC curves have a deep connection to the precision-recall (PR) curves [32]. To evaluate the performance of our ICAM algorithm, we compare it with the linear kernel SVM method that implemented by LIBSVM [33]. Figure 3 shows the results of ROC curves on the KDD Cup 2002 task for ICAM and SVM. The x-axis and y-axis of the figure refer to the false positive rate and true positive rate respectively. We see from the figure that our ICAM (the red curve), outperforms the SVM method (the blue curve) in general. ICAM achieves improvement of 10.0% (0.713 versus 0.613) on area under the ROC curves. The experimental results imply that the proposed ICAM method is able to deliver better performance in the situation of small positive labeled data size.

Figure 3

ROC curve of baseline SVM and our ICAM method.

Experiments on collaboration networks

In this section, we compare the performance of the proposed ICAM algorithm with other collective classification algorithms on 2 multi-label collaboration networks datasets to validate the effectiveness of the proposed method more thoroughly. These collaboration networks datasets are collected from the DBLP computer science bibliography website, and used in prior work to study the multi-label collective classification problems [29]. Their characteristics are listed in Table 2. Specifically, we extract DBLP coauthorship networks that contain authors who have published papers during the years 2000-2010 as the nodes of the networks, and link any two authors who have collaborated with each other. At each node, we extract a bag-of-words representation of all the paper titles published by the author, and used it as the attributes of the node. Each author has one (or multiple) research topic(s) of interests from 6 research areas. The representative conferences from each area are selected as class labels. If an author has published papers in any of these conferences, we assume the author is interested in the corresponding research class. The task is to classify each author with a set of multiple research classes of interest. The conferences corresponding to the class labels of two datasets (DBLP-A and DBLP-B) are given as follows.

Table 2

The description of experimental datasets used in the experiments on collaboration networks.

Datasets	Number of Instances	Number of Attributes	Number of Links	Number of Classes
DBLP-A	23,806	12,588	150,042	6
DBLP-B	16,020	8,595	95,108	6

The classes of DBLP-A are as follows: 1 Database: ICDE, VLDB, SIGMOD, PODS, EDBT 2 Data Mining: KDD, ICDM, SDM, PKDD, PAKDD 3 Artificial Intelligence: IJCAI, AAAI 4 Information Retrieval: SIGIR, ECIR 5 Computer Vision: CVPR 6 Machine Learning: ICML, ECML The classes of DBLP-B are as follows: 1 Algorithms & Theory: STOC, FOCS, SODA, COLT 2 Natural Language Processing: ACL, ANLP, COLING 3 Bioinformatics: ISMB, RECOMB 4 Networking: SIGCOMM, MOBICOM, INFOCOM 5 Operating Systems: SOSP, OSDI 6 Parallel Computing: POD, ICS We test ICAM and other ICA-type algorithms with different number of labeled instances from 1000 to 5000 with an increment of 500. The average results as well as standard deviation of a 10-time data split are given in Figure 4. The experimental results are in concordant with our previous study. We observe that ICAM consistently outperforms the other ICA-type methods on these two datasets, especially when there are only limited number of labeled instances, i.e. larger(smaller) improvement is obtained with less(more) labeled data.

Figure 4

The coverage performance of different algorithms with varying number of labeled instances: (a) DBLP-A dataset; (b) DBLP-B dataset.

The description of experimental datasets used in the experiments on collaboration networks. The coverage performance of different algorithms with varying number of labeled instances: (a) DBLP-A dataset; (b) DBLP-B dataset.

Conclusion

In this paper, we studied the label deficiency problem in collective classification (CC). We showed the protein function prediction problem from PPI networks can be formulated as a problem, and proposed an effective and novel Markov chain based CC learning algorithm, namely ICAM. It focuses on how to use labeled and unlabeled data to enhance the classification performance of PPI network data. Experimental results on two real-world Yeast PPI network datasets and two collaboration network datasets showed that our proposed ICAM method is effective in learning CC tasks in the paucity of labeled data. In future, we will consider other semi-supervised learning techniques for collective classification in PPI network data and we will also research on other complex biological networks, such as heterogeneous network classification.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Q. Wu, M.K. Ng. and Y. Ye participated in designing the algorithm and drafted the manuscript. Q. Wu and R. Shi performed the implementations. S.S. Ho revised and finalized the paper. All authors read and approved the final manuscript.

12 in total

Review 1. Protein function in the post-genomic era.

Authors: D Eisenberg; E M Marcotte; I Xenarios; T O Yeates
Journal: Nature Date: 2000-06-15 Impact factor: 49.962

2. Protein function prediction via graph kernels.

Authors: Karsten M Borgwardt; Cheng Soon Ong; Stefan Schönauer; S V N Vishwanathan; Alex J Smola; Hans-Peter Kriegel
Journal: Bioinformatics Date: 2005-06 Impact factor: 6.937

3. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps.

Authors: Elena Nabieva; Kam Jim; Amit Agarwal; Bernard Chazelle; Mona Singh
Journal: Bioinformatics Date: 2005-06 Impact factor: 6.937

4. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions.

Authors: Hon Nian Chua; Wing-Kin Sung; Limsoon Wong
Journal: Bioinformatics Date: 2006-04-21 Impact factor: 6.937

5. CFinder: locating cliques and overlapping modules in biological networks.

Authors: Balázs Adamcsek; Gergely Palla; Illés J Farkas; Imre Derényi; Tamás Vicsek
Journal: Bioinformatics Date: 2006-02-10 Impact factor: 6.937

Review 6. An overview of in silico protein function prediction.

Authors: Roy D Sleator; Paul Walsh
Journal: Arch Microbiol Date: 2010-02-03 Impact factor: 2.552

7. Predicting gene function in Saccharomyces cerevisiae.

Authors: A Clare; R D King
Journal: Bioinformatics Date: 2003-10 Impact factor: 6.937

Review 8. Network-based prediction of protein function.

Authors: Roded Sharan; Igor Ulitsky; Ron Shamir
Journal: Mol Syst Biol Date: 2007-03-13 Impact factor: 11.429

9. RRW: repeated random walks on genome-scale protein networks for local cluster discovery.

Authors: Kathy Macropol; Tolga Can; Ambuj K Singh
Journal: BMC Bioinformatics Date: 2009-09-09 Impact factor: 3.169

10. Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network.

Authors: Christine Brun; François Chevenet; David Martin; Jérôme Wojcik; Alain Guénoche; Bernard Jacq
Journal: Genome Biol Date: 2003-12-15 Impact factor: 13.583

6 in total

1. Protein functional properties prediction in sparsely-label PPI networks through regularized non-negative matrix factorization.

Authors: Qingyao Wu; Zhenyu Wang; Chunshan Li; Yunming Ye; Yueping Li; Ning Sun
Journal: BMC Syst Biol Date: 2015-01-21

2. Semi-supervised multi-label collective classification ensemble for functional genomics.

Authors: Qingyao Wu; Yunming Ye; Shen-Shyang Ho; Shuigeng Zhou
Journal: BMC Genomics Date: 2014-12-08 Impact factor: 3.969

3. Construction and analysis of gene-gene dynamics influence networks based on a Boolean model.

Authors: Maulida Mazaya; Hung-Cuong Trinh; Yung-Keun Kwon
Journal: BMC Syst Biol Date: 2017-12-21

4. An improved discriminative filter bank selection approach for motor imagery EEG signal classification using mutual information.

Authors: Shiu Kumar; Alok Sharma; Tatsuhiko Tsunoda
Journal: BMC Bioinformatics Date: 2017-12-28 Impact factor: 3.169

5. Protein Function Prediction Based on PPI Networks: Network Reconstruction vs Edge Enrichment.

Authors: Jiaogen Zhou; Wei Xiong; Yang Wang; Jihong Guan
Journal: Front Genet Date: 2021-12-14 Impact factor: 4.599

6. Functional Annotations of Paralogs: A Blessing and a Curse.

Authors: Rémi Zallot; Katherine J Harrison; Bryan Kolaczkowski; Valérie de Crécy-Lagard
Journal: Life (Basel) Date: 2016-09-08

6 in total