Literature DB >> 27766944

Incorporating topological information for predicting robust cancer subnetwork markers in human protein-protein interaction network.

Navadon Khunlertgit1, Byung-Jun Yoon2.   

Abstract

BACKGROUND: Discovering robust markers for cancer prognosis based on gene expression data is an important yet challenging problem in translational bioinformatics. By integrating additional information in biological pathways or a protein-protein interaction (PPI) network, we can find better biomarkers that lead to more accurate and reproducible prognostic predictions. In fact, recent studies have shown that, "modular markers," that integrate multiple genes with potential interactions can improve disease classification and also provide better understanding of the disease mechanisms.
RESULTS: In this work, we propose a novel algorithm for finding robust and effective subnetwork markers that can accurately predict cancer prognosis. To simultaneously discover multiple synergistic subnetwork markers in a human PPI network, we build on our previous work that uses affinity propagation, an efficient clustering algorithm based on a message-passing scheme. Using affinity propagation, we identify potential subnetwork markers that consist of discriminative genes that display coherent expression patterns and whose protein products are closely located on the PPI network. Furthermore, we incorporate the topological information from the PPI network to evaluate the potential of a given set of proteins to be involved in a functional module. Primarily, we adopt widely made assumptions that densely connected subnetworks may likely be potential functional modules and that proteins that are not directly connected but interact with similar sets of other proteins may share similar functionalities.
CONCLUSIONS: Incorporating topological attributes based on these assumptions can enhance the prediction of potential subnetwork markers. We evaluate the performance of the proposed subnetwork marker identification method by performing classification experiments using multiple independent breast cancer gene expression datasets and PPI networks. We show that our method leads to the discovery of robust subnetwork markers that can improve cancer classification.

Entities:  

Keywords:  Cancer classification; Message passing algorithm; Protein-protein interaction network; Subnetwork marker identification; Topological information

Mesh:

Year:  2016        PMID: 27766944      PMCID: PMC5073942          DOI: 10.1186/s12859-016-1224-1

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Introduction

In this work, we focus on one of the problems in translational genomics which is the identification of biomarkers from microarray gene expression data to classify type or state of complex disease. This problem is generally challenging and practically difficult because it normally involves with: 1) Small sample size of clinical data, 2) Large number of potential markers, and 3) Heterogeneity across patient and samples. Several studies have been working on identifying gene markers which are selected based solely on gene expression data. These markers have shown to be useful to build classifiers for disease prediction. However, there are some limitations of these gene-based markers. For example, given two large-scale-dataset studies of breast cancer metastasis [1, 2]. Both studies tried to find out what would be the gene markers to look at in order to estimate the risk of cancer metastasis. Both of them identified around 70 gene markers with 60–70 % of accuracy. However, they shared only 3 genes in common from 55 of possible genes that might share across two platforms [3]. These gene-based markers yielded low performance on cross-dataset experiments. Afterward, many studies have been proposed to improve prediction accuracy and reproducibility of the identified biomarkers. As cancer is a complex disease which its progression involves dysregulation of multiple genetic processes, there is an alternative approach based on the assumption that genes which are known to be in common pathways [4-8] or genes whose protein products are functionally related in protein-protein interaction (PPI) networks [9-11] should be interpreted together as a single feature. This approach analyzes gene expression data at “modular” level by integrating biological information, such as known molecular pathways or PPI networks. Many studies have shown that this “integrative approach” tends to be more robust than single gene markers and may improve classification accuracy. This approach has drawn the attention to several studies to find what might be the effective way to integrate the expression of genes that belong to the same module. Several ideas have been proposed such as using mean or median, sum, or difference of the expression levels of the gene that belong to the same modules as modular activity. PPI network has been shown to overcome the limited numbers of known pathway information. Chuang et al. [9], one of the first studies in this field, proposed a greedy search algorithm for finding discriminative subnetwork markers. Su et al. [10] proposed dynamic programming method to identify and greedily combined paths containing differentially expressed and coexpressed genes to obtain subnetwork markers for predicting breast cancer metastasis. More recently, in our previous work [11], we utilized a message-passing clustering algorithm to identify subnetwork markers with high-accuracy disease prediction. The method is capable to simultaneously predict multiple non-overlapping subnetwork markers which may lead to cover more genes with lower computational cost compared to the existing methods. With these advantages, we adopt our previous message-passing based approach while incorporating the topological information from the PPI network to identify the potential functional modules–or subnetworks. Initially, we adopt widely made assumptions that densely connected subnetworks may likely be potential functional modules and that proteins that are not directly connected but interact with similar sets of other proteins may share similar functionalities. We employ association indices to estimate the topological information. Association indices have been shown to be one of powerful tools for measuring similarity between genes [12]. For example, Jaccard index has been successfully used to measure neighborhood similarity for clustering and constructing Power Graph in the work of Royer et al. [13]. In this paper, we propose a novel method for incorporating PPI network topological information to enhance identification of subnetwork markers for predicting cancer prognosis. We utilize various association coefficients to estimate the topological similarity and also apply different approaches to integrate into our previous message-passing based method. We assess the identified subnetwork markers and evaluate their discriminative power and their classification performance through experiments using publicly available independent breast cancer gene expression datasets and PPI networks.

Materials and methods

Datasets

In this study, we obtained two independent breast cancer microarray gene expression datasets from the public domain, which we refer to as GSE2034 [2] and NKI295 [14]. GSE2034 was profiled on the Affymetrix U133a platform (GPL96) and downloaded from the Gene Expression Omnibus (GEO) database [15]. NKI295 was profiled on Agilent Hu25K platform and downloaded from the supplement information from Chang et al. [16]. We used both datasets as published by their original studies. GSE2034 contains expression profiles of 286 breast cancer patients, NKI295 contains expression profiles of 295 patients. For 108 patients in GSE2034 and 78 patients in NKI295, metastasis had been detected within 5 years of surgery. We labeled them as “metastatic”, while the remainder was labeled as “non-metastatic”. Four publicly available human PPI networks were used in this study which we refer to as Chuang, HPRD, GASOLINE, and BioGRID. Chuang was obtained from a previous study by Chuang et al. [9]. HPRD was downloaded from the Human Protein Reference Database Release 9 [17]. GASOLINE was obtained from the work of Micale et al. [18]. It was derived from STRING database [19] considering only experimentally verified protein interactions. BioGRID was downloaded from the Biological General Repository for Interaction Datasets version 3.4.134 (Homo Sapiens) [20]. We did not combine all the PPI networks because they were compiled based on different criteria and domain of interest. Table 1 shows the number of unique proteins and interactions for each PPI network. BioGRID contains the largest number of interactions while HPRD contains the largest number of proteins.
Table 1

The number of proteins and interactions for each PPI network

PPI networkNumber of unique proteinsNumber of interactions
Chuang11,20357,235
HPRD30,04741,327
GASOLINE955653,859
BioGRID20,364315,507
The number of proteins and interactions for each PPI network We overlaid the gene microarray datasets with each PPI network by mapping each gene to its corresponding protein in the network. After removing the proteins that do not have corresponding genes in both gene expression datasets, we obtained an induced networks with the statistics shown in Table 2. After data integration, the numbers of proteins are quite similar to each other. BioGRID still contains the largest number of interactions while the others contain approximately the same.
Table 2

The number of proteins and interactions for each induced PPI network

PPI networkNumber of unique proteinsNumber of interactions
Chuang529326,773
HPRD476218,684
GASOLINE427722,253
BioGRID569799,426
The number of proteins and interactions for each induced PPI network

Affinity propagation-based subnetwork identification

We adopt the subnetwork identification procedure from our previous study [11], where we utilized a message-passing clustering algorithm, called affinity propagation, to cluster genes whose protein products interact with each other or are closely located in PPI network. The input of this clustering algorithm is the measure of similarity between genes. We originally defined the similarity of genes based entirely on the discriminative power to distinguish between the two class labels as follows: where t , and t are t-test statistics score of the log-likelihood ratio (LLR) between metastatic and non-metastatic samples of genes i, and k, respectively. t is the t-test score of the summation of the LLRs of genes i, and k. The LLR, λ, of gene i, λ(x ), is based on probabilistic inference strategy proposed in [7] and it is computed by where x is the expression level of the gene i and f (x ) is the conditional Gaussian probability density function of x under phenotype j. The last term is the penalty term measured by the difference between discriminative power of considering genes. The parameter, α, is defined between [0,1] to control this term. It is shown in our previous work [11] that the size of the network decreases as α gets larger. It is because a larger α tends to cluster genes with similar discriminative power. As a result of that, it yields small subnetworks with fewer genes. The Eq. 1 is based on original assumptions that when considering similarity between two genes, the gene itself should have high discriminative power, combining both genes as subnetwork should increase the overall discriminative power, and both genes should have similar discriminative power.

Incorporating topological information for computing the similarity between genes

With the assumption that the proteins corresponding to the genes in the same subnetwork should have common topological attributes, we consider two following points: Densely connected subnetworks may likely be potential functional modules. Proteins that are not directly connected but interact with similar sets of other proteins may share similar functionalities. Based on these considerations, we incorporate the topological information of proteins in the PPI network by measuring their association coefficient–or topological similarity. We measure topological attribute using different types of association coefficients. Let N and N be the neighborhood binary vectors of protein i and k. We define the topological similarity between proteins i and k, s (i,k), based on different similarity indexes as follows: Jaccard index: We define topological similarity, , as Jaccard index is widely used to quantify the similarity Kulczyński index: This measure, , represents the average proportion of the number of common neighbors to the total number of neighbors of each protein. It is given by Tversky index: We define topological similarity based on Tversky index, , as In order to indicate the direction of similarity (asymmetric similarity), we let and . This asymmetric definition lets the exemplars of the identified clusters be more densely connected than other non-exemplars. We can rewrite the equation as followings Tversky index can be viewed as a general form of Tanimoto coefficient (Jaccard index) when and , and Dice coefficient when and . We do not include other similarity indices whose results are in the same order (no alteration in the ranks) because they give the same output when applying affinity propagation. For example, Dice coefficient, , and Jaccard index share similar results in terms of ranking. Ochiai index (or Cosine index), , and Geometric index, provide the same ranks as of Kulczyński index. As we focus on retrieving topological information from the PPI network, we do not make use of the number of common non-neighbor proteins |¬N ∩¬N | in this study. Finally, we add the topological similarity, (3), (4) and (6), to the computation of similarity between genes i and k, s(i,k), in two different ways. Similarity between genes i and k, s(i,k), as a product of the topological similarity s (i,k) and the discriminative power based similarity s (i,k). We define as: Similarity between genes i and k, s(i,k), as a combination of the topological similarity s (i,k) and the discriminative power based similarity s (i,k). We first scale the discriminative power based similarity s (i,k) into the range [0,1] as same as topological similarity’s by where s is the set of all discriminative power based similarity of all gene pairs. Then, we combine them as follows where β=[0,1] is used to control the magnitude between each similarity. Topological similarity, s (i,k), has more effects as β increases. It should be noted that s(i,k) can be viewed as the summation of topological similarity and discriminative power based similarity when β=0.5. We use the same setting for preference as in [11]. The self-similarity is set to s(k,k)=c for all k, where s(i,k)≤c for only 1 % of all gene pairs (g ,g ) to guarantee that every gene gets equal chance to be an exemplar at the initial stage of clustering process.

Probabilistic inference of subnetwork activity

To estimate the modular—or subnetwork—activity of identified subnetwork, we employ the probabilistic inference method proposed in [7] which is the aggregation of the LLRs of all member genes to represent the activity level of the subnetwork markers, . It is computed by where x is the expression level of the gene g in the subnetwork . This inference method can be viewed as the aggregation of the probabilistic evidence of the expression level of genes in the subnetworks.

Experimental set-up

We identified subnetwork markers incorporating three different strategies to measure topological similarity which we referred to as Jaccard-based, Kulczyński-based, and Tversky-based. As mentioned previously, we used two different approaches to integrate topological similarity to measure similarity between genes: 1) Product of topological and discriminative power based similarity, namely, “product-based approach”, and 2) Linear combination of topological and discriminative power based similarity, namely, “linear-combination-based approach”. In the latter approach, we used three different values of β(=0.25,0.5,0.75) to investigate the impact of topological similarity to the subnetwork identification. In fact, we can also setup the experiments the other way around to find the optimal the value of β for each data. After computing similarity between genes and applying affinity propagation-based subnetwork identification, all output clusters were ranked based on the t-test statistics score of their activity level. Then we selected the top 50 clusters with high discriminative power as the potential subnetwork markers for assessing their classification performance. We repeated these processes to both gene expression datasets and all four PPI networks.

Results

For comparison, we also evaluated the method proposed in [9], and [11] which we refer to as the ‘greedy’ method, and the ‘AP-based’ method, respectively. We applied the greedy method with 5 % minimum required improvement which is the same setting as originally published in [9]. In the AP-based method, we set the magnitude of the penalty term, α, to 0.5 by reason shown in [11] that it yields high and consistent classification performance as of smaller α with the smaller size of identified subnetworks compared to larger α. For simplicity in displaying Tables and Figures in this section, we abbreviate Jaccard-based, Kulczyński-based, and Tversky-based to jac, kul, and tve, respectively. The suffixes, _p, and _lc are appended to indicate product-based approach, and linear-combination-based approach, respectively.

Statistics of the subnetwork markers

Table 3 shows the average size of top 50 highly discriminative subnetwork markers identified by each method on GSE2034 and NKI295. Each column shows the results for each PPI network. The average size of markers identified by product-based and linear-combination-based approach is similar to the original AP-based method. We can clearly see that the average size of top markers identified by the proposed method and AP-based is larger than the greedy-based.
Table 3

The average size of top 50 highly discriminative subnetwork markers from GSE2034 and NKI295

Gene expression dataset = GSE2034
ChuangHPRDGASOLINEBioGRID
Greedy3.13.263.543.66
AP-based36.2835.7834.1838.78
jac_p18.0619.9419.5829
kul_p21.1625.3222.4836.28
tve_p34.4845.2645.9861.8
jac_lc β=0.2518.321.3623.1434
β=0.515.0815.3816.4424.24
β=0.7513.2816.3413.4419.18
kul_lc β=0.252430.1428.6839.02
β=0.518.9822.8624.1838.32
β=0.7516.0619.1220.8430.98
tve_lc β=0.2534.146.5843.4453.54
β=0.528.9843.845.571.24
β=0.7522.9244.7846.3282.66
Gene expression dataset = NKI295
ChuangHPRDGASOLINEBiogrid
Greedy4.123.684.464.42
AP-based31.3430.3228.7834.66
jac_p14.621618.9427.72
kul_p12.322.526.933.34
tve_p28.2242.2449.957.1
jac_lc β=0.2515.1416.819.6630.06
β=0.513.3812.4413.6822.66
β=0.7511.5412.8810.7817.98
kul_lc β=0.2514.824.627.0639.26
β=0.515.918.523.2833.96
β=0.7513.717.1217.2227.14
tve_lc β=0.2530.7641.7848.6652.44
β=0.527.2641.6250.772.88
β=0.7518.5243.2248.2481.42
The average size of top 50 highly discriminative subnetwork markers from GSE2034 and NKI295 As we can see from Table 3, the average size of top 50 highly discriminative subnetwork markers increases as the PPI network with larger number of interactions and unique proteins is used. This trend can be clearly seen when BioGRID is employed. Among product-based approach group, Tversky-based similarity, tve_p, yields larger subnetworks. In linear-combination-based approach, we can see that the average size decreases as β increases in most cases. However, we cannot see this trend distinctly in Tversky-based, tve_lc. The main reason is that Tversky-based similarity mostly provides higher similarity index compared with the others as it is designed to indicate the direction of the similarity. For instance, when a gene shares all of its neighbors with another gene (|N ∩ N | = |N |), it returns the maximum similarity (), whereas the other topological similarities yield lower because they depend on the number of neighbors the both genes. As defined in Eq. 9, the clustering process relies more on topological information as β gets larger. Therefore, in this case, more genes tend to be clustered into the same subnetwork. We can see the similar trends for the number of unique genes in top 50 discriminative subnetwork markers as shown in Table 4. We can also clearly see that the top markers identified by the proposed method and AP-based cover more genes than the greedy-based. The larger unique genes covered show that the proposed method may increase the chance to discover genes that are not known to be related to the disease. This also means the higher probability of identifying new subnetwork and pathway.
Table 4

The number of unique genes in top 50 highly discriminative subnetwork markers from GSE2034 and NKI295

Gene expression dataset = GSE2034
ChuangHPRDGASOLINEBiogrid
Greedy130121140139
AP-based1814178917091939
jac_p9039979791450
kul_p1058126611241814
tve_p1724226322993090
jac_lc β=0.25915106811571700
β=0.57547698221212
β=0.75664817672959
kul_lc β=0.251200150714341951
β=0.5949114312091916
β=0.7580395610421549
tve_lc β=0.251705232921722677
β=0.51449219022753562
β=0.751146223923164133
Gene expression dataset = NKI295
ChuangHPRDGASOLINEBiogrid
Greedy114110118150
AP-based1567151614391733
jac_p7318009471386
kul_p615112513451667
tve_p1411211224952855
jac_lc β=0.257578409831503
β=0.56696226841133
β=0.75577644539899
kul_lc β=0.25740123013531963
β=0.579592511641698
β=0.756858568611357
tve_lc β=0.251538208924332622
β=0.51363208125353644
β=0.75926216124124071
The number of unique genes in top 50 highly discriminative subnetwork markers from GSE2034 and NKI295 Next, we studied the overlap between the top 50 highly discriminative subnetwork markers identified on different gene expression datasets. The proposed method yield larger overlap when comparing to all of the previous methods as shown in Table 5. Again, similar trends as in Table 3 can also be observed here. The larger overlaps show that more of common genes are covered and shared among identified subnetworks from independent dataset from different platforms. This may lead us to more robust classifiers, we demonstrate the robustness by providing classification performance charts showing that the experimental results from the proposed method are consistent in the next section.
Table 5

Overlap between the top subnetwork markers identified on different gene expression datasets

ChuangHPRDGASOLINEBiogrid
Greedy5.63 %4.05 %4.88 %3.96 %
AP-based24.90 %28.70 %27.71 %23.89 %
jac_p37.89 %29.28 %32.01 %31.97 %
kul_p15.38 %27.52 %26.49 %28.26 %
tve_p25.80 %44.15 %50.57 %42.33 %
jac_lc β=0.2539.10 %22.54 %26.55 %30.20 %
β=0.553.51 %26.68 %26.87 %37.94 %
β=0.7554.55 %31.74 %26.67 %40.12 %
kul_lc β=0.2512.73 %24.47 %27.90 %28.50 %
β=0.539.86 %28.29 %31.18 %33.26 %
β=0.7550.61 %35.53 %31.42 %40.73 %
tve_lc β=0.2527.53 %44.47 %46.75 %36.57 %
β=0.532.14 %43.47 %52.41 %54.90 %
β=0.7532.99 %50.94 %57.71 %69.05 %
Overlap between the top subnetwork markers identified on different gene expression datasets Additionally, we analyzed enriched functions of the genes in the subnetwork markers using Panther [21], a web-based system designed to facilitate analysis of large numbers of genes and provide comprehensive function information which includes up-to-date comprehensive Gene Ontology (GO) annotations (GO database version 1.2, released 2016-05-20 with 44,588 total annotations). An example of the enrichment analysis of the top 50 highly discriminative subnetworks identified using tve_p method on GASOLINE is shown in Table 6. We can see that the genes in identified subnetworks from different gene expression datasets also share common GO terms.
Table 6

The number of genes in top 50 highly discriminative subnetwork markers from tve_p method on GASOLINE categorized by their GO terms

Ontology: Molecular function
GO termGO idGSE2034NKI295
transporter activityGO:0005215240251
translation regulator activityGO:00451823741
protein binding transcription factor activityGO:00009883542
enzyme regulator activityGO:0030234193205
catalytic activityGO:000382411461221
channel regulator activityGO:001624756
receptor activityGO:0004872346370
nucleic acid binding transcription factor activityGO:0001071307316
antioxidant activityGO:001620986
structural molecule activityGO:0005198226260
bindingGO:000548812371330
Ontology: Cellular component
GO termGO idGSE2034NKI295
synapseGO:00452021515
cell junctionGO:00300541311
membraneGO:0016020288290
macromolecular complexGO:0032991213214
extracellular matrixGO:00310125058
cell partGO:0044464765794
organelleGO:0043226411441
extracellular regionGO:0005576151153
Ontology: Biological process
GO termGO idGSE2034NKI295
cellular component organization or biogenesisGO:0071840278309
cellular processGO:000998715591679
localizationGO:0051179536577
apoptotic processGO:0006915174194
reproductionGO:0000003104118
biological regulationGO:0065007886933
response to stimulusGO:0050896547593
developmental processGO:0032502634692
rhythmic processGO:004851131
multicellular organismal processGO:0032501393413
locomotionGO:00400112024
biological adhesionGO:0022610127147
metabolic processGO:000815217731876
growthGO:004000713
immune system processGO:0002376314342
The number of genes in top 50 highly discriminative subnetwork markers from tve_p method on GASOLINE categorized by their GO terms

Discriminative power of the subnetwork markers

We evaluated the discriminative power of the subnetwork markers based on the same procedure as previously used in these studies [6–8, 10]. We computed the t-test score of the inferred subnetwork activity level. And then we sorted the absolute value in descending order. The average absolute t-test score of the top K=10,20,30,40,50 subnetwork markers is shown in Fig. 1. We can see that the discriminative power of subnetwork markers identified by product-based approach, and linear-combination-based approach are considerably higher than the result of the greedy method. Among product-based approach group, Tversky-based yields the highest in most of the results.
Fig. 1

Discriminative power of subnetwork markers identified on GSE2034 by different methods. We computed the average absolute t-test score of the top K=10, 20, 30, 40, and 50 subnetwork markers identified on GSE2034 by various methods for the following PPI datasets: a Chuang, b HPRD, c GASOLINE, and d BioGRID

Discriminative power of subnetwork markers identified on GSE2034 by different methods. We computed the average absolute t-test score of the top K=10, 20, 30, 40, and 50 subnetwork markers identified on GSE2034 by various methods for the following PPI datasets: a Chuang, b HPRD, c GASOLINE, and d BioGRID We also assessed how the subnetwork markers identified on specific gene expression dataset perform in another independent dataset. We sorted the subnetwork markers based on their t-test score of the inferred subnetwork activity level on one dataset and we reevaluated the discriminative power on the other dataset. As shown in Fig. 2, we can see that the trends of discriminative power of subnetwork markers across different gene expression datasets are similar to those observed in Fig. 1. The analysis of discriminative power of the subnetwork markers identified on NKI295 data also shows a similar trend (Figures S1 and S2 in Additional file 1).
Fig. 2

Discriminative power of subnetwork markers across independent gene expression datasets. The markers were identified and ranked on GSE2034 and their discriminative power was evaluated on NKI295. We computed the mean absolute t-score of the top K=10, 20, 30, 40, and 50 markers by different methods for the following PPI datasets: a Chuang, b HPRD, c GASOLINE, and d BioGRID

Discriminative power of subnetwork markers across independent gene expression datasets. The markers were identified and ranked on GSE2034 and their discriminative power was evaluated on NKI295. We computed the mean absolute t-score of the top K=10, 20, 30, 40, and 50 markers by different methods for the following PPI datasets: a Chuang, b HPRD, c GASOLINE, and d BioGRID About the impact of different PPI networks, the PPI network with larger number of interactions tends to yield the higher discriminative power. One of the reasons may be that it contains more topological information which may help to measure the similarity between genes. As intuitively expected, we can see that BioGRID is advantageous to the other PPI networks because it contains the largest number of interactions (as shown in Figures 1 d and Additional file 1: Figure S1(d)).

Evaluating the reproducibility of the identified subnetwork markers

In order to evaluate the reproducibility of subnetwork markers, we performed five-fold cross-validation experiments based on a similar set-up that has been commonly used in previous studies [6-11], where the entire process was repeated for 100 random partitions. We identified potential subnetwork markers and selected the top 50 subnetworks as a feature set for the classifier on one gene expression dataset. After that, we built the linear discriminant analysis (LDA) classifiers based on the selected features and evaluated the accuracy on the other dataset. The classification performance assessed by the area under ROC curve (AUC) is shown in Fig. 3. We can see that both product-based approach and linear-combination based approach yield consistently high performance across different gene expression datasets and PPI networks.
Fig. 3

Reproducibility of subnetwork markers identified by various methods. The bars show the cross-dataset classification performance (average AUC) of different methods. a GSE2034 was used for identifying the potential markers and NKI295 was used for training and evaluating the classifier, b We repeated as NKI295 was used for identifying the markers and GSE2034 was used for training and evaluation of the classifier

Reproducibility of subnetwork markers identified by various methods. The bars show the cross-dataset classification performance (average AUC) of different methods. a GSE2034 was used for identifying the potential markers and NKI295 was used for training and evaluating the classifier, b We repeated as NKI295 was used for identifying the markers and GSE2034 was used for training and evaluation of the classifier In this work, we use the term, ‘reproducibility’ in the sense of the ability to identify common discriminative genes or subnetworks across different independent datasets. Therefore, using these subnetworks as biomarkers for disease classification may lead to consistent performance. Furthermore, in terms of reproducibility in practical usage, the AP-based methods, including our proposed methods, cost less computation time compared to the greedy algorithm as shown in [11].

Conclusion

In this paper, we propose a novel method that incorporates topological information to identify subnetwork markers that can be used in cancer prognosis prediction. We demonstrate how widely used association coefficients, such as Jaccard index, Kulczyński index, and Tversky index can be utilized to measure topological similarity. Also, we show how to integrate these measures by two different approaches, product-based, and linear-combination based. Based on our experimental results, Tversky-based strategy is most suitable to measure similarity between genes when the direction of interaction is involved. It yields consistently high discriminative power across different datasets. Furthermore, utilizing the larger PPI network with larger number of unique proteins and interactions, such as BioGRID, may lead to the better subnetwork identification with higher classification performance. The proposed method considerably increases the coverage of genes and also the overlap of genes when identified across different independent datasets. Through extensive evaluations using various independent breast cancer gene expression datasets and PPI networks, the experimental results show that our method leads to the identification of robust and reproducible subnetwork markers that may lead to better cancer classification. Supplementary materials. Figure S1: Discriminative power of subnetwork markers identified on NKI295 by different methods. Figure S2: Discriminative power of subnetwork markers across independent gene expression datasets. (PDF 1260 kb)
  21 in total

1.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer.

Authors:  Liat Ein-Dor; Or Zuk; Eytan Domany
Journal:  Proc Natl Acad Sci U S A       Date:  2006-04-03       Impact factor: 11.205

2.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer.

Authors:  Yixin Wang; Jan G M Klijn; Yi Zhang; Anieta M Sieuwerts; Maxime P Look; Fei Yang; Dmitri Talantov; Mieke Timmermans; Marion E Meijer-van Gelder; Jack Yu; Tim Jatkoe; Els M J J Berns; David Atkins; John A Foekens
Journal:  Lancet       Date:  2005 Feb 19-25       Impact factor: 79.321

3.  Gene expression profiling predicts clinical outcome of breast cancer.

Authors:  Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal:  Nature       Date:  2002-01-31       Impact factor: 49.962

4.  Discovering statistically significant pathways in expression profiling studies.

Authors:  Lu Tian; Steven A Greenberg; Sek Won Kong; Josiah Altschuler; Isaac S Kohane; Peter J Park
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-08       Impact factor: 11.205

5.  Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival.

Authors:  Howard Y Chang; Dimitry S A Nuyten; Julie B Sneddon; Trevor Hastie; Robert Tibshirani; Therese Sørlie; Hongyue Dai; Yudong D He; Laura J van't Veer; Harry Bartelink; Matt van de Rijn; Patrick O Brown; Marc J van de Vijver
Journal:  Proc Natl Acad Sci U S A       Date:  2005-02-08       Impact factor: 11.205

6.  A gene-expression signature as a predictor of survival in breast cancer.

Authors:  Marc J van de Vijver; Yudong D He; Laura J van't Veer; Hongyue Dai; Augustinus A M Hart; Dorien W Voskuil; George J Schreiber; Johannes L Peterse; Chris Roberts; Matthew J Marton; Mark Parrish; Douwe Atsma; Anke Witteveen; Annuska Glas; Leonie Delahaye; Tony van der Velde; Harry Bartelink; Sjoerd Rodenhuis; Emiel T Rutgers; Stephen H Friend; René Bernards
Journal:  N Engl J Med       Date:  2002-12-19       Impact factor: 91.245

7.  Towards precise classification of cancers based on robust gene functional expression profiles.

Authors:  Zheng Guo; Tianwen Zhang; Xia Li; Qi Wang; Jianzhen Xu; Hui Yu; Jing Zhu; Haiyun Wang; Chenguang Wang; Eric J Topol; Qing Wang; Shaoqi Rao
Journal:  BMC Bioinformatics       Date:  2005-03-17       Impact factor: 3.169

8.  Human Protein Reference Database--2009 update.

Authors:  T S Keshava Prasad; Renu Goel; Kumaran Kandasamy; Shivakumar Keerthikumar; Sameer Kumar; Suresh Mathivanan; Deepthi Telikicherla; Rajesh Raju; Beema Shafreen; Abhilash Venugopal; Lavanya Balakrishnan; Arivusudar Marimuthu; Sutopa Banerjee; Devi S Somanathan; Aimy Sebastian; Sandhya Rani; Somak Ray; C J Harrys Kishore; Sashi Kanth; Mukhtar Ahmed; Manoj K Kashyap; Riaz Mohmood; Y L Ramachandra; V Krishna; B Abdul Rahiman; Sujatha Mohan; Prathibha Ranganathan; Subhashri Ramabadran; Raghothama Chaerkady; Akhilesh Pandey
Journal:  Nucleic Acids Res       Date:  2008-11-06       Impact factor: 16.971

9.  Network-based classification of breast cancer metastasis.

Authors:  Han-Yu Chuang; Eunjung Lee; Yu-Tsueng Liu; Doheon Lee; Trey Ideker
Journal:  Mol Syst Biol       Date:  2007-10-16       Impact factor: 11.429

10.  Unraveling protein networks with power graph analysis.

Authors:  Loïc Royer; Matthias Reimann; Bill Andreopoulos; Michael Schroeder
Journal:  PLoS Comput Biol       Date:  2008-07-11       Impact factor: 4.475

View more
  8 in total

1.  Comprehensive analysis and identification of key genes and signaling pathways in the occurrence and metastasis of cutaneous melanoma.

Authors:  Hanying Dai; Lihuang Guo; Mingyue Lin; Zhenbo Cheng; Jiancheng Li; Jinxia Tang; Xisha Huan; Yue Huang; Keqian Xu
Journal:  PeerJ       Date:  2020-11-19       Impact factor: 2.984

2.  Proceedings of the 2016 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference.

Authors:  Jonathan D Wren; Inimary Toby; Huxiao Hong; Bindu Nanduri; Rakesh Kaundal; Mikhail G Dozmorov; Shraddha Thakkar
Journal:  BMC Bioinformatics       Date:  2016-10-06       Impact factor: 3.169

3.  Network-Based Identification of Altered Stem Cell Pluripotency and Calcium Signaling Pathways in Metastatic Melanoma.

Authors:  Ben-Hur Neves de Oliveira; Carla Dalmaz; Fares Zeidán-Chuliá
Journal:  Med Sci (Basel)       Date:  2018-03-08

4.  Hsa-mir-3163 and CCNB1 may be potential biomarkers and therapeutic targets for androgen receptor positive triple-negative breast cancer.

Authors:  Pengjun Qiu; Qiaonan Guo; Qingzhi Yao; Jianpeng Chen; Jianqing Lin
Journal:  PLoS One       Date:  2021-11-19       Impact factor: 3.240

5.  Identifying key genes of classic papillary thyroid cancer in women aged more than 55 years old using bioinformatics analysis.

Authors:  Chang-Chun Li; Muhammad Hasnain Ehsan Ullah; Xiao Lin; Su-Kang Shan; Bei Guo; Ming-Hui Zheng; Yi Wang; Fuxingzi Li; Ling-Qing Yuan
Journal:  Front Endocrinol (Lausanne)       Date:  2022-09-02       Impact factor: 6.055

6.  Identification of novel hub genes associated with gastric cancer using integrated bioinformatics analysis.

Authors:  Xiao-Qing Lu; Jia-Qian Zhang; Sheng-Xiao Zhang; Jun Qiao; Meng-Ting Qiu; Xiang-Rong Liu; Xiao-Xia Chen; Chong Gao; Huan-Hu Zhang
Journal:  BMC Cancer       Date:  2021-06-14       Impact factor: 4.430

7.  Identification of hub genes, key miRNAs and potential molecular mechanisms of colorectal cancer.

Authors:  Shasha Wu; Feixiang Wu; Zheng Jiang
Journal:  Oncol Rep       Date:  2017-08-29       Impact factor: 3.906

8.  Identification of hub genes and potential molecular mechanisms in gastric cancer by integrated bioinformatics analysis.

Authors:  Ling Cao; Yan Chen; Miao Zhang; De-Quan Xu; Yan Liu; Tonglin Liu; Shi-Xin Liu; Ping Wang
Journal:  PeerJ       Date:  2018-07-02       Impact factor: 2.984

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.