Literature DB >> 26733076

New Method for Joint Network Analysis Reveals Common and Different Coexpression Patterns among Genes and Proteins in Breast Cancer.

Francesca Petralia¹, Won-Min Song¹, Zhidong Tu¹, Pei Wang¹.

Abstract

We focus on characterizing common and different coexpression patterns among RNAs and proteins in breast cancer tumors. To address this problem, we introduce Joint Random Forest (JRF), a novel nonparametric algorithm to simultaneously estimate multiple coexpression networks by effectively borrowing information across protein and gene expression data. The performance of JRF was evaluated through extensive simulation studies using different network topologies and data distribution functions. Advantages of JRF over other algorithms that estimate class-specific networks separately were observed across all simulation settings. JRF also outperformed a competing method based on Gaussian graphic models. We then applied JRF to simultaneously construct gene and protein coexpression networks based on protein and RNAseq data from CPTAC-TCGA breast cancer study. We identified interesting common and differential coexpression patterns among genes and proteins. This information can help to cast light on the potential disease mechanisms of breast cancer.

Entities: Disease

Mesh：

Substances：

Year: 2016 PMID： 26733076 PMCID： PMC4782177 DOI： 10.1021/acs.jproteome.5b00925

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

In a recent breast cancer study conducted by the Clinical Proteomic Tumor Analysis Consortium (CPTAC),[1,2] extensive global- and phospho- protein profiling were obtained for a subset of breast cancer samples that have been extensively characterized in the The Cancer Genome Atlas (TCGA).[3] This is so far the first attempt to study protein activities in breast cancer samples using sophisticated protein experiments on a large scale. Then, by leveraging genomic analytical outputs from TCGA on these samples,[3] we have the unique opportunity to characterize and compare gene coexpression networks and protein coexpression networks based on the same set of samples. This is of great interest because knowledge about the common and different coexpression patterns among RNAs and proteins could improve our understanding of complicated gene regulatory mechanisms in tumor samples and could also facilitate the detection of important disease genes and therapeutical targets. In the past decade, numerous statistical and machine learning methods have been proposed to construct gene–gene regulatory networks and protein–protein regulatory networks, including coexpression network methods,[4] Bayesian network methods,[5,6] and Gaussian graphical models.[7−10] Review of these and other methods are available in Hecker et al.[11] and Lee et al.;[12] however, all of these proposed models are designed to construct one network at a time. Presumably, we can use RNA expression data and protein expression data to construct two networks separately and then compare, but such an approach is certainly less optimal, as gene expressions and protein expressions in one tumor sample are closely related. On one hand, protein levels are regulated by their RNA levels, so we expect the two coexpression networks share common structures. On the other hand, protein activities are subject to a large amount of post-transcriptional modifications, and thus we also expect to observe unique interaction patterns in each network. This motivates us to perform joint learning of both RNA-seq and proteomic networks, so we are able to borrow information across different data sets and better capture the common correlation structure, which shall then improve the overall accuracy of network estimation. Various methods have been proposed to jointly estimate different networks.[13−17] Some of them[13,14,17] have been specifically designed to estimate time-varying graphical models in the context of time-series data. Guo et al.[16] and Danaher et al.[15] proposed likelihood-based methods for the joint estimation of multiple related Gaussian graphical models. Their approaches rely on two different regularization schemes penalizing differences across partial correlation structures of different classes. While both papers successfully demonstrate the advantage of jointly modeling multiple networks over estimating individual networks separately, the performance of the methods heavily depends on the Multivariate Gaussian assumptions of gene expression distributions, which may not hold in real biological systems. Recently, random-forest-based methods have been utilized for the construction of GRN.[18−20] Random forest[21] is a decision-tree-based nonparametric method for building powerful prediction models. It has been extensively utilized for classification and regression problems.[22] Its ensemble structure combined with its nonparametric nature delivers excellent performance even when the sample size is moderate. Recently, Huynh-Thu et al.[18] proposed GENIE3, a random-forest-based method for estimating networks. The key idea is to model the expression of each target gene as a function of the expression of all other genes via random forest. The superior performance of GENIE3 was demonstrated in both DREAM 4 in-silico multifactorial challenge[23] and DREAM 5 network inference challenge.[24] In both challenges, GENIE3 and its extensions scored the best. In Maduranga et al.,[20] GENIE3 was further modified to estimate GRN based on time-series data. Recently, Petralia et al.[19] proposed iRafNet, another random-forest-based method that combines different types of heterogeneous data, such as protein–protein interactions, transcription factor-DNA binding, and gene knock-down to estimate GRN. While these random-forest-based methods greatly advance the field of GRN construction, none of them is designed to estimate simultaneously multiple related networks. We propose JRF (Joint Random Forest), a new algorithm for the simultaneous estimation of protein coexpression and gene coexpression networks. The key idea is to borrow information across different data by forcing the class-specific tree ensembles to use the same genes for the splitting rules. In this way, regulatory relationships playing important roles in multiple classes will be detected with better power. After testing the performance of our algorithm on several in silico experiments and comparing JRF with different methods including GENIE3[18] the Joint Graphical Lasso (JGL),[15] we applied JRF for the simultaneous construction of gene coexpression networks and protein coexpression networks. For our analysis, we considered gene expression from TCGA and protein expression from CPTAC data for 62 breast cancer patients. We derived the two networks and identified protein-specific hub genes and gene modules, showing the unique contribution of proteomic data to breast cancer research.

Materials and Methods

Random Forest for Network Construction

Random forest is a nonparametric algorithm that models a response variable via a collection of decision trees. Specifically, each decision tree is constructed based on a random subset of training samples, and the splitting variable at each node of a decision tree is chosen from a randomly sampled subset of predictors upon maximizing a certain utility function (e.g., decrease in node impurity). Recently, Huynh-Thu et al.[18] introduced GENIE3, a random-forest-based model for inferring gene regulatory networks (GRNs). In GENIE3, first, for each target gene k, its expression is modeled as a function of the expression of all other genes via random forest. Then, regulatory event (j → k) is measured by I, the importance score of gene j in the random forest model of gene k. Specifically, the importance score I is defined as the summation of node impurities across all nodes that utilize predictor j for the splitting rule divided by the total number of trees in the random forest model of gene k.

Joint Random Forest

We start with some notations. Denote G as the total number of different classes. For each class g ∈ {1, ..., G}, denote the gene expression data for p genes and n individuals as X, and the ith observation of gene j under class g as x. An overview of JRF is shown in Figure , where, for simplicity, we consider only two classes. The key step of JRF is to construct G random forest models simultaneously using data from G classes when predicting the expression of a target gene k based on the expression of all other genes. We propose to use the same predictor variables for splitting rules in different trees corresponding to different classes. The goal is to borrow information across different classes, so that regulatory relationships can be better detected if there are coherent signals across different classes. Specifically, when we grow G decision trees in parallel for G classes, at one node τ, we decide the splitting variable based on the following procedure:

Figure 1

JRF schematic. For simplicity, let us assume that there are only two classes and that each data contains the same number of samples. For each target gene, g, we run random forest for each class and the two tree ensembles are forced to share some structure as follows. First, for each node τ, the same set of genes Sτ is randomly sampled from the entire set of genes (a). Then, the same variable in set Sτ is chosen for the splitting rule. This is achieved by deriving the splitting rule for any gene contained in set Sτ (b) and choosing the best variable to maximize the sum of decrease in node impurity over different classes (c).

(1) Randomly sample a set containing N genes from the entire set of genes except gene k. (2) For each class, the best splitting rule based on each candidate predictor in set is derived. Let be the subset of samples allocated to node τ under class g. The splitting rule based on the jth predictor is of the formwhere and are the subsets of samples in class g allocated to the left and right children of node τ with a splitting rule based on the jth predictor. For each predictor in set , the best threshold a*(τ) is derived as followswhereand with representing the mean of set . In other words, Cτ is the decrease in node impurity after splitting node τ in the gth tree according to gene j. (3) Finally, gene gτ* is chosen as the splitting variable of node τ for all classes based on the following criteria:Because JRF utilizes the same splitting variable for corresponding nodes across all classes, predictors associated with gene k in multiple classes are more likely to be chosen for the splitting rules. For each step in the tree construction, (1–3) only apply to classes for which τ is not a final node. As in the original random forest model,[21] each tree grows until either the total number of observations allocated to the final leaves falls below a certain prespecified threshold or the maximum number of possible nodes is reached. It is worth mentioning that when the number of classes is one, JRF reduces to GENIE3, the original random forest model for network inference. It is worth noting that to implement JRF, data sets need to be standardized to mean zero and unit variance. Importance scores depend on the scale of the data and therefore variables (genes) need to be standardized before the random forest models are fitted. JRF schematic. For simplicity, let us assume that there are only two classes and that each data contains the same number of samples. For each target gene, g, we run random forest for each class and the two tree ensembles are forced to share some structure as follows. First, for each node τ, the same set of genes Sτ is randomly sampled from the entire set of genes (a). Then, the same variable in set Sτ is chosen for the splitting rule. This is achieved by deriving the splitting rule for any gene contained in set Sτ (b) and choosing the best variable to maximize the sum of decrease in node impurity over different classes (c).

Assessing JRF Performance

On the basis of the previously described procedure, JRF constructs G random forest models simultaneously for each target gene k and returns a ranking of gene–gene interactions based on importance scores. To assess the performance of JRF in predicting the true interactions, receiver operating characteristic curves (ROC) and precision-recall curves can be computed by setting different thresholds on importance scores. In this paper, JRF is compared with JGL,[15] GENIE3-Sep, and GENIE3-Comb on several in silico experiments. GENIE3-Sep is GENIE3 used to estimate networks based on data from G classes separately, while GENIE3-Comb is GENIE3 used to estimate a unique network based on the union of data from all classes. All random-forest-based algorithms (JRF, GENIE3-Comb, and GENIE3-Sep) provide a ranking of regulatory relationships based on importance scores, and ROC curves were computed by setting different thresholds on these scores. Instead, for JGL, ROC curves were constructed by considering different values for the two parameters controlling the level of sparsity (as shown by Figure S1 in the Supporting Information) . Another approach to evaluate the performance of JRF is choosing a proper cutoff value for importance scores using permutation techniques. Let I be the importance score associated with the regulatory event (j → k) in the gth class-specific tree ensemble. Specifically, this importance score is defined as where T is the number of trees and is the set of nodes which utilize gene j for the splitting rule in the tree ensemble used to predict gene k based on the gth class-specific data. Because in this paper we are interested in estimating undirected networks, we derive importance score I for every edge (j–k) as the average between importance scores I and I, and the final undirected networks can therefore be derived by applying a cutoff on the edge importance scores. To derive a proper cutoff value for importance scores, we utilize the following permutation-based procedure: (a) For b ∈ {1, ···, B}, with B being the number of permutations: (a.1) For any target gene k, we first permute its sample order within each class and fit G random forest models via JRF to predict the expression of gene k based on the expression of all other genes in G classes, respectively. (a.2) Repeat (a.1) for all genes and compute the final importance scores for each edge in each class, which are denoted as {I}. (b) For each threshold ι, we computewhere 1(•) is the indicator function, equal to one if event • occurs and zero otherwise. f(ι) can serve as an approximation of the false discovery rate (FDR).[25] In the following numerical studies, we use ι0 = min{ι: f(ι) ≤ 0.001} and declare an edge between j and k in class g if I > ι0.

Computational Complexity

The computational complexity of JRF is O(pTN∑log(n)n), where p is the number of genes, T is the number of trees in each forest, N is the number of variables sampled at each node, G is the number of classes, and n is the number of samples in each class. This complexity is in the same order as the complexity of applying GENIE3 to estimate G networks for G classes separately. In the contrast, the time complexity of GGM-based approaches, such as JGL, often has the order of p3.[15] So when p is large, JRF and GENIE3 enjoy better computational efficiency, which is supported by our numerical investigations (see Supporting Information). In practice, the number of trees (T) and the number of potential regulators to be sampled at each node (N) are user-specified parameters. T is usually chosen sufficiently large because the tree ensemble provides more accurate results as T increases. In our numerical studies, we reported results for T = 1000. We also evaluated larger T values, such as 2000 and 5000, but observed similar results (data not shown). For N, a common choice is where (p – 1) is the number of predictors for each target gene. In general, large value of N results in predictions with high bias; while low value results in predictions with high variance. We implemented an R package JRF using C routines wrapped with an R interface. The R package is publicly available through CRAN. Codes and a tutorial for JRFs implementation can also be downloaded from http://research.mssm.edu/tulab/software/JRF.html. Similarly to other tree-based algorithms for network construction, JRF can be easily parallelized because the estimation process consists of p-independent subproblems. A comparison between JRF, JGL, GENIE3-Sep, and GENIE3-Comb in terms of computational time is presented in Table S1.

Data

Synthetic Data

For data generation, we first considered two networks containing 500 nodes connected by 498 and 249 edges, respectively. In particular, Network 1 is a network with two disjoint components and Network 2 consists of only the first component of Network 1. In this example, we focused on two network topologies: power law[10,26] and star topology (Figure S2 in Supporting Information). For each network topology, we simulated 20 replicates involving n = 50 samples from a Gaussian graphical model. Network structures and covariance matrices were simulated in the same way as Danaher et al. (2014)[15] (further details can be found in the Supporting Information). Besides this example, we also consider a series of other different simulation settings, including (1) two non-nested networks; (2) five non-nested networks; (3) two nonrelated networks; and (4) data from non-Gaussian distributions. Please see Supporting Information sections 1.3–1.7 for more details. For each data set, variables were standardized to mean zero and unit variance.

Breast Cancer Data

The expressions of 5864 proteins for 62 breast cancer tumors[27] (proteome-ratio-norm-noNA-normal.gct(04-Jun-2014 version)) were downloaded from the CPTAC Data Coordinating Center (http://proteomics.cancer.gov/programs/cptacnetwork) sponsored by the National Cancer Institute. For three samples having technical duplicates, we took the average abundance of each protein across the duplicates. We further standardized protein expression so that each sample had median 0 and median absolute deviation (MAD) 1. Finally, we excluded proteins with interquartile range <70% quantile. The resulting data matrix consisted of 1759 proteins. Level-three RNAseq data of breast tumor samples were obtained from the TCGA Web site (http://tcga-data.nci.nih.gov/tcga/). First, we log-transformed the data and then replaced all missing values with 0. Then, we standardized each sample to have median 0 and MAD 1. We focused on the 62 samples contained in the proteomic data and excluded genes with more than 10% missing values. Then, we selected the top 10% genes with largest interquartile range across samples. The resulting data matrix based on RNAseq data contained 1464 genes. As final subset, we considered 528 genes obtained by overlapping the set of 1464 selected genes and the set of 1759 selected proteins. For each data set, genes and proteins were standardized to mean zero and unit variance.

Results and Discussion

Synthetic Data

In this section, JRF was compared with JGL,[15] GENIE3-Sep, and GENIE3-Comb on several in-silico experiments. In all implementations, we used T = 1000 and . We evaluated the performances of different algorithms based on different network structures, sample sizes, and number of genes.

Network Structure

Figure shows the average ROC curves of each method over 20 replicates for Network 1 and Network 2 separately. JRF outperforms all other methods under all different settings, while GENIE3-Sep is the second best performer. Figure S3 shows the boxplot of the area under the precision-recall curve (AUPR) for JRF, GENIE3-Sep, and GENIE3-Comb. As shown, JRF outperforms competitors also in terms of AUPR. The advantage of JRF over GENIE3-Sep is more evident for Network 2, which contains only shared edges. This is expected because JRF should be able to detect common edges with better power.

Figure 2

Mean of ROC curves for Network 1 (first column) and Network 2 (second column) over 20 replicates for JRF (red), GENIE3-Comb (green), GENIE3-Sep (blue), and JGL (black). For each replicate, we sampled 50 samples from a Gaussian graphical model based on the power law topology (first row) and star topology (second row). To further assess the performance of the proposed model, we included more simulation scenarios in the Supporting Information. In particular, in Supporting Information section 1.3, we considered the case of non-nested networks sharing some structure, while, in section 1.4, we considered the case where five different networks were estimated simultaneously. As shown by Figures S4 and S5, JRF outperformed competitors under all of these simulation scenarios. In addition, it would be interesting to evaluate the performance of JRF when class-specific networks do not share any structure. For this purpose, we simulated two different networks with no common edge and applied different methods on the corresponding data (Supporting Information section 1.5). As shown in Figure S6, performance of JRF is very comparable to that of GENIE3-Sep on this simulation example. This suggests that JRF is sufficiently robust and works in the absence of common structure. Moreover, to assess the performance of JRF in the presence of nonlinearities in the data, we further generated data examples using GeneNetWeaver,[28] an open-source software for the generation of in-silico data from a set of ordinary and stochastic differential equations (Supporting Information section 1.6). As shown in the supplemental Figure S7 and S8, JRF again outperformed other algorithms in terms of both area under the ROC curve (AUC) and area under precision-recall curve (AUPR). Finally, Figure S9 shows results for different algorithms in estimating networks with 1000 nodes (further details can be found in Supporting Information section 1.7), in which we can see that JRF outperformed competitors in terms of both AUC and AUPR. It is worth mentioning that the increased dimensionality did not negatively affect results, suggesting that JRF can handle problem with high dimensionality.

Effect of Sample Size

In this section, we assess the performance of JRF and other algorithms on data sets involving different number of samples. Figure shows boxplots of AUC over 20 replicates resulting from JRF, GENIE3-Sep, and GENIE3-Comb for different sample sizes, that is, n = 50, 100, 200. The AUCs of JRF are significantly larger than that of GENIE3-Sep based on 20 replicates under all sample sizes. As shown in the Supplemental Figure S10, JRF outperforms competitors also in terms of AUPR for different samples sizes. As shown, the smaller the sample size the bigger the improvement of JRF over GENIE3-Sep is. This result is expected because as the sample size increases GENIE3-Sep should have sufficient information to estimate the networks separately.

Figure 3

Boxplot of AUC over 20 replicates for JRF (red), GENIE3-Comb (green), and GENIE3-Sep (blue) for different sample sizes, that is, n = 50, 100, and 200. Data: samples simulated from a Gaussian graphical model on a power law topology.

Edge Specificity

In this section, we compare JRF and GENIE3-Sep regarding their abilities to estimate class-specific edges. We considered the 20 replicate data sets (n = 100), which was simulated based on the power law network topology in the previous section. For both JRF and GENIE3-Sep, edges were declared using the permutation procedure illustrated above with B = 500 and an FDR threshold of 0.001. Table a shows the performance of JRF and GENIE3-Sep in estimating Network 1 and Network 2. For each network, we reported the minimum and the maximum value of true positive rate (TPR), false-positive rate (FPR), and false discovery rate (FDR) across 20 replicates. In particular, these quantities are defined as TPR = TP/(TP + FN), FPR = FP/(FP + TN), and FDR = FP/(FP + TP), where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives. For both networks, JRF achieved better power (TPR) than GENIE3-Sep, while the resulting FPRs were quite comparable. Table b shows a comparison of the two algorithms regarding the detection of class-specific (differential) edges. Again, we computed the rate of true differential edges and false differential edges and showed the minimum and the maximum value of these quantities over 20 replicates. While the TPRs for detecting differential edges were similar for the two methods, the FPRs of GENIE3-Sep were significantly larger than that of JRF. In addition, JRF leads to substantially lower FDRs for detecting differential edges than GENIE3-Sep. This again suggests the merit of joint learning in JRF.

Table 1

(a) TPR, FPR, and FDR of Network 1 and Network 2 and (b) TPR, FPR and FDR of Differential Edges (Diff.Net)a

			TPR		FPR		FDR
	model	network	min	max	min	max	min	max
(a)	JRF	Net 1	0.64	0.69	4e-4	7e-4	0.13	0.20
		Net 2	0.68	0.78	2e-4	6e-4	0.15	0.26
	GENIE3-Sep	Net 1	0.62	0.66	4e-4	6e-4	0.12	0.20
		Net 2	0.57	0.67	2e-4	6e-4	0.11	0.27
(b)	JRF	Diff.Net	0.56	0.65	1e-4	2e-4	0.09	0.18
	GENIE3-Sep	Diff.Net	0.59	0.67	3e-4	4e-4	0.19	0.29

For each quantity (TPR, FPR, and FDR) we show the minimum value (Min) and the maximum value (Max) across 20 replicates. Note that the true discovery rate (TDR) is defined as TDR = (1 – FDR).

Coexpression RNA and Protein Networks in Breast Cancer

We applied JRF to construct coexpression networks based on the CPTAC global proteomics data set and TCGA RNAseq data of 62 breast cancer samples.

Networks via JRF

For simplicity, we will refer to the coexpression networks constructed by JRF from global proteomics data and RNAseq data as protein-network and RNA-network. At FDR cutoff of 0.001, we detected 687 common edges shared by both networks, 502 edges unique to protein-network, and 382 edges unique to RNA-network. The topologies of the two networks are shown in Figure , while the full list of interactions can be found in Additional File 1 in the Supporting Information. Interestingly, a few network modules in protein-network have a high percentage of edges unique to protein-network. No such structures is observed in RNA-network. Figure highlights genes whose degrees (i.e., total number of edges connecting to one gene) in protein-network are much higher than that in RNA-network. Three out of the top five genes with more protein-unique edges, C3, CFB, and MPO, are related to immune response functions. It is well known that immune response plays a major role in the patient response to chemotherapy treatments.[29] Some of these genes have been shown to be predictive of survival outcomes of breast cancer patients. For example, C3 and CP are predictive of chemoresistance in breast cancer patients,[30] and high activity of MPO genotypes can enhance efficacy of chemotherapy for early-stage breast cancer.[31] Other remarkable proteins are MMP8 and MMP9. Genes belonging to the MMP family have been linked to cancer progression for their ability to degrade the extracellular matrix and may be implicated in the formation of metastasis.[32,33]

Figure 4

Figure 5

Number of connecting edges for top connected genes in the network based on proteomic data. For each gene, the green bar corresponds to the number of connecting edges shared between proteomic and RNAseq data; the red bar indicates the number of Protein-specific edges; while the blue bar indicates the number of RNA-specific edges. For each gene, we list its biological functions.

Network based on proteomic data (a) and RNAseq data (b) resulting from JRF. Green edges (687) are shared between proteomic and RNAseq data, red edges (502) are unique to proteomic data, while blue edges are unique to RNAseq data (382). Only names of genes with at least ten connecting edges are shown in the plot. For both networks, the full list of interactions can be found in additional file 1. Number of connecting edges for top connected genes in the network based on proteomic data. For each gene, the green bar corresponds to the number of connecting edges shared between proteomic and RNAseq data; the red bar indicates the number of Protein-specific edges; while the blue bar indicates the number of RNA-specific edges. For each gene, we list its biological functions. We then focused on network modules unique to protein-network. We derived network modules based on edge betweenness using function “edge.betweenness.community” available in the R package igraph.[34] In Figure , we show the protein-network with gene modules enriched of at least one GO category. The full list of genes contained in each module can be found in additional file 1. Figure shows, for each module, the most enriched GO category and the corresponding Benjamini adjusted p-value.[35,36] Two interesting protein-specific modules (with more protein-specific edges than shared edges) are C1 and C2. C1 is enriched of “fibroblast growth factor receptor activity”, as shown in Figure , genes are more tightly correlated in the proteomic data compared with the RNAseq data. Various studies have investigated the role of the FGFR pathway as a predictive marker for breast cancer.[37,38] As shown in Figure , FGFR2 and FGFR3 show a correlation close to one. Recently, Cerliani et al.[39] reported a strong correlation between FGFR2 and FGFR3 based on protein expression mentioning that there is no previous evidence of correlation between these two proteins for breast cancer patients. In addition, the interaction between FGFR2 and FGFR3 is contained in the STRING database[40,41] (see Table S2). Another interesting protein-specific module is C2. This module is enriched of “extracellular region”. Similarly to module C1, genes in module C2 are higher correlated in the proteomic data compared with the RNAseq data (Figure ). This cluster contains genes such as APOB, C3, ORM1, and CP that have been recently shown to be predictive of chemoresistance in breast cancer patients.[30] As shown by Figure , protein expressions of these genes are highly correlated. Another important gene in this module with a high number of connecting edges is APOA1 (the fifth highest connected node according to Figure ). This gene has been recently identified as a potential target for disease progression using whole-genome trancriptomic and whole proteomic.[42] As shown in Table S2, 11 of the relationships in Figure are contained in the STRING database. These results suggest that proteomic data can provide complementary information to genomics data and help to reveal important biological mechanisms.

Figure 6

Figure 7

Heatmap of the absolute value of correlation between genes based on proteomic and RNAseq data for module C1 (a) and C2 (b) in Figure . For each module, we include a pie plot showing the number of protein-specific interactions (red) and interactions shared across proteomic and RNAseq data (green).

Network based on proteomic data resulting from JRF with corresponding gene modules. Modules were detected using function “edge.betweenness.community” available in the R package igraph. For each module, we show the most enriched GO category with corresponding Benjamini adjusted p-value (P). In particular, only modules reporting a significant enriched Benjamini adjusted p-value (P ≤ 0.01) are highlighted. Heatmap of the absolute value of correlation between genes based on proteomic and RNAseq data for module C1 (a) and C2 (b) in Figure . For each module, we include a pie plot showing the number of protein-specific interactions (red) and interactions shared across proteomic and RNAseq data (green).

Comparison with GENIE3-Sep

Because we are interested in assessing the utility of a joint learning, JRF was compared with the standard random-forest algorithm that constructs the two networks separately. For both methodologies, we derived undirected networks using the same FDR cutoff (0.001). The protein-network resulting from GENIE3-Sep can be found in Figure S11 in the Supporting Information. As shown, 57% of the edges are shared across protein- and RNA-networks under JRF, while only 26% are shared under GENIE3-Sep. This finding can be explained by the fact that a joint learning can facilitate the detection of common edges.

Validation Using GO Terms

In this section, we validate our findings using GO categories. In particular, we consider 596 GO terms containing fewer than 200 genes (the list of GO terms can be found in Tables S3–S6 in the Supporting Information). Figure a shows the number of edges contained in at least one GO term for different FDR cutoff for both protein-networks. Figure b shows the enrichment p-value for different FDR cutoff. The enrichment p-value was calculated as follows. First, we constructed a network based on GO terms where an edge (a – b) was drawn between every pair of genes (a, b) sharing at least one GO term. Then, a contingency table was constructed considering variables “network based on GO terms” and “network based on proteomic data” with categories “number of edges contained” and “number of edges non-contained”, and a Fisher’s exact test was performed to derive enrichment p-values. Figure a,b shows that JRF results in a network more overlapping with GO terms than the standard random-forest algorithm GENIE3-Sep.

Figure 8

We validate protein-networks in Figure using GO terms and knockout signatures. The first row contains network validation based on GO terms. For our analysis, we considered 596 GO terms containing fewer than 200 genes. (a) Number of edges contained in at least one GO term for different FDR cutoff. (b) p value of enrichment for different FDR cutoff. The second row contains validation based on Esr1 (c) and Gata3 (d) signatures. For a particular neighborhood size of either Esr1 or Gata3, we show the total number of knockout signatures contained under JRF (○−) and GENIE3-Sep (★−).

Validation Using Esr1 and Gata3 Knockdown Signatures

To further validate our findings, we overlapped both networks with Esr1 and Gata3 knock-down signatures. We identified siRNA knock-down signatures of key transcription factors (TFs) of breast carcinoma in MCF7 cells1 by accessing gene expression data available under GSE31912 in Gene Expression Omnibus[43] (further details about these signatures can be found in Section 2 of Supporting Information). For each network, a neighborhood was defined by defining a threshold k for the number of edges connecting a particular gene to the target gene of interest (either Esr1 or Gata3). For each network in Figure , we considered different values of k and counted the number of knockout signatures contained in the neighborhood. Figure c,d shows the total number of signatures contained in k-size neighborhoods for Esr1 and Gata3, respectively. As shown, also in this case, JRF results in a more enriched network. Figure S12 contains the same comparison for RNA networks.

Discussion

In this paper, we developed JRF, a random-forest-based algorithm for the simultaneous construction of gene coexpression and protein coexpression networks. JRF is designed to borrow information across different expression data by selecting the same set of predictors (genes) as splitting variables in the random forest model corresponding to each data. In this way, JRF is able to detect common relationships (edges) with better power and detect differential edges-specific to individual data with fewer false positives. Because in this study we were mainly interested in assessing the unique contribution of proteomic data to breast cancer research, we focused our attention on protein-specific hub genes and modules. We identified two interesting protein-specific modules containing potential targets for breast cancer treatment. In addition, we compared our algorithm to the original random-forest algorithm, which constructs the two networks separately, and showed that our algorithm leads to networks more overlapping with available knockout signatures and existing GO terms. The current version of JRF requires the feature (protein) space to be the same across different classes. Extension of JRF for handling data involving different sets of variables remains as future work. This problem arises in the protein domain because protein abundance can be measured for different phosphorylation-sites that map to a unique protein. As future work, we will design a model able to jointly estimate networks from phosphorylation-site abundance and RNAseq data, simultaneously. Besides estimating gene-regulatory network and protein-regulatory network, as shown in the paper, JRF can be used in many other applications. For example, we can apply JRF to estimate GRNs for the three breast cancer subcategories (luminal, basal, and her2), GRN for different tissues, and GRN for cancer and normal tissues. Another interesting direction would be assessing the association between miRNAs-genes and miRNAs-proteins simultaneously. miRNAs are molecules that control the growth and proliferation of a cell. It is well known that the downregulation of some miRNAs may play an important role in the progression of cancer.[44] Therefore, a better understanding of the interactions between miRNA, mRNA, and protein expression is crucial to cast light on the potential disease mechanisms of cancer. In this context, JRF can be easily utilized to jointly detect which miRNAs regulate genes and proteins. Another advantage of JRF relies on its computational efficiency. JRF can be easily parallelized because the computation can be divided into p-independent subproblems. On the contrary, for many existing methods such as Bayesian networks and GGM-based algorithms, it is difficult to parallelize their computation. For this reason, JRF could be more preferred when handling large data sets. The merit of JRF depends on the assumption that networks of different classes share some common structures. To assess the performance of JRF in the absence of common structure, we conducted numerical studies to investigate such cases and concluded that JRF is sufficiently flexible to guarantee good performance even when nonrelated networks are considered. In JRF, importance scores derived from random forest models are used to rank edge strengths. We propose a permutation-based procedure to derive proper cutoffs on importance scores for selecting confident edges. Specifically, we introduce f(ι) as an approximation of the false discovery rate (FDR) of the selected edge set based on cutoff ι; however, in the numerical studies, we observe that f(ι) tends to underestimate FDR. Thus, a conservative threshold on f(ι) is recommended in practice. New methods to obtain more accurate estimate of FDR for JRF warrant future research.

37 in total

1. MicroRNA-203 Regulates Growth and Metastasis of Breast Cancer.

Authors: Shan Zhao; Jinzhu Han; Likang Zheng; Zixin Yang; Li Zhao; Yingqian Lv
Journal: Cell Physiol Biochem Date: 2015-08-12

2. Increased expression of fibroblastic growth factor receptor 2 is correlated with poor prognosis in patients with breast cancer.

Authors: Shanshan Sun; Yongdong Jiang; Guoqiang Zhang; Hongtao Song; Xianyu Zhang; Youxue Zhang; Xiaoshuan Liang; Qian Sun; Da Pang
Journal: J Surg Oncol Date: 2011-10-17 Impact factor: 3.454

3. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.

Authors: Juliane Schäfer; Korbinian Strimmer
Journal: Stat Appl Genet Mol Biol Date: 2005-11-14

Review 4. Gene expression omnibus: microarray data storage, submission, retrieval, and analysis.

Authors: Tanya Barrett; Ron Edgar
Journal: Methods Enzymol Date: 2006 Impact factor: 1.600

Review 5. Gene regulatory network inference: data integration in dynamic models-a review.

Authors: Michael Hecker; Sandro Lambeck; Susanne Toepfer; Eugene van Someren; Reinhard Guthke
Journal: Biosystems Date: 2008-12-27 Impact factor: 1.973

6. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks.

Authors: Jun Zhu; Bin Zhang; Erin N Smith; Becky Drees; Rachel B Brem; Leonid Kruglyak; Roger E Bumgarner; Eric E Schadt
Journal: Nat Genet Date: 2008-06-15 Impact factor: 38.330

Review 7. Emergence of FGFR family gene fusions as therapeutic targets in a wide spectrum of solid tumours.

Authors: Brittany C Parker; Manon Engels; Matti Annala; Wei Zhang
Journal: J Pathol Date: 2014-01 Impact factor: 7.996

8. Inferring regulatory networks from expression data using tree-based methods.

Authors: Vân Anh Huynh-Thu; Alexandre Irrthum; Louis Wehenkel; Pierre Geurts
Journal: PLoS One Date: 2010-09-28 Impact factor: 3.240

9. Identification of ApoA1, HPX and POTEE genes by omic analysis in breast cancer.

Authors: Naci Cine; Ahmet Tarik Baykal; Deniz Sunnetci; Zafer Canturk; Muge Serhatli; Hakan Savli
Journal: Oncol Rep Date: 2014-06-23 Impact factor: 3.906

10. KELLER: estimating time-varying interactions between genes.

Authors: Le Song; Mladen Kolar; Eric P Xing
Journal: Bioinformatics Date: 2009-06-15 Impact factor: 6.937

5 in total

1. Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma.

Authors: David J Clark; Saravana M Dhanasekaran; Francesca Petralia; Jianbo Pan; Xiaoyu Song; Yingwei Hu; Felipe da Veiga Leprevost; Boris Reva; Tung-Shing M Lih; Hui-Yin Chang; Weiping Ma; Chen Huang; Christopher J Ricketts; Lijun Chen; Azra Krek; Yize Li; Dmitry Rykunov; Qing Kay Li; Lin S Chen; Umut Ozbek; Suhas Vasaikar; Yige Wu; Seungyeul Yoo; Shrabanti Chowdhury; Matthew A Wyczalkowski; Jiayi Ji; Michael Schnaubelt; Andy Kong; Sunantha Sethuraman; Dmitry M Avtonomov; Minghui Ao; Antonio Colaprico; Song Cao; Kyung-Cho Cho; Selim Kalayci; Shiyong Ma; Wenke Liu; Kelly Ruggles; Anna Calinawan; Zeynep H Gümüş; Daniel Geiszler; Emily Kawaler; Guo Ci Teo; Bo Wen; Yuping Zhang; Sarah Keegan; Kai Li; Feng Chen; Nathan Edwards; Phillip M Pierorazio; Xi Steven Chen; Christian P Pavlovich; A Ari Hakimi; Gabriel Brominski; James J Hsieh; Andrzej Antczak; Tatiana Omelchenko; Jan Lubinski; Maciej Wiznerowicz; W Marston Linehan; Christopher R Kinsinger; Mathangi Thiagarajan; Emily S Boja; Mehdi Mesri; Tara Hiltke; Ana I Robles; Henry Rodriguez; Jiang Qian; David Fenyö; Bing Zhang; Li Ding; Eric Schadt; Arul M Chinnaiyan; Zhen Zhang; Gilbert S Omenn; Marcin Cieslik; Daniel W Chan; Alexey I Nesvizhskii; Pei Wang; Hui Zhang
Journal: Cell Date: 2019-10-31 Impact factor: 41.582

2. Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships.

Authors: Gail Gong; Wei Wang; Chih-Lin Hsieh; David J Van Den Berg; Christopher Haiman; Ingrid Oakley-Girvan; Alice S Whittemore
Journal: Stat Appl Genet Mol Biol Date: 2019-04-08

3. Differential Integration of Transcriptome and Proteome Identifies Pan-Cancer Prognostic Biomarkers.

Authors: Gregory W Schwartz; Jelena Petrovic; Yeqiao Zhou; Robert B Faryabi
Journal: Front Genet Date: 2018-06-15 Impact factor: 4.599

4. Proteogenomics connects somatic mutations to signalling in breast cancer.

Authors: Philipp Mertins; D R Mani; Kelly V Ruggles; Michael A Gillette; Karl R Clauser; Pei Wang; Xianlong Wang; Jana W Qiao; Song Cao; Francesca Petralia; Emily Kawaler; Filip Mundt; Karsten Krug; Zhidong Tu; Jonathan T Lei; Michael L Gatza; Matthew Wilkerson; Charles M Perou; Venkata Yellapantula; Kuan-lin Huang; Chenwei Lin; Michael D McLellan; Ping Yan; Sherri R Davies; R Reid Townsend; Steven J Skates; Jing Wang; Bing Zhang; Christopher R Kinsinger; Mehdi Mesri; Henry Rodriguez; Li Ding; Amanda G Paulovich; David Fenyö; Matthew J Ellis; Steven A Carr
Journal: Nature Date: 2016-05-25 Impact factor: 49.962

5. A new method to study the change of miRNA-mRNA interactions due to environmental exposures.

Authors: Francesca Petralia; Vasily N Aushev; Kalpana Gopalakrishnan; Maya Kappil; Nyan W Khin; Jia Chen; Susan L Teitelbaum; Pei Wang
Journal: Bioinformatics Date: 2017-07-15 Impact factor: 6.937

5 in total