Kai Zhang1, Joe W Gray, Bahram Parvin. 1. Life Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.
Abstract
MOTIVATION: Molecular association of phenotypic responses is an important step in hypothesis generation and for initiating design of new experiments. Current practices for associating gene expression data with multidimensional phenotypic data are typically (i) performed one-to-one, i.e. each gene is examined independently with a phenotypic index and (ii) tested with one stress condition at a time, i.e. different perturbations are analyzed separately. As a result, the complex coordination among the genes responsible for a phenotypic profile is potentially lost. More importantly, univariate analysis can potentially hide new insights into common mechanism of response. RESULTS: In this article, we propose a sparse, multitask regression model together with co-clustering analysis to explore the intrinsic grouping in associating the gene expression with phenotypic signatures. The global structure of association is captured by learning an intrinsic template that is shared among experimental conditions, with local perturbations introduced to integrate effects of therapeutic agents. We demonstrate the performance of our approach on both synthetic and experimental data. Synthetic data reveal that the multi-task regression has a superior reduction in the regression error when compared with traditional L(1)-and L(2)-regularized regression. On the other hand, experiments with cell cycle inhibitors over a panel of 14 breast cancer cell lines demonstrate the relevance of the computed molecular predictors with the cell cycle machinery, as well as the identification of hidden variables that are not captured by the baseline regression analysis. Accordingly, the system has identified CLCA2 as a hidden transcript and as a common mechanism of response for two therapeutic agents of CI-1040 and Iressa, which are currently in clinical use.
MOTIVATION: Molecular association of phenotypic responses is an important step in hypothesis generation and for initiating design of new experiments. Current practices for associating gene expression data with multidimensional phenotypic data are typically (i) performed one-to-one, i.e. each gene is examined independently with a phenotypic index and (ii) tested with one stress condition at a time, i.e. different perturbations are analyzed separately. As a result, the complex coordination among the genes responsible for a phenotypic profile is potentially lost. More importantly, univariate analysis can potentially hide new insights into common mechanism of response. RESULTS: In this article, we propose a sparse, multitask regression model together with co-clustering analysis to explore the intrinsic grouping in associating the gene expression with phenotypic signatures. The global structure of association is captured by learning an intrinsic template that is shared among experimental conditions, with local perturbations introduced to integrate effects of therapeutic agents. We demonstrate the performance of our approach on both synthetic and experimental data. Synthetic data reveal that the multi-task regression has a superior reduction in the regression error when compared with traditional L(1)-and L(2)-regularized regression. On the other hand, experiments with cell cycle inhibitors over a panel of 14 breast cancer cell lines demonstrate the relevance of the computed molecular predictors with the cell cycle machinery, as well as the identification of hidden variables that are not captured by the baseline regression analysis. Accordingly, the system has identified CLCA2 as a hidden transcript and as a common mechanism of response for two therapeutic agents of CI-1040 and Iressa, which are currently in clinical use.
Genome-wide association studies of expression and phenotypic data are becoming a routine methodology for identifying potential biomarkers. While the literature is rich with supervised or unsupervised clustering of genomic information, methods for studying the relationships between genomic and phenotypic data remain relatively limited. Existing association methods are typically based on the univariate correlation analysis, which either correlates a single gene to the resultant phenotype(s) or vice versa. This is known as the gene- and phenotype-based approaches, respectively (Dryja, 1997). More recently, (Yi et al., 2008) quantized large number of transcript data through clustering, and associated them with physiological responses or clinical metadata. In contrast, another group of researchers have taken a new direction by first clustering morphometric data and then associating with the transcript data (Han et al., 2010). However, in both cases, correlation is based on the independent, pairwise univariate analysis.Pairwise univariate correlation analysis can quickly provide important association information, as well as candidates for further screening. However, it treats the genes and the phenotypes as independent and isolated units, therefore the underlying interacting relationships between the units might be lost. It is well-known that some transcripts act as regulatory nodes, driving other transcripts in a coordinated manner to determine the phenotypic profile. Additionally, incubation with each therapeutic reagent simultaneously interferes with a subset of genes. Here, we hypothesized that simultaneous incorporation of genome-wide expression data coupled with phenotypic data computed from multiple perturbation conditions, each targeting a different molecular region, can elucidate a common mechanism of response that may be hidden otherwise. In fact, perturbation and molecular diversity of the model system have shown to be capable of reducing the samples needed for biological inference, thus enhancing robustness of biological conclusion (Ideker et al., 2001; Sachs et al., 2005; Tegnér et al., 2003). Thus, we ask the following questions. How can traditional univariate associations be modeled simultaneously and in the absence of a correlation threshold? How can the inherent sparsity of association be formalized within an optimization framework? How can one compensate for the lack of replicates due to the high experimental cost associated with gene expression profiling? To address these issues, we have developed an integrated platform that simultaneously and systematically takes into account an ensemble of gene and phenotypic signatures. Such an enterprise must incorporate an experimental design with sufficient degree of molecular diversity for increased computational robustness. In this context, molecular diversity is achieved by using a panel of breast cancer cell lines that are well-characterized and readily available through American Type Culture Collection.Our computational framework consists of two major steps. First, a vector-valued, multitask regression formulation is adopted to model the relationships between transcripts and phenotypes under multiple experimental conditions. In particular, the regression coefficients are factorized into two parts. One part is a shared template that suggests a common mechanism of action under various treatments. The second part is related to the perturbation that is induced locally in the transcript network under individual perturbation. The regression has to be sparse, because only a subset of genes is typically involved in a specific phenotypic response. Sparsity is enforced through L1-norm regularization, which inherently removes outliers and irrelevant associations. The end result is a sparse regression matrix that captures intrinsic properties of gene–phenotype association. This matrix is reordered for improved visualization of the gene–phenotype grouping, where the reordering aims at an optimum permutation of rows and columns of the regression matrix such that the underlying saliency becomes apparent. In this context, reordering reveals dominant association between subsets of genes (with the similar expression profile) and subset of phenotypic indices (with the similar measurements).We have demonstrated the efficacy of our method with synthetic and experimental data, where the main purpose of synthetic data is to profile the robustness and precision of the proposed method. Experimental data consist of baseline gene expression data for a panel of breast cancer cell lines, which are associated with cell-cycle inhibitor data. The proposed method can be used as a complementary tool besides baseline regression techniques, to provide a richer and a more promising list of candidate molecular predictors for further biological verifications.Section 2 presents our computational model and detailed optimization procedures. Section 3 provides results on synthetic and experimental data. Section 4 concludes with a discussion on the molecular predictors and system performance.
2 MODELS
2.1 Description of basic computational models
In this section, we introduce our basic computational models for exploring the associations between genes and phenotypic responses. To reduce excessive costs associated with the collection of gene expression data, we assumed that the gene expression were collected under a baseline (unperturbed) condition, as denoted by X0 ∈ ℝ. Here, C is the number of cell lines and N is the number of genes. On the phenotypic side, assume that we obtained measurements Y ∈ R's for d = 0, 1, 2,…., D, where M is the number of phenotypic features, d = 0 denotes the controlled, baseline condition and d = 1, 2,…, D corresponds to the drug-perturbed conditions. We used the linear regression model to measure the dependency between genes and phenotypes, as illustrated in Figure 1. The design matrix X0 was mapped to the phenotype responses Y ∈ ℝ via a regressing matrix T∈ℝ, as
The coefficient matrices T's reflect the dependency (or correlation) between the genes and the phenotypes of interest, i.e. its ij-th entry is the weight associated with the i-th gene in reconstructing the j-th feature in the phenotypic profile under the d-th condition.
Fig. 1.
The linear regression model used to compute the sparse association between baseline gene expression data and phenotypic responses.
The linear regression model used to compute the sparse association between baseline gene expression data and phenotypic responses.There are a number of complexities in estimating T. These complexities originate from low sample size, high dimensionality of the data and coupling between different perturbation conditions. However, majority of the transcript data can be considered as noisy background, as it believed that only a subset of genes are involved in each specific cellular process. To address these issues, we propose a sparse, regularized multitask regression framework with co-clustering. The novelty of our method involves: (i) leveraging the locality of the molecular interactions as a result of treatment with therapeutic agents, and modeling multiple treatments simultaneously; (ii) coupling it with a L1-regularized solution that enforces sparsity and simultaneously compensates for small sample size; and (iii) grouping associations with co-clustering.First, a multitask regression framework is used to model the molecular interactions under multiple conditions in a systematic way. The Multitask learning (Caruana, 1997; Lee et al., 2007; Xiong et al., 2007) is aimed at information sharing among learners from a set of different but related tasks, with the hope to boost the overall performance. In this context, regression (1) under each experimental condition is deemed as a task. As phenotypic profiles arise from the original gene regulatory network and its local perturbation, we can assume that phenotypic responses are triggered by different experimental conditions are lying on the same low-dimensional space, i.e.
In other words, task relatedness is enforced by requiring that T's associated with each task are local perturbations of a shared subspace T. Here, T ∈ ℝ represents the shared structure (related to the gene regulatory network), P∈ℝ compensates for the perturbation of different experimental conditions and K is the dimension of the latent space in which the phenotypic responses are supposed to reside. In our formulation, K is set to be equal to M for practical reasons, and P's are diagonal matrices. The actual structure of P is an open problem at this point, and it is possible that a non-diagonal matrix can produce a better reconstruction result. The structure of P and the choice of K is one of the topics for our continued research. Nevertheless, the shared template matrix T has the potential to summarize association descriptor between N genes and M phenotypes. An advantage of decomposing the T matrices is a significant reduction in the number of variables for estimation.Second, the L1 regularization technique is used to mathematically guarantee the robustness of the system against irrelevant genes. The L1 regularization typically leads to sparse learning models, and has been independently discovered in several research areas such as regression shrinkage and variable selection (Tibshirani, 1996), basis pursuit (Donoho et al., 2001), compressive sensing (Donoho, 2006) and feature vector machine (Li et al., 2005). By penalizing the L1-norm of the variables, part of the regression coefficients will be driven to zero with the level of sparsity controlled by the strength of regularization. This is a desirable property considering the highly localized functionalities of genes as they relate to specific phenotypic signatures.By combining the multitask learning frame with the L1 regularization, we established sparse multitask regression as follows:
Here, ‖ · ‖ is the matrix Frobenius norm and ‖ · ‖1 is the matrix L1-norm. The first term enforces a fit between the gene expression and the phenotypic signature under each condition, while the second term enforces sparsity on the shared template T. The constraints ‖P‖ = 1 are used to prevent trivial solutions (i.e. T approaches zero and P's approach infinity). Alternatively, this can be achieved by penalizing ‖P‖ with a extra regularization parameter. More recently, a heterogeneous multitask learning framework that considers both continuous (regression) and discrete (classification) variables was successfully used to discover genetic markers that jointly influence multiple correlated traits (Yang et al., 2009). In comparison, our method considers pure regression setting only, where the phenotypic measurements are continuous.Formulation (3) allows us to obtain condition-specific regression matrices T's based on a common template T. Note that for each T, its non-zero rows signify important genes under the d-th condition. Therefore, template T, which is shared among multiple T's, defines a combined list of genes that are important to the phenotypes studied under these conditions. In other words, T is an integrated association descriptor that summarizes correlating relations between genes and phenotypes under multiple conditions; and we want to read out useful structures (such as the grouped correlation between subsets of genes and subsets of phenotypes) encoded in T. To achieve this goal, we performed co-clustering analysis (Hartigan, 1972) on T. Co-clustering analysis has been used to find clusters in various tabulated data such as the co-occurrence of documents/words (Dhillon, 2001), or the expression of genes under various conditions (Ding, 2003; Kluger et al., 2003; Tanay et al., 2002), by simultaneously grouping rows and columns of the association table. However, it has rarely been applied to interpret associations between genes and phenotypes, where the association table is not directly available from raw data but instead has to be learned. In fact, co-clustering can reorganize regression coefficients in a perceptually meaningful manner to bring more insights into our analysis. This is illustrated by synthetic data, as shown in Figure 2. For example, assume we have learned an association table of 50 rows (e.g. genes) and 40 columns (e.g. phenotypes) where it is difficult to observe any meaningful structures. However, if we permute the rows and columns of the table by co-clustering (Dhillon, 2001), we will discover four dominant correlation groups, as shown in the Figure 2B. Such a grouping can be regarded as a distinctive ‘watermark’ of the gene–phenotypic association. Furthermore, rows (genes) grouped into the same block are more likely to participate together in affecting corresponding columns (phenotype responses).
Fig. 2.
The co-clustering procedure transforms a randomly displayed association table (a) of 50 genes and 40 phenotypes to a organized partition (b).
The co-clustering procedure transforms a randomly displayed association table (a) of 50 genes and 40 phenotypes to a organized partition (b).In summary, the sparse multitask regression has three advantages: (i) it allows us to reduce the number of variables from O(MND) to O(NM+DM2); (ii) the sparsity of T easily transfers to those of T's due to the simple linear relation T = T · P; and (iii) as we shall see, the template matrix T is a platform from which explorative analysis can be carried out in identifying important, grouped correspondences between genes and phenotypic signatures.
2.2 Optimization procedures
Formulation (3) is a vector-valued regression with intrinsic T and perturbation-specific P's. It can be solved by an alternating optimization strategy, i.e. iteratively fixing P's and solving T, and then fixing T and solving P's. We will show that both T and P's subproblems are convex. Thus a locally optimal solution of the problem (3) can always be guaranteed. In the following, we present details on the alternating optimization (Parts I, II and III) and the co-clustering procedure (Part IV).(I) Fix {P} and solve T: We will show that when P's are fixed, T can be solved through quadratic programming. First, use the operator vec(·):ℝ→ℝ to denote the mapping that transforms a p × q matrix into a pq × 1 vector via concatenating the columns in the matrix, and let ivec(·) be the inverse mapping. Let t = vec(T) ∈ ℝ. Then define a 3D matrix A ∈ ℝ for d = 0, 1, 2,…, D, such that
Here, X0(i, :) is the i-th row in X0, P(:, j) the j-th column in P and each (i, j)-pair locates an MN × 1 vector denoted by A(i, j, :). Now, computing T is equivalent to the following quadratic program
It can be easily verified that the residual term ∑ ‖X0TP − Y‖2 in (3) is identical to t⊤ Qt − 2b⊤ t up to a constant that is independent of the optimization variables. Note that the Hessian of the above quadratic programming problem is positive semi-definite: for any x ∈ ℝ we have
On the other hand, the L1 regularization term λ‖t‖1 is a convex function. Therefore, the problem is convex, and there exists a unique, globally optimal solution for the subproblem (5).The main computational barrier is that the Hessian matrix Q is MN-by-MN, which can be very large and does not fit in a modern desktop computer. However, this matrix is symmetric, positive-definite Hessian matrix Q and has very low rank in practice, i.e. its eigen-spectrum decays very quickly to zero. This is shown in Figure 3, where we chose N = 1210 genes and M = 3 phenotypes to construct the matrix Q (6) with size 3630 × 3630. It is clear that the spectrum of Q decays rapidly, with only the top 48 eigenvalues being strictly non-zero, thus substantiating the low-rank nature of the Q matrix. As a result, the Hessian matrix can be represented by the ‘low-rank approximation’ to alleviate prohibitive computational requirements. To do this, we searched for a rank-R matrix L that best represents the Q matrix in a least square sense, min‖Q−LL⊤‖2, where R ≪ NM, L∈ℝ is a rectangular matrix with low row-rank and LL⊤ is called the rank-R approximation of Q. This approximation Q ≈ LL′ dramatically reduces memory usage from O(N2M2) to O(NMR).
Fig. 3.
Spectrum of a 3630 × 3630 matrix Q, computed from our experimental data, indicates that only the largest 48 eigenvalues are strictly positive and the rest are insignificant. The spectrum clearly reflects the low-rank nature of the matrix Q and the feasibility of low-rank approximation.
Mathematically, the optimal rank-R matrix L is given by the eigenvectors of Q (Golub and Loan, 1996), which is computationally expensive. We therefore pursued an approximate solution by adopting the sampling-based low-rank approximation scheme, known as the Nyström method, which originated from the numerical treatment of integral equations of the second type (Baker, 1997). The basic idea of the Nyström method is to randomly sample R columns from the Q matrix, which, due to its symmetry, also corresponds to R rows. Let E and E′ denote the sampled columns and its transpose, respectively, where E ∈ ℝ. Let W ∈ ℝ be the intersection of the selected rows and columns. Then Q can be decomposed as Q ≈ EW−1E′. In our specific context, Q is represented as the sum of multiple outer products (6). By utilizing this property, E and W can be computed efficiently as follows:
where I = {1, 2,…, MN} is the index of selected columns. Given W and E, the low-rank approximation of Q is then expressed as
As W is a positive semi-definite (PSD) matrix, there exists theoretically a real square root of W. In practice, we could encounter diminishing eigenvalues. A robust way is to first perform the eigenvalue decomposition W = U Λ U⊤, remove those diminishing eigenvalues and then let .The low-rank decomposition (8) allows us to rewrite the L1-regularized quadratic programming problem (5) into a standard least square problem (with L1 regularization),
Here, q ∈ ℝ can be determined by expanding the quadratic term in (9), comparing it with (3) and requiring L′q = b. Formulation of (9) is a good approximation to the original problem (5) and it has been widely examined in statistics, optimization and machine learning. We use the l1-ls solver (Kim et al., 2007) for large-scale L1-regularized least square problems, which are based on the truncated Newton interior-point method. Empirically, it can solve large sparse problems with a million variables with high accuracy in a few tens of minutes on a modern desktop computer.(II) Fix T and solve {P}: By fixing T, entries of P's can be computed using simple scalar equations. Let the i-th column of the matrix X·T be denoted by XT(:, i) and the i-th column in Y be Y(:, i). It's easy to verify that the i-th diagonal entry in P can be solved easily as P(i, i) = XT(:, i)⊤XT(:, i)/‖Y(:, i)‖22. To guarantee that P's all have Norm 1, we will normalize them by P = P/‖P‖. This can be deemed as iteratively projecting the solutions on the feasible region ‖P‖ = 1.Note that rescaling both T and P's with −1 does not affect the prediction performance of the multitask regression, but will reverse the signs of associations learned in T. To solve this problem, we require that the signs of the resultant matrix T should be maximally correlated with those of the standard correlation coefficients on the same set of genes. From a practical standpoint, because P's are initialized with identity matrices, we have always observed that they continue to be PSD during the optimization procedure. Empirically, our method converges rapidly in about 5 to 10 iterations on our current datasets.(III) Initialization and parameter selection: By fixing one of the two groups of variables, T or P's (d = 1, 2,…, D), the other can be computed. Here, we choose to initialize P's as identity matrices for d = 1, 2,…, D. Note that initialization of the T's is usually much easier than that of T, where degrees of freedom are M2D and MN, respectively. We used leave-one-out cross-validation to choose the hyperparameter λ since the sample size is very small. This involves selecting one sample as a testing sample and the rest as training. We repeated this process for each sample and computed the averaged predictor error on the testing sample at each grid point λ ∈ {10−3, 10−2, 10−1, 1, 10}.(IV) Co-clustering: Template T is an intrinsic regression coefficient matrix linking the gene expression and phenotypic signature under the multiple conditions studied: the ij-th entry signifies the strength of the relationship between the i-th gene and the j-th phenotype. To reveal the clustered structure in these associations, we used co-clustering to permute the rows and columns of T, so that the underlying saliency becomes apparent and can be visualized. We have adopted the bipartite spectral clustering (Dhillon, 2001) for simultaneously clustering the genes and phenotypes. Bipartite spectral clustering uses a bipartite graph where vertices are divided into two types, each from one dimension of the given contingency table (T). In our case they are genes and phenotypes, denoted by 𝒢 and 𝒫, respectively, and the number of vertices will be M + N. The edge weights are determined by In other words, edges only exist between a gene vertex and a phenotype vertex. By applying spectral clustering on this bipartite graph, simultaneous groupings on gene and phenotype vertices can be computed. Mathematically, we need to compute the singular value decomposition of the degree-normalized association matrix, , where D is an N × N diagonal degree matrix whose i-th entry is the summation of the i-th row in T, and D is a M × M diagonal degree matrix whose i-th diagonal entry is the summation of the i-th column of T. Interestingly, the left and right singular vectors of S (corresponding to the second largest singular value) not only provide a partitioning of the rows and columns of T, but also provide a natural ordering (embedding) of the required row and column permutations.Spectrum of a 3630 × 3630 matrix Q, computed from our experimental data, indicates that only the largest 48 eigenvalues are strictly positive and the rest are insignificant. The spectrum clearly reflects the low-rank nature of the matrix Q and the feasibility of low-rank approximation.
3 RESULTS
Our proposed method has been tested with both synthetic and experimental data. The synthetic data is used for method validation and profiling against other known techniques. Our studies with experimental data identified molecular predictors of cell cycle data from baseline gene expression data.
3.1 Evaluation with synthetic data
In the synthetic case: (i) a data matrix X0 ∈ ℝ50×300 was created from the Gaussian distribution; (ii) a sparse intrinsic template T ∈ ℝ300×5 with 50 non-zero rows and a small set of randomly generated perturbation matrices P ∈ ℝ5×5 were created for each d = 1, 2,…, D task; and (iii) the responses (e.g. target values) were then determined by Y = X0TP + ϵ, where ϵ is the noise term. We examined how well the system recovers T's, and compared the proposed method with (i) independent L1-regularized regression, and (ii) independent L2-regularized regression, also known as regularized least squares (RLS). First, we set D = 10 and selected one of the tasks to visualize the regression qualities against the competing methods. Reconstruction results are shown in Figure 4. Notice that the L1 and L2 regressions (Fig. 4c and d) ‘contaminated’ the true regression coefficients. In practical association analysis, this can lead to a number of false predictions. In contrast, multitask regression (Fig. 4b) reliably recovered the regression coefficients. Second, we varied D from 1 to 50 and quantified the average per-task-error for each of the three methods, as shown in Figure 5. It is clear that the error in multitask regression decreases monotonically with the number of tasks, while the errors in pure L1 and L2 regressions remain stationary. Although this experiment demonstrates an improved error profile for multitask learning, we have not yet designed a synthetic experiment that maintains a correlation between transcripts.
Fig. 4.
Reconstruction of the regression coefficient matrix indicates that multitask learning is more accurate when compared with L1- and L2-regularized regressions. T is a 300-by-5 matrix and each column is represented by a unique color. (a) Ground-truth solution, (b) Multitask regression, (c) standard L1 regression and (d) regularized least square regression.
Fig. 5.
Multitask learning has an improved error rate profile as the number of tasks is increased.
Reconstruction of the regression coefficient matrix indicates that multitask learning is more accurate when compared with L1- and L2-regularized regressions. T is a 300-by-5 matrix and each column is represented by a unique color. (a) Ground-truth solution, (b) Multitask regression, (c) standard L1 regression and (d) regularized least square regression.Multitask learning has an improved error rate profile as the number of tasks is increased.
3.2 Experimental design and quantification of biological endpoints
We applied our method to a set of publicly available gene expression data for a panel of breast cancer cell lines collected with Affymetrix HG-U133A (Neve et al., 2006). We used the following 14 cell lines: MCF12A, HCC38, HCC1428, AU5650, MDAMB415, SUM185PE, ZR75B, MCF7, MDAMB361, LY2, T47D, MDAMB436, MDAMB468 and ZR751. From the original N = 22 215 probe sets, we chose 5706 by removing those with a variance of <0.3. This is slightly above the noise level of the Affymetrix U133 platform. Notice that the gene expression data were collected under baseline (e.g. unperturbed) condition. Our main hurdle has been the prohibitive cost of collecting necessary data (e.g. three conditions, 14 lines, and at least three biological replicates). Thus, we assumed that perturbed expression data would be linearly predictable from the control data.Cell cycle data where collected for cells exposed to three conditions: control condition (e.g. DMSO solvent alone), the MEK inhibitor CI1040 and the tyrosine kinase inhibitor Iressa. Both these inhibitors induce cell cycle arrest, but through different mechanisms. Each cell line was plated in triplicate and incubated for 48 h with CI1040 and Iressa at 5.6 and 4.0 μM, respectively. Subsequently, samples were fixed and stained with Hoechst and BrdU, and 25 fields of view were imaged using the Celomics high-throughput system. These images were uploaded into the BioSig imaging bioinformatics system (Parvin et al., 2003), and then analyzed for their morphometric and BrdU incorporation on a cell-by-cell basis (Raman et al., 2007; Wen et al., 2009). Figure 6 shows a sample of images that have been registered with the BioSig and one segmented image. Each segmented nucleus is represented using a multidimensional feature (Han et al., 2010) and stored in the database. In our experiment, the pertinent features are total BrdU and DNA content on a cell-by-cell basis. By aggregating these features, within each well, percentages of cells being G1, S and G2 Phase can be quantified as a function of their treatment, as shown in Figure 7. The main advantage of microscopy for evaluating cell cycle arrest is a significant reduction in the number of required cells. Finally, outliers were removed. Summary results are shown in Figure 8.
Fig. 6.
(a) Biological images are registered with BioSig and (b) each nucleus is segmented to quantify total DNA and BrdU incorporation on a cell-by-cell basis.
Fig. 7.
By aggregating total DNA and BrdU, on a cell-by-cell basis for all images in each well, the percentages of cells in G1, S, and G2 phase are quantified.
Fig. 8.
Percentage of each cell line being arrested in G1 phase with DMSO, CI1040, and Iressa treatment conditions.
(a) Biological images are registered with BioSig and (b) each nucleus is segmented to quantify total DNA and BrdU incorporation on a cell-by-cell basis.By aggregating total DNA and BrdU, on a cell-by-cell basis for all images in each well, the percentages of cells in G1, S, and G2 phase are quantified.Percentage of each cell line being arrested in G1 phase with DMSO, CI1040, and Iressa treatment conditions.
3.3 Evaluation with therapeutic agents
First, we examined associations of gene expression and cell cycle data using independent L1-regularized regression that learns the regressing coefficients T's separately for each experimental condition. The results enabled us to contrast traditional L1 regression with multitask learning. Predicted results are shown in Figure 9, where each subfigure corresponds to the regression matrix T under one condition. Here, zero rows in the regression matrix were removed, and the rows and columns of T's have been reordered by the co-clustering procedure. The positive and negative association between each gene–phenotype pair is encoded by green and red blocks, respectively. Second, we applied the proposed multitask regression to learn a common template of correlation between genes and cell cycle data for the two inhibitors (e.g. CI1040 and Iressa), as shown in Figure 10. Again, we assumed that each therapeutic reagent would perturb a small molecular region in the cell cycle progression. In this experiment, both CI1040 and Iressa induced cell cycle arrest by targeting different molecular moieties. However, if there is a common mechanism of action, then we would like to infer that. We observed that the genes identified by multitask regression (Fig. 10) contained subset of genes that were identified separately by independent L1 regression, shown in Figures 9b and c. However, there are certain genes that can only be predicted through the multitask regression. These are hidden markers that are relevant to the effect of the therapeutic reagent and provide potential new hypothesis for further studies. The total computation time on a modern desktop computer is approximately 6500 s.
Fig. 9.
The regression matrices (a) T0, (DMSO) (b)T1 (C11040), and (c)T2 (Iressa) learned by the independent regression using 14 cell lines and reordered by co-clustering.
Fig. 10.
The intrinsic template T learned by the multitask regression using 14 cell lines and the two drug conditions (CI1040 and Iressa) and reordered by co-clustering.
The regression matrices (a) T0, (DMSO) (b)T1 (C11040), and (c)T2 (Iressa) learned by the independent regression using 14 cell lines and reordered by co-clustering.The intrinsic template T learned by the multitask regression using 14 cell lines and the two drug conditions (CI1040 and Iressa) and reordered by co-clustering.
4 DISCUSSION
Our experiments with synthetic data have clearly demonstrated that multitask learning offers the following advantages over independent L1 regression: (i) regression is less noisy; (ii) regression error is reduced as a function of the number of tasks; and (iii) hidden variables are revealed since traditional L1 regression can push non-zero coefficients to zero and vice versa. Therefore, the bulk of the discussion in this section is devoted to the experimental data by focusing on a few important genes and their independent analysis through Ingenuity Pathway Analysis (IPA) and Pathway Studio.(I) CLCA2 is a hidden variable that has been identified through multitask regression and is shown to be negatively associated with the S phase. We hypothesized that CLCA2 is a common mechanism of response for inhibitors CI1040 and Iressa. This gene is known to be downregulated in breast cancer cell lines. In addition to being a p53 client (Gruber and Pauli, 1999), its knockdown leads to increased invasiveness (Walia et al., 2009), and it is epigenetically regulated (Li et al., 2004). It is also a tumor suppressor gene that may be a potential target for therapy. It is likely that CLCA2 acts as a common molecular switch to inhibit DNA synthesis and initiate apoptosis as a result of treatment with either therapeutic agent. Therefore, it not only serves as a therapeutic target, but can also be used in combination with other therapeutic targets used today for improved lethality.(II) NLRP2 is regulated by NFκB and is shown to be expressed in MDA-MB-436 and MCF-7 (Bruey et al., 2004) breast cancer cell lines. This particular gene appears in both independent and multitask regression. Furthermore, the Gene Ontology annotation indicates that NLRP2 is in involved in caspase activities and apoptosis. We hypothesized that strong G1 arrest and complementary negative correlation with cells being in S is the result of treatment with the therapeutic agent. This particular gene is reflected in multitask regression and independent regression analysis corresponding to CI1040 and Iressa. It is also a potential common mechanism of response for further analysis.(III) CDKN2A (also known as p16) expression is positively associated with G1 arrest in normal cells and tissues, but is negatively associated with the S phase in our analysis of the humanbreast tumor cell lines (in both the independent regression of Fig. 9b and the multitask regression of Fig. 10). This discrepancy is likely explained by the fact that most of the malignant cell lines in the panel have aberrations in downstream effectors of the product of this gene. The aberrations result in continued proliferation in the presence of p16 expression that ordinarily would yield cell cycle arrest and senescence (Gauthier et al., 2007).(IV) CSTA is involved in apoptosis and differentiation, and is normally regulated by JUN and FOS (Takahashi et al., 1998), whose gene products together constitute the AP1 transcription factor. AP1 drives the expression of a number of genes that are necessary for cell cycle progression. The relationships between these protein–protein interactions are shown in Figure 11. This gene appears in multitask and one of the independent regression analysis.
Fig. 11.
Interaction of CSTA with JUN and FOS curated through IPA.
CA2 is an example of the gene that is reported by both independent association of gene expression data with CI1040 (Fig. 9b) and the multitask regression analysis (Fig. 10). CA2 is ordinarily involved in differentiation and apoptosis, overexpressed in MCF7 and MDA-MB-231 and negatively correlated with the S phase in the drug-treated cells. SiRNA-mediated interference with humanCA2 gene expression has been shown to decrease survival of MDA-MB-231 cell lines (Mallory et al. , 2005).Interaction of CSTA with JUN and FOS curated through IPA.Finally, we performed an independent analysis by using Ingenuity Pathway Analysis and Pathway Studio, scientific software that helps researchers more effectively search, explore, visualize, and analyze biological and chemical findings related to genes, proteins and small molecules. We selected the set of genes that was correlated with the S phase, and uploaded them into IPA and Pathway Studio. The IPA analysis indicated that this group of genes is largely involved in (i) cell cycle and signaling networks and (ii) cancer. The net result is a more substantial support for gene-by-gene analysis. Similar results have been obtained from Pathway Studio, which provides gene set enrichment analysis (GSEA) and identifies common regulators with the user-defined number of neighbors. Gene enrichment analysis revealed that predicted gene groups are involved in response to toxin, drug, negative regulation of cell proliferation, negative regulation of peptidase activity where S phase is one of them and apoptosis among top-ranked groups. Furthermore, a number of common regulators with high P-values were also inferred that are associated with the cell cycle machinery. Figure 12 shows three regulators of MAPK, Jun/Fos, and GF, and their target entities.
Fig. 12.
Three common regulators that have been inferred from a subset of genes associated with the S phase.
Three common regulators that have been inferred from a subset of genes associated with the S phase.In summary, multitask learning has the potential to summarize a vast amount of data, compute biologically relevant markers and identify hidden variables that traditional regressors may fail to capture. Although the technique is currently applied for integration of gene expression data with cell cycle data, it can also be used for other integrative biology applications.
Authors: Mona L Gauthier; Hal K Berman; Caroline Miller; Krystyna Kozakeiwicz; Karen Chew; Dan Moore; Joseph Rabban; Yunn Yi Chen; Karla Kerlikowske; Thea D Tlsty Journal: Cancer Cell Date: 2007-11 Impact factor: 31.743
Authors: Ju Han; Hang Chang; Orsi Giricz; Genee Y Lee; Frederick L Baehner; Joe W Gray; Mina J Bissell; Paraic A Kenny; Bahram Parvin Journal: PLoS Comput Biol Date: 2010-02-26 Impact factor: 4.475
Authors: Jean Marie Bruey; Nathalie Bruey-Sedano; Ruchi Newman; Sharon Chandler; Christian Stehlik; John C Reed Journal: J Biol Chem Date: 2004-09-28 Impact factor: 5.157