Fenghai Duan1, Ye Xu2. 1. Department of Biostatistics and Center for Statistical Sciences, School of Public Health, Brown University, Providence, RI, USA. 2. StubHub, San Francisco, CA, USA.
Abstract
PURPOSE: To analyze a microarray experiment to identify the genes with expressions varying after the diagnosis of breast cancer. METHODS: A total of 44 928 probe sets in an Affymetrix microarray data publicly available on Gene Expression Omnibus from 249 patients with breast cancer were analyzed by the nonparametric multivariate adaptive splines. Then, the identified genes with turning points were grouped by K-means clustering, and their network relationship was subsequently analyzed by the Ingenuity Pathway Analysis. RESULTS: In total, 1640 probe sets (genes) were reliably identified to have turning points along with the age at diagnosis in their expression profiling, of which 927 expressed lower after turning points and 713 expressed higher after the turning points. K-means clustered them into 3 groups with turning points centering at 54, 62.5, and 72, respectively. The pathway analysis showed that the identified genes were actively involved in various cancer-related functions or networks. CONCLUSIONS: In this article, we applied the nonparametric multivariate adaptive splines method to a publicly available gene expression data and successfully identified genes with expressions varying before and after breast cancer diagnosis.
PURPOSE: To analyze a microarray experiment to identify the genes with expressions varying after the diagnosis of breast cancer. METHODS: A total of 44 928 probe sets in an Affymetrix microarray data publicly available on Gene Expression Omnibus from 249 patients with breast cancer were analyzed by the nonparametric multivariate adaptive splines. Then, the identified genes with turning points were grouped by K-means clustering, and their network relationship was subsequently analyzed by the Ingenuity Pathway Analysis. RESULTS: In total, 1640 probe sets (genes) were reliably identified to have turning points along with the age at diagnosis in their expression profiling, of which 927 expressed lower after turning points and 713 expressed higher after the turning points. K-means clustered them into 3 groups with turning points centering at 54, 62.5, and 72, respectively. The pathway analysis showed that the identified genes were actively involved in various cancer-related functions or networks. CONCLUSIONS: In this article, we applied the nonparametric multivariate adaptive splines method to a publicly available gene expression data and successfully identified genes with expressions varying before and after breast cancer diagnosis.
Entities:
Keywords:
Ingenuity; MASAL; breast cancer; microarray; turning point
Since the development of microarray for high-throughput analysis of gene expressions,1 this technology has been widely used in many biomedical studies.2–8 Cancer research is one of the main areas that apply this technology.2,9–14 Among this, breast cancer research is the most common one.15–19Although the development of microarrays and the ability to perform massively parallel gene expression analysis of humantumors has been shown great contribution to breast cancer classification, prognostication, and prediction during the last decade,20 the heterogeneous nature of breast cancer and the lack of reliable pathological or molecular markers still reflects the complexity of the molecular alterations that underlie the development and progression of this disease and poses serious problems to clinical management.21 Some studies proposed methods to identify genes consistently expressed at different levels in diseased and normal cases, which is useful to elucidate pathways in breast cancer progression.21,22 Other studies have explored the gene expression changes that are associated with the various stages of breast cancer progression.23 However, the genes with changing expression associated with diagnostic age are still poorly understood, although the risk of breast cancer is highly age related.Multivariate adaptive regression splines (MARS) is a nonparametric regression procedure that fits the model with a number of piecewise linear functions (ie, truncated functions with knots).24 Zhang25 further extended this method to be capable of analyzing the data under longitudinal settings. By applying this method to a publicly available microarray data set,26 this study intends to identify a novel set of genes that vary expressions along with the diagnosis of breast cancer. One of our aims is to improve our understanding of biological mechanism in disease progression at different stages of age and provide novel potential drug targets for breast cancer. The other aim is to demonstrate that the method developed in this study can be generalized to studies with the similar problems for other diseases (eg, lung cancer). The outline of the article is as follows.Section “Method” describes the methods used in the study. Section “Results” presents the results. In section “Discussion,” we discussed the findings identified by our method and study limitations, as well as the future work. The last section presents the conclusions.
Method
Microarray data set
We studied 44 928 probe sets in an Affymetrix microarray data publicly available on Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo/) (series number: GSE4922) from 249 patients with breast cancer, who were aged between 28 and 93 years old and enrolled in a cohort at Uppsala.26 The tumor specimens were assessed on Affymetrix U133 A and B arrays. Of 249 samples, 68 were grade 1 tumors, 126 were grade 2 tumors, and 55 were grade 3 tumors.
Processing microarray data through Robust Multichip Average
Robust Multichip Average (RMA) is an algorithm developed to extract the expression matrix from Affymetrix data.27 Through RMA, the raw probe-level intensity values from the Affymetrix data are background corrected, log2 transformed, quantile normalized, and then summarized via a linear model to obtain an expression measure for each probe set. In this step, the raw Affymetrix data (GSE4922) are transformed to a normalized expression value matrix (44 928 probe sets × 249 patients) via RMA in R 2.13.0 (www.bioconductor.org).
Identifying genes with expressions varying after diagnosis via multivariate adaptive splines
Compared with MARS,24 Multivariate Adaptive Splines of Analysis for longitudinal data not only can analyze data under the longitudinal settings but also has the advantage of defining several interesting phases when the velocity of increased (or decreased) outcome changes.25 Below is a brief description of the model and algorithm in MASAL.Assume that the outcomes are repeatedly measured at q different time points for each of the N units. The outcome of unit i at jth observation (i = 1 … N, j = 1 … q), Y, is equal to the following:
where f is a smooth function, t is the time of measurements, x, (k = 1 … p) is the kth covariate, e is an error term, and x* is some covariates which the error term depends on.Here, MASAL is used to regress gene expressions on age at diagnosis through R-package MASAL25,28–30 and in so doing identify changing points for genes with age-varying expressions.In this study, we mainly focus on the changes of gene expressions along with age (t), and only single measurement for each unit is available. Hence, in the absence of covariates and multiple measurements, the function (1) can be deduced as follows:
where β0 is an intercept parameter, M is some unknown number of nonconstant terms, h(t) is a basis function in a function set Γ = {(t − τ)+, t}
∈(−∞,+∞) ((t − τ)+ = max(0, t − τ)),25 or a product of 2 or more such functions. For each gene, β0, β, M, and h(t) are estimated from the data using R-package MASAL.Another advantage of using MASAL is that MARS24 searched for the knot τ* over all observed values of t only, and this restriction has been removed without computational cost by MASAL.25
Grouping the genes with the similar turning points by K-means clustering
K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data (ie, unsupervised machine learning).31 Given a number of cluster centers of genes’ turning points, the K-means procedure iteratively moves the center to minimize the total within-cluster variance. The desired number of cluster centers is a minimum cluster number which satisfies that the ratio of within-cluster variance to the total variance is larger than 90%. The goal of clustering analysis in this study is to group the genes with the similar expression turning points and then conduct the subsequent pathway analysis for their associations.
Modeling the biological systems for the clustered genes with Ingenuity Pathway Analysis
Ingenuity Pathway Analysis (IPA) is applied to understand the molecular and chemical interactions, cellular phenotypes, and disease processes within a system from RNA expression microarrays or single-nucleotide polymorphism microarrays. Ingenuity Pathway Analysis also provides insight into the causes of observed gene expression changes and into the predicted downstream biological effects of those changes. The main transcription factors and biological functions for each cluster will be analyzed by IPA (Ingenuity Systems, www.ingenuity.com).
Results
Out of 44 928 probe sets, 5765 (12.8%) were identified by MASAL to change expressions after diagnosis. However, due to the high variation at the tails, we only focused on the 1640 probe sets with age ranging from 49 to 75, which are close to the 20% and 80% quartiles of the diagnostic age. Out of these 1640 probe sets, 927 expressed lower after turning points and 713 expressed higher after the turning points. Table 1 listed the first 20 probe sets in the order of Affymetrix GeneChip with turning points at diagnosis. The full list can be found in Appendix 1. Figure 1 shows the example of 2 probe sets selected from Table 1.
Table 1
Top 20 probe sets with expressions changing after diagnosis.
Figure 1
Examples of genes (probe sets) with expressions changing after diagnosis.
K-means clustered the identified turning points into 3 major groups, and the corresponding centers are 54 (group 1), 62.5 (group 2), and 72 (group 3) (unit: year) (Figure 2).
Figure 2
Distribution of the identified turning points and the grouping results from K-means clustering.
The pathway analysis was first applied to all probe sets identified by MASAL (named “group all”) and then followed by the 3 groups clustered by K-means (named “group 1,” “group 2,” and “group 3”).For the top canonical pathways, IPA shows that gonadotropin-releasing hormone (GnRH) signaling, inhibition of matrix metalloproteinases, and T-cell receptor signaling are the top-ranked pathways for these 4 groups (including group all, group 1, group 2, and group 3) (Table 2).
Table 2
Top 5 canonical pathways identified by Ingenuity Pathway Analysis.
For the disease and disorder function, the cancer-related function is among the top 5 functions for group all, group 1, and group 3 (Table 3). The inflammatory response function is the top disease and disorder function for group 2.
Table 3
Top 5 disease and disorder functions identified by Ingenuity Pathway Analysis.
For the molecular and cellular function, cellular growth and proliferation function is the no. 1 function for group all, group 1, group 3 and no. 2 for group 2. The function related to cell death and survival appears among the top 5 for 3 out of 4 groups (Table 4).
Table 4
Top 5 molecular and cellular functions identified by Ingenuity Pathway Analysis.
Interestingly, for the top 5 networks identified by IPA (Table 5), the overlaps among these 4 groups are very rare. In particular, RNA posttranscriptional modification, protein synthesis, and gene expression is the no. 1 network identified for group all. Developmental disorder, hereditary disorder, and ophthalmic disease is the no. 1 network for group 1. For group 2, hereditary disorder, respiratory disease, and cell cycle is the no. 1 network, and for group 3, the no. 1 network is amino acid metabolism, drug metabolism, and molecular transport.
Table 5
Top 5 networks identified by Ingenuity Pathway Analysis.
For the upstream regulators analysis (Table 6), IPA found that the tumor protein p53 (TP53), which is a tumor suppressor protein encoded by the TP53 genes in humans, is the no. 1 transcription regulator in group 1. TGFB1, a secreted protein that performs many cellular functions (eg, control of cell growth, cell proliferation, and apoptosis), appears as one of the top 5 regulators among all 4 groups.
Table 6
Identified top 5 upstream regulators via Ingenuity Pathway Analysis for the 3 clusters of genes with expressions after diagnosis.
Discussion
In cancer research, it is natural to envision that there are genes that change expressions before and after the onset of cancer. Developing methods to identify these genes can assist us to further understand the process of carcinogenesis and provide potential drug targets for cancer treatment. In this article, we implemented a method developed by Zhang et al29 to a publicly available gene expression data collected using the microarray technology, which has been widely used to study the biomedical problems, particularly in cancer research in the past 20 years.Our approach successfully identified genes that vary expressions before and after the diagnosis of breast cancer. The networks analysis by IPA shows that many of the identified or associated genes play important roles in the process of cancer development and progression. For instance, the tumor suppressor gene p53 and the transforming growth factor TGFB1 have been demonstrated to be associated with breast cancer risk in many studies.32–36
Figures 3 and 4 show the regulator networks of these 2 factors from Ingenuity.
Figure 3
P53-mediated regulatory network via Ingenuity pathway analysis.
Figure 4
TGFB1-mediated regulatory network via Ingenuity pathway analysis.
In addition to the genes, our approach can also identify networks or functions that are associated with breast cancer risk. For instance, we found that “Cancer” is among the top 5 disease and disorder functions, and GnRH signaling is among the top 5 canonical pathways for many clustered groups (Tables 2 and 3), whereas there is evidence showing that GnRH signaling is associated with risks of many types of cancers.37–39For comparisons, we also separated the genes according to the sign of their turning points after clustering analysis and then re-ran the IPA. The results are uploaded as supplemental data.The ideal data set to study this kind of problems is the one that collects longitudinal measurements for each gene from the same patient and the time spectrum should cover disease diagnosis. In addition, to have enough power, the study must have adequate sample size (eg, in the minimal range of several hundreds). However, this kind of data set is difficult to be obtained even though the cost for each array (or gene chip) has been greatly reduced nowadays. Our study findings show that the measurements of a gene expression from different patients can be pooled together to study the change of expression before and after disease diagnosis.One of the study limitations is the lack of modeling covariates in the discovery of turning points, which will be one of our future research areas. In addition, we plan to apply the approaches demonstrated in this study to the data sets of other cancer types, aiming at the search of common genes that vary expressions before and after diagnosis.
Conclusions
In this article, we implemented the nonparametric MASAL method to a publicly available gene expression data and successfully identified genes with expressions varying before and after breast cancer diagnosis.
Authors: Leming Shi; Laura H Reid; Wendell D Jones; Richard Shippy; Janet A Warrington; Shawn C Baker; Patrick J Collins; Francoise de Longueville; Ernest S Kawasaki; Kathleen Y Lee; Yuling Luo; Yongming Andrew Sun; James C Willey; Robert A Setterquist; Gavin M Fischer; Weida Tong; Yvonne P Dragan; David J Dix; Felix W Frueh; Frederico M Goodsaid; Damir Herman; Roderick V Jensen; Charles D Johnson; Edward K Lobenhofer; Raj K Puri; Uwe Schrf; Jean Thierry-Mieg; Charles Wang; Mike Wilson; Paul K Wolber; Lu Zhang; Shashi Amur; Wenjun Bao; Catalin C Barbacioru; Anne Bergstrom Lucas; Vincent Bertholet; Cecilie Boysen; Bud Bromley; Donna Brown; Alan Brunner; Roger Canales; Xiaoxi Megan Cao; Thomas A Cebula; James J Chen; Jing Cheng; Tzu-Ming Chu; Eugene Chudin; John Corson; J Christopher Corton; Lisa J Croner; Christopher Davies; Timothy S Davison; Glenda Delenstarr; Xutao Deng; David Dorris; Aron C Eklund; Xiao-hui Fan; Hong Fang; Stephanie Fulmer-Smentek; James C Fuscoe; Kathryn Gallagher; Weigong Ge; Lei Guo; Xu Guo; Janet Hager; Paul K Haje; Jing Han; Tao Han; Heather C Harbottle; Stephen C Harris; Eli Hatchwell; Craig A Hauser; Susan Hester; Huixiao Hong; Patrick Hurban; Scott A Jackson; Hanlee Ji; Charles R Knight; Winston P Kuo; J Eugene LeClerc; Shawn Levy; Quan-Zhen Li; Chunmei Liu; Ying Liu; Michael J Lombardi; Yunqing Ma; Scott R Magnuson; Botoul Maqsodi; Tim McDaniel; Nan Mei; Ola Myklebost; Baitang Ning; Natalia Novoradovskaya; Michael S Orr; Terry W Osborn; Adam Papallo; Tucker A Patterson; Roger G Perkins; Elizabeth H Peters; Ron Peterson; Kenneth L Philips; P Scott Pine; Lajos Pusztai; Feng Qian; Hongzu Ren; Mitch Rosen; Barry A Rosenzweig; Raymond R Samaha; Mark Schena; Gary P Schroth; Svetlana Shchegrova; Dave D Smith; Frank Staedtler; Zhenqiang Su; Hongmei Sun; Zoltan Szallasi; Zivana Tezak; Danielle Thierry-Mieg; Karol L Thompson; Irina Tikhonova; Yaron Turpaz; Beena Vallanat; Christophe Van; Stephen J Walker; Sue Jane Wang; Yonghong Wang; Russ Wolfinger; Alex Wong; Jie Wu; Chunlin Xiao; Qian Xie; Jun Xu; Wen Yang; Liang Zhang; Sheng Zhong; Yaping Zong; William Slikker Journal: Nat Biotechnol Date: 2006-09 Impact factor: 54.908
Authors: Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend Journal: Nature Date: 2002-01-31 Impact factor: 49.962
Authors: A Mertzanidou; L Wilton; J Cheng; C Spits; E Vanneste; Y Moreau; J R Vermeesch; K Sermon Journal: Hum Reprod Date: 2012-10-09 Impact factor: 6.918
Authors: Lance D Miller; Johanna Smeds; Joshy George; Vinsensius B Vega; Liza Vergara; Alexander Ploner; Yudi Pawitan; Per Hall; Sigrid Klaar; Edison T Liu; Jonas Bergh Journal: Proc Natl Acad Sci U S A Date: 2005-09-02 Impact factor: 11.205
Authors: Lee P Lim; Nelson C Lau; Philip Garrett-Engele; Andrew Grimson; Janell M Schelter; John Castle; David P Bartel; Peter S Linsley; Jason M Johnson Journal: Nature Date: 2005-01-30 Impact factor: 49.962
Authors: David T Miller; Margaret P Adam; Swaroop Aradhya; Leslie G Biesecker; Arthur R Brothman; Nigel P Carter; Deanna M Church; John A Crolla; Evan E Eichler; Charles J Epstein; W Andrew Faucett; Lars Feuk; Jan M Friedman; Ada Hamosh; Laird Jackson; Erin B Kaminsky; Klaas Kok; Ian D Krantz; Robert M Kuhn; Charles Lee; James M Ostell; Carla Rosenberg; Stephen W Scherer; Nancy B Spinner; Dimitri J Stavropoulos; James H Tepperberg; Erik C Thorland; Joris R Vermeesch; Darrel J Waggoner; Michael S Watson; Christa Lese Martin; David H Ledbetter Journal: Am J Hum Genet Date: 2010-05-14 Impact factor: 11.025
Authors: P T Spellman; G Sherlock; M Q Zhang; V R Iyer; K Anders; M B Eisen; P O Brown; D Botstein; B Futcher Journal: Mol Biol Cell Date: 1998-12 Impact factor: 4.138