Literature DB >> 26568679

Cancer Bioinformatic Methods to Infer Meaningful Data From Small-Size Cohorts.

Nabila Bennani-Baiti¹, Idriss M Bennani-Baiti².

Abstract

Whole-genome analyses have uncovered that most cancer-relevant genes cluster into 12 signaling pathways. Knowledge of the signaling pathways and associated gene signatures not only allows us to understand the mechanisms of oncogenesis inherent to specific cancers but also provides us with drug targets, molecular diagnostic and prognosis factors, as well as biomarkers for patient risk stratification and treatment. Publicly available genomic data sets constitute a wealth of gene mining opportunities for hypothesis generation and testing. However, the increasingly recognized genetic and epigenetic inter- and intratumor heterogeneity, combined with the preponderance of small-size cohorts, hamper reliable analysis and discovery. Here, we review two methods that are used to infer meaningful biological events from small-size data sets and discuss some of their applications and limitations.

Entities: Chemical Disease Gene Mutation Species

Keywords: cohort size; expression profiling; gene data set; intertumor heterogeneity; intratumor heterogeneity; low-incidence cancers

Year: 2015 PMID： 26568679 PMCID： PMC4631160 DOI： 10.4137/CIN.S32696

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Next-generation sequencing and microarray technologies have generated massive amounts of data that can be mined for disease–gene expression correlates in search for molecular mechanisms, biomarkers, or drug targets. As of August 15, 2015, there were a bit less than 4,000 publicly available Gene Expression Omnibus (GEO) data sets (GDSs) that may be retrieved from GEO alone (the NIH gene expression data set repository at the National Center for Biotechnology Information; http://www.ncbi.nlm.nih.gov/gds/), several hundreds of which being dedicated to human cancers. Current gene expression arrays encompass some 45k and 22k probesets for protein-encoding and noncoding genes, respectively (eg, Affymetrix’s GeneChip® Human Transcriptome Array 2.0; Illumina’s HumanHT-12 v4 Expression BeadChip), allowing to probe gene expression variation in clinical samples or cell lines at an unprecedented depth. The analytical power of whole-genome analyses, however, remains limited mostly owing to two practical parameters: (i) most cancers are relatively low-incidence diseases (eg, Ewing’s sarcoma affects 1–2 children/year/million1 and subcutaneous panniculitis-like T-cell lymphoma afflicts about 1 person/year/10 million2,3), and most laboratories or even institutions have therefore access to only a limited number of tumor samples and (ii) the cost of the technology remains too high for most low- to mid-budget laboratories, thus forcing investigators to limit the number of tested samples and biological replicates, which in turn yields mostly underpowered studies. In 2010, McClellan and King highlighted the complex interplay, linking genetic diversity to disease heterogeneity.4 Accordingly, discovery of many disease-associated genetic risk variants requires exceedingly large cohorts in genome-wide association studies, as recently exemplified in a large-size cohort analysis of lung adenocarcinoma, wherein 26 research departments from several countries pulled their resources together to conduct the study.5 The problem posed by the high interindividual allelic variability can be further exacerbated by that of epigenetic diversity (eg, in follicular lymphoma and diffuse large B-cell lymphomas),6 whereby stochastic and/or environmental factors can lead to different epigenetic (and gene expression) landscapes, even in presumably otherwise genetically identical monozygotic twins.7 It is now becoming increasingly appreciated that several cancers exhibit high intratumor variability, including those of the breast,8–10 colon,11,12 head and neck,13 ovary,14 prostate15 and stomach,16 and glioblastoma.17–19 In fact, somatic mutation frequency analysis of more than 3,000 tumor samples encompassing 27 cancer types showed up to three or more orders of magnitude mutation rate variability between tumors (eg, in lung adenocarcinoma and melanoma),20 underscoring the scale of heterogeneity. Furthermore, tumor heterogeneity can be driven in response to chemotherapeutic intervention adding to the complexity of the analysis.21,22 Since heterogeneity can increase through time and/or in cases wherein tumors are exposed to different microenvironments, heterogeneity can be high when comparing metastases to primary tumors, particularly in cases whereby metastases take up to several decades to evolve allowing time for stochastic genotypic or epigenetic changes.23 Thus, for instance, 3%–24% of breast cancer metastases display a different estrogen, progesterone, or HER2 erb-b2 receptor tyrosine kinase 2 receptor status from the primary tumors,24 either due to a receptor switch or due to the fact that the tested metastases arose from sections of the primary tumor not included in the analyses. These intra- and interindividual differences notwithstanding, recurrent alterations in key biological processes often underlie a given disease,4 and for example, only a dozen or so of core signaling pathways appear to drive the tumorigenic phenotype of most cancers.25–27 Whereas cancer genetic and epigenetic diversities offer opportunities for biomarker discovery and risk stratification,28 uncovering genes and pathways associated with specific disease states remains challenging, owing to the sample-size requirements. Fortunately, bioinformatic methods have begun to address this problem, and we briefly summarize subsequently those that proved to be useful in the analysis of small-size cohorts.

How Small are Small-Size Cohorts?

It is common knowledge that most childhood cancer cohorts are relatively small in size. This is not only due to the fact that these diseases are relatively rare in nature but also because less funding is devoted to research on these neoplasms as compared to their adult counterparts. Thus, for example, the combined NIH budget for all types of pediatric sarcomas is only about 1/15th of the budget allocated to breast cancer alone.29 But what about the cohort size of gene data sets of more frequent childhood cancers (eg, leukemia with 88 cases/year/million in 1–4-year-old children30) or adults cancers (80–690 cancers/year/million in men and 73–724 cases/year/million in women for the top 10 cancer types; based on our estimates taking into account the current US population projection31 and cancer frequencies in adults in the US population in 201532)? To address this question, we ran a meta-analysis of all cancer gene data sets deposited to date in GEO. As shown in Figure 1, the majority of data sets contained 50 or less samples and qualified as small-size cohorts (median ; range: 4–192). As may be expected, the largest size data sets were mostly those of cancers with high-incidence and research funding and constituted seven of the 10 largest size data sets (not shown). Interestingly, however, probability density distribution analyses of individual cancers with the highest incidences and research funding, such as those of the breast (representing 29% of all new annual cases in women32; ; range: 4–116), prostate (26% new cases in men32; ; range: 4–171), lung (13%–14% new cases in both genders32; ; range: 4–192), and colorectum (8% new cases in both genders32; ; range: 4–104), show that these too are mostly made up of small-size cohorts (Fig. 2). The problem posed by small-size cohorts affects, therefore, the majority of data sets across the board in both childhood and adult cancers. Larger size data sets can of course be found in the collections of The Cancer Genome Atlas of the National Cancer Institute (NCI)/National Human Genome Research Institute (NHGRI) and of the Wellcome Trust Sanger Institute. These are, however, also subject to two overriding limitations. The first one relates to the aforementioned frequently observed tumor heterogeneity from which one can presume that many large-size cohort data sets are essentially a heterogeneous collection of varying numbers of relatively homogenous smaller size cohorts. Second, although these initiatives have made significant strides at increasing sample size in high-incidence diseases, they still somewhat lag behind in low incidence or so-called orphan diseases, to which many cancers belong.

Figure 1

Probability density distribution of all cancer gene data sets in Gene Expression Omnibus (GEO). All cancer data sets were retrieved from GEO (query performed on August 15, 2015) and plotted against sample size (x axis). Gene data sets size refers to the number of tumor samples per data set. The analysis included 368 data sets and 9,845 tumor samples. Only data sets limited to tumor samples were retrieved; those solely listing data on tumor stroma or normal peripheral blood lymphocytes in cancer patients or those that combined several cancer types were omitted from the analysis. There were no other exclusion criteria.

Figure 2

Probability density distribution of cancer gene data sets of high-incidence adult cancers. All cancer data sets were retrieved from GEO (query performed on August 15, 2015) and plotted against sample size (x axis). These included (A) breast cancer (number of data sets n = 110), (B) prostate cancer (n = 43), (C) lung cancer (n = 25), and (D) colorectal cancer (n = 37).

To illustrate some of the limitations imposed by small-size cohorts, we ran two simple tests comparing gene expression in two publicly available Ewing’s sarcoma gene data sets.33,34 In the first test, we looked at two genes that encode epigenetic modifiers with important roles in tumorigenesis. The first gene, lysine-specific demethylase 1 (to avoid gene and species ambiguity,35,36 NCBI gene IDs are given herein along with the gene symbol; LSD1; GeneID: 23028) was shown to be overexpressed and to serve as a drug target in Ewing’s sarcoma in vitro37 as well as in other neoplasms, such as breast cancer.38 The second gene, enhancer of zeste homolog 2 (EZH2; GeneID: 2146) was also shown to be overexpressed in Ewing’s sarcoma and to be a drug target both in vitro and in vivo.39,40 As proposed by others,41 we computed the bivariate kernel density estimates and run regression analyses in the R environment and compared the probability density distributions for either gene in two equal small-size Ewing’s sarcoma cohorts. As shown in Figure 3A, although LSD1 shows consistently high gene expression in both cohorts, there are a few outliers for EZH2 (Fig. 3B), indicating that the sample size and number of cohorts utilized, while sufficient to analyze LSD1, were borderline in the case of EZH2.

Figure 3

Bivariate kernel density estimates of gene expression consistency across small-size cohorts. Genes tested were LSD1 (A), EZH2 (B), and CXCR4 (C). x and y axes represent Log2 expression values of given genes (x or y) across Ewing’s sarcoma data sets either in GDS# GSE7007 (x axis; n = 27) or in ArrayExpress data set E-MEXP-1142 (y axis; n = 27). Lines across graphs depict regression curves as computed in a multiple regression model based on the ordinary least squares method as defined by the equation .

We next ran a test, this time looking at chemokine (C-X-C motif) receptor 4 (CXCR4; GeneID: 7852), a gene encoding a chemokine receptor previously shown to mark metastatic Ewing’s sarcoma and as such associates with about a one-third of all samples, representing the fraction of metastatic tumors.42 Contrary to the tests earlier, here we find the size and number of cohorts to be limiting, as the bivariate kernel density estimates did not fully reproduce the predicted distribution for this gene (Fig. 3C). Although these examples show that analyses in two small-size cohorts may be sufficient for some genes, it is important to note that these were tests for which we already knew the answers. For hypothesis generation through bioinformatics, which is usually one of the main applications of gene data set mining analyses, one would need a method to infer meaningfulness with a much higher degree of confidence. In statistics, one way of boosting confidence is to increase sample size. One example is that of the so-called sequential analysis, particularly in prospective studies, whereby one adds samples (or recruits patients) until statistical significance is reached or until indications are present that significance is unlikely to be achieved.43 Although the sample size in sequential analysis is unknown prior to the end of the investigation, it tends to be smaller than in other methods wherein the sample size is predetermined, making it particularly suitable for cancers with particularly low incidences. Such a method, however, is impractical in the analysis of publicly available gene data sets, as the size of these is already fixed. In this case, an alternative would be to increase the number of cohorts. Two bioinformatic methods have taken advantage of this option to infer meaningful correlates from small-size cohorts.

Ican and Affiliated Cancer Informatics Methods to Probe Small-Size Cohorts

To address the problem posed by the small size of cancer cohorts, one of the authors developed the first method to reliably infer gene expression significance, and its association to specific patient subsets, from publicly available small-size cohort data sets irrespective of the expression profiling platform.37,42,44 This method, named Intercohort Co-ANalysis or Ican, relies on several innovative tools. First, it utilizes published gene expression levels known to be biologically active in experimentally validated tissues as a benchmark for gene expression significance, thus extracting biological significance from gene expression profiles. This eliminates variability across studies that results from the customary usage by different investigators of different arbitrary cutoffs for gene expression significance. Second, instead of combining small-size cohorts into a larger meta-cohort, each small-size cohort is analyzed individually. This helps avoid conormalization and the averaging out of sample quantiles across cohorts of different variances and distribution functions. The cohort-specific distribution probabilities are in fact used to highlight the high intercohort variability inherent to small-size cohorts. Next, quantile fitting of sample size to specific disease states, say chemoresistant or metastatic tumors, are mined for consistent molecular correlates within individual cohorts. Finally, a subtractive overlay of cohort-restricted associations is carried out to uncover genes whose expression is consistently associated with select sample subsets in all cohorts. In our case, four small-size publicly available data sets, in addition to a fifth nonpublicly available cohort that served for wet laboratory validation, were sufficient to infer gene expression significance. As a case in point, an Ican investigation of Ewing’s sarcoma (a cancer wherein metastasis is the major poor prognosis indicator) yielded several cosegregated chemokine ligand/receptor pairs in Ewing’s sarcoma patient subsets and helped uncover the first two chemokine receptors associated with either metastases or poor prognosis in Ewing’s sarcoma.42 To increase stringency, one may filter the patient-derived data sets through cell line-derived data sets. This can help eliminate genes not necessarily associated with the tumor cells but rather with the tumor stroma or tumor-infiltrating lymphocytes.42 Using such a strategy, we could, for example, zero in on two drugable receptors that represent viable therapeutic strategies for the corresponding Ican patient subsets. Using Ican, another study uncovered a micro-RNA, miR-34a, as a major molecular determinant of chemosensitivity and patient survival.45 The analytical power of Ican can therefore help uncover genes and pathways with clinical significance from underpowered small-size cohorts, as well as from larger cohorts of highly heterogeneous diseases, which represent most diseases.4,28 We surmise that the growing number of gene expression profiling investigations, compounded by the mandatory submission of gene expression data sets to public repositories requested now by most journals, will lead to a large field of Ican applications, and ensuing prognostic factor and biomarker discoveries. A similar bioinformatics methodology dubbed Integrative Transcriptome Analysis (Itan) was independently developed by research groups at Harvard University and Massachusetts Institute of Technology (MIT).46 In this case, a coanalysis of nine hepatocellular carcinoma (HCC) gene data sets derived from different populations and micro array platforms was sufficient to uncover a novel mechanism of TGF-dependent WNT signaling activation in a subset of HCC patients.46 Contrary to Ican which uses publicly available gene data sets for hypothesis formulation and an additional cohort to experimentally test the hypothesis, Itan uses the larger publicly available gene data sets for training purposes (to avoid data overfitting to any given cohort) and uses the smaller publicly available data sets for testing. The latter was accomplished by subclass mapping, which utilizes hierarchical clustering, k-means clustering, and nonnegative matrix factorization as unsupervised clustering methods to identify tumor subclasses.47 As in Ican, the overriding principle here relies on molecular events consistently associated with particular tumor populations across all tested data sets. Based on this principle, the accuracy of both methods is dependent on the quality and number of data sets included in the analysis.

Limitations of Small-Size Cohort Bioinformatic Methods

Although Ican and Itan can be useful in inferring meaningfulness for any given gene (and corresponding pathway), they remain of limited value when assessing covariance of two or more genes across data sets, for example, to uncover gene networks associated with particular tumor subsets. This is because such analyses rely on Bayesian networks, Boolean networks, or on the mathematics of product moment correlations, and assuming all samples were added to the data set randomly (ie, patients were recruited consecutively without any prior knowledge of their clustering into one or another tumor subset, and no patients were removed from the cohort based on criteria that relate to the query at hand), these analyses are highly dependent on sample size.48 Thus, despite the constraints imposed by data set conormalization procedures, analysis of meta-genes remains here the method of choice. For example, using the same data sets analyzed by Ican, product moment correlation analyses of meta-genes can determine whether a signaling pathway is on or off directly in tumor samples or whether signaling molecules are active within specific pathways.49 Similarly, these methods would be ineffectual in inferring significance of tumor drivers harboring activating mutations and whose gene expression remained unchanged. In these cases, however, Ican and affiliated methods can be used in the analysis of the associated transcriptomes, given that the gene in question impervious to Ican analysis imparts a characteristic downstream gene expression signature, as shown, for instance, for several tumors drivers.50–53 While future studies should give us a better feel about the usefulness of Ican in such cases, this and affiliated methods will certainly find ample application in the field of biomarker discovery in search of markers of diagnosis, prognosis, patient risk stratification, and treatment response.37,42 Finally, though Ican is useful in the analysis of small-size cohorts, it requires multiple cohort data sets to infer differential gene expression significance. Unfortunately, many childhood cancers have very few (if any) gene expression data sets deposited in the public repositories, thus critically limiting the scope of Ican for these cancers. In this regard, an NCI’s Office of Cancer Genomics and Cancer Therapy Evaluation Program initiative, dubbed Therapeutically Applicable Research to Generate Effective Treatments (or TARGET) and which aims at characterizing the transcriptomes and genomes of hard-to-treat childhood cancers, is most welcome. TARGET has already generated data sets for childhood acute lymphoblastic leukemia and for neuroblastoma, and efforts are underway to generate genomic and expression profiling data sets for childhood acute myeloid leukemia, osteosarcoma, and renal tumors.

Conclusions

The majority of gene data sets, including those of high-incidence adult cancers, are represented by small-size cohorts. Bioinformatic methods, such as Ican or Itan, can help analyze underpowered studies, given that several data sets of the same disease type are available. Although it may still be necessary to experimentally validate findings in additional data sets, particularly in case novel or little-known pathways are uncovered, these methods have proven to be sufficient to uncover with high confidence genes meaningful for a particular biological or pathological state from small-size cohorts. As most cancers are genetically and epigenetically heterogeneous and/or of low incidence, the cancer informatics of small-size cohorts will remain a tool of choice to enable the grasping for the brass ring of meaningful cancer-associated events in genomic and epigenomic data sets.

49 in total

Review 1. Gene symbol precision.

Authors: Barbara Bennani-Baiti; Idriss M Bennani-Baiti
Journal: Gene Date: 2011-10-12 Impact factor: 3.688

2. Genetic heterogeneity in human disease.

Authors: Jon McClellan; Mary-Claire King
Journal: Cell Date: 2010-04-16 Impact factor: 41.582

3. Inter- and intra-tumor profiling of multi-regional colon cancer and metastasis.

Authors: Akihiro Kogita; Yasumasa Yoshioka; Kazuko Sakai; Yosuke Togashi; Shunsuke Sogabe; Takuya Nakai; Kiyotaka Okuno; Kazuto Nishio
Journal: Biochem Biophys Res Commun Date: 2015-01-24 Impact factor: 3.575

4. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma.

Authors: Anoop P Patel; Itay Tirosh; John J Trombetta; Alex K Shalek; Shawn M Gillespie; Hiroaki Wakimoto; Daniel P Cahill; Brian V Nahed; William T Curry; Robert L Martuza; David N Louis; Orit Rozenblatt-Rosen; Mario L Suvà; Aviv Regev; Bradley E Bernstein
Journal: Science Date: 2014-06-12 Impact factor: 47.728

5. Intratumoral heterogeneity of receptor tyrosine kinases EGFR and PDGFRA amplification in glioblastoma defines subpopulations with distinct growth factor response.

Authors: Nicholas J Szerlip; Alicia Pedraza; Debyani Chakravarty; Mohammad Azim; Jeremy McGuire; Yuqiang Fang; Tatsuya Ozawa; Eric C Holland; Jason T Huse; Suresh Jhanwar; Margaret A Leversha; Tom Mikkelsen; Cameron W Brennan
Journal: Proc Natl Acad Sci U S A Date: 2012-02-08 Impact factor: 11.205

6. Microarray analysis of Ewing's sarcoma family of tumours reveals characteristic gene expression signatures associated with metastasis and resistance to chemotherapy.

Authors: Karl-Ludwig Schaefer; Martin Eisenacher; Yvonne Braun; Kristin Brachwitz; Daniel H Wai; Uta Dirksen; Claudia Lanvers-Kaminsky; Heribert Juergens; David Herrero; Sabine Stegmaier; Ewa Koscielniak; Angelika Eggert; Michaela Nathrath; Georg Gosheger; Dominik T Schneider; Carsten Bury; Raihanatou Diallo-Danebrock; Laura Ottaviano; Helmut E Gabbert; Christopher Poremba
Journal: Eur J Cancer Date: 2008-02-21 Impact factor: 9.162

7. Differential protein expression and oncogenic gene network link tyrosine kinase ephrin B4 receptor to aggressive gastric and gastroesophageal junction cancers.

Authors: Britta Liersch-Löhn; Nadia Slavova; Heinz J Buhr; Idriss M Bennani-Baiti
Journal: Int J Cancer Date: 2015-10-15 Impact factor: 7.396

8. Global gene expression in Ha-ras and B-raf mutated mouse liver tumors.

Authors: Maike Jaworski; Carina Ittrich; Stephan Hailfinger; Michael Bonin; Albrecht Buchmann; Michael Schwarz; Christoph Köhle
Journal: Int J Cancer Date: 2007-09-15 Impact factor: 7.396

9. Subclonal diversification of primary breast cancer revealed by multiregion sequencing.

Authors: Lucy R Yates; Moritz Gerstung; Stian Knappskog; Christine Desmedt; Gunes Gundem; Peter Van Loo; Turid Aas; Ludmil B Alexandrov; Denis Larsimont; Helen Davies; Yilong Li; Young Seok Ju; Manasa Ramakrishna; Hans Kristian Haugland; Peer Kaare Lilleng; Serena Nik-Zainal; Stuart McLaren; Adam Butler; Sancha Martin; Dominic Glodzik; Andrew Menzies; Keiran Raine; Jonathan Hinton; David Jones; Laura J Mudie; Bing Jiang; Delphine Vincent; April Greene-Colozzi; Pierre-Yves Adnet; Aquila Fatima; Marion Maetens; Michail Ignatiadis; Michael R Stratton; Christos Sotiriou; Andrea L Richardson; Per Eystein Lønning; David C Wedge; Peter J Campbell
Journal: Nat Med Date: 2015-06-22 Impact factor: 53.440

10. Subclass mapping: identifying common subtypes in independent disease data sets.

Authors: Yujin Hoshida; Jean-Philippe Brunet; Pablo Tamayo; Todd R Golub; Jill P Mesirov
Journal: PLoS One Date: 2007-11-21 Impact factor: 3.240

1 in total

1. Prognostic Value of BIRC5 in Lung Adenocarcinoma Lacking EGFR, KRAS, and ALK Mutations by Integrated Bioinformatics Analysis.

Authors: Yajuan Cao; Weikang Zhu; Wanqing Chen; Jianchun Wu; Guozhen Hou; Yan Li
Journal: Dis Markers Date: 2019-04-09 Impact factor: 3.434

1 in total