Literature DB >> 22441573

Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review.

Leo Lahti¹, Martin Schäfer, Hans-Ulrich Klein, Silvio Bicciato, Martin Dugas.

Abstract

A variety of genome-wide profiling techniques are available to investigate complementary aspects of genome structure and function. Integrative analysis of heterogeneous data sources can reveal higher level interactions that cannot be detected based on individual observations. A standard integration task in cancer studies is to identify altered genomic regions that induce changes in the expression of the associated genes based on joint analysis of genome-wide gene expression and copy number profiling measurements. In this review, we highlight common approaches to genomic data integration and provide a transparent benchmarking procedure to quantitatively compare method performances in cancer gene prioritization. Algorithms, data sets and benchmarking results are available at http://intcomp.r-forge.r-project.org.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2012 PMID： 22441573 PMCID： PMC3548603 DOI： 10.1093/bib/bbs005

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

INTRODUCTION

Genome-wide profiling technologies, in particular microarrays and next-generation sequencing, are used to characterize disease-associated changes at various levels of genome function. Identification of the key players—genes, chromosomal regions or biological processes—is a fundamental step toward mechanistic characterization of the disease and revealing molecular targets for potential therapeutic intervention. Genomic, transcriptomic, epigenomic and proteomic measurements characterize different aspects of genome regulation and function that are particularly relevant for cancer research [1, 2]. Integrative analysis has been used to prioritize disease genes or chromosomal regions for experimental testing, to discover disease subtypes [3, 4] or to predict patient survival or other clinical variables [5]. Co-occurring genomic observations are increasingly available in private and public repositories, such as the Cancer Genome Atlas database [6] and the Leukemia Gene Atlas [7], promoting wide access to data resources. However, the lack of algorithmic implementations forms a bottleneck hampering integrative approaches. The integration of gene expression (GE) and copy number (CN) data to identify DNA CN alterations that induce changes in the expression levels of the associated genes is a common task in cancer studies [8]. The detection of chromosomal regions with exceptionally high statistical association between CN and GE can pinpoint disease genes and potential cancer mechanisms [9, 10]. First, high-throughput analyses were reported about a decade ago [11-13], evidencing a clear cis-dosage effect of CN alterations on GE levels [14-16]. Although the downstream effect of CN alteration on GE is still a focus of ongoing research [17, 18], a systematic quantitative comparison of alternative approaches for integrating GE/CN data has been missing, as recently highlighted by Huang et al [8]. Hence, we designed a quantitative benchmarking procedure to compare 12 publicly available methods for cancer gene prioritization based on integrative analysis of CN/GE profiling data on two simulated and three real case studies. In the following sections, we give a methodological overview, introduce the analysis pipeline and discuss the benchmarking results.

QUANTIFYING ASSOCIATIONS BETWEEN GE and CN

The available implementations for the integrative analysis of GE and CN can be roughly divided in four main categories. In this section, we provide a general overview of these approaches with further references to individual algorithms.

Two-step approaches

A comparison of GE levels between groups of samples with distinct CN status aims at revealing CN-induced transcriptional responses. Several approaches separately either first assess the alterations in each data set and then compare the results from both or assess alterations in GE in genes or genomic regions previously identified by an assessment of CN alterations to model changes in GE based on the CN signals [16, 19]. This corresponds to the biological intuition concerning the cis-regulatory effect of CN alterations. In the first step, samples and genes are grouped based on estimated CN levels, estimated probabilities of CN alterations [20] or quantiles [21]. In the second step, differential GE is quantified either between such groups or independently (with respect to a reference sample) using standard approaches for GE analysis such as the t-test which assesses the difference between two sample groups based on Gaussian assumptions [13]. Nonparametric [20, 22] and permutation-based alternatives [23, 24, 36] have also been suggested to relax the normality assumptions of the t-test. Cancer-associated changes often affect chromosomal regions with varying sizes, which potentially contain multiple genes. Therefore, some methods have been designed to specifically detect large regions affected by CN alteration rather than prioritize individual genes [19, 24]. Nevertheless, the regional modeling of GE and CN data can help to pinpoint individual driver genes whose expression is most notably affected by a larger chromosomal alteration.

Regression approaches

Another class of tools uses regression models, generally with CN as the predictor and GE as the response variable, again exploiting the biological intuition concerning the cis-regulatory effect of CN alterations. Both linear [12] and nonlinear regression models [25] have been proposed. Univariate linear regression models have been designed to model the associations between individual CN and GE probes [26], as well as multiple and/or multivariate linear regression models that combine statistical power across multiple probes targeting adjacent genes or chromosomal positions [14, 26–28]. Regression models are theoretically related to correlation analysis. For instance, the square of Pearson’s correlation coefficient estimates the proportion of variance in the response variable that is explained by the predictor in a univariate linear regression. In case, variables are standardized beforehand, the regression coefficient of the predictor variable equals Pearson's correlation coefficient.

Correlation-based approaches

DR-Correlate [21] and a modified version of Ortiz-Estevez algorithm [16] use correlation-based analysis to scan over the genome and detect loci with exceptionally high associations between CN/GE. To address potential shortcomings with respect to a biologically inadequate reflection of CN and GE abnormalities by ordinary correlation analysis, Schäfer et al. [29] substitute sample means by the reference medians, and Lipson et al. [30] use quantile-based analysis to obtain improved correlation coefficients. Furthermore, canonical correlation analysis (CCA) has been suggested to identify general linear associations between CN and GE data through flexible detection of weighted combinations of probes, which reveal maximal correlations between the two data sources. This is expected to more efficiently distinguish the relevant shared variation of the GE/CN data from the data set-specific effects [34]. Various modifications for dimensionality reduction and model regularization have also been proposed based on principal component analysis [31] and penalized approaches based on LASSO, elastic net or other constraints to obtain sparse or regularized versions of CCA [5, 32–34]. Although regularization may reduce overfitting and sparsity can simplify interpretation of the results, setting the appropriate regularization parameters may be a challenging task.

Latent variable models

Latent variable approaches are used to model directly the data-generating processes. For instance, the pint/simcca algorithm [34] decomposes GE and CN data sets into shared and independent Gaussian components based on regularized probabilistic CCA. A comparison of the shared and data set-specific signals is used to pinpoint chromosomal regions with exceptionally high levels of dependence between the GE/CN observations. Related matrix decomposition models and iterative, dependence-seeking projections have been suggested based on generalized singular value decomposition [3] and independent component analysis [35]. The advantage of latent variable models in comparison with the two-step-, correlation- or regression-based approaches is that they explicitly model both the signal and noise in the data, and take into account the uncertainty in the model by integrating over the unknown latent variables. These properties help distinguishing signal from noise in a robust manner, but often come at an increased computational cost.

BENCHMARKING THE ALGORITHMS

Manual literature search in PubMed and Google Scholar using combinations of the keywords ‘gene expression’, ‘copy number’, ‘integration’ and inspection of the Bioconductor repository (http://www.bioconductor.org) were performed to identify available implementations, yielding 12 algorithms that were applicable for cancer gene prioritization based on integrative analysis of GE/CN data (Table 1). The source code for Ortiz-Estevez [16] was obtained from the authors. An automated benchmarking pipeline was created to compare method performance on two simulated data sets and three real case studies (http://intcomp.r-forge.r-project.org).

Table 1:

Summary of the comparison algorithms

Implementation	CN preprocessing	Methodology	Significance scoring	Reference
CNAmet (R)	Called	Custom statistic;	PPT; aberrant regions	[24]
		Two step		[36]
DR-Correlate/t-test (BC)	Raw/segmented	Two step	PPT; P-values	[21]
DR-Correlate (BC)	Raw/segmented	COR	PPT; P-values	[21]
edira (R)	Raw/segmented	Custom statistic;	NT; P-values	[29]
		COR
intCNGEan (R)	cghCall object	Custom statistic;	PNT; P-values	[20]
		Two step
Ortiz-Estevez (R)	Raw/segmented	Two step	PNT; P-values	[16]
PMA (CRAN)	Raw/segmented	LV; COR	PLV; P-values	[56]
PREDA/SODEGIR (BC)	Raw/segmented	Custom statistic;	PPT; aberrant regions/	[19]
		Two step	q-values	[48]
pint/simcca	Raw/segmented	LV; COR	PLV; P-values	[34]
SIM (BC)	Raw/segmented	REG	PT; P-values	[26]

The implementations are available through Bioconductor (BC); CRAN or R source code (R). The CN preprocessing methods required by each algorithm are listed. COR, correlation analysis; REG, regression analysis; LV, latent variables analysis; PT, parametric test; NT, nonparametric test; PNT, permutation test based on statistic of nonparametric test; PPT, permutation test based on statistic of parametric test; PLV, permutation test based on latent variable score.

Summary of the comparison algorithms The implementations are available through Bioconductor (BC); CRAN or R source code (R). The CN preprocessing methods required by each algorithm are listed. COR, correlation analysis; REG, regression analysis; LV, latent variables analysis; PT, parametric test; NT, nonparametric test; PNT, permutation test based on statistic of nonparametric test; PPT, permutation test based on statistic of parametric test; PLV, permutation test based on latent variable score. Each method was used to prioritize candidate cancer genes, followed by a comparison with a golden standard list of known cancer genes, and ranking of the methods based on receiver operating characteristic (ROC) analysis of the prioritized gene lists and running times. Investigating the true positive rate among the top findings complemented the standard area under curve (ROC/AUC) analysis, which considers the overall prioritized gene list. Default parameters for each method were used where possible. The following exceptions were made to apply the algorithms to cancer gene prioritization. In DR-Correlate [21], empirical P-values from 1000 random gene permutations were used to rank the genes. The DR-Correlate t-test option was not applicable on the Ferrari simulations due to the low number of replicate samples. CNAmet [24, 36] requires called CN values and provides separate lists for amplifications and deletions; thus, the two lists were pooled and ranked based on the P-values. Moreover, to enable an unbiased AUC comparison of CNAmet with all other methods (that prioritize all genes), random ranks were assigned to genes labeled by CNAmet with no P-value (nonsignificant genes). With intCNGEan [20], the weighted Mann–Whitney test with univariate analysis was used with an effective P-value threshold of 0.1. In pint/simcca [34], segmented CN data were used only when the resolution of the CN platform was higher than the resolution of the GE microarray. In PREDA/SODEGIR, we used ‘spline’ for smoothing, 1000 random gene orderings of the output regions and the median AUC as an unbiased output for gene prioritization. For all methods, GE and CN probes were matched by selecting for each GE probe the closest CN probe within the same chromosomal arm. One-to-one matching between the GE and CN data was required in the real case studies [34, 37]; in simulation experiments, the original simulation procedures [19, 29] were followed as described below. The preprocessing of CN data depends partially on the platform resolution. On the latest high-density SNP arrays, for instance, segmentation strategies are essential for estimating the CN for individual genes [8]. Various approaches consider to investigate only certain genomic regions at a time, e.g. to avoid bias, and propose different strategies to select the size of the chromosomal region, including fixed windows in terms of consecutive probes or base pairs [28, 30, 34], chromosome arms or minimal common regions [26] or performing kernel regression [19], where the probe signals are modeled with a smoothing function which accounts for the nonuniform distribution of the genes along the genome.

Simulated data

Two simulated data sets were generated by roughly following Schäfer et al. ([29]; 'Schäfer' data) and Bicciato et al. ([19]; 'Ferrari' data). The simulations are based on general assumptions regarding the associations between the (altered) CN and GE signals in genome-wide profiling studies, as detailed in the original publications. For the ‘Schäfer’ data set, CN and GE values are drawn from a normal mixture where two components represent aberrations of different extent for each locus; 100 samples were created for each input with mixing proportions of either 10% or 90% for the affected and normal regions. Varying noise levels were imposed using multiple variance parameters (0.25, 0.5, 1, 2 and 4 times an adjusted median absolute deviation of the data). The data points are organized in 16 equally sized blocks to mimic affected regions. The ‘Ferrari’ data with six samples was created by manipulating a renal cell carcinoma data set through permutation of loci and adding or subtracting constants to both CN and GE values within 10 blocks of 10 Mbp. Normal control data was generated by subtracting the median across the samples [19].

Real case studies

We investigated two publicly available breast cancer data sets [12, 13] and a leukemia study [38]. Expert-curated lists of known breast cancer genes [39] and leukemia genes from the Cancer Gene Census [40] were used as the ground truth for the benchmarking experiments, respectively. The preprocessed ‘Hyman’ data set [13] contains 14 breast cancer cell lines, 7489 genes and 48 known breast cancer genes. The preprocessed 'Pollack' data set [12] contains 41 breast cancer samples, 4287 genes and 38 known breast cancer genes. The preprocessed ‘Mullighan’ data set consists of 171 acute lymphoblastic leukemia (ALL) samples divided into 9 subtypes [38, 41], 2162 genes in the matched CN/GE data and 39 known leukemia genes. A combination of standard algorithms was used to preprocess the 500 K Affymetrix CN data [42-44] and the Affymetrix GE data [45-47] for the Mullighan data set. The CN data (Affymetrix Human Mapping 500 K) was downloaded from ftp://ftp.studje.org and normalized with CRMA v2 [42]. The log-additive model from the CRMA v1 algorithm [43] was used for probe summarization. Data values from the Nsp and Sty array of the 500 K set were combined and segmented with CBS [44]. GE profiles of the same ALL specimens, measured with the Affymetrix HG-U133A platform, were obtained from GEO (GSE12995; [45]) and preprocessed with the RPA algorithm [46] and EntrezID-based custom chip definition file (v13; [47]). The reference for GE and CN data was defined as the median normalized log ratios across all samples. In all data sets, probes with no EntrezID or location information and probes mapping to multiple locations or in sex chromosomes were excluded. Missing values were imputed by Gaussian random samples using the mean and variance of the data.

RESULTS

The cancer gene prioritization performance of the comparison methods as quantified by the AUC analysis is summarized in Figure 1 (for the ROC curves, see Supplementary Figure S1). The highest median ranking across the five benchmarking data sets was obtained by edira (1), followed by Ortiz-Estevez (4) and pint/simcca (4). Each of these three methods outperformed the others on at least one data set. Note that the performance of edira with the ‘Schäfer’ data set and of PREDA/SODEGIR with the ‘Ferrari’ data set needs to be carefully interpreted, since these simulations were originally constructed to follow the particular modeling assumptions of these algorithms in the original publications [19, 29]. The complete benchmarking results are available at the project website.

Figure 1:

AUC values in ROC analysis quantify cancer gene prioritization performance of the methods for the five benchmarking data sets. High values indicate high true-positive versus false-positive ratio among the top findings; the dashed line indicates the expected AUC value for a random gene list (AUC = 0.5). The methods have been ordered by their median rank across all data sets. For the ROC curves, see Supplementary Figure S1. Considering the true-positive rate among the top 200 genes of each algorithm, pint/simcca had the highest median ranking (1), followed by edira, Ortiz-Estevez and PREDA/SODEGIR (3; Supplementary Figure S2). These methods had systematically the highest median rankings with multiple thresholds (20, 50 and 100 top genes). Notably, although edira and PREDA/SODEGIR had the highest AUC scores on the Schäfer data, most of other algorithms outperformed these methods with respect to known true positives among the top findings in this data set. Differences regarding the running times were considerable (Supplementary Table S1). Specifically, edira and PMA were the fastest methods with less than 1 min running time in all data sets, closely followed by Ortiz-Estevez with a maximum running time of <3 min. The number of permutations in significance testing affects remarkably the running times of CNAmet, DR-Correlate, intCNGEan and PREDA/SODEGIR, although in the latest version of PREDA/SODEGIR a parallelized version has been implemented to reduce computation time [48].

DISCUSSION

Prioritization of disease genes is a key-modeling task in functional genomics [49-52]. This review provides an overview and quantitative benchmarking of publicly available algorithms for detecting associations between GE and CN alterations. Our work complements the recent review by Huang et al. [8], who pointed out the lack of quantitative comparisons of the available methods. The ‘intcomp’ benchmarking package applied in this review is freely available at R-forge (http://intcomp.r-forge.r-project.org) to facilitate transparent comparisons and the addition of new algorithms, benchmarking procedures and validation data sets. The comparison of 12 algorithms with respect to their cancer gene prioritization performance revealed systematic differences across independent data sets, preprocessing scenarios and sample sizes. Interestingly, while no systematic differences between the four main categories of GE/CN integration approaches were seen, systematic differences between individual methods were evident. In particular, edira, Ortiz-Estevez and pint/simcca consistently outperformed the other methods. Considering both relative performance and running time, edira and Ortiz-Estevez seem to offer an optimal trade-off, although all methods have acceptable running times for practical applications. While none of the methods outperformed the others in all data sets, identification of the few best-performing implementations provides quantitative guidance for the selection of analysis tools and has therefore direct practical relevance for cancer studies. Benchmarking the algorithms on real data is crucial since simulation studies are unlikely to capture all complexities present in real data. However, the availability of suitable benchmarking data sets is limited. We selected publicly available data sets in which both GE and CN data from the same samples are available and independent lists of known cancer genes obtained from the literature. The model performance is in general better in the simulation studies, compared to the real cancer data sets, suggesting that manually curated cancer gene lists may be only coarse approximations of the ground truth in the real case studies and that simulations may have lower noise levels. On the other hand, simulation procedures are only rough approximations of the biological reality and the simulation schema can remarkably affect model performance. For instance, variants of DR-Correlate and CNAmet performed well with ‘Schäfer’ simulated data, but their performance dropped close to random expectation in the ‘Ferrari’ data set. The ‘Ferrari’ simulations assume that the CN effect is visible in all tumor samples, which can be particularly disadvantageous for DR-Correlate and other methods that rely on variations between the aberration profiles across the samples. The ‘Ferrari’ and ‘Schäfer’ simulated data sets were originally designed to evaluate the performances of PREDA/SODEGIR and edira methods, and this aspect potentially causes positive bias on these methods in the respective data sets. Moreover, certain methods, such as CNAmet [36], Ortiz-Estevez [16] or PREDA/SODEGIR [19], have originally been designed to prioritize altered chromosomal regions rather than individual genes. Our benchmarking procedure is based on the prioritization of individual genes since this is the most prevalent objective shared by the available GE/CN integration algorithms. Since chromosomal CN alterations represent a key feature of cancer, well-performing GE/CN analysis methods are expected to have a good prioritization performance of known cancer genes. However, certain cancer genes may be overlooked by integrative approaches that focus only on simultaneous changes in both GE and CN levels since gene activity is also affected by cellular mechanisms other than GE/CN alterations. For such reason, it was not un-expected that 33–73% of the known cancer genes were not included among the first 200 prioritized genes by any comparison method in the five benchmarking data sets. The relatively low number (0–8) of the known cancer genes among the first 200 findings in the real case studies highlights the need for efficient approaches to identify key mutations and genes that drive cancer development and progression [23]. Moreover, although any algorithm detected certain cancer genes, none of the known cancer genes was detected by all methods in any benchmarking data set among the first 200 findings. Since different methods emphasize different aspects of the GE/CN data, efficient joint analysis of the results from multiple independent methodologies might outperform individual methods. One could, for instance, consider mean or median ranks across the prioritized lists, or weight the different lists according to certain criteria. Related approaches have been suggested elsewhere [49], but have not been investigated in the context of GE/CN analysis yet. In our experiments, straightforward ranking of the genes based on their mean or median rank across the different methods did not outperform the best-performing methods in any benchmarking data set. The choice of preprocessing and model parameters can have a remarkable effect on the results. The key decisions in the context of GE/CN data are associated with selecting the CN preprocessing approach [53], size of the investigated chromosomal regions and the matching approach for the integrated data sets. These and related issues are extensively discussed in the recent review by Huang et al. [8]. It is also possible to utilize class information of the samples, for instance, by including both tumor and reference samples [21]. However, in many cases, the references are included as a pooled control for two-color microarray experiments but not as a separate group, as with the Hyman and Pollack data sets. Moreover, genomic aberrations often affect only a subset of the cancer patients, and multiple cancer subtypes may be present, as in the Mullighan data set. The matching approach for GE/CN data may also affect the results. In the current pipeline, each GE probe is matched to the closest CN probe or segment. Requiring one-to-one matching of the GE/CN data may lead to exclusion of many GE probes in particular on high-density arrays such as in the Mullighan data set. The publicly available benchmarking pipeline will allow further experimentation with alternative preprocessing scenarios. All data presented in this study come from microarray studies, where several matched GE/CN data sets are available from public sources, but the approach should be in principle applicable also to high-throughput sequencing data. Since the underlying biological phenomena remain unaltered, and methodological approaches proposed for GE/CN integration are based on relatively general modeling assumptions, it can be expected that the proposed methods are applicable also in the context of next-generation sequencing after appropriate data preprocessing. Further integrative tasks in GE/CN analysis would include modeling of trans-regulatory effects of CN aberrations on genes outside the affected region [54, 55], disease subtype discovery [4], prediction of patient survival or of clinical covariates [56] and integrative analysis of other data sources, such as methylation [57], microRNA [58-59] or protein expression [60]. However, fewer implementations for such tasks are currently available. Availability of reference implementations would facilitate benchmarking and optimizing new algorithms. The benchmarking pipeline introduced in this review can be adjusted to incorporate additional algorithms and data sets as they become available.

CONCLUSION

A variety of methods is available for the integrative analysis of GE and CN data. The algorithms can be classified as two-step, regression, correlation-based and latent variable approaches. Implementation quality, running time and accuracy of the algorithm, as well as preprocessing, sample size and availability of control samples need to be considered when selecting the appropriate method. The benchmarking pipeline reveals systematic differences in cancer gene prioritization performance of available implementations across five case studies.

SUPPLEMENTARY DATA

Supplementary Data are available online at http://bib.oxfordjournals.org/. Integrative analysis algorithms for GE and CN data include two-step, regression, correlation-based and latent variable approaches. The benchmarking pipeline reveals systematic differences in cancer gene prioritization performance of currently available implementations. Implementation quality, running time and accuracy of the algorithm, as well as data preprocessing, sample size and availability of control samples need to be considered when selecting the analysis approach.

FUNDING

This work was supported by EuGESMA COST Action BM0801 (European Genomics and Epigenomics Study on MDS and AML). L.L. has been supported by Helsinki Institute for Information Technology HIIT and Finnish Center of Excellence on Adaptive Informatics Research (AIRC). M.D. is supported by the European Leukemia Network of Excellence (LSHC-CT-2004); Deutsche Kinderkrebsstiftung (grant number DKS 2010.21) and Carreras Foundation (grant number DJCLS 09/04). M.S. is supported by the Deutsche Forschungsgemeinschaft (Research Training Group Statistical Modeling). S.B. is supported from AIRC Special Program Molecular Clinical Oncology ‘5 per mille’.

56 in total

1. Nonparametric testing for DNA copy number induced differential mRNA gene expression.

Authors: Wessel N van Wieringen; Mark A van de Wiel
Journal: Biometrics Date: 2008-05-13 Impact factor: 2.571

2. PREDA: an R-package to identify regional variations in genomic data.

Authors: Francesco Ferrari; Aldo Solari; Cristina Battaglia; Silvio Bicciato
Journal: Bioinformatics Date: 2011-07-07 Impact factor: 6.937

3. Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer.

Authors: Jie Peng; Ji Zhu; Anna Bergamaschi; Wonshik Han; Dong-Young Noh; Jonathan R Pollack; Pei Wang
Journal: Ann Appl Stat Date: 2010-03 Impact factor: 2.083

4. Identification of novel cluster groups in pediatric high-risk B-precursor acute lymphoblastic leukemia with gene expression profiling: correlation with genome-wide DNA copy number alterations, clinical characteristics, and outcome.

Authors: Richard C Harvey; Charles G Mullighan; Xuefei Wang; Kevin K Dobbin; George S Davidson; Edward J Bedrick; I-Ming Chen; Susan R Atlas; Huining Kang; Kerem Ar; Carla S Wilson; Walker Wharton; Maurice Murphy; Meenakshi Devidas; Andrew J Carroll; Michael J Borowitz; W Paul Bowman; James R Downing; Mary Relling; Jun Yang; Deepa Bhojwani; William L Carroll; Bruce Camitta; Gregory H Reaman; Malcolm Smith; Stephen P Hunger; Cheryl L Willman
Journal: Blood Date: 2010-08-10 Impact factor: 22.113

5. integrOmics: an R package to unravel relationships between two omics datasets.

Authors: Kim-Anh Lê Cao; Ignacio González; Sébastien Déjean
Journal: Bioinformatics Date: 2009-08-25 Impact factor: 6.937

6. Integrative analysis of gene expression and copy number alterations using canonical correlation analysis.

Authors: Charlotte Soneson; Henrik Lilljebjörn; Thoas Fioretos; Magnus Fontes
Journal: BMC Bioinformatics Date: 2010-04-15 Impact factor: 3.169

7. Integrated gene copy number and expression microarray analysis of gastric cancer highlights potential target genes.

Authors: Samuel Myllykangas; Siina Junnila; Arto Kokkola; Reija Autio; Ilari Scheinin; Tuula Kiviluoto; Marja-Liisa Karjalainen-Lindsberg; Jaakko Hollmén; Sakari Knuutila; Pauli Puolakkainen; Outi Monni
Journal: Int J Cancer Date: 2008-08-15 Impact factor: 7.396

8. Global associations between copy number and transcript mRNA microarray data: an empirical study.

Authors: Wenjuan Gu; Hyungwon Choi; Debashis Ghosh
Journal: Cancer Inform Date: 2008-02-09

9. DR-Integrator: a new analytic tool for integrating DNA copy number and gene expression data.

Authors: Keyan Salari; Robert Tibshirani; Jonathan R Pollack
Journal: Bioinformatics Date: 2009-12-22 Impact factor: 6.937

10. A comparative study of genome-wide SNP, CGH microarray and protein expression analysis to explore genotypic and phenotypic mechanisms of acquired antiestrogen resistance in breast cancer.

Authors: Neil Johnson; Valerie Speirs; Nicola J Curtin; Andrew G Hall
Journal: Breast Cancer Res Treat Date: 2007-09-28 Impact factor: 4.872

18 in total

1. MVisAGe Identifies Concordant and Discordant Genomic Alterations of Driver Genes in Squamous Tumors.

Authors: Vonn Walter; Ying Du; Ludmila Danilova; Michele C Hayward; D Neil Hayes
Journal: Cancer Res Date: 2018-04-26 Impact factor: 12.701

2. A comparative study of multi-omics integration tools for cancer driver gene identification and tumour subtyping.

Authors: Anita Sathyanarayanan; Rohit Gupta; Erik W Thompson; Dale R Nyholt; Denis C Bauer; Shivashankar H Nagaraj
Journal: Brief Bioinform Date: 2020-12-01 Impact factor: 11.622

3. integIRTy: a method to identify genes altered in cancer by accounting for multiple mechanisms of regulation using item response theory.

Authors: Pan Tong; Kevin R Coombes
Journal: Bioinformatics Date: 2012-09-26 Impact factor: 6.937

4. Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis.

Authors: Yan Zhou; Pei Wang; Xianlong Wang; Ji Zhu; Peter X-K Song
Journal: Genet Epidemiol Date: 2016-11-10 Impact factor: 2.135

Review 5. Integrating genetics and epigenetics in breast cancer: biological insights, experimental, computational methods and therapeutic potential.

Authors: Claudia Cava; Gloria Bertoli; Isabella Castiglioni
Journal: BMC Syst Biol Date: 2015-09-21

6. An integrated approach to reveal miRNAs' impacts on the functional consequence of copy number alterations in cancer.

Authors: Kening Li; Yongjing Liu; Yuanshuai Zhou; Rui Zhang; Ning Zhao; Zichuang Yan; Qiang Zhang; Shujuan Zhang; Fujun Qiu; Yan Xu
Journal: Sci Rep Date: 2015-06-23 Impact factor: 4.379

7. Integration of mRNA expression profile, copy number alterations, and microRNA expression levels in breast cancer to improve grade definition.

Authors: Claudia Cava; Gloria Bertoli; Marilena Ripamonti; Giancarlo Mauri; Italo Zoppis; Pasquale Anthony Della Rosa; Maria Carla Gilardi; Isabella Castiglioni
Journal: PLoS One Date: 2014-05-27 Impact factor: 3.240

8. Integrated exon level expression analysis of driver genes explain their role in colorectal cancer.

Authors: Mohammad Azhar Aziz; Sathish Periyasamy; Zeyad Al Yousef; Ibrahim AlAbdulkarim; Majed Al Otaibi; Abdulaziz Alfahed; Glowi Alasiri
Journal: PLoS One Date: 2014-10-21 Impact factor: 3.240

9. Associations between the human intestinal microbiota, Lactobacillus rhamnosus GG and serum lipids indicated by integrated analysis of high-throughput profiling data.

Authors: Leo Lahti; Anne Salonen; Riina A Kekkonen; Jarkko Salojärvi; Jonna Jalanka-Tuovinen; Airi Palva; Matej Orešič; Willem M de Vos
Journal: PeerJ Date: 2013-02-26 Impact factor: 2.984

10. Investigating inter-chromosomal regulatory relationships through a comprehensive meta-analysis of matched copy number and transcriptomics data sets.

Authors: Richard Newton; Lorenz Wernisch
Journal: BMC Genomics Date: 2015-11-18 Impact factor: 3.969