| Literature DB >> 25937953 |
Jonatan Taminau1, Cosmin Lazar1, Stijn Meganck1, Ann Nowé1.
Abstract
An increasing amount of microarray gene expression data sets is available through public repositories. Their huge potential in making new findings is yet to be unlocked by making them available for large-scale analysis. In order to do so it is essential that independent studies designed for similar biological problems can be integrated, so that new insights can be obtained. These insights would remain undiscovered when analyzing the individual data sets because it is well known that the small number of biological samples used per experiment is a bottleneck in genomic analysis. By increasing the number of samples the statistical power is increased and more general and reliable conclusions can be drawn. In this work, two different approaches for conducting large-scale analysis of microarray gene expression data-meta-analysis and data merging-are compared in the context of the identification of cancer-related biomarkers, by analyzing six independent lung cancer studies. Within this study, we investigate the hypothesis that analyzing large cohorts of samples resulting in merging independent data sets designed to study the same biological problem results in lower false discovery rates than analyzing the same data sets within a more conservative meta-analysis approach.Entities:
Year: 2014 PMID: 25937953 PMCID: PMC4393058 DOI: 10.1155/2014/345106
Source DB: PubMed Journal: ISRN Bioinform ISSN: 2090-7338
Figure 1Schematic overview of the two main approaches of integrative microarray analysis in the context of the identification of differential genes (DEGs). (a) Meta-analysis first derives results from each individual study and then combines the results. (b) Merging first combines the data and then derives a result from this large data set.
List of six publicly available lung cancer microarray data sets used in this application.
| Data set | Platform | No. of genes | No. of samples (control/cancer) | Reference |
|---|---|---|---|---|
| GSE10072 | GPL96 | 12718 |
| Landi et al. [ |
| GSE7670 | GPL96 | 12718 |
| Su et al. [ |
| GSE31547 | GPL96 | 12718 |
| — |
| GSE19804 | GPL570 | 19798 |
| Lu et al. [ |
| GSE19188 | GPL570 | 19798 |
| Hou et al. [ |
| GSE18842 | GPL570 | 19798 |
| Sanchez-Palencia et al. [ |
|
| ||||
| Total |
| |||
Number of differentially expressed genes (DEGs) for all individual data sets. The final result of this meta-analysis case is the intersection of the different lists in the last column.
| Data set | No. of DEGs(i) | No. of DEGs | No. of DEGs(ii) |
|---|---|---|---|
| (resamp.) | (intersection) | ||
| GSE10072 | 90 | 74 | 25 |
| GSE7670 | 79 | 52 | |
| GSE31547 | 67 | 43 | |
| GSE19804 | 158 | 109 | |
| GSE19188 | 351 | 284 | |
| GSE18842 | 499 | 398 |
(i)Number of DEGs found on the complete data set without resampling. (ii)Number of DEGs in the intersection of the DEG lists of all single data sets after using resampling.
Number of differentially expressed genes (DEGs) for all merged data sets.
| BERM(i) | No. of DEGs(ii) | No. of DEGs | No. of DEGs(iii) |
|---|---|---|---|
| (resamp.) | (intersection) | ||
| NONE | 131 | 112 | 102 |
| BMC | 124 | 109 | |
| COMBAT | 125 | 110 | |
| DWD | 125 | 111 | |
| XPN | 143 | 123 |
(i)BERM: batch effect removal method. (ii)Number of DEGs found on the complete data set without resampling. (iii)Number of DEGs in the intersection of the DEG lists for all batch effect removal methods after using resampling.
Figure 2Multidimensional scaling (MDS) plot of the merged data set with no batch effect removal. Samples are colored based on the target biological variable of interest and the different symbols correspond to the individual studies. The figure is generated using the plotMDS function from the inSilicoMerging R/Bioconductor package [25].
Figure 3Different boxplots for ADRB1 gene. On the left we have two boxplots for the merged data set without batch effect removal (NONE) and on the right for the merged data set with batch effect removal (COMBAT). All boxplots are grouped and colored based on the target biological variable of interest; the boxplots on top are further grouped per original data set. The figure is generated using the plotGeneWiseBoxPlot function from the inSilicoMerging R/Bioconductor package [25].
Figure 4Different boxplots for LRRN3 gene. On the left we have two boxplots for the merged data set without batch effect removal (NONE) and on the right for the merged data set with batch effect removal (COMBAT). All boxplots are grouped and colored based on the target biological variable of interest; the boxplots on top are further grouped per original data set. The figure is generated using the plotGeneWiseBoxPlot function from the inSilicoMerging R/Bioconductor package [25].