| Literature DB >> 23236413 |
Claudia Schurmann1, Katharina Heim, Arne Schillert, Stefan Blankenberg, Maren Carstensen, Marcus Dörr, Karlhans Endlich, Stephan B Felix, Christian Gieger, Harald Grallert, Christian Herder, Wolfgang Hoffmann, Georg Homuth, Thomas Illig, Jochen Kruppa, Thomas Meitinger, Christian Müller, Matthias Nauck, Annette Peters, Rainer Rettig, Michael Roden, Konstantin Strauch, Uwe Völker, Henry Völzke, Simone Wahl, Henri Wallaschofski, Philipp S Wild, Tanja Zeller, Alexander Teumer, Holger Prokisch, Andreas Ziegler.
Abstract
Microarray profiling of gene expression is widely applied in molecular biology and functional genomics. Experimental and technical variations make meta-analysis of different studies challenging. In a total of 3358 samples, all from German population-based cohorts, we investigated the effect of data preprocessing and the variability due to sample processing in whole blood cell and blood monocyte gene expression data, measured on the Illumina HumanHT-12 v3 BeadChip array.Gene expression signal intensities were similar after applying the log(2) or the variance-stabilizing transformation. In all cohorts, the first principal component (PC) explained more than 95% of the total variation. Technical factors substantially influenced signal intensity values, especially the Illumina chip assignment (33-48% of the variance), the RNA amplification batch (12-24%), the RNA isolation batch (16%), and the sample storage time, in particular the time between blood donation and RNA isolation for the whole blood cell samples (2-3%), and the time between RNA isolation and amplification for the monocyte samples (2%). White blood cell composition parameters were the strongest biological factors influencing the expression signal intensities in the whole blood cell samples (3%), followed by sex (1-2%) in both sample types. Known single nucleotide polymorphisms (SNPs) were located in 38% of the analyzed probe sequences and 4% of them included common SNPs (minor allele frequency >5%). Out of the tested SNPs, 1.4% significantly modified the probe-specific expression signals (Bonferroni corrected p-value<0.05), but in almost half of these events the signal intensities were even increased despite the occurrence of the mismatch. Thus, the vast majority of SNPs within probes had no significant effect on hybridization efficiency.In summary, adjustment for a few selected technical factors greatly improved reliability of gene expression analyses. Such adjustments are particularly required for meta-analyses.Entities:
Mesh:
Year: 2012 PMID: 23236413 PMCID: PMC3517598 DOI: 10.1371/journal.pone.0050938
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Cohort characteristics.
| Variable (mean/SD) | SHIP-TREND | KORA F4 | GHS |
| Sample size | 991 | 993 | 1374 |
|
| 204.0±153.8 | 855.5±179.4 | 314.4±91.6 |
|
| 8.56±0.50 | 8.68±0.61 | 9.36±0.43 |
| Females (%) | 555 (56.0) | 493 (49.6) | 622 (48.4) |
| Age [years] | 50.1±13.7 | 70.4±5.4 | 54.7±11.0 |
| Body height [cm] | 169.8±9.0 | 165.3±8.8 | 171.0±9.3 |
| Body weight [kg] | 79.0±15.1 | 78.9±13.7 | 79.1±15.5 |
| Body mass index [kg/m2] | 27.3±4.6 | 28.9±4.5 | 27.0±4.6 |
| Hip circumference [cm] | 101.3±9.6 | 107.8±9.3 | 100.5±9.6 |
| Waist circumference [cm] | 88.0±12.9 | 98.6±12.1 | 93.5±13.4 |
| Waist-to-hip ratio | 0.87±0.09 | 0.91±0.08 | 0.93±0.09 |
| White blood cell count [Gpt/l] | 5.72±1.48 | 6.00±1.80 | 7.04±3.81 |
| Red blood cell count [Tpt/l] | 4.63±0.39 | 4.50±0.40 | 4.69±0.41 |
| Hematocrit | 0.42±0.03 | 0.41±0.03 | 0.42±0.03 |
| Hemoglobin [mmol/l] | 8.62±0.74 | 8.69±0.75 | 9.10±0.74 |
| Platelets [Gpt/l] | 225.7±50.3 | 244.7±65.1 | 271.5±67.9 |
| Serum C-reactive protein [mg/l] | - | 3.05±6.27 | 3.78±4.92 |
| High density lipoprotein [mmol/l] | 1.48±0.37 | 1.43±0.36 | 1.47±0.40 |
| Serum triglycerides [mmol/l] | 1.42±0.85 | 1.50±0.84 | 1.46±0.91 |
| Active smokers [%] | 214 (22.0) | 66 (6.7) | 239 (18.6) |
| Systolic blood pressure [mmHg] | 124.4±16.9 | 128.7±20.0 | 132.2±17.8 |
| Diastolic blood pressure [mmHg] | 76.6±9.8 | 74.0±10.1 | 83.5±9.68 |
Storage time: Time between blood donation and RNA isolation (SHIP-TREND and KORA F4) or time between RNA isolation and RNA amplification (GHS).
A dash indicates that the variable was not available in the cohort.
Figure 1Log2 transformation (L2T) versus variance-stabilizing transformation (VST).
The panels show the association results for the random phenotype (A–C) and for body mass index (BMI) (D–F) on each mRNA probe adjusted for sex, age, RNA amplification batch, RNA integrity number (RIN) and the sample storage time based on L2T expression values (x-axis) and on VST values (y-axis) in the SHIP-TREND cohort. The upper panels (A, D) show the betas, the middle panels (B, E) show the standard errors (SEs) and the lower panels (C, F) show the negative log10 association p-values. The corresponding squared Pearson product-moment correlation coefficient between the plotted values is given in the upper right corner of each plot. Each spot represents a probe and is colored according to its mean L2T expression value from all samples. The color code is given in the legend located in the lower right corner of each plot. Although betas and SEs differ between both transformations, the association p-values are highly correlated.
Eigen-R results for SHIP-TREND, KORA F4 and GHS.
|
| |||
| Parameter | SHIP-TREND | KORA F4 | GHS |
|
| 33.75% | 48.18% | 26.55% |
|
| 20.18% | 24.30% | 12.44% |
|
| 2.86% | 1.60% | 1.70% |
|
| 18.72% | 3.31% | 8.11% |
|
| 0.20% | 0.41% | 0.61% |
|
| 1.36% | 0.77% | 0.29% |
| Sex | 0.95% | 0.87% | 1.51% |
| Age [years] | 0.58% | 0.45% | 0.30% |
| Body height [cm] | 0.54% | 0.48% | 0.82% |
| Body weight [km] | 0.59% | 0.60% | 0.51% |
| Body mass index [kg/m2] | 0.68% | 0.54% | 0.35% |
| Hip circumference [cm] | 0.60% | 0.41% | 0.27% |
| Waist circumference [cm] | 0.77% | 0.67% | 0.52% |
| Waist to hip ratio | 0.65% | 0.70% | 0.82% |
| White blood cell count [Gpt/l] | 0.89% | 0.74% | 0.23% |
| Red blood cell count [Tpt/l] | 0.38% | 0.35% | 0.65% |
| Hematocrit | 0.47% | 0.46% | 0.83% |
| Hemoglobin [mmol/l] | 0.50% | 0.42% | 1.03% |
| Platelets [Gpt/l] | 0.32% | 0.27% | 0.63% |
| High density lipoprotein [mmol/l] | 0.49% | 0.48% | 0.48% |
| Serum triglycerides [mmol/l] | 0.68% | 0.87% | 0.23% |
| Active smokers [%] | 0.36% | 0.23% | 0.26% |
| Systolic blood pressure [mmHg] | 0.41% | 0.15% | 0.26% |
| Diastolic blood pressure [mmHg] | 0.37% | 0.14% | 0.19% |
| Serum C-reactive protein [mg/l] | - | 0.30% | 0.26% |
Storage time: Time between blood donation and RNA isolation (SHIP-TREND and KORA F4) or time between RNA isolation and RNA amplification (GHS).
The first six lines of the Table represent technical parameters. A dash indicates that the parameter was not available in the cohort.
Figure 2Unexplained variance after adjustment for principle components (PCs).
The panels show the percentage of adjusted unexplained variance (y-axis) of the regression on the log2 transformed (L2T) gene expression levels and body mass index (BMI) (A) or the random phenotype (B) over the first 100 PCs (x-axis). With both phenotypes the unexplained variance decreases continuously with the addition of further PCs to the regression model. Results are given separately for the SHIP-TREND, KORA F4 and GHS cohorts.
Mean standard errors (SEs) for SHIP-TREND, KORA F4 and GHS after different covariate adjustments for the random phenotype and body mass index (BMI).
| Mean SE | ||||
| Phenotype | additional covariates (besides phenotype) | SHIP-TREND | KORA F4 | GHS |
|
| none | 0.00602560 | 0.00705074 | 0.00555893 |
| age, sex | 0.00600400 | 0.00692849 | 0.00554164 | |
| age, sex, technical | 0.00549340 | 0.00640187 | 0.00528846 | |
| technical | 0.00551280 | 0.00641387 | 0.00530522 | |
| technical, PC1 | 0.00548790 | 0.00637871 | 0.00500432 | |
| technical, detected genes | 0.00544510 | 0.00627055 | - | |
| technical, | 0.00544820 | 0.00629034 | - | |
| 50 PCs | 0.00474210 | 0.00512433 | 0.00419344 | |
| age, sex, technical, cell types | 0.00542430 | - | - | |
| technical, non-technical | 0.00557310 | - | - | |
|
| None | 0.00130350 | 0.00154734 | 0.00114923 |
| age, sex | 0.00135000 | 0.00154774 | 0.00117234 | |
| age, sex, technical | 0.00123420 | 0.00142589 | 0.00112182 | |
| Technical | 0.00119320 | 0.00142516 | 0.00109915 | |
| technical, PC1 | 0.00119210 | 0.00141686 | 0.00109480 | |
| technical, detected genes | 0.00118540 | 0.00140998 | - | |
| technical, | 0.00119490 | 0.00141395 | - | |
| 50 PCs | 0.00125360 | 0.00126477 | 0.00105583 | |
| age, sex, technical, cell types | 0.00123254 | - | - | |
| technical, non-technical | 0.01305295 | - | - | |
50 PCs: the first 50 principal components (PCs) of the principle component analysis (PCA) over the gene expression levels; BMI: body mass index in [kg/m2]; cell types: percentage of lymphocytes, neutrophils, monocytes, eosinophils and basophils; detected genes: number of detected genes (detection p-value<0.01); Mean SE: mean standard error of phenotypes' beta from all probes of the corresponding association analysis; non-technical: all non-technical parameters having an Eigen-R value>0.3% in SHIP-TREND; PC1: the first PC of the PCA; random phenotype: the random phenotype ∼N (0,1); technical: RNA amplification batch, RNA integrity number (RIN), storage time.
A dash indicates that the parameter was not available in the cohort.
Figure 3Effects of SNPs within probes on signal intensities.
The effects on measured log2 transformed (L2T) gene expression levels per mismatch allele of SNPs located within probes (y-axis) are plotted against the mean L2T expression level of the samples for each probe (x-axis). Each spot represents a SNP-probe combination; associations with significant p-values after Bonferroni correction (p<2.3×10−5) are colored in red and p-values below 0.05 are colored in orange. To increase legibility the y-axis was limited from −3 to 3 excluding 176 non-significant results out of 1237 successful association results (minimum and maximum effect sizes were −174.1 and 188.7, respectively). Surprisingly, in almost 45% of the associations a positive effect per mismatch allele on expression signal intensity was observed.
Figure 4Workflow – from blood sampling to measured mRNA intensities.
From left to right: Whole blood was collected and stored in PAXgene tubes until isolation of RNA from whole blood cells in both SHIP-TREND and KORA F4. In GHS, monocytes were separated from whole blood and RNA was isolated from monocytes within 24 hours after blood sampling, subsequently storing the isolated RNA until amplification. The sample storage time refers to the duration the whole blood (SHIP-TREND and KORA F4) or isolated RNA (GHS) was stored before further processing, shown as mean ± standard deviation in days. The samples were processed in 96 well plates both after isolation and amplification of the RNA. The corresponding plate layouts were called RNA isolation batch and RNA amplification batch, respectively. Finally, the RNA was hybridized and the arrays were scanned, quality controlled and analyzed.