| Literature DB >> 23755061 |
Shaopu Qin1, Jinhee Kim, Dalia Arafat, Greg Gibson.
Abstract
An under-appreciated aspect of the genetic analysis of gene expression is the impact of post-probe level normalization on biological inference. Here we contrast nine different methods for normalization of an Illumina bead-array gene expression profiling dataset consisting of peripheral blood samples from 189 individual participants in the Center for Health Discovery and Well Being study in Atlanta, quantifying differences in the inference of global variance components and covariance of gene expression, as well as the detection of variants that affect transcript abundance (eSNPs). The normalization strategies, all relative to raw log2 measures, include simple mean centering, two modes of transcript-level linear adjustment for technical factors, and for differential immune cell counts, variance normalization by interquartile range and by quantile, fitting the first 16 Principal Components, and supervised normalization using the SNM procedure with adjustment for cell counts. Robustness of genetic associations as a consequence of Pearson and Spearman rank correlation is also reported for each method, and it is shown that the normalization strategy has a far greater impact than correlation method. We describe similarities among methods, discuss the impact on biological interpretation, and make recommendations regarding appropriate strategies.Entities:
Keywords: eSNP; microarray analysis; normalization; variance component analysis
Year: 2013 PMID: 23755061 PMCID: PMC3668151 DOI: 10.3389/fgene.2012.00160
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Profile distributions after nine modes of normalization. Each plot shows the frequency distribution of transcripts at increasing levels of expression along the x-axis (the units are removed, since these are not comparable between methods). Colors represent normal weight (blue), heavy (green), or obese (red) individuals.
Figure 2Heatmaps showing pair-wise similarity of arrays. Each plot shows the correlation coefficient for the correlation coefficients of each gene expression in each array with that in the paired array. Values range from −1 (dark blue) to +1 (dark red). Blocks of color indicate that arrays in those sectors are less or more similar to one another. Each plot is symmetrical about the diagonal.
Figure 3Similarity of principal components (A) and immuno-informative axis scores (B). The heat maps show the correlation coefficient across all 189 samples for each PC axis, where the order of the rows is the same as the order of the columns. (A) Comparison of the first five PC shows that PC1 is generally highly correlated across normalization strategies, as is PC2, but that the lower PC fall into different clusters. (B) By contrast, the primary axis of covariance of genes representing seven common axes of immunologically informative variation (Risso et al., 2011) are generally well conserved across all eight normalization strategies (excepting PCA).
Variance component analyses.
| Normalization | Date | RIN | Age | BMI | Gender | Ethnicity | Residual |
|---|---|---|---|---|---|---|---|
| RAW | 43.7 | 5.1 | 1.5 | 0.1 | 3.3 | 2.8 | 43.5 |
| MEA | 43.7 | 5.1 | 1.5 | 0.1 | 3.3 | 2.8 | 43.5 |
| dr3 | 0 | 0 | 2.0 | 0.5 | 2.7 | 4.8 | 90.0 |
| DRM | 0 | 0 | 2.0 | 0.5 | 2.7 | 4.8 | 90.0 |
| IQR | 43.3 | 4.9 | 1.6 | 0.1 | 3.1 | 3.0 | 44.2 |
| LMN | 0 | 0 | 1.7 | 0.3 | 0.1 | 3.3 | 94.5 |
| QNM | 38.5 | 7.9 | 1.7 | 0.2 | 5.0 | 3.5 | 45.2 |
| SNM | 0 | 0.2 | 1.8 | 0.9 | 5.9 | 7.4 | 83.8 |
| PCA | 2.5 | 0.7 | 0.5 | 0.1 | 1.2 | 4.3 | 90.7 |
The table reports the weighted average of the percentage of variation explained by the first five principal components of gene expression, for the indicated variables. Samples were hybridized on five different days in the July and August 2010, and RIN refers to three categorical levels of RNA integrity number (<7, 7–8, >8). Age was modeled as a categorical variable with four levels (<40, 40–50, 50–60, >60); BMI as a categorical variable with three levels (<25, 25–30, >30); and Ethnicity has three levels (Caucasian, African American, Asian).
Trait associations.
| Normalization | Age | BMI | Gender | Ethnicity | Total |
|---|---|---|---|---|---|
| RAW | 4 | 0 | 40 | 59 | 103 |
| MEA | 4 | 0 | 75 | 159 | 238 |
| dr3 | 16 | 0 | 34 | 89 | 139 |
| DRM | 7 | 0 | 64 | 198 | 269 |
| IQR | 3 | 0 | 58 | 151 | 212 |
| LMN | 2 | 2 | 15 | 101 | 120 |
| QNM | 3 | 0 | 90 | 201 | 294 |
| SNM | 13 | 5 | 38 | 140 | 196 |
| PCA | 0 | 0 | 3 | 2 | 5 |
The table reports the total number of associations detected between Probe-level expression, and the indicated traits. Age was modeled as a categorical variable with four levels (<40, 40–50, 50–60, >60); BMI as a categorical variable with three levels (<25, 25–30, >30); and Ethnicity has three levels (Caucasian, African American, Asian).
Figure 4Volcano plots of significance and comparison of thresholds. Volcano plots contrast significance as the negative logarithm of the p-value against differential expression, in this case for all genes with NLP > 1.3 (nominal p < 0.05) in the contrast of African American and Caucasian samples in the CHDWB study. Red circles are genes that are significant at NLP > 4 in the SNM normalization, and the horizontal dashed red line shows this threshold for each method. In general, highly significant contrasts are significant in all methods, but this is not necessarily the case for the Gender comparison where Quantile normalization (QNM) over-represents the gender effect relative to all other methods. The heat map at the bottom left shows the pair-wise correlation of estimated effect sizes for all 14,343 probes for each normalization comparison, for ethnicity above the diagonal, and gender below it.
eSNP analyses.
| Normalization | Pearson correlation | Spearman rank correlation | ||||
|---|---|---|---|---|---|---|
| Total (NLP 8) | Cis (NLP 5) | Cis (NLP 8) | Probes (NLP 8) | Cis (NLP 8) | Probes (NLP 8) | |
| RAW | 552 | 1183 | 411 | 39 | 324 | 36 |
| MEA | 1082 | 2009 | 743 | 77 | 703 | 71 |
| dr3 | 627 | 1362 | 455 | 44 | 407 | 46 |
| DRM | 959 | 2150 | 761 | 87 | 747 | 77 |
| IQR | 935 | 1708 | 603 | 71 | 565 | 73 |
| LMN | 484 | 1281 | 439 | 44 | 394 | 44 |
| QNM | 1211 | 2288 | 842 | 88 | 791 | 81 |
| SNM | 969 | 2084 | 825 | 86 | 821 | 81 |
| PCA | 602 | 1563 | 585 | 73 | 505 | 74 |
The Table reports the total number of associations detected between 34,548 Chromosome 6 SNPs and 732 Chromosome 6 Probes, respectively including total (trans and cis) associations at NLP 8; just cis-associations at NLP 5 or NLP 8 (defining cis as eSNPs within 250 kb of the probe); the number of independent probes with eSNPs at NLP 8 (all using Pearson correlation with the transcript abundance); and then the cis-associations and number of independent probes at NLP 8 using Spearman rank correlation.