| Literature DB >> 27600351 |
Mario Fasold1,2, Hans Binder3,4.
Abstract
The great utility of microarrays for genome-scale expression analysis is challenged by the widespread presence of batch effects, which bias expression measurements in particular within large data sets. These unwanted technical artifacts can obscure biological variation and thus significantly reduce the reliability of the analysis results. It is largely unknown which are the predominant technical sources leading to batch effects. We here quantitatively assess the prevalence and impact of several known technical effects on microarray expression results. Particularly, we focus on important factors such as RNA degradation, RNA quantity, and sequence biases including multiple guanine effects. We find that the common variation of RNA quality and RNA quantity can not only yield low-quality expression results, but that both factors also correlate with batch effects and biological characteristics of the samples.Entities:
Keywords: RNA; batch effects; expression analysis; microarray; quality control
Year: 2014 PMID: 27600351 PMCID: PMC4979052 DOI: 10.3390/microarrays3040322
Source DB: PubMed Journal: Microarrays (Basel) ISSN: 2076-3905
Figure 1Variation of RNA quality among a large set of microarray samples and its impact on expression results. Panel (a) shows the density distribution of dk values measuring the degradation for samples either included or excluded in the HumanExpressionAtlas data set by independent quality control. The red line indicates the low quality threshold corresponding to RIN ≤ 7. The inset shows RNA degradation plots for two selected samples. The corresponding dk values (red and blue dots) are shown in the inset and in the density distribution, respectively; Panel (b) shows the first two principal components of the HumanExpressionAtlas expression space where sample points are colored according to their RNA quality measure dk.
Figure 2Chip-specific summary parameters λ and β in dependence of the amount of hybridized RNA. We computed both parameters for the samples of Gene Logic’s dilution data set where two types of RNA (liver and SNB-19) have been hybridized at varying concentrations with 5 replicate samples for each concentration. In Panel (a) λ increases roughly linear with increasing RNA mass between 1 and 10 µg (Pearson correlations of r > 0.7), but saturates at 20 µg; Panel (b) shows how β decreases with increasing RNA mass. An amount of 10 µg aRNA is recommended for the employed HG-U95A platform.
Figure 3Distribution of the λ and β parameters characterizing the amount of hybridized RNA for qc included/qc-excluded samples (panels (a) and (b)). The inset shows the hook curves for two selected samples and the corresponding λ and β values (red and blue dots in the same color as the curves). Panel (c) shows the principal components two and four of the HumanExpressionAtlas expression space where points are colored according to the value of λ.
Figure 4Distribution of summary parameters related to sequence effects for qc-included/qc-excluded samples of the HumanExpressionAtlas. Panel (a) shows the parameter log(Kdiff) as a measure of the total strength of the sequence effect. The inset shows the corresponding sequence profiles for two selected samples (red and blue dots, in the same color as the profiles); Panel (b) shows the intensity increase due to the (GGG)1 motif, δI(GGG)1.
Prevalence and impact of technical factors that constitute potential sources of batch effects in gene expression experiments. The second column shows the percentage of samples that are critically affected by the respective technical artifact. The selection of appropriate thresholds is reasoned in the respective subsections. Column two denotes the prevalence among all samples in the HumanArraySet, in column three among the subset of samples that have been selected after quality control independently performed by [7]. Column four shows the correlation of the technical variable with the principal components of the expression space, and last column how it changes along with known batches.
| Parameter | Prevalence (all) | Prevalence (qc Included) | Correlation with Principal Component (pc) | Adjusted R2 with Batch (Lm) |
|---|---|---|---|---|
| Degradation index dk | 10.6% | 3.0% | −0.44 (2) | 0.67 |
| Specific transcript level λ | 1.6% | (0.1%) | −0.61 (4) | 0.62 |
| Measuring range β | 4.2% | (0.1%) | 0.04 (1) | 0.24 |
| Sequence effect size | 4.1% | (0.1%) | −0.17 (3) | 0.76 |
| Guanine effects | 3.1% | 1.1% | 0.19 (5) | 0.76 |