| Literature DB >> 26545828 |
Andrew E Jaffe1,2, Thomas Hyde3,4,5, Joel Kleinman6,7, Daniel R Weinbergern8,9,10,11,12, Joshua G Chenoweth13, Ronald D McKay14, Jeffrey T Leek15, Carlo Colantuoni16,17,18.
Abstract
BACKGROUND: Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature.Entities:
Mesh:
Year: 2015 PMID: 26545828 PMCID: PMC4636836 DOI: 10.1186/s12859-015-0808-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Global transcriptional landscape in differentiating pluripotent cells. PCA of expression data from differentiating pluripotent cells prior to SVA colored by differentiation treatment (a) and microarray scan date (b, left panel). The first PC shows a strong effect of treatment, while the second PC is related to “batch”: boxplots of the second PC indicate strong association with scan date (b, right panel). PCA following estimation and removal of SVs again colored by differentiation treatment (c) and microarray scan date (d, left panel). Both the first and second PC now show systematic association with differentiation, and the second PC no longer shows systematic change with scan date (d, right panel). Letters are also used to distinguish individual scan dates in (b) and (d)
Fig. 2SVA improves power to identify differentially expressed genes. a PAX6 shows significant differential expression between mesendodermal and neurectodermal differentiation before SVA (p = 1.77 × 10−33) and b this effect becomes more significant following SVA (p = 4.82 × 10−53). c Prior to SVA, OLFML1 is not identified as being differentially expressed between differentiation conditions (p = 8.23 × 10−5), but d is highly significant after properly controlling for unwanted latent heterogeneity with SVA (p = 4.49 × 10−29). Expression values are on the log2 scale. Statistical significance was derived from a moderated t-statistic comparing expression in the mesendodermal differentiation condition versus that in the neurectodermal differentiation while also allowing variability to be explained by the undifferentiated condition (e.g. condition was categorical with 3 groups). Individual cell lines are represented on the X-axis. Gene expression on the Y-axis is depicted in quantile normalized, log2-scale intensities
Fig. 3The biological model limits the scope of biological questions that can be asked. Defining a biological model that only preserves the effect of treatment obscures other true biological effects. The RPS4Y1 gene is differentially expressed by sex (a). However, when the biological model passed to SVA does not include sex (i.e. using the treatment only model used in Figs. 1 and 2), the effect of sex at this gene is not apparent (b). When the effects defined in SVA include sex, the difference by sex is preserved in the data (c). Similarly, with GSTT1, copy number variation has a large impact on gene expression (d) which is removed by SVA under a treatment-only biological model (e). Including a term for GSTT1 copy number in the biological model passed to SVA preserves the effect (f). Individual cell lines are represented on the X axis. Gene expression on the Y-axis is depicted in quantile normalized, log2-scale intensities
Fig. 4Global view of the impact of differing models in SVA. PCA performed on the original normalized data (“No SVA”), and on data “cleaned” with SVA using the 3 different models described in the text (a, b and c). Values of individual samples in the first 4 PCs are shown for each analysis, along with a smoothing spline to these PC’s across age (red). SVA Model A: linear spline with a knot at birth [2° of freedom], SVA Model B: 2nd degree basis spline with a knot at birth [3° of freedom], SVA Model C: 2nd degree basis spline with knots at birth, 1, 10, 20 and 50 years [8° of freedom]. Each model also incorporated an offset at birth
Fig. 5Individual gene view of the impact of differing models in SVA. CNDP1 gene expression is shown in the original normalized data (“Raw”) in the top panels. Each of the 3 different models used for SVA are shown overlaid on this original data in the top panels. This fit represents the effect that SVA will preserve for this particular gene in each of the 3 different scenarios (red). Bottom panels show CNDP1 gene expression adjusted with SVs generated using each of the 3 different models. Smoothing splines for expression across age are shown for each to depict the difference in expression patterns present when using SVA under the different models (green)