Literature DB >> 24092958

Monitoring of technical variation in quantitative high-throughput datasets.

Martin Lauss¹, Ilhami Visne, Albert Kriegner, Markus Ringnér, Göran Jönsson, Mattias Höglund.

Abstract

High-dimensional datasets can be confounded by variation from technical sources, such as batches. Undetected batch effects can have severe consequences for the validity of a study's conclusion(s). We evaluate high-throughput RNAseq and miRNAseq as well as DNA methylation and gene expression microarray datasets, mainly from the Cancer Genome Atlas (TCGA) project, in respect to technical and biological annotations. We observe technical bias in these datasets and discuss corrective interventions. We then suggest a general procedure to control study design, detect technical bias using linear regression of principal components, correct for batch effects, and re-evaluate principal components. This procedure is implemented in the R package swamp, and as graphical user interface software. In conclusion, high-throughput platforms that generate continuous measurements are sensitive to various forms of technical bias. For such data, monitoring of technical variation is an important analysis step.

Entities: Disease Gene Species

Keywords: RNAseq; batch effect; bias; data adjustment; high-throughput analysis; sample annotation

Year: 2013 PMID： 24092958 PMCID： PMC3785384 DOI： 10.4137/CIN.S12862

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

In high-throughput datasets, sophisticated biostatistical analyses are required to see how technical and biological information of samples is reflected in the data. The variation inherent to the samples (biological variation) coexists with variation added by technical sources, such as batch effects, and random noise.1 A landmark study has highlighted that technical biases in the form of batch effects are found in several high-dimensional data such as gene expression, protein data, and data from next-generation sequencing.2 Data on epigenetic profiling3 and copy number changes4 may also be influenced by batch effects. The underlying cause of an observed batch effect is often unclear and may be linked to a variety of experimental conditions, such as reagent lot, date of experiment, or laboratory personnel. In a broader sense, the merging of several datasets into one single dataset also constitutes a batch effect problem.5,6 Undetected batch effects can have major impact on subsequent conclusions in both unsupervised and supervised analysis.2 Several methods that remove or adjust batch variation have been developed. These methods range from simple batchwise centering of probes to more sophisticated methods, eg, ComBat,7 DWD,8 SVA,9 and XPN;10 however no clear best-performing method has emerged.11,12 It is worth noting that batch correction changes the data substantially and may be incomplete or may introduce new bias to the data. Therefore it is important to evaluate the quality of batch correction. Herein we identify technical bias in datasets from commonly used high-throughput platforms and delineate problems of batch correction in such data. We then suggest a simple procedure to validate batch correction that can be conveniently applied using the R package swamp.

Methods

Detection, correction and re-evaluation of technical bias

We have developed the R package swamp to provide algorithms and supportive plots for the analyses of high-throughput data in respect to sample annotations. The basic elements of the suggested framework presented in Figure 4 are:

Figure 4

Framework to monitor technical variation.

Study design: Heatmap of the square matrix of log10P-values from pair-wise tests of sample annotations using Fisher/Chi-square test or linear models. Function: confounding. Principal component analysis, using univariate linear regression models to determine the association between principal components and sample annotations. A heatmap of log10P-values or R2 values is plotted. Functions: prince, prince.plot. Hierarchical clustering analysis and a quantification of batch effects across the dendrogram clusters using Fisher/Chi-square test or linear models for factors and numeric vectors respectively. Functions: hca.plot, hca.test. We include two data correction methods that so far have not been implemented in R packages: The function kill.pc removes principal components from the data, as described by Alter et al.13 Principal components are deleted from the data by setting the corresponding singular values to zero and recalculating the data matrix. The function adjust.linearmodel uses the lm() function to obtain for each probe the residuals of a linear regression model with the technical variable as regressor. The adjust.linearmodel function makes it possible to correct the data for continuous technical variables. Furthermore we implemented the popular Com-Bat algorithm,7 taken from the webpage http://www.bu.edu/jlab/wp-assets/ComBat, in the function combat. The functions batchadjust.ref and bachadjust.zero perform simple median-centering of each probe and batch. Additional batch correction methods, which may be more appropriate for certain data types, can be found in dedicated R packages for dwd,14 poe,15 sva,16 isva,17 pls-sva,18 and xpn.12

Data processing

For RNAseq data, we downloaded level3 RNAseqv2 data from the Cancer Genome Atlas (TCGA) data portal. We used the files that contain ‘reads per kilo-base per million mapped reads’ (RPKM) values for each gene. There is currently no standard processing pipeline of RPKM values. In the case of colon cancer, we used quantile-normalization as it removed substantial amounts of unexplained variation. For all RNAseq datasets from TCGA, we added an offset of 32, capped the data at 65,000, log2 transformed the data, and mean-centered the genes. This simple pre-processing method makes the data similar to the microarray gene expression format. It has been proposed that heteroscedastic RNAseq counts may influence downstream homoscedastic-based methods, such as principal component analysis.19 We therefore compared the simple pre-processing method with ‘variance stabilizing transformation’ (vst).19 However, we find that principal component analysis of TCGA data produces highly similar results for the simple pre-processing method and the vst-correction (Supplementary Fig. 1). In order to limit unnecessary data transformation, we did not continue to use vst-corrected data. For miRNAseq data, we downloaded level3 data from the TCGA data portal and performed the same data pre-processing (offset 32, cap 65,000, log2 transformation). For methylation data we downloaded level2 data and calculated beta-values as M/(M+U), where M is methylated and U is unmethylated signal. For microarray gene expression data we used processed data from Leek et al.2 The batch variable in this dataset refers to date of hybridization, as derived from the headers of the cel files.

Availability and requirements

The R package swamp is freely available at CRAN. The package runs in R 2.15 or higher, and requires amap, gplots, and impute20 packages. A Windows software for swamp is freely available at http://co.bmc.lu.se/swamp/. The software is implemented in RGG language,21 and requires R 2.15 or higher and Java.

Results

Batch effects are found in a variety of datasets

TCGA resource is remarkable as it reports technical variables in detail, notably by the MD Anderson Batch Effects Tool that implements the MBatch package. To visualize technical bias, such as batch effects, in high-throughput datasets, we introduce a plot which we call prince plot (Fig. 1). For the prince plot we perform a univariate linear regression of each principal component of a data matrix using the sample annotations as regressors. The heatmap of P-values shows the strength of associations of each sample annotation with the top principal components of a dataset. Biological variables are well associated to the top principal components, as shown in Figure 1, for several types of data, including RNAseq, miRNAseq, Methylation27K, and Affymetrix gene expression data. For instance, estrogen receptor status is strongly associated with top components of all three breast cancer datasets. In the kidney cancer RNAseq and miRNAseq data sets, histological grade and tumor stage are well associated with principal component 1. However, technical variables can also be associated to high-ranking principal components. In particular, the ‘batch’ annotations from TCGA data were associated to the top components in all types of investigated data (Fig. 1). The ‘batch’ variable is not further defined in the TCGA project; however, in most datasets correlates well with ‘tissue source site’ (hospital), ‘shipment date,’ and ‘plate-id’ variables. Notably, the ‘tissue source site’ variable can introduce slightly different bias when compared to the ‘batch’ variable, as observed in the breast cancer RNAseq data (Fig. 1). Sources of technical bias can also stem from continuous variables such as amount of DNA or RIN-value (RNA Integrity Number, ranges from 1 to 10) as observed in the kidney cancer RNAseq dataset. Furthermore, technical bias can occur in more than one principal component. For example, the batch variable of the lung cancer RNAseq data is associated with principal components 2, 3 and 4; and the strength of these associations differs across principal components. Biological variation may overlap technical variation and therefore can influence the same principal component. For example, principal component 2 of the bladder cancer expression data indicates that the technical ‘batch’ and the biological ‘CIS’ variables are interrelated. In summary, technical bias is present in all investigated high-throughput technologies, varies in effect size, and may overlap with biological variation in the data.

Figure 1

Technical and biological variation in cancer high-throughput data.

Notes: The prince plots show the log10P-values from univariate linear regression of the top 10 principal components with sample annotations as regressors. The P-values are color-coded from red (P < 10−8) to white (P = 1). Sample annotations are named as in the TCGA biotab files or patient information tables of the respective TCGA portal publications.

Abbreviations: TCGA, The Cancer Genome Atlas; CIS, carcinoma in situ.

Monitoring of batch correction

We took RNAseq data from TCGA’s colon cancer project generated using Illumina Hiseq2000 technology as an example.22 To make the analysis straightforward, we considered only the four largest sample batches. A plot of the interrelation of some important biological and technical variables revealed information on the study design of the TCGA colon project (Fig. 2A). The ‘batch’ variable overlaps with ‘date-of-shipment’ and ‘plate-id,’ as is the case for most TCGA projects. On the other hand, ‘batch’ is largely independent from biological variables such as ‘MSI status’ or ‘MLH1 silencing,’ indicating that each batch contains roughly the same biological composition. The prince plot shows that batch, together with other technical variables, confounds the first and the third component (Fig. 2B). The biological variables are highly associated with the second principal component and show moderate association with the fourth component. As biological and technical variables are associated to different sets of principal components, they contribute to uncorrelated variation patterns. No conclusions can be made a priori on the unexplained variance in the principal component analysis. For example, the variance in component 5 could be caused by either unknown technical or unknown biological processes. Batch effects have an immediate impact on unsupervised clustering analysis, with samples from the same batch clustering together (Fig. 2C). The two main colon cancer clusters are driven significantly by batch assignment (P = 2.5 × 10−6, Fisher test). We removed the first and third principal components from the dataset, as they are dominated by technical confounders and confirm the removal using a prince plot (Fig. 2D). This resulted in the batch variable being equally distributed across the two main clusters (Fig. 2E, P = 0.47). The TCGA publication concluded that MYC is a key regulator in colon cancer. We compared the associations of all genes on the platform to MYC expression, before and after correction (Fig. 2F). Before correction, 2879 genes were correlated to MYC at fdr = 0.05 while 4757 genes were after correction. Furthermore, there were 2406 genes before correction and 3924 genes after correction correlated to micro-satellite instability, respectively. It should be emphasized that successful batch correction can also decrease associations in supervised analysis. This is the case when the technical variable and the biological variable of interest are correlated. In such cases, the initial biological effect had been inflated by the batch effect.

Figure 2

Adjustment of the RNAseq data from the TCGA colorectal cancer project.

Notes: The 4 largest batches of the colon cancer data are analyzed before and after data correction. (A) Confounding plot shows the association of sample annotations with P-values color-coded from purple (P < 10−8) to white (P = 1). (B) Prince plot before correction. Legend as in Figure 1, and percentage of variation for each principal component in brackets. (C) Hierarchical cluster analysis (HCA) using correlation as distance and ward algorithm as linkage method. MSI, microsatellite instability: green, stable microsatellites; red, MSI-low; black, MSI-high. (D) Prince plot after removal of principal components 1 and 3. (E) HCA after correction. (F) Correlation of the expression of all genes on the platform to MYC expression before (black) and after (green) correction.

Monitoring of dataset merging

In another example, two RNAseq datasets taken from the ReCount database were assessed.23 Here, two labs used independently generated B-lymphocyte cell lines from the same 29 HapMap samples. A large association of the study variable was found with the first principal component of the merged data (Fig. 3A), and corrected the data by setting the median of each probe to be the same in both studies. Gender remained associated to the data after correction (Fig. 3B). The gender association is weak, however, as it relies on only 3 Y-chromosome genes, expressed in males. The highest expressed gene, RPS4Y1, shows increased male-specific expression after study correction (before: P = 6 × 10−9; after: P = 3 × 10−15). For a single HapMap individual, RNAseq data across the two studies should be similar. Before study correction sample pairs, however, are anti-correlated. This is due to the strong effect of the study variable in combination with low biological variation, as all cell lines are derived from lymphocytic cells of healthy individuals. Mean correlation values of HapMap cell line pairs increase after study correction, from −0.38 to 0.29 (Fig. 3C).

Figure 3

Merging of two HapMap RNAseq datasets.

Notes: RNAseq data of 29 HapMap cell lines from two independent studies. (A) Prince plot before study correction and (B) after correction. (C) Density plot of correlations of HapMap cell line pairs before (black) and after (red) study correction.

Discussion

We find that large RNAseq datasets, such as those generated by the TCGA,22,24–26 can be burdened with technical variation and that this bias distorts downstream analysis. For expression microarray data, it has been shown that standard normalization pipelines may perform poorly to remove batch effects.2 In the colon RNA-seq dataset from TCGA, we applied quantile-normalization to RPKM values, thereby removing unexplained variances. However, batch effects remained after quantile-normalization. Applying a vst, as suggested for RNAseq data,19 did not make a notable impact on the TCGA datasets. For sequencing technologies, it may be possible that platform-tailored alignment and pre-processing algorithms can reduce batch effects, and we would argue that the prince plot is an adequate tool to monitor improvements. The high prevalence of batch effects has important implications on study design. As it is unclear how to avoid batch effects, it may be wise to consider a study design that allows for repairing batch effects. In the worst case, biological variables and technical variables are highly correlated, eg, Phenotype A was assayed preferentially on Date A, and Phenotype B was preferentially assayed on Date B. The removal of a batch effect in such a case will be problematic as it reduces the biological variation in the data. To be able to correct for batch effects, each batch should be a good representation of the overall cohort. If there remains a risk that a strong biological variable is unknown, randomization of samples across batches may be a good strategy. A sufficiently large number of samples per batch are needed to statistically secure biological independence of batches. When the data is a compendium of small batches, batch correction is likely to shift biological variation, and hence outweighs the benefits. Unfortunately, many TCGA batches show biological selection and are small-sized, which makes it problematic to provide batch-corrected data. To control batch effects, we propose a simple framework (Fig. 4). During the experiment all possible sources of technical variation are recorded and put into relation with biological annotations. The data is then screened by a prince plot and hierarchical clustering analysis. This may lead to detection of batch effects and subsequent batch correction efforts. If the correction algorithms provided by the swamp package are not suitable for a certain dataset, we encourage the use of an alternative algorithm from the literature. The choice of batch correction algorithm is likely to be dependent on study design,12 pre-processing steps, sample type, and platform. Regardless of the chosen algorithm, it is important to monitor the success of batch correction. Therefore, the outcome of data manipulation is controlled again by a prince plot/HCA. The elements of this framework are implemented in the R package swamp.

Conclusions

In summary, we find that high-throughput datasets which result in continuous measurements are potentially subject to technical biases. In such data, we argue that it is essential to monitor batch effects, eg, using the prince plot. Furthermore, we recommend a study design that keeps technical and biological variables independent, to be able to take rescue actions. To increase transparency and reproducibility of scientific findings, data submissions to public repositories should include technical variables. Simple data processing vs. variance stabilizing transformation. Notes: The prince plots show the log10P-values from univariate linear regression of the top 10 principal components with sample annotations as regressors. The P-values are color-coded from red (P < 10−8) to white (P = 1). For sample annotations see Figure 1.

26 in total

1. Statistical modeling and visualization of molecular profiles in cancer.

Authors: Robert Scharpf; Elizabeth S Garrett; Jiang Hu; Giovanni Parmigiani
Journal: Biotechniques Date: 2003-03 Impact factor: 1.993

2. Surrogate variable analysis using partial least squares (SVA-PLS) in gene expression studies.

Authors: Sutirtha Chakraborty; Somnath Datta; Susmita Datta
Journal: Bioinformatics Date: 2012-01-11 Impact factor: 6.937

3. The sva package for removing batch effects and other unwanted variation in high-throughput experiments.

Authors: Jeffrey T Leek; W Evan Johnson; Hilary S Parker; Andrew E Jaffe; John D Storey
Journal: Bioinformatics Date: 2012-01-17 Impact factor: 6.937

4. Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies.

Authors: Andrew E Teschendorff; Joanna Zhuang; Martin Widschwendter
Journal: Bioinformatics Date: 2011-04-06 Impact factor: 6.937

5. LTR: Linear Cross-Platform Integration of Microarray Data.

Authors: Paul C Boutros
Journal: Cancer Inform Date: 2010-08-27

6. Comprehensive molecular characterization of human colon and rectal cancer.

Authors:
Journal: Nature Date: 2012-07-18 Impact factor: 49.962

7. Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages.

Authors: Jonatan Taminau; Stijn Meganck; Cosmin Lazar; David Steenhoff; Alain Coletta; Colin Molter; Robin Duque; Virginie de Schaetzen; David Y Weiss Solís; Hugues Bersini; Ann Nowé
Journal: BMC Bioinformatics Date: 2012-12-24 Impact factor: 3.169

8. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods.

Authors: Chao Chen; Kay Grennan; Judith Badner; Dandan Zhang; Elliot Gershon; Li Jin; Chunyu Liu
Journal: PLoS One Date: 2011-02-28 Impact factor: 3.240

9. Differential expression analysis for sequence count data.

Authors: Simon Anders; Wolfgang Huber
Journal: Genome Biol Date: 2010-10-27 Impact factor: 13.583

10. An epigenetic signature in peripheral blood predicts active ovarian cancer.

Authors: Andrew E Teschendorff; Usha Menon; Aleksandra Gentry-Maharaj; Susan J Ramus; Simon A Gayther; Sophia Apostolidou; Allison Jones; Matthias Lechner; Stephan Beck; Ian J Jacobs; Martin Widschwendter
Journal: PLoS One Date: 2009-12-18 Impact factor: 3.240

26 in total

Review 1. Library construction for next-generation sequencing: overviews and challenges.

Authors: Steven R Head; H Kiyomi Komori; Sarah A LaMere; Thomas Whisenant; Filip Van Nieuwerburgh; Daniel R Salomon; Phillip Ordoukhanian
Journal: Biotechniques Date: 2014-02-01 Impact factor: 1.993

2. Multi-perspective quality control of Illumina RNA sequencing data analysis.

Authors: Quanhu Sheng; Kasey Vickers; Shilin Zhao; Jing Wang; David C Samuels; Olivia Koues; Yu Shyr; Yan Guo
Journal: Brief Funct Genomics Date: 2017-07-01 Impact factor: 4.241

3. Possible Human Papillomavirus 38 Contamination of Endometrial Cancer RNA Sequencing Samples in The Cancer Genome Atlas Database.

Authors: Majid Kazemian; Min Ren; Jian-Xin Lin; Wei Liao; Rosanne Spolski; Warren J Leonard
Journal: J Virol Date: 2015-06-17 Impact factor: 5.103

4. Comprehensive competitive endogenous RNA network analysis reveals EZH2-related axes and prognostic biomarkers in hepatocellular carcinoma.

Authors: Mohammad Hossein Donyavi; Sadra Salehi-Mazandarani; Parvaneh Nikpour
Journal: Iran J Basic Med Sci Date: 2022-03 Impact factor: 2.532

5. Analysis of multispectral imaging with the AstroPath platform informs efficacy of PD-1 blockade.

Authors: Sneha Berry; Nicolas A Giraldo; Benjamin F Green; Alexander S Szalay; Janis M Taube; Tricia R Cottrell; Julie E Stein; Elizabeth L Engle; Haiying Xu; Aleksandra Ogurtsova; Charles Roberts; Daphne Wang; Peter Nguyen; Qingfeng Zhu; Sigfredo Soto-Diaz; Jose Loyola; Inbal B Sander; Pok Fai Wong; Shlomit Jessel; Joshua Doyle; Danielle Signer; Richard Wilton; Jeffrey S Roskes; Margaret Eminizer; Seyoun Park; Joel C Sunshine; Elizabeth M Jaffee; Alexander Baras; Angelo M De Marzo; Suzanne L Topalian; Harriet Kluger; Leslie Cope; Evan J Lipson; Ludmila Danilova; Robert A Anders; David L Rimm; Drew M Pardoll
Journal: Science Date: 2021-06-11 Impact factor: 47.728

6. Molecular stratification of metastatic melanoma using gene expression profiling: Prediction of survival outcome and benefit from molecular targeted therapy.

Authors: Helena Cirenajwis; Henrik Ekedahl; Martin Lauss; Katja Harbst; Ana Carneiro; Jens Enoksson; Frida Rosengren; Linda Werner-Hartman; Therese Törngren; Anders Kvist; Erik Fredlund; Pär-Ola Bendahl; Karin Jirström; Lotta Lundgren; Jillian Howlin; Åke Borg; Sofia K Gruvberger-Saal; Lao H Saal; Kari Nielsen; Markus Ringnér; Hensin Tsao; Håkan Olsson; Christian Ingvar; Johan Staaf; Göran Jönsson
Journal: Oncotarget Date: 2015-05-20

7. The Built Environment Is a Microbial Wasteland.

Authors: Sean M Gibbons
Journal: mSystems Date: 2016-04-19 Impact factor: 6.496

8. Microbiome analyses of blood and tissues suggest cancer diagnostic approach.

Authors: Gregory D Poore; Evguenia Kopylova; Qiyun Zhu; Carolina Carpenter; Serena Fraraccio; Stephen Wandro; Tomasz Kosciolek; Stefan Janssen; Jessica Metcalf; Se Jin Song; Jad Kanbar; Sandrine Miller-Montgomery; Robert Heaton; Rana Mckay; Sandip Pravin Patel; Austin D Swafford; Rob Knight
Journal: Nature Date: 2020-03-11 Impact factor: 49.962

9. DNA methylation subgroups in melanoma are associated with proliferative and immunological processes.

Authors: Martin Lauss; Markus Ringnér; Anna Karlsson; Katja Harbst; Christian Busch; Jürgen Geisler; Per Eystein Lønning; Johan Staaf; Göran Jönsson
Journal: BMC Med Genomics Date: 2015-11-06 Impact factor: 3.063

10. Endocrine regulation of predator-induced phenotypic plasticity.

Authors: Stuart R Dennis; Gerald A LeBlanc; Andrew P Beckerman
Journal: Oecologia Date: 2014-10-05 Impact factor: 3.225