Literature DB >> 21144010

Overcoming bias and systematic errors in next generation sequencing data.

Margaret A Taub¹, Hector Corrada Bravo, Rafael A Irizarry.

Abstract

Considerable time and effort has been spent in developing analysis and quality assessment methods to allow the use of microarrays in a clinical setting. As is the case for microarrays and other high-throughput technologies, data from new high-throughput sequencing technologies are subject to technological and biological biases and systematic errors that can impact downstream analyses. Only when these issues can be readily identified and reliably adjusted for will clinical applications of these new technologies be feasible. Although much work remains to be done in this area, we describe consistently observed biases that should be taken into account when analyzing high-throughput sequencing data. In this article, we review current knowledge about these biases, discuss their impact on analysis results, and propose solutions.

Entities: Chemical Disease Species

Year: 2010 PMID： 21144010 PMCID： PMC3025429 DOI： 10.1186/gm208

Source DB: PubMed Journal: Genome Med ISSN： 1756-994X Impact factor: 11.117

Background: clinical applications of microarrays

While microarrays were rapidly accepted in research applications, incorporating them in clinical settings has required over a decade of benchmarking, standardization and the development of appropriate analysis methods. Extensive cross-platform and cross-laboratory analyses demonstrated the importance of low-level processing choices [1-3], including data summarization, normalization, and adjustment for laboratory or 'batch' effects [4], on outcome accuracy. Some of this work was done under the auspices of the Food and Drug Administration (FDA), most notably the Microarray Quality Control (MAQC) studies, which were developed specifically in order to determine the utility of microarray technologies in a clinical setting [5,6]. Microarray-measured gene expression signatures now form the basis of several FDA-approved clinical diagnostic tests, including MammaPrint, and Pathwork's Tissue of Origin test [7,8]. With high-throughput sequencing still in its infancy, many questions remain to be addressed before any hope of achieving approval for clinical applications is warranted. Although a study on the scale of the MAQC analyses for microarrays has yet to be carried out for sequencing (although one is in the works), there is already evidence that similar technical biases are present in sequencing data, and these will need to be understood and adjusted for to enable use of these new technologies in a clinical setting. In this commentary, we present some of these known biases and discuss the current state of solutions aimed at addressing them. Looking ahead to the application of this new technology in the clinical setting, we see both hurdles and promise.

Bias and batch effects in high-throughput assays

Biases arise when an observed measurement does not reflect the quantity to be measured due to a systematic distorting effect. For a concrete example from microarrays, non-specific hybridization at microarray probes produces an observed intensity that is not an unbiased measure of the presence of the target sequence in the population being studied. Thorough investigation has revealed that the chemical composition of microarray probes influences this effect, and analysis methods have been developed to alleviate it [9]. Similarly, batch effects, whereby external factors, for example, time or technician, have a systematic influence on experimental outcomes across a condition, have been seen in many high-throughput technologies, and can cause confounding without proper study design and analysis techniques [4,10]. So far, there is evidence that these issues are present in experiments employing high-throughput sequencing data, indicating that similar precautions and methodological developments will be necessary before sequencing data can be used with confidence in the clinic.

Bias in base-call error rates

High-throughput sequencing involves the parallel sequencing of millions of DNA fragments simultaneously. Generally, these fragments are sequenced one base at a time, and, at each step or cycle, the current base is determined through fluorescent detection. For a review, see Holt and Jones [11]. Although sequencing platform chemistries differ, in all cases care must be taken to avoid introducing bias at this early stage. Focusing on the Illumina Genome Analyzer platform, base-call errors are not randomly distributed across the cycle positions in sequenced reads [12]. Although not as extensively studied, similar biases have been observed and low-level signal correction methods have been developed for other sequencing platforms [13]. Incorrect base calls can have a deleterious impact downstream in aligning reads to the reference genome (resulting in fewer or incorrect alignments) and in variant detection (contributing to false-positive variant calls). In experiments aimed at detecting variants in genomic DNA, concern about false positives may lead researchers to employ stringent filtering criteria. Many researchers are hypothesizing that the discovery of rare variants will be a crucial next step in understanding the genetic causes of complex diseases [14], and overly strict filtering criteria may eliminate exactly the variants of most interest and impact. By improving the quality of nucleotide calls, either through better base calling or error correction, more accurate variant calls will be possible. Alternative base-calling methods that reduce the cycle-related bias in error rates have been developed (Figure 1) [15,16]. Numerous error correction methods have also been developed to remove errors from reads after base calls have been made [17-20]. Since base calling requires the raw intensity files, which many laboratories never receive from sequencing centers, re-calling bases is logistically burdensome, and error correction provides a potential alternative.

Figure 1

Effect of base-calling improvements on error bias. This figure is based on figures from Bravo and Irizarry [15]. Choosing a site that was a false-positive variant as determined by MAQ [28], the authors examined the pattern of nucleotide calls according to the read cycle the different calls occurred at. (a) Results with the default base-calling software; (b) results after application of the base-calling method of Bravo and Irizarry. The x-axis shows read cycle and the colored points indicate the percentage of calls at each cycle that were made for a particular nucleotide. In (a), the letter T becomes much more frequent in reads that align to the SNP site only at later sequencing cycles, indicating a technical bias in base calls at this position, while the plot in (b) shows a strong reduction in this bias. In addition, the location is no longer determined as a variant by MAQ after the improved base calling.

Coverage biases

Another long-observed phenomenon of high-throughput sequencing data is the strong, reproducible effect of local sequence content on the coverage of a genomic region by sequencing reads [12]. This phenomenon is analogous to probe effects for microarray platforms. For sequencing projects where coverage levels are compared across regions, such as RNA-Seq, chromosome immunoprecipitation-sequencing (ChIP-Seq) or copy number detection, this phenomenon can be particularly problematic. Researchers carrying out ChIP-Seq experiments have observed a systematic relationship between coverage and GC content (Figure 2) [21]. Researchers using sequencing to measure copy number have also found adjusting for GC content improves precision [22]. Adjusting signal for GC content leads to improved results in both ChIP-Seq and copy number estimation with sequencing data [21,22].

Figure 2

Effect of mappability and GC content on coverage. (a) Mean tag counts in 50-bp bins, with error bars, from a naked DNA sample from a ChIP-Seq experiment, showing that they depend on mappability and GC content. (b) 97.4% of bins have GC percentages between 0.2% and 0.56%, as marked by the vertical dashed lines. This figure is reproduced with permission from Kuan et al. [21]. Genomic regions that are identical or highly similar to one another create ambiguity in alignment to the genome, and ambiguous reads are generally discarded. The low coverage in these regions can produce biased measurements or remove the regions from consideration in downstream analysis, potentially eliminating important signals from the data. Methods have been developed for taking this mappability property into account to adjust the observed signal in these regions [21]. Some spatial biases seem to be unique to the sample preparation protocol being used. Hansen et al. [23] have shown that random hexamer priming can lead to coverage bias in RNA-Seq analyses, and Li et al. [24] present a model for the non-uniformity of RNA-Seq read coverage. Both papers provide solutions to adjust for these biases and achieve more uniform coverage.

Batch effects

Batch effects arise when variability in the data correlates with a technical variable, such as processing date, location or technician. Such effects have been observed in many different high-throughput experiments. Leek et al. [10] investigated batch effects in genomic DNA sequencing carried out as part of the 1000 Genomes Project [25]. To investigate whether batch effects were present in a subset of this sequencing data, Leek et al. compiled a set of aligned sequencing data sets that were produced in the same location at different dates. After summarization and normalization of the data, clear spatial patterns can be seen in several of the samples, and the patterns are correlated with the technical variable of processing date (Figure 3). Patterns like these could lead to false conclusions in experiments where the sequencing coverage is related to the condition of interest, such as copy-number or peak height.

Figure 3

Batch effect for second-generation sequencing data from the 1000 Genomes Project. This figure is similar to one from Leek et al. [10]. Each row in the heat-map is data from a different HapMap sample processed in the same facility with the same platform (see Leek et al. [10] for a description of the data), shown for a 3-Mb region on chromosome 16, with data summarized in 10-kb bins. Data from each bin were standardized across samples, with blue representing 3 standard deviations below average, and orange representing 3 standard deviations above average. The rows are ordered by date, with black lines separating different processing days. The largest batch effect can be seen on the alternating pattern of blue and orange on days 223 to 241 and days 244 to 251. The primary way of avoiding batch effects is through careful experimental design. Randomization of all experimental variables across treatment conditions should be employed to avoid systematic effects within a condition. In order to correct for these batch effects after the fact, they need to first be detected, and then adjusted for, be it through the use of covariates in linear models, or more involved procedures such as surrogate variable analysis [26]. These methods will work best when confounding between the technical variable and the outcome of interest are avoided; thus, careful experimental design is essential. One challenge of using sequencing technologies in clinical applications is that conclusions are likely to be drawn by comparing newly acquired data with genome profiles derived from previously collected data. Interpreting findings derived from this type of comparison is made difficult by the batch effect. Better understanding of batch-to-batch variation and development of single-sample methods such as fRMA [27] will be important steps forward in addressing this challenge.

Conclusion

Just as is the case for other high-throughput biological assays, high-throughput sequencing presents many challenges when it comes to avoiding bias and batch effects. Promising solutions to these problems are already in development, including: low-level improvements in base calling and error correction, improved per-position data quality metrics, adjustments to coverage estimates to alleviate context-specific or protocol-specific effects, and experimental designs that minimize potential confounding effects of batch. The lessons learned through the development of clinical applications of microarrays, such as the need for benchmark studies such as those conducted by the MAQC project, should help accelerate the process of incorporating high-throughput sequencing into the clinic.

Abbreviations

Bp: base pair; ChIP: chromatin immunoprecipitation; FDA: Food and Drug Administration; MAQC: Microarray Quality Control.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RI conceived of the paper and contributed ideas. HCB performed analyses and contributed ideas. MT performed analyses, contributed ideas and drafted the manuscript. All authors read and approved the final manuscript.

25 in total

1. Reptile: representative tiling for short read error correction.

Authors: Xiao Yang; Karin S Dorman; Srinivas Aluru
Journal: Bioinformatics Date: 2010-08-16 Impact factor: 6.937

2. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements.

Authors: Leming Shi; Laura H Reid; Wendell D Jones; Richard Shippy; Janet A Warrington; Shawn C Baker; Patrick J Collins; Francoise de Longueville; Ernest S Kawasaki; Kathleen Y Lee; Yuling Luo; Yongming Andrew Sun; James C Willey; Robert A Setterquist; Gavin M Fischer; Weida Tong; Yvonne P Dragan; David J Dix; Felix W Frueh; Frederico M Goodsaid; Damir Herman; Roderick V Jensen; Charles D Johnson; Edward K Lobenhofer; Raj K Puri; Uwe Schrf; Jean Thierry-Mieg; Charles Wang; Mike Wilson; Paul K Wolber; Lu Zhang; Shashi Amur; Wenjun Bao; Catalin C Barbacioru; Anne Bergstrom Lucas; Vincent Bertholet; Cecilie Boysen; Bud Bromley; Donna Brown; Alan Brunner; Roger Canales; Xiaoxi Megan Cao; Thomas A Cebula; James J Chen; Jing Cheng; Tzu-Ming Chu; Eugene Chudin; John Corson; J Christopher Corton; Lisa J Croner; Christopher Davies; Timothy S Davison; Glenda Delenstarr; Xutao Deng; David Dorris; Aron C Eklund; Xiao-hui Fan; Hong Fang; Stephanie Fulmer-Smentek; James C Fuscoe; Kathryn Gallagher; Weigong Ge; Lei Guo; Xu Guo; Janet Hager; Paul K Haje; Jing Han; Tao Han; Heather C Harbottle; Stephen C Harris; Eli Hatchwell; Craig A Hauser; Susan Hester; Huixiao Hong; Patrick Hurban; Scott A Jackson; Hanlee Ji; Charles R Knight; Winston P Kuo; J Eugene LeClerc; Shawn Levy; Quan-Zhen Li; Chunmei Liu; Ying Liu; Michael J Lombardi; Yunqing Ma; Scott R Magnuson; Botoul Maqsodi; Tim McDaniel; Nan Mei; Ola Myklebost; Baitang Ning; Natalia Novoradovskaya; Michael S Orr; Terry W Osborn; Adam Papallo; Tucker A Patterson; Roger G Perkins; Elizabeth H Peters; Ron Peterson; Kenneth L Philips; P Scott Pine; Lajos Pusztai; Feng Qian; Hongzu Ren; Mitch Rosen; Barry A Rosenzweig; Raymond R Samaha; Mark Schena; Gary P Schroth; Svetlana Shchegrova; Dave D Smith; Frank Staedtler; Zhenqiang Su; Hongmei Sun; Zoltan Szallasi; Zivana Tezak; Danielle Thierry-Mieg; Karol L Thompson; Irina Tikhonova; Yaron Turpaz; Beena Vallanat; Christophe Van; Stephen J Walker; Sue Jane Wang; Yonghong Wang; Russ Wolfinger; Alex Wong; Jie Wu; Chunlin Xiao; Qian Xie; Jun Xu; Wen Yang; Liang Zhang; Sheng Zhong; Yaping Zong; William Slikker
Journal: Nat Biotechnol Date: 2006-09 Impact factor: 54.908

3. Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors: Heng Li; Jue Ruan; Richard Durbin
Journal: Genome Res Date: 2008-08-19 Impact factor: 9.043

4. Frozen robust multiarray analysis (fRMA).

Authors: Matthew N McCall; Benjamin M Bolstad; Rafael A Irizarry
Journal: Biostatistics Date: 2010-01-22 Impact factor: 5.899

5. BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing.

Authors: Wei-Chun Kao; Kristian Stevens; Yun S Song
Journal: Genome Res Date: 2009-08-06 Impact factor: 9.043

Review 6. The new paradigm of flow cell sequencing.

Authors: Robert A Holt; Steven J M Jones
Journal: Genome Res Date: 2008-06 Impact factor: 9.043

7. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.

Authors: Leming Shi; Gregory Campbell; Wendell D Jones; Fabien Campagne; Zhining Wen; Stephen J Walker; Zhenqiang Su; Tzu-Ming Chu; Federico M Goodsaid; Lajos Pusztai; John D Shaughnessy; André Oberthuer; Russell S Thomas; Richard S Paules; Mark Fielden; Bart Barlogie; Weijie Chen; Pan Du; Matthias Fischer; Cesare Furlanello; Brandon D Gallas; Xijin Ge; Dalila B Megherbi; W Fraser Symmans; May D Wang; John Zhang; Hans Bitter; Benedikt Brors; Pierre R Bushel; Max Bylesjo; Minjun Chen; Jie Cheng; Jing Cheng; Jeff Chou; Timothy S Davison; Mauro Delorenzi; Youping Deng; Viswanath Devanarayan; David J Dix; Joaquin Dopazo; Kevin C Dorff; Fathi Elloumi; Jianqing Fan; Shicai Fan; Xiaohui Fan; Hong Fang; Nina Gonzaludo; Kenneth R Hess; Huixiao Hong; Jun Huan; Rafael A Irizarry; Richard Judson; Dilafruz Juraeva; Samir Lababidi; Christophe G Lambert; Li Li; Yanen Li; Zhen Li; Simon M Lin; Guozhen Liu; Edward K Lobenhofer; Jun Luo; Wen Luo; Matthew N McCall; Yuri Nikolsky; Gene A Pennello; Roger G Perkins; Reena Philip; Vlad Popovici; Nathan D Price; Feng Qian; Andreas Scherer; Tieliu Shi; Weiwei Shi; Jaeyun Sung; Danielle Thierry-Mieg; Jean Thierry-Mieg; Venkata Thodima; Johan Trygg; Lakshmi Vishnuvajjala; Sue Jane Wang; Jianping Wu; Yichao Wu; Qian Xie; Waleed A Yousef; Liang Zhang; Xuegong Zhang; Sheng Zhong; Yiming Zhou; Sheng Zhu; Dhivya Arasappan; Wenjun Bao; Anne Bergstrom Lucas; Frank Berthold; Richard J Brennan; Andreas Buness; Jennifer G Catalano; Chang Chang; Rong Chen; Yiyu Cheng; Jian Cui; Wendy Czika; Francesca Demichelis; Xutao Deng; Damir Dosymbekov; Roland Eils; Yang Feng; Jennifer Fostel; Stephanie Fulmer-Smentek; James C Fuscoe; Laurent Gatto; Weigong Ge; Darlene R Goldstein; Li Guo; Donald N Halbert; Jing Han; Stephen C Harris; Christos Hatzis; Damir Herman; Jianping Huang; Roderick V Jensen; Rui Jiang; Charles D Johnson; Giuseppe Jurman; Yvonne Kahlert; Sadik A Khuder; Matthias Kohl; Jianying Li; Li Li; Menglong Li; Quan-Zhen Li; Shao Li; Zhiguang Li; Jie Liu; Ying Liu; Zhichao Liu; Lu Meng; Manuel Madera; Francisco Martinez-Murillo; Ignacio Medina; Joseph Meehan; Kelci Miclaus; Richard A Moffitt; David Montaner; Piali Mukherjee; George J Mulligan; Padraic Neville; Tatiana Nikolskaya; Baitang Ning; Grier P Page; Joel Parker; R Mitchell Parry; Xuejun Peng; Ron L Peterson; John H Phan; Brian Quanz; Yi Ren; Samantha Riccadonna; Alan H Roter; Frank W Samuelson; Martin M Schumacher; Joseph D Shambaugh; Qiang Shi; Richard Shippy; Shengzhu Si; Aaron Smalter; Christos Sotiriou; Mat Soukup; Frank Staedtler; Guido Steiner; Todd H Stokes; Qinglan Sun; Pei-Yi Tan; Rong Tang; Zivana Tezak; Brett Thorn; Marina Tsyganova; Yaron Turpaz; Silvia C Vega; Roberto Visintainer; Juergen von Frese; Charles Wang; Eric Wang; Junwei Wang; Wei Wang; Frank Westermann; James C Willey; Matthew Woods; Shujian Wu; Nianqing Xiao; Joshua Xu; Lei Xu; Lun Yang; Xiao Zeng; Jialu Zhang; Li Zhang; Min Zhang; Chen Zhao; Raj K Puri; Uwe Scherf; Weida Tong; Russell D Wolfinger
Journal: Nat Biotechnol Date: 2010-07-30 Impact factor: 54.908

8. Biases in Illumina transcriptome sequencing caused by random hexamer priming.

Authors: Kasper D Hansen; Steven E Brenner; Sandrine Dudoit
Journal: Nucleic Acids Res Date: 2010-04-14 Impact factor: 16.971

9. Quake: quality-aware detection and correction of sequencing errors.

Authors: David R Kelley; Michael C Schatz; Steven L Salzberg
Journal: Genome Biol Date: 2010-11-29 Impact factor: 13.583

10. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.

Authors: Juliane C Dohm; Claudio Lottaz; Tatiana Borodina; Heinz Himmelbauer
Journal: Nucleic Acids Res Date: 2008-07-26 Impact factor: 16.971

35 in total

1. Deconvolution of nucleic-acid length distributions: a gel electrophoresis analysis tool and applications.

Authors: Riccardo Ziraldo; Massa J Shoura; Andrew Z Fire; Stephen D Levene
Journal: Nucleic Acids Res Date: 2019-09-19 Impact factor: 16.971

Review 2. Library construction for next-generation sequencing: overviews and challenges.

Authors: Steven R Head; H Kiyomi Komori; Sarah A LaMere; Thomas Whisenant; Filip Van Nieuwerburgh; Daniel R Salomon; Phillip Ordoukhanian
Journal: Biotechniques Date: 2014-02-01 Impact factor: 1.993

3. Clinical impact of small TP53 mutated subclones in chronic lymphocytic leukemia.

Authors: Davide Rossi; Hossein Khiabanian; Valeria Spina; Carmela Ciardullo; Alessio Bruscaggin; Rosella Famà; Silvia Rasi; Sara Monti; Clara Deambrogi; Lorenzo De Paoli; Jiguang Wang; Valter Gattei; Anna Guarini; Robin Foà; Raul Rabadan; Gianluca Gaidano
Journal: Blood Date: 2014-02-05 Impact factor: 22.113

4. Evaluating the impact of sequencing error correction for RNA-seq data with ERCC RNA spike-in controls.

Authors: Li Tong; Cheng Yang; Po-Yen Wu; May D Wang
Journal: IEEE EMBS Int Conf Biomed Health Inform Date: 2016-02

Review 5. Novel sequencing-based strategies for high-throughput discovery of genetic mutations underlying inherited antibody deficiency disorders.

Authors: Hong-Ying Wang; Ashish Jain
Journal: Curr Allergy Asthma Rep Date: 2011-10 Impact factor: 4.806

6. Bias-invariant RNA-sequencing metadata annotation.

Authors: Hannes Wartmann; Sven Heins; Karin Kloiber; Stefan Bonn
Journal: Gigascience Date: 2021-09-22 Impact factor: 6.524

Review 7. Impact of bioinformatic procedures in the development and translation of high-throughput molecular classifiers in oncology.

Authors: Charles Ferté; Andrew D Trister; Erich Huang; Brian M Bot; Justin Guinney; Frederic Commo; Solveig Sieberts; Fabrice André; Benjamin Besse; Jean-Charles Soria; Stephen H Friend
Journal: Clin Cancer Res Date: 2013-06-18 Impact factor: 12.531