Stephan Weißbach1,2, Stanislav Sys1, Charlotte Hewel1, Hristo Todorov1, Susann Schweiger1,3, Jennifer Winter1,3, Markus Pfenninger4,5,6, Ali Torkamani7, Doug Evans7, Joachim Burger8, Karin Everschor-Sitte9, Helen Louise May-Simera10, Susanne Gerber11. 1. Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany. 2. Institute of Developmental Biology and Neurobiology, Johannes Gutenberg-University Mainz, Mainz, Germany. 3. Leibniz Institute for Resilience Research, Mainz, Germany. 4. Department of Molecular Ecology, Senckenberg Biodiversity and Climate Research Centre, Senckenberganlage 25, 60325, Frankfurt am Main, Germany. 5. Institute for Molecular and Organismic Evolution, Johannes Gutenberg-University Mainz, Johann-Joachim-Becher-Weg 7, 55128, Mainz, Germany. 6. LOEWE Centre for Translational Biodiversity Genomics, Senckenberg Biodiversity, and Climate Research Centre, Senckenberganlage 25, 60325, Frankfurt am Main, Germany. 7. Department of Integrative Structural and Computational Biology, Scripps Research Translational Institute, California Campus, San Diego, USA. 8. Institute of Anthropology, Johannes Gutenberg-University Mainz, Mainz, Germany. 9. Institute of Physics, Johannes Gutenberg-University Mainz, Mainz, Germany. 10. Institute of Molecular Physiology, Johannes Gutenberg-University Mainz, Mainz, Germany. 11. Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany. sugerber@uni-mainz.de.
Abstract
BACKGROUND: Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact. RESULTS: The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. CONCLUSION: We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.
BACKGROUND: Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact. RESULTS: The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. CONCLUSION: We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.
Authors: Peter Krusche; Len Trigg; Paul C Boutros; Christopher E Mason; Francisco M De La Vega; Benjamin L Moore; Mar Gonzalez-Porta; Michael A Eberle; Zivana Tezak; Samir Lababidi; Rebecca Truty; George Asimenos; Birgit Funke; Mark Fleharty; Brad A Chapman; Marc Salit; Justin M Zook Journal: Nat Biotechnol Date: 2019-03-11 Impact factor: 54.908
Authors: Donald F Conrad; Jonathan E M Keebler; Mark A DePristo; Sarah J Lindsay; Yujun Zhang; Ferran Casals; Youssef Idaghdour; Chris L Hartl; Carlos Torroja; Kiran V Garimella; Martine Zilversmit; Reed Cartwright; Guy A Rouleau; Mark Daly; Eric A Stone; Matthew E Hurles; Philip Awadalla Journal: Nat Genet Date: 2011-06-12 Impact factor: 38.330
Authors: Charlotte Hewel; Julia Kaiser; Anna Wierczeiko; Jan Linke; Christoph Reinhardt; Kristina Endres; Susanne Gerber Journal: Front Neurosci Date: 2019-03-05 Impact factor: 4.677
Authors: Annalisa Buniello; Jacqueline A L MacArthur; Maria Cerezo; Laura W Harris; James Hayhurst; Cinzia Malangone; Aoife McMahon; Joannella Morales; Edward Mountjoy; Elliot Sollis; Daniel Suveges; Olga Vrousgou; Patricia L Whetzel; Ridwan Amode; Jose A Guillen; Harpreet S Riat; Stephen J Trevanion; Peggy Hall; Heather Junkins; Paul Flicek; Tony Burdett; Lucia A Hindorff; Fiona Cunningham; Helen Parkinson Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971