Nicholas Stoler1, Barbara Arbeithuber2, Gundula Povysil3,4, Monika Heinzl3, Renato Salazar3, Kateryna D Makova5, Irene Tiemann-Boege6, Anton Nekrutenko7. 1. Graduate Program in Bioinformatics and Genomics, The Huck Institutes for Life Sciences, The Pennsylvania State University, University Park, PA, USA. 2. Department of Biology, The Pennsylvania State University, University Park, PA, USA. 3. Institut für Biophysik, Johannes Kepler Universität, Linz, Österreich, Austria. 4. Present Address: Institute for Genomic Medicine, Columbia University Irving Medical Center, New York, NY, USA. 5. Department of Biology, The Pennsylvania State University, University Park, PA, USA. kdm16@psu.edu. 6. Institut für Biophysik, Johannes Kepler Universität, Linz, Österreich, Austria. irene.tiemann@jku.at. 7. Graduate Program in Bioinformatics and Genomics, The Huck Institutes for Life Sciences, The Pennsylvania State University, University Park, PA, USA. aun1@psu.edu.
Abstract
BACKGROUND: Duplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost-sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are thrown away. RESULTS: In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows "reuniting" these reads with their respective families increasing the output of the method and making it more cost effective. CONCLUSIONS: We combine an error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0. It is written in Python, C, AWK, and Bash. It is open source and readily available through Galaxy, Bioconda, and Github: https://github.com/galaxyproject/dunovo.
BACKGROUND: Duplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost-sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are thrown away. RESULTS: In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows "reuniting" these reads with their respective families increasing the output of the method and making it more cost effective. CONCLUSIONS: We combine an error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0. It is written in Python, C, AWK, and Bash. It is open source and readily available through Galaxy, Bioconda, and Github: https://github.com/galaxyproject/dunovo.
Entities:
Keywords:
Barcodes; Duplex sequence; Error correction; Low frequency variants
Authors: Michael W Schmitt; Scott R Kennedy; Jesse J Salk; Edward J Fox; Joseph B Hiatt; Lawrence A Loeb Journal: Proc Natl Acad Sci U S A Date: 2012-08-01 Impact factor: 11.205
Authors: Mikhail Shugay; Olga V Britanova; Ekaterina M Merzlyak; Maria A Turchaninova; Ilgar Z Mamedov; Timur R Tuganbaev; Dmitriy A Bolotin; Dmitry B Staroverov; Ekaterina V Putintseva; Karla Plevova; Carsten Linnemann; Dmitriy Shagin; Sarka Pospisilova; Sergey Lukyanov; Ton N Schumacher; Dmitriy M Chudakov Journal: Nat Methods Date: 2014-05-04 Impact factor: 28.547
Authors: Boris Rebolledo-Jaramillo; Marcia Shu-Wei Su; Nicholas Stoler; Jennifer A McElhoe; Benjamin Dickins; Daniel Blankenberg; Thorfinn S Korneliussen; Francesca Chiaromonte; Rasmus Nielsen; Mitchell M Holland; Ian M Paul; Anton Nekrutenko; Kateryna D Makova Journal: Proc Natl Acad Sci U S A Date: 2014-10-13 Impact factor: 11.205
Authors: Mikhail Shugay; Andrew R Zaretsky; Dmitriy A Shagin; Irina A Shagina; Ivan A Volchenkov; Andrew A Shelenkov; Mikhail Y Lebedin; Dmitriy V Bagaev; Sergey Lukyanov; Dmitriy M Chudakov Journal: PLoS Comput Biol Date: 2017-05-05 Impact factor: 4.475
Authors: Nicholas Stoler; Barbara Arbeithuber; Wilfried Guiblet; Kateryna D Makova; Anton Nekrutenko Journal: Genome Biol Date: 2016-08-26 Impact factor: 13.583
Authors: Barbara Arbeithuber; James Hester; Marzia A Cremona; Nicholas Stoler; Arslan Zaidi; Bonnie Higgins; Kate Anthony; Francesca Chiaromonte; Francisco J Diaz; Kateryna D Makova Journal: PLoS Biol Date: 2020-07-15 Impact factor: 8.029
Authors: Anna Kostecka; Tomasz Nowikiewicz; Paweł Olszewski; Magdalena Koczkowska; Monika Horbacz; Monika Heinzl; Maria Andreou; Renato Salazar; Theresa Mair; Piotr Madanecki; Magdalena Gucwa; Hanna Davies; Jarosław Skokowski; Patrick G Buckley; Rafał Pęksa; Ewa Śrutek; Łukasz Szylberg; Johan Hartman; Michał Jankowski; Wojciech Zegarski; Irene Tiemann-Boege; Jan P Dumanski; Arkadiusz Piotrowski Journal: NPJ Breast Cancer Date: 2022-06-29
Authors: Renato Salazar; Barbara Arbeithuber; Maja Ivankovic; Monika Heinzl; Sofia Moura; Ingrid Hartl; Theresa Mair; Angelika Lahnsteiner; Thomas Ebner; Omar Shebl; Johannes Pröll; Irene Tiemann-Boege Journal: Genome Res Date: 2022-02-24 Impact factor: 9.043
Authors: Barbara Arbeithuber; Marzia A Cremona; James Hester; Alison Barrett; Bonnie Higgins; Kate Anthony; Francesca Chiaromonte; Francisco J Diaz; Kateryna D Makova Journal: Proc Natl Acad Sci U S A Date: 2022-04-08 Impact factor: 12.779