Wanding Zhou1, Tenghui Chen1, Hao Zhao1, Agda Karina Eterovic2, Funda Meric-Bernstam2, Gordon B Mills2, Ken Chen1. 1. Department of Bioinformatics and Computational Biology, Department of Systems Biology, Institute of Personalized Cancer Therapy and Department of Investigational Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston TX 77030, USA. 2. Department of Bioinformatics and Computational Biology, Department of Systems Biology, Institute of Personalized Cancer Therapy and Department of Investigational Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston TX 77030, USA Department of Bioinformatics and Computational Biology, Department of Systems Biology, Institute of Personalized Cancer Therapy and Department of Investigational Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston TX 77030, USA.
Abstract
MOTIVATION: Identifying subclonal mutations and their implications requires accurate estimation of mutant allele fractions from possibly duplicated sequencing reads. Removing duplicate reads assumes that polymerase chain reaction amplification from library constructions is the primary source. The alternative-sampling coincidence from DNA fragmentation-has not been systematically investigated. RESULTS: With sufficiently high-sequencing depth, sampling-induced read duplication is non-negligible, and removing duplicate reads can overcorrect read counts, causing systemic biases in variant allele fraction and copy number variation estimations. Minimal overcorrection occurs when duplicate reads are identified accounting for their mate reads, inserts are of a variety of lengths and samples are sequenced in separate batches. We investigate sampling-induced read duplication in deep sequencing data with 500× to 2000× duplicates-removed sequence coverage. We provide a quantitative solution to overcorrection and guidance for effective designs of deep sequencing platforms that facilitate accurate estimation of variant allele fraction and copy number variation. AVAILABILITY AND IMPLEMENTATION: A Python implementation is freely available at https://bitbucket.org/wanding/duprecover/overview CONTACT: : wzhou1@mdanderson.org, kchen3@mdanderson.org Supplementary information: Supplementary data are available at Bioinformatics online.
MOTIVATION: Identifying subclonal mutations and their implications requires accurate estimation of mutant allele fractions from possibly duplicated sequencing reads. Removing duplicate reads assumes that polymerase chain reaction amplification from library constructions is the primary source. The alternative-sampling coincidence from DNA fragmentation-has not been systematically investigated. RESULTS: With sufficiently high-sequencing depth, sampling-induced read duplication is non-negligible, and removing duplicate reads can overcorrect read counts, causing systemic biases in variant allele fraction and copy number variation estimations. Minimal overcorrection occurs when duplicate reads are identified accounting for their mate reads, inserts are of a variety of lengths and samples are sequenced in separate batches. We investigate sampling-induced read duplication in deep sequencing data with 500× to 2000× duplicates-removed sequence coverage. We provide a quantitative solution to overcorrection and guidance for effective designs of deep sequencing platforms that facilitate accurate estimation of variant allele fraction and copy number variation. AVAILABILITY AND IMPLEMENTATION: A Python implementation is freely available at https://bitbucket.org/wanding/duprecover/overview CONTACT: : wzhou1@mdanderson.org, kchen3@mdanderson.org Supplementary information: Supplementary data are available at Bioinformatics online.
Authors: Serena Nik-Zainal; Peter Van Loo; David C Wedge; Ludmil B Alexandrov; Christopher D Greenman; King Wai Lau; Keiran Raine; David Jones; John Marshall; Manasa Ramakrishna; Adam Shlien; Susanna L Cooke; Jonathan Hinton; Andrew Menzies; Lucy A Stebbings; Catherine Leroy; Mingming Jia; Richard Rance; Laura J Mudie; Stephen J Gamble; Philip J Stephens; Stuart McLaren; Patrick S Tarpey; Elli Papaemmanuil; Helen R Davies; Ignacio Varela; David J McBride; Graham R Bignell; Kenric Leung; Adam P Butler; Jon W Teague; Sancha Martin; Goran Jönsson; Odette Mariani; Sandrine Boyault; Penelope Miron; Aquila Fatima; Anita Langerød; Samuel A J R Aparicio; Andrew Tutt; Anieta M Sieuwerts; Åke Borg; Gilles Thomas; Anne Vincent Salomon; Andrea L Richardson; Anne-Lise Børresen-Dale; P Andrew Futreal; Michael R Stratton; Peter J Campbell Journal: Cell Date: 2012-05-17 Impact factor: 41.582
Authors: Sohrab P Shah; Andrew Roth; Rodrigo Goya; Arusha Oloumi; Gavin Ha; Yongjun Zhao; Gulisa Turashvili; Jiarui Ding; Kane Tse; Gholamreza Haffari; Ali Bashashati; Leah M Prentice; Jaswinder Khattra; Angela Burleigh; Damian Yap; Virginie Bernard; Andrew McPherson; Karey Shumansky; Anamaria Crisan; Ryan Giuliany; Alireza Heravi-Moussavi; Jamie Rosner; Daniel Lai; Inanc Birol; Richard Varhol; Angela Tam; Noreen Dhalla; Thomas Zeng; Kevin Ma; Simon K Chan; Malachi Griffith; Annie Moradian; S-W Grace Cheng; Gregg B Morin; Peter Watson; Karen Gelmon; Stephen Chia; Suet-Feung Chin; Christina Curtis; Oscar M Rueda; Paul D Pharoah; Sambasivarao Damaraju; John Mackey; Kelly Hoon; Timothy Harkins; Vasisht Tadigotla; Mahvash Sigaroudinia; Philippe Gascard; Thea Tlsty; Joseph F Costello; Irmtraud M Meyer; Connie J Eaves; Wyeth W Wasserman; Steven Jones; David Huntsman; Martin Hirst; Carlos Caldas; Marco A Marra; Samuel Aparicio Journal: Nature Date: 2012-04-04 Impact factor: 49.962
Authors: Marco Gerlinger; Andrew J Rowan; Stuart Horswell; James Larkin; David Endesfelder; Eva Gronroos; Pierre Martinez; Nicholas Matthews; Aengus Stewart; Charles Swanton; M Math; Patrick Tarpey; Ignacio Varela; Benjamin Phillimore; Sharmin Begum; Neil Q McDonald; Adam Butler; David Jones; Keiran Raine; Calli Latimer; Claudio R Santos; Mahrokh Nohadani; Aron C Eklund; Bradley Spencer-Dene; Graham Clark; Lisa Pickering; Gordon Stamp; Martin Gore; Zoltan Szallasi; Julian Downward; P Andrew Futreal Journal: N Engl J Med Date: 2012-03-08 Impact factor: 91.245
Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330
Authors: Li Ding; Timothy J Ley; David E Larson; Christopher A Miller; Daniel C Koboldt; John S Welch; Julie K Ritchey; Margaret A Young; Tamara Lamprecht; Michael D McLellan; Joshua F McMichael; John W Wallis; Charles Lu; Dong Shen; Christopher C Harris; David J Dooling; Robert S Fulton; Lucinda L Fulton; Ken Chen; Heather Schmidt; Joelle Kalicki-Veizer; Vincent J Magrini; Lisa Cook; Sean D McGrath; Tammi L Vickery; Michael C Wendl; Sharon Heath; Mark A Watson; Daniel C Link; Michael H Tomasson; William D Shannon; Jacqueline E Payton; Shashikant Kulkarni; Peter Westervelt; Matthew J Walter; Timothy A Graubert; Elaine R Mardis; Richard K Wilson; John F DiPersio Journal: Nature Date: 2012-01-11 Impact factor: 49.962
Authors: Aziz M Mezlini; Eric J M Smith; Marc Fiume; Orion Buske; Gleb L Savich; Sohrab Shah; Sam Aparicio; Derek Y Chiang; Anna Goldenberg; Michael Brudno Journal: Genome Res Date: 2012-11-29 Impact factor: 9.043
Authors: Shuying Liu; Funda Meric-Bernstam; Napa Parinyanitikul; Bailiang Wang; Agda K Eterovic; Xiaofeng Zheng; Mihai Gagea; Mariana Chavez-MacGregor; Naoto T Ueno; Xiudong Lei; Wanding Zhou; Lakshmy Nair; Debu Tripathy; Powel H Brown; Gabriel N Hortobagyi; Ken Chen; John Mendelsohn; Gordon B Mills; Ana M Gonzalez-Angulo Journal: Oncotarget Date: 2015-02-20
Authors: Mikhail G Dozmorov; Indra Adrianto; Cory B Giles; Edmund Glass; Stuart B Glenn; Courtney Montgomery; Kathy L Sivils; Lorin E Olson; Tomoaki Iwayama; Willard M Freeman; Christopher J Lessard; Jonathan D Wren Journal: BMC Bioinformatics Date: 2015-09-25 Impact factor: 3.169
Authors: Erin N Smith; Kristen Jepsen; Mahdieh Khosroheidari; Laura Z Rassenti; Matteo D'Antonio; Emanuela M Ghia; Dennis A Carson; Catriona Hm Jamieson; Thomas J Kipps; Kelly A Frazer Journal: Genome Biol Date: 2014-08-07 Impact factor: 13.583
Authors: Valentina Boeva; Tatiana Popova; Maxime Lienard; Sebastien Toffoli; Maud Kamal; Christophe Le Tourneau; David Gentien; Nicolas Servant; Pierre Gestraud; Thomas Rio Frio; Philippe Hupé; Emmanuel Barillot; Jean-François Laes Journal: Bioinformatics Date: 2014-07-12 Impact factor: 6.937
Authors: Christian Rinke; Serene Low; Ben J Woodcroft; Jean-Baptiste Raina; Adam Skarshewski; Xuyen H Le; Margaret K Butler; Roman Stocker; Justin Seymour; Gene W Tyson; Philip Hugenholtz Journal: PeerJ Date: 2016-09-22 Impact factor: 2.984