Xiaotu Ma1, Ying Shao2, Liqing Tian2, Diane A Flasch2, Heather L Mulder2, Michael N Edmonson2, Yu Liu2, Xiang Chen2, Scott Newman2, Joy Nakitandwe3, Yongjin Li2, Benshang Li4, Shuhong Shen4, Zhaoming Wang2,5, Sheila Shurtleff3, Leslie L Robison5, Shawn Levy6, John Easton2, Jinghui Zhang7. 1. Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA. Xiaotu.Ma@stjude.org. 2. Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA. 3. Department of Pathology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA. 4. Key Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, 200127, China. 5. Department of Epidemiology and Cancer Control, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA. 6. HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA. 7. Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA. Jinghui.Zhang@stjude.org.
Abstract
BACKGROUND: Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions. RESULTS: By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10-5 to 10-4, which is 10- to 100-fold lower than generally considered achievable (10-3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10-5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10-4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression. CONCLUSIONS: We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.
BACKGROUND: Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions. RESULTS: By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10-5 to 10-4, which is 10- to 100-fold lower than generally considered achievable (10-3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10-5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10-4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression. CONCLUSIONS: We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.
Entities:
Keywords:
Deep sequencing; Detection; Error rate; Hotspot mutation; Subclonal; Substitution
Authors: Yatish Turakhia; Nicola De Maio; Bryan Thornlow; Landen Gozashti; Robert Lanfear; Conor R Walker; Angie S Hinrichs; Jason D Fernandes; Rui Borges; Greg Slodkowicz; Lukas Weilguny; David Haussler; Nick Goldman; Russell Corbett-Detig Journal: PLoS Genet Date: 2020-11-18 Impact factor: 5.917
Authors: Sara Rassoulian Barrett; Noah G Hoffman; Christopher Rosenthal; Andrew Bryan; Desiree A Marshall; Joshua Lieberman; Alexander L Greninger; Vikas Peddu; Brad T Cookson; Stephen J Salipante Journal: J Clin Microbiol Date: 2020-11-18 Impact factor: 5.948
Authors: Joanna Pierro; Jason Saliba; Sonali Narang; Gunjan Sethia; Shella Saint Fleur-Lominy; Ashfiyah Chowdhury; Anita Qualls; Hannah Fay; Harrison L Kilberg; Takaya Moriyama; Tori J Fuller; David T Teachey; Kjeld Schmiegelow; Jun J Yang; Mignon L Loh; Patrick A Brown; Jinghui Zhang; Xiaotu Ma; Aristotelis Tsirigos; Nikki A Evensen; William L Carroll Journal: Mol Cancer Res Date: 2020-04-24 Impact factor: 5.852
Authors: Clay McLeod; Alexander M Gout; Xin Zhou; Andrew Thrasher; Delaram Rahbarinia; Samuel W Brady; Michael Macias; Kirby Birch; David Finkelstein; Jobin Sunny; Rahul Mudunuri; Brent A Orr; Madison Treadway; Bob Davidson; Tracy K Ard; Arthur Chiao; Andrew Swistak; Stephanie Wiggins; Scott Foy; Jian Wang; Edgar Sioson; Shuoguo Wang; J Robert Michael; Yu Liu; Xiaotu Ma; Aman Patel; Michael N Edmonson; Mark R Wilkinson; Andrew M Frantz; Ti-Cheng Chang; Liqing Tian; Shaohua Lei; S M Ashiqul Islam; Christopher Meyer; Naina Thangaraj; Pamella Tater; Vijay Kandali; Singer Ma; Tuan Nguyen; Omar Serang; Irina McGuire; Nedra Robison; Darrell Gentry; Xing Tang; Lance E Palmer; Gang Wu; Ed Suh; Leigh Tanner; James McMurry; Matthew Lear; Alberto S Pappo; Zhaoming Wang; Carmen L Wilson; Yong Cheng; Soheil Meshinchi; Ludmil B Alexandrov; Mitchell J Weiss; Gregory T Armstrong; Leslie L Robison; Yutaka Yasui; Kim E Nichols; David W Ellison; Chaitanya Bangur; Charles G Mullighan; Suzanne J Baker; Michael A Dyer; Geralyn Miller; Scott Newman; Michael Rusch; Richard Daly; Keith Perry; James R Downing; Jinghui Zhang Journal: Cancer Discov Date: 2021-01-06 Impact factor: 39.397
Authors: Eric M Davis; Yu Sun; Yanling Liu; Pandurang Kolekar; Ying Shao; Karol Szlachta; Heather L Mulder; Dongren Ren; Stephen V Rice; Zhaoming Wang; Joy Nakitandwe; Alexander M Gout; Bridget Shaner; Salina Hall; Leslie L Robison; Stanley Pounds; Jeffery M Klco; John Easton; Xiaotu Ma Journal: Genome Biol Date: 2021-01-25 Impact factor: 13.583