Maximillian Marin1,2, Roger Vargas1,2, Michael Harris3, Brendan Jeffrey3, L Elaine Epperson4, David Durbin5, Michael Strong4, Max Salfinger6, Zamin Iqbal7, Irada Akhundova8, Sergo Vashakidze9,10, Valeriu Crudu11, Alex Rosenthal3, Maha Reda Farhat1,12. 1. Department of Biomedical Informatics, Harvard Medical School, Boston, USA. 2. Department of Systems Biology, Harvard Medical School, Boston, USA. 3. Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, USA. 4. Center for Genes, Environment, and Health, National Jewish Health, Denver, Colorado, USA. 5. Mycobacteriology Reference Laboratory, Advanced Diagnostic Laboratories, National Jewish Health, Denver, Colorado, USA. 6. College of Public Health & Morsani College of Medicine, University of South Florida, Tampa, Florida, USA. 7. EMBL-EBI, Wellcome Genome Campus, Hinxton, UK. 8. Scientific Research Institute of Lung Diseases, Ministry of Health, Baku, Azerbaijan. 9. The University of Georgia, Tbilisi, Georgia. 10. National Center for Tuberculosis and Lung Diseases, Ministry of Health, Tbilisi, Georgia. 11. Phthisiopneumology Institute, Chisinau, Moldova. 12. Pulmonary and Critical Care Medicine, Massachusetts General Hospital, Boston, USA.
Abstract
MOTIVATION: Short-read whole genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences, and sequencing bias reduce the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically studied the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias, and GC content. RESULTS: Reference based Illumina variant calling demonstrated a maximum recall of 89.0% and minimum precision of 98.5% across parameters evaluated. The approach that maximized variant recall while still maintaining high precision (<99%) was tuning the mapping quality (MQ) filtering threshold, i.e. confidence of the read mapping (recall = 85.8%, precision = 99.1%, MQ ≥ 40). Additional masking of repetitive sequence content is an alternative conservative approach to variant calling that increases precision at cost to recall (recall = 70.2%, precision = 99.6%, MQ ≥ 40). Of the genomic positions typically excluded for Mtb, 68% are accurately called using Illumina WGS including 52/168 PE/PPE genes (34.5%). From these results we present a refined list of low confidence regions across the Mtb genome, which we found to frequently overlap with regions with structural variation, low sequence uniqueness, and low sequencing coverage. Our benchmarking results have broad implications for the use of WGS in the study of Mtb biology, inference of transmission in public health surveillance systems, and more generally for WGS applications in other organisms. AVAILABILITY: All relevant code is available at https://github.com/farhat-lab/mtb-illumina-wgs-evaluation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Short-read whole genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences, and sequencing bias reduce the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically studied the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias, and GC content. RESULTS: Reference based Illumina variant calling demonstrated a maximum recall of 89.0% and minimum precision of 98.5% across parameters evaluated. The approach that maximized variant recall while still maintaining high precision (<99%) was tuning the mapping quality (MQ) filtering threshold, i.e. confidence of the read mapping (recall = 85.8%, precision = 99.1%, MQ ≥ 40). Additional masking of repetitive sequence content is an alternative conservative approach to variant calling that increases precision at cost to recall (recall = 70.2%, precision = 99.6%, MQ ≥ 40). Of the genomic positions typically excluded for Mtb, 68% are accurately called using Illumina WGS including 52/168 PE/PPE genes (34.5%). From these results we present a refined list of low confidence regions across the Mtb genome, which we found to frequently overlap with regions with structural variation, low sequence uniqueness, and low sequencing coverage. Our benchmarking results have broad implications for the use of WGS in the study of Mtb biology, inference of transmission in public health surveillance systems, and more generally for WGS applications in other organisms. AVAILABILITY: All relevant code is available at https://github.com/farhat-lab/mtb-illumina-wgs-evaluation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Nicola De Maio; Liam P Shaw; Alasdair Hubbard; Sophie George; Nicholas D Sanderson; Jeremy Swann; Ryan Wick; Manal AbuOun; Emma Stubberfield; Sarah J Hoosdally; Derrick W Crook; Timothy E A Peto; Anna E Sheppard; Mark J Bailey; Daniel S Read; Muna F Anjum; A Sarah Walker; Nicole Stoesser Journal: Microb Genom Date: 2019-08-30
Authors: Samuel J Modlin; Cassidy Robinhold; Christopher Morrissey; Scott N Mitchell; Sarah M Ramirez-Busby; Tal Shmaya; Faramarz Valafar Journal: Microb Genom Date: 2021-01-27
Authors: Anna G Green; Chang Ho Yoon; Andrew Beam; Maha Farhat; Michael L Chen; Yasha Ektefaie; Mack Fina; Luca Freschi; Matthias I Gröschel; Isaac Kohane Journal: Nat Commun Date: 2022-07-02 Impact factor: 17.694