Anthony Youzhi Cheng1, Yik-Ying Teo2, Rick Twee-Hee Ong1. 1. Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672. 2. Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672Saw Swee Hock School of Public Health, National University of Singapore, Singapore 11759
Abstract
MOTIVATION: Whole-genome sequencing (WGS) is now routinely used for the detection and identification of genetic variants, particularly single nucleotide polymorphisms (SNPs) in humans, and this has provided valuable new insights into human diversity, population histories and genetic association studies of traits and diseases. However, this relies on accurate detection and genotyping calling of the polymorphisms present in the samples sequenced. To minimize cost, the majority of current WGS studies, including the 1000 Genomes Project (1 KGP) have adopted low coverage sequencing of large number of samples, where such designs have inadvertently influenced the development of variant calling methods on WGS data. Assessment of variant accuracy are usually performed on the same set of low coverage individuals or a smaller number of deeply sequenced individuals. It is thus unclear how these variant calling methods would fare for a dataset of ∼100 samples from a population not part of the 1 KGP that have been sequenced at various coverage depths. AVAILABILITY AND IMPLEMENTATION: Using down-sampling of the sequencing reads obtained from the Singapore Sequencing Malay Project (SSMP), and a set of SNP calls from the same individuals genotyped on the Illumina Omni1-Quad array, we assessed the sensitivity of SNP detection, accuracy of genotype calls made and variant accuracy for six commonly used variant calling methods of GATK, SAMtools, Consensus Assessment of Sequence and Variation (CASAVA), VarScan, glfTools and SOAPsnp. The results indicate that at 5× coverage depth, the multi-sample callers of GATK and SAMtools yield the best accuracy particularly if the study samples are called together with a large number of individuals such as those from 1000 Genomes Project. If study samples are sequenced at a high coverage depth such as 30×, CASAVA has the highest variant accuracy as compared with the other variant callers assessed.
MOTIVATION: Whole-genome sequencing (WGS) is now routinely used for the detection and identification of genetic variants, particularly single nucleotide polymorphisms (SNPs) in humans, and this has provided valuable new insights into human diversity, population histories and genetic association studies of traits and diseases. However, this relies on accurate detection and genotyping calling of the polymorphisms present in the samples sequenced. To minimize cost, the majority of current WGS studies, including the 1000 Genomes Project (1 KGP) have adopted low coverage sequencing of large number of samples, where such designs have inadvertently influenced the development of variant calling methods on WGS data. Assessment of variant accuracy are usually performed on the same set of low coverage individuals or a smaller number of deeply sequenced individuals. It is thus unclear how these variant calling methods would fare for a dataset of ∼100 samples from a population not part of the 1 KGP that have been sequenced at various coverage depths. AVAILABILITY AND IMPLEMENTATION: Using down-sampling of the sequencing reads obtained from the Singapore Sequencing Malay Project (SSMP), and a set of SNP calls from the same individuals genotyped on the Illumina Omni1-Quad array, we assessed the sensitivity of SNP detection, accuracy of genotype calls made and variant accuracy for six commonly used variant calling methods of GATK, SAMtools, Consensus Assessment of Sequence and Variation (CASAVA), VarScan, glfTools and SOAPsnp. The results indicate that at 5× coverage depth, the multi-sample callers of GATK and SAMtools yield the best accuracy particularly if the study samples are called together with a large number of individuals such as those from 1000 Genomes Project. If study samples are sequenced at a high coverage depth such as 30×, CASAVA has the highest variant accuracy as compared with the other variant callers assessed.
Authors: Laura M Shannon; Ryan H Boyko; Marta Castelhano; Elizabeth Corey; Jessica J Hayward; Corin McLean; Michelle E White; Mounir R Abi Said; Baddley A Anita; Nono Ikombe Bondjengo; Jorge Calero; Ana Galov; Marius Hedimbi; Bulu Imam; Rajashree Khalap; Douglas Lally; Andrew Masta; Kyle C Oliveira; Lucía Pérez; Julia Randall; Nguyen Minh Tam; Francisco J Trujillo-Cornejo; Carlos Valeriano; Nathan B Sutter; Rory J Todhunter; Carlos D Bustamante; Adam R Boyko Journal: Proc Natl Acad Sci U S A Date: 2016-04-20 Impact factor: 11.205
Authors: Ujwal R Bagal; John Phan; Rory M Welsh; Elizabeth Misas; Darlene Wagner; Lalitha Gade; Anastasia P Litvintseva; Christina A Cuomo; Nancy A Chow Journal: Methods Mol Biol Date: 2022
Authors: Ann M Guggisberg; Sesh A Sundararaman; Miguel Lanaspa; Cinta Moraleda; Raquel González; Alfredo Mayor; Pau Cisteró; David Hutchinson; Peter G Kremsner; Beatrice H Hahn; Quique Bassat; Audrey R Odom Journal: J Infect Dis Date: 2016-07-20 Impact factor: 5.226
Authors: Shatha Alosaimi; Noëlle van Biljon; Denis Awany; Prisca K Thami; Joel Defo; Jacquiline W Mugo; Christian D Bope; Gaston K Mazandu; Nicola J Mulder; Emile R Chimusa Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622