Hongseok Tae1, Dong-Yun Kim, John McCormick, Robert E Settlage, Harold R Garner. 1. Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061 and Office of Biostatistics Research, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Abstract
MOTIVATION: Inferring lengths of inherited microsatellite alleles with single base pair resolution from short sequence reads is challenging due to several sources of noise caused by the repetitive nature of microsatellites and the technologies used to generate raw sequence data. RESULTS: We have developed a program, GenoTan, using a discretized Gaussian mixture model combined with a rules-based approach to identify inherited variation of microsatellite loci from short sequence reads without paired-end information. It effectively distinguishes length variants from noise including insertion/deletion errors in homopolymer runs by addressing the bidirectional aspect of insertion and deletion errors in sequence reads. Here we first introduce a homopolymer decomposition method which estimates error bias toward insertion or deletion in homopolymer sequence runs. Combining these approaches, GenoTan was able to genotype 94.9% of microsatellite loci accurately from simulated data with 40x sequence coverage quickly while the other programs showed <90% correct calls for the same data and required 5∼30× more computational time than GenoTan. It also showed the highest true-positive rate for real data using mixed sequence data of two Drosophila inbred lines, which was a novel validation approach for genotyping. AVAILABILITY: GenoTan is open-source software available at http://genotan.sourceforge.net.
MOTIVATION: Inferring lengths of inherited microsatellite alleles with single base pair resolution from short sequence reads is challenging due to several sources of noise caused by the repetitive nature of microsatellites and the technologies used to generate raw sequence data. RESULTS: We have developed a program, GenoTan, using a discretized Gaussian mixture model combined with a rules-based approach to identify inherited variation of microsatellite loci from short sequence reads without paired-end information. It effectively distinguishes length variants from noise including insertion/deletion errors in homopolymer runs by addressing the bidirectional aspect of insertion and deletion errors in sequence reads. Here we first introduce a homopolymer decomposition method which estimates error bias toward insertion or deletion in homopolymer sequence runs. Combining these approaches, GenoTan was able to genotype 94.9% of microsatellite loci accurately from simulated data with 40x sequence coverage quickly while the other programs showed <90% correct calls for the same data and required 5∼30× more computational time than GenoTan. It also showed the highest true-positive rate for real data using mixed sequence data of two Drosophila inbred lines, which was a novel validation approach for genotyping. AVAILABILITY: GenoTan is open-source software available at http://genotan.sourceforge.net.
Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043
Authors: Cornelis A Albers; Gerton Lunter; Daniel G MacArthur; Gilean McVean; Willem H Ouwehand; Richard Durbin Journal: Genome Res Date: 2010-10-27 Impact factor: 9.043
Authors: Rick M Tankard; Mark F Bennett; Peter Degorski; Martin B Delatycki; Paul J Lockhart; Melanie Bahlo Journal: Am J Hum Genet Date: 2018-11-29 Impact factor: 11.025
Authors: Nick Kinney; Timothy R Larsen; David M Kim; Robin T Varghese; Steven Poelzing; Harold R Garner; Soufian T AlMahameed Journal: Clin Cardiol Date: 2018-06-11 Impact factor: 2.882
Authors: Konrad Zych; Yang Li; Joeri K van der Velde; Ronny V L Joosen; Wilco Ligterink; Ritsert C Jansen; Danny Arends Journal: BMC Bioinformatics Date: 2015-02-19 Impact factor: 3.169
Authors: Hong Dang; Paul J Gallins; Rhonda G Pace; Xue-Liang Guo; Jaclyn R Stonebraker; Harriet Corvol; Garry R Cutting; Mitchell L Drumm; Lisa J Strug; Michael R Knowles; Wanda K O'Neal Journal: Hum Genome Var Date: 2016-07-07
Authors: Nicholas Kinney; Kyle Titus-Glover; Jonathan D Wren; Robin T Varghese; Pawel Michalak; Han Liao; Ramu Anandakrishnan; Arichanah Pulenthiran; Lin Kang; Harold R Garner Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971
Authors: Eliseos J Mucaki; Natasha G Caminsky; Ami M Perri; Ruipeng Lu; Alain Laederach; Matthew Halvorsen; Joan H M Knoll; Peter K Rogan Journal: BMC Med Genomics Date: 2016-04-11 Impact factor: 3.063