Ajay Ummat1, Ali Bashir1. 1. Department of Genetics and Genomic Science and Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Abstract
MOTIVATION: Resolving tandemly repeated genomic sequences is a necessary step in improving our understanding of the human genome. Short tandem repeats (TRs), or microsatellites, are often used as molecular markers in genetics, and clinically, variation in microsatellites can lead to genetic disorders like Huntington's diseases. Accurately resolving repeats, and in particular TRs, remains a challenging task in genome alignment, assembly and variation calling. Though tools have been developed for detecting microsatellites in short-read sequencing data, these are limited in the size and types of events they can resolve. Single-molecule sequencing technologies may potentially resolve a broader spectrum of TRs given their increased length, but require new approaches given their significantly higher raw error profiles. However, due to inherent error profiles of the single-molecule technologies, these reads presents a unique challenge in terms of accurately identifying and estimating the TRs. RESULTS: Here we present PacmonSTR, a reference-based probabilistic approach, to identify the TR region and estimate the number of these TR elements in long DNA reads. We present a multistep approach that requires as input, a reference region and the reference TR element. Initially, the TR region is identified from the long DNA reads via a 3-stage modified Smith-Waterman approach and then, expected number of TR elements is calculated using a pair-Hidden Markov Models-based method. Finally, TR-based genotype selection (or clustering: homozygous/heterozygous) is performed with Gaussian mixture models, using the Akaike information criteria, and coverage expectations.
MOTIVATION: Resolving tandemly repeated genomic sequences is a necessary step in improving our understanding of the human genome. Short tandem repeats (TRs), or microsatellites, are often used as molecular markers in genetics, and clinically, variation in microsatellites can lead to genetic disorders like Huntington's diseases. Accurately resolving repeats, and in particular TRs, remains a challenging task in genome alignment, assembly and variation calling. Though tools have been developed for detecting microsatellites in short-read sequencing data, these are limited in the size and types of events they can resolve. Single-molecule sequencing technologies may potentially resolve a broader spectrum of TRs given their increased length, but require new approaches given their significantly higher raw error profiles. However, due to inherent error profiles of the single-molecule technologies, these reads presents a unique challenge in terms of accurately identifying and estimating the TRs. RESULTS: Here we present PacmonSTR, a reference-based probabilistic approach, to identify the TR region and estimate the number of these TR elements in long DNA reads. We present a multistep approach that requires as input, a reference region and the reference TR element. Initially, the TR region is identified from the long DNA reads via a 3-stage modified Smith-Waterman approach and then, expected number of TR elements is calculated using a pair-Hidden Markov Models-based method. Finally, TR-based genotype selection (or clustering: homozygous/heterozygous) is performed with Gaussian mixture models, using the Akaike information criteria, and coverage expectations.
Authors: Yu Lin; Jeffrey Yuan; Mikhail Kolmogorov; Max W Shen; Mark Chaisson; Pavel A Pevzner Journal: Proc Natl Acad Sci U S A Date: 2016-12-12 Impact factor: 11.205
Authors: Matthew Pendleton; Robert Sebra; Andy Wing Chun Pang; Ajay Ummat; Oscar Franzen; Tobias Rausch; Adrian M Stütz; William Stedman; Thomas Anantharaman; Alex Hastie; Heng Dai; Markus Hsi-Yang Fritz; Han Cao; Ariella Cohain; Gintaras Deikus; Russell E Durrett; Scott C Blanchard; Roger Altman; Chen-Shan Chin; Yan Guo; Ellen E Paxinos; Jan O Korbel; Robert B Darnell; W Richard McCombie; Pui-Yan Kwok; Christopher E Mason; Eric E Schadt; Ali Bashir Journal: Nat Methods Date: 2015-06-29 Impact factor: 28.547
Authors: Paras Garg; Bharati Jadhav; Oscar L Rodriguez; Nihir Patel; Alejandro Martin-Trujillo; Miten Jain; Sofie Metsu; Hugh Olsen; Benedict Paten; Beate Ritz; R Frank Kooy; Jozef Gecz; Andrew J Sharp Journal: Am J Hum Genet Date: 2020-09-15 Impact factor: 11.025