Qian Liu1, Yao Tong1, Kai Wang2,3. 1. Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA. 2. Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA. wangk@email.chop.edu. 3. Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA. wangk@email.chop.edu.
Abstract
BACKGROUND: Short tandem repeat (STR), or "microsatellite", is a tract of DNA in which a specific motif (typically < 10 base pairs) is repeated multiple times. STRs are abundant throughout the human genome, and specific repeat expansions may be associated with human diseases. Long-read sequencing coupled with bioinformatics tools enables the estimation of repeat counts for STRs. However, with the exception of a few well-known disease-relevant STRs, normal ranges of repeat counts for most STRs in human populations are not well known, preventing the prioritization of STRs that may be associated with human diseases. RESULTS: In this study, we extend a computational tool RepeatHMM to infer normal ranges of 432,604 STRs using 21 long-read sequencing datasets on human genomes, and build a genomic-scale database called RepeatHMM-DB with normal repeat ranges for these STRs. Evaluation on 13 well-known repeats show that the inferred repeat ranges provide good estimation to repeat ranges reported in literature from population-scale studies. This database, together with a repeat expansion estimation tool such as RepeatHMM, enables genomic-scale scanning of repeat regions in newly sequenced genomes to identify disease-relevant repeat expansions. As a case study of using RepeatHMM-DB, we evaluate the CAG repeats of ATXN3 for 20 patients with spinocerebellar ataxia type 3 (SCA3) and 5 unaffected individuals, and correctly classify each individual. CONCLUSIONS: In summary, RepeatHMM-DB can facilitate prioritization and identification of disease-relevant STRs from whole-genome long-read sequencing data on patients with undiagnosed diseases. RepeatHMM-DB is incorporated into RepeatHMM and is available at https://github.com/WGLab/RepeatHMM .
BACKGROUND: Short tandem repeat (STR), or "microsatellite", is a tract of DNA in which a specific motif (typically < 10 base pairs) is repeated multiple times. STRs are abundant throughout the human genome, and specific repeat expansions may be associated with human diseases. Long-read sequencing coupled with bioinformatics tools enables the estimation of repeat counts for STRs. However, with the exception of a few well-known disease-relevant STRs, normal ranges of repeat counts for most STRs in human populations are not well known, preventing the prioritization of STRs that may be associated with human diseases. RESULTS: In this study, we extend a computational tool RepeatHMM to infer normal ranges of 432,604 STRs using 21 long-read sequencing datasets on human genomes, and build a genomic-scale database called RepeatHMM-DB with normal repeat ranges for these STRs. Evaluation on 13 well-known repeats show that the inferred repeat ranges provide good estimation to repeat ranges reported in literature from population-scale studies. This database, together with a repeat expansion estimation tool such as RepeatHMM, enables genomic-scale scanning of repeat regions in newly sequenced genomes to identify disease-relevant repeat expansions. As a case study of using RepeatHMM-DB, we evaluate the CAG repeats of ATXN3 for 20 patients with spinocerebellar ataxia type 3 (SCA3) and 5 unaffected individuals, and correctly classify each individual. CONCLUSIONS: In summary, RepeatHMM-DB can facilitate prioritization and identification of disease-relevant STRs from whole-genome long-read sequencing data on patients with undiagnosed diseases. RepeatHMM-DB is incorporated into RepeatHMM and is available at https://github.com/WGLab/RepeatHMM .
Entities:
Keywords:
Microsatellite; Repeat database; Repeat expansion; RepeatHMM; Short tandem repeats
Authors: E J Kremer; M Pritchard; M Lynch; S Yu; K Holman; E Baker; S T Warren; D Schlessinger; G R Sutherland; R I Richards Journal: Science Date: 1991-06-21 Impact factor: 47.728
Authors: M Cossée; M Schmitt; V Campuzano; L Reutenauer; C Moutou; J L Mandel; M Koenig Journal: Proc Natl Acad Sci U S A Date: 1997-07-08 Impact factor: 11.205
Authors: K M Hsiao; H M Lin; H Pan; T C Li; S S Chen; S B Jou; Y L Chiu; M F Wu; C C Lin; S Y Li Journal: J Clin Lab Anal Date: 1999 Impact factor: 2.352
Authors: H T Orr; M Y Chung; S Banfi; T J Kwiatkowski; A Servadio; A L Beaudet; A E McCall; L A Duvick; L P Ranum; H Y Zoghbi Journal: Nat Genet Date: 1993-07 Impact factor: 38.330
Authors: Elizabeth K Baker; Elizabeth A Ulm; Alyce Belonis; Diana S Brightman; Barbara E Hallinan; Nancy D Leslie; Alexander G Miethke; Marissa Vawter-Lee; Yaning Wu; Loren D M Pena Journal: Front Genet Date: 2022-07-22 Impact factor: 4.772