Literature DB >> 31134279

ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions.

Egor Dolzhenko1, Viraj Deshpande1, Felix Schlesinger1, Peter Krusche2, Roman Petrovski2, Sai Chen1, Dorothea Emig-Agius1, Andrew Gross1, Giuseppe Narzisi3, Brett Bowman1, Konrad Scheffler1, Joke J F A van Vugt4, Courtney French5, Alba Sanchis-Juan6,7, Kristina Ibáñez8, Arianna Tucci8, Bryan R Lajoie1, Jan H Veldink4, F Lucy Raymond5, Ryan J Taft1, David R Bentley2, Michael A Eberle1.   

Abstract

SUMMARY: We describe a novel computational method for genotyping repeats using sequence graphs. This method addresses the long-standing need to accurately genotype medically important loci containing repeats adjacent to other variants or imperfect DNA repeats such as polyalanine repeats. Here we introduce a new version of our repeat genotyping software, ExpansionHunter, that uses this method to perform targeted genotyping of a broad class of such loci.
AVAILABILITY AND IMPLEMENTATION: ExpansionHunter is implemented in C++ and is available under the Apache License Version 2.0. The source code, documentation, and Linux/macOS binaries are available at https://github.com/Illumina/ExpansionHunter/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2019. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2019        PMID: 31134279      PMCID: PMC6853681          DOI: 10.1093/bioinformatics/btz431

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Short tandem repeats (STRs) are ubiquitous throughout the human genome. Although our understanding of STR biology is far from complete, emerging evidence suggests that STRs play an important role in basic cellular processes (Gymrek ; Hannan, 2018). In addition, STR expansions are a major cause of over 20 severe neurological disorders including amyotrophic lateral sclerosis, Friedreich ataxia (FRDA) and Huntington’s disease (HD). ExpansionHunter was the first computational method for genotyping STRs from short-read sequencing data capable of consistently genotyping repeats longer than the read length and, hence, detecting pathogenic repeat expansions (Dolzhenko ). Since the initial release of ExpansionHunter, several other methods have been developed and were shown to accurately identify long (greater than read length) repeat expansions (Dashnow ; Mousavi ; Tang ; Tankard ). Current methods are not designed to handle complex loci that harbor multiple repeats. Important examples of such loci include the CAG repeat in the HTT gene that causes HD flanked by a CCG repeat, the GAA repeat in FXN that causes FRDA flanked by an adenine homopolymer and the CAG repeat in ATXN8 that causes Spinocerebellar ataxia type 8 (SCA8) flanked by an ACT repeat. An even more extreme example is the CAGG repeat in the CNBP gene whose expansions cause Myotonic Dystrophy type 2(DM2). This repeat is adjacent to polymorphic CA and CAGA repeats (Liquori ) making it particularly difficult to accurately align reads to this locus. Another type of complex repeat is the polyalanine repeat which has been associated with at least nine disorders to date (Shoubridge and Gecz, 2012). Polyalanine repeats consist of repetitions of α-amino acid codons GCA, GCC, GCG or GCT (i.e. GCN). Clusters of variants can affect alignment and genotyping accuracy (Lincoln ). Variants adjacent to low complexity polymorphic sequences can be additionally problematic because methods for variant discovery can output clusters of inconsistently represented or spurious variant calls in such genomic regions. This, in part, is due to the elevated error rates of such regions in sequencing data (Benjamini and Speed, 2012; Dolzhenko ). One example is a single-nucleotide variant (SNV) adjacent to an adenine homopolymer in MSH2 that causes Lynch syndrome I (Froggatt ). Here we present a new version (v3.0.0) of ExpansionHunter that was reimplemented to handle complex loci such as those described above. The implementation uses sequence graphs (Dilthey ; Garrison ; Paten ) as a general and flexible model of each target locus.

2 Implementation

ExpansionHunter works on a predefined variant catalog containing genomic locations and the structure of a series of targeted loci. For each locus, the program extracts relevant reads (Dolzhenko ) from a binary alignment/map file (Li ) and realigns them using a graph-based model representing the locus structure. The realigned reads are then used to genotype each variant at the locus (Fig. 1).
Fig. 1.

Overview of ExpansionHunter. (a) A locus definition is read from the variant catalog file. (b) Sequence graph is constructed according to its specification in the variant catalog. (c) Relevant reads are extracted from the input binary alignment/map file. (d) Reads are aligned to the graph. (e) Alignments are pieced together to genotype each variant

Overview of ExpansionHunter. (a) A locus definition is read from the variant catalog file. (b) Sequence graph is constructed according to its specification in the variant catalog. (c) Relevant reads are extracted from the input binary alignment/map file. (d) Reads are aligned to the graph. (e) Alignments are pieced together to genotype each variant The locus structure is specified using a restricted subset of the regular expression syntax. For example, the HTT repeat region linked to HD can be defined by expression (CAG)*CAACAG(CCG)* that signifies that it harbors variable numbers of the CAG and CCG repeats separated by a CAACAG interruption (see Supplementary Materials); the FXN repeat region linked to the FRDA corresponds to expression (A)*(GAA)*; the ATXN8 repeat region linked to SCA8 corresponds to (CTA)*(CTG)*; the CNBP repeat region linked to DM2 consists of three adjacent repeats defined by (CAGG)*(CAGA)*(CA)*; the MSH2 SNV adjacent to an adenine homopolymer that causes Lynch syndrome I corresponds to (A|T)(A)*. Additionally, the regular expressions are allowed to contain multi-allelic or ‘degenerate’ base symbols that can be specified using the International Union of Pure and Applied Chemistry notation (Cornish-Bowden, 1985). Degenerate bases make it possible to represent certain classes of imperfect DNA repeats where, e.g. different bases may occur at the same position. Using this notation, polyalanine repeats can be encoded by the expression (GCN)* and polyglutamine repeats can be encoded by the expression (CAR)*. ExpansionHunter translates each regular expression into a sequence graph. Informally, a sequence graph consists of nodes that correspond to sequences and directed edges that define how these sequences can be connected together to assemble different alleles. We implemented the basic sequence graph functionality used by ExpansionHunter in the GraphTools C++ library (Supplementary Materials). One of the key features of the library is its support for single-node loops in contrast to the traditional approaches that use fully acyclic graphs (Lee ). Single-node loops are the key to representing STRs and other sequences that can appear in any number of copies. Genotyping is performed by analyzing the alignment paths associated with the presence or absence of each constituent allele. The repeats are genotyped as before (Dolzhenko ) and SNVs/indels are genotyped using a straightforward Poisson-based model (Supplementary Materials).

3 Results and discussion

To demonstrate the performance of ExpansionHunter we analyzed multiple complex STR regions. First, we analyzed a simulated dataset containing a wide range of CAG and CCG repeat sizes at the HTT locus. As expected, the accuracy of ExpansionHunter was substantially higher when the reads were aligned to a sequence graph that included both repeats compared to when the repeats were analyzed independently (Supplementary Fig. S2). ExpansionHunter also produced more accurate genotypes compared to other tools that were not designed to handle loci harboring multiple nearby STRs, GangSTR and TREDPARSE (Supplementary Fig. S2). A recent study used ExpansionHunter to investigate mutations in the short sequence interrupting two repeats in the HTT locus across 1600 samples (Wright ) demonstrating usefulness of the program for analysis of complex loci in real data. ExpansionHunter also correctly detected the pathogenic SNV adjacent to an adenine homopolymer in the MSH2 gene in three WGS replicates of a sample obtained from SeraCare Life Sciences (Supplementary Materials). To demonstrate the utility of ExpansionHunter across both short and long repeats, we compared calls from ExpansionHunter, GangSTR and TREDPARSE on sequence data from samples with experimentally confirmed repeat expansions (Supplementary Materials and Fig. S3). ExpansionHunter had better accuracy (precision =0.91, recall =0.99) in detecting the expanded repeats in this dataset compared to GangSTR (precision =0.88, recall =0.83) and TREDPARSE (precision =0.84, recall =0.46). Finally, we used ExpansionHunter to genotype degenerate DNA repeats by analyzing a polyalanine repeat in PHOX2B gene in 150 healthy controls and one sample harboring a known pathogenic expansion. PHOX2B contains a polyalanine repeat of 20 codons that can expand to cause congenital central hypoventilation syndrome. Consistent with what is known about this repeat (Amiel ), all but a few controls were genotyped 20/20. ExpansionHunter accurately genotyped the sole sample with the expansion as 20/27; the correctness of this genotype was confirmed by Sanger sequencing. In summary, we have developed a novel method that addresses the need for more accurate genotyping of complex loci. This method can genotype polyalanine repeats and resolve difficult regions containing repeats in close proximity to small variants and other repeats. A catalog of difficult regions is supplied with the software and can be extended by the user. We expect that the flexibility of the sequence graph framework now adopted in ExpansionHunter will enable a variety of novel variant calling applications. Conflict of Interest: none declared. Click here for additional data file.
  20 in total

Review 1.  Tandem repeats mediating genetic plasticity in health and disease.

Authors:  Anthony J Hannan
Journal:  Nat Rev Genet       Date:  2018-02-05       Impact factor: 53.242

2.  Profiling the genome-wide landscape of tandem repeat expansions.

Authors:  Nima Mousavi; Sharona Shleizer-Burko; Richard Yanicky; Melissa Gymrek
Journal:  Nucleic Acids Res       Date:  2019-09-05       Impact factor: 16.971

3.  Polyalanine expansion and frameshift mutations of the paired-like homeobox gene PHOX2B in congenital central hypoventilation syndrome.

Authors:  Jeanne Amiel; Béatrice Laudier; Tania Attié-Bitach; Ha Trang; Loïc de Pontual; Blanca Gener; Delphine Trochet; Heather Etchevers; Pierre Ray; Michel Simonneau; Michel Vekemans; Arnold Munnich; Claude Gaultier; Stanislas Lyonnet
Journal:  Nat Genet       Date:  2003-03-17       Impact factor: 38.330

4.  Summarizing and correcting the GC content bias in high-throughput sequencing.

Authors:  Yuval Benjamini; Terence P Speed
Journal:  Nucleic Acids Res       Date:  2012-02-09       Impact factor: 16.971

Review 5.  Genome graphs and the evolution of genome inference.

Authors:  Benedict Paten; Adam M Novak; Jordan M Eizenga; Erik Garrison
Journal:  Genome Res       Date:  2017-03-30       Impact factor: 9.043

6.  Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes.

Authors:  Haibao Tang; Ewen F Kirkness; Christoph Lippert; William H Biggs; Martin Fabani; Ernesto Guzman; Smriti Ramakrishnan; Victor Lavrenko; Boyko Kakaradov; Claire Hou; Barry Hicks; David Heckerman; Franz J Och; C Thomas Caskey; J Craig Venter; Amalio Telenti
Journal:  Am J Hum Genet       Date:  2017-11-02       Impact factor: 11.025

7.  Detection of long repeat expansions from PCR-free whole-genome sequence data.

Authors:  Egor Dolzhenko; Joke J F A van Vugt; Richard J Shaw; Mitchell A Bekritsky; Marka van Blitterswijk; Giuseppe Narzisi; Subramanian S Ajay; Vani Rajan; Bryan R Lajoie; Nathan H Johnson; Zoya Kingsbury; Sean J Humphray; Raymond D Schellevis; William J Brands; Matt Baker; Rosa Rademakers; Maarten Kooyman; Gijs H P Tazelaar; Michael A van Es; Russell McLaughlin; William Sproviero; Aleksey Shatunov; Ashley Jones; Ahmad Al Khleifat; Alan Pittman; Sarah Morgan; Orla Hardiman; Ammar Al-Chalabi; Chris Shaw; Bradley Smith; Edmund J Neo; Karen Morrison; Pamela J Shaw; Catherine Reeves; Lara Winterkorn; Nancy S Wexler; David E Housman; Christopher W Ng; Alina L Li; Ryan J Taft; Leonard H van den Berg; David R Bentley; Jan H Veldink; Michael A Eberle
Journal:  Genome Res       Date:  2017-09-08       Impact factor: 9.438

8.  Variation graph toolkit improves read mapping by representing genetic variation in the reference.

Authors:  Erik Garrison; Jouni Sirén; Adam M Novak; Glenn Hickey; Jordan M Eizenga; Eric T Dawson; William Jones; Shilpa Garg; Charles Markello; Michael F Lin; Benedict Paten; Richard Durbin
Journal:  Nat Biotechnol       Date:  2018-08-20       Impact factor: 54.908

9.  A Rigorous Interlaboratory Examination of the Need to Confirm Next-Generation Sequencing-Detected Variants with an Orthogonal Method in Clinical Genetic Testing.

Authors:  Stephen E Lincoln; Rebecca Truty; Chiao-Feng Lin; Justin M Zook; Joshua Paul; Vincent H Ramey; Marc Salit; Heidi L Rehm; Robert L Nussbaum; Matthew S Lebo
Journal:  J Mol Diagn       Date:  2019-01-03       Impact factor: 5.568

10.  Improved genome inference in the MHC using a population reference graph.

Authors:  Alexander Dilthey; Charles Cox; Zamin Iqbal; Matthew R Nelson; Gil McVean
Journal:  Nat Genet       Date:  2015-04-27       Impact factor: 38.330

View more
  46 in total

1.  Accuracy of short tandem repeats genotyping tools in whole exome sequencing data.

Authors:  Andreas Halman; Alicia Oshlack
Journal:  F1000Res       Date:  2020-03-23

2.  Length of Uninterrupted CAG, Independent of Polyglutamine Size, Results in Increased Somatic Instability, Hastening Onset of Huntington Disease.

Authors:  Galen E B Wright; Jennifer A Collins; Chris Kay; Cassandra McDonald; Egor Dolzhenko; Qingwen Xia; Kristina Bečanović; Britt I Drögemöller; Alicia Semaka; Charlotte M Nguyen; Brett Trost; Fiona Richards; Emilia K Bijlsma; Ferdinando Squitieri; Colin J D Ross; Stephen W Scherer; Michael A Eberle; Ryan K C Yuen; Michael R Hayden
Journal:  Am J Hum Genet       Date:  2019-05-16       Impact factor: 11.025

Review 3.  Pangenome Graphs.

Authors:  Jordan M Eizenga; Adam M Novak; Jonas A Sibbesen; Simon Heumos; Ali Ghaffaari; Glenn Hickey; Xian Chang; Josiah D Seaman; Robin Rounthwaite; Jana Ebler; Mikko Rautiainen; Shilpa Garg; Benedict Paten; Tobias Marschall; Jouni Sirén; Erik Garrison
Journal:  Annu Rev Genomics Hum Genet       Date:  2020-05-26       Impact factor: 8.929

4.  Genome-wide tandem repeat expansions contribute to schizophrenia risk.

Authors:  Anne S Bassett; Ryan K C Yuen; Bahareh A Mojarad; Worrawat Engchuan; Brett Trost; Ian Backstrom; Yue Yin; Bhooma Thiruvahindrapuram; Linda Pallotto; Aleksandra Mitina; Mahreen Khan; Giovanna Pellecchia; Bushra Haque; Keyi Guo; Tracy Heung; Gregory Costain; Stephen W Scherer; Christian R Marshall; Christopher E Pearson
Journal:  Mol Psychiatry       Date:  2022-05-12       Impact factor: 15.992

Review 5.  Evaluation of copy number variants for genetic hearing loss: a review of current approaches and recent findings.

Authors:  Wafaa Abbasi; Courtney E French; Shira Rockowitz; Margaret A Kenna; A Eliot Shearer
Journal:  Hum Genet       Date:  2021-11-22       Impact factor: 4.132

Review 6.  Long-read sequencing for molecular diagnostics in constitutional genetic disorders.

Authors:  Laura K Conlin; Erfan Aref-Eshghi; Deborah A McEldrew; Minjie Luo; Ramakrishnan Rajagopalan
Journal:  Hum Mutat       Date:  2022-09-18       Impact factor: 4.700

7.  Exploring the Genetic Architecture of Spontaneous Coronary Artery Dissection Using Whole-Genome Sequencing.

Authors:  Ingrid Tarr; Stephanie Hesselson; Siiri E Iismaa; Emma Rath; Steven Monger; Michael Troup; Ketan Mishra; Claire M Y Wong; Pei-Chen Hsu; Keerat Junday; David T Humphreys; David Adlam; Tom R Webb; Anna A Baranowska-Clarke; Stephen E Hamby; Keren J Carss; Nilesh J Samani; Monique Bax; Lucy McGrath-Cadell; Jason C Kovacic; Sally L Dunwoodie; Diane Fatkin; David W M Muller; Robert M Graham; Eleni Giannoulatou
Journal:  Circ Genom Precis Med       Date:  2022-05-18

8.  Targeted long-read sequencing identifies missing disease-causing variation.

Authors:  Danny E Miller; Arvis Sulovari; Tianyun Wang; Hailey Loucks; Kendra Hoekzema; Katherine M Munson; Alexandra P Lewis; Edith P Almanza Fuerte; Catherine R Paschal; Tom Walsh; Jenny Thies; James T Bennett; Ian Glass; Katrina M Dipple; Karynne Patterson; Emily S Bonkowski; Zoe Nelson; Audrey Squire; Megan Sikes; Erika Beckman; Robin L Bennett; Dawn Earl; Winston Lee; Rando Allikmets; Seth J Perlman; Penny Chow; Anne V Hing; Tara L Wenger; Margaret P Adam; Angela Sun; Christina Lam; Irene Chang; Xue Zou; Stephanie L Austin; Erin Huggins; Alexias Safi; Apoorva K Iyengar; Timothy E Reddy; William H Majoros; Andrew S Allen; Gregory E Crawford; Priya S Kishnani; Mary-Claire King; Tim Cherry; Jessica X Chong; Michael J Bamshad; Deborah A Nickerson; Heather C Mefford; Dan Doherty; Evan E Eichler
Journal:  Am J Hum Genet       Date:  2021-07-02       Impact factor: 11.025

Review 9.  Applying genomic and transcriptomic advances to mitochondrial medicine.

Authors:  William L Macken; Jana Vandrovcova; Michael G Hanna; Robert D S Pitceathly
Journal:  Nat Rev Neurol       Date:  2021-02-23       Impact factor: 42.937

10.  Randomized prospective evaluation of genome sequencing versus standard-of-care as a first molecular diagnostic test.

Authors:  Deanna G Brockman; Christina A Austin-Tse; Renée C Pelletier; Caroline Harley; Candace Patterson; Holly Head; Courtney Elizabeth Leonard; Kimberly O'Brien; Lisa M Mahanta; Matthew S Lebo; Christine Y Lu; Pradeep Natarajan; Amit V Khera; Krishna G Aragam; Sekar Kathiresan; Heidi L Rehm; Miriam S Udler
Journal:  Genet Med       Date:  2021-05-11       Impact factor: 8.822

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.