Literature DB >> 32345345

ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data.

Egor Dolzhenko¹, Mark F Bennett^2,3,4, Phillip A Richmond⁵, Brett Trost^6,7, Sai Chen¹, Joke J F A van Vugt⁸, Charlotte Nguyen^6,7,9, Giuseppe Narzisi¹⁰, Vladimir G Gainullin¹, Andrew M Gross¹, Bryan R Lajoie¹, Ryan J Taft¹, Wyeth W Wasserman⁵, Stephen W Scherer^6,7,9,11, Jan H Veldink⁸, David R Bentley¹², Ryan K C Yuen^6,7,9, Melanie Bahlo^2,3, Michael A Eberle¹³.

Abstract

Repeat expansions are responsible for over 40 monogenic disorders, and undoubtedly more pathogenic repeat expansions remain to be discovered. Existing methods for detecting repeat expansions in short-read sequencing data require predefined repeat catalogs. Recent discoveries emphasize the need for methods that do not require pre-specified candidate repeats. To address this need, we introduce ExpansionHunter Denovo, an efficient catalog-free method for genome-wide repeat expansion detection. Analysis of real and simulated data shows that our method can identify large expansions of 41 out of 44 pathogenic repeats, including nine recently reported non-reference repeat expansions not discoverable via existing methods.

Entities: CellLine Chemical Disease Gene Species

Keywords: Fragile X syndrome; Friedreich ataxia; Genome-wide analysis; Huntington disease; Myotonic dystrophy type 1; Repeat expansions; Short tandem repeats; Whole-genome sequencing data

Mesh：

Year: 2020 PMID： 32345345 PMCID： PMC7187524 DOI： 10.1186/s13059-020-02017-z

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Background

High-throughput whole-genome sequencing (WGS) has experienced rapid reductions in per-genome costs over the past 10 years [1] driving population-level sequencing projects and precision medicine initiatives at an unprecedented scale [2-7]. The availability of large sequencing datasets now allows researchers to perform comprehensive genome-wide searches for disease-associated variants. The primary limitations of these studies are the completeness of the reference genome and the ability to identify putative causal variations against the reference background. A wide variety of software tools can identify variations relative to the reference genome such as single nucleotide variants and short (1–50 bp) insertions and deletions [8-13], copy number variants [14, 15], and structural variants [15-17]. A common feature of these variant callers is their reliance on sequence reads that at least partially align to the reference genome. However, because some variants include large amounts of inserted sequence relative to the reference, methods that can analyze reads that do not align to the reference are also needed. A particularly important category of variants that involve long insertions relative to the reference genome are repeat expansions (REs). An example of which is the expansion in C9orf72 associated with amyotrophic lateral sclerosis (ALS). This repeat consists of three copies of CCGGGG motif in the reference (18 bp total) whereas the pathogenic mutations are comprised of at least 30 copies of the motif (180 bp total) and may encompass thousands of bases [18, 19]. REs are known to be responsible for dozens of monogenic disorders [20, 21]. Several recently developed tools can detect REs longer than the standard short-read sequencing read length of 150 bp [22-27]. These tools have all been demonstrated to be capable of accurately detecting pathogenic expansions of simple short tandem repeats (STRs). However, recent discoveries have shown that many pathogenic repeats have complex structures and hence require more flexible methods. For instance, (a) REs causing spinocerebellar ataxia types 31 and 37; familial adult myoclonic epilepsy types 1, 2, 3, 4, 6, and 7; and Baratela-Scott syndrome [28-34] occur within an inserted sequence relative to the reference; (b) expanded repeats recently shown to cause spinocerebellar ataxia, familial adult myoclonic epilepsy, and cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome have different composition relative to the reference STR [28-33]; (c) Unverricht-Lundborg disease, a type of progressive myoclonus epilepsy, is caused by an expansion of a larger, dodecamer (12-mer) motif repeat [35]. None of the existing variant calling methods is capable of discovering all of these REs. We have developed ExpansionHunter Denovo (EHdn), a novel method for performing a genome-wide search for expanded repeats, to address the limitations of the existing approaches. EHdn scans the existing alignments of short reads from one or many sequencing libraries, including the unaligned and misaligned reads, to identify approximate locations of long repeats and their nucleotide composition. Unlike other methods designed to identify REs [22-27], EHdn (a) does not require prior knowledge of the genomic coordinates of the REs, (b) can detect nucleotide composition changes within the expanded repeats, and (c) is applicable to both short and long motifs. EHdn is computationally efficient because it does not re-align reads and, depending on the sensitivity settings, can analyze a single 30–40x WGS sample in about 30 min to 2 h using a single CPU thread. In this study, we demonstrate that EHdn can be used to rediscover the REs associated with fragile X syndrome (FXS), Friedreich ataxia (FRDA), myotonic dystrophy type 1 (DM1), and Huntington disease (HD) using case-control analysis to compare a small number of affected individuals (N = 14–35) to control samples (N = 150). We also show that REs in individual samples can be identified using outlier analysis. We then characterize large (longer than the read length) repeats in our control cohort to investigate baseline variability of these long repeats. Finally, we demonstrate the capabilities of our method by analyzing simulated expansions of various classes of tandem repeats known to play an important role in human disease. Taken together, our findings demonstrate that EHdn is a robust tool for identifying novel pathogenic repeat expansions in both cohort and single-sample outlier analysis, capable of identifying a new, previously inaccessible class of REs.

Results

ExpansionHunter Denovo

Overview

The length of disease-causing REs tends to exceed the read length of modern short-read sequencing technologies [36]. Thus, pathogenic expansions of many repeats can be detected by locating reads that are completely contained inside the repeats. As in our previous work [24, 26], we call these reads in-repeat reads (IRRs). We implemented a method, ExpansionHunter Denovo (EHdn), for performing a genome-wide search for IRRs in BAM/CRAM files containing read alignments. EHdn computes genome-wide STR profiles containing locations and counts of all identified IRRs. Subsequent comparisons of STR profiles across multiple samples can reveal the locations of the pathogenic repeat expansions.

Genome-wide STR profiles

Genome-wide STR profiles computed by EHdn contain information about two types of IRRs: anchored IRRs and paired IRRs. Anchored IRRs are IRRs whose mates are confidently aligned to the genomic sequence adjacent to the repeat (the “Methods” section). Paired IRRs are read pairs where both mates are IRRs with the same repeat motif. Repeats exceeding the read length generate anchored IRRs (Fig. 1, middle panel). Repeats that are longer than the fragment length of the DNA library produce paired IRRs in addition to anchored IRRs (Fig. 1, right panel). The genomic coordinates where the anchored reads align correspond to the approximate locations of loci harboring REs and the number of IRRs is indicative of the overall RE length.

Fig. 1

Diagram illustrating the types and counts of reads generated by simulating repeats of different lengths. When the repeat is shorter than the read length (left panels), there are no IRRs associated with the repeat. When a repeat is longer than the read length but shorter than the fragment length (middle panels), anchored IRRs but no paired IRRs are present. As the repeat length approaches and exceeds the fragment length (right panels), paired IRRs are generated in addition to anchored IRRs The information about anchored IRRs is summarized in an STR profile for each repeat motif (e.g., CCG) by listing regions containing anchored IRRs in close proximity to each other together with the total number of anchored IRRs identified (Fig. 2, middle). Note that the mapping positions of anchored IRRs correspond to the positions of anchor reads; mapping positions of IRRs themselves are not used because their alignments are often unreliable. Contrary to anchored IRRs, the origin of paired IRRs cannot be determined if a genome contains multiple long repeats with the same motif. Due to this, STR profiles only contain the overall count of paired IRRs for each repeat motif.

Fig. 2

(Left) A search for anchored IRRs is performed across all aligned reads. (Middle) The IRR counts are summarized into STR profiles. (Right) The resulting STR profiles are merged across all samples. If the dataset can be partitioned into cases and controls, IRR counts in these groups are compared for each locus. Alternatively, if no such partition is possible, an outlier analysis is performed

Comparing STR profiles across multiple samples

To compare STR profiles across multiple samples, the profiles must first be merged together across samples. During this process, nearby anchored IRR regions are merged across multiple samples and the associated counts are depth-normalized and tabulated for each sample (Fig. 2, right; the “Methods” section). The total counts of paired IRRs are also normalized and tabulated for each sample. The resulting per-sample counts can be compared in two ways: If the samples can be partitioned into cases and controls where a significant subset of cases is hypothesized to contain expansions of the same repeat, then a case-control analysis can be performed using a Wilcoxon rank-sum test (the “Methods” section). Alternatively, if no enrichment for any specific expansion is expected, an outlier analysis (the “Methods” section) can be used to flag repeats that are expanded in a small subgroup of cases compared to the rest of the dataset. Case-control and outlier analyses can be performed on either anchored IRRs or paired IRRs, which we call locus and motif methods, respectively (the “Methods” section). Thus, the locus method can identify locations of repeat expansions while the motif method can reveal the overall enrichment for long repeats with a given motif.

Baseline simulations

To demonstrate the baseline expectation of how the numbers of anchored and paired IRRs vary with repeat length, we simulated 2 × 150 bp reads at 20x coverage with 450-bp mean fragment length for the repeat associated with Huntington disease and varied the repeat length from 0 to 340 CAG repeats (0 to 1020 bp; Additional file 1). No IRRs occur when the repeat is shorter than the read length (Fig. 1, left panel). When the repeat is longer than the read length, but shorter than the fragment length (Fig. 1, middle panel), the number of anchored IRRs increases proportionally to the length of the repeat. As the length of the repeat approaches and exceeds the mean fragment length (Fig. 1, right panel), the number of paired IRRs increases linearly with the length of the repeat. Because anchored IRRs require one of the reads to “anchor” outside of the repeat region, the number of anchored IRRs is limited by the fragment length and remains constant as the repeat grows beyond the mean fragment length. It is important to note that real sequence data may introduce additional challenges compared to the simulated data. For example, sequence quality in low complexity regions or interruptions in the repeat may impact the ability to identify some IRRs.

Analysis of sequencing data

Detection of expanded repeats in case-control studies

Given a sufficient number of samples with the same phenotype, pathogenic REs may be identified by searching for regions with significantly longer repeats in cases compared to controls (see Fig. 2). To demonstrate the feasibility of such analyses, we analyzed 91 Coriell samples with experimentally confirmed expansions in repeats associated with Friedreich ataxia (FRDA; N = 25), myotonic dystrophy type 1 (DM1; N = 17), Huntington disease (HD; N = 14), and fragile X syndrome (FXS; N = 35). This dataset has been previously used to benchmark the performance of existing targeted methods [23, 24, 27]. The pathogenic cutoffs for FRDA, DM1, and FXS repeats are greater than the read length, so our analysis of simulated data suggests that anchored IRRs are likely to be present in each sample with one of these expansions (Fig. 1). The pathogenic cutoff for the HD repeat (120 bp) is less than the read length (150 bp) used in this study, so a subset of samples with Huntington disease may not contain relevant IRRs making this expansion harder to detect de novo even though it is detectable with existing methods [22, 24, 26, 27]. We separately compared samples with expansions in FXN (FRDA), DMPK (DM1), HTT (HD), or FMR1 (FXS) genes (cases) against a control cohort of 150 unrelated Coriell samples of African, European, and East Asian ancestry [37]. Each case-control comparison revealed a clear enrichment of anchored IRRs at the corresponding repeat region (Fig. 3). This analysis demonstrated that ExpansionHunter Denovo (EHdn) can re-identify known pathogenic repeat expansions without prior knowledge of the location or repeat motif when the pathogenic repeat length is equal to or longer than the read length, also assuming that the repeat is highly penetrant.

Fig. 3

Genome-wide analysis of anchored IRRs comparing cases with known pathogenic expansions in DMPK, FXN, FMR1, and HTT genes (top to bottom) to 150 controls

Detection of expanded repeats in mixed sample cohorts

In many discovery projects, it can be difficult to isolate patients that harbor the same repeat expansion based on the phenotype alone. For instance, the repeat expansion in the C9orf72 gene is present in fewer than 10% of ALS patients and many ataxias can be caused by expansions of a variety of repeats. Such problems call for analysis methods that are suitable for heterogeneous disease cohorts. To solve this problem, we follow the approach taken previously by others and compare each case sample against the control cohort to identify outliers [23, 25] (the “Methods” section). To demonstrate the efficacy of this approach, we combined each sample from the pool of samples with expansions in FXN, DMPK, HTT, and FMR1 genes with 150 controls to generate a total of 91 datasets, each containing 151 samples. We then performed an outlier analysis on the counts of anchored IRRs (the “Methods” section) in each dataset. In 81% of the datasets, the expanded repeat ranked within the top 10 repeats based on the outlier score (Fig. 4a). This number increased to 84% when the analysis was restricted to short motifs between 2 and 6 bp (Fig. 4b). EHdn performed well for DMPK and FXN repeats, identifying these REs within the top 10 ranks for 41 out of 42 cases. The FMR1 expansion was only ranked in the top 10 for 24 out of 35 cases known to have the expansion. This result is consistent with a previous comparison, which found this locus had the poorest performance across all RE detection tools [23]. The performance for the HTT repeat is surprisingly good considering that EHdn was not designed to detect REs shorter than the read length. The rankings improved further when the analysis was restricted to repeats located close to exons of brain-expressed genes (Fig. 4c, Additional file 1) or when multiple cases (five in this example) were included in the analysis (Fig. 4d, Additional file 1).

Fig. 4

Ranking of known expansions based on the outlier score computed for anchored IRRs. Each rank originates from a genome-wide analysis of a dataset consisting of one (a–c) or five (d) samples with a known expansion and 150 controls. a Ranks for all identified repeats. b Ranks for repeats with 2–6-bp motifs. c Ranks for repeats located in the 5-kbp region around exons of brain-expressed genes. d Ranks for datasets with five case samples

The landscape of long repeats within a control population

To explore the landscape of large repeats in the general population, we applied EHdn to 150 unrelated Coriell samples of African, European, and East Asian ancestry [37]. To limit this analysis to higher confidence repeats, we considered loci where EHdn identified at least five anchored IRRs, corresponding to repeats spanning about 150–200 bp and longer, and motifs supported by at least five paired IRRs in a single sample. Altogether, EHdn identified 1574 unique motifs spanning between 2 and 20 bp, 94% of which were longer than 6 bp. Of these, 19% were found in at least half of the samples and 23% were found in just one sample. On average, each person had 660 loci with long repeats. As expected, the telomeric hexamer motif AACCCT is particularly abundant and was found in about ~ 23,000 IRRs per sample. Similarly, the centromeric pentamer motif AATGG was found in ~ 5000 IRRs per sample. To estimate the number of repeats located outside of the telomeric and centromeric regions [38, 39], we stratified the repeats by their distance to the closest telomere/centromere (Additional file 1: Figure S5). We found that, on average, about 170 of the identified repeats are located closer than 2 Mbp from the nearest telomere or centromere and about 200 are located further than 15 Mbp. We also showed that EHdn can accurately detect long repeats in a control sample by validating 77% of repeats supported by two or more reads in Pacific Biosciences long-read data (Additional file 1: Figure S4).

Exploring the limitations of catalog-based RE detection methods

To evaluate the limitations of catalog-based approaches [22–25, 27], we curated a set of 53 pathogenic or potentially pathogenic repeats (Additional file 1 and Additional file 2: Table S1) and checked if they were present in two commonly used catalogs: (a) STRs with up to 6-bp motifs from the UCSC genome browser simple repeats track [38, 40] utilized by STRetch and exSTRa and (b) the GangSTR catalog. Nine of the known pathogenic repeats are not present in the reference genome and hence are absent from both catalogs. Out of the remaining 44 loci, 22 loci are present in both catalogs, 12 are missing from the GangSTR catalog and present in the UCSC catalog, five are missing from the UCSC catalog and present in the GangSTR catalog, and five are missing from both catalogs (Additional file 1: Figure S1). While it is possible to update the catalogs to include these known pathogenic repeats, the number of missing potentially pathogenic REs remains unknown. To demonstrate that EHdn offers similar performance to catalog-based methods on expansions exceeding the read length, we simulated expansions of the 35 non-degenerate STRs present in the reference (Additional file 1 and Additional file 2: Table S1). We focused our comparisons on STRetch because this method was specifically designed to search for novel expansions using a genome-wide catalog and because it was shown to have similar performance to other existing methods [23]. Our simulations show that EHdn ranks 33 out of 35 pathogenic repeats in the top 10 at sufficiently long lengths (Additional file 1 and Additional files 4, 5, 6, 7: Tables S3-S6). STRetch prioritizes 26 out of 29 repeats in the top 10, and the six remaining repeats are missing from its catalog. One of the REs detected by EHdn and missed by STRetch is the pathogenic CSTB repeat with a motif length of 12 bp. This is because STRetch is limited to the detection of motifs with length up to 6 bp. To further highlight this strength of EHdn, we confirmed that it can detect other REs with long motifs (Additional file 1). Some recently discovered REs are composed of motifs that are not present in the reference genome. One such example is the recently discovered repeat expansion of non-reference motif AAGGG causing cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome (CANVAS) [41, 42]. Rafehi et al. [42] demonstrated that EHdn is the only computational method capable of discovering this expansion. To further benchmark EHdn’s ability to detect REs with complex structure, we simulated nine complex REs with non-reference motifs known to cause disease (Additional file 1: Figure S3). For eight out of nine REs, including a simulated version of the CANVAS expansion, EHdn was able to detect one or both of the expanded repeats in each locus (Additional file 1 and Additional file 8: Table S7).

Discussion

Here, we introduced a new software tool, ExpansionHunter Denovo (EHdn), that can identify novel REs using high-throughput WGS data. We tested EHdn by comparing samples with known REs against a control group of 150 diverse individuals and performed simulation studies across a range of pathogenic or potentially pathogenic REs. These analyses show that EHdn offers comparable performance to targeted methods on known pathogenic repeats while also being able to detect repeats absent from existing catalogs. In particular, EHdn can be used for discovery of novel repeat expansions not detectable by the current methods because it (a) does not require prior knowledge of the genomic coordinates of the REs, (b) can detect nucleotide composition changes within the expanded repeats, and (c) is applicable to both short and long motifs. Recent discoveries have highlighted the importance of complex pathogenic repeat expansions involving non-reference insertions [28-33]. EHdn is currently the only method capable of discovering these expansions from BAM or CRAM files without the need for re-alignment of the supporting reads. Additionally, we anticipate that EHdn can replace existing more manual and less computationally efficient discovery pipelines, such as the TRhist-based pipeline [43], where identification of enriched repeat motifs is followed by ad hoc re-alignment of relevant reads to the reference genome and manual evaluation of loci where these reads align. EHdn has some limitations and areas for further improvement. It is limited to the detection of repetitive sequences longer than the read length and cannot, in general, detect shorter expansions. However, detection of these shorter expansions is feasible with the existing catalog-based methods, or structural variant detection methods. It may be possible to extend the detection limit to shorter repeat expansions; however, increasing the search space will lead to increased runtime and reduced power to detect outlier expansions. It is also important to note that while EHdn can analyze reads produced by a variety of read aligners (Additional file 1: Figure S6), the same aligner should be used for all samples involved in comparative analyses to eliminate false signals due to aligner differences. All parameters of the sequencing assay (sequencing platform and library preparation kits) should also match as closely as possible to avoid coverage biases and other technical artifacts. In many previous studies, identification of pathogenic REs required years of work and involved linkage studies to isolate the region of interest followed by targeted sequencing to identify the likely causative mutations. EHdn can be used as a front-line tool in such studies to rapidly identify candidate REs. Once identified, these novel REs can be genotyped using targeted methods [22, 24, 26, 27] or molecular assays. The benefits of this approach were demonstrated in a recent study, where EHdn successfully identified a novel complex pathogenic RE [42]. Hundreds of thousands of individuals’ genomes have now been sequenced using short-read sequencing from many large disease cohorts, awaiting additional analyses such as RE detection. Additionally, while it is generally easier to analyze the structure of the expanded repeats in long-read data [44, 45], combining short-read sequencing datasets and methods with long-read data can offer a cost-effective way to conduct large-scale repeat expansion discovery projects.

Conclusions

We presented ExpansionHunter Denovo, a new genome-wide and catalog-free method to search for REs in WGS data. We demonstrated that EHdn consistently detects REs in real and simulated data. Given the widespread adoption of WGS for rare disease diagnosis, we expect that EHdn will enable further RE discoveries that will likely resolve the genetic cause of disease in many individuals.

Methods

Identification of IRRs

To determine if a read r is an in-repeat read, we first check the read for periodicity. We define I(i) = 1 if r = r and 0 otherwise, where r and r are the ith and (i + k)th bases of the read r. We then let where L is the read length. Note that if a read consists of a perfect stretch of repeat units of length k, then S(k) = 1. We search across of range of motif lengths (by default k ∈ {2, 3, …, 20}) for the smallest k such that S(k) ≥ t where t is a set threshold (we use t = 0.8 in all our analyses). If such a value of k is found, we extract the putative repeat unit using the most frequent bases at each offset 0 ≤ i ≤ k − 1. Since the orientation of the repeat where a given IRR originated is unknown in general, the unit of the repeat is ambiguous. To remove this ambiguity, we select the smallest repeat unit in lexicographical order under circular permutation and reverse complement operations. We then use this putative repeat unit to calculate a weighted-purity (WP) score of a read [24]. We assume that a read is an IRR if it achieves a WP score of at least 0.9. The WP score lowers the penalty for low-quality mismatches in order to account for the possibility of an increased base-call error rate that may occur in highly repetitive regions of the genome. EHdn searches for IRRs among unaligned reads and reads whose mapping quality (MAPQ) is below a set threshold which in the analysis presented here was set to 40. For this study, we limited our analysis to motif lengths between 2 and 20 bp. Motif lengths equal to 1 were excluded to eliminate the large number of homopolymer repeats from the downstream analyses since we identified over 30 times as many homopolymer IRRs as IRRs with longer repeat motifs. EHdn designates a read pair as a paired IRR if both mates are IRRs with the same repeat motif. A read is designated as an anchored IRR if it is an IRR and its mate is not an IRR and has MAPQ above a set threshold which was set to 50 for this study. Parameters such as the maximum allowed MAPQ for an IRR, the minimum allowed MAPQ for an anchor, and the range of repeat unit lengths for which to search are all tunable with EHdn. For example, setting the anchor read MAPQ threshold to 0 and the IRR MAPQ threshold to 60 ensures that every read pair in the alignment file is analyzed (assuming that the MAPQ values range from 0 to 60) at the cost of a corresponding increase in runtime.

Merging IRRs

Because an anchored IRR is assigned to the location of the aligned anchor read and not the position of the actual repeat (whose exact location may be unknown), a single repeat may produce anchored IRRs at a variety of locations centered around the repeat. To account for this, anchored IRRs with the same repeat motif are merged if their anchors are aligned within 500 bp of one another. When multiple samples are analyzed, the anchor regions are also merged across all samples and the counts of anchored IRRs (normalized to 40x read depth) are tabulated for each merged region and sample. Additionally, the depth-normalized counts of paired IRRs are tabulated for each repeat motif and sample.

Prioritization of expanded repeats

EHdn supports case-control and outlier analyses of the underlying dataset. The case-control analysis is based on a one-sided Wilcoxon rank-sum test. It is appropriate for situations where a significant subset of cases is expected to contain expansions of the same repeat. The outlier analysis is appropriate for heterogeneous cohorts where enrichment for any specific expansion is not expected. The outlier analysis bootstraps the sampling distribution of the 95% quantile and then calculates the z-scores for cases that exceed the mean of this distribution. The z-scores are used for ranking the repeat regions. Similar outlier-detection frameworks were also developed for exSTRa [23] and STRetch [25]. Both the case-control and the outlier analyses can be applied either to the counts of anchored IRRs or to the counts of paired IRRs. We refer to these as locus or motif methods, respectively. The high-ranking regions flagged by the analysis of anchored IRRs correspond to approximate locations of putative repeat expansions. The high-ranking motifs flagged by the analysis of paired IRRs correspond to the overall enrichment for repeats with that motif.

Defining relevant repeat expansions

A catalog of pathogenic or potentially pathogenic repeat expansions was collated from the literature. We supplemented this catalog with recently reported STRs linked with gene expression [46], and repeats with longer motifs overlapping with disease genes (Additional file 1).

Simulated repeat expansions

Expanded repeats were simulated using a strategy similar to that taken by BamSurgeon [47]. Briefly, we simulated reads in a 2-kb region around an expanded repeat and then aligned the reads to the reference genome. We then removed reads in the same region from a WGS control sample and merged alignments of real and simulated data together (Additional file 1: Figure S2). Additional file 1. Supplementary methods, supplementary results, Figure S1-S6, and captions of Tables S1-S8. Additional file 2: Table S1. Definitions of repeat expansions. Additional file 3: Table S2. Repeats with long motifs and expression-linked repeats which were used for simulation. Additional file 4: Table S3. Simulation results for 13 small pathogenic repeat expansions. Additional file 5: Table S4. Simulation results for 22 large pathogenic repeat expansions. Additional file 6: Table S5. Simulation results for 27 repeat expansions with long motifs. Additional file 7: Table S6. Simulation results for expression-linked repeat expansions with short motifs. Additional file 8: Table S7. Simulation results for nine complex pathogenic repeat expansions. Additional file 9: Table S8. Overview of existing methods for detecting repeat expansions based on short-read data. Additional file 10. Review history.

44 in total

1. The UCSC Table Browser data retrieval tool.

Authors: Donna Karolchik; Angela S Hinrichs; Terrence S Furey; Krishna M Roskin; Charles W Sugnet; David Haussler; W James Kent
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

2. Tandem repeats finder: a program to analyze DNA sequences.

Authors: G Benson
Journal: Nucleic Acids Res Date: 1999-01-15 Impact factor: 16.971

3. TTTCA repeat insertions in an intron of YEATS2 in benign adult familial myoclonic epilepsy type 4.

Authors: Patra Yeetong; Monnat Pongpanich; Chalurmpon Srichomthong; Adjima Assawapitaksakul; Varote Shotelersuk; Nithiphut Tantirukdham; Chaipat Chunharas; Kanya Suphapeetiporn; Vorasuk Shotelersuk
Journal: Brain Date: 2019-11-01 Impact factor: 13.501

4. Profiling the genome-wide landscape of tandem repeat expansions.

Authors: Nima Mousavi; Sharona Shleizer-Burko; Richard Yanicky; Melissa Gymrek
Journal: Nucleic Acids Res Date: 2019-09-05 Impact factor: 16.971

5. Expansions of intronic TTTCA and TTTTA repeats in benign adult familial myoclonic epilepsy.

Authors: Hiroyuki Ishiura; Koichiro Doi; Jun Mitsui; Jun Yoshimura; Miho Kawabe Matsukawa; Asao Fujiyama; Yasuko Toyoshima; Akiyoshi Kakita; Hitoshi Takahashi; Yutaka Suzuki; Sumio Sugano; Wei Qu; Kazuki Ichikawa; Hideaki Yurino; Koichiro Higasa; Shota Shibata; Aki Mitsue; Masaki Tanaka; Yaeko Ichikawa; Yuji Takahashi; Hidetoshi Date; Takashi Matsukawa; Junko Kanda; Fumiko Kusunoki Nakamoto; Mana Higashihara; Koji Abe; Ryoko Koike; Mutsuo Sasagawa; Yasuko Kuroha; Naoya Hasegawa; Norio Kanesawa; Takayuki Kondo; Takefumi Hitomi; Masayoshi Tada; Hiroki Takano; Yutaka Saito; Kazuhiro Sanpei; Osamu Onodera; Masatoyo Nishizawa; Masayuki Nakamura; Takeshi Yasuda; Yoshio Sakiyama; Mieko Otsuka; Akira Ueki; Ken-Ichi Kaida; Jun Shimizu; Ritsuko Hanajima; Toshihiro Hayashi; Yasuo Terao; Satomi Inomata-Terada; Masashi Hamada; Yuichiro Shirota; Akatsuki Kubota; Yoshikazu Ugawa; Kishin Koh; Yoshihisa Takiyama; Natsumi Ohsawa-Yoshida; Shoichi Ishiura; Ryo Yamasaki; Akira Tamaoka; Hiroshi Akiyama; Taisuke Otsuki; Akira Sano; Akio Ikeda; Jun Goto; Shinichi Morishita; Shoji Tsuji
Journal: Nat Genet Date: 2018-03-05 Impact factor: 38.330

6. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications.

Authors: Xiaoyu Chen; Ole Schulz-Trieglaff; Richard Shaw; Bret Barnes; Felix Schlesinger; Morten Källberg; Anthony J Cox; Semyon Kruglyak; Christopher T Saunders
Journal: Bioinformatics Date: 2015-12-08 Impact factor: 6.937

7. Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia.

Authors: Roisin Sullivan; Jana Vandrovcova; Mary M Reilly; Andrea Cortese; Roberto Simone; Huma Tariq; Wai Yan Yau; Jack Humphrey; Zane Jaunmuktane; Prasanth Sivakumar; James Polke; Muhammad Ilyas; Eloise Tribollet; Pedro J Tomaselli; Grazia Devigili; Ilaria Callegari; Maurizio Versino; Vincenzo Salpietro; Stephanie Efthymiou; Diego Kaski; Nick W Wood; Nadja S Andrade; Elena Buglo; Adriana Rebelo; Alexander M Rossor; Adolfo Bronstein; Pietro Fratta; Wilson J Marques; Stephan Züchner; Henry Houlden
Journal: Nat Genet Date: 2019-03-29 Impact factor: 38.330

8. A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD.

Authors: Alan E Renton; Elisa Majounie; Adrian Waite; Javier Simón-Sánchez; Sara Rollinson; J Raphael Gibbs; Jennifer C Schymick; Hannu Laaksovirta; John C van Swieten; Liisa Myllykangas; Hannu Kalimo; Anders Paetau; Yevgeniya Abramzon; Anne M Remes; Alice Kaganovich; Sonja W Scholz; Jamie Duckworth; Jinhui Ding; Daniel W Harmer; Dena G Hernandez; Janel O Johnson; Kin Mok; Mina Ryten; Danyah Trabzuni; Rita J Guerreiro; Richard W Orrell; James Neal; Alex Murray; Justin Pearson; Iris E Jansen; David Sondervan; Harro Seelaar; Derek Blake; Kate Young; Nicola Halliwell; Janis Bennion Callister; Greg Toulson; Anna Richardson; Alex Gerhard; Julie Snowden; David Mann; David Neary; Michael A Nalls; Terhi Peuralinna; Lilja Jansson; Veli-Matti Isoviita; Anna-Lotta Kaivorinne; Maarit Hölttä-Vuori; Elina Ikonen; Raimo Sulkava; Michael Benatar; Joanne Wuu; Adriano Chiò; Gabriella Restagno; Giuseppe Borghero; Mario Sabatelli; David Heckerman; Ekaterina Rogaeva; Lorne Zinman; Jeffrey D Rothstein; Michael Sendtner; Carsten Drepper; Evan E Eichler; Can Alkan; Ziedulla Abdullaev; Svetlana D Pack; Amalia Dutra; Evgenia Pak; John Hardy; Andrew Singleton; Nigel M Williams; Peter Heutink; Stuart Pickering-Brown; Huw R Morris; Pentti J Tienari; Bryan J Traynor
Journal: Neuron Date: 2011-09-21 Impact factor: 17.173

9. Detection of long repeat expansions from PCR-free whole-genome sequence data.

Authors: Egor Dolzhenko; Joke J F A van Vugt; Richard J Shaw; Mitchell A Bekritsky; Marka van Blitterswijk; Giuseppe Narzisi; Subramanian S Ajay; Vani Rajan; Bryan R Lajoie; Nathan H Johnson; Zoya Kingsbury; Sean J Humphray; Raymond D Schellevis; William J Brands; Matt Baker; Rosa Rademakers; Maarten Kooyman; Gijs H P Tazelaar; Michael A van Es; Russell McLaughlin; William Sproviero; Aleksey Shatunov; Ashley Jones; Ahmad Al Khleifat; Alan Pittman; Sarah Morgan; Orla Hardiman; Ammar Al-Chalabi; Chris Shaw; Bradley Smith; Edmund J Neo; Karen Morrison; Pamela J Shaw; Catherine Reeves; Lara Winterkorn; Nancy S Wexler; David E Housman; Christopher W Ng; Alina L Li; Ryan J Taft; Leonard H van den Berg; David R Bentley; Jan H Veldink; Michael A Eberle
Journal: Genome Res Date: 2017-09-08 Impact factor: 9.438

10. Intronic ATTTC repeat expansions in STARD7 in familial adult myoclonic epilepsy linked to chromosome 2.

Authors: Mark A Corbett; Thessa Kroes; Liana Veneziano; Mark F Bennett; Rahel Florian; Amy L Schneider; Antonietta Coppola; Laura Licchetta; Silvana Franceschetti; Antonio Suppa; Aaron Wenger; Davide Mei; Manuela Pendziwiat; Sabine Kaya; Massimo Delledonne; Rachel Straussberg; Luciano Xumerle; Brigid Regan; Douglas Crompton; Anne-Fleur van Rootselaar; Anthony Correll; Rachael Catford; Francesca Bisulli; Shreyasee Chakraborty; Sara Baldassari; Paolo Tinuper; Kirston Barton; Shaun Carswell; Martin Smith; Alfredo Berardelli; Renee Carroll; Alison Gardner; Kathryn L Friend; Ilan Blatt; Michele Iacomino; Carlo Di Bonaventura; Salvatore Striano; Julien Buratti; Boris Keren; Caroline Nava; Sylvie Forlani; Gabrielle Rudolf; Edouard Hirsch; Eric Leguern; Pierre Labauge; Simona Balestrini; Josemir W Sander; Zaid Afawi; Ingo Helbig; Hiroyuki Ishiura; Shoji Tsuji; Sanjay M Sisodiya; Giorgio Casari; Lynette G Sadleir; Riaan van Coller; Marina A J Tijssen; Karl Martin Klein; Arn M J M van den Maagdenberg; Federico Zara; Renzo Guerrini; Samuel F Berkovic; Tommaso Pippucci; Laura Canafoglia; Melanie Bahlo; Pasquale Striano; Ingrid E Scheffer; Francesco Brancati; Christel Depienne; Jozef Gecz
Journal: Nat Commun Date: 2019-10-29 Impact factor: 14.919

18 in total

1. A novel RFC1 repeat motif (ACAGG) in two Asia-Pacific CANVAS families.

Authors: Carolin K Scriba; Sarah J Beecroft; Joshua S Clayton; Andrea Cortese; Roisin Sullivan; Wai Yan Yau; Natalia Dominik; Miriam Rodrigues; Elizabeth Walker; Zoe Dyer; Teddy Y Wu; Mark R Davis; David C Chandler; Ben Weisburd; Henry Houlden; Mary M Reilly; Nigel G Laing; Phillipa J Lamont; Richard H Roxburgh; Gianina Ravenscroft
Journal: Brain Date: 2020-10-01 Impact factor: 13.501

2. Genome-wide tandem repeat expansions contribute to schizophrenia risk.

Authors: Anne S Bassett; Ryan K C Yuen; Bahareh A Mojarad; Worrawat Engchuan; Brett Trost; Ian Backstrom; Yue Yin; Bhooma Thiruvahindrapuram; Linda Pallotto; Aleksandra Mitina; Mahreen Khan; Giovanna Pellecchia; Bushra Haque; Keyi Guo; Tracy Heung; Gregory Costain; Stephen W Scherer; Christian R Marshall; Christopher E Pearson
Journal: Mol Psychiatry Date: 2022-05-12 Impact factor: 15.992

3. Familial long-read sequencing increases yield of de novo mutations.

Authors: Michelle D Noyes; William T Harvey; David Porubsky; Arvis Sulovari; Ruiyang Li; Nicholas R Rose; Peter A Audano; Katherine M Munson; Alexandra P Lewis; Kendra Hoekzema; Tuomo Mantere; Tina A Graves-Lindsay; Ashley D Sanders; Sara Goodwin; Melissa Kramer; Younes Mokrab; Michael C Zody; Alexander Hoischen; Jan O Korbel; W Richard McCombie; Evan E Eichler
Journal: Am J Hum Genet Date: 2022-03-14 Impact factor: 11.043

Review 4. Neurodegenerative diseases associated with non-coding CGG tandem repeat expansions.

Authors: Zhi-Dong Zhou; Joseph Jankovic; Tetsuo Ashizawa; Eng-King Tan
Journal: Nat Rev Neurol Date: 2022-01-12 Impact factor: 44.711

5. Repeat DNA expands our understanding of autism spectrum disorder.

Authors: Anthony J Hannan
Journal: Nature Date: 2021-01 Impact factor: 49.962

6. The investigation of genetic and clinical features in patients with hereditary spastic paraplegia in central-Southern China.

Authors: Chen Wang; Yun-Jian Zhang; Ci-Hao Xu; Zhi-Jun Liu; Yan Wu
Journal: Mol Genet Genomic Med Date: 2021-02-27 Impact factor: 2.183

Review 7. Advancing genomic technologies and clinical awareness accelerates discovery of disease-associated tandem repeat sequences.

Authors: Terence Gall-Duncan; Nozomu Sato; Ryan K C Yuen; Christopher E Pearson
Journal: Genome Res Date: 2021-12-29 Impact factor: 9.438

8. GeneBreaker: Variant simulation to improve the diagnosis of Mendelian rare genetic diseases.

Authors: Phillip A Richmond; Tamar V Av-Shalom; Oriol Fornes; Bhavi Modi; Alison M Elliott; Wyeth W Wasserman
Journal: Hum Mutat Date: 2021-02-10 Impact factor: 4.878

9. Genome-wide detection of tandem DNA repeats that are expanded in autism.

Authors: Brett Trost; Worrawat Engchuan; Charlotte M Nguyen; Bhooma Thiruvahindrapuram; Egor Dolzhenko; Ian Backstrom; Mila Mirceta; Bahareh A Mojarad; Yue Yin; Alona Dov; Induja Chandrakumar; Tanya Prasolava; Natalie Shum; Omar Hamdan; Giovanna Pellecchia; Jennifer L Howe; Joseph Whitney; Eric W Klee; Saurabh Baheti; David G Amaral; Evdokia Anagnostou; Mayada Elsabbagh; Bridget A Fernandez; Ny Hoang; M E Suzanne Lewis; Xudong Liu; Calvin Sjaarda; Isabel M Smith; Peter Szatmari; Lonnie Zwaigenbaum; David Glazer; Dean Hartley; A Keith Stewart; Michael A Eberle; Nozomu Sato; Christopher E Pearson; Stephen W Scherer; Ryan K C Yuen
Journal: Nature Date: 2020-07-27 Impact factor: 69.504

10. Common and rare variant association analyses in amyotrophic lateral sclerosis identify 15 risk loci with distinct genetic architectures and neuron-specific biology.

Authors: Wouter van Rheenen; Rick A A van der Spek; Mark K Bakker; Joke J F A van Vugt; Paul J Hop; Ramona A J Zwamborn; Niek de Klein; Harm-Jan Westra; Olivier B Bakker; Patrick Deelen; Gemma Shireby; Eilis Hannon; Matthieu Moisse; Denis Baird; Restuadi Restuadi; Egor Dolzhenko; Annelot M Dekker; Klara Gawor; Henk-Jan Westeneng; Gijs H P Tazelaar; Kristel R van Eijk; Maarten Kooyman; Ross P Byrne; Mark Doherty; Mark Heverin; Ahmad Al Khleifat; Alfredo Iacoangeli; Aleksey Shatunov; Nicola Ticozzi; Johnathan Cooper-Knock; Bradley N Smith; Marta Gromicho; Siddharthan Chandran; Suvankar Pal; Karen E Morrison; Pamela J Shaw; John Hardy; Richard W Orrell; Michael Sendtner; Thomas Meyer; Nazli Başak; Anneke J van der Kooi; Antonia Ratti; Isabella Fogh; Cinzia Gellera; Giuseppe Lauria; Stefania Corti; Cristina Cereda; Daisy Sproviero; Sandra D'Alfonso; Gianni Sorarù; Gabriele Siciliano; Massimiliano Filosto; Alessandro Padovani; Adriano Chiò; Andrea Calvo; Cristina Moglia; Maura Brunetti; Antonio Canosa; Maurizio Grassano; Ettore Beghi; Elisabetta Pupillo; Giancarlo Logroscino; Beatrice Nefussy; Alma Osmanovic; Angelica Nordin; Yossef Lerner; Michal Zabari; Marc Gotkine; Robert H Baloh; Shaughn Bell; Patrick Vourc'h; Philippe Corcia; Philippe Couratier; Stéphanie Millecamps; Vincent Meininger; François Salachas; Jesus S Mora Pardina; Abdelilah Assialioui; Ricardo Rojas-García; Patrick A Dion; Jay P Ross; Albert C Ludolph; Jochen H Weishaupt; David Brenner; Axel Freischmidt; Gilbert Bensimon; Alexis Brice; Alexandra Durr; Christine A M Payan; Safa Saker-Delye; Nicholas W Wood; Simon Topp; Rosa Rademakers; Lukas Tittmann; Wolfgang Lieb; Andre Franke; Stephan Ripke; Alice Braun; Julia Kraft; David C Whiteman; Catherine M Olsen; Andre G Uitterlinden; Albert Hofman; Marcella Rietschel; Sven Cichon; Markus M Nöthen; Philippe Amouyel; Bryan J Traynor; Andrew B Singleton; Miguel Mitne Neto; Ruben J Cauchi; Roel A Ophoff; Martina Wiedau-Pazos; Catherine Lomen-Hoerth; Vivianna M van Deerlin; Julian Grosskreutz; Annekathrin Roediger; Nayana Gaur; Alexander Jörk; Tabea Barthel; Erik Theele; Benjamin Ilse; Beatrice Stubendorff; Otto W Witte; Robert Steinbach; Christian A Hübner; Caroline Graff; Lev Brylev; Vera Fominykh; Vera Demeshonok; Anastasia Ataulina; Boris Rogelj; Blaž Koritnik; Janez Zidar; Metka Ravnik-Glavač; Damjan Glavač; Zorica Stević; Vivian Drory; Monica Povedano; Ian P Blair; Matthew C Kiernan; Beben Benyamin; Robert D Henderson; Sarah Furlong; Susan Mathers; Pamela A McCombe; Merrilee Needham; Shyuan T Ngo; Garth A Nicholson; Roger Pamphlett; Dominic B Rowe; Frederik J Steyn; Kelly L Williams; Karen A Mather; Perminder S Sachdev; Anjali K Henders; Leanne Wallace; Mamede de Carvalho; Susana Pinto; Susanne Petri; Markus Weber; Guy A Rouleau; Vincenzo Silani; Charles J Curtis; Gerome Breen; Jonathan D Glass; Robert H Brown; John E Landers; Christopher E Shaw; Peter M Andersen; Ewout J N Groen; Michael A van Es; R Jeroen Pasterkamp; Dongsheng Fan; Fleur C Garton; Allan F McRae; George Davey Smith; Tom R Gaunt; Michael A Eberle; Jonathan Mill; Russell L McLaughlin; Orla Hardiman; Kevin P Kenna; Naomi R Wray; Ellen Tsai; Heiko Runz; Lude Franke; Ammar Al-Chalabi; Philip Van Damme; Leonard H van den Berg; Jan H Veldink
Journal: Nat Genet Date: 2021-12-06 Impact factor: 38.330