| Literature DB >> 29946432 |
Melanie Bahlo1,2, Mark F Bennett1,2,3, Peter Degorski1, Rick M Tankard4, Martin B Delatycki5,6,7, Paul J Lockhart5,7.
Abstract
Short tandem repeats (STRs), also known as microsatellites, are commonly defined as consisting of tandemly repeated nucleotide motifs of 2-6 base pairs in length. STRs appear throughout the human genome, and about 239,000 are documented in the Simple Repeats Track available from the UCSC (University of California, Santa Cruz) genome browser. STRs vary in size, producing highly polymorphic markers commonly used as genetic markers. A small fraction of STRs (about 30 loci) have been associated with human disease whereby one or both alleles exceed an STR-specific threshold in size, leading to disease. Detection of repeat expansions is currently performed with polymerase chain reaction-based assays or with Southern blots for large expansions. The tests are expensive and time-consuming and are not always conclusive, leading to lengthy diagnostic journeys for patients, potentially including missed diagnoses. The advent of whole exome and whole genome sequencing has identified the genetic cause of many genetic disorders; however, analysis pipelines are focused primarily on the detection of short nucleotide variations and short insertions and deletions (indels). Until recently, repeat expansions, with the exception of the smallest expansion (SCA6), were not detectable in next-generation short-read sequencing datasets and would have been ignored in most analyses. In the last two years, four analysis methods with accompanying software (ExpansionHunter, exSTRa, STRetch, and TREDPARSE) have been released. Although a comprehensive comparative analysis of the performance of these methods across all known repeat expansions is still lacking, it is clear that these methods are a valuable addition to any existing analysis pipeline. Here, we detail how to assess short-read data for evidence of expansions, reviewing all four methods and outlining their strengths and weaknesses. Implementation of these methods should lead to increased diagnostic yield of repeat expansion disorders for known STR loci and has the potential to detect novel repeat expansions.Entities:
Keywords: bioinformatics; repeat expansion disorders; short tandem repeats; short-read sequencing
Year: 2018 PMID: 29946432 PMCID: PMC6008857 DOI: 10.12688/f1000research.13980.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Detailed short tandem repeat loci information for neurological disorders.
| Disease | Symbol | OMIM | Inheritance | Gene | Cytogenetic
| Type | Repeat motif | Normal
| Expansion
| Strand | Start hg19 | Reference
| TRF
| TRF
| Reference
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Huntington
| HD | 143100 | AD |
| 4p16.3 | Coding | CAG | 6–34 | 36–100+ | + | 3,076,604 | 21.3 | 96 | 0 | 64 |
| Kennedy disease | SBMA | 313200 | X |
| Xq12 | Coding | CAG | 9–35 | 38–62 | + | 66,765,159 | 33.3 | 86 | 9 | 103 |
| Spinocerebellar
| SCA1 | 164400 | AD |
| 6p23 | Coding | CAG | 6–38 | 39–82 | − | 16,327,865 | 30.3 | 95 | 0 | 91 |
| Spinocerebellar
| SCA2 | 183090 | AD |
| 12q24 | Coding | CAG | 15–24 | 32–200 | − | 112,036,754 | 23.3 | 97 | 0 | 70 |
| Machado-Joseph
| SCA3 | 109150 | AD |
| 14q32.1 | Coding | CAG | 13–36 | 61–84 | − | 92,537,355 | 14 | 84 | 0 | 42 |
| Spinocerebellar
| SCA6 | 183086 | AD |
| 19p13 | Coding | CAG | 4–7 | 21–33 | − | 13,318,673 | 13.3 | 100 | 0 | 40 |
| Spinocerebellar
| SCA7 | 164500 | AD |
| 3p14.1 | Coding | CAG | 4–35 | 37–306 | + | 63,898,361 | 10.7 | 100 | 0 | 32 |
| Spinocerebellar
| SCA17 | 607136 | AD |
| 6q27 | coding | CAG | 25–42 | 47–63 | + | 170,870,995 | 37 | 94 | 0 | 111 |
| Dentatorubral-
| DRPLA | 125370 | AD |
| 12p13.31 | Coding | CAG | 7–34 | 49–88 | + | 7,045,880 | 19.7 | 92 | 0 | 59 |
| Huntington
| HDL2 | 606438 | AD |
| 16q24.3 | Exon | CTG | 7–28 | 66–78 | + | 87,637,889 | 15.3 | 95 | 4 | 47 |
| Fragile-X site A | FRAXA | 300624 | X |
| Xq27.3 | 5′ UTR | CGG | 6–54 | 200–1,000+ | + | 146,993,555 | 25 | 90 | 5 | 75 |
| Fragile-X site E | FRAXE | 309548 | X |
| Xq28 | 5′ UTR | CCG | 4–39 | 200–900 | + | 147,582,159 | 15.3 | 100 | 0 | 46 |
| Myotonic
| DM1 | 160900 | AD |
| 19q13 | 3′ UTR | CTG | 5–37 | 50–10,000 | − | 46,273,463 | 20.7 | 100 | 0 | 62 |
| Friedreich ataxia | FRDA | 229300 | AR |
| 9q13 | Intron | GAA | 6–32 | 200–1,700 | + | 71,652,201 | 6.7 | 100 | 0 | 20 |
| Myotonic
| DM2 | 602668 | AD |
| 3q21.3 | Intron | CCTG | 10–26 | 75–11,000 | − | 128,891,420 | 20.8 | 92 | 0 | 83 |
| Frontotemporal
| FTDALS1 | 105550 | AD |
| 9p21 | Intron | GGGGCC | 2–19 | 250–1,600 | − | 27,573,483 | 10.8 | 74 | 8 | 62 |
| Spinocerebellar
| SCA36 | 614153 | AD |
| 20p13 | Intron | GGCCTG | 3–8 | 1500–2,500 | + | 2,633,379 | 7.2 | 97 | 0 | 43 |
| Spinocerebellar
| SCA10 | 603516 | AD |
| 22q13.31 | Intron | ATTCT | 10–20 | 500–4,500 | + | 46,191,235 | 14 | 100 | 0 | 70 |
| Myoclonic
| EPM1 | 254800 | AR |
| 21q22.3 | Promoter | CCCCGCCCCGCG | 2–3 | 40–80 | − | 45,196,324 | 3.1 | 100 | 0 | 37 |
| Spinocerebellar
| SCA12 | 604326 | AD |
| 5q32 | Promoter | CAG | 7–45 | 55–78 | − | 146,258,291 | 10.7 | 100 | 0 | 32 |
| Spinocerebellar
| SCA8 | 608768 | AD |
| 13q21 | utRNA | CTG | 16–34 | 74+ | + | 70,713,516 | 15.3 | 100 | 0 | 46 |
| Spinocerebellar
| SCA31 | 117210 | AD |
| 16q21 | Intron | TGGAA
[ | N/A | 2.5–3.8 kb
[ | + | 66,524,302 | 0 | N/A | N/A | N/A |
| Spinocerebellar
| SCA37 | 615945 | AD |
| 1p32.2 | Intron | ATTTC
[ | 0 | 31–75 | − | 57,832,716
[ | 0 | N/A | N/A | N/A |
| Familial adult
| FAME1 | 601068 | AD |
| 8q24 | Intron | TTTCA
[ | 0 | 440–3,680
[ | − | 119,379,055
[ | 0 | N/A | N/A | N/A |
| Fuchs endothelial
| FECD3 | 613267 | AD |
| 18q21.2 | Intron | CTG | 10–40 | 50–150+ | − | 53,253,385 | 25.3 | 100 | 0 | 76 |
| Oculopharyngeal
| OPMD | 164300 | AD |
| 14q11.2 | Coding | GCG | 6–7 | 8–13 | + | 23,790,682 | 6.7 | 100 | 0 | 20 |
| Early infantile
| EIEE1 | 308350 | X |
| Xp21.3 | Coding | GCG | 7–12
[ | 17–20
[ | − | 25,031,771 | 14.7 | 90 | 0 | 44 |
Detailed short tandem repeat (STR) loci information for disorders associated with repeat expansions. Tandem Repeats Finder (TRF) (Benson [16] 1999) match and TRF indel describe the purity of the repeat. AD, autosomal dominant; AR, autosomal recessive; N/A, not applicable; UTR, untranslated region; X, X-linked.
aAs these repeats are insertions, the motifs do not appear in the reference at the respective locus.
bSCA31 is caused by the insertion of a complex repeat containing (TGGAA) n; thus, the base-pair length of expanded repeats is given instead of repeat number.
cThe SCA37 position is given at the reference (ATTTT) n repeat, of which affected individuals have (ATTTC) n inserted. The FAME1 position is given at the reference (TTTTA) n repeat, of which affected individuals have (TTTCA) n inserted.
dIshiura et al. [3] identified similar expansions associated with FAME6 and FAME7 but only in single families. The same TTTCA repeat insertion was observed in the intronic region of TNRC6A and RAPGEF2, respectively.
eThe size of the FAME1 repeat is the estimated combined size of the expanded (TTTCA) n insertion and (TTTTA) n reference repeat.
fDifferent polyalanine expansions in the gene can be expanded.
Figure 1. Detecting repeat expansions with short-read sequencing data.
Depicted are three scenarios: ( I) a short repeat expansion where the repeat expansion is less than 150 base pairs (bp), or smaller than a read; ( II) a medium-size repeat expansion where the repeat expansion is between 150 and 350 bp; and ( III) a large repeat expansion, where the repeat expansion is greater than 350 bp. For each of the three panels, I– III, the top line of DNA sequence depicts the reference sequence, and the bottom line depicts the (not known) repeat expansion size sequence. Red segments in reads signify repeat sequence. Evidence from reads varies according to the repeat size. For all three scenarios, there is information in reads that map into the repeat (A) but for scenario I occasional reads span the expanded allele (B), giving information about the size of the expanded allele. In scenario II, some read fragments can span expanded allele and can also be used for inference. For large expansions, some read fragments stem entirely from the expanded alleles. These may not be unambiguously mapped and are exploited only by ExpansionHunter (large motifs only) and TREDPARSE (based on fragment size information).
Summary of computational methods, evaluation framework, and limitations for ExpansionHunter, exSTRa, STRetch, and TREDPARSE.
| Software | Publication | Computational
| Statistical test | Reported
| Software
| Ability to
| Graphical
| Length of STR
|
|---|---|---|---|---|---|---|---|---|
| ExpansionHunter | Dolzhenko
| Low/Low | None – estimates
| WGS | High | Possible | No | Repeats with
|
| exSTRa | Tankard
| Low/Medium | Permutation
| WGS and
| Medium | Possible | Yes | No known bias |
| STRetch | Dashnow
| High/Medium | Likelihood ratio
| WGS | Low | Easy | No | Short expansions
|
| TREDPARSE | Tang
| Low/Unknown | Likelihood of
| WGS | High | Possible | Yes | Does not detect
|
aComputational burden has been split into two components: known loci—a small subset of all short tandem repeat (STR) loci—and genome-wide, representing thousands of STR loci. bRequires prior information for STR in terms of allele size to aid statistical test. cThe C9orf72 repeat expansion is a hexamer repeat. dSCA6 is the smallest repeat expansion currently known. WES, whole exome sequencing; WGS, whole genome sequencing.