| Literature DB >> 29100084 |
Haibao Tang1, Ewen F Kirkness2, Christoph Lippert1, William H Biggs2, Martin Fabani2, Ernesto Guzman2, Smriti Ramakrishnan1, Victor Lavrenko1, Boyko Kakaradov2, Claire Hou2, Barry Hicks1, David Heckerman1, Franz J Och1, C Thomas Caskey3, J Craig Venter4, Amalio Telenti5.
Abstract
Short tandem repeats (STRs) are hyper-mutable sequences in the human genome. They are often used in forensics and population genetics and are also the underlying cause of many genetic diseases. There are challenges associated with accurately determining the length polymorphism of STR loci in the genome by next-generation sequencing (NGS). In particular, accurate detection of pathological STR expansion is limited by the sequence read length during whole-genome analysis. We developed TREDPARSE, a software package that incorporates various cues from read alignment and paired-end distance distribution, as well as a sequence stutter model, in a probabilistic framework to infer repeat sizes for genetic loci, and we used this software to infer repeat sizes for 30 known disease loci. Using simulated data, we show that TREDPARSE outperforms other available software. We sampled the full genome sequences of 12,632 individuals to an average read depth of approximately 30× to 40× with Illumina HiSeq X. We identified 138 individuals with risk alleles at 15 STR disease loci. We validated a representative subset of the samples (n = 19) by Sanger and by Oxford Nanopore sequencing. Additionally, we validated the STR calls against known allele sizes in a set of GeT-RM reference cell-line materials (n = 6). Several STR loci that are entirely guanine or cytosines (G or C) have insufficient read evidence for inference and therefore could not be assayed precisely by TREDPARSE. TREDPARSE extends the limit of STR size detection beyond the physical sequence read length. This extension is critical because many of the disease risk cutoffs are close to or beyond the short sequence read length of 100 to 150 bases.Entities:
Keywords: genetic disorder; genome sequencing; genotyping; microsatellites; population genetics; short tandem repeats; trinucleotide repeat expansion
Mesh:
Year: 2017 PMID: 29100084 PMCID: PMC5673627 DOI: 10.1016/j.ajhg.2017.09.013
Source DB: PubMed Journal: Am J Hum Genet ISSN: 0002-9297 Impact factor: 11.025
Figure 1TREDPARSE Workflow for Calling STR-Related Genetic Disease
The workflow includes ploidy inference, read realignment, and integration of various types of evidence in a probabilistic model.
A List of Trinucleotide Repeat Diseases (TREDs) That We Compiled for This Study
| DM1 | Myotonic dystrophy 1 (MIM: | CTG | chr19: 45770205–45770264 | AD | 50 | 15 (9) | |
| DM2 | Myotonic dystrophy 2 (MIM: | CCTG | chr3: 129172577–129172656 | AD | 75 | 0 | |
| DRPLA | Dentatorubro-pallidoluysian atrophy (MIM: | CAG | chr12: 6936729–6936773 | AD | 48 | 0 | |
| FXTAS | Fragile X-associated tremor/ataxia syndrome (MIM: | CGG | chrX: 147912051–147912110 | XLD | 55 | 2 (1) | |
| FXS | Fragile X syndrome (MIM: | CGG | chrX: 147912051–147912110 | XLD | 200 | 0 | |
| FRAXE | Mental retardation, FRAXE type (MIM: | GCC | chrX: 148500638–148500682 | XLR | 200 | 0 | |
| FRDA | Friedreich ataxia (MIM: | GAA | chr9: 69037287–69037304 | AR | 66 | 0 | |
| HD | Huntington disease (MIM: | CAG | chr4: 3074877–3074933 | AD | 40 | 5 (4) | |
| HDL | Huntington disease-like 2 (MIM: | CTG | chr16: 87604288–87604329 | AD | 40 | 0 | |
| ULD | Unverricht-Lundborg Disease (MIM: | CCCCGCCCCGCG | chr21: 43776444–43776479 | AR | 30 | 0 | |
| OPMD | Oculopharyngeal muscular dystrophy (MIM: | GCN | chr14: 23321473–23321502 | AD | 12 | 8 (7) | |
| SBMA | Spinal and bulbar muscular atrophy (MIM: | CAG | chrX: 67545318–67545383 | XLR | 36 | 1 (1) | |
| SCA1 | Spinocerebellar ataxia 1 (MIM: | CAG | chr6: 16327636–16327722 | AD | 39 | 26 (23) | |
| SCA2 | Spinocerebellar ataxia 2 (MIM: | CAG | chr12: 111598951–111599019 | AD | 33 | 4 (4) | |
| SCA3 | Spinocerebellar ataxia 3 (MIM: | CAG | chr14: 92071011–92071034 | AD | 60 | 0 | |
| SCA6 | Spinocerebellar ataxia 6 (MIM: | CAG | chr19: 13207859–13207897 | AD | 20 | 2 (2) | |
| SCA7 | Spinocerebellar ataxia 7 (MIM: | CAG | chr3: 63912686–63912715 | AD | 34 | 0 | |
| SCA8 | Spinocerebellar ataxia 8 (MIM: | CTG/CAG | chr13: 70139384–70139428 | AD | 80 | 3 (3) | |
| SCA10 | Spinocerebellar ataxia 10 (MIM: | ATTCT | chr22: 45795355–45795424 | AD | 800 | 0 | |
| SCA12 | Spinocerebellar ataxia 12 (MIM: | CAG | chr5: 146878729–146878758 | AD | 51 | 0 | |
| SCA17 | Spinocerebellar ataxia 17 (MIM: | CAG | chr6: 170561908–170562021 | AD | 43 | 52 (48) | |
| SCA36 | Spinocerebellar ataxia 36 (MIM: | GGCCTG | chr20: 2652734–2652757 | AD | 650 | 0 | |
| EIEE1 | Epileptic encephalopathy, early infantile, 1 (MIM: | GCG | chrX: 25013662–25013691 | XLR | 20 | 0 | |
| BPES | Blepharophimosis, epicanthus inversus, and ptosis (MIM: | GCN | chr3: 138946021–138946062 | AD | 19 | 1 (1) | |
| CCD | Cleidocranial dysplasia (MIM: | GCN | chr6: 45422751–45422801 | AD | 27 | 5 (5) | |
| CCHS | Central hypoventilation syndrome (MIM: | GCN | chr4: 41745972–41746031 | AD | 24 | 11 (11) | |
| HFG | Hand-foot-uterus syndrome (MIM: | GCN | chr7: 27199925–27199966 | AD | 22 | 2 (2) | |
| HPE5 | Holoprosencephaly-5 (MIM: | GCN | chr13: 99985449–99985493 | AD | 25 | 0 | |
| SD5 | Syndactyly (MIM: | GCN | chr2: 176093059–176093103 | AD | 22 | 1 (1) | |
| XLMR | Mental retardation, X-linked (MIM: | GCN | chrX: 140504317–140504361 | XLR | 22 | 0 | |
| ALS | Amyotrophic lateral sclerosis (MIM: | GGGGCC | chr9: 27573529–27573546 | AD | 31 | 0 |
Inheritance modes are AD (autosomal dominant), AR (autosomal recessive), XLD (X-linked dominant), and XLR (X-linked recessive). Individuals were inferred to be “at risk” by TREDPARSE if , where is the probability that a sample is pathological given the risk cutoff.
Total number of individuals assessed in this study: 12,632. Total number of independent families plus unrelated individuals: 8,784.
Same genetic locus for FXTAS and FXS but with different risk cutoffs in repeat counts.
Figure 2Integrated Probabilistic Model for Calling STRs with Four Types of Evidence
(A) Model based on spanning reads.
(B) Model based on partial reads.
(C) Model based on repeat-only reads.
(D) Model based on paired-end reads.
(E) Predictive power for each of the four evidence types on the range of STR repeat lengths. Darker shades of green represent higher confidence of inference.
Figure 3Examples of Posterior Probability Density Function Based on the Integrated Model for Calling STRs
(A) Simulated diploid with ; there are no uncertainties around and some uncertainties around .
(B) Simulated diploid with , showing a slight negative dependence between and .
Figure 4Simulations with Synthetic Datasets of Implanted STR Alleles at the Huntington Locus
(A) Performance comparison of TREDPARSE and lobSTR on a simulated haploid with one single allele with number of CAGs, where varies from 1 to 300.
(B) Performance comparison of TREDPARSE and lobSTR on a simulated diploid with two alleles, one allele fixed with 20 CAGs and another allele with units of CAGs.
(C) Performance of TREDPARSE on a simulated diploid with a low haploid depth of 5×.
(D) Performance of TREDPARSE on a simulated diploid with a high haploid depth of 80×. Shaded regions represent a 95% credible interval for TREDPARSE estimates of . represents the root-mean-square deviation, calculated as , where .
Figure 5Testing and Validation of TREDPARSE on 12,632 Whole-Genome Sequences
We ran TREDPARSE on sequence data from 12,632 individuals and identified 138 individuals with risk alleles at a total of 15 disease loci. A subset of the inferred at-risk samples were validated by complementary sequencing experiments.
Figure 6Individuals with Risk Alleles at the Huntington Disease Locus in Whole-Genome Samples
(A) A family with the putative HD risk allele transmitted between generations.
(B) A second family with the putative DM1 risk allele transmitted between generations.
(C) A third family with the putative SCA17 allele transmitted between generations.
The expanded risk alleles are highlighted in red. For both alleles, the 95% credible intervals are provided below the estimates. Age refers to the biological age of the individual at the time when the DNA sample was taken.
Confidence about STR Calls for Each Disease Locus through Experimental Validations, Simulation Support, and Mendelian Error Analysis
| YES | YES | YES | HD, SBMA, SCA1, SCA2, SCA8, SCA17, DM1, FXTAS |
| NO | YES | YES | OPMD, SCA6, BPES, CCD, CCHS, HFG, SD5, FRDA ( |
| NO | NO | YES | DM2, DRPLA, HDL, ULD, SCA3, SCA7, SCA12, EIEE1, HPE5, XLMR, ALS |
| NO | NO | NO | FXS, FRAXE, SCA10, SCA36 ( |
| Imprecise size estimates (Mendelian error rate > 10%) | FXTAS/FXS, FRAXE, ALS, SCA7, CCHS, SCA17, EIEE1 | ||