| Literature DB >> 35907931 |
L G Fearnley1,2, M F Bennett1,2,3, M Bahlo4,5.
Abstract
Bioinformatic methods for detecting short tandem repeat expansions in short-read sequencing have identified new repeat expansions in humans, but require alignment information to identify repetitive motif enrichment at genomic locations. We present superSTR, an ultrafast method that does not require alignment. superSTR is used to process whole-genome and whole-exome sequencing data, and perform the first STR analysis of the UK Biobank, efficiently screening and identifying known and potential disease-associated STRs in the exomes of 49,953 biobank participants. We demonstrate the first bioinformatic screening of RNA sequencing data to detect repeat expansions in humans and mouse models of ataxia and dystrophy.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35907931 PMCID: PMC9338934 DOI: 10.1038/s41598-022-17267-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1An overview of superSTR, its compression heuristic, and the heuristic’s performance in simulated reads. (a) superSTR analysis involves a per-sample processing step where repeats are identified and a cohort-level analysis where samples are analysed, ultimately leading to post-superSTR analysis or experimental confirmation of findings. (b) superSTR relies on relative compressibility to distinguish between repeat containing reads. Compression with zlib involves removal of duplication. Read A (which does not contain significant repetition) will compress less than read B (which does), and the ratio of compressed size to uncompressed size will be greater for A than B. B) Distribution of C compression ratios in 150nt pseudorandom reads and repeat-containing reads drawn from a distribution where nucleotides are equiprobable and no errors are present. A more complete characterization across different distributions, read lengths and error rates is contained in Supplementary Figs. S1–S5.
Figure 2superSTR analysis of WGS and RNA-seq RE data. The distribution of information scores in controls is shown in grey (lower part of each graphic) and affected individuals in color (upper part of each graphic). A right shift in the distribution or the presence of a tail indicates an increased quantity of repeats of that motif in the sequencing data. (a–d) show comparison of disease groups within the Illumina RE cohort to the Illumina Diversity cohort. (e, f) show RNA-seq analysis. (a) AGC profile of DM1-bearing individuals with long-tailed distribution characteristic of relatively large RE; (b) AGC profile of HD-bearing individuals with a much shorter RE; (c) CCG profile of FXS individuals; (d) AAG profile of FRDA individuals. (e) RNA-seq AGC profile of peripheral blood mononuclear cells from 12 individuals with SCA3 RE against 12 matched non-SCA3 controls. (f) RNA-seq AGC profile of from eight patients with confirmed FECD3 expansions and six controls without FECD (of any type).
superSTR analysis of UK Biobank data—significant trimers in motif screening of ICD10 codes and post-screening localization of reads in samples and codes identified by superSTR by EHDN.
| ICD | ICD description | # samples in cohort with ICD | Motif | MW statistic | FDR-corrected p-value | # samples with EHDN paired IRRs | EHDN localised genes (n, mean estimated het. allele lengths) |
|---|---|---|---|---|---|---|---|
| H185 | Hereditary corneal dystrophies | 93 | AGC | 3113774 | 0.0064 | 88 | TCF4 (63, 351.0), AGBL1 (10, 78.4), PDK3 (23, 99.95652173913044), POLG (2, 64.0), RAB28 (1, 74.0) & 127 other genes |
| G911 | Obstructive hydrocephalus | 14 | AAC | 430504 | 0.035 | 1 | CHID1 (1, 97.0), TNFRSF19 (1, 62.0) |
| G711 | Myotonic disorders | 13 | AGC | 530258 | 0.0034 | 13 | DMPK (10, 1736.7), 44 additional genes (including MLTT3, TCF4, CA10, TBP) |
| G568 | Other mononeuropathies of upper limb | 15 | AGG | 5332345 | 0.038 | 0 | DCAF8L2 (7, 86.4), FGFR4 (2, 68.5), CHD3 (1, 61.0), EPHB3 (1, 61.0), GDNF (1, 45.0), WIZ (1, 117.0), NAA38 (1, 61.0), ERICH6 (1, 48.0) |
| G439 | Migraine, unspecified | 2352 | ACG | 56157179 | 0.036 | 5 | CASZ1 (3, 62.0) |
| H654 | Other chronic nonsuppurative otitis media | 95 | ACG | 2414382 | 0.023 | 1 | No localisation |
| G440 | Cluster headache syndrome | 2082 | CCG | 51810209 | 0.045 | 113 | FMR1 (1044, 94), AR (532,74) and 113 additional genes |
| G440 | Cluster headache syndrome | 2082 | ACG | 50023184 | 0.020 | 5 | CASZ1 (3, 62.0) |
| H269 | Cataract, unspecified | 2680 | ATC | 64490674 | 0.017 | 27 | CENPP (10, 58.8), ASPN (10, 58.8) and 41 other genes not previously linked to cataract disorders |
| G048 | Other encephalitis, myelitis and encephalomyelitis | 17 | ACG | 448775 | 0.038 | 1 | No localisation |
| H210 | Hyphaema | 29 | ATC | 849421 | 0.033 | 1 | ABCA8 (1, 168) |
Repeat sizes are as reported by EHDN, and are listed for qualitative purposes only due to high uncertainty resulting from low coverage and numbers of supporting reads in some areas of the WES data. A single individual may have expansions of the same motif at multiple loci. IRRs refer to paired reads detected by EHDN where both reads are repetitive, with the same motif, and > 90% of the reads is the specified motif. Highlighted genes have associations with related disorders. An extended version of the above table with complete gene lists incorporating data from tetramers and pentamers is included in Supplementary Data S2.