| Literature DB >> 34372915 |
Indhu-Shree Rajan-Babu1,2, Junran J Peng3, Readman Chiu4, Chenkai Li4,5, Arezoo Mohajeri3, Egor Dolzhenko6, Michael A Eberle6, Inanc Birol3,4, Jan M Friedman3.
Abstract
BACKGROUND: Screening for short tandem repeat (STR) expansions in next-generation sequencing data can enable diagnosis, optimal clinical management/treatment, and accurate genetic counseling of patients with repeat expansion disorders. We aimed to develop an efficient computational workflow for reliable detection of STR expansions in next-generation sequencing data and demonstrate its clinical utility.Entities:
Keywords: Clinical bioinformatics; Machine learning; Next-generation sequencing; Repeat expansion; Short tandem repeats
Mesh:
Year: 2021 PMID: 34372915 PMCID: PMC8351082 DOI: 10.1186/s13073-021-00932-9
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Features of some publicly available short tandem repeat analysis algorithms
| Features | lobSTR | RepeatSeq | HipSTR | TRED | EH | STRetch | exSTRa | GangSTR |
|---|---|---|---|---|---|---|---|---|
| Outputs repeat length? | Y | Y | Y | Y | Y | Y | N | Y |
| Sequencing reads | Single- and paired-end | Single- and paired-end | Single- and paired-end | Paired-end | Paired-end | Paired-end | Paired-end | Paired-end |
| Sequencing platforms supported | Illumina, Sanger, 454, and IonTorrent | Illumina | Illumina | Illumina | Illumina | Illumina | Illumina | Illumina |
| Library prep. supported | PCR and PCR-free | n.a. | PCR and PCR-free | PCR and PCR-free | PCR and PCR-free | PCR and PCR-free | PCR and PCR-free | PCR and PCR-free |
| Library prep. (rcmd) | None | None | None | None | PCR-free | PCR-free | None | None |
| Aligners (rcmd) | lobSTR and BWA-MEM | Novoalign and Bowtie 2 | Indel-sensitive aligner | None | None | None | Bowtie 2 | None |
| Analysis approach | Targeted and GW | Targeted and GW | Targeted and GW | Targeted | Targeted | GW | Targeted and GW | Targeted and GW |
| NGS data type supported | WGS | WGS | WGS | WGS | WGS and ES | WGS and ES | WGS and ES | WGS and ES |
| NGS data format | .bam, .fastq, or .fasta | .bam | .bam | .bam | .bam or .cram | .bam or .fastq | .bam | .bam |
| Built-in stutter correction modela | Y | Y | Y | Y | n.a. | n.a. | n.a. | Y |
| Test of significance | N | N | N | N | N | Y | Y | N |
| Read types used | Spanning | Spanning | Spanning | Spanning, flanking or partial, paired-end reads, and IRR | Spanning, flanking, and IRR/IRR pairs | Anchored IRR | Flanking and anchored IRR | Spanning, flanking, and IRR/IRR pairs |
| Phasingb | n.a | n.a | Y | n.a | n.a | n.a | n.a | n.a |
| PL | C++ | C++ | C++ | Python | C++ | Java | Perl and R | C++ |
| Sizing limitation | RL | RL | RL | FL | Not limited | FL | n.a. | Not limited |
| Control dataset | Not required | Not required | Not required | Not required | Not required | Required | Not required | Not required |
| Complex repeats | n.a. | n.a. | n.a. | n.a. | Y | n.a. | n.a. | N |
| Output files | .vcf and .allelotype.stats | .repeatseq, .calls, and .vcf | .vcf | .vcf and .json | .vcf, .json, and .log | .tsv | p-values, ECDF, and tsum plots | .vcf |
| Customized regions file | Possible | Possible | Possible | Possible | Possible | Possible, but not rcmd | Possible | Possible |
EH ExpansionHunter, TRED TREDPARSE, Y feature included, N feature not included, Library prep library preparation protocol, rcmd recommended, PL programming language used, n.a. not applicable/information not available, GW genome-wide, WGS whole-genome sequencing, ES exome sequencing, IRR in-repeat reads, RL read length, FL fragment length, Not limited not limited by either RL or FL, ECDF Empirical Cumulative Distribution Function, t-sum aggregated T statistic
aCorrects the noise (stutters) introduced during PCR amplification-based library preparation
bUtilizes phased single nucleotide variant haplotypes
Full-mutation samples detected in the EGA and simulated genomes using the default implementation of STR analysis tools
| Gene | FPs | Total FM detected | Sensitivity | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FM threshold (rpts) | 37 | 47 | 38 | 59 | 60 | 50 | 200 | 200 | 65 | 39 | |||
| Allelic classification | FM | FM | FM | FM | FM | FM | FM | FM | NL/FM or FM/FM | NL/FM or FM/FM | |||
| EH_v2 | 1 | 2 | 2 | 1 | 3 | 17 | 1 | 0 | 25 | 13 | 6 | 65 | 0.75581395 |
| EH_v3 | 1 | 2 | 3 | 0 | 3 | 17 | 0 | 0 | 25 | 13 | 5 | 64 | 0.74418605 |
| GangSTR | 0 | 2 | 2 | 0 | 0 | 16 | 0 | 0 | 16 | 11 | 8 | 47 | 0.54651163 |
| TRED | 1 | 2 | 1 | 0 | 3 | 17 | 0 | 0 | 25 | 13 | 3 | 62 | 0.72093023 |
| STRetch | 1 | 2 | 3 | 1 | 3 | 17 | 2 | 3 | 20 | 13 | 26 | 65 | 0.75581395 |
| exSTRa | 1 | 2 | 3 | 0 | 3 | 17 | 1 | 3 | 5 | 13 | 33 | 48 | 0.55813953 |
| EH_v2 | 1 | 2 | 2 | 1 | 3 | 17 | 0 | 0 | 25 | 13 | 6 | 64 | 0.74418605 |
| EH_v3 | 1 | 2 | 3 | 0 | 3 | 17 | 0 | 0 | 25 | 13 | 5 | 64 | 0.74418605 |
| GangSTR | 1 | 2 | 2 | 1 | 1 | 16 | 0 | 0 | 0 | 10 | 8 | 33 | 0.38372093 |
| TRED | 1 | 2 | 1 | 0 | 3 | 17 | 0 | 0 | 25 | 13 | 10 | 62 | 0.72093023 |
| STRetch | 1 | 2 | 3 | 1 | 3 | 17 | 2 | 3 | 20 | 13 | 26 | 65 | 0.75581395 |
| exSTRa | 1 | 2 | 3 | 1 | 3 | 16 | 9 | 3 | 25 | 13 | 35 | 76 | 0.88372093 |
The analyzed dataset had 86 samples with at least one known full-mutation allele. The number of true-positives detected by the tools, sensitivity, and the number of false positives identified in our default analysis of the Isaac- (top panel) and BWA-aligned (bottom panel) genomes are shown. NL normal, FM full-mutation, FPs false-positives, rpts repeats, EH_v2 ExpansionHunter version 2, EH_v3 ExpansionHunter version 3, TRED TREDPARSE
FMR1 and FMR2 full-mutations detected by ExpansionHunter, GangSTR, and TREDPARSE with lowered repeat length threshold
| Aligner | Isaac | BWA | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Locus | ||||||||||||||
| FM threshold | 54 repeats | 60 repeats | 54 repeats | 60 repeats | ||||||||||
| Allelic classification | FM | IM | NL | FP | FM | NL | FP | FM | IM | NL | FP | FM | NL | FP |
| EH_v2 | 18 | . | . | 20 | 3 | . | 0 | 18 | . | . | 16 | 3 | . | 0 |
| EH_v3 | 18 | . | . | 22 | 3 | . | 0 | 18 | . | . | 22 | 3 | . | 0 |
| GangSTR | 4 | . | 14 | 7 | 0 | 3 | 0 | 3 | . | 15 | 0 | 0 | 3 | 0 |
| TREDPARSE | 15 | 1 | 2 | 8 | 0 | 3 | 0 | 16 | . | 2 | 13 | 0 | 3 | 0 |
The number of full-mutations (FMs) misclassified as normal (NL; < 45 repeats for FMR1 and < 31 repeats for FMR2) or intermediate (IM; 45–54 repeats for FMR1) allele are shown. The true number (n) of known FM alleles in the FMR1 and FMR2 genes is indicated in parenthesis. False-positive (FP) calls made by the tools are also reported
Fig. 1Decision tree model and its performance metrics on modified analysis of BWA-aligned EGA genomes. a Decision tree generated on the training dataset (n = 940). Node #0 at the top of the tree is the root node. Each node lists an STR tool (feature). The “samples” number represents the total number of genotype calls in a particular node, and “value” shows the number of expanded (or full-mutation, FM) and non-expanded (non-FM) genotypes. Gini index shows the impurity at each node. The terminal nodes or leaves with a Gini value of 0 have genotypes belonging entirely to either the expanded or non-expanded class. EHv3, ExpansionHunter version 3; wCtrls, analysis performed with controls. b Classification report summarizing the performance metrics of the model on test data (n = 236). Macro and weighted average (avg) show the unweighted and weighted mean of performance metrics calculated for Expanded and Not_Expanded class labels, respectively. c Receiver operating characteristics and precision-recall curves. d Confusion matrix showing the number of predicted and true labels on x- and y-axis, respectively. e Feature importance plot showing the STR tool on x-axis and the tool’s normalized (Gini) importance on y-axis
Short tandem repeat candidates identified in our patient cohort
| Sample ID | Gene | Inheritance | Sequencing | Pathogenic SNV/indel/SV finding | Phenotype | STR finding | Molecular validation |
|---|---|---|---|---|---|---|---|
| 1901-P | Inherited | WGS | No | Short stature, delayed gross motor, speech and language development, spasticity, cerebral palsy, and hypertonia | FM (full-penetrance) | FM (reduced/full-penetrance) | |
| 1901-F | . | WGS | . | . | FM (full-penetrance) | FM (reduced/full-penetrance) | |
| 532-M | . | WGS | . | . | FM (full-penetrance) | n.a. | |
| 821-P | Inherited | ES | No | Mild intellectual disabilities, systemic hypertension, cutis aplasia, congenital heart defect, and limb anomalies | Borderlinea | n.a. | |
| 821-M | . | ES | . | . | Borderlinea | n.a. | |
| 1099-P | Inherited | ES | No | Hearing loss, cataract, myopia, visceral (kidney and spleen) cysts, proteinuria, and dysmorphic facial features | FM (higher penetrance) | n.a. | |
| 1099-M | . | ES | . | . | FM (higher penetrance) | n.a. | |
| 235-P | Inherited | WGS | No | Mild to moderate intellectual disability, and psychosis | FM (higher penetrance) | n.a. | |
| 235-M | . | WGS | . | . | FM (higher penetrance) | n.a. | |
| 2010-P | Inherited | ES | Definite | Myotonic dystrophy type 1, inguinal hernias, joint hypermobility, strabismus, mild intellectual disability, and dysmorphic facial features | FM (full-penetrance) | FM (full-penetrance) | |
| 2010-M | . | ES | . | Myotonic dystrophy type 1 | FM (full-penetrance) | FM (full-penetrance) | |
| 148-M | . | WGS | . | . | PM | n.a. (proband is negative for | |
| 800-F | . | WGS | . | . | IM | n.a. | |
| 480-P | Inherited | WGS | Probable | Moderate intellectual disability, language delay, autism, borderline macrocephaly, low set ears, down slanting palpebral fissures, high palate, and soft skin | PM | n.a. | |
| 712-M | . | WGS | . | . | PM | n.a. (proband is negative for | |
| 925-P | Inherited | WGS | No | Intellectual disability, developmental delay including speech delay, dysmorphic features, and behavioral challenges | PM | Negative for FM | |
| 925-S | Inherited | WGS | No | Intellectual disability, autism, developmental delay, and dysmorphic features | IM | n.a. | |
| 925-M | . | WGS | . | . | PM | n.a. | |
| 1987-F | . | WGS | . | . | NL/FM | Heterozygous NL/FM carrier | |
| 1530-P | Inherited | WGS | Uncertain | Global developmental delay, seizures, gliosis, developmental regression, encephalomalacia, hirsutism, nystagmus, optic atrophy, cyanosis, abnormal muscle tone, scoliosis, hearing impairment, and otitis media | FM (reduced penetrance) | FM (reduced penetrance) | |
| 1530-F | . | WGS | . | . | FM (reduced penetrance) | FM (reduced penetrance) |
Probands with an identified STR candidate are given a “-P” suffix in the “Sample ID” column; sibling of the proband, “-S”; mother, “-M”; and father, “-F”. The genes harboring the STR candidate identified by our bioinformatics workflow and the inheritance pattern deciphered by comparing the proband’s STR call with that of the parents are reported. The “Sequencing” column shows the technology used: whole-genome sequencing (WGS) or exome sequencing (ES). The “Pathogenic SNV/indel/SV Finding” column indicates whether the proband has had a definite, probable, certain, or no diagnosis of a single-nucleotide variant (SNV), indel, or structural variant (SV). Phenotypic presentations reported in the probands, STR finding from our bioinformatics analysis, and the results from the molecular validation (if available) are also presented. NL normal, IM intermediate, PM premutation, FM full-mutation, n.a. not available
aReduced-penetrance alleles have 33–34 repeats and full-penetrance alleles have ≥ 37 repeats