Literature DB >> 31804671

popSTR2 enables clinical and population-scale genotyping of microsatellites.

Snædis Kristmundsdottir^1,2, Hannes P Eggertsson¹, Gudny A Arnadottir², Bjarni V Halldorsson^1,2.

Abstract

SUMMARY: popSTR2 is an update and augmentation of our previous work 'popSTR: a population-based microsatellite genotyper'. To make genotyping sensitive to inter-sample differences, we supply a kernel to estimate sample-specific slippage rates. For clinical sequencing purposes, a panel of known pathogenic repeat expansions is provided along with a script that scans and flags for manual inspection markers indicative of a pathogenic expansion. Like its predecessor, popSTR2 allows for joint genotyping of samples at a population scale. We now provide a binning method that makes the microsatellite genotypes more amenable to analysis within standard association pipelines and can increase association power.
AVAILABILITY AND IMPLEMENTATION: https://github.com/DecodeGenetics/popSTR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 31804671 PMCID： PMC7141861 DOI： 10.1093/bioinformatics/btz913

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Microsatellites, a.k.a. short tandem repeats (STRs), are tandem repeats with repeat motif lengths between one and six base pairs. They are one of the most frequent types of variation in the human genome, surpassed only by single nucleotide polymorphisms (SNPs) and indels and have a mutation rate estimated to be three to five orders of magnitude higher than for other types of genetic variation (Jónsson ; Sun ). Genotyping microsatellites from whole-genome sequence (WGS) data is challenging since they are highly polymorphic and library preparation methods may modify the true number of repeats in the sequence (Gymrek ). WGS-based association and clinical analysis commonly do not consider microsatellites, partially due to a lack of tools capable of analyzing them. Tandem repeat expansions occur when microsatellites expand beyond a certain length threshold, making them unstable and thus more likely to expand further. A number of repeat expansions are known to be disease-causing (Gatchel and Zoghbi, 2005) and an increase in the use of WGS-technologies for genetic diagnostics has created a need for fast estimation of the repeat number at disease-associated loci. Here, we present extensions to our previously published software popSTR and improvements of its previous implementation, both with respect to runtime and accuracy. We increased our expansion detection sensitivity, updated our sample specific slippage estimation kernel, reduced the dimensions of our logistic regression model and updated external libraries to decrease I/O time and handle both BAM and CRAM files. We further created a panel of known repeat expansion markers and a pipeline to determine at each loci whether read support for a pathogenic expansion is present. Last, we provide a method to bin genotypes into user specified bins to increase power of downstream association analysis. By combining this set of functionalities, we hope to make popSTR2 applicable in a wide range of situations. Both when analyzing large cohorts to make population inferences and disease associations as well as analyzing small sets or single samples in a clinical context.

2 Materials and methods

Figure 1 gives a high level description of the algorithm’s workflow, a more detailed description is given in Supplementary Section S1.1 and a full description is given in Kristmundsdóttir . To summarize, we start by computing various quality-indicating attributes for all reads encompassing each of the microsatellites being considered, i.e. overlapping its coordinates and containing repeats of the relevant motif. We also look for repeats in unaligned reads with mates aligned close to the repeat region. An update of our read selection step is to also look for repeats of the relevant motif in reads aligned to longer repeats of the same motif in other locations of the genome that have mates aligned close to the repeat region. This can happen when a repeat has expanded considerably and the read reporting it is thus highly divergent from the reference sequence. After the set of informative reads has been created, the algorithm iterates between genotyping and assigning to each read a probability of reporting a true allele. Since this type of iterative parameter estimation is time and resource intensive, we supply a kernel of reliable markers to efficiently estimate these parameters. For details on kernel construction see Supplementary Section S1.2. We replaced the SeqAn BAM I/O module (Reinert ) with the one from htslib (Li ; https://github.com/DecodeGenetics/SeqAnHTS). The update provides CRAM file support, decreases I/O demands and runtime. Algorithmic improvements reduced runtime from 11.25 to 2.17 CPU hours/million markers per sample. See Supplementary Table S1 for a breakdown of our runtime analysis.

Fig. 1.

Results of read selection are passed into genotyping model along with sample and marker-specific parameters

2.1 Application to population-based genotyping

Useful reads and their attributes are used along with marker and sample specific parameters to perform genotyping. The marker-specific parameters can be estimated by popSTR2, but we also provide a default set of parameters. By default we require 20 samples for the parameters to be estimated since estimation with fewer samples would not yield reliable results. The sample-specific slippage parameter is estimated using a kernel of reliable markers described above and supplied with the software. Our genotyping model (Supplementary Equation S1 in Supplementary Material) computes the likelihood of observing a read, r, given genotypes A and B and selects the genotype pair that maximizes this likelihood over the set of reads being considered. The model previously assumed constant probabilities of adding and removing repeats across all markers, fixing in Supplementary Equation S1 from Supplementary Material to 0.85 if whole repeats were removed and consequently to 0.15 if whole repeats were added. It has however been shown that microsatellites have very different mutation profiles depending on their various properties, e.g. repeat motif, repeat purity, reference allele length, etc. (Brinkmann ). To reflect this we have replaced the hard coded values with marker-specific estimates, computed as follows. Assuming that we know which reads result from whole motif slippage events, we can estimate the fraction of slippage events that added whole repeats at microsatellite i: where is the set of reads at microsatellite i, considered to be results of slippage events that add whole motifs and is the set of all reads at microsatellite i reporting whole motif slippage events, regardless of their direction. The probability of removing repeats is then trivially computed as . Our previous version created one output file per sample and computed nine attributes from each read used for genotyping. Due to increased data quality and consistency we were able to reduce the number of attributes to six, which simplified and sped up the logistic regression analysis. To make population scale inferences and genotyping easier we now write one output file per marker, i.e. all alleles discovered in a population accessible in the same file. Association pipelines commonly assume biallelic variants or multi allelic variants where only a single allele is tested for association with a phenotype, rather than associating a subset of the alleles with it (Gudbjartsson ; Purcell ). This is not optimal for microsatellites where alleles above or below a certain length threshold may be pathogenic (Lee and McMurray, 2014). In an effort to increase association power we provide binSTR, a software for grouping alleles as a preprocessing step for association analysis. To allow for various patterns of allele groups, binSTR enables not only binarizing but also binning into a user determined number of groups where each group is defined by a list of allele indices passed as a parameter.

2.2 Application to clinical genetics

We have, through literature review, assembled a panel containing 31 STR markers, each associated with a disease or syndrome when the number of repeats passes a certain threshold, hereafter referred to as pathogenicity threshold. We provide a script which reports which of these markers, if any, contain evidence of a repeat expansion. The script runs the read selection step described above to scan a given BAM file at all panel locations and extracts for each of them all reads containing information on the number of repeats present. Expanded alleles have often undergone a dramatic increase in length, decreasing the odds of finding informative reads supporting them. Genotyping models assuming equal probabilities of drawing reads from each haplotype are thus not reliable in these cases. To account for this, our script scans the informative reads for any repeat tracts longer than the given threshold for each marker and flags locations harboring such reads for further manual inspection. Since many of the pathogenicity thresholds exceed the current read lengths by a considerable number of base pairs the scripts also counts and reports all fully repetitive reads, i.e. reads containing only repeats of the relevant motif. See Supplementary Table S4 for a table summarizing the markers included in the panel along with a pathogenicity threshold for each of them. As the set of pathogenic variants and our understanding of them grows the panel can easily be extended and thresholds for existing markers updated.

3 Experiments

We compared popSTR2 to HipSTR (Willems ), a commonly used microsatellite genotyper on chr21 of the CEU trio consisting of NA12878, NA12891 and NA12892 and on chr21 of 10 trios sequenced at deCODE genetics. The runtime reduction was 40% for the CEU trio and 26% for the deCODE trios. To compare the accuracy of these two methods we extracted markers where both methods had high confidence genotypes for all members of at least one trio and at least one trio member had a non-homozygous-reference genotype and recorded the number of trios where the offspring genotype did not match the parental ones. The deCODE trios had slightly more accurate genotypes from popSTR2 than HipSTR (99.8% versus 99.6%) but for the CEU trio hipSTR had a single trio inconsistency in 250 markers while popSTR had 2. For a more detailed comparison of these runs see Supplementary Table S3. To examine the sensitivity of our expansion detection script we ran it on ten samples with a known expanded allele in the 3′-flanking region of the DMPK gene which causes myotonic dystrophy 1 when exceeding 50 copies (Musova ) and ten healthy control samples. The expanded samples were sequenced for clinical sequencing analysis at deCODE genetics and the healthy ones as parts of various other projects, also at deCODE genetics. The script flagged the DMPK locus in all expanded individuals and none of the control samples. Last, we genotyped 49 962 Icelandic samples to examine the allelic spectrum of this repeat in the Icelandic population. The resulting distribution was in concordance with ones previously published for European populations with a bimodal distribution consisting of a peak at 5 repeats and another one between 11 and 13 repeats (Dean ; Magaña ) (see Supplementary Fig. S1).

4 Conclusion

We updated the microsatellite genotyper popSTR to decrease runtime and increase genotype quality and accuracy. This was done by replacing external libraries, re-training the data provided with the software and decreasing the number of variables in our logistic regression analysis. To expand the application range we extended the software to provide both a clinical sequencing analysis script for quickly estimating expansion status at known disease loci and a binning software for grouping genotypes by allele length range before performing disease association on them. It is our hope that these updates and extensions will make popSTR2 applicable in a broader spectrum of situations, i.e. for single sample clinical sequencing analysis as well as large scale association efforts. Analysis methods (Dashnow ; Dolzhenko ; Tang ; Tankard ) sensitive to detecting expanded repeats are not explicitly intended for population scale analysis of STRs at a genome wide scale. Conversely, other methods which aim at population and genome scale analysis (Gymrek ; Willems ) do not focus on and reporting of expanded repeats. GangSTR (Mousavi ) is, to our knowledge, the only method intended to perform accurate genotyping of both short and expanded microsatellites. It however does not mark known pathogenic variants in its output nor flags those expansions passing pathogenicity thresholds. By supplying a panel of known expansions along with an easily executable and fast script to flag potentially expanded repeats for further manual inspection we aim to direct users to the correct putative expansion as quickly as possible. Conflict of Interest: none declared. Click here for additional data file.

20 in total

1. Transmission ratio distortion in the myotonic dystrophy locus in human preimplantation embryos.

Authors: Nicola L Dean; J Concepción Loredo-Osti; T Mary Fujiwara; Kenneth Morgan; Seang Lin Tan; Anna K Naumova; Asangla Ao
Journal: Eur J Hum Genet Date: 2006-03 Impact factor: 4.246

Review 2. Diseases of unstable repeat expansion: mechanisms and common principles.

Authors: Jennifer R Gatchel; Huda Y Zoghbi
Journal: Nat Rev Genet Date: 2005-10 Impact factor: 53.242

3. Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat.

Authors: B Brinkmann; M Klintschar; F Neuhuber; J Hühne; B Rolf
Journal: Am J Hum Genet Date: 1998-06 Impact factor: 11.025

4. Parental influence on human germline de novo mutations in 1,548 trios from Iceland.

Authors: Hákon Jónsson; Patrick Sulem; Birte Kehr; Snaedis Kristmundsdottir; Florian Zink; Eirikur Hjartarson; Marteinn T Hardarson; Kristjan E Hjorleifsson; Hannes P Eggertsson; Sigurjon Axel Gudjonsson; Lucas D Ward; Gudny A Arnadottir; Einar A Helgason; Hannes Helgason; Arnaldur Gylfason; Adalbjorg Jonasdottir; Aslaug Jonasdottir; Thorunn Rafnar; Mike Frigge; Simon N Stacey; Olafur Th Magnusson; Unnur Thorsteinsdottir; Gisli Masson; Augustine Kong; Bjarni V Halldorsson; Agnar Helgason; Daniel F Gudbjartsson; Kari Stefansson
Journal: Nature Date: 2017-09-20 Impact factor: 49.962

5. Large-scale whole-genome sequencing of the Icelandic population.

Authors: Daniel F Gudbjartsson; Hannes Helgason; Sigurjon A Gudjonsson; Florian Zink; Asmundur Oddson; Arnaldur Gylfason; Soren Besenbacher; Gisli Magnusson; Bjarni V Halldorsson; Eirikur Hjartarson; Gunnar Th Sigurdsson; Simon N Stacey; Michael L Frigge; Hilma Holm; Jona Saemundsdottir; Hafdis Th Helgadottir; Hrefna Johannsdottir; Gunnlaugur Sigfusson; Gudmundur Thorgeirsson; Jon Th Sverrisson; Solveig Gretarsdottir; G Bragi Walters; Thorunn Rafnar; Bjarni Thjodleifsson; Einar S Bjornsson; Sigurdur Olafsson; Hildur Thorarinsdottir; Thora Steingrimsdottir; Thora S Gudmundsdottir; Asgeir Theodors; Jon G Jonasson; Asgeir Sigurdsson; Gyda Bjornsdottir; Jon J Jonsson; Olafur Thorarensen; Petur Ludvigsson; Hakon Gudbjartsson; Gudmundur I Eyjolfsson; Olof Sigurdardottir; Isleifur Olafsson; David O Arnar; Olafur Th Magnusson; Augustine Kong; Gisli Masson; Unnur Thorsteinsdottir; Agnar Helgason; Patrick Sulem; Kari Stefansson
Journal: Nat Genet Date: 2015-03-25 Impact factor: 38.330

6. Profiling the genome-wide landscape of tandem repeat expansions.

Authors: Nima Mousavi; Sharona Shleizer-Burko; Richard Yanicky; Melissa Gymrek
Journal: Nucleic Acids Res Date: 2019-09-05 Impact factor: 16.971

7. Distribution of CTG repeats at the DMPK gene in myotonic dystrophy patients and healthy individuals from the Mexican population.

Authors: J J Magaña; P Cortés-Reynosa; R Escobar-Cedillo; R Gómez; N Leyva-García; B Cisneros
Journal: Mol Biol Rep Date: 2010-07-16 Impact factor: 2.316

8. Highly unstable sequence interruptions of the CTG repeat in the myotonic dystrophy gene.

Authors: Zuzana Musova; Radim Mazanec; Anna Krepelova; Edvard Ehler; Jiri Vales; Radka Jaklova; Tomas Prochazka; Petr Koukal; Tatana Marikova; Josef Kraus; Marketa Havlovicova; Zdenek Sedlacek
Journal: Am J Med Genet A Date: 2009-07 Impact factor: 2.802

9. Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes.

Authors: Haibao Tang; Ewen F Kirkness; Christoph Lippert; William H Biggs; Martin Fabani; Ernesto Guzman; Smriti Ramakrishnan; Victor Lavrenko; Boyko Kakaradov; Claire Hou; Barry Hicks; David Heckerman; Franz J Och; C Thomas Caskey; J Craig Venter; Amalio Telenti
Journal: Am J Hum Genet Date: 2017-11-02 Impact factor: 11.025

10. Detection of long repeat expansions from PCR-free whole-genome sequence data.

Authors: Egor Dolzhenko; Joke J F A van Vugt; Richard J Shaw; Mitchell A Bekritsky; Marka van Blitterswijk; Giuseppe Narzisi; Subramanian S Ajay; Vani Rajan; Bryan R Lajoie; Nathan H Johnson; Zoya Kingsbury; Sean J Humphray; Raymond D Schellevis; William J Brands; Matt Baker; Rosa Rademakers; Maarten Kooyman; Gijs H P Tazelaar; Michael A van Es; Russell McLaughlin; William Sproviero; Aleksey Shatunov; Ashley Jones; Ahmad Al Khleifat; Alan Pittman; Sarah Morgan; Orla Hardiman; Ammar Al-Chalabi; Chris Shaw; Bradley Smith; Edmund J Neo; Karen Morrison; Pamela J Shaw; Catherine Reeves; Lara Winterkorn; Nancy S Wexler; David E Housman; Christopher W Ng; Alina L Li; Ryan J Taft; Leonard H van den Berg; David R Bentley; Jan H Veldink; Michael A Eberle
Journal: Genome Res Date: 2017-09-08 Impact factor: 9.438

3 in total