| Literature DB >> 25823460 |
Arkarachai Fungtammasan1, Guruprasad Ananda2, Suzanne E Hile3, Marcia Shu-Wei Su4, Chen Sun5, Robert Harris6, Paul Medvedev7, Kristin Eckert3, Kateryna D Makova8.
Abstract
Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCR-free protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.Entities:
Mesh:
Year: 2015 PMID: 25823460 PMCID: PMC4417121 DOI: 10.1101/gr.185892.114
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Erroneous call rates for Illumina data. The dotted lines represent the 95% confidence intervals of the multinomial sampling. Only repeat lengths with ≥100 read support (all loci combined) were plotted. (A) Human male X Chromosome for PCR-containing (PCR+) and PCR-free (PCR−; down-sampled data) library preparation protocols. (B) PCR− mononucleotide STR error rates by error categories. See text for the explanation of the categories. (C) PCR− dinucleotide STR error rates by error categories. (D) Error rates for ultradeep sequencing of plasmids with PCR+ and PCR− protocols.
Evaluation of the STR genotyping model by pseudodiploid genotyping
Figure 2.Factors that affect the accuracy of STR heterozygote genotyping, using the error correction model. (A) STR length. (B) The length difference in a heterozygote. (C) The ratio of read depths supporting each allele. (D) Read depth. All read profiles are generated from (A)m (A)n heterozygotes. The x-axis shows different STR length arrays, with the number of reads indicated for each genotype. The y-axis (confidence of correct prediction) shows the ratio of the probability for a locus to be a heterozygote versus a homozygote depending on the read length profile. The magnitude of the bar implies higher confidence in genotyping. The negative value indicates incorrect genotyping. PCR− error profiles from human X Chromosome data were used to calculate probabilities that the read profiles correspond to homozygous or heterozygous loci, which reflects the accuracy of genotyping, since the true genotype is a heterozygote.
Figure 3.Germline mutation rate analyses using STR-FM. (A) Cartoon of the pedigree used, with arrows showing branches where mutations were detected. (B) The numbers of de novo mutations arising in the male and female germ lines. (C) The germline mutation rates for STRs with different length and motif size. The dotted lines represent the 95% confidence interval of the multinomial sampling. Only repeat lengths with ≥2000 loci support were plotted. (D) The frequencies of different mutation categories for mononucleotide STRs. The dotted lines represent 95% confidence intervals from multinomial sampling. Only repeat numbers with ≥2000 loci support were plotted.
Figure 4.The minimal informative read depth required for correct genotyping of STR heterozygous alleles of consecutive repeat number with success rate of 90%. (A)n was used for mononucleotide STRs, and (AC)n for dinucleotide STRs. All tri- and tetranucleotide TRs were used. (A) PCR-containing library preparation protocol. (B) PCR-free library preparation protocol.
Figure 5.Genome-wide sequencing depth required for 10× informative read depth of 10–50 bp repeats with a success rate of 90% using different read lengths.