Literature DB >> 23090981

Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles.

Gareth Highnam1, Christopher Franck, Andy Martin, Calvin Stephens, Ashwin Puthige, David Mittelman.   

Abstract

Repetitive sequences are biologically and clinically important because they can influence traits and disease, but repeats are challenging to analyse using short-read sequencing technology. We present a tool for genotyping microsatellite repeats called RepeatSeq, which uses Bayesian model selection guided by an empirically derived error model that incorporates sequence and read properties. Next, we apply RepeatSeq to high-coverage genomes from the 1000 Genomes Project to evaluate performance and accuracy. The software uses common formats, such as VCF, for compatibility with existing genome analysis pipelines. Source code and binaries are available at http://github.com/adaptivegenome/repeatseq.

Entities:  

Mesh:

Year:  2012        PMID: 23090981      PMCID: PMC3592458          DOI: 10.1093/nar/gks981

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Tandem repeats are dispersed throughout the genome, in and around gene regions. They were first identified as agents of disease ∼20 years ago (1), and since then, several microsatellite repeats (not all of which are triplets) have been identified as the underlying basis for a wide range of neurological and morphological disorders in humans and other mammals (2–4). In addition to causing disease, microsatellites can exert subtle effects on gene function and quantitative traits [reviewed in (5)]. Repeats are also mutational hotspots in that their instability can be triggered by nearly any aspect of DNA metabolism, and even transcription or stress [reviewed in (6)]. This sensitivity to defects in repair and cellular insults makes repeats useful markers for genome instability and cancer (7,8). Further, analysing repeats in personal genomes promises benefit not just to medical genetics and the diagnosis of repeat-related disorders but also to forensics and genealogy, where shorter and more stable tandem repeats can serve as DNA fingerprints to uniquely identify individuals (9,10). The use of accurately and globally measuring tandem repeats spans medicine, genetics and biotechnology; repeats influence clinical and subclinical phenotypes, are signatures for genomic instability and cancer and are important markers for forensics and genealogy. Despite their use and biological importance, some repetitive sequences (particularly microsatellites) are challenging to study with short-read sequencing technology. Genotyping microsatellite repeats from reference-mapped reads is fundamentally distinct from calling SNPs or indels in non-repetitive sequence because there is no sound basis for inferring homology between pairs of aligned repeat units. Microsatellite genotypes must be assigned in terms of allele length or the number of sequenced bases within a read separating the non-repetitive flanking boundaries aligned to the reference, irrespective of intervening alignment gaps. Furthermore, reads must span an entire repeat track to confidently support an allele length. We recently demonstrated the advantage of this repeat-aware genotyping approach in a population-scale analysis of >100 inbred isolates from the Drosophila Genetics Reference Panel (DGRP) (11,12). Here, we describe a comprehensive genotyping software package that we have termed RepeatSeq, for calling microsatellite repeat genotypes from whole genome resequencing data. RepeatSeq combines a repeat-aware method for repeat allele determination (11) with a Bayesian genotyping approach that uses a novel error model informed by properties of a repeat sequence and the reads that map to it. Current approaches for identifying repeat mutations include indel genotyping methods implemented in popular software suites, such as GATK (13) and ATLAS2 (14), that can reveal indels within repeat regions or the recently reported lobSTR method (10), which much like our approach, considers repeats in their entirety. Indel callers are ill-suited for identifying repeat mutation, as they do not report repeat genotypes, they can base indel identification on reads that do not fully span the repeat and they fail to account for the error rates of different repeat types. The mutation rate of microsatellite repeats is influenced largely not only by the length of repeat tract but also by other intrinsic properties, such as the size of the repeated unit and the purity (lack of interruptions) of the repeated sequence (15). A genotyping method that incorporates the mutational properties of repeat sequences will be better able to distinguish false alleles from true heterozygosity. However, the success of a genotyping approach relies on more than just the accurate identification of true alleles—the method must be applicable to the greatest number of loci genome-wide. The lobSTR method, for example, makes microsatellite calls genome-wide (10); however, it is blind to homopolymers runs (i.e. mononucleotide repeats, which are a common and important source of genetic variation). In the following sections we describe the RepeatSeq algorithm and error model and its application to high-coverage genomes from the 1000 Genomes Project (16). RepeatSeq assigned genotypes to ∼90% of the repeat loci, including homopolymers. In addition, comparing RepeatSeq to lobSTR revealed that ∼90% of repeats genotyped by both methods were assigned the same call. The software inputs high-throughput resequencing data in BAM format (17) and outputs genotypes in multiple formats, including VCF format (18), for compatibility and easy integration with existing genotyping pipelines.

METHODS

Program outline

The genotyping process is summarized in Figure 1. Reads are first mapped to a reference sequence and then sorted, with duplicate reads marked. Next, reads mapping to reference repeat regions are locally realigned using the GATK IndelRealigner tool. Then, for each repeat, the RepeatSeq module discards reads that do not completely span the repeat, as these reads cannot unambiguously support a repeat allele length. The extent to which a read must overlap a repeat (and match the reference) is an adjustable parameter, although we have previously optimized this value to maximize the number of repeat regions that could be genotyped while minimizing the inclusion of improperly mapped reads (11). Finally, RepeatSeq assigns the most probable genotype using a fully Bayesian approach and considers the reference length of the repeat, the repeat unit size and the average base quality of the mapped reads, as previous information. RepeatSeq implements a diploid genotyping model by default, but it can be set at runtime to implement a haploid model.
Figure 1.

An outline of the RepeatSeq method. Reads are mapped and realigned, and a set of reads spanning reference repeats are retained. Genotypes are assigned with consideration of the a priori error rate , which comes from the appropriate error profile and is used in the prior distribution of allele and error probabilities . The probability of each genotype suggested by the data is estimated in a Bayesian fashion, and the most probable genotype among these is called.

An outline of the RepeatSeq method. Reads are mapped and realigned, and a set of reads spanning reference repeats are retained. Genotypes are assigned with consideration of the a priori error rate , which comes from the appropriate error profile and is used in the prior distribution of allele and error probabilities . The probability of each genotype suggested by the data is estimated in a Bayesian fashion, and the most probable genotype among these is called. Several other read filters can be adjusted. For example, reads can be excluded from consideration on the basis of their length, mapping quality and on whether they are properly paired. By default, RepeatSeq returns microsatellite calls in VCF format for compatibility and interoperability with other indel callers. However, RepeatSeq can also output a detailed report that returns annotated alignments of reads in addition to just the assigned genotypes.

Probabilistic determination of repeat genotypes

The probability that a read is representative of a true allele can be inferred by the properties of the read and the reference repeat sequence that it maps to (11). We explored many such properties in a previous DGRP population study (12). The DGRP is composed of >100 fly genomes; each derived from single female founders of a natural fly population and bred to near-isogeny by 20 generations of full-sibling matings. Therefore, in the absence of mapping, alignment or sequence errors, all reads from a single inbred line mapped to a specific microsatellite locus should possess the same repeat allele length, and this homozygosity permits the assumption that deviant reads represent error. To strengthen the validity of this assumption, we minimized the effects of any residual heterozygosity by excluding genomic regions with >5% heterozygosity (based on the SNP calls). For loci to which at least 10 reads had been mapped, reads that supported the majority allele were assumed to support the true allele at the locus, whereas reads that supported a non-majority allele were assumed to be errors. The proportions of reads supporting majority alleles were computed within five bins of equal size for the following properties: reference repeat length, repeat unit size and average base quality for the read. The resulting 5 × 5 × 5 array reflects the frequency of discordant and concordant reads and is the source of error profiles for RepeatSeq. Once reads are mapped, realigned and then pruned of those that do not span the repeat, RepeatSeq calls the most probable genotype using Bayesian model selection. The error rate () is populated with prior information by associating the reference repeat length, unit size and average base quality of the read data with the appropriate bin of the error profiles. The reads at a given locus support length variants where . RepeatSeq considers each homozygous and heterozygous genotype suggested by the read data, of which, there are . For example, if two length alleles are present then the heterozygous genotype of the two variants and each homozygous genotype are considered. If three variants are present, then the possible three homozygous and three heterozygous genotypes are considered. Let denote the reads at a given locus. Denote the possible genotypes g1, g2, . For heterozygous and homozygous genotypes, we assume the likelihood is multinomial , and is Dirichlet , where the elements of sum to unity. For the purpose of elucidating the statistical model, let generically denote a homozygous genotype and let denote a heterozygous genotype, where and each represent one of the length variants present in the data. In the heterozygous case, the values and are 3D vectors. The values and represent the number of reads of variants and , respectively, and is the number of reads of any other variant. Values and represent the probability that a read at this locus is of variant and , respectively, and represents the probability that a read does not represent a true variant (e.g. a sequencing or mapping artifact relative to the genotype under consideration). The homozygous genotypes are modelled similarly with 2D vectors: position one refers to the variant supported by the genotype and the second position refers to all other alleles. The vector characterizes prior information about the probability vector , and is formed based on derived error profiles. In the heterozygous case, let represent the number of majority allele supporting reads of variant A from the error profiles, and let represent the number of non-majority allele supporting reads from the error profiles. Let and be similarly defined for variant B. Hyperparameters and . In the homozygous case, hyperparameters and , where and are as defined previously. The Bayesian model describing the probability of read error given genotype is as follows: This model specification ensures that the posterior distribution weighs the error profile reads equally to reads from the data. We consider the aforementioned model specification for all genotypes . The marginal distribution of the data given each genotype is estimated, and Bayes’ rule is used to compute the probability of each genotype given the data. This strategy is a fully Bayesian model selection algorithm, for which each model under consideration corresponds to one of the genotypes suggested by the data. The marginal distribution of the data for a given genotype is where represents the multinomial beta function. This analytical solution is based on conjugacy results. Bayes’ rule is then implemented to invert the above probabilities. where for all . The most probable genotype is then selected, provided that the probability of the genotype is >50%.

RESULTS

Optimization of read mapping for microsatellite repeat genotyping

The accuracy of repeat genotypes is contingent on the proper mapping of reads to repeat loci. There are many short-read mapping algorithms, and their performance has been widely evaluated (19). However, mapping algorithms have not yet been evaluated on their ability to accurately map reads composed of low-complexity sequence. We compared the accuracy of several popular tools, including Bowtie2 (20), BWA (21), Novoalign (Novocraft Technologies), Stampy (22) and SMALT (Wellcome Trust Sanger Institute). First, reference repeats were identified from the hg19 reference sequence using a previously described approach (11). Next, DWGSIM (http://github.com/nh13/DWGSIM) was used to simulate 100 bp single-end Illumina reads from reference repeats from chromosomes 1 through 7 in human reference sequence, with a simulated coverage of 15×. We specified a 2% uniform sequencing error rate and a 0.1% mutation rate, of which 10% of the mutations were small indels (1–10 bp long) and the remainder were point mutations. Simulated reads were then mapped to the entire human reference sequence. The parameters, runtimes and resource usage for all tools are documented in Supplementary Table S1. The proportion of incorrectly mapped reads is plotted as a function of the proportion of correctly mapped reads in Figure 2 under different mapping quality thresholds. For each mapping quality threshold, the accuracy rate is the number of correctly mapped reads divided by the total amount of reads simulated. Likewise, the inaccuracy rate is the number of incorrectly mapped reads divided by the total number of mapped reads. Assessment of read alignment is based on the leftmost coordinate of the read. Novoalign maps repeat-containing reads with the lowest ratio of inaccurate calls to accurate calls. Among the open source tools, Bowtie2-high (Bowtie2 using the highest sensitivity option) is the best mapping method by this metric. Other settings of Bowtie2, along with Stampy and BWA-SW, performed fairly similarly among each other, but not as well as Novoalign or Bowtie2-high.
Figure 2.

The performance of various methods for mapping reads to reference repeats. Mapping accuracy is determined using simulated 100 bp Illumina reads (with a coverage of 15×) and is assessed by measuring the proportion of incorrectly mapped reads as a function of the proportion of correctly mapped reads under different mapping quality thresholds. Variations of Bowtie2 are fully described as follows: bowtie2 (Bowtie2 with default settings), bowtie2-high (Bowtie2 using the highest sensitivity setting), bowtie2-local (Bowtie2 with default sensitivity and soft-clipping) and bowtie2-local-high (Bowtie2 using the highest sensitivity and soft-clipping).

The performance of various methods for mapping reads to reference repeats. Mapping accuracy is determined using simulated 100 bp Illumina reads (with a coverage of 15×) and is assessed by measuring the proportion of incorrectly mapped reads as a function of the proportion of correctly mapped reads under different mapping quality thresholds. Variations of Bowtie2 are fully described as follows: bowtie2 (Bowtie2 with default settings), bowtie2-high (Bowtie2 using the highest sensitivity setting), bowtie2-local (Bowtie2 with default sensitivity and soft-clipping) and bowtie2-local-high (Bowtie2 using the highest sensitivity and soft-clipping). Table 1 summarizes the performance of the mapping programs and includes a comparison with the lobSTR method, which eschews conventional short-read mappers in favour of its own method for directly assigning reads to a proper location on the human reference sequence. The table indicates performance cumulatively for all mapping qualities, to enable fair comparison with lobSTR, which does not report mapping quality. Novoalign mapped the greatest number of correctly mapped reads (40 547 527; 93.9%), although SMALT mapped the greatest number of reads overall (41 180 368; 95.4%). LobSTR mapped the fewest number of correctly mapped reads (1 117 142; 2.59%) but also the fewest reads overall (1 118 902; 2.59%). Maximizing read mapping might result in increased numbers of improperly mapped reads; therefore, we also compared mapping methods on the basis of their inaccuracy rate. We found that lobSTR features the lowest inaccuracy rate of 0.16%, followed by Novoalign with an inaccuracy rate of 1.14%. The open-source Bowtie2-high method also performs nearly as well with an inaccuracy rate of 1.18%.
Table 1.

Performance of mappers for microsatellite repeat regions

MethodTotal mapped n (%)Correctly mapped n (%)Incorrectly mapped n (%)
lobSTR1 118 902 (2.59)1 117 142 (2.59)1760 (0.16)
Novoalign41 014 531 (95.0)40 547 527 (93.9)467 004 (1.14)
Bowtie240 678 703 (94.2)40 196 603 (93.0)482 100 (1.19)
Bowtie2-high40 946 152 (94.8)40 464 488 (93.7)481 664 (1.18)
Bowtie2-local40 961 622 (94.9)40 448 448 (93.6)513 174 (1.25)
Bowtie2-local-high40 975 438 (94.9)40 472 990 (93.7)502 421 (1.23)
BWA39 390 695 (91.2)38 941 969 (90.2)448 726 (1.14)
BWASW40 611 633 (94.1)40 120 872 (92.9)490 761 (1.21)
SMALT41 180 368 (95.4)40 491 179 (93.7)689 189 (1.67)
Stampy41 004 163 (95.0)40 478 030 (93.8)526 133 (1.28)

aNumber (%) of total, correctly and incorrectly mapped reads by each mapping method from 43 176 537 simulated 100 bp single-end reads that overlap a repetitive region in the hg19 reference sequence. Percentages for incorrectly mapped reads are from total mapped reads and not the total simulated reads.

Performance of mappers for microsatellite repeat regions aNumber (%) of total, correctly and incorrectly mapped reads by each mapping method from 43 176 537 simulated 100 bp single-end reads that overlap a repetitive region in the hg19 reference sequence. Percentages for incorrectly mapped reads are from total mapped reads and not the total simulated reads. Comparing performance cumulatively for all mapping qualities likely underestimates the accuracy of conventional mappers because even minimal filtering by mapping quality can exclude inaccurately mapped reads. For example, filtering reads based on a minimum mapping quality of 4 lowers the Novoalign inaccuracy rate to 0.007%, at the cost of reducing the number of correctly mapped reads to 40 297 714 (93.3%). At a minimum mapping quality of 3, Novoalign features an inaccuracy rate of 0.33% with 40 436 553 (93.6%) correctly mapped reads. Bowtie2-high performs similarly with an inaccuracy rate of 0.03% (40 213 461 or 93.1% correctly mapped reads) with a minimum mapping quality of 2. Filtering reads by mapping quality allows for optimizing the trade-off of maximizing correctly mapped reads while minimizing the incorrectly mapped reads.

Application of RepeatSeq to whole genomes from a parent–offspring trio

A subset of genotyping error can be inferred by evaluating whether called genotypes from a parent–offspring dataset are consistent with Mendelian inheritance (23). We applied RepeatSeq to whole genome data generated using the Illumina HiSeq and 100 bp reads, from the CEU trio, a family from Utah with European ancestry (CEU genomes: NA12878, NA12891, NA12892). The average coverage of the mapped and post-processed genomes was 42.7× and this coverage dropped to 33.3× at repeat regions. In testing for consistency with Mendelian inheritance, we considered repeats for which calls were made from all genomes in the trio and for which there was at least a single mapped discordant read at the locus in each genome. RepeatSeq requires at least two reads to make a call and for this threshold, 92.1% of evaluated repeat calls are consistent with Mendelian inheritance (Supplementary Table S2). At a minimum coverage of 9×, the percentage of consistent calls increases to 95.3%, and at 17×, the percentage increase to 98.0%. This consistency test also enables the assessment of the impact of choosing a less informative prior for the genotyping model. We replaced our error profile with a constant rate prior in which we define a single observed concordant reads and no discordant reads for every element of the error profile. This strategy eschews the experimentally derived error profile and instead incorporates weak prior information into our genotyping algorithm. Not surprisingly, the percentage of consistent calls drops substantially. At a minimum of 2, 9 and 17× coverage, the constant rate prior produces calls that are consistent 72.9, 73.7 and 74.5% of the time, respectively (Supplementary Table S2).

Application of RepeatSeq to the HG01140 whole genome

We compared the performance of our genotyper with lobSTR, a recently reported tool that is also the only other genotyper tuned for microsatellite repeats. To perform the comparison, a 16× sequenced human genome (HG01140) from the 1000 Genomes Project was mapped against the hg19 reference sequencing using Novoalign, post-processed and passed to both genotyping methods. The FASTQ files for this genome were mapped using Novoalign and realigned using the GATK IndelRealigner. RepeatSeq was then executed using default parameters to generate a list of repeat genotypes. To genotype repeats with lobSTR, we used the available binaries (lobSTR version 1.0.6) and usage guidelines. First, lobSTR built an index for the hg19 reference genome. The same list of repeat regions used by RepeatSeq was indexed for use by lobSTR. Next, lobSTR constructed a new BAM file (based on the same BAM file used by RepeatSeq) containing its alignments for genotype calling. We used default parameters except that we specified genotyping repeats with a unit length of 1–5. The lobSTR-generated BAM file was then passed to the lobSTR allelotyper tool to produce a list of repeat genotypes. Table 2 summarizes the comparison of RepeatSeq and lobSTR genotypes. The percentages indicate the fraction of genotypes made from a list of 4 030 980 reference repeat regions. Unlike the earlier comparison using simulated data from chromosomes 1 through 7, this comparison includes all identified hg19 reference repeat regions. RepeatSeq assigned genotypes to 3 604 790 (89.4%) repeats, whereas lobSTR assigned genotypes 106 043 (2.63%) repeats. When the comparison is made by repeat unit size, RepeatSeq assigns genotypes to a similar proportion of repeats for all unit sizes, whereas lobSTR assigned the most genotypes to dinucleotide repeats (64 670; 10.3%). For repeat regions that were genotyped by RepeatSeq and lobSTR, we measured the concordance of the methods or cases in which both methods assigned the same genotype. Genotypes were assigned by both methods for 96 950 repeats. Of these, 85 569 (88.3%) repeat genotypes were the same. The concordance between the methods is lowest for 2mers (83.4%) and increases with unit size, which is expected given that the variability of microsatellite repeats is dependent on the size of the repeated unit sequence. From the 11 381 (11.7%) discordant genotypes, we randomly selected repeat regions and used Sanger sequencing to confirm which method assigned the correct genotype. After sequencing 40 regions, we found that 25 (62.5%) regions were correctly genotyped by RepeatSeq, 4 (10.0%) regions were correctly genotyped by lobSTR and 11 (27.5%) regions were incorrectly genotyped by both methods (Supplementary Table S3).
Table 2.

Comparison of RepeatSeq and lobSTR microsatellite calls

Comparison1 n (%)2 n (%)3 n (%)4 n (%)5 n (%)Total n (%)
RepeatSeq calls1 014 806 (88.0)556 727 (89.0)680 939 (89.7)766 010 (90.6)586 308 (90.7)3 604 790 (89.4)
lobSTR callsN64 670 (10.3)15 722 (2.07)17 336 (2.05)8315 (1.29)106 043 (2.63)
Concordant callN47 987 (7.67)14 482 (1.91)15 430 (1.82)7670 (1.19)85 569 (2.12)
Discordant callN9538 (1.52)624 (0.08)946 (0.11)273 (0.04)11 381 (0.28)
RepeatSeq call, lobSTR N1 014 806 (88.0)499 202 (79.8)665 833 (87.7)749 634 (88.6)578 365 (89.5)3 507 840 (87.0)
lobSTR call, RepeatSeq NN7145 (1.14)616 (0.08)960 (0.11)372 (0.06)9093 (0.23)
RepeatSeq N, lobSTR N138 769 (12.0)61 800 (9.88)77 758 (10.2)78 922 (9.33)59 848 (9.26)417 097 (10.3)

aNumber (%) of total, concordant and discordant microsatellite calls are provided by repeat unit length, indicated by column values 1–5. Comparisons are made for microsatellites in which both, one or neither method makes a call. N indicates no call.

Comparison of RepeatSeq and lobSTR microsatellite calls aNumber (%) of total, concordant and discordant microsatellite calls are provided by repeat unit length, indicated by column values 1–5. Comparisons are made for microsatellites in which both, one or neither method makes a call. N indicates no call. From the 3 604 790 repeat genotypes assigned by RepeatSeq, lobSTR failed to genotype 3 507 840 (97.3%) regions. This is consistent with lobSTRs greatly reduced number of total and correctly mapped reads for the simulated data (Table 1). However, from the 106 043 repeat genotypes assigned by lobSTR, 9093 repeat genotypes (8.57%) were missed by RepeatSeq. In 5090 (56.0%) of the repeats, lobSTR was able to map more reads to the locus than RepeatSeq had access to, and this likely enabled the assignment of a genotype by lobSTR. In nearly half of these cases (2436 repeats), lobSTR added more than one mapped read to the locus. We also found that lobSTR assigns genotypes if at least a single read is mapped to the locus. RepeatSeq, however, requires at least two mapped reads, as a single read is not sufficient to confidently distinguish a heterozygous locus from a homozygous locus. There were 3819 (42.0%) cases in which lobSTR exclusively assigned a genotype using only a single read. For the remaining repeats genotyped by lobSTR and not RepeatSeq, manual inspection revealed that RepeatSeq failed to genotype many of these repeats because it could not assign a high enough confidence (>50%) to the genotype. Finally, there were 417 097 (10.3%) repeats not genotyped by either method primarily because no reads could be mapped to these regions. In all, RepeatSeq assigns genotypes to ∼90% of repeats in the HG01140 genome, and in the cases where both methods make a call, the concordance in genotypes is ∼90%. Although we did not exhaustively validate the discordant genotypes, we do not see any evidence to suggest that lobSTR generally assigns genotypes more accurately than RepeatSeq.

DISCUSSION

Repetitive DNA sequences pose unique challenges to next-generation sequencing technologies. However, these sequences are too important to ignore, in terms of their frequent occurrence in the genome and their biological relevance. RepeatSeq uses properties of reference repeat sequences and the reads that map to them, to best identify reads that contribute to true alleles. To enable accurate RepeatSeq calls, we first evaluated several methods for mapping repeat-containing reads to the human reference sequence. We found that when we filtered reads by mapping quality, Novoalign and Bowtie2 correctly map >93% of simulated repeat-containing reads to the human reference (compared with 3% mapped by lobSTR) while maintaining an inaccuracy rate that is lower than lobSTR. Filtering reads by mapping quality allowed for optimizing the trade-off of maximizing correctly mapped reads while minimizing the incorrectly mapped reads. When we analysed the HG01140 genome from the 1000 Genomes Project, we found that RepeatSeq assigned genotypes to ∼90% of the repeats, whereas lobSTR genotyped <3%. The disparity between the methods results likely from the difference in how many reads are mapped by lobSTR and Novoalign. LobSTR did make calls that RepeatSeq missed for 9093 (8.57%) repeats. Of these calls that RepeatSeq missed, 42% were made with the support of a single read. RepeatSeq requires the support of two reads to make calls in diploid genomes. From the remaining cases, RepeatSeq sometimes failed to have high enough genotype probabilities (>50%), and in other cases, lobSTR was able to exclusively make a call likely because it was able to map more reads to the locus. There is no doubt that there are cases in which lobSTR will excel at assigning genotypes, particularly those that exhibit extreme allelic variation from the reference (such as in the case of expanded repeats at the HTT locus in Huntington’s patients); however, based on the number of calls, ∼90% concordance in genotypes for repeats in which both methods made a call and our Sanger validation data, RepeatSeq offers the best comprehensive approach for exhaustively genotyping repeats in human genomes. RepeatSeq is multithread and scales well on multi-core processors. An analysis of ∼4 million reference repeats in the NA12878 genome (40× post-processed coverage, 100 bp reads) takes 11 h using a single core of an AMD Opteron 6174 CPU and a maximum of 1.65 GB of memory. With 48 cores, the runtime is reduced to 20 min. Runtimes can be further reduced by opting to generate only the VCF output. In addition to outputting genotypes using VCF, RepeatSeq optionally produces a detailed report that includes an annotated alignment of all the reads to the reference sequence for each microsatellite locus. We envision this extended report will be valuable for manually confirming calls and for further refining the method. The use of VCF output ensures the compatibility of RepeatSeq with other variant callers and analysis pipelines. For example, the VCF output of RepeatSeq can be used to augment indel calls from GATKs UnifiedGenotyper to provide a comprehensive and more accurate set of variant calls.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1–3.

FUNDING

National Institutes of Health [NS079926 to D.M.]; NVIDIA Foundation's ‘Compute the Cure’ program (to D.M.). Funding for open access charge: NIH. Conflict of interest statement. None declared.
  23 in total

Review 1.  Variable tandem repeats accelerate evolution of coding and regulatory sequences.

Authors:  Rita Gemayel; Marcelo D Vinces; Matthieu Legendre; Kevin J Verstrepen
Journal:  Annu Rev Genet       Date:  2010       Impact factor: 16.830

Review 2.  A survey of sequence alignment algorithms for next-generation sequencing.

Authors:  Heng Li; Nils Homer
Journal:  Brief Bioinform       Date:  2010-05-11       Impact factor: 11.622

3.  Unstable nucleotide repeat minireview series: a molecular biography of unstable repeat disorders.

Authors:  Harry T Orr
Journal:  J Biol Chem       Date:  2008-10-28       Impact factor: 5.157

4.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

Review 5.  Multiple mutations and cancer.

Authors:  Lawrence A Loeb; Keith R Loeb; Jon P Anderson
Journal:  Proc Natl Acad Sci U S A       Date:  2003-01-27       Impact factor: 11.205

Review 6.  Expandable DNA repeats and human disease.

Authors:  Sergei M Mirkin
Journal:  Nature       Date:  2007-06-21       Impact factor: 49.962

7.  The Drosophila melanogaster Genetic Reference Panel.

Authors:  Trudy F C Mackay; Stephen Richards; Eric A Stone; Antonio Barbadilla; Julien F Ayroles; Dianhui Zhu; Sònia Casillas; Yi Han; Michael M Magwire; Julie M Cridland; Mark F Richardson; Robert R H Anholt; Maite Barrón; Crystal Bess; Kerstin Petra Blankenburg; Mary Anna Carbone; David Castellano; Lesley Chaboub; Laura Duncan; Zeke Harris; Mehwish Javaid; Joy Christina Jayaseelan; Shalini N Jhangiani; Katherine W Jordan; Fremiet Lara; Faye Lawrence; Sandra L Lee; Pablo Librado; Raquel S Linheiro; Richard F Lyman; Aaron J Mackey; Mala Munidasa; Donna Marie Muzny; Lynne Nazareth; Irene Newsham; Lora Perales; Ling-Ling Pu; Carson Qu; Miquel Ràmia; Jeffrey G Reid; Stephanie M Rollmann; Julio Rozas; Nehad Saada; Lavanya Turlapati; Kim C Worley; Yuan-Qing Wu; Akihiko Yamamoto; Yiming Zhu; Casey M Bergman; Kevin R Thornton; David Mittelman; Richard A Gibbs
Journal:  Nature       Date:  2012-02-08       Impact factor: 49.962

8.  Analysis of microsatellite variation in Drosophila melanogaster with population-scale genome sequencing.

Authors:  John W Fondon; Andy Martin; Stephen Richards; Richard A Gibbs; David Mittelman
Journal:  PLoS One       Date:  2012-03-12       Impact factor: 3.240

9.  lobSTR: A short tandem repeat profiler for personal genomes.

Authors:  Melissa Gymrek; David Golan; Saharon Rosset; Yaniv Erlich
Journal:  Genome Res       Date:  2012-04-20       Impact factor: 9.043

10.  Fast and accurate long-read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2010-01-15       Impact factor: 6.937

View more
  71 in total

1.  Detecting Expansions of Tandem Repeats in Cohorts Sequenced with Short-Read Sequencing Data.

Authors:  Rick M Tankard; Mark F Bennett; Peter Degorski; Martin B Delatycki; Paul J Lockhart; Melanie Bahlo
Journal:  Am J Hum Genet       Date:  2018-11-29       Impact factor: 11.025

2.  Discretized Gaussian mixture for genotyping of microsatellite loci containing homopolymer runs.

Authors:  Hongseok Tae; Dong-Yun Kim; John McCormick; Robert E Settlage; Harold R Garner
Journal:  Bioinformatics       Date:  2013-10-17       Impact factor: 6.937

3.  Accuracy of short tandem repeats genotyping tools in whole exome sequencing data.

Authors:  Andreas Halman; Alicia Oshlack
Journal:  F1000Res       Date:  2020-03-23

4.  Population-scale analysis of human microsatellites reveals novel sources of exonic variation.

Authors:  L J McIver; J F McCormick; A Martin; J W Fondon; H R Garner
Journal:  Gene       Date:  2012-12-26       Impact factor: 3.688

5.  ReviSTER: an automated pipeline to revise misaligned reads to simple tandem repeats.

Authors:  Hongseok Tae; Kevin W McMahon; Robert E Settlage; Jasmin H Bavarva; Harold R Garner
Journal:  Bioinformatics       Date:  2013-05-15       Impact factor: 6.937

6.  The impact of DNA input amount and DNA source on the performance of whole-exome sequencing in cancer epidemiology.

Authors:  Qianqian Zhu; Qiang Hu; Lori Shepherd; Jianmin Wang; Lei Wei; Carl D Morrison; Jeffrey M Conroy; Sean T Glenn; Warren Davis; Marilyn L Kwan; Isaac J Ergas; Janise M Roh; Lawrence H Kushi; Christine B Ambrosone; Song Liu; Song Yao
Journal:  Cancer Epidemiol Biomarkers Prev       Date:  2015-05-19       Impact factor: 4.254

Review 7.  The overdue promise of short tandem repeat variation for heritability.

Authors:  Maximilian O Press; Keisha D Carlson; Christine Queitsch
Journal:  Trends Genet       Date:  2014-08-30       Impact factor: 11.639

8.  Population-Scale Sequencing Data Enable Precise Estimates of Y-STR Mutation Rates.

Authors:  Thomas Willems; Melissa Gymrek; G David Poznik; Chris Tyler-Smith; Yaniv Erlich
Journal:  Am J Hum Genet       Date:  2016-04-25       Impact factor: 11.025

9.  Indel variant analysis of short-read sequencing data with Scalpel.

Authors:  Han Fang; Ewa A Bergmann; Kanika Arora; Vladimir Vacic; Michael C Zody; Ivan Iossifov; Jason A O'Rawe; Yiyang Wu; Laura T Jimenez Barron; Julie Rosenbaum; Michael Ronemus; Yoon-Ha Lee; Zihua Wang; Esra Dikoglu; Vaidehi Jobanputra; Gholson J Lyon; Michael Wigler; Michael C Schatz; Giuseppe Narzisi
Journal:  Nat Protoc       Date:  2016-11-17       Impact factor: 13.491

10.  MonoSeq Variant Caller Reveals Novel Mononucleotide Run Indel Mutations in Tumors with Defective DNA Mismatch Repair.

Authors:  Christopher J Walker; Mario A Miranda; Matthew J O'Hern; James S Blachly; Cassandra L Moyer; Jennifer Ivanovich; Karl W Kroll; Ann-Kathrin Eisfeld; Caroline E Sapp; David G Mutch; David E Cohn; Ralf Bundschuh; Paul J Goodfellow
Journal:  Hum Mutat       Date:  2016-08-08       Impact factor: 4.878

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.