| Literature DB >> 23090981 |
Gareth Highnam1, Christopher Franck, Andy Martin, Calvin Stephens, Ashwin Puthige, David Mittelman.
Abstract
Repetitive sequences are biologically and clinically important because they can influence traits and disease, but repeats are challenging to analyse using short-read sequencing technology. We present a tool for genotyping microsatellite repeats called RepeatSeq, which uses Bayesian model selection guided by an empirically derived error model that incorporates sequence and read properties. Next, we apply RepeatSeq to high-coverage genomes from the 1000 Genomes Project to evaluate performance and accuracy. The software uses common formats, such as VCF, for compatibility with existing genome analysis pipelines. Source code and binaries are available at http://github.com/adaptivegenome/repeatseq.Entities:
Mesh:
Year: 2012 PMID: 23090981 PMCID: PMC3592458 DOI: 10.1093/nar/gks981
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.An outline of the RepeatSeq method. Reads are mapped and realigned, and a set of reads spanning reference repeats are retained. Genotypes are assigned with consideration of the a priori error rate , which comes from the appropriate error profile and is used in the prior distribution of allele and error probabilities . The probability of each genotype suggested by the data is estimated in a Bayesian fashion, and the most probable genotype among these is called.
Figure 2.The performance of various methods for mapping reads to reference repeats. Mapping accuracy is determined using simulated 100 bp Illumina reads (with a coverage of 15×) and is assessed by measuring the proportion of incorrectly mapped reads as a function of the proportion of correctly mapped reads under different mapping quality thresholds. Variations of Bowtie2 are fully described as follows: bowtie2 (Bowtie2 with default settings), bowtie2-high (Bowtie2 using the highest sensitivity setting), bowtie2-local (Bowtie2 with default sensitivity and soft-clipping) and bowtie2-local-high (Bowtie2 using the highest sensitivity and soft-clipping).
Performance of mappers for microsatellite repeat regions
| Method | Total mapped | Correctly mapped | Incorrectly mapped |
|---|---|---|---|
| lobSTR | 1 118 902 (2.59) | 1 117 142 (2.59) | 1760 (0.16) |
| Novoalign | 41 014 531 (95.0) | 40 547 527 (93.9) | 467 004 (1.14) |
| Bowtie2 | 40 678 703 (94.2) | 40 196 603 (93.0) | 482 100 (1.19) |
| Bowtie2-high | 40 946 152 (94.8) | 40 464 488 (93.7) | 481 664 (1.18) |
| Bowtie2-local | 40 961 622 (94.9) | 40 448 448 (93.6) | 513 174 (1.25) |
| Bowtie2-local-high | 40 975 438 (94.9) | 40 472 990 (93.7) | 502 421 (1.23) |
| BWA | 39 390 695 (91.2) | 38 941 969 (90.2) | 448 726 (1.14) |
| BWASW | 40 611 633 (94.1) | 40 120 872 (92.9) | 490 761 (1.21) |
| SMALT | 41 180 368 (95.4) | 40 491 179 (93.7) | 689 189 (1.67) |
| Stampy | 41 004 163 (95.0) | 40 478 030 (93.8) | 526 133 (1.28) |
aNumber (%) of total, correctly and incorrectly mapped reads by each mapping method from 43 176 537 simulated 100 bp single-end reads that overlap a repetitive region in the hg19 reference sequence. Percentages for incorrectly mapped reads are from total mapped reads and not the total simulated reads.
Comparison of RepeatSeq and lobSTR microsatellite calls
| Comparison | 1 | 2 | 3 | 4 | 5 | Total |
|---|---|---|---|---|---|---|
| RepeatSeq calls | 1 014 806 (88.0) | 556 727 (89.0) | 680 939 (89.7) | 766 010 (90.6) | 586 308 (90.7) | 3 604 790 (89.4) |
| lobSTR calls | N | 64 670 (10.3) | 15 722 (2.07) | 17 336 (2.05) | 8315 (1.29) | 106 043 (2.63) |
| Concordant call | N | 47 987 (7.67) | 14 482 (1.91) | 15 430 (1.82) | 7670 (1.19) | 85 569 (2.12) |
| Discordant call | N | 9538 (1.52) | 624 (0.08) | 946 (0.11) | 273 (0.04) | 11 381 (0.28) |
| RepeatSeq call, lobSTR N | 1 014 806 (88.0) | 499 202 (79.8) | 665 833 (87.7) | 749 634 (88.6) | 578 365 (89.5) | 3 507 840 (87.0) |
| lobSTR call, RepeatSeq N | N | 7145 (1.14) | 616 (0.08) | 960 (0.11) | 372 (0.06) | 9093 (0.23) |
| RepeatSeq N, lobSTR N | 138 769 (12.0) | 61 800 (9.88) | 77 758 (10.2) | 78 922 (9.33) | 59 848 (9.26) | 417 097 (10.3) |
aNumber (%) of total, concordant and discordant microsatellite calls are provided by repeat unit length, indicated by column values 1–5. Comparisons are made for microsatellites in which both, one or neither method makes a call. N indicates no call.