| Literature DB >> 25621011 |
Abstract
BACKGROUND: Genetic studies are increasingly based on short noisy next generation scanners. Typically complete DNA sequences are assembled by matching short NextGen sequences against reference genomes. Despite considerable algorithmic gains since the turn of the millennium, matching both single ended and paired end strings to a reference remains computationally demanding. Further tailoring Bioinformatics tools to each new task or scanner remains highly skilled and labour intensive. With this in mind, we recently demonstrated a genetic programming based automated technique which generated a version of the state-of-the-art alignment tool Bowtie2 which was considerably faster on short sequences produced by a scanner at the Broad Institute and released as part of The Thousand Genome Project.Entities:
Keywords: Double-ended DNA sequence; High throughput Solexa 454 nextgen NGS sequence query; Homo sapiens genome reference consortium HG19; Rapid fuzzy string matching
Year: 2015 PMID: 25621011 PMCID: PMC4304608 DOI: 10.1186/s13040-014-0034-0
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Figure 1Speed up of genetic programming versions of Bowtie2 compared to hand produced code on single ended small-indel GCAT benchmarks. The horizontal axis gives the fraction of sequences correctly mapped (given by GCAT itself). The near vertical plots, for all but the longest DNA sequences, emphasises that the speed up (vertical axis) comes at little reduction in quality.
Figure 2Speed up of genetic programming versions of Bowtie2 compared to hand produced code on paired end small-indel GCAT benchmarks. As with Table 1 and Figure 1, the percentage of correctly mapped sequences is calculated by GCAT.
Speed and percentage speed up of each Bowtie2 variant on GCAT benchmarks
|
| ||||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| 100 | 11945249 | 684 ± 6 | 89.23% | 98.88% | 782 ± 7 | 89.05% | 98.25% | 14 |
| 150 | 7963499 | 556 ± 2 | 93.58% | 99.48% | 696 ± 5 | 93.20% | 98.58% | 25 |
| 250 | 4778100 | 413 ± 9 | 97.14% | 99.74% | 509 ± 11 | 96.04% | 98.33% | 23 |
| 400 | 2986312 | 342 ± 19 | 98.77% | 99.86% | 371 ± 29 | 96.50% | 97.69% | 8 |
|
|
|
|
|
|
|
|
|
|
| 100 | 11945249 | 486 ± 6 | 93.54% | 98.81% | 640 ± 16 | 92.98% | 98.16% | 32 |
| 150 | 7963499 | 481 ± 3 | 96.33% | 99.48% | 701 ± 3 | 95.46% | 98.55% | 46 |
| 250 | 4778100 | 462 ± 12 | 98.54% | 99.82% | 656 ± 23 | 97.05% | 98.40% | 42 |
| 400 | 2986312 | 425 ± 41 | 99.36% | 99.94% | 524 ± 65 | 96.87% | 97.76% | 23 |
|
| ||||||||
|
|
|
|
|
|
|
|
|
|
| 100 | 5972625 | 674 | 94.47% | 99.41% | 827 | 94.03% | 98.70% | 23 |
| 150 | 3981750 | 736 | 91.99% | 98.82% | 956 | 91.61% | 97.62% | 30 |
| 250 | 2389050 | 826 | 95.46% | 98.96% | 1041 | 94.38% | 97.33% | 26 |
| 400 | 1493156 | 658 | 97.79% | 99.24% | 822 | 95.65% | 97.08% | 25 |
|
|
|
|
|
|
|
|
|
|
| 100 | 5972625 | 702 | 95.19% | 98.91% | 921 | 94.74% | 98.20% | 31 |
| 150 | 3981750 | 717 | 93.53% | 98.93% | 999 | 93.07% | 97.66% | 39 |
| 250 | 2389050 | 763 | 95.49% | 98.61% | 1044 | 94.21% | 96.83% | 37 |
| 400 | 1493156 | 461 | 98.29% | 99.39% | 616 | 95.97% | 97.24% | 34 |
To normalise for different sequence lengths, we report millions of DNA bases processed per CPU hour. (In paired end tests both ends are counted.) Where available ± gives estimated standard deviation. The percentage of correctly assigned sequences and the percentage mapped are both reported by GCAT.