| Literature DB >> 24413527 |
Abstract
MOTIVATION: Over the last few years, methods based on suffix arrays using the Burrows-Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on statistical methods for phasing and imputation, based on probabilistic matching to hidden Markov model representations of the reference data, which while powerful are much less computationally efficient. Here a theory of haplotype matching using suffix array ideas is developed, which should scale too much larger datasets than those currently handled by genotype algorithms.Entities:
Mesh:
Year: 2014 PMID: 24413527 PMCID: PMC3998136 DOI: 10.1093/bioinformatics/btu014
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A set of haplotype sequences sorted in order of reversed prefixes at position k, showing the set of values at k isolated from those before and after, and on the right hand side how the order at position (k + 1) is derived from that at k as in Algorithm 1. Maximal substrings shared with the preceding sequence ending at k are shown bold underlined; these start at position d[i] as calculated in Algorithm 2
Compression performance of pbwt on datasets of increasing size
| Number of sequences | 1000 | 10 000 | 100 000 |
|---|---|---|---|
| Sequences .gz size (KB) | 10 515 | 105 559 | 1 024 614 |
| PBWT size (KB) | 1686 | 3372 | 7698 |
| Ratio .gz/PBWT | 6.2 | 31.3 | 133.1 |
| PBWT bytes/site | 4.6 | 9.1 | 20.8 |
Set-maximal match performance of pbwt on datasets of increasing size
| Number of sequences | 1000 | 10 000 | 100 000 |
|---|---|---|---|
| Set-maximal time (s) | 12.1 | 120.3 | 1213.7 |
| Set-maximal average length (Mb) | 0.27 | 1.48 | 3.98 |
Time to match 1000 new sequences in seconds, split into user (u) and system (s) contributions for the indexed and batch approaches
| Number of sequences | 1000 | 5000 | 10 000 | 50 000 |
|---|---|---|---|---|
| Naïve | 52.1 | 258.9 | 519.2 | 2582.6 |
| Indexed | 0.9u + 0.1s | 0.9u + 0.1s | 0.9u + 0.2s | 1.7u + 15s |
| Batch | 2.3u + 0.1s | 3.5u + 0.1s | 4.8u + 0.1s | 12.1u + 0.1s |