| Literature DB >> 21034477 |
Amy L Williams1, David E Housman, Martin C Rinard, David K Gifford.
Abstract
Hapi is a new dynamic programming algorithm that ignores uninformative states and state transitions in order to efficiently compute minimum-recombinant and maximum likelihood haplotypes. When applied to a dataset containing 103 families, Hapi performs 3.8 and 320 times faster than state-of-the-art algorithms. Because Hapi infers both minimum-recombinant and maximum likelihood haplotypes and applies to related individuals, the haplotypes it infers are highly accurate over extended genomic distances.Entities:
Mesh:
Year: 2010 PMID: 21034477 PMCID: PMC3218664 DOI: 10.1186/gb-2010-11-10-r108
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Runtime results comparing Hapi to other family-based haplotyping algorithms
| All families | ≤3 Children | ||||
|---|---|---|---|---|---|
| Machine | Program | Runtime | Speedup | Runtime | Speedup |
| Hapi | 3.112 s | - | 2.225 s | - | |
| 2.30 GHz | Merlin | 1005 s | 323× | 8.662 s | 3.84× |
| AMD Opteron | Allegro v2 | 7661 s | 2,462× | 14.50 s | 6.43× |
| Superlink | 1393 s* | 448× | 38.75 s | 17.2× | |
| 1.40 GHz | Hapi | 4.732 s | - | 3.451 s | - |
| Pentium M | PedPhase 2.0 | >21,600 s (6 h)† | >4,500× | >21,600 s (6 h)† | >6,000× |
Runtimes for maximum likelihood haplotyping using Hapi, Merlin Allegro and Superlink of nuclear families from the Huntington's Disease Venezuela Collaborative Study [32]. We list times for haplotyping all nuclear families and for haplotyping those with three or fewer children. *Superlink failed to haplotype the family with 11 children; we therefore used only 8 of the children from the 11 child family to time it. Times are averages from running Hapi eight times and Merlin, Allegro, and Superlink three times each. Runtimes also on a different machine for minimum-recombinant haplotyping using Hapi (averaged from eight runs) and PedPhase †for chromosome 1 only.
Timing results from simulations of extreme amounts of missing data
| Total % missing | Simulation probability | Runtime | Slowdown | Speedup vs. Merlin |
|---|---|---|---|---|
| 5% | 3.83% | 3.274 s | 5.21% | 306× |
| 10% | 8.83% | 3.564 s | 14.5% | 281× |
| 20% | 18.8% | 4.567 s | 46.8% | 220× |
| 30% | 28.8% | 6.897 s | 122% | 145× |
| 40% | 38.8% | 11.36 s | 265% | 88.5× |
| 50% | 48.8% | 36.38 s | 1070% | 27.6× |
Hapi's runtime performance for haplotyping the dataset discussed in Results in the presence of various total proportions of missing data. Because this dataset contains 1.17% missing data already, we dropped genotypes according to the indicated probabilities in order to obtain the total overall proportions of missing data. The table lists the runtime, percentage slowdown compared to running Hapi on the unmodified dataset, and the speedup compared to running Merlin on the unmodified dataset.
Figure 1Sample inheritance vector output from Hapi imported into a spreadsheet. Output from Hapi showing the inherited homologs on chromosome 1 for a family with 11 children from the Huntington's Disease Venezuela Collaborative Study [32]. Hapi produces CSV format output, which we imported into a spreadsheet. To color the cells, we used conditional formatting based on the homolog value transmitted. The output of inheritance vector values uses letters A and B. Lower-case letters indicate the transmitting parent is homozygous and the presence of recombination unknown. Each column is labeled with the child's numerical id with either a 'P' or an 'M' preceding it to indicate either paternal or maternal-derived homologs. The left most column gives the SNP rs numbers, and the right most column lists the number of recombinations across all children at the given locus.
Figure 2Example graph of states across several loci. A pictorial representation showing the relationship between states at different loci. Each row of boxes correspond to a locus; boxes represent a state and indicate the numbers of recombinations the state incurs; arrows point to previous state(s). Once the system deduces a single state at some locus - shown here as the bottom box - it back traces by traversing the pointers and assigns the haplotype values from the states it encounters. The numbers are not from real data.
Two states at a fully informative for one parent locus built from the previous state
| Parents | Children | # Rec | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Prev | 〈0, 1〉 | 〈1, 1〉 | 〈1, 1〉 | 〈0, 0〉 | 〈1, 1〉 | 0 | |||
| State | 〈a, g〉 | 〈a, a〉 | 〈g, a〉 | 〈a, a〉 | 〈a, a〉 | a, a〉 | 〈a, a〉 | ||
| 〈 | 〈 | 〈 | 0, 0〉 | 〈 | 4 | ||||
| State | 〈g, a〉 | 〈a, a〉 | 〈g, a〉 | 〈a, a〉 | 〈a, a〉 〈 | a, a〉 | 〈a, a〉 | ||
| 〈0, 1〉 | 〈1, 1〉 | 〈1, 1〉 〈 | 〈1, 1〉 | 1 | |||||
An example locus with one heterozygous and one homozygous parent that shows one state at the previous locus and the two states Hapi builds based on this previous state. This example is from the real dataset discussed in Results. The rows labeled show the states' inheritance vectors and the rows labeled hap give haplotype assignments of the alleles. Hapi copies the inheritance vector values corresponding to the homozygous parent from the previous state to states a and b. Recombinations result from differing inheritance vector values from the previous state; these differences appear in bold and the states' total number of recombinations appear in the right-most column. Note that the heterozygous parent's inheritance vector values in the two states are exactly opposite each other and are therefore equivalently labeled.
Four types of loci Hapi distinguishes
| Number of states | ||||
|---|---|---|---|---|
| Locus type | Parent | Parent | If | Average |
| Fully informative for both parents | 1 | N/A | ||
| Fully informative for one parent | After informative for parent | 1.87 | ||
| Partly informative | ≤4 | 6.31 | ||
| Uninformative | 0 | N/A | ||
The four types of loci our algorithm handles separately with the names we use to refer to them. The table lists the number of states that Hapi produces for each type if there are s states at the previous locus, and gives the average number of states produced for haplotyping the dataset we evaluate in Results. Note that either parent may have the genotypes listed for parents p and q.
States at a fully informative for one parent locus built from a state with ambiguous values
| # Rec | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Prev | 〈0, 1〉 | 〈1, 1〉? | 〈1, 1〉 | 〈0, 0〉? | 〈1, 1〉 | 0 | |||
| State | 〈a, g〉 | 〈a, a〉 | 〈g, a〉 | 〈a, a〉 | 〈a, a〉 | 〈a, a〉 | 〈a, a〉 | 4 | |
| 〈 | 〈 | 〈 | 〈0, | 〈 | |||||
| State | 〈g, a〉 | 〈a, a〉 | 〈g, a〉 | 〈a, a〉 | 〈a, a〉 | 〈a, a〉 | 〈a, a〉 | 1 | |
| 〈0, 1〉 | 〈1, | 〈1, 1〉 | 〈 | 〈1, 1〉 |
An example, modified from Table 3 and not from real data, showing a state with ambiguous inheritance values (marked by ?) at the previous locus, and the two states Hapi builds based on it. For unambiguous children's inheritance vector values, the system copies the bits corresponding to the homozygous parent from the previous state. For ambiguous children, two opposite inheritance values are valid for the previous state, and the system uses the homozygous parent bit from the inheritance value that matches the heterozygous parent's bit in the state being built. Both of the two inheritance values are necessarily represented, one in each of the resulting states. As the underlined values show, the inheritance values for the homozygous parent differ across the two outputs. As such, the states are not equivalent, and Hapi cannot eliminate either. Bold values indicate recombinations.
Example haplotype inference across a series of loci from real data
| 8 | 〈a, a〉 | 〈a, c〉 | 〈a, a〉 | 〈a, c〉 | 〈a, c〉 | 〈a, c〉 | 〈a, a〉 | 0 | |||||||||||
| 〈-, 0〉 | 〈-, 1〉 | 〈-, 1〉 | 〈-, 1〉 | 〈-, 0〉 | |||||||||||||||
| 12 | 〈g, t〉 | 〈t, t〉 | 〈t, t〉 | 〈t, t〉 | 〈g, t〉 | 〈t, t〉 | 〈t, t〉 | 0 | |||||||||||
| 〈1, 0〉 | 〈1, 1〉 | 〈0, 1〉 | 〈1, 1〉 | 〈1, 0〉 | |||||||||||||||
| 14 | 〈c, a〉 | 〈c, a〉 | 〈a, c〉 | 〈a, a〉 | 〈c, c〉 | 〈a, c〉 | 〈a, c〉 | 2 | 14 | 〈c, a〉 | 〈a, c〉 | 〈a, c〉 | 〈a, a〉 | 〈c, c〉 | 〈a, c〉 | 〈c, a〉 | 3 | ||
| 〈1, 0〉 | 〈1, 1〉 | 〈0, | 〈1, | 〈1, 0〉 | 〈1, | 〈1, | 〈0, 1〉 | 〈1, 1〉 | 〈 | ||||||||||
| 16 | 〈a, a〉 | 〈g, a〉 | 〈a, g〉 | 〈a, a〉 | 〈a, g〉 | 〈a, g〉 | 〈a, a〉 | 3 | 16 | 〈a, a〉 | 〈a, g〉 | 〈a, g〉 | 〈a, a〉 | 〈a, g〉 | 〈a, g〉 | 〈a, a〉 | 3 | ||
| 〈1, 0〉 | 〈1, 1〉 | 〈0, 0〉 | 〈1, 0〉 | 〈1, | 〈1, 1〉 | 〈1, 0〉 | 〈0, 1〉 | 〈1, 1〉 | 〈0, 0〉 | ||||||||||
| 17 | 〈t, c〉 | 〈c, c〉 | 〈c, c〉 | 〈c, c〉 | 〈t, c〉 | 〈c, c〉 | 〈t, c〉 | 4 | 17 | 〈t, c〉 | 〈c, c〉 | 〈c, c〉 | 〈c, c〉 | 〈t, c〉 | 〈c, c〉 | 〈t, c〉 | 3 | ||
| 〈1, 0〉 | 〈1, 1〉 | 〈0, 0〉 | 〈1, 0〉 | 〈 | 〈1, 1〉 | 〈1, 0〉 | 〈0, 1〉 | 〈1, 1〉 | 〈0, 0〉 | ||||||||||
An example from the real dataset described in Results. The loci are from chromosome 3 and we number them sequentially in the order they occur physically. For simplicity and conciseness, we omit uninformative loci and one non-recombinant fully informative locus between locus 8 and 12. Bold inheritance vector values indicate recombinations. Each state lists its total number of recombinations. Note that the state at locus 14 with minimum recombinations is ultimately not minimum-recombinant globally. See the Example subsection for a detailed description of this table.