| Literature DB >> 19087329 |
Olivier Delaneau1, Cédric Coulonges, Jean-François Zagury.
Abstract
BACKGROUND: We have developed a new computational algorithm, Shape-IT, to infer haplotypes under the genetic model of coalescence with recombination developed by Stephens et al in Phase v2.1. It runs much faster than Phase v2.1 while exhibiting the same accuracy. The major algorithmic improvements rely on the use of binary trees to represent the sets of candidate haplotypes for each individual. These binary tree representations: (1) speed up the computations of posterior probabilities of the haplotypes by avoiding the redundant operations made in Phase v2.1, and (2) overcome the exponential aspect of the haplotypes inference problem by the smart exploration of the most plausible pathways (ie. haplotypes) in the binary trees.Entities:
Mesh:
Year: 2008 PMID: 19087329 PMCID: PMC2647951 DOI: 10.1186/1471-2105-9-540
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 2Representation of the execution trellis of the hidden Markov model used to compute the probability of a haplotype. The haplotypes h1,..., h2denote the previously sampled haplotypes which are used to compute the probability of the observed haplotype h. The sets {o1,..., o} and {q1(k), ..., q(k)} correspond respectively to the observed state sequence of haplotype h and to the hidden state sequence of haplotype h. The transition probability a(k,l) corresponds to the probability of jumping from hidden state q(k) of haplotype hto hidden state q(l) of haplotype h, and the emission probability b(k) corresponds to the probability of observing ogiven the hidden state q(k). To compute the probability of observing the sequence h = {o1, ..., o} in this HMM, one must sum up the probabilities of observing h over all (2n - 2)possible sequences of s hidden states which is done efficiently by the forward algorithm.
Figure 1Schematic representation of a sample of . In this example, the space of possible haplotypes Sfor individual i contains 4 haplotype pairs with 8 distinct haplotypes. The possible phases between heterozygous markers are shown in bold.
Figure 3Different representations of the space of possible haplotypes pairs S. The left panel (A) shows the list representation commonly used by haplotype software such as Phase v2.1. The lower right panel (C) shows the representation used by Shape-IT. White and black circles indicate the phases between the heterozygous SNPs. On this example we use the same genotype Gdescribed in Figure 1. For iterations as performed by Phase v2.1 (A), the list requires the exploration of 20 nodes (4 haplotype pairs × 5 SNPs). With the complete tree representation (B) 10 nodes need to be explored, and with the incomplete tree representation as performed by Shape-IT (C), only 7 nodes need to be explored. The difference observed between (B) and (C) results from the pruning strategy which avoids the exploration of the nodes with probability ≤ 0.01.
Figure 4Algorithm 1 to compute the FDSL distribution on the complete haplotype tree.
Figure 5Algorithm 2 to compute the FDSL distribution on the incomplete haplotype tree.
Hapmap trio datasets description
| CEU Size | 1 to 5 | 250 | 10 to 160 | 60 | 50 datasets of 10, 20, 40, 80 and 160 adjacent SNPs with MAF above 5% |
| CEU Density | 1 to 5 | 300 | 40 | 60 | 50 datasets with spanned distance between SNP above 0, 0.5, 1, 2, 4 and 8 kb (MAF 5%) |
| CEU MAF | 1 to 5 | 150 | 40 | 60 | 50 datasets with MAF above 1%, 5% and 10% |
| YRI Size | 1 to 5 | 250 | 10 to 160 | 60 | 50 datasets of 10, 20, 40, 80 and 160 adjacent SNPs with MAF above 5% |
| YRI Density | 1 to 5 | 300 | 40 | 60 | 50 datasets with spanned distance between SNP above 0, 0.5, 1, 2, 4 and 8 kb (MAF 5%) |
| YRI MAF | 1 to 5 | 150 | 40 | 60 | 50 datasets with MAF above 1%, 5% and 10% |
| CEU illumina 50 | 12 | 300 | 50 | 60 | 15,000 illumina SNPs grouped by dataset of 50 SNPs |
| CEU illumina 100 | 12 | 150 | 100 | 60 | 15,000 illumina SNPs grouped by dataset of 100 SNPs |
| CEU illumina 200 | 12 | 75 | 200 | 60 | 15,000 illumina SNPs grouped by dataset of 200 SNPs |
| GRIV | 1 | 90 | 50 to 200 | 100 to 300 | 3,500 illumina SNPs grouped by dataset of 50, 100 and 200 SNPs |
Description of the benchmarks derived from the HapMap trios datasets that we used to compare accuracy and runtimes of the various algorithms in Table 4. For each parameter (size, density, and MAF) 10 samples were chosen in each of the chromosomes 1 to 5, i.e. a total of 50 tests per parameter.
Hapmap trio datasets results
| CEU Size | 1.5 | 2.2 | 2.3 | 2.0 | ||||||||||
| 53 | 832 | 113 | 93 | < 1 | 50 | 10 | ||||||||
| YRI Size | 2.3 | 1.8 | 4.5 | 3.9 | 4.2 | |||||||||
| 64 | 1,209 | 125 | 138 | < 1 | 131 | 10 | ||||||||
| CEU Density | 2.7 | 2.4 | 4.2 | 4.0 | 4.1 | |||||||||
| 26 | 214 | 64 | 43 | < 1 | 5 | 6 | ||||||||
| YRI Density | 4.9 | 3.9 | 8.5 | 7.5 | 8.8 | |||||||||
| 35 | 490 | 71 | 80 | < 1 | 9 | 5 | ||||||||
| CEU MAF | 1.2 | 1.2 | 2.0 | 2.1 | 1.7 | |||||||||
| 19 | 104 | 71 | 22 | < 1 | 2 | 4 | ||||||||
| YRI MAF | 2.0 | 4.5 | 3.8 | 3.2 | ||||||||||
| 26 | 173 | 80 | 38 | < 1 | 4 | 4 | ||||||||
| CEU 50 illumina SNP | 7.2 | 6.6 | 10.7 | 9.2 | 12.2 | |||||||||
| 51 | 1,214 | 60 | 161 | < 1 | 22 | 5 | ||||||||
| CEU 100 illumina SNP | 6.8 | 7.7 | 9.2 | 11.3 | 9.7 | N/A | ||||||||
| 143 | 11,678 | 144 | 461 | < 1 | 254 | N/A | ||||||||
| CEU 200 illumina SNP | N/A | 8.0 | N/A | 11.5 | 9.9 | N/A | ||||||||
| 372 | N/A | 198 | N/A | < 1 | 2,038 | N/A | ||||||||
N/A: software was unable to handle some of these datasets (errors or untracktable running times). Results of the various tested software on the HapMap trios datasets described in Table 1. For each software tested, the mean percentage of heterozygous markers incorrectly inferred (SER) is shown in the upper-left corner, and the mean running time in seconds is shown in the lower-right corner.
Comparison of the estimated running times of various software on 300 K Illunima genotyping chips datasets.
| 50 | 100 | 10 | 29 | 10 | 151 |
| 100 | 100 | 6 | 37 | 12 | 519 |
| 200 | 100 | 6 | 41 | 19 | 3,137 |
| 50 | 200 | 21 | 34 | 13 | 443 |
| 100 | 200 | 21 | 119 | 29 | 2,739 |
| 200 | 200 | 21 | 124 | 37 | 7,601 |
| 50 | 300 | 37 | 113 | 28 | 1,372 |
| 100 | 300 | 41 | 268 | 52 | 6,514 |
| 200 | 300 | 42 | 261 | 81 | 12,757 |
Estimations of the running times in days of the 4 most accurate software (Phase v2.1, Ishape, Fastphase and Shape-IT) to infer the haplotypes for 100, 200, or 300 genotypes derived from Illumina 300 k chips partitioned into segments of either 50 SNPs, or 100 SNPs, or 200 SNPs. For each combination #SNPs #genotypes, the running time estimations were extrapolated from the measures performed on 10 datasets extracted from the GRIV cohort 300 K Illumina chip genomic data.
Results obtained by various haplotyping software on the experimentally determined ApoE dataset.
| 2snp | 20.0 | 83.8 | 22.7 | 7.3 | 83.9 |
| Fastphase | 11.3 | 89.4 | 17.4 | 6.1 | 87.5 |
| Gerbil | 20.0 | 81.3 | 20.3 | 6.6 | 84.6 |
| Ishape | 5.9 | ||||
| Shape-IT | 10.5 | 6.2 | 92.4 | ||
| Phase v2.1 | 5.8 | 94.0 | 92.4 | ||
| PLEM | 12.5 | 89.8 | 16.0 | 6.5 | 88.7 |
For the various software tested, we measured the percentage of individuals incorrectly reconstructed (IER), the percentage of missing data incorrectly inferred (MER), and the distance between real and inferred haplotype frequencies (IF) on the ApoE with complete genotypes and 5% random missing genotypes.
Results obtained by various haplotyping software on the experimentally determined GH1 dataset.
| 2snp | 15.7 | 88.2 | 22.0 | 7.5 | 88.3 |
| Fastphase | 10.5 | 92.5 | 17.3 | 4.5 | 90.7 |
| Gerbil | 11.8 | 92.8 | 16.7 | 91.6 | |
| Ishape | 15.0 | 4.5 | |||
| Shape-IT | 10.3 | 93.6 | 4.5 | 92.5 | |
| Phase v2.1 | 10.3 | 93.7 | 15.2 | 4.5 | 92.5 |
| PLEM | 12.4 | 90.3 | 17.2 | 4.8 | 89.4 |
For the various software tested, we measured the percentage of individuals incorrectly reconstructed (IER), the percentage of missing data incorrectly inferred (MER), and the distance between real and inferred haplotype frequencies (IF) on the GH1 with complete genotypes and 5% random missing genotypes.