| Literature DB >> 17573965 |
Olivier Delaneau1, Cédric Coulonges, Pierre-Yves Boelle, George Nelson, Jean-Louis Spadoni, Jean-François Zagury.
Abstract
BACKGROUND: We have developed a new haplotyping program based on the combination of an iterative multiallelic EM algorithm (IEM), bootstrap resampling and a pseudo Gibbs sampler. The use of the IEM-bootstrap procedure considerably reduces the space of possible haplotype configurations to be explored, greatly reducing computation time, while the adaptation of the Gibbs sampler with a recombination model on this restricted space maintains high accuracy. On large SNP datasets (>30 SNPs), we used a segmented approach based on a specific partition-ligation strategy. We compared this software, Ishape (Iterative Segmented HAPlotyping by Em), with reference programs such as Phase, Fastphase, and PL-EM. Analogously with Phase, there are 2 versions of Ishape: Ishape1 which uses a simple coalescence model for the pseudo Gibbs sampler step, and Ishape2 which uses a recombination model instead.Entities:
Mesh:
Year: 2007 PMID: 17573965 PMCID: PMC1919397 DOI: 10.1186/1471-2105-8-205
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A schematic representation of the algorithm : (I) Partition strategy of the SNPs into segments thank's to the multiallelic IEM, with a new segment creation at each orphan haplotype (see text). (II) IEM-bootstrap-GS algorithm to obtain reliable haplotypes for each segment. (III) Ligation of the haplotyped segments with the same multiallelic IEM-bootstrap-GS to obtain reliable results on all the genotype dataset.
Figure 2Measure of the impact of the number of bootstrap samples on the size and the relevance of the candidate haplotypes space for the APOE gene. Black line and left scale are for ICR (capture rate of true haplotypes configurations). Dashed line and right scale are for ANCR (average number of candidates per genotype).
capture rate and number of haplotypes detected by the various algorithms
| 1A : | |||||||||
| 0% MD | 2% MD | 5% MD | 10% MD | ||||||
| Ishape | APOE | 32.7 | 43.6 | 53.4 | 70.4 | ||||
| Phase2.1 | 29.0 | 32.5 | 0.99 | 35.4 | 40.1 | ||||
| Phase1.0 | 0.99 | 29.0 | 0.99 | 27.8 | 0.99 | 29.6 | 0.98 | 32.5 | |
| FastPhase | 0.89 | 29.0 | 0.86 | 27.2 | 0.82 | 27.2 | 0.77 | 27.7 | |
| PL-EM | 0.89 | 20.0 | 0.90 | 21.1 | 0.89 | 21.3 | 0.88 | 21.1 | |
| Ishape | GH1 | 101.5 | 148.2 | 229.6 | 365 | ||||
| Phase2.1 | 0.98 | 71.0 | 0.97 | 70.5 | 0.97 | 74.3 | 0.96 | 83.2 | |
| Phase1.0 | 0.97 | 46.0 | 0.96 | 50.2 | 0.95 | 54.1 | 0.93 | 59.7 | |
| FastPhase | 0.88 | 55.0 | 0.87 | 52.4 | 0.82 | 53.1 | 0.76 | 54.2 | |
| PL-EM | 0.91 | 41.0 | 0.90 | 42.4 | 0.89 | 43.4 | 0.86 | 42.5 | |
| 1B : | |||||||||
| APOE (9 SNPs) | GH1 (14 SNPs) | ||||||||
| 0% | 2% | 5% | 10% | 0% | 2% | 5% | 10% | ||
| Average number of | 3.25 | 4.48 | 8.21 | 22.89 | 9.62 | 18.69 | 48.72 | 244.13 | |
| Average number of | 1.59 (1.0) | 2.3 (1.0) | 3.0 (1.0) | 4.8 (0.99) | 2.3 (0.99) | 3.31 (0.99) | 5.4 (0.98) | 10.2 (0.97) | |
A) Comparison of the algorithms'performance on the APOE and GH1 datasets regarding the average ICR (left column) and the average number of detected haplotypes (right column). For each level of missing data (0%, 2%, 5%, 10%), 100 experiments were performed. Best performances are highlighted in bold.
B) Estimation of the reduction of the space of possible haplotypes for the GH1 and APOE genes operated thanks to the Bootstrap-IEM. The values are the average of 100 experiments. In the lower line, the values in brackets give the correspondingaverage ICR.
performance of the various algorithms on the GH1 and APOE datasets
| Soft | MD | IF | IER | Time (sec.) | MD | IF | IER | Time (sec.) |
| Ishape1 | 0% | 0.927 +/- 0.001 | 0.119 +/- 0.001 | 0.9 | 5% | 0.915 +/- 0.002 | 0.164 +/- 0.004 | 1.7 |
| Ishape2 | 9.2 | 11.5 | ||||||
| Phase2.1 | 62.9 | 0.924 +/- 0.002 | 0.148 +/- 0.004 | 71.4 | ||||
| Phase1.0 | 0.926 +/- 0.002 | 0.119 +/- 0.002 | 15.6 | 0.915 +/- 0.003 | 0.164 +/- 0.005 | 26.0 | ||
| FastPhase | 0.928 +/- 0.001 | 0.105 +/- 0.001 | 139.1 | 0.920 +/- 0.002 | 0.170 +/- 0.004 | 138.9 | ||
| PL-EM | 0.915 +/- 0.001 | 0.116 +/- 0.000 | 0.3 | 0.890 +/- 0.003 | 0.171 +/- 0.003 | 3.2 | ||
| 2snp | NA | 0.157 +/- 0.000 | NA | 0.214 +/- 0.002 | ||||
| Ishape1 | 2% | 0.922 +/- 0.001 | 0.137 +/- 0.003 | 1.2 | 10% | 0.905 +/- 0.002 | 0.208 +/- 0.004 | 2.8 |
| Ishape2 | 10.6 | 14.6 | ||||||
| Phase2.1 | 0.931 +/- 0.001 | 0.122 +/- 0.003 | 64.6 | 0.914 +/- 0.002 | 0.196 +/- 0.005 | 82.5 | ||
| Phase1.0 | 0.921 +/- 0.002 | 0.138 +/- 0.003 | 20.7 | 0.903 +/- 0.003 | 0.211 +/- 0.005 | 33.9 | ||
| fastPhase | 0.924 +/- 0.001 | 0.134 +/- 0.004 | 147.5 | 0.907 +/- 0.002 | 0.241 +/- 0.006 | 134.6 | ||
| PL-EM | 0.913 +/- 0.003 | 0.140 +/- 0.003 | 1.0 | 0.854 +/- 0.004 | 0.225 +/- 0.005 | 12.6 | ||
| 2snp | NA | 0.176 +/- 0.002 | NA | 0.283 +/- 0.004 | ||||
| Soft | MD | IF | IER | Time (sec.) | MD | IF | IER | Time (sec.) |
| Ishape1 | 0% | 0.946 +/- 0.001 | 0.062 +/- 0.001 | 0.2 | 5% | 0.109 +/- 0.005 | 0.4 | |
| Ishape2 | 0.941 +/- 0.001 | 0.057 +/- 0.001 | 3.5 | 0.926 +/- 0.003 | 4.1 | |||
| Phase2.1 | 0.940 +/- 0.001 | 14.0 | 0.923 +/- 0.003 | 15.8 | ||||
| Phase1.0 | 0.062 +/- 0.000 | 2.7 | 0.108 +/- 0.005 | 3.9 | ||||
| fastPhase | 0.876 +/- 0.001 | 0.118 +/- 0.002 | 49.1 | 0.870 +/- 0.003 | 0.181 +/- 0.005 | 44.2 | ||
| PL-EM | 0.897 +/- 0.000 | 0.125 +/- 0.000 | 0.1 | 0.883 +/- 0.004 | 0.159 +/- 0.005 | 0.4 | ||
| 2snp | NA | 0.200 +/- 0.000 | NA | 0.227 +/- 0.004 | ||||
| Ishape1 | 2% | 0.078 +/- 0.003 | 0.3 | 10% | 0.149 +/- 0.007 | 0.6 | ||
| Ishape2 | 0.935 +/- 0.002 | 3.9 | 0.910 +/- 0.004 | 4.6 | ||||
| Phase2.1 | 0.933 +/- 0.002 | 0.072 +/- 0.003 | 14.8 | 0.907 +/- 0.004 | 0.146 +/- 0.007 | 17.4 | ||
| Phase1.0 | 0.941 +/- 0.002 | 0.078 +/- 0.003 | 3.2 | 0.150 +/- 0.007 | 5.1 | |||
| fastPhase | 0.875 +/- 0.002 | 0.140 +/- 0.003 | 47.0 | 0.864 +/- 0.004 | 0.225 +/- 0.007 | 45.4 | ||
| PL-EM | 0.894 +/- 0.003 | 0.137 +/- 0.003 | 0.2 | 0.854 +/- 0.005 | 0.191 +/- 0.006 | 1.3 | ||
| 2snp | NA | 0.208 +/- 0.002 | NA | 0.259 +/- 0.004 | ||||
Different missing data levels are tested, each with 100 experiments. The mean accuracy (IF and SER) and runtime of the haplotyping algorithms are compared on A. the GH1 dataset, B. the APOE dataset. The 95% confidence intervals are also given. Best performances are highlighted in bold. For 2-SNP, the software does not provide haplotype frequency estimation: thus, the IF is not available (NA).
Accuracy and time comparison of the algorithms on the HapMap data.
| Contiguous SNPs | Spaced by 5 kb | |||||||
| Software | Average | Median | Average | Average | Average | Median | Average | Average |
| FastPhase | 1.31 +/- 0.16 | 0.68 | 2.81 +/- 0.14 | 100.4 | 3.98 +/- 0.30 | 2.99 | 2.79 +/- 0.12 | 88.8 |
| Ishape1 | 1.40 +/- 0.16 | 0.63 | 2.87 +/- 0.15 | 5.0 | 4.88 +/- 0.36 | 3.51 | 3.99 +/- 0.15 | 12.3 |
| Ishape2 | 34.9 | 3.60 +/- 0.29 | 66.1 | |||||
| Phase1.0 | 1.39 +/- 0.16 | 0.68 | 2.80 +/- 0.15 | 52.2 | 4.92 +/- 0.36 | 3.53 | 4.04 +/- 0.15 | 142.5 |
| Phase2.1 | 1.17 +/- 0.14 | 0.58 | 2.21 +/- 0.13 | 215.0 | 2.53 | 2.11 +/- 0.10 | 702.0 | |
| PL-EM | 1.81 +/- 0.22 | 0.85 | 3.87 +/- 0.18 | 6.7 | 5.88 +/- 0.42 | 4.27 | 5.02 +/- 0.16 | 5.8 |
| 2snp | 1.77 +/- 0.15 | 1.20 | 4.31 +/- 0.19 | 4.71 +/- 0.28 | 4.01 | 4.24 +/- 0.17 | ||
Different size of SNP datasets are tested under two assumptions for the choice of the SNPs retained: adjacent SNPs and SNPs spaced by 5 kb in average. All the SNPs have a MAF above 1%. For each given size 10, 20, 30, 40, 50, 60, and 80 SNPs, one hundred different SNPs datasets were tested. The Table provides a summary showing the average performances obtained over entire HAPMAP segments tested. The 3rd column presents the average rank i.e. the mean of the ranks given to each software regarding the SER they obtained for each experiment. The last column gives the average time obtained for the experiments. 95% confidence intervals are provided for the SER and for the ranking (total of 700 experiments). Best performances are highlighted in bold.
Accuracy and time comparison of the algorithms on four real datasets involving different numbers of genotypes
| Prog | SER | Time (sec.) | SER | Time (sec.) | SER | Time (sec.) | SER | Time (sec.) |
| Ishape1 | 0.0190 +/- 0.002 | 0.8 | 0.055 +/- 0.001 | 0.2 | 0.065 +/- 0.003 | 0.9 | 0.0473 +/- 0.001 | 512 |
| Ishape2 | 0.0184 +/- 0.0006 | 4.96 | 0.050 +/- 0.005 | 3.5 | 9.2 | 5744 | ||
| Phase1.0 | 0.0186 +/- 0.001 | 4.82 | 0.055 +/- 0.001 | 2.7 | 0.065 +/- 0.004 | 15.6 | 0.0657 +/- 0.002 | 21536 |
| Phase2.1 | 0.0175 +/- 0.000 | 23.25 | 14 | 62.9 | 0.0501 +/- 0.001 | 61789 | ||
| fastPhase | 0.0182 +/- 0.001 | 37.61 | 0.103 +/- 0.009 | 49.1 | 0.056 +/- 0.003 | 139.1 | 0.0452 +/- 0.001 | 986 |
| PL-EM | 0.0573 +/- 0.005 | 0.51 | 0.165 +/- 0.000 | 0.11 | 0.060 +/- 0.004 | 0.31 | 0.0601 +/- 0.001 | 6507 |
| 2snp | 0.230 +/- 0.000 | 0.074 +/- 0.000 | 0.0513 +/- 0.000 | |||||
Accuracy and time comparison of the algorithms on the four real datasets ACE [29], APOE [36], GH1 [20,35] genes and data from Chr. 5q31 [30]. The 95% confidence interval corresponds to 100 runs of each program for ACE, APOE and ACE, and to 10 runs for Chr.5q31 data. Best performances are highlighted in bold.