| Literature DB >> 22905221 |
Yun Xu1, Wenhua Cheng, Pengyu Nie, Fengfeng Zhou.
Abstract
Haplotype phasing represents an essential step in studying the association of genomic polymorphisms with complex genetic diseases, and in determining targets for drug designing. In recent years, huge amounts of genotype data are produced from the rapidly evolving high-throughput sequencing technologies, and the data volume challenges the community with more efficient haplotype phasing algorithms, in the senses of both running time and overall accuracy. 2SNP is one of the fastest haplotype phasing algorithms with comparable low error rates with the other algorithms. The most time-consuming step of 2SNP is the construction of a maximum spanning tree (MST) among all the heterozygous SNP pairs. We simplified this step by replacing the MST with the initial haplotypes of adjacent heterozygous SNP pairs. The multi-SNP haplotypes were estimated within a sliding window along the chromosomes. The comparative studies on four different-scale genotype datasets suggest that our algorithm WinHAP outperforms 2SNP and most of the other haplotype phasing algorithms in terms of both running speeds and overall accuracies. To facilitate the WinHAP's application in more practical biological datasets, we released the software for free at: http://staff.ustc.edu.cn/~xuyun/winhap/index.htm.Entities:
Mesh:
Year: 2012 PMID: 22905221 PMCID: PMC3419172 DOI: 10.1371/journal.pone.0043163
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The uneven distribution of haplotypes in a block.
| Haplotype | 0/1 representation | Frequency |
|
| 00000 | 66 |
|
| 01001 | 24 |
|
| 11110 | 10 |
|
| 01000 | 6 |
|
| 11000 | 1 |
|
| 11011 | 1 |
|
| 00001 | 1 |
Figure 1The initial phasing process in the first step.
Figure 2The recovery process of a missing allele on haplotype h1.
Figure 3A recombination process with the 4th SNP as the switched site.
Mean IER, SER and runtime of various phasing algorithms on the ACE dataset.
| Software | IER | SER | Time (s) |
|
| 0.092 | 0.011 | 4.237 |
|
| 0.232 | 0.020 | 3.031 |
|
| 0.091* | 0.005* | 1.523 |
|
| 0.178 | 0.020 | 0.039 |
|
| 0.182 | 0.030 | 0.368 |
|
| 0.091* | 0.005* |
|
|
|
|
| 0.019* |
The results corresponding to the highest performance in each column are in bold. The results corresponding to the second-best performance in each column are attached an asterisk.
Mean IER, SER and runtime of various phasing algorithms on the 5q31 dataset with missing data.
| Software | IER1 | IER2 | SER1 | SER2 | Time (s) |
|
| 0.331* | 0.391 | 0.030 | 0.047 | 489.0 |
|
| 0.337 |
| 0.026* | 0.041* | 74.7 |
|
| 0.380 | 0.434 | 0.030 | 0.045 | 21.3 |
|
| 0.341 | 0.388* | 0.030 | 0.043 | 1.1 |
|
| 0.343 | 0.400 | 0.028 | 0.043 | 1.8 |
|
| 0.395 | 0.465 | 0.031 | 0.046 |
|
|
|
| 0.388* |
|
| 0.5* |
The results corresponding to the highest performance in each column are in bold. The results corresponding to the second-best performance in each column are attached an asterisk.
Mean IER, SER and runtime of various phasing algorithms on the CFTR dataset.
| Software | N = 28 | N = 30 | N = 35 | ||||||
| IER | SER | Time (s) | IER | SER | Time (s) | IER | SER | Time (s) | |
|
|
|
| 3.90 |
|
| 4.14 | 0.191* | 0.038* | 4.73 |
|
| 0.252* | 0.049* | 3.73 | 0.246 | 0.049* | 4.03 | 0.231 | 0.045 | 4.70 |
|
| 0.387 | 0.085 | 0.36 | 0.367 | 0.085 | 0.39 | 0.349 | 0.078 | 0.55 |
|
| 0.261 | 0.054 | 0.02* | 0.243* | 0.051 | 0.02* |
|
| 0.02* |
|
| 0.381 | 0.074 | 0.38 | 0.378 | 0.075 | 0.40 | 0.333 | 0.064 | 0.43 |
|
| 0.423 | 0.089 |
| 0.414 | 0.087 |
| 0.417 | 0.083 |
|
|
| 0.316 | 0.065 |
| 0.312 | 0.066 |
| 0.301 | 0.061 |
|
|
|
|
| |||||||
|
|
|
| 5.67 |
|
| 6.34 |
|
| 7.31 |
|
| 0.214 | 0.040 | 5.38 | 0.214 | 0.039 | 6.00 | 0.208 | 0.039 | 6.68 |
|
| 0.324 | 0.070 | 0.70 | 0.334 | 0.072 | 0.86 | 0.325 | 0.069 | 0.95 |
|
| 0.182* | 0.037* | 0.02* | 0.176* | 0.036* | 0.02* | 0.178* | 0.036* | 0.02* |
|
| 0.304 | 0.055 | 0.05 | 0.276 | 0.050 | 0.48 | 0.272 | 0.048 | 0.50 |
|
| 0.396 | 0.079 |
| 0.388 | 0.077 |
| 0.405 | 0.080 |
|
|
| 0.285 | 0.056 |
| 0.260 | 0.053 | 0.02* | 0.275 | 0.056 | 0.02* |
The results corresponding to the highest performance in each column are in bold. The results corresponding to the second-best performance in each column are attached an asterisk.
Mean IER, SER and runtime of various phasing algorithms on the HapMap dataset with no missing data.
| Software | IER | SER | Time (s) |
|
| – | – | No solution |
|
| – | – | No solution |
|
| – | – | No solution |
|
| 0.992* | 0.027 | 126.1 |
|
| 0.999 | 0.128 | 55.5 |
|
| 0.999 | 0.024* |
|
|
|
|
| 41.1* |
The results corresponding to the highest performance in each column are in bold. The results corresponding to the second-best performance in each column are attached an asterisk.