| Literature DB >> 23445458 |
Fei Deng1, Wenjuan Cui, Lusheng Wang.
Abstract
BACKGROUND: Single nucleotide polymorphisms (SNPs) are the most common form of genetic variation in human DNA. The sequence of SNPs in each of the two copies of a given chromosome in a diploid organism is referred to as a haplotype. Haplotype information has many applications such as gene disease diagnoses, drug design, etc. The haplotype assembly problem is defined as follows: Given a set of fragments sequenced from the two copies of a chromosome of a single individual, and their locations in the chromosome, which can be pre-determined by aligning the fragments to a reference DNA sequence, the goal here is to reconstruct two haplotypes (h1, h2) from the input fragments. Existing algorithms do not work well when the error rate of fragments is high. Here we design an algorithm that can give accurate solutions, even if the error rate of fragments is high.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23445458 PMCID: PMC3582451 DOI: 10.1186/1471-2164-14-S2-S2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Illustration of the preprocessing on the input fragment matrix. (a) The original fragment matrix M. (b) The matrix obtained from (a) by removing possible errors. (c) The obtained matrix .
Evaluation of how the size of boundOfCoverage affects the initial solution.
| c = 3 | c = 5 | c = 8 | c = 10 | |
|---|---|---|---|---|
| 10 | 0.708( | 0.753( | 0.764(0.14) | 0.774(0.18) |
| 12 | 0.728( | 0.785( | 0.794(0.15) | 0.797(0.21) |
| 15 | 0.776(0.30) | 0.837(0.33) | 0.841(0.36) | 0.857(0.45) |
There are 3 different sizes of boundOfCoverage, i.e. 10, 12 and 15, and 4 different coverage rates c, i.e. 3, 5, 8 and 10. The number outside each bracket refers to the average reconstruction rate under the corresponding parameter setting, while the one enclosed in the bracket refers to the average running time in seconds.
Comparisons of the algorithms when l = 100.
|
|
| SpeedHap | Fast Hare | 2d-mec | HapCUT | MLF | SHR-three | DGS | Ours |
|---|---|---|---|---|---|---|---|---|---|
| 0.0 | 3 | 0.999 | 0.999 | 0.990 | 0.973 | 0.816 | 1.000 | ||
| 5 | 0.999 | 0.997 | 0.992 | 0.861 | 1.000 | ||||
| 8 | 0.997 | 0.912 | 1.000 | ||||||
| 10 | 0.998 | 0.944 | 1.000 | ||||||
| 0.1 | 3 | 0.895 | 0.919 | 0.912 | 0.929 | 0.889 | 0.696 | 0.973 | |
| 5 | 0.967 | 0.965 | 0.951 | 0.920 | 0.970 | 0.738 | 0.996 | ||
| 8 | 0.989 | 0.983 | 0.901 | 0.985 | 0.758 | 0.989 | 0.999 | ||
| 10 | 0.990 | 0.988 | 0.892 | 0.995 | 0.762 | 0.997 | 1.000 | ||
| 0.2 | 3 | 0.623 | 0.715 | 0.738 | 0.725 | 0.615 | 0.725 | 0.903 | |
| 5 | 0.799 | 0.797 | 0.793 | 0.836 | 0.655 | 0.813 | 0.963 | ||
| 8 | 0.852 | 0.881 | 0.873 | 0.864 | 0.681 | 0.878 | 0.990 | ||
| 10 | 0.865 | 0.915 | 0.894 | 0.871 | 0.699 | 0.917 | 0.996 | ||
| 0.3 | 3 | 0.480 | 0.617 | 0.602 | 0.618 | 0.557 | 0.611 | 0.776 | |
| 5 | 0.637 | 0.639 | 0.640 | 0.629 | 0.599 | 0.647 | 0.874 | ||
| 8 | 0.667 | 0.661 | 0.675 | 0.673 | 0.632 | 0.663 | 0.950 | ||
| 10 | 0.676 | 0.675 | 0.678 | 0.709 | 0.632 | 0.688 | 0.972 |
The columns e and c refer to the error rate and coverage rate, respectively. Columns 3-9 represent the reconstruction rate of the seven algorithms, i.e. SpeedHap, Fast Hare, 2d-mec, HapCUT, MLF, SHR-three and DGS. For each combination of e and c, the best among the seven algorithms is highlighted in bold. The last column lists the reconstruction rate of our algorithm.
Comparisons of the algorithms when l = 350.
| e | c | SpeedHap | Fast Hare | 2d-mec | HapCUT | MLF | SHR-three | DGS | Ours |
|---|---|---|---|---|---|---|---|---|---|
| 0.0 | 3 | 0.999 | 0.990 | 0.965 | 0.864 | 0.830 | 0.999 | 1.000 | |
| 5 | 0.999 | 0.993 | 0.929 | 0.829 | 1.000 | ||||
| 8 | 0.998 | 0.969 | 0.895 | 1.000 | |||||
| 10 | 0.999 | 0.999 | 0.981 | 0.878 | 1.000 | ||||
| 0.1 | 3 | 0.819 | 0.871 | 0.837 | 0.752 | 0.682 | 0.926 | 0.970 | |
| 5 | 0.959 | 0.945 | 0.913 | 0.913 | 0.858 | 0.724 | 0.993 | ||
| 8 | 0.984 | 0.985 | 0.964 | 0.896 | 0.933 | 0.742 | 0.999 | ||
| 10 | 0.984 | 0.995 | 0.978 | 0.888 | 0.962 | 0.728 | 1.000 | ||
| 0.2 | 3 | 0.439 | 0.684 | 0.675 | 0.642 | 0.591 | 0.691 | 0.877 | |
| 5 | 0.729 | 0.746 | 0.728 | 0.728 | 0.632 | 0.769 | 0.953 | ||
| 8 | 0.825 | 0.853 | 0.791 | 0.798 | 0.670 | 0.842 | 0.988 | ||
| 10 | 0.855 | 0.877 | 0.817 | 0.867 | 0.831 | 0.668 | 0.994 | ||
| 0.3 | 3 | 0.251 | 0.590 | 0.565 | 0.581 | 0.548 | 0.578 | 0.725 | |
| 5 | 0.578 | 0.602 | 0.606 | 0.582 | 0.606 | 0.557 | 0.833 | ||
| 8 | 0.629 | 0.626 | 0.623 | 0.621 | 0.604 | 0.628 | 0.922 | ||
| 10 | 0.638 | 0.644 | 0.634 | 0.641 | 0.619 | 0.641 | 0.951 |
The columns e and c refer to the error rate and coverage rate, respectively. Columns 3-9 represent the reconstruction rate of the seven algorithms, i.e. SpeedHap, Fast Hare, 2d-mec, HapCUT, MLF, SHR-three and DGS. For each combination of e and c, the best among the seven algorithms is highlighted in bold. The last column lists the reconstruction rate of our algorithm.
Comparisons of the algorithms when l = 700.
| e | c | SpeedHap | Fast Hare | 2d-mec | HapCUT | MLF | SHR-three | DGS | Ours |
|---|---|---|---|---|---|---|---|---|---|
| 0.0 | 3 | 0.999 | 0.988 | 0.946 | 0.787 | 0.781 | 0.999 | 0.997 | |
| 5 | 0.999 | 0.976 | 0.854 | 0.832 | 0.999 | ||||
| 8 | 0.992 | 0.919 | 0.868 | 1.000 | |||||
| 10 | 0.999 | 0.997 | 0.933 | 0.898 | 1.000 | ||||
| 0.1 | 3 | 0.705 | 0.829 | 0.786 | 0.927 | 0.698 | 0.668 | 0.951 | |
| 5 | 0.947 | 0.949 | 0.880 | 0.916 | 0.809 | 0.716 | 0.989 | ||
| 8 | 0.985 | 0.986 | 0.948 | 0.896 | 0.863 | 0.743 | 0.997 | ||
| 10 | 0.986 | 0.995 | 0.965 | 0.889 | 0.884 | 0.726 | 0.998 | ||
| 0.2 | 3 | 0.199 | 0.652 | 0.647 | 0.624 | 0.591 | 0.669 | 0.837 | |
| 5 | 0.681 | 0.712 | 0.697 | 0.682 | 0.617 | 0.741 | 0.927 | ||
| 8 | 0.801 | 0.808 | 0.751 | 0.747 | 0.653 | 0.818 | 0.974 | ||
| 10 | 0.813 | 0.778 | 0.861 | 0.765 | 0.675 | 0.861 | 0.982 | ||
| 0.3 | 3 | 0.095 | 0.581 | 0.552 | 0.570 | 0.536 | 0.573 | 0.676 | |
| 5 | 0.523 | 0.591 | 0.555 | 0.594 | 0.562 | 0.595 | 0.777 | ||
| 8 | 0.615 | 0.613 | 0.597 | 0.614 | 0.611 | 0.614 | 0.876 | ||
| 10 | 0.627 | 0.616 | 0.622 | 0.625 | 0.625 | 0.622 | 0.909 |
The columns e and c refer to the error rate and coverage rate, respectively. Columns 3-9 represent the reconstruction rate of the seven algorithms, i.e. SpeedHap, Fast Hare, 2d-mec, HapCUT, MLF, SHR-three and DGS. For each combination of e and c, the best among the seven algorithms is highlighted in bold. The last column lists the reconstruction rate of our algorithm.
Figure 2Illustration of the effect of the voting procedure. The reconstruction rates for the final version of our algorithm and the one without the voting procedure are depicted by black and gray bar, respectively. The error rate for the benchmark used in (a)(respectively, (b)) is 0.2 (respectively, 0.3).
Figure 3Evaluation of how the size of . The reconstruction rates for x = 25, 50 and 100 are depicted by white, gray and black bar, respectively. The error rate for the benchmark used in (a)(respectively, (b)) is 0.2 (respectively, 0.3).