| Literature DB >> 25885901 |
Shreepriya Das1, Haris Vikalo2.
Abstract
BACKGROUND: The goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. Limited lengths of paired-end sequencing reads and inserts render haplotype assembly computationally challenging; in fact, most of the problem formulations are known to be NP-hard. Dimensions (and, therefore, difficulty) of the haplotype assembly problems keep increasing as the sequencing technology advances and the length of reads and inserts grow. The computational challenges are even more pronounced in the case of polyploid haplotypes, whose assembly is considerably more difficult than in the case of diploids. Fast, accurate, and scalable methods for haplotype assembly of diploid and polyploid organisms are needed.Entities:
Mesh:
Year: 2015 PMID: 25885901 PMCID: PMC4422552 DOI: 10.1186/s12864-015-1408-5
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Comparison of MEC and runtimes for different schemes applied to HuRef data
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| ||||||
|
|
|
|
|
|
|
|
|
|
| |
| 1 | 16853 | 1.72×105 | 17192 | 190 | 19777 | 251 | 18739 | 8354 | 21112 | 8499 |
| 2 | 12618 | 1260 | 12713 | 220 | 14698 | 185 | 13762 | 3576 | 15720 | 8210 |
| 3 | 9296 | 960 | 9444 | 153 | 10714 | 257 | 10096 | 4866 | 11424 | 12183 |
| 4 | 9958 | 1140 | 10106 | 115 | 11567 | 274 | 10936 | 5399 | 12479 | 9534 |
| 5 | 9195 | 1080 | 9333 | 141 | 10590 | 196 | 10045 | 4398 | 11391 | 8972 |
| 6 | 8637 | 900 | 8696 | 105 | 9922 | 247 | 9318 | 4225 | 10912 | 8462 |
| 7 | 9782 | 1020 | 9954 | 102 | 11279 | 152 | 10540 | 7030 | 12196 | 6876 |
| 8 | 8480 | 3960 | 8604 | 90 | 9832 | 226 | 9352 | 4640 | 10552 | 9162 |
| 9 | 8051 | 780 | 8134 | 88 | 9290 | 111 | 8850 | 3898 | 9905 | 5515 |
| 10 | 8550 | 1200 | 8680 | 87 | 9877 | 209 | 9323 | 5203 | 10598 | 8782 |
| 11 | 7027 | 840 | 7186 | 92 | 8210 | 156 | 7744 | 4514 | 8856 | 7093 |
| 12 | 7136 | 720 | 7256 | 100 | 8240 | 152 | 7725 | 2669 | 9003 | 5494 |
| 13 | 5090 | 600 | 5142 | 62 | 5844 | 125 | 5511 | 3363 | 6285 | 7680 |
| 14 | 5086 | 480 | 5173 | 52 | 5861 | 72 | 5537 | 2313 | 6273 | 3533 |
| 15 | 8088 | 1.67×105 | 8216 | 71 | 9364 | 108 | 9031 | 2014 | 10218 | 4018 |
| 16 | 7176 | 1500 | 7231 | 51 | 8287 | 138 | 7830 | 1896 | 8769 | 4589 |
| 17 | 5739 | 480 | 5819 | 47 | 6570 | 86 | 6238 | 2362 | 7106 | 5626 |
| 18 | 4403 | 540 | 4467 | 56 | 5041 | 108 | 4814 | 1542 | 5500 | 3544 |
| 19 | 4628 | 480 | 4670 | 32 | 5335 | 65 | 5052 | 1132 | 5660 | 2804 |
| 20 | 3243 | 300 | 3316 | 37 | 3753 | 63 | 3458 | 1472 | 4068 | 3602 |
| 21 | 3360 | 2400 | 3369 | 31 | 3914 | 63 | 3752 | 739 | 4154 | 1692 |
| 22 | 3908 | 7.16×105 | 3973 | 32 | 4539 | 43 | 4384 | 1786 | 4780 | 1683 |
MEC and running times for CPLEX, SDhaP, HAPCUT, RefHap and HapTree algorithms applied to HuRef data. SDhaP is more accurate than all schemes except for CPLEX (which is the only one that requires proprietary software). However, for longer blocks, the complexity of the CPLEX scheme may become very high as evident from chromosomes 1, 15 and 22.
Comparison of MEC and runtimes for different schemes applied to Fosmid data
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| ||||||
|
|
|
|
|
|
|
|
|
|
| |
| 1 | 6889 | 480 | 7297 | 8.07 | 8051 | 2.61 | 9550 | 600 | 9676 | 6501 |
| 2 | 6700 | 451 | 7214 | 7.84 | 7910 | 2.65 | 9661 | 660 | 9802 | 7196 |
| 3 | 5122 | 420 | 5588 | 6.49 | 6111 | 2.969 | 7557 | 360 | 7705 | 4847 |
| 4 | 4072 | 360 | 4510 | 5.41 | 4880 | 1.81 | 6265 | 540 | 6500 | 8392 |
| 5 | 4637 | 762 | 5029 | 6.2 | 5558 | 2.15 | 6919 | 480 | 7094 | 5670 |
| 6 | 5248 | 471 | 5674 | 7.26 | 6341 | 2.093 | 7958 | 2700 | – | – |
| 7 | 4174 | 464 | 4509 | 5.02 | 4961 | 2.07 | 6062 | 480 | 6169 | 5589 |
| 8 | 4301 | 347 | 4785 | 5.09 | 5092 | 2.0 | 6255 | 615 | 6379 | 8316 |
| 9 | 3974 | 191 | 4200 | 4.7 | 4591 | 1.76 | 5463 | 376 | 5513 | 4465 |
| 10 | 4508 | 270 | 4765 | 5.04 | 5357 | 2.52 | 6445 | 454 | 6553 | 4838 |
| 11 | 3903 | 150 | 4165 | 4.63 | 4620 | 2.23 | 5558 | 457 | 5625 | 5183 |
| 12 | 3907 | 159 | 4174 | 5 | 4686 | 2.18 | 5654 | 360 | 5770 | 5654 |
| 13 | 2669 | 137 | 2946 | 3.0 | 3155 | 1.1 | 3967 | 291 | 4029 | 5367 |
| 14 | 2814 | 413 | 2971 | 3.35 | 3244 | 1.89 | 3978 | 302 | 4038 | 4103 |
| 15 | 2903 | 138 | 3029 | 3.09 | 3341 | 1.54 | 4007 | 250 | 4116 | 3357 |
| 16 | 3844 | 221 | 4022 | 4.84 | 4438 | 1.66 | 5086 | 570 | 5142 | 9683 |
| 17 | 3448 | 295 | 3586 | 3.41 | 4159 | 1.86 | 4743 | 251 | 4806 | 3003 |
| 18 | 2337 | 288 | 2555 | 2.69 | 2801 | 1.39 | 3445 | 240 | 3493 | 2303 |
| 19 | 2707 | 70 | 2857 | 2.78 | 3406 | 1.35 | 3898 | 180 | 3953 | 1984 |
| 20 | 2783 | 305 | 2943 | 3.08 | 3295 | 1.72 | 3810 | 203 | 3886 | 1529 |
| 21 | 1367 | 72 | 1452 | 1.44 | 1601 | 1.05 | 1951 | 134 | 1979 | 1410 |
| 22 | 2422 | 175 | 2508 | 3.21 | 2876 | 1.69 | 3260 | 118 | 3307 | 1351 |
MEC and running times for CPLEX, SDhaP, HAPCUT, RefHap and HapTree haplotype assembly strategies applied to Fosmid data. SDhaP is more accurate than the other schemes except CPLEX and significantly faster than HapCUT or HapTree. For chromosome 6, HapTree did not complete its run within 48 hours and hence the corresponding entry is missing.
Comparison of SWER, MEC and runtimes for different schemes on simulated diploid data
|
|
| |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
| |
| l 103, c 10 | 86 | 0.002 | 4 | 123 | 0.009 | 8 | 123 | 0.009 | 6 | 123 | 0.009 | 31 |
| l 103, c 20 | 212 | 0.001 | 5 | 293 | 0.010 | 169 | 303 | 0.011 | 8 | 305 | 0.006 | 14 |
| l 103, c 30 | 300 | 0.001 | 7 | 378 | 0.007 | 567 | 377 | 0.002 | 7 | 378 | 0.001 | 14 |
| l 104, c 10 | 1112 | 0.003 | 28 | 1257 | 0.008 | 2341 | 1354 | 0.011 | 282 | 1354 | 0.010 | 34905 |
| l 104, c 20 | 2088 | 0.003 | 36 | 2659 | 0.008 | 36392 | 2774 | 0.009 | 680 | 2774 | 0.009 | 35443 |
| l 104, c 30 | 3482 | 0.004 | 81 | 4164 | 0.009 | 39184 | 4277 | 0.010 | 604 | 4283 | 0.009 | 17002 |
MEC, SWER and running times (in seconds) for SDhaP, RefHap, HAPCUT and HapTree algorithms for simulated data of different lengths (l) and with different coverages (c). The data contains a fixed 1% fraction of genotyping errors. SDhaP is more accurate in terms of MEC and SWER and faster by almost an order of magnitude compared to other schemes for longer blocks.
Comparison of SWER, MEC and runtimes for SDhaP and CPLEX on simulated diploid data with 1% error rate
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
| |||||||
|
|
|
|
|
|
|
|
|
| |
| l 103, c 10 | 100 | 0.001 | 1 | 100 | 0.001 | 192 | 100 | 0.001 | 57 |
| l 103, c 20 | 215 | 0.001 | 4 | 215 | 0.001 | 1320 | 215 | 0.001 | 373 |
| l 103, c 30 | 291 | 0.001 | 6 | 291 | 0.001 | 1241 | 291 | 0.001 | 910 |
| l 104, c 10 | 978 | 0.008 | 14 | - | - | - | 972 | 0.008 | 4505 |
| l 104, c 20 | 2039 | 0.004 | 33 | - | - | - | 2039 | 0.004 | 89811 |
| l 104, c 30 | 2988 | 0.004 | 68 | - | - | - | - | - | - |
| l 105, c 10 | 10356 | 0.008 | 324 | - | - | - | - | - | - |
| l 105, c 20 | 19975 | 0.007 | 713 | - | - | - | - | - | - |
| l 105, c 30 | 29967 | 0.005 | 1810 | - | - | - | - | - | - |
MEC, SWER and running times (in seconds) for SDhaP, RefHap, HAPCUT and HapTree algorithms for simulated data of different lengths (l) and with different coverages (c). The data contains a fixed 1% fraction of genotyping errors. SDhaP is more accurate in terms of MEC and SWER and faster by almost an order of magnitude compared to other schemes for longer blocks.
Comparison of SWER, MEC and runtimes for SDhaP and CPLEX on simulated diploid data with 5% error rate
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
| |||||||
|
|
|
|
|
|
|
|
|
| |
| l 103, c 10 | 535 | 0.04 | 1 | 518 | 0.036 | 811 | - | - | - |
| l 103, c 20 | 1042 | 0.007 | 4 | - | - | - | 1041 | 0.007 | 11393 |
| l 103, c 30 | 1583 | 0.003 | 4 | - | - | - | - | - | - |
| l 104, c 10 | 4971 | 0.099 | 14 | 4945 | 0.093 | 13800 | 7024 | 0.12 | 961 |
| l 104, c 20 | 9839 | 0.0370 | 41 | - | - | - | - | - | - |
| l 104, c 30 | 15310 | 0.0150 | 85 | - | - | - | - | - | - |
| l 105, c 10 | 51342 | 0.210 | 375 | - | - | - | - | - | - |
| l 105, c 20 | 102234 | 0.120 | 772 | - | - | - | - | - | - |
| l 105, c 30 | 157234 | 0.030 | 1517 | - | - | - | - | - | - |
MEC, SWER and running times (in seconds) for SDhaP, RefHap, HAPCUT and HapTree algorithms for simulated data of different lengths (l) and with different coverages (c). The data contains a fixed 5% fraction of genotyping errors. SDhaP is more accurate in terms of MEC and SWER and faster by almost an order of magnitude compared to other schemes for longer blocks.
Figure 1SWER for diploids. Switch error rates of SDhaP applied to diploid data as a function of coverage for different block lengths and error rates. To achieve the same SWER, higher coverages are needed for longer blocks and higher error rates.
Figure 2Runtimes for diploids. Runtimes of SDhaP applied to diploid data as a function of coverage for different block lengths and error rates. The runtimes are nearly independent of error rates and scale approximately linearly with block lengths.
Figure 3SWER for polyploids. Switch error rates of SDhaP applied to polyploid data as a function of coverage for different block lengths and error rates. To achieve the same SWER, higher coverages are needed for longer blocks and higher ploidy.
Figure 4Runtimes for polyploids. Runtimes of SDhaP applied to polyploid data as a function of coverage for different block lengths and error rates. The runtimes scale approximately linearly with block lengths and quadratically with the ploidy.
Comparison of SWER, MEC and runtimes for different schemes on simulated biallelic triploid data
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
| length 103, cov 15 | 0.0348 | 170 | 0.0130 | 18 | 401 | 0.0430 | 1860 |
| length 103, cov 30 | 0.0135 | 331 | 0.0027 | 90 | 582 | 0.0220 | 8 |
| length 103, cov 45 | 0.0064 | 488 | 0.0013 | 183 | 614 | 0.0053 | 5 |
| length 104, cov 15 | 0.0348 | 1848 | 0.0143 | 388 | - | - | - |
| length 104, cov 30 | 0.0135 | 4091 | 0.0038 | 1289 | 4744 | 0.0191 | 680 |
| length 104, cov 45 | 0.0064 | 6169 | 0.0025 | 2048 | 5492 | 0.0060 | 2424 |
MEC, SWER and running times for SDhaP and HapTree algorithms on biallelic triploid simulated data. For l = 104 and c = 15, HapTree did not complete the task in 48 hrs.
Comparison of SWER, MEC and runtimes for different schemes on simulated biallelic tetraploid data
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
| length 103, cov 20 | 0.0487 | 193 | 0.0105 | 40 | 626 | 0.0891 | 580 |
| length 103, cov 40 | 0.0217 | 385 | 0.0050 | 124 | 974 | 0.0380 | 780 |
| length 103, cov 80 | 0.0081 | 836 | 0.0015 | 560 | 2174 | 0.0290 | 15 |
| length 104, cov 20 | 0.0487 | 4676 | 0.0233 | 383 | - | - | - |
| length 104, cov 40 | 0.0217 | 6966 | 0.0096 | 2901 | - | - | - |
| length 104, cov 80 | 0.0081 | 14146 | 0.0072 | 8784 | - | - | - |
MEC, SWER and running times for SDhaP and HapTree algorithms on biallelic tetraploid simulated data. For l = 104, HapTree did not complete the task in 48 hrs.
SWER, MEC and runtimes of SDhaP for simulated hexaploid data
|
|
|
| ||
|---|---|---|---|---|
|
|
|
| ||
| length 103, cov 30 | 0.0480 | 1270 | 0.1338 | 278 |
| length 103, cov 60 | 0.0283 | 1653 | 0.0215 | 943 |
| length 103, cov 120 | 0.0177 | 2246 | 0.0170 | 1178 |
| length 103, cov 180 | 0.0087 | 1767 | 0.0017 | 8341 |
| length 104, cov 30 | 0.0480 | 14127 | 0.3370 | 1665 |
| length 104, cov 60 | 0.0283 | 16014 | 0.1100 | 5240 |
| length 104, cov 120 | 0.0177 | 21102 | 0.0353 | 19940 |
| length 104, cov 180 | 0.0087 | 72203 | 0.0210 | 30911 |
MEC, SWER and running times of SDhaP for biallelic hexaploid simulated data. HapTree completed the task within 48 hrs in only one case, (l = 30, cov=60), where it achieved MEC=2832, SWER=0.1114, and t = 3441 s, all inferior compared to the results of SDhaP in the table.
Figure 5Block lengths histogram for HuRef data. Histogram of block lengths for Huref data.
Figure 6Block lengths histogram for fosmid data. Histogram of block lengths for Fosmid data.
Figure 7Histogram of homozygous positions. Histogram of the fraction of homozygous positions as a function of chromosome number for HuRef data. On average, around 1% positions are falsely called heterozygous.