| Literature DB >> 20529904 |
Dan He1, Arthur Choi, Knot Pipatsrisawat, Adnan Darwiche, Eleazar Eskin.
Abstract
MOTIVATION: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of 'haplotype assembly' is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem.Entities:
Mesh:
Year: 2010 PMID: 20529904 PMCID: PMC2881399 DOI: 10.1093/bioinformatics/btq215
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
An example of read matrix that consists of 10 reads spanning 13 positions
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Read1 | 0 | 0 | - | - | - | - | - | - | - | - | - | - | - |
| Read2 | - | - | 1 | 1 | - | - | - | - | - | - | - | - | - |
| Read3 | 0 | 0 | 0 | 0 | - | - | - | - | - | - | - | - | - |
| Read4 | - | - | 1 | 0 | 1 | - | - | - | - | - | - | - | - |
| Read5 | - | - | 0 | - | - | 0 | - | - | - | - | - | - | - |
| Read6 | - | - | - | 0 | - | - | - | - | - | - | 1 | 1 | - |
| Read7 | - | - | - | - | 0 | 0 | 0 | - | - | - | - | - | - |
| Read8 | - | - | - | - | 0 | 1 | 1 | 0 | - | - | - | - | - |
| Read9 | - | - | - | - | - | - | - | - | 1 | 1 | - | - | - |
| Read10 | - | - | - | - | - | - | - | 1 | 1 | 0 | - | - | 0 |
Fig. 1.(a) The number of short reads, all reads and (b) the length of haplotypes for each chromosome. The threshold for short reads is 15. The length of haplotypes is the number of heterozygous sites in each chromosome.
Fig. 2.Graphical representation of the read matrix for the first block of Chromosome 22, where the reads are sorted by their starting positions. The rows are the reads and the columns are the haplotype positions. The black dots are the non-‘−’ cells for the short reads and the red dots are the non-‘−’ cells for the long reads. The red lines are the gap cells of the paired-end reads.
The MEC scores computed by Greedy, HapCut and dynamic programming (DP), MaxSAT conversion, on short reads only and on all reads, respectively, for each chromosome
| On short reads | On all reads | |||||
|---|---|---|---|---|---|---|
| Greedy | HapCut | DP | Greedy | HapCut | MaxSAT | |
| Chromosome 1 | 21 355 | 15 312 | 15 292 | 29 518 | 19 687 | 19 584 |
| Chromosome 2 | 16 067 | 11 251 | 11 107 | 22 706 | 14 615 | 14576 |
| Chromosome 3 | 11 909 | 8223 | 8181 | 16 696 | 10 702 | 10 647 |
| Chromosome 4 | 12 518 | 8820 | 8775 | 17 509 | 11 525 | 11 304 |
| Chromosome 5 | 11 621 | 8017 | 7944 | 16 432 | 10 536 | 10 528 |
| Chromosome 6 | 10 624 | 7487 | 7369 | 15 295 | 9842 | 9826 |
| Chromosome 7 | 11 668 | 8531 | 8423 | 17 188 | 11244 | 11 187 |
| Chromosome 8 | 10 501 | 7343 | 7311 | 14 535 | 9741 | 9025 |
| Chromosome9 | 10 199 | 7350 | 7312 | 13 512 | 9222 | 9201 |
| Chromosome 10 | 10 263 | 7313 | 7236 | 15 076 | 9846 | 9778 |
| Chromosome 11 | 8825 | 6224 | 6196 | 12 667 | 8200 | 8183 |
| Chromosome 12 | 8641 | 6337 | 6155 | 12 453 | 8218 | 8176 |
| Chromosome 13 | 6412 | 4396 | 4341 | 8848 | 5822 | 5761 |
| Chromosome 14 | 6634 | 4567 | 4532 | 9070 | 5879 | 5845 |
| Chromosome 15 | 9289 | 6653 | 6623 | 13291 | 9311 | 9285 |
| Chromosome 16 | 8574 | 6160 | 6093 | 12365 | 8259 | 8207 |
| Chromosome 17 | 7088 | 5034 | 4955 | 10195 | 6525 | 6459 |
| Chromosome 18 | 4973 | 3526 | 3398 | 8324 | 4991 | 4943 |
| Chromosome 19 | 5549 | 3996 | 3907 | 7939 | 5319 | 5288 |
| Chromosome 20 | 4136 | 2909 | 2891 | 5563 | 3739 | 3723 |
| Chromosome 21 | 3877 | 2903 | 2796 | 5607 | 3888 | 3881 |
| Chromosome 22 | 4424 | 3267 | 3250 | 6685 | 4495 | 4479 |
| Sum | 205 147 | 145 619 | 144 087 | 291 474 | 191 606 | 189 886 |
Average number of connected components contained in each block and average size of the connected components whose size is greater than 1 for different (coverage ratio, SD) settings
| (10, 5) | (10, 50) | (10, 500) | (20, 5) | (20, 50) | (20, 500) | (30, 5) | (30, 50) | (30, 500) | (40, 5) | (40, 50) | (40, 500) | (100, 500) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Num | 8 | 7 | 10 | 8 | 6 | 6 | 8 | 6 | 4 | 8 | 5 | 3 | 1 |
| Size | 2 | 2 | 4 | 2 | 3 | 8 | 2 | 3 | 13 | 2 | 4 | 17 | 31 |
The probability (%) of an SNP attached to other SNPs more than once, twice, three times, four times, for different (coverage ratio, SD) settings
| (10, 5) | (10, 50) | (10, 500) | (20, 5) | (20, 50) | (20, 500) | (30, 5) | (30, 50) | (30, 500) | (40, 5) | (40, 50) | (40, 500) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ≥1 | 41 | 56 | 65 | 45 | 66 | 82 | 46 | 70 | 89 | 47 | 73 | 92 |
| ≥2 | 35 | 38 | 39 | 41 | 54 | 62 | 43 | 62 | 75 | 44 | 66 | 83 |
| ≥3 | 29 | 26 | 23 | 38 | 44 | 44 | 41 | 54 | 61 | 43 | 60 | 71 |
| ≥4 | 23 | 19 | 16 | 35 | 35 | 32 | 39 | 46 | 48 | 41 | 54 | 60 |