| Literature DB >> 26793302 |
Davide Verzotto1, Audrey S M Teo2, Axel M Hillmer2, Niranjan Nagarajan1.
Abstract
BACKGROUND: Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences.Entities:
Keywords: Genomic mapping; Glocal alignment; Map-to-sequence alignment; Optical mapping; Overlap alignment
Mesh:
Year: 2016 PMID: 26793302 PMCID: PMC4719737 DOI: 10.1186/s13742-016-0110-0
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Fig. 1Example of a genomic map and strategies for glocal and overlap map alignment. a Example of an experimental or in silico map with ordered fragment sizes. bFeasible match within dashed bars (Definition 1). cComposite seeds with c=2 (Definition 4), where Composite (iv) represents the final composition of seeds with errors used here; the case with one false cut allowed is not directly indexed from the in silico maps, but is explored during the seed search process. d Seed extension in glocal alignment with dynamic programming (straight lines delimit feasible matches found, dashed lines mark truncated end matches and dashed circles show potentially missing fragments). e Sliding-window approach in overlap alignment: for a particular window of fixed size (dashed black border) we first compute a glocal alignment (solid yellow border) from one of its seeds (multicolored box), statistically evaluate it and subsequently extend it until the end of one of the maps is reached on both sides of the seed
Fig. 2Comparison of sensitivity between different seeding approaches for the human genome. a The easier scenario (a). b The harder scenario (b). For each corresponding length in fragments, we report the percentage of maps with at least one correct seed detected (out of 100 maps). Note that the approach used in OPTIMA, Composite seeds (iv), was able to find the correct location for more than 99 and 88 % of maps with at least ten fragments in scenarios (a) and (b), respectively
Fig. 3Representation of candidate alignments as a function of alignment features. The results shown are based on aligning a 26-fragment simulated experimental map on the human reference genome. The green comet represents the true solution, and also the best solution π∗ found by OPTIMA (p-value p∗=2.16e−9), while the blue comet belongs to a false alignment with the lowest number of cut errors (p=7.35e−6). Note here that despite having many near-optimal solutions, OPTIMA unambiguously identifies the correct solution based on its statistical analysis
Comparison of all methods and their variants on glocal map-to-sequence alignment
| Algorithm | Human (A) | Human (B) | ||||||
|---|---|---|---|---|---|---|---|---|
| S | P | S | P | S | P | S | P | |
| OPTIMA |
|
|
|
|
|
|
|
|
| Gentig v.2 (d) | 59 |
| 24 |
| 53 | 96 | 20 | 80 |
| Gentig v.2 (tp) | 59 |
| 24 | 98 | 54 | 95 | 20 | 88 |
| SOMA v.2 (v) | 72 | 73 | 31 | 39 | 50 | 50 | 17 | 20 |
| Likelihood (d+a) | 49 | 49 | 29 | 30 | 24 | 24 | 14 | 14 |
| Likelihood (d+a+t) | 64 | 65 | 38 | 39 | 33 | 34 | 18 | 19 |
| Likelihood (p+a+t) | 75 | 75 | 39 | 39 | 62 | 62 | 19 | 20 |
Sensitivity (S) and precision (P) are percentages and the best values across all methods are highlighted in bold. Results are based on the alignment of a subset of 2100 maps, as used in Fig. 4
Fig. 4Glocal alignment as a function of the number of fragments in the experimental maps. Gentig results are plotted for setting (d) and likelihood-based fit alignment results are for setting (d+a+t). Results are reported for 100 maps for each bin of simulated datasets for Drosophila and human scenarios (a) and (b)
Running time and worst-case complexity for various glocal map-to-sequence aligners
| Algorithm | Complexity | Running time | ||
|---|---|---|---|---|
| Time | Space |
| Human | |
| OPTIMA |
|
| ||
| Gentig v.2 (d) | 1.32 h | 75 days | ||
| Gentig v.2 (tp) | 1.85 h | 174 days | ||
| SOMA v.2 (v) | 1.28 years | 1,067 years | ||
| Likelihood (d+a) | 22.22 h | 2.72 years | ||
| Likelihood (d+a+t) | 19.62 h | 2.38 years | ||
| Likelihood (p+a+t) | 41.73 h | 5.53 years | ||
Running times reported are estimated from 2100 maps and extrapolated for the full datasets (82,000 Drosophila maps and 2.1 million human maps, for 100 × coverage; single-core computation on Intel x86 64-bit Linux workstations with 16GB RAM). The best column-wise running times are reported in bold. Note that including the permutation-based statistical tests for SOMA and the likelihood method would increase their runtime by a factor of greater than 100. The complexity analysis refers to map-to-sequence glocal alignment per map, where n is the total length of the in silico maps (500,000 fragments for the human genome), m≪n is the length of the experimental map in fragments (typically 17 fragments on average), #seeds, c (default of two) and δ are as defined in the “Methods” section and #it (number of iterations), #hashes (geometric hashes found to match) and |HashTable| are as specified in [17, 24]
Statistics for glocal alignment of real human optical maps from GM12878 HapMap cell line
| Map card | F | Input maps | Details | OPTIMA | Gentig v.2 | Increase w.r.t. Gentig v.2 | Yield (genome coverage) | Avg. length and size | Avg. digestion rate | Avg. false/extra cut rate | Avg. WHT chi square sizing error |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 21157LB | (r) | 73,365 (7.2X) | Avg. quality 0.50; 295 kbp, 18 f; AFS 16.5 kbp | 25 % | 9 % | 3X | 2X | 21 f | 324 kbp | 66 % | 0.74 | –0.69 |
| (s) | 38,483 (4.7X) | Avg. quality 0.53; size 368 kbp, 22 f; AFS 17 kbp | 36 % | 14 % | 2.6X | 1.7X | 23 f | 361 kbp | 65 % | 0.73 | –0.58 | |
| 21159LB | (r) | 75,761 (7.6X) | Avg. quality 0.47; size 300 kbp, 17 f; AFS 17.4 kbp | 19 % | 5 % | 4X | 1.6X | 19 f | 325 kbp | 63 % | 0.72 | –1.07 |
| (s) | 41,236 (5.1X) | Avg. quality 0.50; size 370 kbp, 21 f; AFS 17.8 kbp | 27 % | 8 % | 3.4X | 1.3X | 21 f | 359 kbp | 62 % | 0.72 | –0.97 | |
| 21431LB | (r) | 93,896 (8.6X) | Avg. quality 0.52; size 274 kbp, 17 f; AFS 15.8 kbp | 20 % | 8 % | 2.6X | 1.9X | 21 f | 305 kbp | 68 % | 0.77 | –0.42 |
| (s) | 43,667 (5.1X) | Avg. quality 0.54; size 348 kbp, 21 f; AFS 16.3 kbp | 30 % | 13 % | 2.4X | 1.5X | 23 f | 343 kbp | 67 % | 0.77 | –0.29 | |
| 21443LB | (r) | 66,857 (6X) | Avg. quality 0.51; size 271 kbp, 17 f; AFS 15.8 kbp | 19 % | 7 % | 2.7X | 1.3X | 20 f | 299 kbp | 67 % | 0.77 | –0.50 |
| (s) | 29,991 (3.5X) | Avg. quality 0.53; size 346 kbp, 21 f; AFS 16.3 kbp | 29 % | 12 % | 2.5X | 1X | 23 f | 340 kbp | 66 % | 0.77 | –0.35 | |
| TOTAL | (r) | 309,879 (29.4X) | Avg. quality 0.50; size 285 kbp, 17 f; AFS 16.4 kbp | 21 % | 7 % | 2.9X | 6.8X | 21 f | 314 kbp | 66 % | 0.75 | –0.66 |
| (s) | 153,377 (18.3X) | Avg. quality 0.52; size 359 kbp, 21 f; AFS 16.9 kbp | 31 % | 11 % | 2.7X | 5.5X | 23 f | 352 kbp | 65 % | 0.75 | –0.55 |
Statistics are reported independently for each map card of GM12878 cell line, using: (r) relaxed filtering: ≥ 10 fragments and 150kbp; and (s) stringent filtering: ≥ 12 fragments and 250kbp (as shown in column F). From left to right are reported: the total number of input maps and their coverage in bases of the human genome; further details such as average map quality (provided by the Argus machine), average map size in bases and length in fragments, and average fragment size (AFS); aligned maps by OPTIMA and Gentig v.2; OPTIMA alignment rate increase with respect to Gentig v.2; other OPTIMA alignment statistics
Statistics for glocal alignment of real human optical maps from HCT116 colorectal cancer cell line
| Map card | F | Input maps | Details | OPTIMA | Gentig v.2 | Increase w.r.t. Gentig v.2 | Yield (genome coverage) | Avg. length and size | Avg. digestion rate | Avg. false/extra cut rate | Avg. WHT chi square sizing error |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 17182LA | (r) | 10,911 (0.9X) | Avg. quality 0.33; size 257 kbp, 16 f; AFS 15.7 kbp | 4 % | 0.5 % | 8.1X | 0.04X | 19 f | 245 kbp | 66 % | 1.29 | –1.15 |
| (s) | 3,744 (0.4X) | Avg. quality 0.33; size 351 kbp, 20 f; AFS 17.7 kbp | 4 % | 0.9 % | 4.5X | 0.02X | 22 f | 326 kbp | 63 % | 1.23 | –0.83 | |
| 17184LA-2 | (r) | 55,719 (5.7X) | Avg. quality 0.43; size 305 kbp, 19 f; AFS 16.3 kbp | 18 % | 9 % | 1.9X | 1.1X | 23 f | 332 kbp | 68 % | 0.76 | –0.65 |
| (s) | 28,658 (3.7X) | Avg. quality 0.45; size 390 kbp, 23 f; AFS 17.2 kbp | 25 % | 15 % | 1.6X | 0.9X | 25 f | 378 kbp | 67 % | 0.74 | –0.51 | |
| 17185LA | (r) | 56,879 (5.4X) | Avg. quality 0.55; size 285 kbp, 19 f; AFS 14.7 kbp | 24 % | 18 % | 1.4X | 1.5X | 23 f | 325 kbp | 70 % | 0.76 | –0.17 |
| (s) | 28,003 (3.4X) | Avg. quality 0.59; size 365 kbp, 24 f; AFS 15.1 kbp | 35 % | 28 % | 1.2X | 1.2X | 26 f | 367 kbp | 70 % | 0.74 | –0.04 | |
| 17186LA-3 | (r) | 52,984 (5.8X) | Avg. quality 0.54; size 328 kbp, 20 f; AFS 16.0 kbp | 33 % | 19 % | 1.7X | 2X | 24 f | 342 kbp | 70 % | 0.68 | –0.35 |
| (s) | 31,588 (4.3X) | Avg. quality 0.56; size 404 kbp, 25 f; AFS 16.4 kbp | 42 % | 28 % | 1.5X | 1.7X | 26 f | 380 kbp | 69 % | 0.67 | –0.26 | |
| 17187LA | (r) | 88,730 (7.8X) | Avg. quality 0.45; size 264 kbp, 18 f; AFS 14.8 kbp | 12 % | 7 % | 1.7X | 1X | 21 f | 285 kbp | 69 % | 0.94 | –0.56 |
| (s) | 36,018 (4.2X) | Avg. quality 0.46; size 349 kbp, 22 f; AFS 15.8 kbp | 17 % | 11 % | 1.6X | 0.7X | 24 f | 338 kbp | 68 % | 0.92 | –0.35 | |
| 14593LB | (r) | 30,994 (2.7X) | Avg. quality 0.39; size 261 kbp, 14 f; AFS 18.9 kbp | 6 % | 0.6 % | 9.9X | 0.2X | 16 f | 269 kbp | 63 % | 0.85 | –1.23 |
| (s) | 10,944 (1.2X) | Avg. quality 0.39; size 337 kbp, 17 f; AFS 20.2 kbp | 9 % | 0.7 % | 12.3X | 0.1X | 18 f | 320 kbp | 60 % | 0.87 | –0.97 | |
| TOTAL | (r) | 296,217 (28.3X) | Avg. quality 0.47; size 287 kbp, 18 f; AFS 15.7 kbp | 18 % | 11 % | 1.7X | 5.7X | 23 f | 322 kbp | 69 % | 0.77 | –0.44 |
| (s) | 138,955 (17.2X) | Avg. quality 0.50; size 372 kbp, 23 f; AFS 16.5 kbp | 27 % | 18 % | 1.5X | 4.6X | 25 f | 368 kbp | 68 % | 0.75 | –0.28 |
Statistics are reported for each map card of HCT116 cell line using the relaxed filtering (r) and the stringent filtering (s), similarly as in Table 3 These results further suggest a mean yield of 1.25 × and 1 × for (r) and (s), respectively, in terms of aligned coverage of the human genome per map card using OPTIMA
Fig. 5Trade-off for partial overlap detection. Number of (correct) partial overlaps found for each sliding-window size using OPTIMA-Overlap, for both simulated (Drosophila and human scenarios (a) and (b)) and real maps over simulated and real scaffolds (K562 human cancer cell line), respectively
Comparison of methods for overlap map-to-sequence alignment
| Algorithm | Human (A) | Human (B) | Human real data | ||||||
|---|---|---|---|---|---|---|---|---|---|
| E | P | E | P | E | P | E | P | E | |
| OPTIMA-overlap |
|
|
|
|
|
|
|
|
|
| Gentig v.2 (d) | 69 | 100 | 29 | 93 | 51 | 93 | 19 | 83 | 14 |
| Likelihood-overlap (d + a) | 59 | 74 | 36 | 52 | 21 | 41 | 9 | 26 | 12 |
The precision of overlap alignments (P, in percentages) and the number of overlap alignments that lead to (correct) extensions (E, absolute values) as a measure of sensitivity (correctness is only known for simulated datasets) are shown. The best values across methods are highlighted in bold