| Literature DB >> 32563153 |
Abstract
Pairwise whole-genome homology mapping is the problem of finding all pairs of homologous intervals between a pair of genomes. As the number of available whole genomes has been rising dramatically in the last few years, there has been a need for more scalable homology mappers. In this paper, we develop an algorithm (BubbZ) for computing whole-genome pairwise homology mappings, especially in the context of all-to-all comparison for multiple genomes. BubbZ is based on an algorithm for computing chains in compacted de Bruijn graphs. We evaluate BubbZ on simulated datasets, a dataset composed of 16 long mouse genomes, and a large dataset of 1,600 Salmonella genomes. We show up to approximately an order of magnitude speed improvement, compared with MashMap2 and Minimap2, while retaining similar accuracy.Entities:
Keywords: Algorithms; Bioinformatics; Genomics
Year: 2020 PMID: 32563153 PMCID: PMC7303978 DOI: 10.1016/j.isci.2020.101224
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Running Time (Minutes) and Memory Usage (Gigabytes, in Parenthesis) on the Mouse Data
| TwoPaCo + BubbZ | Minimap2 | MashMap2 | |||
|---|---|---|---|---|---|
| Dataset | TwoPaCo | BubbZ | Total | ||
| 1–2 | 15 (9.3) | 6 (35.2) | 21 (35.2) | 73 (46.5) | 233 (22.3) |
| 1–4 | 22 (9.4) | 14 (66.5) | 36 (66.5) | 75 (105.4) | 240 (39.7) |
| 1–8 | 40 (9.3) | 26 (94.9) | 66 (94.9) | 104 (119.2) | 464 (44.7) |
| 1–16 | 83 (17.8) | 42 (164.2) | 125 (164.2) | 411 (119.6) | 1,530 (45.6) |
Figure 1Results on the Mouse Data
Recall of the position pairs belonging to pairs of protein-coding genes by BubbZ(blue), Minimap2(green), and MashMap2(red). (A) corresponds to orthologs and(B) to paralogs. MashMap2 recall on paralogs could not be computed.
Running Time (Minutes) and Memory Usage (Gigabytes, in Parenthesis) on the Bacterial Data
| Dataset | TwoPaCo + BubbZ | Minimap2 | MashMap2 | ||
|---|---|---|---|---|---|
| TwoPaCo | BubbZ | Total | |||
| 1–200 | 4 (17.5) | 2 (7.8) | 6 (17.5) | 35 (3.5) | 26 (1.6) |
| 1–400 | 6 (17.5) | 6 (16.7) | 12 (17.5) | 132 (3.5) | 101 (1.8) |
| 1–800 | 10 (17.6) | 33 (44.3) | 43 (44.3) | 510 (4.3) | 390 (2.3) |
| 1–1,600 | 19 (17.8) | 257 (149.2) | 276 (149.2) | 2,250 (7.0) | 1876 (2.3) |
Figure 2Results on the Simulated Data: Accuracy as a Function of the Genomic Distance
(A) shows recall, and (B) displays precision.
Running Time (Seconds) and Memory Usage (Megabytes, in Parenthesis) on the Simulated Data
| Dataset | TwoPaCo + BubbZ | Minimap2 | MashMap2 | ||
|---|---|---|---|---|---|
| TwoPaCo | BubbZ | Total | |||
| 0.03 | 7 (1,240) | 1 (36) | 8 (1,240) | 6 (904) | 3 (147) |
| 0.06 | 6 (1,291) | 1 (51) | 7 (1,291) | 8 (820) | 3 (154) |
| 0.09 | 6 (1,246) | 1 (74) | 7 (1,246) | 10 (824) | 3 (168) |
| 0.11 | 6 (1,292) | 1 (77) | 7 (1,292) | 10 (634) | 3 (172) |
| 0.14 | 6 (1,250) | 2 (80) | 8 (1,250) | 15 (1,341) | 3 (171) |
| 0.17 | 6 (1,277) | 2 (80) | 8 (1,277) | 15 (1,340) | 3 (157) |
| 0.20 | 6 (1,238) | 2 (82) | 8 (1,238) | 16 (1,113) | 3 (165) |
| 0.22 | 5 (1,237) | 2 (71) | 7 (1,237) | 16 (614) | 4 (164) |
| 0.25 | 5 (1,207) | 2 (82) | 7 (1,207) | 16 (1,204) | 3 (168) |
Each dataset is labeled by its corresponding divergence.