| Literature DB >> 31406990 |
Jouni Sirén1,2, Erik Garrison2, Adam M Novak1, Benedict Paten1, Richard Durbin2,3.
Abstract
MOTIVATION: The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes.Entities:
Mesh:
Year: 2020 PMID: 31406990 PMCID: PMC7223266 DOI: 10.1093/bioinformatics/btz575
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Top: A graph with three paths (lines above the nodes). Bottom: The GBWT of the paths, with lines connecting the paths' entries in each node's record
Fig. 2.Unfolding the paths in the graph in Figure 1. Border nodes have been highlighted. Left: The graph after removing nodes 4, 5 and 6. Center: Complement graph. The maximal paths are and , with the bar splitting a path into a prefix and a suffix. Right: Unfolded graph. Dashed edges cross from prefixes to suffixes. Duplicated nodes have the original ids below the node
Datasets and direct GBWT construction
| Dataset | Graph | Construction | Index | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Name | Samples | Variants | Nodes | Out-degree | Paths | Length | Batch | Buffer | Interval | Time | GBWT | Text IDs | Total |
| 1000 GP-all-S | 2504 | 84.7 M | 612 M | 1.3/13 | 50.6 M | 2.19 T | 200 | 100 M | 1024 | 12 h | 8.74 GiB | 9.90 GiB | 18.6 GiB |
| 1000 GP-all-L | 2504 | 84.7 M | 612 M | 1.3/13 | 240 232 | 2.19 T | 100 | 200 M | 1024 | 17 h | 8.43 GiB | 8.17 GiB | 16.6 GiB |
| 1000 GP-17-S | 2504 | 2.33 M | 16.6 M | 1.3/7 | 1.67 M | 60.1 G | 200 | 100 M | 1024 | 3.7 h | 258 MiB | 242 MiB | 500 MiB |
| 1000 GP-17-L | 2504 | 2.33 M | 16.6 M | 1.3/7 | 10 016 | 60.1 G | 100 | 200 M | 1024 | 4.2 h | 252 MiB | 193 MiB | 444 MiB |
| TOPMed-17-L | 54 035 | 12.9 M | 67.6 M | 1.3/11 | 216 140 | 4.66 T | 200 | 1000 M | 16 384 | 25 d | 1.13 GiB | 1.03 GiB | 2.16 GiB |
Note: The name of each dataset is a combination of source, chromosome and path length (S for short paths with phase breaks, L for long chromosome-length paths). For each dataset, we give the number of samples and variants, as well as the number of nodes, average/maximum out-degree, number of paths (including reverse paths) and the total length of the paths in the graph. For construction, we give batch size in samples, buffer size in millions of nodes, interval for stored text identifiers and wall-clock time for index construction in hours or days. We also report GBWT size, space used by text identifiers and total index size. M, G and T suffixes indicate millions, billions and trillions, respectively.
GCSA2 indexes for simplified 1000 GP graphs
| GCSA2 index | 64-mers | |||||
|---|---|---|---|---|---|---|
| Graph | Pruning | Constr | Size | Shared | Haplotype | Recomb |
| Pruned-128 | 3.1 h | 25.6 h | 36.3 GiB | 27.0 G | 3.11 G | 11.4 G |
| Pruned-256 | 3.3 h | 25.5 h | 30.0 GiB | 27.0 G | — | — |
| Unfolded-256 | 3.7 h | 29.0 h | 34.0 GiB | 27.0 G | 3.46 G | — |
Note: For each graph, we give the pruning time in hours, construction time in hours and index size in GiB. We also show a comparison of unique 64-mer content versus pruned-256, including the number of 64-mers shared with pruned-256, the number of additional (real) haplotype 64-mers over pruned-256 and the number of additional (spurious) recombination 64-mers over pruned-256. The G suffix indicates billions.
Estimated resource usage of whole-genome TOPMed GBWT construction
| Chr 17 | Chr 2 | ||||||
|---|---|---|---|---|---|---|---|
| Phase | Jobs | Cores | Time | Memory | Time | Memory | CPU hours |
| Parsing | 1 | 1 | 42 h | 39 GiB | 5–6 d | 128 GiB | 1500 |
| Construction | 22 | 2 | 15 h | 33 GiB | 2 d | 100 GiB | 23 000 |
| Merging | 1 | 32 | 46 h | 102 GiB | 5–6 d | 320 GiB | 51 500 |
Note: For each phase (VCF parsing, GBWT construction for super-batches, GBWT merging), we give the number of jobs per chromosome, number of CPU cores per job, time (in hours or days) and memory usage (in GiB) per job for chromosome 17 (measured) and chromosome 2 (estimated) and estimated CPU hours for building a whole-genome index.