| Literature DB >> 28702075 |
Adam M Novak1, Erik Garrison2, Benedict Paten1.
Abstract
We present a generalization of the positional Burrows-Wheeler transform, or PBWT, to genome graphs, which we call the gPBWT. A genome graph is a collapsed representation of a set of genomes described as a graph. In a genome graph, a haplotype corresponds to a restricted form of walk. The gPBWT is a compressible representation of a set of these graph-encoded haplotypes that allows for efficient subhaplotype match queries. We give efficient algorithms for gPBWT construction and query operations. As a demonstration, we use the gPBWT to quickly count the number of haplotypes consistent with random walks in a genome graph, and with the paths taken by mapped reads; results suggest that haplotype consistency information can be practically incorporated into graph-based read mappers. We estimate that with the gPBWT of the order of 100,000 diploid genomes, including all forms structural variation, could be stored and made searchable for haplotype queries using a single large compute node.Entities:
Keywords: Genome graph; Haplotype; PBWT
Year: 2017 PMID: 28702075 PMCID: PMC5505026 DOI: 10.1186/s13015-017-0109-9
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1An illustration of the array for a single side numbered 1. (Note that a similar, reverse view could be constructed for the array and the opposite orientations of all the thread orientations shown here, but it is omitted for clarity). The central rectangle represents a node, and the pairs of solid lines on either side delimit edges attached to either the left or right side of the node, respectively. These edges connect the node to other parts of the graph, which have been elided for clarity. The dashed lines within the edges represent thread orientations traveling along each edge in a conserved order, while the solid lines with triangles at the ends within the displayed node represent thread orientations as they cross over one another within the node. The triangles themselves represent “terminals”, which connect to the corresponding dashed lines within the edges, and which are wired together within the node in a configuration determined by the array. Thread orientations entering this node by visiting side 1 may enter their next nodes on sides 3, 5, or 7, and these labels are displayed near the edges leaving the right side of the diagram. (Note that we are following a convention where nodes’ left sides are assigned odd numbers, and nodes’ right sides are assigned even numbers). The array records, for each thread orientation entering through side 1, the side on which it enters its next node. This determines through which of the available edges it should leave the current node. Because threads tend to be similar to each other, their orientations are likely to run in “ribbons” of multiple thread orientations that both enter and leave together. These ribbons cause the arrays to contain runs of identical values, which may be compressed.
Fig. 2A diagram of a graph containing two embedded threads. The graph consists of nodes with sides , connected by edges {2, 5}, {4, 5}, {6, 7}, {6, 9}, {8, 8}, and {10, 9}. Note that, once again, odd numbers are used for left sides and even numbers are used for right sides. As in Fig. 1, nodes are represented by rectangles, and thread orientations running from node to node are represented by dashed lines. The actual edges connecting the nodes are omitted for clarity; only the thread orientations are shown. Because each side’s B[] array defines a separate permutation, each node is divided into two parts by a central double yellow line (like on a road). The top half of each node shows visits to the node’s right side, while the bottom half shows visits to the node’s left side. Within the appropriate half of each node, the B[] array entries for the entry side are shown. The special 0 value is used to indicate that a thread stops and does not continue on to another node. When moving from the entry side to the exit side of a node, threads cross over each other so that they become sorted, stably, by the side of their next visit. Threads’ order of arrival at a node is determined by the relative order of the edges incident on the side they arrive at, which is in turn determined by the ordering of the sides on the other ends of the edges. The threads shown here are [1, 2, 5, 6, 9, 10, 9, 10] and [3, 4, 5, 6, 7, 8, 8, 7]. See Table 1 for a tabular representation of this example.
and c() values for the embedding of threads illustrated in Fig. 2.
| Side |
|
|---|---|
| 1 | [5] |
| 2 | [0] |
| 3 | [5] |
| 4 | [0] |
| 5 | [9, 7] |
| 6 | [4, 2] |
| 7 | [8, 8] |
| 8 | [6, 0] |
| 9 | [9, 0] |
| 10 | [10, 6] |
Fig. 3Distribution (top) and cumulative distribution (bottom) of the number of 1000 Genomes Phase 3 haplotypes consistent with short paths in the GRCh37 chromosome 22 graph. Primary mappings of 101 bp reads with scores of 90 out of 101 or above () are the solid blue line. Secondary mappings meeting the same score criteria () are the dashed green line. Simulated 100 bp random walks in the graph without consecutive N characters () are the dotted red line. Consistent haplotypes were counted using the gPBWT support added to vg [18].
Fig. 4Disk space usage for the gPBWT versus sample count for GRCh38 chromosome 22. Points are sampled at powers of two up to 128, and intervals of 128 thereafter up to 1024. The trend line shown corresponds to the function .