| Literature DB >> 30125266 |
Erik Garrison1, Jouni Sirén1, Adam M Novak2, Glenn Hickey2, Jordan M Eizenga2, Eric T Dawson1,3,4, William Jones1, Shilpa Garg5, Charles Markello2, Michael F Lin6, Benedict Paten2, Richard Durbin1,4.
Abstract
Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30125266 PMCID: PMC6126949 DOI: 10.1038/nbt.4227
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1A region of a yeast genome variation graph. This displays the start of the subtelomeric region on the left arm of chromosome 9 in a multiple alignment of the strains sequenced by Yue et al.22, built using vg from a full genome multiple alignment generated with the Cactus alignment package6. The inset shows a subregion of the alignment at single base level. The colored paths correspond to separate contiguous chromosomal segments of these strains. This illustrates the ability of vg to represent paths corresponding to both colinear (inset) and structurally rearranged (main figure) regions of genomic variation.
Figure 2Mapping accuracy for vg against the human genome. (a) ROC curves parameterised by mapping quality for 10M read pairs simulated from NA24385 as mapped by bwa mem, vg with the 1000GP 1% allele frequency threshold pangenome reference, and vg with a linear reference, using single end (se) or paired end (pe) mapping. Left: all reads, middle: reads simulated from segments matching the linear reference, Right: reads simulated from segments different from the linear reference. (b) the mean alternate allele fraction at heterozygous variants previously called19 in NA24385 as a function of deletion or insertion size (SNPs at 0). Error bars are +/- one standard error.
Figure 3Mapping short and long reads with vg to yeast genome references. (a) ROC curves obtained by mapping 100,000 simulated SK1 yeast strain 150bp paired reads against a variety of references described in the text; (b) a density plot of identity fraction when mapping 43,337 Pacific Biosciences long reads from the SK1 strain to the drop.SK1 reference or the S288c reference.