Shoshana Marcus1, Hayan Lee2, Michael C Schatz2. 1. Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA. 2. Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA.
Abstract
MOTIVATION: Genomics is expanding from a single reference per species paradigm into a more comprehensive pan-genome approach that analyzes multiple individuals together. A compressed de Bruijn graph is a sophisticated data structure for representing the genomes of entire populations. It robustly encodes shared segments, simple single-nucleotide polymorphisms and complex structural variations far beyond what can be represented in a collection of linear sequences alone. RESULTS: We explore deep topological relationships between suffix trees and compressed de Bruijn graphs and introduce an algorithm, splitMEM, that directly constructs the compressed de Bruijn graph in time and space linear to the total number of genomes for a given maximum genome size. We introduce suffix skips to traverse several suffix links simultaneously and use them to efficiently decompose maximal exact matches into graph nodes. We demonstrate the utility of splitMEM by analyzing the nine-strain pan-genome of Bacillus anthracis and up to 62 strains of Escherichia coli, revealing their core-genome properties.
MOTIVATION: Genomics is expanding from a single reference per species paradigm into a more comprehensive pan-genome approach that analyzes multiple individuals together. A compressed de Bruijn graph is a sophisticated data structure for representing the genomes of entire populations. It robustly encodes shared segments, simple single-nucleotide polymorphisms and complex structural variations far beyond what can be represented in a collection of linear sequences alone. RESULTS: We explore deep topological relationships between suffix trees and compressed de Bruijn graphs and introduce an algorithm, splitMEM, that directly constructs the compressed de Bruijn graph in time and space linear to the total number of genomes for a given maximum genome size. We introduce suffix skips to traverse several suffix links simultaneously and use them to efficiently decompose maximal exact matches into graph nodes. We demonstrate the utility of splitMEM by analyzing the nine-strain pan-genome of Bacillus anthracis and up to 62 strains of Escherichia coli, revealing their core-genome properties.
Authors: David A Rasko; Dale R Webster; Jason W Sahl; Ali Bashir; Nadia Boisen; Flemming Scheutz; Ellen E Paxinos; Robert Sebra; Chen-Shan Chin; Dimitris Iliopoulos; Aaron Klammer; Paul Peluso; Lawrence Lee; Andrey O Kislyuk; James Bullard; Andrew Kasarskis; Susanna Wang; John Eid; David Rank; Julia C Redman; Susan R Steyert; Jakob Frimodt-Møller; Carsten Struve; Andreas M Petersen; Karen A Krogfelt; James P Nataro; Eric E Schadt; Matthew K Waldor Journal: N Engl J Med Date: 2011-07-27 Impact factor: 91.245
Authors: David A Rasko; Patricia L Worsham; Terry G Abshire; Scott T Stanley; Jason D Bannan; Mark R Wilson; Richard J Langham; R Scott Decker; Lingxia Jiang; Timothy D Read; Adam M Phillippy; Steven L Salzberg; Mihai Pop; Matthew N Van Ert; Leo J Kenefic; Paul S Keim; Claire M Fraser-Liggett; Jacques Ravel Journal: Proc Natl Acad Sci U S A Date: 2011-03-07 Impact factor: 11.205
Authors: David A Rasko; M J Rosovitz; Garry S A Myers; Emmanuel F Mongodin; W Florian Fricke; Pawel Gajer; Jonathan Crabtree; Mohammed Sebaihia; Nicholas R Thomson; Roy Chaudhuri; Ian R Henderson; Vanessa Sperandio; Jacques Ravel Journal: J Bacteriol Date: 2008-08-01 Impact factor: 3.490
Authors: Jordan M Eizenga; Adam M Novak; Jonas A Sibbesen; Simon Heumos; Ali Ghaffaari; Glenn Hickey; Xian Chang; Josiah D Seaman; Robin Rounthwaite; Jana Ebler; Mikko Rautiainen; Shilpa Garg; Benedict Paten; Tobias Marschall; Jouni Sirén; Erik Garrison Journal: Annu Rev Genomics Hum Genet Date: 2020-05-26 Impact factor: 8.929
Authors: Martin D Muggli; Alexander Bowe; Noelle R Noyes; Paul S Morley; Keith E Belk; Robert Raymond; Travis Gagie; Simon J Puglisi; Christina Boucher Journal: Bioinformatics Date: 2017-10-15 Impact factor: 6.937
Authors: Deanna M Church; Valerie A Schneider; Karyn Meltz Steinberg; Michael C Schatz; Aaron R Quinlan; Chen-Shan Chin; Paul A Kitts; Bronwen Aken; Gabor T Marth; Michael M Hoffman; Javier Herrero; M Lisandra Zepeda Mendoza; Richard Durbin; Paul Flicek Journal: Genome Biol Date: 2015-01-24 Impact factor: 13.583