Literature DB >> 35355938

PFP Compressed Suffix Trees.

Christina Boucher1, Ondřej Cvacho2, Travis Gagie3, Jan Holub2, Giovanni Manzini4, Gonzalo Navarro5, Massimiliano Rossi1.   

Abstract

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a parse P of overlapping phrases such that BWT(S) can be computed from D and P in time and workspace bounded in terms of their combined size |PFP(S)|. In practice D and P are significantly smaller than S and computing BWT(S) from them is more efficient than computing it from S directly, at least when S is the concatenation of many genomes. In this paper, we consider PFP(S) as a data structure and show how it can be augmented to support full suffix tree functionality, still built and fitting within O(|PFP(S)|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in S; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.

Entities:  

Year:  2021        PMID: 35355938      PMCID: PMC8963198          DOI: 10.1137/1.9781611976472.5

Source DB:  PubMed          Journal:  Proc Worksh Algorithm Eng Exp        ISSN: 2164-0300


  6 in total

1.  Storage and retrieval of highly repetitive sequence collections.

Authors:  Veli Mäkinen; Gonzalo Navarro; Jouni Sirén; Niko Välimäki
Journal:  J Comput Biol       Date:  2010-03       Impact factor: 1.479

2.  A global reference for human genetic variation.

Authors:  Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal:  Nature       Date:  2015-10-01       Impact factor: 49.962

3.  The Public Health Impact of a Publically Available, Environmental Database of Microbial Genomes.

Authors:  Eric L Stevens; Ruth Timme; Eric W Brown; Marc W Allard; Errol Strain; Kelly Bunning; Steven Musser
Journal:  Front Microbiol       Date:  2017-05-09       Impact factor: 5.640

Review 4.  Computational pan-genomics: status, promises and challenges.

Authors: 
Journal:  Brief Bioinform       Date:  2018-01-01       Impact factor: 11.622

5.  Prefix-free parsing for building big BWTs.

Authors:  Christina Boucher; Travis Gagie; Alan Kuhnle; Ben Langmead; Giovanni Manzini; Taher Mun
Journal:  Algorithms Mol Biol       Date:  2019-05-24       Impact factor: 1.405

6.  Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.

Authors:  Alan Kuhnle; Taher Mun; Christina Boucher; Travis Gagie; Ben Langmead; Giovanni Manzini
Journal:  J Comput Biol       Date:  2020-03-16       Impact factor: 1.479

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.