Literature DB >> 29389989

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

Meznah Almutairy1,2, Eric Torng1.   

Abstract

Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.

Entities:  

Mesh:

Year:  2018        PMID: 29389989      PMCID: PMC5794061          DOI: 10.1371/journal.pone.0189960

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


  40 in total

1.  A greedy algorithm for aligning DNA sequences.

Authors:  Z Zhang; S Schwartz; L Wagner; W Miller
Journal:  J Comput Biol       Date:  2000 Feb-Apr       Impact factor: 1.479

2.  BLAT--the BLAST-like alignment tool.

Authors:  W James Kent
Journal:  Genome Res       Date:  2002-04       Impact factor: 9.043

3.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors:  Daniel R Zerbino; Ewan Birney
Journal:  Genome Res       Date:  2008-03-18       Impact factor: 9.043

4.  A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

Authors:  Zia Khan; Joshua S Bloom; Leonid Kruglyak; Mona Singh
Journal:  Bioinformatics       Date:  2009-04-23       Impact factor: 6.937

5.  RazerS--fast read mapping with sensitivity control.

Authors:  David Weese; Anne-Katrin Emde; Tobias Rausch; Andreas Döring; Knut Reinert
Journal:  Genome Res       Date:  2009-07-10       Impact factor: 9.043

6.  Improved tools for biological sequence comparison.

Authors:  W R Pearson; D J Lipman
Journal:  Proc Natl Acad Sci U S A       Date:  1988-04       Impact factor: 11.205

7.  Efficient construction of an assembly string graph using the FM-index.

Authors:  Jared T Simpson; Richard Durbin
Journal:  Bioinformatics       Date:  2010-06-15       Impact factor: 6.937

8.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes.

Authors:  Mohammadreza Ghodsi; Bo Liu; Mihai Pop
Journal:  BMC Bioinformatics       Date:  2011-06-30       Impact factor: 3.169

9.  Scalable metagenomic taxonomy classification using a reference genome database.

Authors:  Sasha K Ames; David A Hysom; Shea N Gardner; G Scott Lloyd; Maya B Gokhale; Jonathan E Allen
Journal:  Bioinformatics       Date:  2013-07-04       Impact factor: 6.937

10.  Database indexing for production MegaBLAST searches.

Authors:  Aleksandr Morgulis; George Coulouris; Yan Raytselis; Thomas L Madden; Richa Agarwala; Alejandro A Schäffer
Journal:  Bioinformatics       Date:  2008-06-21       Impact factor: 6.937

View more
  2 in total

1.  Minimally-overlapping words for sequence similarity search.

Authors:  Martin C Frith; Laurent Noé; Gregory Kucherov
Journal:  Bioinformatics       Date:  2020-12-21       Impact factor: 6.937

2.  Sequence-specific minimizers via polar sets.

Authors:  Hongyu Zheng; Carl Kingsford; Guillaume Marçais
Journal:  Bioinformatics       Date:  2021-07-12       Impact factor: 6.937

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.