Literature DB >> 15256412

Reducing storage requirements for biological sequence comparison.

Michael Roberts1, Wayne Hayes, Brian R Hunt, Stephen M Mount, James A Yorke.   

Abstract

MOTIVATION: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process.
RESULTS: We present a simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.

Entities:  

Mesh:

Year:  2004        PMID: 15256412     DOI: 10.1093/bioinformatics/bth408

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  70 in total

1.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Authors:  Konstantin Berlin; Sergey Koren; Chen-Shan Chin; James P Drake; Jane M Landolin; Adam M Phillippy
Journal:  Nat Biotechnol       Date:  2015-05-25       Impact factor: 54.908

2.  Minimap2: pairwise alignment for nucleotide sequences.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2018-09-15       Impact factor: 6.937

3.  lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data.

Authors:  Ehsan Haghshenas; S Cenk Sahinalp; Faraz Hach
Journal:  Bioinformatics       Date:  2019-01-01       Impact factor: 6.937

4.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2016-03-19       Impact factor: 6.937

5.  Weighted minimizer sampling improves long read mapping.

Authors:  Chirag Jain; Arang Rhie; Haowen Zhang; Claudia Chu; Brian P Walenz; Sergey Koren; Adam M Phillippy
Journal:  Bioinformatics       Date:  2020-07-01       Impact factor: 6.937

6.  Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Authors:  Shubham Chandak; Kedar Tatwawadi; Tsachy Weissman
Journal:  Bioinformatics       Date:  2018-02-15       Impact factor: 6.937

7.  A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

Authors:  Chirag Jain; Alexander Dilthey; Sergey Koren; Srinivas Aluru; Adam M Phillippy
Journal:  J Comput Biol       Date:  2018-04-30       Impact factor: 1.479

8.  Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

Authors:  Meznah Almutairy; Eric Torng
Journal:  PLoS One       Date:  2018-02-01       Impact factor: 3.240

9.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing.

Authors:  Sergey I Nikolenko; Anton I Korobeynikov; Max A Alekseyev
Journal:  BMC Genomics       Date:  2013-01-21       Impact factor: 3.969

10.  To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.

Authors:  R A Leo Elworth; Qi Wang; Pavan K Kota; C J Barberan; Benjamin Coleman; Advait Balaji; Gaurav Gupta; Richard G Baraniuk; Anshumali Shrivastava; Todd J Treangen
Journal:  Nucleic Acids Res       Date:  2020-06-04       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.