Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Reducing storage requirements for biological sequence comparison.

Literature DB >> 15256412

Reducing storage requirements for biological sequence comparison.

Michael Roberts¹, Wayne Hayes, Brian R Hunt, Stephen M Mount, James A Yorke.

Abstract

MOTIVATION: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process.
RESULTS: We present a simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.

Entities: Species

Mesh：

Year: 2004 PMID： 15256412 DOI： 10.1093/bioinformatics/bth408

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

70 in total

1. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Authors: Konstantin Berlin; Sergey Koren; Chen-Shan Chin; James P Drake; Jane M Landolin; Adam M Phillippy
Journal: Nat Biotechnol Date: 2015-05-25 Impact factor: 54.908

2. Minimap2: pairwise alignment for nucleotide sequences.

Authors: Heng Li
Journal: Bioinformatics Date: 2018-09-15 Impact factor: 6.937

3. lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data.

Authors: Ehsan Haghshenas; S Cenk Sahinalp; Faraz Hach
Journal: Bioinformatics Date: 2019-01-01 Impact factor: 6.937

4. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Authors: Heng Li
Journal: Bioinformatics Date: 2016-03-19 Impact factor: 6.937

5. Weighted minimizer sampling improves long read mapping.

Authors: Chirag Jain; Arang Rhie; Haowen Zhang; Claudia Chu; Brian P Walenz; Sergey Koren; Adam M Phillippy
Journal: Bioinformatics Date: 2020-07-01 Impact factor: 6.937

6. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Authors: Shubham Chandak; Kedar Tatwawadi; Tsachy Weissman
Journal: Bioinformatics Date: 2018-02-15 Impact factor: 6.937

7. A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

Authors: Chirag Jain; Alexander Dilthey; Sergey Koren; Srinivas Aluru; Adam M Phillippy
Journal: J Comput Biol Date: 2018-04-30 Impact factor: 1.479

8. Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

Authors: Meznah Almutairy; Eric Torng
Journal: PLoS One Date: 2018-02-01 Impact factor: 3.240

9. BayesHammer: Bayesian clustering for error correction in single-cell sequencing.

Authors: Sergey I Nikolenko; Anton I Korobeynikov; Max A Alekseyev
Journal: BMC Genomics Date: 2013-01-21 Impact factor: 3.969

10. To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.

Authors: R A Leo Elworth; Qi Wang; Pavan K Kota; C J Barberan; Benjamin Coleman; Advait Balaji; Gaurav Gupta; Richard G Baraniuk; Anshumali Shrivastava; Todd J Treangen
Journal: Nucleic Acids Res Date: 2020-06-04 Impact factor: 16.971