Guillaume Marçais1, Carl Kingsford. 1. Department of Computer Science, University of Maryland, College Park, MD 20742, USA. gmarcais@umd.edu
Abstract
MOTIVATION: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. RESULTS: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. AVAILABILITY: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
MOTIVATION: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. RESULTS: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. AVAILABILITY: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
Authors: E W Myers; G G Sutton; A L Delcher; I M Dew; D P Fasulo; M J Flanigan; S A Kravitz; C M Mobarry; K H Reinert; K A Remington; E L Anson; R A Bolanos; H H Chou; C M Jordan; A L Halpern; S Lonardi; E M Beasley; R C Brandon; L Chen; P J Dunn; Z Lai; Y Liang; D R Nusskern; M Zhan; Q Zhang; X Zheng; G M Rubin; M D Adams; J C Venter Journal: Science Date: 2000-03-24 Impact factor: 47.728
Authors: Rami A Dalloul; Julie A Long; Aleksey V Zimin; Luqman Aslam; Kathryn Beal; Le Ann Blomberg; Pascal Bouffard; David W Burt; Oswald Crasta; Richard P M A Crooijmans; Kristal Cooper; Roger A Coulombe; Supriyo De; Mary E Delany; Jerry B Dodgson; Jennifer J Dong; Clive Evans; Karin M Frederickson; Paul Flicek; Liliana Florea; Otto Folkerts; Martien A M Groenen; Tim T Harkins; Javier Herrero; Steve Hoffmann; Hendrik-Jan Megens; Andrew Jiang; Pieter de Jong; Pete Kaiser; Heebal Kim; Kyu-Won Kim; Sungwon Kim; David Langenberger; Mi-Kyung Lee; Taeheon Lee; Shrinivasrao Mane; Guillaume Marcais; Manja Marz; Audrey P McElroy; Thero Modise; Mikhail Nefedov; Cédric Notredame; Ian R Paton; William S Payne; Geo Pertea; Dennis Prickett; Daniela Puiu; Dan Qioa; Emanuele Raineri; Magali Ruffier; Steven L Salzberg; Michael C Schatz; Chantel Scheuring; Carl J Schmidt; Steven Schroeder; Stephen M J Searle; Edward J Smith; Jacqueline Smith; Tad S Sonstegard; Peter F Stadler; Hakim Tafer; Zhijian Jake Tu; Curtis P Van Tassell; Albert J Vilella; Kelly P Williams; James A Yorke; Liqing Zhang; Hong-Bin Zhang; Xiaojun Zhang; Yang Zhang; Kent M Reed Journal: PLoS Biol Date: 2010-09-07 Impact factor: 8.029
Authors: Jason R Miller; Arthur L Delcher; Sergey Koren; Eli Venter; Brian P Walenz; Anushka Brownley; Justin Johnson; Kelvin Li; Clark Mobarry; Granger Sutton Journal: Bioinformatics Date: 2008-10-24 Impact factor: 6.937
Authors: Aleksey V Zimin; Guillaume Marçais; Daniela Puiu; Michael Roberts; Steven L Salzberg; James A Yorke Journal: Bioinformatics Date: 2013-08-29 Impact factor: 6.937
Authors: Thomas D Niehaus; Jacob Folz; Donald R McCarty; Arthur J L Cooper; David Moraga Amador; Oliver Fiehn; Andrew D Hanson Journal: J Biol Chem Date: 2018-04-06 Impact factor: 5.157
Authors: Peng Li; Lisa N Kinch; Ann Ray; Ankur B Dalia; Qian Cong; Linda M Nunan; Andrew Camilli; Nick V Grishin; Dor Salomon; Kim Orth Journal: Appl Environ Microbiol Date: 2017-06-16 Impact factor: 4.792
Authors: Endymion D Cooper; Bastian Bentlage; Theodore R Gibbons; Tsvetan R Bachvaroff; Charles F Delwiche Journal: Harmful Algae Date: 2014-07 Impact factor: 4.273
Authors: Dayana Yahalomi; Stephen D Atkinson; Moran Neuhof; E Sally Chang; Hervé Philippe; Paulyn Cartwright; Jerri L Bartholomew; Dorothée Huchon Journal: Proc Natl Acad Sci U S A Date: 2020-02-24 Impact factor: 11.205