MOTIVATION: In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. RESULTS: We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. AVAILABILITY AND IMPLEMENTATION: Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap. Published by Oxford University Press 2020.
MOTIVATION: In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. RESULTS: We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. AVAILABILITY AND IMPLEMENTATION: Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap. Published by Oxford University Press 2020.
Authors: Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg Journal: Genome Biol Date: 2004-01-30 Impact factor: 13.583
Authors: Valerie A Schneider; Tina Graves-Lindsay; Kerstin Howe; Nathan Bouk; Hsiu-Chuan Chen; Paul A Kitts; Terence D Murphy; Kim D Pruitt; Françoise Thibaud-Nissen; Derek Albracht; Robert S Fulton; Milinn Kremitzki; Vincent Magrini; Chris Markovic; Sean McGrath; Karyn Meltz Steinberg; Kate Auger; William Chow; Joanna Collins; Glenn Harden; Timothy Hubbard; Sarah Pelan; Jared T Simpson; Glen Threadgold; James Torrance; Jonathan M Wood; Laura Clarke; Sergey Koren; Matthew Boitano; Paul Peluso; Heng Li; Chen-Shan Chin; Adam M Phillippy; Richard Durbin; Richard K Wilson; Paul Flicek; Evan E Eichler; Deanna M Church Journal: Genome Res Date: 2017-04-10 Impact factor: 9.043
Authors: Sergey Koren; Brian P Walenz; Konstantin Berlin; Jason R Miller; Nicholas H Bergman; Adam M Phillippy Journal: Genome Res Date: 2017-03-15 Impact factor: 9.043
Authors: Karen H Miga; Sergey Koren; Arang Rhie; Mitchell R Vollger; Ariel Gershman; Andrey Bzikadze; Shelise Brooks; Edmund Howe; David Porubsky; Glennis A Logsdon; Valerie A Schneider; Tamara Potapova; Jonathan Wood; William Chow; Joel Armstrong; Jeanne Fredrickson; Evgenia Pak; Kristof Tigyi; Milinn Kremitzki; Christopher Markovic; Valerie Maduro; Amalia Dutra; Gerard G Bouffard; Alexander M Chang; Nancy F Hansen; Amy B Wilfert; Françoise Thibaud-Nissen; Anthony D Schmitt; Jon-Matthew Belton; Siddarth Selvaraj; Megan Y Dennis; Daniela C Soto; Ruta Sahasrabudhe; Gulhan Kaya; Josh Quick; Nicholas J Loman; Nadine Holmes; Matthew Loose; Urvashi Surti; Rosa Ana Risques; Tina A Graves Lindsay; Robert Fulton; Ira Hall; Benedict Paten; Kerstin Howe; Winston Timp; Alice Young; James C Mullikin; Pavel A Pevzner; Jennifer L Gerton; Beth A Sullivan; Evan E Eichler; Adam M Phillippy Journal: Nature Date: 2020-07-14 Impact factor: 49.962
Authors: Nathan D Olson; Justin Wagner; Jennifer McDaniel; Sarah H Stephens; Samuel T Westreich; Anish G Prasanna; Elaine Johanson; Emily Boja; Ezekiel J Maier; Omar Serang; David Jáspez; José M Lorenzo-Salazar; Adrián Muñoz-Barrera; Luis A Rubio-Rodríguez; Carlos Flores; Konstantinos Kyriakidis; Andigoni Malousi; Kishwar Shafin; Trevor Pesout; Miten Jain; Benedict Paten; Pi-Chuan Chang; Alexey Kolesnikov; Maria Nattestad; Gunjan Baid; Sidharth Goel; Howard Yang; Andrew Carroll; Robert Eveleigh; Mathieu Bourgey; Guillaume Bourque; Gen Li; ChouXian Ma; LinQi Tang; YuanPing Du; ShaoWei Zhang; Jordi Morata; Raúl Tonda; Genís Parra; Jean-Rémi Trotta; Christian Brueffer; Sinem Demirkaya-Budak; Duygu Kabakci-Zorlu; Deniz Turgut; Özem Kalay; Gungor Budak; Kübra Narcı; Elif Arslan; Richard Brown; Ivan J Johnson; Alexey Dolgoborodov; Vladimir Semenyuk; Amit Jain; H Serhat Tetikol; Varun Jain; Mike Ruehle; Bryan Lajoie; Cooper Roddey; Severine Catreux; Rami Mehio; Mian Umair Ahsan; Qian Liu; Kai Wang; Sayed Mohammad Ebrahim Sahraeian; Li Tai Fang; Marghoob Mohiyuddin; Calvin Hung; Chirag Jain; Hanying Feng; Zhipan Li; Luoqi Chen; Fritz J Sedlazeck; Justin M Zook Journal: Cell Genom Date: 2022-04-27