Literature DB >> 32657365

Weighted minimizer sampling improves long read mapping.

Chirag Jain1, Arang Rhie1, Haowen Zhang2, Claudia Chu2, Brian P Walenz1, Sergey Koren1, Adam M Phillippy1.   

Abstract

MOTIVATION: In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions.
RESULTS: We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.
AVAILABILITY AND IMPLEMENTATION: Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap. Published by Oxford University Press 2020.

Entities:  

Mesh:

Year:  2020        PMID: 32657365      PMCID: PMC7355284          DOI: 10.1093/bioinformatics/btaa435

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  24 in total

1.  Fast gapped-read alignment with Bowtie 2.

Authors:  Ben Langmead; Steven L Salzberg
Journal:  Nat Methods       Date:  2012-03-04       Impact factor: 28.547

2.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2016-03-19       Impact factor: 6.937

3.  Versatile and open software for comparing large genomes.

Authors:  Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal:  Genome Biol       Date:  2004-01-30       Impact factor: 13.583

4.  De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm.

Authors:  Kristoffer Sahlin; Paul Medvedev
Journal:  J Comput Biol       Date:  2020-03-16       Impact factor: 1.479

5.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.

Authors:  Valerie A Schneider; Tina Graves-Lindsay; Kerstin Howe; Nathan Bouk; Hsiu-Chuan Chen; Paul A Kitts; Terence D Murphy; Kim D Pruitt; Françoise Thibaud-Nissen; Derek Albracht; Robert S Fulton; Milinn Kremitzki; Vincent Magrini; Chris Markovic; Sean McGrath; Karyn Meltz Steinberg; Kate Auger; William Chow; Joanna Collins; Glenn Harden; Timothy Hubbard; Sarah Pelan; Jared T Simpson; Glen Threadgold; James Torrance; Jonathan M Wood; Laura Clarke; Sergey Koren; Matthew Boitano; Paul Peluso; Heng Li; Chen-Shan Chin; Adam M Phillippy; Richard Durbin; Richard K Wilson; Paul Flicek; Evan E Eichler; Deanna M Church
Journal:  Genome Res       Date:  2017-04-10       Impact factor: 9.043

6.  A hybrid cloud read aligner based on MinHash and kmer voting that preserves privacy.

Authors:  Victoria Popic; Serafim Batzoglou
Journal:  Nat Commun       Date:  2017-05-16       Impact factor: 14.919

7.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Authors:  Sergey Koren; Brian P Walenz; Konstantin Berlin; Jason R Miller; Nicholas H Bergman; Adam M Phillippy
Journal:  Genome Res       Date:  2017-03-15       Impact factor: 9.043

8.  Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps.

Authors:  Alexander T Dilthey; Chirag Jain; Sergey Koren; Adam M Phillippy
Journal:  Nat Commun       Date:  2019-07-11       Impact factor: 14.919

Review 9.  When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data.

Authors:  Will P M Rowe
Journal:  Genome Biol       Date:  2019-09-13       Impact factor: 13.583

10.  Telomere-to-telomere assembly of a complete human X chromosome.

Authors:  Karen H Miga; Sergey Koren; Arang Rhie; Mitchell R Vollger; Ariel Gershman; Andrey Bzikadze; Shelise Brooks; Edmund Howe; David Porubsky; Glennis A Logsdon; Valerie A Schneider; Tamara Potapova; Jonathan Wood; William Chow; Joel Armstrong; Jeanne Fredrickson; Evgenia Pak; Kristof Tigyi; Milinn Kremitzki; Christopher Markovic; Valerie Maduro; Amalia Dutra; Gerard G Bouffard; Alexander M Chang; Nancy F Hansen; Amy B Wilfert; Françoise Thibaud-Nissen; Anthony D Schmitt; Jon-Matthew Belton; Siddarth Selvaraj; Megan Y Dennis; Daniela C Soto; Ruta Sahasrabudhe; Gulhan Kaya; Josh Quick; Nicholas J Loman; Nadine Holmes; Matthew Loose; Urvashi Surti; Rosa Ana Risques; Tina A Graves Lindsay; Robert Fulton; Ira Hall; Benedict Paten; Kerstin Howe; Winston Timp; Alice Young; James C Mullikin; Pavel A Pevzner; Jennifer L Gerton; Beth A Sullivan; Evan E Eichler; Adam M Phillippy
Journal:  Nature       Date:  2020-07-14       Impact factor: 49.962

View more
  27 in total

1.  Sensitive alignment using paralogous sequence variants improves long-read mapping and variant calling in segmental duplications.

Authors:  Timofey Prodanov; Vikas Bansal
Journal:  Nucleic Acids Res       Date:  2020-11-04       Impact factor: 16.971

2.  Finding and Characterizing Repeats in Plant Genomes.

Authors:  Jacques Nicolas; Sébastien Tempel; Anna-Sophie Fiston-Lavier; Emira Cherif
Journal:  Methods Mol Biol       Date:  2022

3.  Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet.

Authors:  Guilherme de Sena Brandine; Andrew D Smith
Journal:  NAR Genom Bioinform       Date:  2021-12-22

4.  Long-read mapping to repetitive reference sequences using Winnowmap2.

Authors:  Chirag Jain; Arang Rhie; Nancy F Hansen; Sergey Koren; Adam M Phillippy
Journal:  Nat Methods       Date:  2022-04-01       Impact factor: 28.547

Review 5.  Satellite DNAs and human sex chromosome variation.

Authors:  Monika Cechova; Karen H Miga
Journal:  Semin Cell Dev Biol       Date:  2022-05-27       Impact factor: 7.499

6.  PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions.

Authors:  Nathan D Olson; Justin Wagner; Jennifer McDaniel; Sarah H Stephens; Samuel T Westreich; Anish G Prasanna; Elaine Johanson; Emily Boja; Ezekiel J Maier; Omar Serang; David Jáspez; José M Lorenzo-Salazar; Adrián Muñoz-Barrera; Luis A Rubio-Rodríguez; Carlos Flores; Konstantinos Kyriakidis; Andigoni Malousi; Kishwar Shafin; Trevor Pesout; Miten Jain; Benedict Paten; Pi-Chuan Chang; Alexey Kolesnikov; Maria Nattestad; Gunjan Baid; Sidharth Goel; Howard Yang; Andrew Carroll; Robert Eveleigh; Mathieu Bourgey; Guillaume Bourque; Gen Li; ChouXian Ma; LinQi Tang; YuanPing Du; ShaoWei Zhang; Jordi Morata; Raúl Tonda; Genís Parra; Jean-Rémi Trotta; Christian Brueffer; Sinem Demirkaya-Budak; Duygu Kabakci-Zorlu; Deniz Turgut; Özem Kalay; Gungor Budak; Kübra Narcı; Elif Arslan; Richard Brown; Ivan J Johnson; Alexey Dolgoborodov; Vladimir Semenyuk; Amit Jain; H Serhat Tetikol; Varun Jain; Mike Ruehle; Bryan Lajoie; Cooper Roddey; Severine Catreux; Rami Mehio; Mian Umair Ahsan; Qian Liu; Kai Wang; Sayed Mohammad Ebrahim Sahraeian; Li Tai Fang; Marghoob Mohiyuddin; Calvin Hung; Chirag Jain; Hanying Feng; Zhipan Li; Luoqi Chen; Fritz J Sedlazeck; Justin M Zook
Journal:  Cell Genom       Date:  2022-04-27

7.  Enrichment of centromeric DNA from human cells.

Authors:  Riccardo Gamba; Giulia Mazzucco; Therese Wilhelm; Leonid Velikovsky; Catalina Salinas-Luypaert; Florian Chardon; Julien Picotto; Mylène Bohec; Sylvain Baulande; Ylli Doksani; Daniele Fachinetti
Journal:  PLoS Genet       Date:  2022-07-19       Impact factor: 6.020

8.  Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment.

Authors:  Yilei Fu; Medhat Mahmoud; Viginesh Vaibhav Muraliraman; Fritz J Sedlazeck; Todd J Treangen
Journal:  Gigascience       Date:  2021-09-24       Impact factor: 6.524

9.  Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.

Authors:  Barış Ekim; Bonnie Berger; Rayan Chikhi
Journal:  Cell Syst       Date:  2021-09-14       Impact factor: 10.304

10.  lra: A long read aligner for sequences and contigs.

Authors:  Jingwen Ren; Mark J P Chaisson
Journal:  PLoS Comput Biol       Date:  2021-06-21       Impact factor: 4.475

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.