Literature DB >> 33726811

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.

Huiguang Yi1,2, Yanling Lin1, Chengqi Lin2, Wenfei Jin3.   

Abstract

Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.

Entities:  

Keywords:  Distance estimation; K-mer; Mislabeling detection; Sequence comparison; Sketching method

Mesh:

Year:  2021        PMID: 33726811      PMCID: PMC7962209          DOI: 10.1186/s13059-021-02303-4

Source DB:  PubMed          Journal:  Genome Biol        ISSN: 1474-7596            Impact factor:   13.583


  17 in total

1.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors:  Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal:  Genome Res       Date:  2010-07-19       Impact factor: 9.043

2.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Authors:  Konstantin Berlin; Sergey Koren; Chen-Shan Chin; James P Drake; Jane M Landolin; Adam M Phillippy
Journal:  Nat Biotechnol       Date:  2015-05-25       Impact factor: 54.908

3.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2016-03-19       Impact factor: 6.937

4.  Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities.

Authors:  Migun Shakya; Christopher Quince; James H Campbell; Zamin K Yang; Christopher W Schadt; Mircea Podar
Journal:  Environ Microbiol       Date:  2013-02-06       Impact factor: 5.491

5.  Large-scale sequence comparisons with sourmash.

Authors:  N Tessa Pierce; Luiz Irber; Taylor Reiter; Phillip Brooks; C Titus Brown
Journal:  F1000Res       Date:  2019-07-04

6.  MBV: a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets.

Authors:  Alexandre Fort; Nikolaos I Panousis; Marco Garieri; Stylianos E Antonarakis; Tuuli Lappalainen; Emmanouil T Dermitzakis; Olivier Delaneau
Journal:  Bioinformatics       Date:  2017-06-15       Impact factor: 6.937

7.  High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.

Authors:  Chirag Jain; Luis M Rodriguez-R; Adam M Phillippy; Konstantinos T Konstantinidis; Srinivas Aluru
Journal:  Nat Commun       Date:  2018-11-30       Impact factor: 14.919

8.  A fast adaptive algorithm for computing whole-genome homology maps.

Authors:  Chirag Jain; Sergey Koren; Alexander Dilthey; Adam M Phillippy; Srinivas Aluru
Journal:  Bioinformatics       Date:  2018-09-01       Impact factor: 6.937

9.  Dashing: fast and accurate genomic distances with HyperLogLog.

Authors:  Daniel N Baker; Ben Langmead
Journal:  Genome Biol       Date:  2019-12-04       Impact factor: 13.583

10.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.