Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.

Literature DB >> 33726811

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.

Huiguang Yi^1,2, Yanling Lin¹, Chengqi Lin², Wenfei Jin³.

Abstract

Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.

Entities: Chemical

Keywords: Distance estimation; K-mer; Mislabeling detection; Sequence comparison; Sketching method

Mesh：

Year: 2021 PMID： 33726811 PMCID： PMC7962209 DOI： 10.1186/s13059-021-02303-4

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

17 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Authors: Konstantin Berlin; Sergey Koren; Chen-Shan Chin; James P Drake; Jane M Landolin; Adam M Phillippy
Journal: Nat Biotechnol Date: 2015-05-25 Impact factor: 54.908

3. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Authors: Heng Li
Journal: Bioinformatics Date: 2016-03-19 Impact factor: 6.937

4. Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities.

Authors: Migun Shakya; Christopher Quince; James H Campbell; Zamin K Yang; Christopher W Schadt; Mircea Podar
Journal: Environ Microbiol Date: 2013-02-06 Impact factor: 5.491

5. Large-scale sequence comparisons with sourmash.

Authors: N Tessa Pierce; Luiz Irber; Taylor Reiter; Phillip Brooks; C Titus Brown
Journal: F1000Res Date: 2019-07-04

6. MBV: a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets.

Authors: Alexandre Fort; Nikolaos I Panousis; Marco Garieri; Stylianos E Antonarakis; Tuuli Lappalainen; Emmanouil T Dermitzakis; Olivier Delaneau
Journal: Bioinformatics Date: 2017-06-15 Impact factor: 6.937

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

2. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

3. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

4. Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities.

5. Large-scale sequence comparisons with sourmash.

6. MBV: a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets.

7. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.

8. A fast adaptive algorithm for computing whole-genome homology maps.

9. Dashing: fast and accurate genomic distances with HyperLogLog.

10. Fast and accurate short read alignment with Burrows-Wheeler transform.