| Literature DB >> 33726811 |
Huiguang Yi1,2, Yanling Lin1, Chengqi Lin2, Wenfei Jin3.
Abstract
Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.Entities:
Keywords: Distance estimation; K-mer; Mislabeling detection; Sequence comparison; Sketching method
Mesh:
Year: 2021 PMID: 33726811 PMCID: PMC7962209 DOI: 10.1186/s13059-021-02303-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583