MOTIVATION: Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy. RESULTS: We introduce MetaCache-a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache's database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data. AVAILABILITY AND IMPLEMENTATION: MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache. CONTACT: bertil.schmidt@uni-mainz.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy. RESULTS: We introduce MetaCache-a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache's database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data. AVAILABILITY AND IMPLEMENTATION: MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache. CONTACT: bertil.schmidt@uni-mainz.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Konstantin Bob; David Teschner; Thomas Kemmer; David Gomez-Zepeda; Stefan Tenzer; Bertil Schmidt; Andreas Hildebrandt Journal: BMC Bioinformatics Date: 2022-07-20 Impact factor: 3.307