| Literature DB >> 34107881 |
Theo R Allnutt1,2, Alexandra J Roth-Schulze3,4, Leonard C Harrison5,6.
Abstract
BACKGROUND: Except for bacteria, the taxonomic diversity of the human fecal metagenome has not been widely studied, despite the potential importance of viruses and eukaryotes. Widely used bioinformatic tools contain limited numbers of non-bacterial species in their databases compared to available genomic sequences and their methodologies do not favour classification of rare sequences which may represent only a small fraction of their parent genome. In seeking to optimise identification of non-bacterial species, we evaluated five widely-used metagenome classifier programs (BURST, Kraken2, Centrifuge, MetaPhlAn2 and CCMetagen) for their ability to correctly assign and count simulations of bacterial, viral and eukaryotic DNA sequence reads, including the effect of taxonomic order of analysis of bacteria, viruses and eukaryotes and the effect of sequencing depth.Entities:
Keywords: Benchmarking; Classifier; Eukaryotes; Metagenomics; Viruses
Mesh:
Year: 2021 PMID: 34107881 PMCID: PMC8188691 DOI: 10.1186/s12859-021-04212-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Time required by each program to classify 10 million 150 bp reads into each taxonomic group (hours:minutes:seconds)
| Program | Bacteria | Viruses | Eukaryotes | Total |
|---|---|---|---|---|
| BURST | 0:08:54 | 0:02:28 | 0:16:58 | 0:28:20 |
| Kraken2 | 0:09:56 | 0:00:24 | 0:10:04 | 0:20:24 |
| Centrifuge | 0:02:01 | 0:02:45 | 0:14:43 | 0:19:29 |
| Ublast | 10:33:50 | 1:47:55 | 18:01:24 | 30:23:09 |
| CCMetagen | 0:04:47 | 0:00:28 | 0:50:29 | 0:55:44 |
| MetaPhlAn2 | 0:05:12 | – | – | – |
24 CPUs (Intel(R) Xeon(R) Gold 6130 CPU @ 2.10 GHz)
Database size (GB) required for each program and taxonomic group
| Program | Bacteria | Viruses | Eukaryotes |
|---|---|---|---|
| BURST | 48.2 | 4.9 | 119 |
| Kraken2 | 22 | 0.17 | 218 |
| Centrifuge | 6.3 | 0.057 | 115 |
| Ublast | 15 | 0.1 | 200 |
| CCMetagen | 17.3 | 0.19 | 163 |
| MetaPhlAn2 | 0.65 | – | – |
Fig. 1Recall determined by classifier programs using an unordered analysis approach
Fig. 2Precision determined by classifier programs using an unordered analysis approach
Mean (standard deviation) of recall of each classifier for bacteria, viruses and eukaryotes using the unordered approach
| Classifier | Bacteria | Virus | Eukaryotes |
|---|---|---|---|
| BURST | 0.86 (0.017) | 0.664 (0.045) | 0.514 (0.04) |
| Kraken2 | 0.852 (0.015) | 0.697 (0.057) | 0.748 (0.049) |
| Centrifuge | 0.571 (0.02) | 0.626 (0.045) | 0.69 (0.026) |
| CCMetagen | 0.347 (0.022) | 0.566 (0.043) | 0.07 (0.022) |
| MetaPhlAn2 | 0.396 (0.039) | 0.237 (0.042) | – |
Mean (standard deviation) of precision of each classifier for bacteria, viruses and eukaryotes using the unordered approach
| Classifier | Bacteria | Virus | Eukaryotes |
|---|---|---|---|
| BURST | 0.684 (0.029) | 0.623 (0.038) | 0.716 (0.05) |
| Kraken2 | 0.586 (0.033) | 0.698 (0.036) | 0.589 (0.035) |
| Centrifuge | 0.467 (0.032) | 0.673 (0.029) | 0.452 (0.019) |
| CCMetagen | 0.692 (0.038) | 0.638 (0.05) | 0.328 (0.098) |
| MetaPhlAn2 | 0.372 (0.04) | 0.536 (0.076) | – |
Fig. 3Precision determined with BURST classifier for each taxonomic group, using seven different orders of analysis: a = viruses > bacteria > eukaryotes; b = viruses > eukaryotes > bacteria; c = bacteria > eukaryotes > viruses; d = bacteria > viruses > eukaryotes; e = eukaryotes > bacteria > viruses; f = eukaryotes > viruses > bacteria; g = unordered
Fig. 4Precision determined by each classifier program after ordered analysis ‘d’ (bacteria > viruses > eukaryotes) and unordered analysis
Precision mean (standard deviation) determined with each classifier program after ordered analysis ‘d’ (bacteria > viruses > eukaryotes)
| Classifier | Bacteria | Viruses | Eukaryotes |
|---|---|---|---|
| BURST | 0.684 (0.029) | 0.65 (0.041) | 0.854 (0.06) |
| Kraken2 | 0.586 (0.033) | 0.667 (0.053) | 0.828 (0.052) |
| Centrifuge | 0.467 (0.032) | 0.767 (0.035) | 0.696 (0.033) |
| CCmetagen | 0.692 (0.038) | 0.642 (0.051) | 0.413 (0.1) |
Fig. 5Effect of sequencing depth on precision determined by BURST classifier for each taxonomic group, ordered using method ‘d’ (bacteria > viruses > eukaryotes)
Fig. 6Effect of increasing sequencing depth (grouped on x-axis) on the probability of detecting and classifying species within six expected relative abundance bins (key denotes lower limit of bin)