| Literature DB >> 32345331 |
Vanessa R Marcelino1,2,3, Philip T L C Clausen4, Jan P Buchmann5, Michelle Wille6, Jonathan R Iredell7,8,9, Wieland Meyer7,9,10, Ole Lund4, Tania C Sorrell7,8, Edward C Holmes7,5.
Abstract
There is an increasing demand for accurate and fast metagenome classifiers that can not only identify bacteria, but all members of a microbial community. We used a recently developed concept in read mapping to develop a highly accurate metagenomic classification pipeline named CCMetagen. The pipeline substantially outperforms other commonly used software in identifying bacteria and fungi and can efficiently use the entire NCBI nucleotide collection as a reference to detect species with incomplete genome data from all biological kingdoms. CCMetagen is user-friendly, and the results can be easily integrated into microbial community analysis software for streamlined and automated microbiome studies.Entities:
Keywords: ConClave sorting; Fungi; Metagenomic classifier; Microbiome
Mesh:
Year: 2020 PMID: 32345331 PMCID: PMC7189439 DOI: 10.1186/s13059-020-02014-2
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Overview of the ConClave sorting scheme applied to species identification in metagenomic data sets. The figure represents a data set containing 5 sequence reads (4 bp) and two closely related reference sequences (templates), including a true positive (Ref. 1) and a potential false positive (Ref. 2). a Commonly used read mappers yield a high number of false positives because reads can be randomly assigned to closely related reference sequences sharing identical fragments spanning the whole sequence read (represented by the ATATT region). b The KMA aligner minimizes this problem by scoring reference sequences based on all possible mappings of all reads and then choosing the templates with the highest scores. Coupled with KMA, CCMetagen produces highly accurate taxonomic assignments of reads in metagenomic data sets in user-friendly formats
Fig. 2The CCMetagen pipeline has a higher F1 score than other metagenomic classification methods for all taxonomic ranks. The two points for each program and taxonomic rank represent the results using a simulated metagenome and a metatranscriptome sample of a fungal community. a Results using the whole NCBI nt collection as a reference database. b Results using the RefSeq-bf (bacteria and fungi) database, containing all bacterial and fungal genomes available. c Partial RefSeq database containing only some of the fungal species currently present in the RefSeq-bf database, mimicking the effects of dealing with species without representatives in reference data sets. In this case, Kraken2, Centrifuge, and KrakenUniq have overlapping results. Refer to Additional file 1: Figures S1 and S2 and Additional file 2 for more information, including precision and recall
CPU time (in minutes) required to analyze a simulated fungal metatranscriptome (mtt, ~ 9M PE reads) and a fungal metagenome (mtg, ~ 6.7M PE reads)
| nt | RefSeq-bf | RefSeq-f-Partial | ||||
|---|---|---|---|---|---|---|
| mtt | mtg | mtt | mtg | mtt | mtg | |
| Kraken2 | 10.92 | 7.05 | 5.29 | 3.98 | 4.48 | 3.50 |
| CCMetagen* | 17.24 | 13.54 | 85.74 | 67.00 | 69.29 | 20.58 |
| Centrifuge | 40.11 | 27.54 | 23.70 | 19.41 | 16.67 | 16.10 |
| KrakenUniq | 74.11 | 74.94 | 43.33 | 40.85 | 29.65 | 21.04 |
*The CCMetagen time was calculated as the sum of the CPU time used by KMA and CCMetagen
Fig. 3CCMetagen pipeline performance for bacterial classifications, compared with Kraken2, Centrifuge, and KrakenUniq. Precision (% of true positives), recall (% of taxa identified), and F1 scores represent averages across 10 simulated metagenome samples. Shaded areas indicate 75% confidence intervals
Fig. 4Snapshot of CCMetagen results for a spiked fungal community. This Krona graph shows the relative abundance of taxa at various taxonomic levels that are color-coded according to their taxonomic classification at lower-ranks—here, we see fungal taxa in shades of red, and bacterial taxa in shades of green. The Krona html file can be opened and interactively inspected in a web browser. Each circle represents a taxonomic level, where the user can click for a representation of the relative abundance at a given taxonomic rank. For a detailed list of taxa, refer to Additional file 5
Fig. 5Microbial families in the microbiome of wild birds. The 20 most abundant families are shown, with fungal families indicated in bold. For a full list of taxa, refer to Additional file 6. A tutorial and R scripts to reproduce these analyses are available on the CCMetagen website