| Literature DB >> 35456762 |
Sylvie Buffet-Bataillon1, Guillaume Rizk2, Vincent Cattoir3, Mohamed Sassi3, Vincent Thibault4, Jennifer Del Giudice2, Jean-Pierre Gangneux4.
Abstract
Metagenomics analysis is now routinely used for clinical diagnosis in several diseases, and we need confidence in interpreting metagenomics analysis of microbiota. Particularly from the side of clinical microbiology, we consider that it would be a major milestone to further advance microbiota studies with an innovative and significant approach consisting of processing steps and quality assessment for interpreting metagenomics data used for diagnosis. Here, we propose a methodology for taxon identification and abundance assessment of shotgun sequencing data of microbes that are well fitted for clinical setup. Processing steps of quality controls have been developed in order (i) to avoid low-quality reads and sequences, (ii) to optimize abundance thresholds and profiles, (iii) to combine classifiers and reference databases for best classification of species and abundance profiles for both prokaryotic and eukaryotic sequences, and (iv) to introduce external positive control. We find that the best strategy is to use a pipeline composed of a combination of different but complementary classifiers such as Kraken2/Bracken and Kaiju. Such improved quality assessment will have a major impact on the robustness of biological and clinical conclusions drawn from metagenomic studies.Entities:
Keywords: Bracken; Kaiju; Kraken2; clinical microbiology; metagenomics; microbiome; mycobiome; quality assessment; shotgun; virome
Year: 2022 PMID: 35456762 PMCID: PMC9026403 DOI: 10.3390/microorganisms10040711
Source DB: PubMed Journal: Microorganisms ISSN: 2076-2607
Figure 1Processing steps and quality assessment for metagenomics data.
Figure 2Precision/recall curves for the classifiers Kraken 2 and Kaiju with the simBA525 dataset according to Ye et al. in 2019 [3]. Each point in the curves represents the precision and recall score for a specific read abundance threshold, calculated on a simulated dataset. We observed a sharp decrease in precision when the threshold was below 500 reads per species, indicating many false-positive species with low abundance. The figure also shows the cutoff values for recall of 0.8 and 0.95 at 2500 and 10 reads, respectively.