| Literature DB >> 29912385 |
Johan Bengtsson-Palme1,2,3, Rodney T Richardson4, Marco Meola5, Christian Wurzbacher6,7, Émilie D Tremblay8, Kaisa Thorell9, Kärt Kanger10, K Martin Eriksson11, Guillaume J Bilodeau8, Reed M Johnson4, Martin Hartmann12,13, R Henrik Nilsson6,14.
Abstract
Motivation: Correct taxonomic identification of DNA sequences is central to studies of biodiversity using both shotgun metagenomic and metabarcoding approaches. However, no genetic marker gives sufficient performance across all the biological kingdoms, hampering studies of taxonomic diversity in many groups of organisms. This has led to the adoption of a range of genetic markers for DNA metabarcoding. While many taxonomic classification software tools can be re-trained on these genetic markers, they are often designed with assumptions that impair their utility on genes other than the SSU and LSU rRNA. Here, we present an update to Metaxa2 that enables the use of any genetic marker for taxonomic classification of metagenome and amplicon sequence data.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29912385 PMCID: PMC6247927 DOI: 10.1093/bioinformatics/bty482
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Self-evaluated performance of the Metaxa2 Database Builder. Evaluation was performed in all operating modes (conserved, divergent and hybrid) on 10 different DNA barcoding regions. (A) Proportion of assigned sequences classified to the correct order (circles), family (diamonds) and genera (triangles). (B) Accuracy, i.e. proportion of correctly assigned sequences multiplied with the proportion of sequences included in the final classification databases (see Supplementary Fig. S2). The ATP9-NAD9 genetic marker is not shown, because it only had relevant taxonomic differences at the species level
Fig. 2.Performance of the Metaxa2 Database Builder on sequence fragments. Family-level Metaxa2 performance on randomly generated 150-nucleotide fragments originating from the sequence datasets used to build the respective databases in the three different modes (Conserved, Divergent and Hybrid). (A) Proportions of fragments assigned to the correct taxonomic family. (B) Proportions of fragments assigned to an incorrect family even though sequences from the correct family were present in the database. (C) Proportions of fragments not assigned, or not recognized as belonging to the investigated barcoding region, at the family level. (D) Family-level overpredictions, i.e. the proportions of sequence fragments belonging to a family not present in the final database, which were still assigned to a (different) family by Metaxa2. The total proportion of erroneous assignments (regardless of type) can be obtained by summing the numbers of incorrect assignments (B) and overpredictions (D). Note that the ATP9-NAD9 dataset is only used for species identification and thus this marker would be expected to show perfect performance on the family level. Note also that the Y-axis scales are different for B and for D compared to A and B