| Literature DB >> 29707201 |
Zachary S L Foster1, Scott Chamberlain2, Niklaus J Grünwald3.
Abstract
The taxa R package provides a set of tools for defining and manipulating taxonomic data. The recent and widespread application of DNA sequencing to community composition studies is making large data sets with taxonomic information commonplace. However, compared to typical tabular data, this information is encoded in many different ways and the hierarchical nature of taxonomic classifications makes it difficult to work with. There are many R packages that use taxonomic data to varying degrees but there is currently no cross-package standard for how this information is encoded and manipulated. We developed the R package taxa to provide a robust and flexible solution to storing and manipulating taxonomic data in R and any application-specific information associated with it. Taxa provides parsers that can read common sources of taxonomic information (taxon IDs, sequence IDs, taxon names, and classifications) from nearly any format while preserving associated data. Once parsed, the taxonomic data and any associated data can be manipulated using a cohesive set of functions modeled after the popular R package dplyr. These functions take into account the hierarchical nature of taxa and can modify the taxonomy or associated data in such a way that both are kept in sync. Taxa is currently being used by the metacoder and taxize packages, which provide broadly useful functionality that we hope will speed adoption by users and developers.Entities:
Keywords: R language; R package; metacoder; rOpenSci; taxa; taxize; taxonomy
Mesh:
Year: 2018 PMID: 29707201 PMCID: PMC5887078 DOI: 10.12688/f1000research.14013.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. A class diagram representing the relationship between classes implemented in the taxa package.
Diamond-tipped arrows indicate that objects of a lower class are used in a higher class. For example, a database object can be stored in the taxon_rank, taxon_name, or taxon_id objects. A standard arrow indicates that the lower class is inherited by the higher class. For example, the taxmap class inherits the taxonomy class. An asterisk indicates that an object (e.g. a database object) can be replaced by a simple character vector. A question mark indicates that the information is optional.
Figure 2. A table for determining how to parse different sources of taxonomic information using the taxa package.
The rows correspond to the common sources of taxonomic information: full taxonomic classifications encoded in text, taxon IDs from a database, taxon names (a single rank), and NCBI sequence IDs. The columns correspond to the different formats the information can be encoded in: as a simple vector, as columns in a table, and as a piece of a complex string (e.g. a FASTA header). In the case of tables and complex strings, other information associated with the taxa can be preserved in the parsed result, as is done in the “use cases” example below. Any one cell in the table shows how to parse a given taxonomic information source in a given format using one of the three parsing functions: parse_tax_data, lookup_tax_data, extract_tax_data.
Primary classes and functions found in taxa.
| Function | Description |
|---|---|
| •
| A class that combines the classes containing the name, rank, and ID for a taxon. |
| •
| A simple list of taxon objects in an arbitrary order. |
| •
| A class that stores a list of nested taxa constituting a classification. |
| •
| A simple list of hierarchy objects in an arbitrary order. |
| •
| A class that stores a list of unique taxon objects and a tree structure. |
| •
| A class that combines a taxonomy with user-defined, tables, lists, or vectors
|
| •
| A "supertaxon" is a taxon of a coarser rank that encompasses the taxon of interest
|
| •
| Roots are taxa that lack a supertaxon. Likewise, leaves are taxa that lack
|
| •
| Returns the information about every observation from an user-defined data set for
|
| •
| Subset taxa or associated data in
|
| •
| Order taxon or observation data in
|
| •
| Randomly sample taxa or observation data in
|
Figure 3. The result of the example analysis shown in the text.
Records of plant species occurrences in Oregon are downloaded from the Global Biodiversity Information Facility (GBIF) using the rgbif package ( Chamberlain, 2017). Then a taxa parser is used to parse the table of GBIF data into a taxmap object. A series of filters are then applied. First, all occurrences that are not from preserved specimens as well any taxa that have no occurrences from preserved specimens are removed. Then, all taxa at the species level are removed, but their occurrences are reassigned to the genus level. All taxa without names are then removed. In the final two filters, only orders within Tracheophyta with greater than 10 subtaxa are preserved. The metacoder package is then used to create a heat tree (i.e. taxonomic tree) with color and size used to display the number of occurrences associated with each taxon at each level of the hierarchy.