| Literature DB >> 35229639 |
Colin J Carlson1,2,3, Rory J Gibb4,5, Gregory F Albery3, Liam Brierley6, Ryan P Connor7, Tad A Dallas8,9, Evan A Eskew10, Anna C Fagre11,12, Maxwell J Farrell13, Hannah K Frank14, Renata L Muylaert15, Timothée Poisot16,17, Angela L Rasmussen2,18, Sadie J Ryan19,20,21, Stephanie N Seifert22.
Abstract
Data that catalogue viral diversity on Earth have been fragmented across sources, disciplines, formats, and various degrees of open sharing, posing challenges for research on macroecology, evolution, and public health. Here, we solve this problem by establishing a dynamically maintained database of vertebrate-virus associations, called The Global Virome in One Network (VIRION). The VIRION database has been assembled through both reconciliation of static data sets and integration of dynamically updated databases. These data sources are all harmonized against one taxonomic backbone, including metadata on host and virus taxonomic validity and higher classification; additional metadata on sampling methodology and evidence strength are also available in a harmonized format. In total, the VIRION database is the largest open-source, open-access database of its kind, with roughly half a million unique records that include 9,521 resolved virus "species" (of which 1,661 are ICTV ratified), 3,692 resolved vertebrate host species, and 23,147 unique interactions between taxonomically valid organisms. Together, these data cover roughly a quarter of mammal diversity, a 10th of bird diversity, and ∼6% of the estimated total diversity of vertebrates, and a much larger proportion of their virome than any previous database. We show how these data can be used to test hypotheses about microbiology, ecology, and evolution and make suggestions for best practices that address the unique mix of evidence that coexists in these data. IMPORTANCE Animals and their viruses are connected by a sprawling, tangled network of species interactions. Data on the host-virus network are available from several sources, which use different naming conventions and often report metadata in different levels of detail. VIRION is a new database that combines several of these existing data sources, reconciles taxonomy to a single consistent backbone, and reports metadata in a format designed by and for virologists. Researchers can use VIRION to easily answer questions like "Can any fish viruses infect humans?" or "Which bats host coronaviruses?" or to build more advanced predictive models, making it an unprecedented step toward a full inventory of the global virome.Entities:
Keywords: data synthesis; ecological networks; global virome; host-virus interactions
Mesh:
Year: 2022 PMID: 35229639 PMCID: PMC8941870 DOI: 10.1128/mbio.02985-21
Source DB: PubMed Journal: mBio Impact factor: 7.786
FIG 1The VIRION pipeline. Data are integrated from a total of seven sources into one master file and a set of six disaggregated files that constitute the VIRION database. Data sources marked with a delta can be dynamically updated as new data are submitted to the source databases.
FIG 2Comparative scope of data. Networks show all unique NCBI-recognized host-virus species pairs (viruses are red, hosts are blue) in the Host-Pathogen Phylogeny Project database (HP3, published in 2017) and VIRION. The information stored in VIRION is more extensive (including all vertebrates, not just mammals) but also far more information dense, describing a network with many more nodes and many more connections.
FIG 3Taxonomic coverage across hosts. Each tree tip represents one host family, with the total number of viruses recorded in VIRION, the number that are NCBI resolved, and the number that are ICTV ratified. Note that the color scale varies across panels.
FIG 4The geographic distribution of hosts and host-virus associations based on IUCN geographic range maps. Species are matched to the IUCN database using verbatim Latin names, without any manual correction. This is largely congruent for mammals (91.1%) and birds (93%) but less so for reptiles/amphibians (79%) and fish (47.6%), in part because some species may not yet be mapped. Particularly when working with the latter groups, users will likely need to manually cross-reference species names from the VIRION database to other sources.
Data field descriptors for the VIRION database
| Data field (column) | Data type | Descriptor |
|---|---|---|
| Host, HostGenus, HostFamily, HostOrder, HostClass | Character string | Host taxonomy, including higher taxonomy (all lowercase). |
| Virus, VirusGenus, VirusFamily, VirusOrder, VirusClass | Character string | Virus taxonomy, including higher taxonomy (all lowercase). |
| ICTVRatified | Boolean | Is the virus species given in the field “Virus” considered a valid species name in the latest ICTV taxonomy? |
| HostNCBIResolved, VirusNCBIResolved | Boolean | Is the lowest nonmissing taxonomic value (usually species level but sometimes higher) matched to the NCBI taxonomy? |
| HostTaxID, VirusTaxID | Numeric character string | The “TaxID” unique identifier to the lowest possible taxonomic match in the NCBI database; in some cases this may be below the lowest taxonomic resolution (e.g., some virus species may have an NCBI identifier below the species level). |
| HostOriginal, VirusOriginal | Character string | Original entry for host and virus taxonomy as provided in source database (verbatim and not necessarily lowercase). |
| HostFlagID | Boolean | Values are given as TRUE if source metadata reports any uncertainty in host identification (e.g., “cf.” in a species name or the flags in the PREDICT data). |
| DetectionMethod, DetectionOriginal | Character string | DetectionMethod harmonizes four categories (in descending order of strength of evidence: “Isolation/Observation,” “PCR/Sequencing,” “Antibodies,” and “Not specified”) from the raw information provided in DetectionOriginal. Harmonized values are given to the highest evidence level possible based on a source record (e.g., the plaintext value “Isolation and antibodies” is harmonized to “Isolation/Observation”). In some cases where detection method is not available via metadata, source information is used as DetectionOriginal (e.g., “NCBI Nucleotide”). |
| Database, DatabaseVersion | Character string | Data provenance traces back to seven source data sets (EID2, Shaw, HP3, GMPD2, PREDICT, GenBank, and GLOBI), with linked information about either the citation of the relevant copy (e.g., “Shaw et al. 2020 Mol Ecol”) or version information for dynamic sources (e.g., “Aug2021FlatFile” for GenBank). |
| ReferenceText, PMID, PublicationYear | Character string, character string, numeric | Bibliographic information is sourced from the CLOVER database for records derived from Shaw, GMPD2, HP3, or EID2. ReferenceText provides a text description of literature sources (where provided in source data sets); PMID provides PubMed identifiers for literature sources (where provided); PublicationYear provides the year the literature source was published, accessed either from the original database’s reference description or from scraping the PubMed database. |
| NCBIAccession | Character string | NCBI accession information is given for records that originate directly from GenBank or have other linked metadata in other sources, including both the CLOVER data and the PREDICT data. These can be readily used in combination with tools like ‘rentrez’ to query source information on viral samples. |
| CollectionDay, CollectionMonth, CollectionYear | Numeric | Reports the date of actual sample collection (not the release of data or a published paper) as provided for samples from PREDICT or GenBank |
| ReleaseDay, ReleaseMonth, ReleaseYear | Numeric | Reports the year a given association was “released” in public information (EID2 and PREDICT) or a publicly deposited sample on GenBank. For PREDICT, all values are given as 2021, given the release of a static file at that time even though some findings may have been published or deposited in GenBank earlier. (This redundancy should be captured in overlap with GenBank and EID2.) |