Gloria I Giraldo-Calderón1, Omar S Harb2, Sarah A Kelly3, Samuel Sc Rund4, David S Roos2, Mary Ann McDowell5. 1. Department of Biological Sciences, Eck Institute for Global Health, University Notre Dame, Notre Dame, IN 46556, USA; Dept. Ciencias Biológicas & Dept. Ciencias Básicas Médicas, Universidad Icesi, Calle 18 No 122-135, Cali, Colombia. 2. Department of Biology, University of Pennsylvania, Philadelphia 19104, PA, USA. 3. Department of Life Sciences, Imperial College London, South Kensington Campus, London SW7 2AZ, UK. 4. Department of Biological Sciences, Eck Institute for Global Health, University Notre Dame, Notre Dame, IN 46556, USA. 5. Department of Biological Sciences, Eck Institute for Global Health, University Notre Dame, Notre Dame, IN 46556, USA. Electronic address: mmcdowe1@nd.edu.
Abstract
VectorBase (VectorBase.org) is part of the VEuPathDB Bioinformatics Resource Center, providing free online access to multi-omics and population biology data, focusing on arthropod vectors and invertebrates of importance to human health. VectorBase includes genomics and functional genomics data from bed bugs, biting midges, body lice, kissing bugs, mites, mosquitoes, sand flies, ticks, tsetse flies, stable flies, house flies, fruit flies, and a snail intermediate host. Tools include the Search Strategy system and MapVEu, enabling users to interrogate and visualize diverse 'omics and population-level data using a graphical interface (no programming experience required). Users can also analyze their own private data, such as transcriptomic sequences, exploring their results in the context of other publicly-available information in the database. Help Desk: help@vectorbase.org.
VectorBase (VectorBase.org) is part of the VEuPathDB Bioinformatics Resource Center, providing free online access to multi-omics and population biology data, focusing on arthropod vectors and invertebrates of importance to human health. VectorBase includes genomics and functional genomics data from bed bugs, biting midges, body lice, kissing bugs, mites, mosquitoes, sand flies, ticks, tsetse flies, stable flies, house flies, fruit flies, and a snail intermediate host. Tools include the Search Strategy system and MapVEu, enabling users to interrogate and visualize diverse 'omics and population-level data using a graphical interface (no programming experience required). Users can also analyze their own private data, such as transcriptomic sequences, exploring their results in the context of other publicly-available information in the database. Help Desk: help@vectorbase.org.
As part of the Eukaryotic Pathogen, Vector and Host Informatics Resource (VEuPathDB.org) Bioinformatics Resource Center (BRC), VectorBase (VectorBase.org) is supported by the US National Institutes of Allergy and Infectious Diseases (NIAID) [1]. In addition to VectorBase [2], VEuPathDB [3] also supports eukaryotic pathogens (protists, fungi), selected mammalian host data, and provides resources for orthology determination and phylogenetic inference (OrthoMCL.org) [4]. Additional resources using the VEuPathDB model and infrastructure accommodate epidemiological (ClinEpiDB.org) [5] and microbiome data (MicrobiomeDB.org) [6].Release 54 of VectorBase supports 53 vector genomes and integrates a wide range of other data types, including functional genomics and genetic variation data. The MapVEu geo-visualization tool displays different types of population data, including vector abundance, pathogen infection status, genetic variation, host blood meal source, and insecticide resistance phenotypes and genotypes, for ~470 species worldwide. Data are integrated from public repositories or directly from providers and analyzed with standard workflows using an ontology-driven framework to ensure data comparability. Expert knowledge from the community is also incorporated to improve genome annotation through an Apollo interface and in the form of User Comments. Here we present a general overview of the new VectorBase resource, including site use, data types, and tools, and finish with our future plans.
The new VectorBase: a merged BRC infrastructure
The rapid growth of genomic-scale datasets, increasing integration of scientific research, and funder mandates for improved efficiency have driven the development of VEuPathDB, coupling the Ensembl bioinformatic pipelines [7,8] long used by VectorBase, with the Genomics Unified Schema [9] and highly flexible Search Strategies [10] of EuPathDB. The net result offers improved scalability, flexibility, data flow, and overall user experience.
Web interface improvements
A redesigned common user interface provides convenient, consistent access to data, searches, and help infor- mation for all supported species. The home page (Figure 1) features a header (present on all pages), a main panel, an expandable ‘News & Tweets’ section (Figure 1d), and a footer with clickable icons to access other VEuPathDB resources (Figure 1f). The Site Search (Figure 1b) allows free text searches, returning categorized results; with filters allowing users to define categories or organisms of interest. Results (genes, SNPs, etc.) can then be exported to the ‘My Strategies’ system for further data mining, visualization, or download (see below). Educational materials, FAQs, virtual events, workshops, and methods are available under the ‘Help’ menu (Figure 1, arrow), and links to tutorials and exercises are at the bottom of the main panel (Figure 1e). Additional help is also available from the ‘Contact Us’ link (Figure 1b).
Figure 1
Redesigned VectorBase home page. (a) Left-hand panel provides access to all available searches, categorized by datatype. (b) Header on all site pages, including Site Search box and access to My Strategies, Searches, Tools, My Workspace, Data, About, Help (educational materials, arrow) and Contact Us sections. Release date and version adjacent to the logo at left; social media, login, registration, and user profile links at right. (c) Central section provides an overview of resources and tools, including vignettes to help users get started on specific topics of likely interest. (d) News & Tweets section is an expandable tab (collapsed by default), providing access to recent announcements. (e) Links to more detailed step-by-step exercises. (f) Hyperlinked logos to other VEuPathDB components and affiliated sites, and community chat button enabling users to ask questions and share information.
The left sidebar provides access to all searches (Figure 1a; also accessible from the ‘Searches’ menu). Searches are organized into expandable categories containing configurable queries against the underlying data. Search results are returned as an expandable Search Strategy and are displayed in a dynamic table that can be configured by adding, removing, or moving columns. The central section of the main panel provides an overview of available resources and tools (Figure 1c).
Omics and population data sets
VectorBase release 54 includes 492 datasets relating to vector species. Bimonthly releases incorporate new data and functionality into the site; the latest data can be found on the datasets page under the ‘Data’ menu (located in the header) (https://vectorbase.org/vectorbase/app/search/dataset/AllDatasets/result).Forty-one vector genomes represent ‘reference’ strains for distinct species, while 12 are additional strains or resequencing of already available strains. Gene set predictions are available for all reference species (and most additional strains), including 12 with chromosomal map-pings. Other datasets, including transcriptomes, pro-teomes, genetic variation, and orthology profiles are aligned or cross-referenced to reference genomes, and genomes are also cross-referenced with ~20 external databases, including Chemical Entities of Biological Interest (ChEBI) [11], Kyoto Encyclopedia of Genes and Genomes (KEGG) [12,13] and Gene Ontology (GO) [14,15]. All omics datasets are also available for download or use with the site tools accessible under the ‘Tools’ menu in the header.Population datasets include records for ~470 taxonomic groups, from field-collected samples divided into differ- ent map ‘views’ and/or data types, including >21 000 and >17 000 insecticide resistance phenotype and genotype assays respectively, >187 000 pathogen infection status assays, >12 000 blood meal source assays, >15 000 chromosomal inversions, >15 000 microsatellites, >2600 bar-codes, and >25 million population abundance records, among others. The MapVEu tool (Figure 2) is used for visualization, search, analysis, and raw data download. Specialized representations are also available; the bar graph in Figure 2b indicates species abundance counts for the geographic region shown.
Figure 2.
Population data in MapVEu, a tool for visualizing, analyzing, and downloading geographic data. (a) Select MapVEu from the Tools menu on the home page (arrow). (b) Select Abundance ‘view’ from the dropdown menu below the map search bar (violet arrow). Select sampling location on the map or, enter into the search bar and select from the autocomplete menu (Manatee County used in this example). Date Search set = July 2018 to July 2019. Open/collapse blue arrowhead in legend panel (green box) to set: Collection Protocol = CDC light trap & Attractant = carbon dioxide, in the same panel select Species & Optimize Colors options. A point was selected (indicated in color, center of the map), to explore it in more detail. The icon at the left (orange arrow) defines graph type (bars in this example); ‘EpiWeekly’ is set as temporal resolution. Left panel size can be adjusted (red box). The orange box indicates other ‘view’ specific data visualizations, metadata details, raw data download, and so on. Login to VectorBase and follow this link to recreate the image shown: https://tinyurl.com/InsectGx2021VectorBaseFig2.
New and improved tools and resources
Genome and protein browsers
Genome browsing is facilitated by the JBrowse genome browser [16], an open-source platform allowing users to select tracks displaying aligned transcriptomic, proteomic, epigenomic, and variation data. Variation data sets (SNP calls) are available via Variant Call Format (VCF) files aligned to reference genomes. Protein Browser tracks include transmembrane domains (TMHMM predictions) [17], protein domains (InterPro predictions) [18], and synteny views across multiple genomes.
Gene pages
Gene pages, now with a new design, compile all the available data about a particular gene into a single webpage. Aligning of orthologs and paralogs are identified using OrthoMCL [4], and Clustal Omega [19] can be launched for multiple sequence alignments. New representations facilitate exploration of transcriptomics data, protein features and properties, use of functional prediction tools, and assessment of metabolic pathways.
My strategies
Searches in VectorBase can be integrated into a Search Strategy, allowing users to integrate diverse results into a multistep in silico experiment (Figure 3). Multistep strategies (for example, find Aedes kinases expressed in a particular time or place, and conserved in species of interest) are built one step at a time, bringing together several searches by union (Figure 3c, step 2), intersection (Figure 3c, step 3), or subtraction operations. Strategies are extended by clicking ‘Add a step’ in the graphic panel (Figure 3c). Options for extending a strategy include ‘Combine’ with similar records’, ‘Transform’ to related records, and Genomic Colocation. Results can be transformed into orthologs, metabolic pathways, or compounds. Additionally, the genomic location can be exploited to search for additional features. Search Strategies can be saved, copied, revised, or shared with others using a private link. The Strategy System replaces BioMart functionalities, including the ability to download genome-wide information available from gene pages (e.g. homologs, expression values, GO terms, etc.).
Figure 3.
Search strategies as in silico experiments. (a) Site Search, a box that can be accessed from the header, returns site-wide hits, in the results page filters can be applied. Search strategies (for Genes, Organisms, Pathways, etc.) can be run from the Searches pull-down menu or the Search for . . . panel at the left. (b) Search filter can help to locate searches of interest, for example, identifying ‘text’ searches of Genes or Compounds. (c) Search results constitute one Step in a Search Strategy, which can be combined with other searches (+ Add a step; green arrow) using Boolean operators (e.g. union, intersection, subtraction). Results may be downloaded (black arrow), and searches edited, saved, shared, or published (icons at right; blue arrow). To retrieve this sample search for A. gambiae proteases expressed in the midgut with a specific promotor region DNA motif, see: https://vectorbase.org/vectorbase/app/workspace/strategies/import/bc4d101022805435. Publicly shared strategies are available from the menu at the top (orange arrow; https://vectorbase.org/vectorbase/app/workspace/strategies/public).
Enrichment analysis
Functional enrichment of gene results includes statistically valid gene ontology (GO) [14,15] and metabolic pathway enrichment results (also available as a word cloud). GO enrichment data can also be exported to REVIGO [20], facilitating data visualization using a variety of interactive tools.
Community annotation
VectorBase continues to support manual gene annotation with Apollo [21], which allows users to create & edit structural annotation, update product names, descriptions and symbols, and so on. For some species, VEuPathDB staff may integrate these annotations as part of the official gene set once several annotations have been submitted. Users can request that a specific genome be made available in Apollo for annotation by contacting the help desk. The ‘User Comments’ tool available on Gene Pages is new to VectorBase, allowing users to submit comments about specific genes, which are immediately integrated into the database and become searchable.
Homology predictions
VectorBase has historically been used to predict putative gene function, resolve evolutionary questions, and provide comparative genomic analyses using the Ensembl Compara pipeline. This functionality is now provided by OrthoMCL [4], but Compara can still be accessed via Ensembl Metazoa [7,8] (https://metazoa.ensembl.org/index.html) using the genome browser gene pages and BioMart [22,23].
Galaxy
Computationally intensive analysis of user-provided data (e.g. RNA-seq datasets, SNP calling, etc.) continues to be provided via a user-friendly front end to a cloud-based Galaxy pipeline [24], allowing users to privately analyze their own data. Output files can be exported for interrogation in the context of all other data in VectorBase.
Registration and citation
VectorBase does not require registration for use, but an account provides additional features including email alerts about new data sets, the ability to save and share BLAST jobs, Search Strategies, output results from Galaxy, gene annotations in Apollo, and more.Much of the data in VectorBase is provided by independent researchers, and citation information is included for each VectorBase record, including publications or other attribution details for unpublished datasets, allowing users to cite primary data sources when relevant. A FAQ (https://vectorbase.org/vectorbase/app/static-content/faq.html) and the ‘About’ section provides information on how to cite VectorBase, and when appropriate, users are encouraged to include the VectorBase logo, tables, figures, and images in their original research presentations and publications.
Recent science enabled by VectorBase data and tools
VectorBase data, tools, and analyses can demonstrably expedite basic discovery and translational research. For example, VectorBase genomes have been used to resolve questions involving individual genes [25-27], characterize gene families [28,29], and perform genome-wide analyses [30,31]. Genome resources have also been used to develop wet lab techniques, for example, for primer design [32] or a multilocus amplicon sequencing for simultaneous mosquito species identification and detection of parasite infection status [33]. BLAST [34,35], gene enrichment [36], and comparative genomic analyses among the same or different species, have been used for phylogenetic and homolog gene predictions [37•,38•,39•]. Genome assemblies have been improved, creating physical maps [40], karyotypes [41], and genome elements identified [42-44] using VectorBase files and tools. Researchers have also used VectorBase genome assembly and gene set files to perform analyses such as transcript differential expression [45,46] and peptide expression [47].Transcriptomics and proteomics data sets deposited in VectorBase allow research groups to ask or test new hypotheses as described in this review paper on mosquito ‘omics [7•], now also possible using the Search Strategy system [12•]. Phenotype experiments, for example, insecticide susceptibility [50], have been analyzed using VectorBase genomes and the Ensembl Variation Effect Predictor (VEP), to interpret obtained genotypes (variant calling). The MapVEu tool has been used to generate meta-analyses, for example, with the population abundance view [51•], and/or facilitate reviews, for example, with the blood meal view [52•].
Summary and future perspectives
VEuPathDB provides consistent representation, interrogation, and visualization of data types and tools for hosts, vectors, parasites, and fungi species. Vector data can be accessed directly through VectorBase or through the VEuPathDB homepage. Infrastructure improvements resulting from the merger of VectorBase and EuPathDB allow for increased scalability, efficiency, and interoperability to incorporate the increasing quantities of data and new data types.Future VectorBase releases are expected to provide organism preference parameters enabling customization of the user experience including the ability to select organisms across taxonomic groups (e.g. exploration of both Plasmodium parasites and Anopheles mosquito vectors). Development plans include tools for analysis and visualization of vector-pathogen interactions and systems biology research, resources for integrated exploration of VectorBase and the bacterial/viral BRC, improved visualizations and analyses for the MapVEu tool, an improved variant calling pipeline and associated searches, improved mechanisms for portability of data to other applications, and additional workflows using the VEuPathDB Galaxy instance.
Authors: Steve Fischer; Brian P Brunk; Feng Chen; Xin Gao; Omar S Harb; John B Iodice; Dhanasekaran Shanmugam; David S Roos; Christian J Stoeckert Journal: Curr Protoc Bioinformatics Date: 2011-09
Authors: Rommel J Gestuveo; Jamie Royle; Claire L Donald; Douglas J Lamont; Edward C Hutchinson; Andres Merits; Alain Kohl; Margus Varjak Journal: Nat Commun Date: 2021-05-13 Impact factor: 14.919
Authors: Janna Hastings; Gareth Owen; Adriano Dekker; Marcus Ennis; Namrata Kale; Venkatesh Muthukrishnan; Steve Turner; Neil Swainston; Pedro Mendes; Christoph Steinbeck Journal: Nucleic Acids Res Date: 2015-10-13 Impact factor: 16.971
Authors: Nathan A Dunn; Deepak R Unni; Colin Diesh; Monica Munoz-Torres; Nomi L Harris; Eric Yao; Helena Rasche; Ian H Holmes; Christine G Elsik; Suzanna E Lewis Journal: PLoS Comput Biol Date: 2019-02-06 Impact factor: 4.475
Authors: Hugo D Perdomo; Mazhar Hussain; Rhys Parry; Kayvan Etebari; Lauren M Hedges; Guangmei Zhang; Benjamin L Schulz; Sassan Asgari Journal: Commun Biol Date: 2021-07-09
Authors: Yifeng Y J Xu; YuMin M Loh; Tai-Ting Lee; Takuro S Ohashi; Matthew P Su; Azusa Kamikouchi Journal: Front Physiol Date: 2022-08-29 Impact factor: 4.755