Literature DB >> 22135296

VectorBase: improvements to a bioinformatics resource for invertebrate vector genomics.

Karine Megy1, Scott J Emrich, Daniel Lawson, David Campbell, Emmanuel Dialynas, Daniel S T Hughes, Gautier Koscielny, Christos Louis, Robert M Maccallum, Seth N Redmond, Andrew Sheehan, Pantelis Topalis, Derek Wilson.   

Abstract

VectorBase (http://www.vectorbase.org) is a NIAID-supported bioinformatics resource for invertebrate vectors of human pathogens. It hosts data for nine genomes: mosquitoes (three Anopheles gambiae genomes, Aedes aegypti and Culex quinquefasciatus), tick (Ixodes scapularis), body louse (Pediculus humanus), kissing bug (Rhodnius prolixus) and tsetse fly (Glossina morsitans). Hosted data range from genomic features and expression data to population genetics and ontologies. We describe improvements and integration of new data that expand our taxonomic coverage. Releases are bi-monthly and include the delivery of preliminary data for emerging genomes. Frequent updates of the genome browser provide VectorBase users with increasing options for visualizing their own high-throughput data. One major development is a new population biology resource for storing genomic variations, insecticide resistance data and their associated metadata. It takes advantage of improved ontologies and controlled vocabularies. Combined, these new features ensure timely release of multiple types of data in the public domain while helping overcome the bottlenecks of bioinformatics and annotation by engaging with our user community.

Entities:  

Mesh:

Year:  2011        PMID: 22135296      PMCID: PMC3245112          DOI: 10.1093/nar/gkr1089

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

VectorBase is a NIAID-funded Bioinformatics Resource Center (BRC) (1), which focuses on arthropod vectors of human pathogens. Our mission is to support the vector research community by providing access to genome assemblies, genome annotations and high-throughput data. VectorBase is involved in capturing community gene annotations, storing microarray expression studies and more recently population biology data. The collection of experimental and sample-related metadata has been aided through our development of ontologies and controlled vocabularies for vector-specific data, such as field-associated samples, pathogen transmission and insecticide resistance. VectorBase currently hosts nine genomes of which the majority are mosquitoes, reflecting their importance in disease agent transmission. The seven corresponding species are: Anopheles gambiae (three genomes, for the PEST, Mali-NIH and Pimperena colonies), Aedes aegypti, Culex quinquefasciatus, Glossina morsitans, Ixodes scapularis, Pediculus humanus and Rhodnius prolixus. We anticipate hosting genome clusters for a broader group of Anopheline mosquitoes, ticks and other important vector genera such as Glossina and Simulium. Full details about current and future genomes to be hosted by VectorBase can be found at http://www.vectorbase.org/organisms. Here, we highlight improvements and new features, and discuss genomes integrated since the last update (2). All information and data are available from our website at http://www.vectorbase.org.

NEW FEATURES

Release cycles and early release of emerging genomes

VectorBase now releases data and software updates on a bi-monthly release cycle, such as genome browser improvements via the Ensembl project (3). Recent browser additions include tools for the visualization of user data sources: read coverage plots from high-throughput mRNA-sequencing experiments (BAM (4), WIG http://genome.ucsc.edu/FAQ/FAQformat.html), gene models (GFF3— http://www.sequenceontology.org/gff3.shtml) and population resequencing/variation data sets [VCF (5)] (Figure 1). Searching and selection of evidence tracks have been simplified with a greater level of customization of genome-based views.
Figure 1.

Visualization of user data in the genome browser (image exported directly from the browser). The dark red boxes at the top represent the exons on transcript AAEL012734-RA, and the blue bar represents the contig sequence. The top track in dark grey represents the coverage plot of short-read alignments of an RNA-Seq experiment on the Aedes aegypti transcriptome. Individual read alignments extracted from a BAM-formatted file (4) are displayed underneath in light grey. The four orange tracks represent reconstructed transcript models in four replicates of the same experiment, using a custom RNA-Seq analysis procedure. The BedGraph format was used to generate the data to keep the coverage information base by base.

Visualization of user data in the genome browser (image exported directly from the browser). The dark red boxes at the top represent the exons on transcript AAEL012734-RA, and the blue bar represents the contig sequence. The top track in dark grey represents the coverage plot of short-read alignments of an RNA-Seq experiment on the Aedes aegypti transcriptome. Individual read alignments extracted from a BAM-formatted file (4) are displayed underneath in light grey. The four orange tracks represent reconstructed transcript models in four replicates of the same experiment, using a custom RNA-Seq analysis procedure. The BedGraph format was used to generate the data to keep the coverage information base by base. To make emerging genome sequences rapidly available to our communities, we have recently introduced preliminary sites, called pre-sites, for newly assembled genomes. These contain temporary, unarchived automated gene predictions and transcriptome and proteome alignments. These pre-sites improve vector community involvement during initial analysis, including highly valued community-aided annotation. Once an annotation is finalized, additional analyses are performed such as our standard orthology/paralogy relationship predictions (6) and cross-referencing to other resources. This system was trialled for the R. prolixus and G. morsitans genomes.

Integration of community data

VectorBase has a mandate to capture community annotations. Community appraisal of the reference genome annotations has been important to assess automatic gene predictions and ensure correct models for many gene families as part of the initial genome publication (7) and subsequent analyses (8). Most current annotation data correspond to specific genes and/or gene families and are provided by community members through a simple spreadsheet submitted to our Community Annotation Pipeline. Integration of these data with existing gene sets has greatly improved reference gene sets (e.g. An. gambiae) and has led to a new ‘patch’ build system that uses heuristics to merge manual and automated gene predictions to allow more frequent gene set updates. Patch builds for three species (Ae. aegypti, C. quinquefasciatus and I. scapularis) were performed in 2011. To ensure timely release of community-sourced annotations, all community manual annotation data are made available as a Distributed Annotation System track within the genome browser (9). These data include corrections of gene structures and relevant metadata such as gene symbols and citations. Community-generated transcriptome data from newer sequencing technologies, known as RNA-Seq, are also increasingly being produced for VectorBase species. We have been using these data to validate existing gene models and predict new ones. Alignment algorithms such as Tophat (10), GSNAP (11) (short reads) or GMAP (12) (long reads), were used to map reads to the assembly and identify splicing junctions. Gene models were then reconstructed using Cufflinks (13) and a custom pipeline.

Accessing data

VectorBase has improved its text-based search facility by increasing the speed and the scope of the underlying engine. Search terms now include gene identifiers and descriptions, microarray experiments and expression data. Indices are regenerated for each release using the open source Apache Lucene technology (http://lucene.apache.org) and served using a web service. Information can be retrieved from the search box on the main site or from the genome browser; results contain hyperlinks to genes, their locations and where appropriate, their paralogs/orthologs. A custom interface, CVSearch, has been developed to search (keywords or identifiers) and browse ontologies and controlled vocabularies. More recently, we have used our GDAV open source tool (http://www.vectorbase.org/Help/GDAV) to provide access to available RNA-Seq data. For example, assembled RNA-Seq data for eight Anopheline species for which the genome sequencing is in progress are already available for download or blast, and searchable using keywords, gene identifiers or InterPro domains.

NEW DATA

Ontologies

VectorBase continues to develop and maintain ontologies relating to control of disease vectors (14). Specifically, we host anatomy ontologies [TGMA for mosquitoes and TADS for ticks (15)] and a BFO compliant ontology of insecticide resistance [MIRO (16)]. Our most recent ontology is an extension of the Infectious Disease Ontology (IDO) called IDOMAL (17), which is a comprehensive malaria-focused ontology with more than 2300 unique terms including most related to the disease vector (e.g. vector control). All VectorBase ontologies strictly follow the rules established by the OBO Foundry (18), and can be browsed either at VectorBase or the NCBO Bioportal (http://bioportal.bioontology.org). These ontologies have also been deposited into the publicly accessible OBO Foundry (http://www.obofoundry.org).

Insecticide resistance data

IRbase is a dedicated section of VectorBase that hosts data from both published studies and recently analyzed data for field populations. It used to depend on our MIRO ontology but now relies on the newer IDOMAL ontology described above. We are in the process of incorporating these data into the population biology resource described in the next section.

Variation data

As anticipated in our previous update (2), analyses of populations and variations at the genomic level have increased significantly. To accommodate these data sets, VectorBase has continued to improve its Ensembl-based genome browser for visualizing genomic variation data. As of 2011, the current resource contains data from the dbSNP database (19), variations derived from the An. gambiae Mali-NIH (M molecular form) and Pimperena (S molecular form) sequencing project (20), and genotypes obtained with the AgSNP01 SNP-array (21). We expect to increasingly use this functionality with the completion of a number of planned large-scale population sampling projects.

POPULATION GENOMICS RESOURCE

Integral to handling both genomic variations and insecticide resistance data is the capture of metadata, such as field collection locations and methods. The original IRbase (16) and more recent AgPopGenBase data from UC Davis/UCLA (http://www.vectorbase.org/PopulationData) were highly valuable but were not designed to store more diverse data types. To allow more flexibility, we developed a unified population biology resource that can store all of these data while linking to the genome browser when useful, e.g. high-throughput genotyping data from stored AgSNP01 chip hybridizations (21). This new resource currently contains just over 15 000 mosquito samples originating from over 1600 field collections and more than 34 000 phenotype/genotype assay results.

Population genomics database

We participated in the development of a Chado Natural Diversity Module (22) in collaboration with the GMOD consortium (http://gmod.org) and specific members (23–25). This module is an extension to the Chado database schema that stores population and variation data. The module has a simple, ontology-centred, design which allows the processing of data from a wide range of experiments by extending existing ontologies or adopting new ones. Data storage and access is simplified through Perl and Ruby Application Programming Interfaces (APIs). The Ruby API has been used to write a ‘RESTful’ web service that enables programs, both within VectorBase and from third parties, to retrieve data from the database in a structured format (JSON). The web service code is available under an open source license (http://www.vectorbase.org/Tools). For display of these data, we have developed a lightweight browser and JavaScript library; this queries the main data server and formats it using a set of standard display methods (Figure 2). Display code is available under a GPLv3 license from the same URL as the web service code.
Figure 2.

Examples of customizable displays from the Phenovis javascript library. (A) Susceptibility status of Anopheles fluviatilis, An. annularis and An. culicifacies to insecticides in Koraput District, Orissa: [insecticide x per cent mortality]. (B) Anopheles gambiae M, S and Bamako populations [location x population] (21).

Examples of customizable displays from the Phenovis javascript library. (A) Susceptibility status of Anopheles fluviatilis, An. annularis and An. culicifacies to insecticides in Koraput District, Orissa: [insecticide x per cent mortality]. (B) Anopheles gambiae M, S and Bamako populations [location x population] (21).

Community-led development

The standard display methods provide a wide variety of options that can be customized by a submitter to best suit their data. By using an open web service and providing the visualization code under an open source license, we hope third-party displays will be developed and we will support these efforts through outreach and through VectorBase-hosted development mailing lists. As a concrete example, we have tested a number of visualizations that retrieve data from our resource and from the web service at EuPathDB (26). Other examples of this approach include the display of climatic, economic or human disease data. This functionality could enable co–analysis of vector and pathogen data of this kind.

Data submission

Data can be submitted to the VectorBase Population Biology Resource via spreadsheet forms using open source tools to assist with formatting and ontology term selection (ISA-Tab (27) and Phenote, http://www.phenote.org). Genotypes are submitted to the variation resource in standard VCF format (5).

EXPANDING THE TAXONOMIC COVERAGE OF VECTORBASE

The decreasing cost of genome sequencing has radical effects on the scope of genome projects. Previously, VectorBase has partnered with large-scale sequencing centres to generate annotation and support single representatives from important vector genera, e.g., An. gambiae for Anopheles and Ae. aegypti for Aedes. Projects using newer generation sequencing methodologies can deliver assemblies at a fraction of the cost and have expanded to encompass multiple species from each genera. NIAID/NHGRI has approved several of these genome clusters including 15 Anopheline genomes, 11 Simulium genomes, 5 Glossina genomes, 2 tick genomes (including the improvement of the I. scapularis assembly) and a mite genome. In total, these represent a 4-fold increase of the number of genomes stored in VectorBase. VectorBase will support these expanded genome clusters using many of the features described in this update. Each project will produce other data types such as RNA-Seq and variation data through population sampling. VectorBase has also developed a new genome annotation pipeline to infer gene structures from closely related orthologs via whole-genome alignment techniques. Thus a single, high-quality reference annotation set can be used to rapidly predict genes in the other members of a genome cluster. The improvements in the storage and visualization of RNA-Seq and variation data will be invaluable for supporting and augmenting these new genomes for our users.

FUTURE DEVELOPMENTS

In this update, we described improvements to existing features and integration of new data. Two significant advancements are the development of a bi-monthly release and pre-sites, providing the latest data at an early stage of their analysis, thus ensuring high community involvement. VectorBase also assists the community with a helpdesk system, on-line help (FAQs, forum, tutorials) and outreach at conferences. Decreasing sequencing costs are producing a wealth of vector-focused genomics data and expanding the taxonomic coverage far beyond mosquitoes. Although a first cluster of 15 Anopheline genomes is being sequenced, three clusters of related non-mosquito vectors are next in line. Re–sequencing or sequencing of individuals from the same species for population genetics study is also becoming more common. The future of vector genomics appears to be an expansion of both taxonomic coverage (breadth) and within-species re-sequencing (depth). By continuously improving its resources, as has been done in the past years, VectorBase is in a good position to meet this exciting challenge.

FUNDING

National Institutes of Health/National Institute for Allergy and Infectious Diseases (grant numbers HHSN266200400039C, HHSN272200900039C); partial support from: the Evimalar network of excellence (grant number 242095); INFRAVEC from the FP7 program of the European Commission (grant number 228421); Transmalariabloc from the FP7 program of the European Commission (grant number HEALTH-F3-2008-223736). Funding for open access charge: National Institutes of Health/National Institute for Allergy and Infectious Diseases [grant number HHSN272200900039C]. Conflict of interest statement. None declared.
  27 in total

1.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration.

Authors:  Barry Smith; Michael Ashburner; Cornelius Rosse; Jonathan Bard; William Bug; Werner Ceusters; Louis J Goldberg; Karen Eilbeck; Amelia Ireland; Christopher J Mungall; Neocles Leontis; Philippe Rocca-Serra; Alan Ruttenberg; Susanna-Assunta Sansone; Richard H Scheuermann; Nigam Shah; Patricia L Whetzel; Suzanna Lewis
Journal:  Nat Biotechnol       Date:  2007-11       Impact factor: 54.908

2.  Anatomical ontologies of mosquitoes and ticks, and their web browsers in VectorBase.

Authors:  P Topalis; C Tzavlaki; K Vestaki; E Dialynas; D E Sonenshine; R Butler; R V Bruggner; E O Stinson; F H Collins; C Louis
Journal:  Insect Mol Biol       Date:  2008-02       Impact factor: 3.585

3.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates.

Authors:  Albert J Vilella; Jessica Severin; Abel Ureta-Vidal; Li Heng; Richard Durbin; Ewan Birney
Journal:  Genome Res       Date:  2008-11-24       Impact factor: 9.043

4.  How can ontologies help vector biology?

Authors:  Pantelis Topalis; Daniel Lawson; Frank H Collins; Christos Louis
Journal:  Trends Parasitol       Date:  2008-04-24

5.  GMAP: a genomic mapping and alignment program for mRNA and EST sequences.

Authors:  Thomas D Wu; Colin K Watanabe
Journal:  Bioinformatics       Date:  2005-02-22       Impact factor: 6.937

6.  The Chado Natural Diversity module: a new generic database schema for large-scale phenotyping and genotyping data.

Authors:  Sook Jung; Naama Menda; Seth Redmond; Robert M Buels; Maren Friesen; Yuri Bendana; Lacey-Anne Sanderson; Hilmar Lapp; Taein Lee; Bob MacCallum; Kirstin E Bett; Scott Cain; Dave Clements; Lukas A Mueller; Dorrie Main
Journal:  Database (Oxford)       Date:  2011-11-26       Impact factor: 3.451

7.  TopHat: discovering splice junctions with RNA-Seq.

Authors:  Cole Trapnell; Lior Pachter; Steven L Salzberg
Journal:  Bioinformatics       Date:  2009-03-16       Impact factor: 6.937

8.  MIRO and IRbase: IT tools for the epidemiological monitoring of insecticide resistance in mosquito disease vectors.

Authors:  Emmanuel Dialynas; Pantelis Topalis; John Vontas; Christos Louis
Journal:  PLoS Negl Trop Dis       Date:  2009-06-23

9.  VectorBase: a data resource for invertebrate vector genomics.

Authors:  Daniel Lawson; Peter Arensburger; Peter Atkinson; Nora J Besansky; Robert V Bruggner; Ryan Butler; Kathryn S Campbell; George K Christophides; Scott Christley; Emmanuel Dialynas; Martin Hammond; Catherine A Hill; Nathan Konopinski; Neil F Lobo; Robert M MacCallum; Greg Madey; Karine Megy; Jason Meyer; Seth Redmond; David W Severson; Eric O Stinson; Pantelis Topalis; Ewan Birney; William M Gelbart; Fotis C Kafatos; Christos Louis; Frank H Collins
Journal:  Nucleic Acids Res       Date:  2008-11-21       Impact factor: 16.971

10.  GDR (Genome Database for Rosaceae): integrated web-database for Rosaceae genomics and genetics data.

Authors:  Sook Jung; Margaret Staton; Taein Lee; Anna Blenda; Randall Svancara; Albert Abbott; Dorrie Main
Journal:  Nucleic Acids Res       Date:  2007-10-11       Impact factor: 16.971

View more
  102 in total

1.  Computational identification of novel microRNAs and their targets in the malarial vector, Anopheles stephensi.

Authors:  Remya Krishnan; Vinod Kumar; Vivek Ananth; Shailja Singh; Achuthsankar S Nair; Pawan K Dhar
Journal:  Syst Synth Biol       Date:  2015-02-21

2.  Genome-wide QTL mapping of saltwater tolerance in sibling species of Anopheles (malaria vector) mosquitoes.

Authors:  H A Smith; B J White; P Kundert; C Cheng; J Romero-Severson; P Andolfatto; N J Besansky
Journal:  Heredity (Edinb)       Date:  2015-04-29       Impact factor: 3.821

3.  Mosquitocidal properties of IgG targeting the glutamate-gated chloride channel in three mosquito disease vectors (Diptera: Culicidae).

Authors:  Jacob I Meyers; Meg Gray; Brian D Foy
Journal:  J Exp Biol       Date:  2015-05-15       Impact factor: 3.312

4.  Characterization of the target of ivermectin, the glutamate-gated chloride channel, from Anopheles gambiae.

Authors:  Jacob I Meyers; Meg Gray; Wojtek Kuklinski; Lucas B Johnson; Christopher D Snow; William C Black; Kathryn M Partin; Brian D Foy
Journal:  J Exp Biol       Date:  2015-05-15       Impact factor: 3.312

5.  Genome sequence of the Asian Tiger mosquito, Aedes albopictus, reveals insights into its biology, genetics, and evolution.

Authors:  Xiao-Guang Chen; Xuanting Jiang; Jinbao Gu; Meng Xu; Yang Wu; Yuhua Deng; Chi Zhang; Mariangela Bonizzoni; Wannes Dermauw; John Vontas; Peter Armbruster; Xin Huang; Yulan Yang; Hao Zhang; Weiming He; Hongjuan Peng; Yongfeng Liu; Kun Wu; Jiahua Chen; Manolis Lirakis; Pantelis Topalis; Thomas Van Leeuwen; Andrew Brantley Hall; Xiaofang Jiang; Chevon Thorpe; Rachel Lockridge Mueller; Cheng Sun; Robert Michael Waterhouse; Guiyun Yan; Zhijian Jake Tu; Xiaodong Fang; Anthony A James
Journal:  Proc Natl Acad Sci U S A       Date:  2015-10-19       Impact factor: 11.205

Review 6.  Emerging roles of aquaporins in relation to the physiology of blood-feeding arthropods.

Authors:  Joshua B Benoit; Immo A Hansen; Elise M Szuter; Lisa L Drake; Denielle L Burnett; Geoffrey M Attardo
Journal:  J Comp Physiol B       Date:  2014-06-19       Impact factor: 2.200

7.  Gene expression divergence between malaria vector sibling species Anopheles gambiae and An. coluzzii from rural and urban Yaoundé Cameroon.

Authors:  Bryan J Cassone; Colince Kamdem; Changde Cheng; John C Tan; Matthew W Hahn; Carlo Costantini; Nora J Besansky
Journal:  Mol Ecol       Date:  2014-04-11       Impact factor: 6.185

8.  Large-scale detection and analysis of adenosine-to-inosine RNA editing during development in Plutella xylostella.

Authors:  Tao He; Wenjie Lei; Chang Ge; Peng Du; Li Wang; Fei Li
Journal:  Mol Genet Genomics       Date:  2014-12-10       Impact factor: 3.291

9.  Ontology for vector surveillance and management.

Authors:  Saul Lozano-Fuentes; Aritra Bandyopadhyay; Lindsay G Cowell; Albert Goldfain; Lars Eisen
Journal:  J Med Entomol       Date:  2013-01       Impact factor: 2.278

10.  Validation of internal reference genes for real-time quantitative polymerase chain reaction studies in the tick, Ixodes scapularis (Acari: Ixodidae).

Authors:  Juraj Koci; Ladislav Simo; Yoonseong Park
Journal:  J Med Entomol       Date:  2013-01       Impact factor: 2.278

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.