Literature DB >> 26656951

GeneWeaver: data driven alignment of cross-species genomics in biology and disease.

Erich Baker¹, Jason A Bubier², Timothy Reynolds³, Michael A Langston⁴, Elissa J Chesler⁵.

Abstract

The GeneWeaver data and analytics website (www.geneweaver.org) is a publically available resource for storing, curating and analyzing sets of genes from heterogeneous data sources. The system enables discovery of relationships among genes, variants, traits, drugs, environments, anatomical structures and diseases implicitly found through gene set intersections. Since the previous review in the 2012 Nucleic Acids Research Database issue, GeneWeaver's underlying analytics platform has been enhanced, its number and variety of publically available gene set data sources has been increased, and its advanced search mechanisms have been expanded. In addition, its interface has been redesigned to take advantage of flexible web services, programmatic data access, and a refined data model for handling gene network data in addition to its original emphasis on gene set data. By enumerating the common and distinct biological molecules associated with all subsets of curated or user submitted groups of gene sets and gene networks, GeneWeaver empowers users with the ability to construct data driven descriptions of shared and unique biological processes, diseases and traits within and across species.

Entities: Chemical Disease Species

Mesh：

Year: 2015 PMID： 26656951 PMCID： PMC4702926 DOI： 10.1093/nar/gkv1329

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

There are many circumstances that benefit from the rapid and detailed one-to-many or many-to-many comparison of sets of genes and variants. These types of analytics arise in personal genomics, experimental functional genomics, genetic mapping and other analyses in which collections of diverse associations of genes and genomes to biological concepts, patients, diseases or samples must be compared and interpreted. GeneWeaver.org is a data repository and analytics platform that meets these needs through the storage, curation and analysis of publicly sourced and user-defined sets of genes across species (1,2). Initially referred to as the Ontological Discovery Environment, this system enables users to apply biclique centric analyses to infer the relations among any biological concept that can be represented by a set of associated genes. A computationally tractable bipartite analysis tool (3) makes it possible for GeneWeaver users to analyze collections of gene-centric data to describe the emergent gene-to-disease, -ontology or -phenotype relationships hidden among biological concepts and molecular components by making use of labeled gene sets that retain the contextual information about the conditions under which co-occurring genes are present. This approach is akin to more recent work in data driven ontology using clique enumeration and intersection to identify relations among co-occurring genes to find the underlying biological components observed across systems (4). By taking advantage of a bipartite data structure, our suite of tools dynamically enumerates maximal bicliques which can be arranged into hierarchical associations for the identification of common and unique components shared between biological processes (HiSim Graph) or highly connected genes (GeneSet Graph) (1,2). In addition to graph-based approaches, GeneWeaver allows statistical interrogation of data, including quantitative assessment of gene set overlap through Jaccard similarity analysis and clustering. A suite of Boolean tools permits users to perform set combination or enumeration, allowing the creation of novel gene sets based on shared components, and thus refining their derived ontological structure over time. Importantly, GeneWeaver's ability to integrate sets of genes and gene identifiers across species through identifier mapping enables heterogeneous data integration. In addition, genes may be associated with biological processes, disease states or semantic descriptions through Gene Ontology (GO) (5), Human Phenotype (HP) Ontology (6), Mammalian Phenotype (MP) Ontology (7) or the Ontology for Biomedical Investigations (OBI) (8). Aggregating sparse sets of genes by including multiple model organisms across multiple disease spaces unleashes the potential promised by convergent functional genomics. Effectively, this allows users to align disease phenotypes across model organisms, isolate sets of genes shared by biological processes or discover shared biological substrates within data driven disease hierarchies, as described in a recent review using GeneWeaver to find consilience among diverse data sets (9).

DATA CONTENT

Since its initial description in the Nucleic Acids Research Database issue in 2012, GeneWeaver has undergone more than a two fold growth of publically available and private gene sets (1). A primary objective of GeneWeaver is to allow users to load, store and curate ad hoc sets of genes, derived based on user defined metrics, such a microarray results, genetic mapping and association studies, and semantic or publication associations, among others. A historical problem with genomic approaches is that in isolation, a set of genes, variants or other molecular entities may contain many false positive or false negative associations to a disease state or biological concept. The same set may reveal highly relevant information when analyzed in the context of aggregate data consisting of thousands of gene sets derived from multiple species, tissue types, physiological states or genetic background. This approach effectively reduces noise through high degree convergence among multiple experimentally derived data types. To this end, we have curated a large collection of gene sets, each assigned to a tier based on source and status. Publically available sets of genes annotated to structured vocabularies and ontologies are assigned Tier I, or public resource data. In the case of semantic ontologies, each GeneWeaver gene set represents the closure of genes associated with a particular term. Other sets of genes, such as MeSH term-to-gene annotations, are derived from the processing of public sources and attributed to Tier II. In the case of MeSH, we take advantage of NCBI's gene-to-Pubmed and Pubmed-to-mesh files to produce sets of genes annotated through their transitive associations. Tiers III (reviewed) and IV (user submitted and pending review) data are manually curated and publically available, while Tier V designates private user submitted data that has not been subject to curatorial review or released to the public. To date, GeneWeaver houses 100 069 active sets of genes, including 64 639 Tier I, 17 482 Tier II, 1070 Tier III and 14 386 Tier V gene sets. These numbers represent a 225% increase during the last three years and include the addition of gene sets associated with the Kyoto Encyclopedia of Genes and Genomes (KEGG) (10), MeSH, Molecular Signatures Database (MSigDB) (11), Mammalian Genome Institute (MGI), Online Mendelian Inheritance in Man (OMIM) (12), Pathway Commons (13), and rat QTLs from the Rat Genome Database (RGD) (14). Gene sets included in the initial 2012 publication have been augmented or refined based on changes in the data sources. A complete list of included data sources is available in Supplement Table S1. Species are added based on criteria that include user requests, position of the organism as a model disease platform, existence of adequate functional genomic data sources and availability of a stable genome build, with preference given to species represented by an active annotation consortium. GeneWeaver currently houses data on nine species: Macaca mulatta, Canis familiaris, Drosophila melanogaster, Mus musculus, Rattus norvegicus, Danio rerio, Caenorhabditis elegans, Gallus gallus, Saccharomyces cerevisiae and Homo sapiens. Genes associated with these species are derived from a variety of sources, including primary model organism databases (Rat Genome Database (RGD) (14), FlyBase (15), WormBase (16), ZFIN (17), Saccharomyces Genome Database (SGD) (18)), NCBI, Ensembl, Mammalian Genome Institute (MGI), and Uniprot. Collectively, GeneWeaver contains a total of more than four million external reference identifiers, which translates to over three million unique GeneWeaver identifiers mapped onto 29 266 homolog clusters based on homologene-based alignments (19).

NEW DEVELOPMENTS

Analysis tools

Tools for the identification of potentially informative biological entities among sets of intersecting heterogeneous data are continually evaluated, upgraded and made more efficient, with preference given to scalable but exhaustive solutions over mere heuristic approaches, and the ability of new tools to provide interpretable results through an intuitive user interface. Notably, complete bipartite Hierarchical Similarity (HiSim) graphs can be constructed of 100s or 1000s of sets of genes, which produce enormous graphs. These graphs now include bootstrap algorithms that sample edges within result sets to remove underrepresented edges and nodes, greatly reducing complexity. Visualizations have been enhanced to color nodes based on pre-selected emphasis genes, dynamically selected genes, or set similarity to existing user-defined sets, Figure 1.

Figure 1.

The GeneWeaver HiSim graph represents hierarchical intersections of maximal bicliques based on shared genes. This structure relates a data driven ontology. Here, intersecting sets of genes are colored by inclusion of a gene of interest (orange), or by sets of user-defined genes (green opacity).

Multi-partite relationships

As the number and variety of data sources continue to expand, it is evident that relevant biological associations may only be apparent through the intersection of multiple partitions. For example, one may wish to identify genes with a maximal association to MeSH derived gene sets and gene sets annotated to empirical cocaine experiments. In order to account for high order partition associations, GeneWeaver has recently adopted the bipartite gene association graph to include multi-partite sets of genes (20). Users are able to select each partite set based on sets of genes associated with a project. Edges between partite sets can be created by setting a minimum threshold of Jaccard overlap between individual sets of genes contained with each set. Alternatively, edges between sets can be created based on shared genes within each. Results highlight common and unique genes and gene sets associated with the underlying partite sets. This provides a means through which a prospective analysis of multi-way set intersections can be performed, potentially aligning semantic and data driven ontologies through mediating sets of genes.

User interface

To take advantage of the benefits of increased interoperability, stability, flexibility and modularity in modern web-based platforms, the GeneWeaver interface has been wholly redesigned based on python and the flask microframework, leveraging its native jinja2 template agent, RESTful request dispatching, Web Service Gateway Interface (WSGI) 1.0 compliance and secure session settings. The overall look and feel has been intentionally streamlined to highlight analysis functionality and user operations around the Gene Set metaphor. Thus, each operation functions on a set of genes.

Data sharing

Expanded capabilities for user sharing, user-driven group administration and project sharing allow users to share access to data and to analysis results, so that a team of investigators can collaborative on pre-publication data, ultimately releasing the data for public access. Improved flexibility in this system allows users to work with specific collaborators within a session, and to transfer work to another group member as the project and team evolve.

Search

Because GeneWeaver is designed to present real time data query and analysis, the web interface has been optimize to search sets of genes rapidly, based on meta data, size, ontological associations, related publications and curatorial text. We have adopted Sphinx, a cross-file format indexer based on reStructuredText (rest) extensible parsing language (http://www.sphinxsearch.com). Search results can be organized by species, curation tier or attribution type, and filtered by set size, status or other attributes (Figure 2).

Figure 2.

Search results can be organized by species, curation tiers and attribution metadata, and filtered by gene set size and status and group privileges. In this example, a search for ‘cocaine addiction’ returns 100 sets of genes, predominantly from rat, mouse and human. These are mostly Tier 1 and III data from published sources or experimentally derived, respectively. Numerous sets are also identified as being from the drug-related gene database (DRG).

Documentation

GeneWeaver has reconfigured its documentation within a wiki (GeneWeaver.org/wiki). Here, users can find tutorials, a quick start guide (Supplement Figure S1) and details about each tool and data set. The wiki also contains updated curation standards and instructions for connecting GeneWeaver tools to other community resources.

Data access and web services

GeneWeaver has adopted a REpresentational State Transfer (REST)-ful web services model that allows programmatic interaction with underlying data sets, such as data export (21). These dynamic facets are designed to support direct query of all analysis tools, including job status, stored results and other data. Data is returned as result image or JavaScript Object Notion (JSON).

Annotation to OBI and other widely used ontologies

GeneWeaver gene sets and networks are each annotated with appropriate ontologies. Data curated from individual published studies are among the hardest to annotate, but have been previously supported with extensive free text documentation. We have also recently initiated the formal annotation of these sets to terms in the OBI ontology (8).

CONCLUSIONS

With a greatly increased repository of background gene sets, GeneWeaver enables its users to perform a tremendous variety of applications directed at emerging questions in the comparison and prioritization of genes and variants and their role in disease (9). GeneWeaver is maintained under active development and continues to move towards web services, big data storage and computation paradigms, and intentionally maintained curatorial groups. The addition of updated user interfaces, new search features and analysis tools positions GeneWeaver for continued growth and use within the community. The foundational approach supported by the GeneWeaver model, namely, that of finding consilience among cross-species heterogeneous data, has produced numerous success stories, where data analysis explicitly informs hypothesis creation in vitro and in vivo.

19 in total

1. A description of the Molecular Signatures Database (MSigDB) Web site.

Authors: Arthur Liberzon
Journal: Methods Mol Biol Date: 2014

2. ZFIN, The zebrafish model organism database: Updates and new directions.

Authors: Leyla Ruzicka; Yvonne M Bradford; Ken Frazer; Douglas G Howe; Holly Paddock; Sridhar Ramachandran; Amy Singer; Sabrina Toro; Ceri E Van Slyke; Anne E Eagle; David Fashena; Patrick Kalita; Jonathan Knight; Prita Mani; Ryan Martin; Sierra A T Moxon; Christian Pich; Kevin Schaper; Xiang Shao; Monte Westerfield
Journal: Genesis Date: 2015-07-08 Impact factor: 2.487

Review 3. The mammalian phenotype ontology: enabling robust annotation and comparative analysis.

Authors: Cynthia L Smith; Janan T Eppig
Journal: Wiley Interdiscip Rev Syst Biol Med Date: 2009 Nov-Dec

4. Modeling biomedical experimental processes with OBI.

Authors: Ryan R Brinkman; Mélanie Courtot; Dirk Derom; Jennifer M Fostel; Yongqun He; Phillip Lord; James Malone; Helen Parkinson; Bjoern Peters; Philippe Rocca-Serra; Alan Ruttenberg; Susanna-Assunta Sansone; Larisa N Soldatova; Christian J Stoeckert; Jessica A Turner; Jie Zheng
Journal: J Biomed Semantics Date: 2010-06-22

5. Ontological Discovery Environment: a system for integrating gene-phenotype associations.

Authors: Erich J Baker; Jeremy J Jay; Vivek M Philip; Yun Zhang; Zuopan Li; Roumyana Kirova; Michael A Langston; Elissa J Chesler
Journal: Genomics Date: 2009-09-03 Impact factor: 5.736

6. The Rat Genome Database 2015: genomic, phenotypic and environmental variations and disease.

Authors: Mary Shimoyama; Jeff De Pons; G Thomas Hayman; Stanley J F Laulederkind; Weisong Liu; Rajni Nigam; Victoria Petri; Jennifer R Smith; Marek Tutaj; Shur-Jen Wang; Elizabeth Worthey; Melinda Dwinell; Howard Jacob
Journal: Nucleic Acids Res Date: 2014-10-29 Impact factor: 19.160

7. Saccharomyces genome database provides new regulation data.

Authors: Maria C Costanzo; Stacia R Engel; Edith D Wong; Paul Lloyd; Kalpana Karra; Esther T Chan; Shuai Weng; Kelley M Paskov; Greg R Roe; Gail Binkley; Benjamin C Hitz; J Michael Cherry
Journal: Nucleic Acids Res Date: 2013-11-21 Impact factor: 16.971

8. On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types.

Authors: Yun Zhang; Charles A Phillips; Gary L Rogers; Erich J Baker; Elissa J Chesler; Michael A Langston
Journal: BMC Bioinformatics Date: 2014-04-15 Impact factor: 3.169

9. Inferring gene ontologies from pairwise similarity data.

Authors: Michael Kramer; Janusz Dutkowski; Michael Yu; Vineet Bafna; Trey Ideker
Journal: Bioinformatics Date: 2014-06-15 Impact factor: 6.937

10. WormBase 2014: new views of curated biology.

Authors: Todd W Harris; Joachim Baran; Tamberlyn Bieri; Abigail Cabunoc; Juancarlos Chan; Wen J Chen; Paul Davis; James Done; Christian Grove; Kevin Howe; Ranjana Kishore; Raymond Lee; Yuling Li; Hans-Michael Muller; Cecilia Nakamura; Philip Ozersky; Michael Paulini; Daniela Raciti; Gary Schindelman; Mary Ann Tuli; Kimberly Van Auken; Daniel Wang; Xiaodong Wang; Gary Williams; J D Wong; Karen Yook; Tim Schedl; Jonathan Hodgkin; Matthew Berriman; Paul Kersey; John Spieth; Lincoln Stein; Paul W Sternberg
Journal: Nucleic Acids Res Date: 2013-11-04 Impact factor: 16.971

13 in total

1. Reference Trait Analysis Reveals Correlations Between Gene Expression and Quantitative Traits in Disjoint Samples.

Authors: Daniel A Skelly; Narayanan Raghupathy; Raymond F Robledo; Joel H Graber; Elissa J Chesler
Journal: Genetics Date: 2019-05-21 Impact factor: 4.562

2. On Finding and Enumerating Maximal and Maximum k-Partite Cliques in k-Partite Graphs.

Authors: Charles A Phillips; Kai Wang; Erich J Baker; Jason A Bubier; Elissa J Chesler; Michael A Langston
Journal: Algorithms Date: 2019-01-15

3. Domination based classification algorithms for the controllability analysis of biological interaction networks.

Authors: Stephen K Grady; Faisal N Abu-Khzam; Ronald D Hagan; Hesam Shams; Michael A Langston
Journal: Sci Rep Date: 2022-07-13 Impact factor: 4.996

4. Harnessing Genetic Complexity to Enhance Translatability of Alzheimer's Disease Mouse Models: A Path toward Precision Medicine.

Authors: Sarah M Neuner; Sarah E Heuer; Matthew J Huentelman; Kristen M S O'Connell; Catherine C Kaczorowski
Journal: Neuron Date: 2018-12-27 Impact factor: 17.173

5. Genetic mapping in Diversity Outbred mice identifies a Trpa1 variant influencing late-phase formalin response.

Authors: Jill M Recla; Jason A Bubier; Daniel M Gatti; Jennifer L Ryan; Katie H Long; Raymond F Robledo; Nicole C Glidden; Guoqiang Hou; Gary A Churchill; Richard S Maser; Zhong-Wei Zhang; Erin E Young; Elissa J Chesler; Carol J Bult
Journal: Pain Date: 2019-08 Impact factor: 7.926

6. Integration of heterogeneous functional genomics data in gerontology research to find genes and pathway underlying aging across species.

Authors: Jason A Bubier; George L Sutphin; Timothy J Reynolds; Ron Korstanje; Axis Fuksman-Kumpa; Erich J Baker; Michael A Langston; Elissa J Chesler
Journal: PLoS One Date: 2019-04-12 Impact factor: 3.240

7. Cisplatin-resistant triple-negative breast cancer subtypes: multiple mechanisms of resistance.

Authors: David P Hill; Akeena Harper; Joan Malcolm; Monica S McAndrews; Susan M Mockus; Sara E Patterson; Timothy Reynolds; Erich J Baker; Carol J Bult; Elissa J Chesler; Judith A Blake
Journal: BMC Cancer Date: 2019-11-04 Impact factor: 4.430

8. Discovery of a Role for Rab3b in Habituation and Cocaine Induced Locomotor Activation in Mice Using Heterogeneous Functional Genomic Analysis.

Authors: Jason A Bubier; Vivek M Philip; Price E Dickson; Guy Mittleman; Elissa J Chesler
Journal: Front Neurosci Date: 2020-07-09 Impact factor: 4.677

9. Genetic variation regulates opioid-induced respiratory depression in mice.

Authors: Jason A Bubier; Hao He; Vivek M Philip; Tyler Roy; Christian Monroy Hernandez; Rebecca Bernat; Kevin D Donohue; Bruce F O'Hara; Elissa J Chesler
Journal: Sci Rep Date: 2020-09-11 Impact factor: 4.379

10. Investigation of COVID-19 comorbidities reveals genes and pathways coincident with the SARS-CoV-2 viral disease.

Authors: Mary E Dolan; David P Hill; Gaurab Mukherjee; Monica S McAndrews; Elissa J Chesler; Judith A Blake
Journal: bioRxiv Date: 2020-09-21