Literature DB >> 24270788

Lynx: a database and knowledge extraction engine for integrative medicine.

Dinanath Sulakhe¹, Sandhya Balasubramanian, Bingqing Xie, Bo Feng, Andrew Taylor, Sheng Wang, Eduardo Berrocal, Utpal Dave, Jinbo Xu, Daniela Börnigen, T Conrad Gilliam, Natalia Maltsev.

Abstract

We have developed Lynx (http://lynx.ci.uchicago.edu)--a web-based database and a knowledge extraction engine, supporting annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Its underlying knowledge base (LynxKB) integrates various classes of information from >35 public databases and private collections, as well as manually curated data from our group and collaborators. Lynx provides advanced search capabilities and a variety of algorithms for enrichment analysis and network-based gene prioritization to assist the user in extracting meaningful knowledge from LynxKB and experimental data, whereas its service-oriented architecture provides public access to LynxKB and its analytical tools via user-friendly web services and interfaces.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2013 PMID： 24270788 PMCID： PMC3965040 DOI： 10.1093/nar/gkt1166

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Technological advances in genomics now allow us to produce biological data at unprecedented tera- and petabyte scales. The extraction of useful knowledge from these voluminous data sets critically depends on seamless integration of clinical, genomic and experimental information with prior knowledge about genotype–phenotype relationships accumulated in a plethora of databases. Furthermore, these large and complex integrated knowledge bases should be accessible to search engines and algorithms that drive efficient knowledge extraction advancing scientific insight and the development of biomedical applications. To meet these challenges, we developed Lynx (http://lynx.ci.uchicago.edu), a web-based database and a knowledge extraction engine for annotation and analysis of high-throughput biomedical data. Lynx database was designed specifically to support both discovery-based and hypothesis-based approaches to prediction of genetic factors and networks contributing to phenotypes of interest. Such unique support is provided by integration of vast amounts of information (e.g. genomic data, pathways and molecular interactions and other) from public and private repositories, as well as the targeted acquisition of phenotypic information and data describing association of genetic factors with diseases, clinical symptoms and phenotypic features. Lynx advanced search engines and a variety of algorithms for enrichment analysis and network-based gene prioritization support the extraction of meaningful knowledge from LynxKB and experimental data provided by the users. Lynx also enables formulation of weighted hypotheses regarding molecular mechanisms contributing to human phenotypes and disorders of interest.

LYNX DESIGN AND COMPONENTS

The Lynx database system has the following major components: (i) Integrated Lynx knowledge base (LynxKB); (ii) Knowledge extraction services currently available for LynxKB, including advanced search capabilities, features-based gene enrichment analysis and network-based gene prioritization, which may be invoked via the Lynx REST interface; and (iii) ‘Web Interface’, a user-friendly web interface for accessing the annotations and analytical tools.

Lynx integrated knowledgebase

LynxKB is a database integrating modeled data from >35 databases and manually curated private collections (Table 1). These data are used for annotation and extraction of knowledge from LynxKB via database queries or from experimental data provided by the user. An XML schema-driven annotation service supports annotations from the LynxKB as RESTful web services. Additionally, LynxKB contains a number of manually curated in-house data collections, including inter alia customized ontologies for early brain development and brain connectivity (developed in collaboration with Dr. Paciorkowski, University of Rochester), weighted collections of candidate genes provided by our clinical collaborators or extracted from Developmental Brain Disorders Database (DBDB) and other disease-related data sources such as AutDB (19), Schizophrenia Gene Resource (20), LisDB (https://lisdb.ci.uchicago.edu) and Cancer Gene Index (https://wiki.nci.nih.gov/display/cageneindex/caBIO). Lynx also provides an exclusive analytical access to the text-mining data describing molecular interactions from GeneWays (26). Integration of the data describing clusters of transcription factors binding sites (28) and enhancers (29), as provided by the Vista project, allows one to factor the information regarding non-coding genomic signals into the Lynx predictions of genetic factors involved in disorders of interest. Integrated structured data from Lynx KB is available for downloads in multiple formats (e.g. XML, CSV, TXT, JSON) via a web-based user interface and web services.

Table 1.

Data types and resources integrated in LynxKB

Type of data	Source
Genomic	NCBI (1), Ensembl (2), UniGene (3), TRANSFAC^b (4), RefSeq (5)
Proteomic	BIND (6), BioGRID (7), HPRD (8), MINT (9), UniProt (10), InterPro (11)
Pathways-related	KEGG (12), Reactome (13), NCI (14), BioCarta, STRING^b (15), TRANSPATH^b (16), Pathway Commons (17)
Disease-specific	OMIM, Disease ontology (18), AutDB (19), SZGR (20), Cancer gene index, AGRE, DBDB^a, LisDB^a
Phenotypic	OMIM, Human phenotype ontology (21), customized ontologies^a
Variations	Genomic association database (22), Database of genomic variants (23), Human genomic mutation database^b (24), SLEP (25), NHGRI
Text-mining	GeneWays^a (26), Diseases (University of Copenhagen)
Pharmacogenomics	Comparative toxicogenomics database (CTD) (27)

aCustomized and manually curated sources of information.

bThe resources are not displayed on the annotations page due to the proprietary license restrictions and/or are used exclusively in the analytical pipelines.

Data types and resources integrated in LynxKB aCustomized and manually curated sources of information. bThe resources are not displayed on the annotations page due to the proprietary license restrictions and/or are used exclusively in the analytical pipelines. Lynx data are available for download in a number of ways: (i) ‘Lynx KB database dumps’. Due to the fact that public data are available for download at the respective sources and the size of a complete integrated Lynx KB is prohibitively large, downloading the full content of Lynx KB may be impractical. However, any part of the public data integrated into Lynx KB is available for download in the form of tab-delimited tables and database dumps on request; (ii) all annotations and results of analysis in Lynx are available for download in CSV format via the ‘download’ button displayed on every page; and (iii) any Lynx object or set of objects as well as the results of annotation and analysis may be downloaded using web services in JSON and XML format.

Lynx knowledge extraction engine

Seamless integration of data, knowledge-extraction services and integrative analysis in Lynx provide a one-stop solution for generating weighted hypotheses regarding the molecular mechanisms contributing to the phenotypes of interest (Figure 1). Lynx supports multiple entry points for annotation and analysis of individual objects (e.g. genes, pathways, disorders) and batch queries. The user can submit search-based queries to LynxKB or experimental results to be analyzed by Lynx, such as the results of next-generation sequencing (NGS), copy number variation-based analyses or gene expression data in the form of SNPs, genomic coordinates or gene lists to be annotated or downstream analyzed via the web user interface and its integrated services. Lynx provides the following knowledge extraction tools for the downstream analyses and annotations:

Figure 1.

A workflow of knowledge extraction in the Lynx database where initial query genes are filtered interactively using annotations or based on the results of enrichment analysis. Resulting gene sets are ranked by the user according to his/her preferences and further prioritized using networks-based prioritization assisting in the prediction of molecular mechanisms contributing to the phenotype or biological process of interest to the user.

Advanced search

The large-scale integration of biomedical data in Lynx provides a great opportunity to mine these data with a systems perspective in mind. Its powerful search capabilities, based on Apache Lucene (http://lucene.apache.org/), allow users to generate highly selective data sets by filtering the queries to LynxKB on multiple parameters of interest to the user (e.g. phenotypes, pathways, keywords), as illustrated below in the case study, as well as highly efficient search functionality based on phrase queries, wildcard queries and Boolean operators for a deeper refinement of search results. Additionally, as illustrated below in the case study, users can start with broader searches based on diseases, pathways or symptoms of interest and then further refine and narrow down the results of the searches according to the parameters of interest. Another important feature of the advanced search functionality in Lynx is that the results of the queries are presented in association with other relevant annotations, such as genes, pathways, tissues, phenotypes and more, to provide a comprehensive overview for an object of interest. Lynx’s advanced search capabilities provide a unique perspective on the biological data of interest and can be an extremely powerful tool for researchers.

Annotation services

Lynx’s XML schema-driven annotation service provides annotations from the integrated database as RESTful web services. Every query to the LynxKB for an individual object (e.g. a gene) or a batch query (e.g. list of genes or genomic coordinates) extracts all information relevant to the query from LynxKB for the growing list of annotations [e.g. gene function description (RefSeq), associated pathways, diseases, clinical symptoms, molecular interactions, toxicogenomic information, Gene Ontology categories, tissues and other related annotations] and displays it to the user according to his/her preferences. Lynx provides detailed web interfaces for single-gene or multiple-gene annotations that allow users to get a complete understanding of the functionality of the genes of interest from various different perspectives. All information related to the objects is easily accessible via user interface and available for download in tab-delimited, XML or JSON formats (web services).

Statistical enrichment analysis

Lynx assists the user in formulating the hypotheses regarding the molecular mechanisms involved in the phenomena under study by providing tools for enrichment analysis and identification of functional categories over-represented in the query data sets. Two singular enrichment analysis algorithms, Bayes factor and P-value estimates are used in our pipeline for this purpose (see Xie et al. for more description and results of analysis (30)). Enrichment analysis in Lynx is based on a large variety of features obtained from multiple sources [e.g. associated pathways and diseases (Table 1), various levels of resolution of Gene Ontology terms], as well as unique-for-the-system customized brain development and brain connectivity ontologies, symptoms-level phenotypes and associated non-coding signals (e.g. enhancers and clusters of transcription factors binding sites). The results of the enrichment analyses based on multiple categories of interest to the user may be used for formulating a working hypothesis regarding molecular mechanisms involved in phenomena of interest. Lynx also supports contextual enrichment analysis (e.g. against genes expressed in a particular tissue or on a particular developmental stage) that may substantially increase the accuracy of the results.

Network-based gene prioritization

Gene prioritization proposes promising candidate genes from a large set of genes or even from the entire genome for a disease or phenotype of interest. Here, for network-based gene prioritization, Lynx integrates five network propagation algorithms [simple random walk, heat kernel diffusion (31), PageRank with priors (32), HITS with priors (33) and K-step Markov (33)], and using STRING version 9.0 (15) as the underlying protein interaction network as initially suggested in PINTA (34,35). To use known disease genes as input, the algorithms were accordingly modified for Lynx by replacing the continuous microarray expression data—as requested from the original PINTA implementation—with binary data using seed genes associated with a disease or phenotype of interest: a ‘1’ is fed as an input for each seed gene, whereas a ‘0’ is assigned to all non-seed genes (36). Additionally, these algorithms were modified to accommodate a variety of weighted data types to be used for gene prioritization including ranked gene to phenotype associations, weighted canonical pathways, gene expression, NGS data and others. Consequently, the propagation algorithms for gene prioritization provide a ranked list of novel and promising candidate genes based on the propagated signal through the network, starting from binary data associated with disease related genes in the network.

CASE STUDY: IDENTIFICATION OF MOLECULAR MECHANISMS ASSOCIATED WITH SEIZURES IN AUTISM

This case study will illustrate the functionality of Lynx by predicting genes and molecular mechanisms associated with a particular symptom of autism (seizures) based on various Lynx analyses, such as annotation, gene set enrichment analysis and gene prioritization. Autism spectrum disorders (ASD) are known to be associated with an increased incidence of epilepsy and of epileptiform discharges on electroencephalograms. However, it is unknown whether epileptiform discharges correlate with symptoms of ASD and what are the contributing molecular mechanisms (37,38). To formulate a weighted hypothesis regarding genes and molecular mechanisms potentially contributing to epilepsy in patients with autism, we have performed the following steps: Step 1: Lynx advanced search was used to perform ‘fuzzy’ search for autism candidate genes against ‘disease’ object. The search returned 483 genes associated with autism by OMIM, AutDB and Disease Database from the University of Copenhagen. These genes were further filtered using ‘seizures’ as a fuzzy search term. The resulting query returned 59 genes positively associated both with autism and seizures phenotype (Supplementary Table S1). Step 2: The enrichment analysis of these 59 genes associated both with autism and seizures showed over-representation of the functional categories associated with synaptic transmission and ionotropic glutamate receptor binding and voltage-gated sodium channel activity already known to be associated with ASD and epileptic phenotypes. Step 3: The 59 genes obtained in Step 1 can be ranked according to the strength of their association with autism, as suggested by AutDB or expert curation, or can be assigned a default score of ‘1’ as shown in the use case. Step 4: The ranked set of genes from Step 3 was used as an input to the gene prioritization tool, based on the heat kernel-ranking algorithm (Supplementary Table S2). Default parameters were used to run the algorithm. The results of gene prioritization allowed predicting additional 31 high-scoring genes (P = <0.02) potentially contributing to epileptic phenotype in autistic patients (Supplementary Table S3). A number of these genes predicted by the network were recently found to be associated with ASD and epileptic phenotypes, but not yet included in AutDB and OMIM databases (and consequently in LynxDB) as markers for ASD. These include: DLG3, discs, large homolog 3 (39), GAD1, glutamate decarboxylase 1, brain type (40), DOCK8, dedicator of cytokinesis 8 (41), GABRB3, GABA A receptor, beta 3 (42), GLUD2, glutamate dehydrogenase 2 (43) and others (see Supplementary Table S3 for more details). All results of analyses are available for download in various formats via user interface or web services. A video and tutorial describing this and other examples of using Lynx for data annotation and analyses are available at the Lynx Web site at http://lynx.ci.uchicago.edu/usecase.html.

SYSTEMS ARCHITECTURE

Lynx is designed using a service-oriented architecture and is implemented using JAX-RS and Spring framework, (44) to provide the integrated data and analytical tools as RESTful services (45). The integrated data are modeled and represented as XML schemas and using JAXB (46) are automatically translated into Java objects that are then used to encapsulate data from the MySQL database. The resulting annotations and results of analysis are delivered in XML, JSON or TXT format as per the request. The project is being developed using the Maven (http://maven.apache.org) multi-module architecture so that various data access objects (DAO) modules; service modules and REST-resource modules are independently implemented and reused where necessary using Spring’s dependency injection. The algorithms involved in the analytical steps are implemented using Java and required statistical packages (such as Matlab, which is used in network-based prioritization) and integrated within the project as maven modules. The modular design architecture allows us to maintain ‘separation of concerns’ within the complete project without introducing any design or architecture-based dependencies.

Data and analytical web services

Although the integrated data and annotations as well as the various analytical tools are presented to the users via web interface, the service-oriented architecture enables other users/groups to leverage our work and integrate it within their own research tools and platforms. For example, there are current ongoing efforts by the Globus Genomics project (47) at the University of Chicago Computation Institute to integrate the Lynx Knowledge base annotation services and analytical workflows (via web services) for analysis and annotation of the results of the NGS. The Developmental Brain Disorders Database (https://www.dbdb.urmc.rochester.edu/home) at the University of Rochester and RViewer (48) are also using Lynx RESTful web service interface for annotation of genomic data. End users can download the data sets of interest and results of analysis from the web interface.

CONCLUSIONS

We present the Lynx database and knowledge extraction suite of tools designed specifically to support the discovery and hypothesis-based approaches to identification of genetic factors contributing to phenotypes or disorders of interest. Lynx integrates the main downstream analyses, such as gene annotation, gene set enrichment analysis and gene prioritization within one engine, based on a large knowledge base from public and private data and a powerful search engine that enables the user to access the knowledge base in a user-friendly web interface.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

40 in total

1. GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data.

Authors: Andrey Rzhetsky; Ivan Iossifov; Tomohiro Koike; Michael Krauthammer; Pauline Kra; Mitzi Morris; Hong Yu; Pablo Ariel Duboué; Wubin Weng; W John Wilbur; Vasileios Hatzivassiloglou; Carol Friedman
Journal: J Biomed Inform Date: 2004-02 Impact factor: 6.317

2. Fine mapping of Xq11.1-q21.33 and mutation screening of RPS6KA6, ZNF711, ACSL4, DLG3, and IL1RAPL2 for autism spectrum disorders (ASD).

Authors: Katri Kantojärvi; Ilona Kotala; Karola Rehnström; Tero Ylisaukko-Oja; Raija Vanhala; Taina Nieminen von Wendt; Lennart von Wendt; Irma Järvelä
Journal: Autism Res Date: 2011-02-22 Impact factor: 5.216

3. SZGR: a comprehensive schizophrenia gene resource.

Authors: P Jia; J Sun; A Y Guo; Z Zhao
Journal: Mol Psychiatry Date: 2010-05 Impact factor: 15.992

4. Candidate gene prioritization by network analysis of differential expression using machine learning approaches.

Authors: Daniela Nitsch; Joana P Gonçalves; Fabian Ojeda; Bart de Moor; Yves Moreau
Journal: BMC Bioinformatics Date: 2010-09-14 Impact factor: 3.169

5. Reactome: a database of reactions, pathways and biological processes.

Authors: David Croft; Gavin O'Kelly; Guanming Wu; Robin Haw; Marc Gillespie; Lisa Matthews; Michael Caudy; Phani Garapati; Gopal Gopinath; Bijay Jassal; Steven Jupe; Irina Kalatskaya; Shahana Mahajan; Bruce May; Nelson Ndegwa; Esther Schmidt; Veronica Shamovsky; Christina Yung; Ewan Birney; Henning Hermjakob; Peter D'Eustachio; Lincoln Stein
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

6. Variation in the autism candidate gene GABRB3 modulates tactile sensitivity in typically developing children.

Authors: Teresa Tavassoli; Bonnie Auyeung; Laura C Murphy; Simon Baron-Cohen; Bhismadev Chakrabarti
Journal: Mol Autism Date: 2012-07-06 Impact factor: 7.509

7. Disease Ontology: a backbone for disease semantic integration.

Authors: Lynn Marie Schriml; Cesar Arze; Suvarna Nadendla; Yu-Wei Wayne Chang; Mark Mazaitis; Victor Felix; Gang Feng; Warren Alden Kibbe
Journal: Nucleic Acids Res Date: 2011-11-12 Impact factor: 16.971

8. InterProScan: protein domains identifier.

Authors: E Quevillon; V Silventoinen; S Pillai; N Harte; N Mulder; R Apweiler; R Lopez
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

Review 9. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine.

Authors: Peter D Stenson; Matthew Mort; Edward V Ball; Katy Shaw; Andrew Phillips; David N Cooper
Journal: Hum Genet Date: 2014-01 Impact factor: 4.132

10. PID: the Pathway Interaction Database.

Authors: Carl F Schaefer; Kira Anthony; Shiva Krupa; Jeffrey Buchoff; Matthew Day; Timo Hannay; Kenneth H Buetow
Journal: Nucleic Acids Res Date: 2008-10-02 Impact factor: 16.971

16 in total

1. Disease gene prioritization using network and feature.

Authors: Bingqing Xie; Gady Agam; Sandhya Balasubramanian; Jinbo Xu; T Conrad Gilliam; Natalia Maltsev; Daniela Börnigen
Journal: J Comput Biol Date: 2015-04 Impact factor: 1.479

2. Circulating Plasma miRNA Homologs in Mice and Humans Reflect Familial Cerebral Cavernous Malformation Disease.

Authors: Abhinav Srinath; Ying Li; Romuald Girard; Issam A Awad; Sharbel G Romanos; Bingqing Xie; Chang Chen; Yan Li; Thomas Moore; Dehua Bi; Je Yeong Sone; Rhonda Lightle; Nick Hobson; Dongdong Zhang; Janne Koskimäki; Le Shen; Sara McCurdy; Catherine Chinhchu Lai; Agnieszka Stadnik; Kristina Piedad; Julián Carrión-Penagos; Abdallah Shkoukani; Daniel Snellings; Robert Shenkar; Dinanath Sulakhe; Yuan Ji; Miguel A Lopez-Ramirez; Mark L Kahn; Douglas A Marchuk; Mark H Ginsberg
Journal: Transl Stroke Res Date: 2022-06-17 Impact factor: 6.829

3. High-Fat Diet Alters the Retinal Pigment Epithelium and Choroidal Transcriptome in the Absence of Gut Microbiota.

Authors: Jason Xiao; Bingqing Xie; David Dao; Melanie Spedale; Mark D'Souza; Betty Theriault; Seenu M Hariprasad; Dinanath Sulakhe; Eugene B Chang; Dimitra Skondra
Journal: Cells Date: 2022-06-30 Impact factor: 7.666

4. Mutations in CENPE define a novel kinetochore-centromeric mechanism for microcephalic primordial dwarfism.

Authors: Ghayda M Mirzaa; Benjamin Vitre; Gillian Carpenter; Iga Abramowicz; Joseph G Gleeson; Alex R Paciorkowski; Don W Cleveland; William B Dobyns; Mark O'Driscoll
Journal: Hum Genet Date: 2014-04-20 Impact factor: 4.132

Review 5. Practical aspects of genome-wide association interaction analysis.

Authors: Elena S Gusareva; Kristel Van Steen
Journal: Hum Genet Date: 2014-08-28 Impact factor: 4.132

6. The Developmental Brain Disorders Database (DBDB): a curated neurogenetics knowledge base with clinical and research applications.

Authors: Ghayda M Mirzaa; Kathleen J Millen; A James Barkovich; William B Dobyns; Alex R Paciorkowski
Journal: Am J Med Genet A Date: 2014-04-03 Impact factor: 2.802

7. Reprogramming LCLs to iPSCs Results in Recovery of Donor-Specific Gene Expression Signature.

Authors: Samantha M Thomas; Courtney Kagan; Bryan J Pavlovic; Jonathan Burnett; Kristen Patterson; Jonathan K Pritchard; Yoav Gilad
Journal: PLoS Genet Date: 2015-05-07 Impact factor: 5.917

8. An integrative computational approach for prioritization of genomic variants.

Authors: Inna Dubchak; Sandhya Balasubramanian; Sheng Wang; Meydan Cem; Cem Meyden; Dinanath Sulakhe; Alexander Poliakov; Daniela Börnigen; Bingqing Xie; Andrew Taylor; Jianzhu Ma; Alex R Paciorkowski; Ghayda M Mirzaa; Paul Dave; Gady Agam; Jinbo Xu; Lihadh Al-Gazali; Christopher E Mason; M Elizabeth Ross; Natalia Maltsev; T Conrad Gilliam
Journal: PLoS One Date: 2014-12-15 Impact factor: 3.240

9. Lynx web services for annotations and systems analysis of multi-gene disorders.

Authors: Dinanath Sulakhe; Andrew Taylor; Sandhya Balasubramanian; Bo Feng; Bingqing Xie; Daniela Börnigen; Utpal J Dave; Ian T Foster; T Conrad Gilliam; Natalia Maltsev
Journal: Nucleic Acids Res Date: 2014-06-19 Impact factor: 16.971

10. Computational Reconstruction of NFκB Pathway Interaction Mechanisms during Prostate Cancer.

Authors: Daniela Börnigen; Svitlana Tyekucheva; Xiaodong Wang; Jennifer R Rider; Gwo-Shu Lee; Lorelei A Mucci; Christopher Sweeney; Curtis Huttenhower
Journal: PLoS Comput Biol Date: 2016-04-14 Impact factor: 4.475