Literature DB >> 26590263

Lynx: a knowledge base and an analytical workbench for integrative medicine.

Dinanath Sulakhe¹, Bingqing Xie², Andrew Taylor³, Mark D'Souza³, Sandhya Balasubramanian³, Somaye Hashemifar⁴, Steven White⁵, Utpal J Dave⁶, Gady Agam⁷, Jinbo Xu⁴, Sheng Wang⁸, T Conrad Gilliam⁹, Natalia Maltsev¹⁰.

Abstract

Lynx (http://lynx.ci.uchicago.edu) is a web-based database and a knowledge extraction engine. It supports annotation and analysis of high-throughput experimental data and generation of weighted hypotheses regarding genes and molecular mechanisms contributing to human phenotypes or conditions of interest. Since the last release, the Lynx knowledge base (LynxKB) has been periodically updated with the latest versions of the existing databases and supplemented with additional information from public databases. These additions have enriched the data annotations provided by Lynx and improved the performance of Lynx analytical tools. Moreover, the Lynx analytical workbench has been supplemented with new tools for reconstruction of co-expression networks and feature-and-network-based prioritization of genetic factors and molecular mechanisms. These developments facilitate the extraction of meaningful knowledge from experimental data and LynxKB. The Service Oriented Architecture provides public access to LynxKB and its analytical tools via user-friendly web services and interfaces.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26590263 PMCID： PMC4702889 DOI： 10.1093/nar/gkv1257

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The extraction of useful knowledge from voluminous datasets generated by functional genomics critically depends on the seamless integration of clinical, genomic and experimental information with the knowledge about genotype–phenotype relationships, which has been accumulated in a variety of disparate databases. The large-scale integration of all this data enables efficient data mining for advancing scientific insight and supports the development of new biomedical applications. To address these challenges, we further developed Lynx—a bioinformatics platform offering a large compendium of biomedical information (LynxKB) and a collection of analytical tools (http://lynx.ci.uchicago.edu) (1,2). Lynx supports the annotation and analysis of various types of high-throughput experimental data. It offers both discovery- and hypothesis-based approaches for the prediction of the genetic factors and molecular mechanisms contributing to the phenotypes or conditions of interest to the users. The current release of LynxKB includes additional information as shown in the Table 1 below. We have integrated these new datasets within the existing analytical tools (e.g. Enrichment analysis tool) and the new tools developed (e.g. Cheetoh algorithm) (3,4). Integration of this information also enhances data annotation in Lynx.

Table 1.

Data types and resources integrated in LynxKB

Type of data	Source
Genomic	NCBI (10), Ensembl (11), UniGene (12), Transfac (13), RefSeq (14)
Proteomic	BIND (15), BioGRID (16), HPRD (17), MINT (18), UniProt^b (6), InterPro (19), IEDB^b (7), ProteomicsDB^b (20), Human Protein Atlas^b (5)
Pathways-related	KEGG (21), Reactome (22), NCI (23), BioCarta, STRING (24), TRANSPATH (25), Pathway Commons (26), WikiPathways^b (8)
Disease-specific	OMIM (27), Disease Ontology (28), AutDB (29), SZGR (30), Cancer Gene Index, AGRE, DBDB^a (31), LisDB^a, GeneCards (32)
Phenotypic	OMIM, Human Phenotype Ontology (33), Customized Ontologies^a
Variations	Genetic Association Database (34), Database of Genomic Variants (35), Human Gene Mutation Database (36), SLEP (37)
Text-mining	GeneWays^a (38), DISEASES (39)
Pharmacogenomics	Comparative Toxicogenomics Database (CTD) (40)

aCustomized and manually curated sources of information.

bNew databases added to LynxKB.

aCustomized and manually curated sources of information. bNew databases added to LynxKB. Since the last release the Lynx workbench has been supplemented with a number of new tools. These include Cheetoh (3), a unique feature-and-network-based gene-prioritization tool and NetLynx (in press), a tool for the reconstruction of co-expression networks. Lynx's usage has been increasing steadily with thousands of users each month accessing the platform for annotation and analysis of high-throughput biomedical data.

LYNX DESIGN AND COMPONENTS

Lynx provides a one-stop solution for generating weighted hypotheses regarding the genes or molecular mechanisms contributing to the phenotypes of interest (Figure 1). It supports annotations and analyses of the following data types: (i) various types of experimental results, such as gene expression, NGS, GWAS, CNV data, etc.; (ii) data extracted from LynxKB via search and annotation engines and (iii) lists of genes provided by the user.

Figure 1.

Lynx knowledge extraction engine: major components and general workflow.

Lynx knowledge extraction engine: major components and general workflow. Lynx contains the following major components: (i) Lynx annotation engine consisting of Integrated Lynx Knowledge Base (LynxKB) and Knowledge extraction services; (ii) Lynx analytical workbench that includes tools for features-based gene enrichment analysis, feature-and-network-based gene prioritization, and reconstruction of co-expression networks; and (iii) user-friendly web interface for accessing the annotations and analytical tools.

Updates to Lynx annotation engine

Lynx Integrated knowledge base

A number of resources were added to LynxKB in the past year. These include the addition of information from the Human Protein Atlas (5), UniProt feature data (6), IEDB (7), WikiPathways (8), GeneRifs (9) and others. Table 1 shows the resources currently integrated into LynxKB. In order to keep LynxKB up-to-date we have performed a number of periodic updates. The LynxKB data is accessible to users via advanced searches, annotation interfaces and analytical tools. Lynx also provides exclusive access to the text-mining data describing molecular interactions from GeneWays, data describing clusters of transcription factors binding sites (41) and enhancers (42) provided by the VISTA project. Integrated structured data from LynxKB is available for downloads in multiple formats (e.g. XML, CSV, TXT, JSON) via a web-based user interface and via web services.

Lynx knowledge extraction engine

Lynx Knowledge Extraction Engine was further enhanced to provide multiple entry points for the extraction of information describing individual objects (e.g. genes, pathways, disorders), as well as batch queries. Lynx uses Apache Lucene to index the knowledge base and offers advanced search capabilities. It allows users to generate highly selective datasets by filtering on multiple parameters (e.g. phenotypes, pathways or functional associations and more). We have updated the Lucene indexes with the latest versions of database updates and new database additions. The annotation service in Lynx provides annotation data as RESTful web services that are consumed by Lynx web applications also.

Updates to Lynx analytical workbench

Updates to statistical enrichment analysis

Lynx enrichment analysis allows identification of functional categories over-represented in the query datasets, thus assisting users in formulating hypotheses regarding the molecular mechanisms involved in the phenomena under study. Two singular enrichment analysis algorithms, Bayes factor and P-value estimates are used in our pipeline for this purpose (see (43) for more details). Enrichment analysis in Lynx is based on a large variety of features obtained from multiple sources, as well symptoms-level phenotypes and associated non-coding signals as mentioned in our previous publication (1). Several new feature categories, including inter alia Pubmed (UniProt and NCBI GeneRifs), UniProt Keywords and InterPro Domains, are introduced in the current release to enable the literature and protein function oriented discovery. The results of the Lynx enrichment analysis can now be filtered and utilized by our new prioritization tool, Cheetoh, to perform the feature and network-based gene prioritization.

Updates to lynx gene prioritization and prediction of molecular mechanisms

Gene prioritization identifies promising candidate genes and sets of genes relevant to molecular mechanisms contributing to a phenotype or a condition of interest extracted from a large set of genes or even from the entire genome. It can also serve as a preliminary step for network reconstruction. In addition to the previously described PINTA network-based gene prioritization (44–46), Lynx now contains Cheetoh, a network-and-feature-based gene prioritization tool. These prioritization tools perform distinct but complementary analyses suitable for the scientific goals of an investigation, as outlined below.

Cheetoh

A list of genes submitted to the Cheetoh algorithm first undergoes enrichment analysis to identify and score over-represented functional categories. The results of the enrichment analysis are passed to the Cheetoh algorithm as node features. Cheetoh integrates these enrichment analysis results with the underlying network structure as edge features through the Conditional Random Field (CRF) model. It further ranks the genes in the whole genome by global inference scores on the CRF model. Please refer to Xie et al. (3,4) for a detailed description of the Cheetoh algorithm and its performance evaluation and validation procedures. The output of the tool consists of 1000 top ranked genes ordered by ascending Bonferroni (multiple testing correction) corrected P-values based on all user-selected categories as well as rankings and corrected P-values from individual category. The results are available both for viewing via the Lynx interactive interface as well as for downloading. The resulting top ranked genes can be used in both hypothesis and discovery based approaches to identify a small set of high-confidence candidate genes relevant to user's interests or to explore larger sets of high-ranking genes to identify molecular mechanisms associated with the conditions under investigation. Moreover, the user can increase the resolution of the analyses by choosing particular categories of interest from among a collection of the enrichment analysis categories to enable customized prioritization. For general-purpose gene prioritization, the combination of Gene ontology (Molecular Function/Biological Process/Cellular Component), phenotype and pathway categories are recommended. Users are advised to use Cheetoh in cases when (i) the pre-existing knowledge is available, such as a list of validated genes or highly differentially expressed (DE) genes, associated with phenotype or condition of interest and (ii) the network associated with the input list of genes is sparse or input genes are poorly annotated.

PINTA

In contrast to Cheetoh, Pinta is an unsupervised gene prioritization tool, which propagates the input information in the form of genes and associated scores or gene expression values through the gene–gene interaction networks. It accepts gene lists annotated with experimental values (e.g. gene expression results, differential expression values, scored sets of candidate genes, etc.) that are factored into the analytical procedure. Users are encouraged to use PINTA when the scoring for the input genes is available, such as reliability scores, differential expression values, and the strength of association to the phenotypes. Since this information propagated through the network can determine whether a gene's neighborhood is functionally related to the input gene set, it could further identify promising candidate genes and subnetworks even if no knowledge is available about the disease or phenotype under consideration. Please refer to (44–46) for a detailed description of PINTA, its comparison with the other similar tools and rigorous validation procedures.

NetLynx

Reconstruction of co-expression networks has proved to be one of the promising approaches for investigation of system-level properties. Lynx now contains NetLynx, a co-expression-based network prediction tool to rank the interactions between each pair of genes with respect to their gene expression profiles. NetLynx uses a well-established method for modeling the gene expression correlations as a multivariate Gaussian distribution with an L1 norm penalty. A comparison of NetLynx with the Pearson-correlation-based and mutual-information-based methods demonstrated its good performance (manuscript in press). NetLynx may be used for the reconstruction of co-expression networks utilizing a user-input threshold to infer the final gene co-expression network. The resulting co-expression networks can be annotated through Lynx annotation resources and then further analyzed by Lynx workbench tools for enrichment analysis and gene prioritization.

Lynx customized workflows

Lynx aims to support various scientific scenarios by offering flexible analytical workflows containing complementary tools. Lynx workflows allow users to explore biological data, accessible via search engine as well as specialized gene pages. Lynx user interface allows easy navigation between Lynx tools (see Figure 1) as well as external tools, such as RaptorX (47) and VISTA RViewer (48). This flexibility enables the user to create workflows suitable for his/her research goals. An iterative application of Lynx analytical tools can also help users validate hypotheses or discover new mechanisms hidden in the data.

Data and analytical web services

The integrated data and annotations, as well as the various analytical tools, are presented to the users via the web interface. The service-oriented architecture enables other users/groups to leverage our work and integrate it within their own research tools and platforms. Other public systems such as UCSC Genome Browser (49) and RViewer provide external links to Lynx annotation pages. Databases such as DBDB are using Lynx RESTful web service interface for annotation of genomic data. End users can download the datasets of interest and results of analysis from the web interface.

CASE STUDY: identification of genes and molecular mechanisms involved in the transcriptional response to LPS (Lipopolysaccharide) in airway epithelia

We will use the analysis of gene expression profiling of airway epithelial cells involved in environmental asthma to illustrate the use of existing and newly added Lynx tools. The data used in this case study is accessible at NCBI GEO database (50) under accession GSE8190. According to the GEO metadata and corresponding article by Yang et al. (51) the airway epithelial cells were obtained via bronchial brush and bronchoalveolar lavage from 39 subjects comprising three phenotypic groups (non-atopic non-asthmatic, atopic non-asthmatic and atopic asthmatic) 4 h after instillation of lipopolysaccharide (LPS) in three distinct sub-segmental bronchi. RNA transcript levels were assessed using whole genome microarrays. To formulate a weighted hypothesis about the LPS response in airway epithelial cells, we have performed the following steps: the 388 genes DE in all phenotypic conditions under investigation (control, atopy+/asthma−, atopy+/asthma+) with contrast of LPS and saline exposure were extracted from the article's supplementary materials. By removing duplicates and correcting obsolete synonyms, we obtained a clean set of 283 genes that were used in the Lynx analysis. of 283 DE genes, obtained in Step 1, was performed against sixteen feature categories. The results reveal a highly significant over-representation of genes involved in cytokine and chemokine response pathways, such as: Cytokine Signaling in Immune system (P-value 1.65e-27, Bayes factor 56.113, Reactome, 75–790); Interferon Signaling (P-value 3.99 -23; Bayes factor 46.004, Reactome, 25_229) and NF-κB signaling pathway (2.63e-16, Bayes factor 30.286, KEGG, hsa_04064) in the set of genes under consideration (please see the online example for more details, http://lynx.ci.uchicago.edu/usecase.html). These results are consistent with the discovery presented in the source article (51) stating that the LPS stimulation resulted in pronounced transcriptional response across all subjects in airway epithelia, with strong association to nuclear factor-κB and IFN-inducible genes. In order to predict additional genes and sub-networks potentially involved in the inflammatory response to LPS in airway epithelia, 283 DE genes from Step 1 were analyzed by Cheetoh. This algorithm uses both features and network as an input for gene prioritization. In the aforementioned case, the enriched categories from GO and phenotype were used as features and STRING 9 was used as an underlying global network. The top ranked 100 genes (P-value = <0.004), containing 23 genes from the input, were resubmitted for the enrichment analysis. The results of the enrichment analysis against pathways databases (not used in the gene prioritization process) demonstrated a significant boost for the categories of interest. For example, Cheetoh was able to identify 20 out of 100 genes in the Chemokine category [GO:0008009] versus 9 out of 283 genes before the prioritization. We were also able to identify the toll-like receptor signaling pathway with this prioritized gene list (see Table 2).

Table 2.

Results of gene prioritization using Cheetoh

Feature ID	Description	Differentially expressed genes (283)			Cheetoh prioritized genes (100)
		In query	P-value	Bayes factor	In query	P-value	Bayes factor
REACTOME Pathway 75790	Cytokine signaling in immune system	40	1.65E-27	56.113	37	7.92E-35	73.551
KEGG hsa04064	NF-kappa B signaling pathway	18	2.63E-16	30.286	30	1.37E-42	91.396
KEGG hsa04062	Chemokine signaling pathway	21	1.41E-13	24.051	43	2.4E-53	116.168
REACTOME Pathway 6894	Toll-like Receptor 4 (TLR4) Cascade	12	1.48E-07	10.21	80	5E-149	336.461
REACTOME Pathway 9047	Toll-like Receptor 9 (TLR9) Cascade	N/A	N/A	N/A	56	8.3E-101	225.433

The results of analyses performed by Lynx tools, demonstrated in this example, allowed us to reproduce the results described in the original paper by Yang et al. (51) and to suggest some additional avenues for further investigation (e.g. identification of genes involved in Toll-like receptor response). A tutorial describing this and other examples of using Lynx for data annotation and analyses are available at the Lynx Web site at http://lynx.ci.uchicago.edu/usecase.html.

CONCLUSIONS

We present an updated Lynx database and analytical workbench designed to support discovery and hypothesis-based approaches. Lynx integrates the main downstream analyses, such as gene annotations, gene set enrichment analysis, various algorithms for gene prioritization and network reconstruction within one engine, based on a large knowledge base. Two newly added tools, Cheetoh and NetLynx, further expand our platform's analytical repertoire.

46 in total

1. Proteomics. Tissue-based map of the human proteome.

Authors: Mathias Uhlén; Linn Fagerberg; Björn M Hallström; Cecilia Lindskog; Per Oksvold; Adil Mardinoglu; Åsa Sivertsson; Caroline Kampf; Evelina Sjöstedt; Anna Asplund; IngMarie Olsson; Karolina Edlund; Emma Lundberg; Sanjay Navani; Cristina Al-Khalili Szigyarto; Jacob Odeberg; Dijana Djureinovic; Jenny Ottosson Takanen; Sophia Hober; Tove Alm; Per-Henrik Edqvist; Holger Berling; Hanna Tegel; Jan Mulder; Johan Rockberg; Peter Nilsson; Jochen M Schwenk; Marica Hamsten; Kalle von Feilitzen; Mattias Forsberg; Lukas Persson; Fredric Johansson; Martin Zwahlen; Gunnar von Heijne; Jens Nielsen; Fredrik Pontén
Journal: Science Date: 2015-01-23 Impact factor: 47.728

2. Candidate gene prioritization by network analysis of differential expression using machine learning approaches.

Authors: Daniela Nitsch; Joana P Gonçalves; Fabian Ojeda; Bart de Moor; Yves Moreau
Journal: BMC Bioinformatics Date: 2010-09-14 Impact factor: 3.169

3. The BioGRID interaction database: 2015 update.

Authors: Andrew Chatr-Aryamontri; Bobby-Joe Breitkreutz; Rose Oughtred; Lorrie Boucher; Sven Heinicke; Daici Chen; Chris Stark; Ashton Breitkreutz; Nadine Kolas; Lara O'Donnell; Teresa Reguly; Julie Nixon; Lindsay Ramage; Andrew Winter; Adnane Sellam; Christie Chang; Jodi Hirschman; Chandra Theesfeld; Jennifer Rust; Michael S Livstone; Kara Dolinski; Mike Tyers
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 19.160

Review 4. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine.

Authors: Peter D Stenson; Matthew Mort; Edward V Ball; Katy Shaw; Andrew Phillips; David N Cooper
Journal: Hum Genet Date: 2014-01 Impact factor: 4.132

5. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015.

Authors: Allan Peter Davis; Cynthia J Grondin; Kelley Lennon-Hopkins; Cynthia Saraceni-Richards; Daniela Sciaky; Benjamin L King; Thomas C Wiegers; Carolyn J Mattingly
Journal: Nucleic Acids Res Date: 2014-10-17 Impact factor: 16.971

6. PID: the Pathway Interaction Database.

Authors: Carl F Schaefer; Kira Anthony; Shiva Krupa; Jeffrey Buchoff; Matthew Day; Timo Hannay; Kenneth H Buetow
Journal: Nucleic Acids Res Date: 2008-10-02 Impact factor: 16.971

7. The Reactome pathway knowledgebase.

Authors: David Croft; Antonio Fabregat Mundo; Robin Haw; Marija Milacic; Joel Weiser; Guanming Wu; Michael Caudy; Phani Garapati; Marc Gillespie; Maulik R Kamdar; Bijay Jassal; Steven Jupe; Lisa Matthews; Bruce May; Stanislav Palatnik; Karen Rothfels; Veronica Shamovsky; Heeyeon Song; Mark Williams; Ewan Birney; Henning Hermjakob; Lincoln Stein; Peter D'Eustachio
Journal: Nucleic Acids Res Date: 2013-11-15 Impact factor: 16.971

8. The UCSC Genome Browser database: 2014 update.

Authors: Donna Karolchik; Galt P Barber; Jonathan Casper; Hiram Clawson; Melissa S Cline; Mark Diekhans; Timothy R Dreszer; Pauline A Fujita; Luvina Guruvadoo; Maximilian Haeussler; Rachel A Harte; Steve Heitner; Angie S Hinrichs; Katrina Learned; Brian T Lee; Chin H Li; Brian J Raney; Brooke Rhead; Kate R Rosenbloom; Cricket A Sloan; Matthew L Speir; Ann S Zweig; David Haussler; Robert M Kuhn; W James Kent
Journal: Nucleic Acids Res Date: 2013-11-21 Impact factor: 16.971

9. The immune epitope database (IEDB) 3.0.

Authors: Randi Vita; James A Overton; Jason A Greenbaum; Julia Ponomarenko; Jason D Clark; Jason R Cantrell; Daniel K Wheeler; Joseph L Gabbard; Deborah Hix; Alessandro Sette; Bjoern Peters
Journal: Nucleic Acids Res Date: 2014-10-09 Impact factor: 16.971

10. The InterPro protein families database: the classification resource after 15 years.

Authors: Alex Mitchell; Hsin-Yu Chang; Louise Daugherty; Matthew Fraser; Sarah Hunter; Rodrigo Lopez; Craig McAnulla; Conor McMenamin; Gift Nuka; Sebastien Pesseat; Amaia Sangrador-Vegas; Maxim Scheremetjew; Claudia Rato; Siew-Yit Yong; Alex Bateman; Marco Punta; Teresa K Attwood; Christian J A Sigrist; Nicole Redaschi; Catherine Rivoire; Ioannis Xenarios; Daniel Kahn; Dominique Guyot; Peer Bork; Ivica Letunic; Julian Gough; Matt Oates; Daniel Haft; Hongzhan Huang; Darren A Natale; Cathy H Wu; Christine Orengo; Ian Sillitoe; Huaiyu Mi; Paul D Thomas; Robert D Finn
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 16.971

4 in total

1. modEnrichr: a suite of gene set enrichment analysis tools for model organisms.

Authors: Maxim V Kuleshov; Jennifer E L Diaz; Zachary N Flamholz; Alexandra B Keenan; Alexander Lachmann; Megan L Wojciechowicz; Ross L Cagan; Avi Ma'ayan
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 19.160

2. Unique somatic variants in DNA from urine exosomes of individuals with bladder cancer.

Authors: Xunian Zhou; Paul Kurywchak; Kerri Wolf-Dennen; Sara P Y Che; Dinanath Sulakhe; Mark D'Souza; Bingqing Xie; Natalia Maltsev; T Conrad Gilliam; Chia-Chin Wu; Kathleen M McAndrews; Valerie S LeBleu; David J McConkey; Olga V Volpert; Shanna M Pretzsch; Bogdan A Czerniak; Colin P Dinney; Raghu Kalluri
Journal: Mol Ther Methods Clin Dev Date: 2021-05-29 Impact factor: 6.698

3. Candidate gene prioritization for chronic obstructive pulmonary disease using expression information in protein-protein interaction networks.

Authors: Wan Li; Yihua Zhang; Yahui Wang; Zherou Rong; Chenyu Liu; Hui Miao; Hongwei Chen; Yuehan He; Weiming He; Lina Chen
Journal: BMC Pulm Med Date: 2021-09-04 Impact factor: 3.317

4. Predicting susceptibility to tuberculosis based on gene expression profiling in dendritic cells.

Authors: John D Blischak; Ludovic Tailleux; Marsha Myrthil; Cécile Charlois; Emmanuel Bergot; Aurélien Dinh; Gloria Morizot; Olivia Chény; Cassandre Von Platen; Jean-Louis Herrmann; Roland Brosch; Luis B Barreiro; Yoav Gilad
Journal: Sci Rep Date: 2017-07-18 Impact factor: 4.379

4 in total