Literature DB >> 29888054

GNOMICS: A one-stop shop for biomedical and genomic data.

Abstract

The World Wide Web is an indispensable tool for biomedical researchers who are striving to understand the molecular basis of phenotype. However, it presents challenges in the form of proliferation of data resources, with heterogeneity ranging from their content to functionality to interfaces. This often frustrates researchers who must visit multiple sites, become familiar with their interfaces, and learn how to use them to extract knowledge. Even then, one may never feel sure that they have tracked down all needed information. We envision addressing this challenge with GNOMICS (Genomic Nomenclature Omnibus and Multifaceted Informatics and Computational Suite), a suite with both a programmatic interface and a GUI. GNOMICS allows for extensible biomedical functionality, including identifier conversion, pathway enrichment, sequence alignment, and reference gathering, among others. It combines usage of other biological and chemical database application programming interfaces (APIs) to deliver uniform data which can be further manipulated and parsed.

Entities: Chemical Disease Gene Species

Year: 2018 PMID： 29888054 PMCID： PMC5961829

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction

Discussions of “big data” are on the forefront of researchers’ minds in almost every discipline. However, in the biomedical sciences, the growth of biological data is unprecedented, with new data being created more quickly than it can be efficiently organized[1]. Automating large-scale, cross-domain heterogeneous data aggregation across many different sites continues to pose a major challenge for bioinformaticians. Several groups, including the U.S. National Center for Biomedical Ontology (NCBO), have aimed to standardize data via the usage of normalized ontologies and nomenclatures[2]. It is important to note that these controlled vocabularies are sometimes difficult to map to or from, due, in part, to looming digital dark ages, referring to difficulties in understanding various file formats and identifier types after their deprecation[3]. Having a “universal” identifier for biological concepts continues to be a challenge. This problem is further compounded as new databases or meta-databases (e.g. comprehensive databases) are constructed. These efforts often end up introducing new identifiers for existing concepts. Further, the status quo as far as advanced biological databases are concerned seems to be locally downloading databases or simply copying them to other servers. This results not only in duplicate data, but also in a data source which can quickly become out-of-date. A potential solution to these concerns, discussed herein, is the usage of application programming interfaces (APIs). These systems abstract database implementations, making input and output data easier to query, consume, and analyze in a standardized manner (Fig 1.)

Figure 1.

Example outlining data flow to/from a client to an application via an API.

Many biological databases have open APIs, such as Ensembl, UniProt, PubMed, and KEGG, while others are semi-open, requiring registration and some kind of security key (such as UMLS or Open PHACTS)[4,5,6,7]. Since APIs are tied directly to their component databases, they are as up to date as possible. In addition, minimal data are pulled to the local machine when queried, mitigating the copying of large databases. Our system, labeled GNOMICS (Genomic Nomenclature Omnibus and Multifaceted Informatics and Computational Suite) is a grouping of API-based Python programs organized into objects which can be extended or scaled as needed. In addition to the command-line interface, a beta front-end is also available, having been built primarily using Electron and AngularJS.

Methods

The GNOMICS system was built upon user-friendly and understandable objects, of which there are currently 25 (Table 1). While these modules and others are still under development, they allow for a diverse array of functions, including arterial supply for anatomical structures, effective time for drugs, metabolic rate for taxa, and effective rotor count for compounds, among others.

Table 1.

GNOMICS objects breakdown. Showcased are the approximate number of available records, number of identifiers currently available, and the number of properties included in the object. Please note that not all sources have published exact numbers of available objects.

Object	# Records	# Identifiers	# Properties
Adverse Event	30,126	7	0
Anatomical Structure	526,706	39	8
Assay	1,267,241	4	14
Biological Process	26,678	1	0
Cell Line	125,369	4	0
Cellular Component	4,150	1	0
Clinical Trial	619,701	5	24
Compound	332,518,372	23	55
Disease	382,830	28	11
Drug	1,541,646	34	21
Gene	45,569,934	21	16
Genotype	1,069	1	0
Molecular Function	11,134	2	0
Patent	4,000,000	3	0
Pathway	2,010,623	5	10
People	26,759,399	1	0
Phenotype	358,575	19	0
Procedure	397,186	8	0
Protein	586,444,112	15	9
Reference	477,357,444	20	13
Symptom	3,129	3	0
Taxa	6,739,136	11	98
Tissue	17,520	6	0
Transcript	202,338	2	0
Variation	801,208,836	5	0
Total:	2,288,123,254	268	279

Additionally, a twenty-sixth User object is available. This object stores information about the eleven API systems that require a separate user API key for usage (ChemSpider, DPLA, Elsevier, EOL, FDA, ISBNdb, NCBO, OMIM, OpenPHACTS, Springer, and UMLS)[4,5,6,7,8,9]. Currently 35 API systems are available, ranging from UniProt to Ensembl to Wikipedia. Lastly, 51 Python 3 packages were adapted for GNOMICS use, including BeautifulSoup, Bio, BioPython, Bioservices, the ChEMBL Webresource Client, ChemSpiPy, GEOparse, Intermine, LibChEBIpy, MyGene, MyVariant, NLTK, PubChemPy, and PyTaxize[10,11,12,13,14,15,16]. The inclusion of these packages prevented unnecessary code duplication. Each component object has several subfolders whose functions relate to parsing specific materials returned from component APIs for command-line usage, allowing for simpler functions such as identifier conversion. Interaction objects, which allow for relating an object to another object or series of objects, are paramount to the success of GNOMICS (Fig. 2). For example, a gene can be related to several other genes via orthology; a disease can be related to a particular anatomical structure if it occurs there; or a drug can berelated to adverse events (AEs) through drug-compound interactions. 63 such interaction objects are available or under development. In addition, there are areas for locally downloading data to facilitate faster querying time (as well as allowing for integration of databases without programmatic access) and an area for documentation. Allowing for different import and export formats is crucial as well, with CSV, TSV, XML, HTML, JSON, JSONP, and YAML appearing as the most commonly parsed formats, alongside biological formatting standards such as BED and FASTA. Accessory are preformatted reports summarizing data in PDF, Microsoft Word, Microsoft Excel, CSV, and TSV files.

Results

268 object-specific identifiers were included in GNOMICS with various mappings originating from the Ontology Lookup Service (OLS) at EMBL-EBI, the National Center for Biomedical Ontology (NCBO) BioPortal, the UMLS Metathesaurus, Wikidata, and others[2,4,17]. These mappings may be one-to-one or one-to-many. The Human Disease Ontology (DOID) for heart disease, for example, maps to 6 SNOMED-CT U.S. IDs in OLS and NCBO, but only to one UMLS CUI in both sources[24,17] Non-object-specific identifiers (such as MedDRA, SNOMED-CT, or MeSH) must be mapped to an object-specific identifier before inclusion with that specific id[24,1719,20]. Including non-object-specific identifiers, 533 identifier types are available among the 25 object types. Many identifiers appear in multiple sources allowing for mapping if a single source is under maintenance or briefly unavailable for other reasons (on average, each identifier is available from three sources). Included with identifiers are terms and labels provided by the various sources, as well as accepted synonyms and translations. 76 languages are available due to integration of translatable terms from MeSH, MedDRA, CPT, LOINC, and Wikipedia. First, a few test searches were run for all objects with available search functionality (“breast cancer,” “SLC4A1,” “ibuprofen”, etc.). Each test was run five times across the 17 such objects, with all component databases and ontologies returning results in an average 0.160864599 seconds (s = 0.01142). Next, we timed 45 randomly chosen interaction object search functions using the same methodology. These returned results in an average 0.007318 seconds (s = 0.00525). Artificially crippling single APIs with repeated data (such as disallowing access to the OLS domain) was met with no significant change in speed, as most identifiers and search results could be found on multiple platforms. This, however, is not currently the case for PubMed or Ensembl, among others. Two more additional complex tests looked at synthesizing phenotypes related to adverse events caused by a drug and phenotypes related to the disease that drug is used to treat, as well as normal gene expression for genes known to interact with a given drug in certain tissues alongside AE counts endemic to those same tissues[8,21,22,23]. The first test found 200 phenotypes associated with amiodarone usage and 20 phenotypes observed in the adverse event (AE) cardiac arrhythmia. There were 14 overlapping phenotypes discovered, including bradycardia, congestive heart failure, cardiac arrest, and ventricular tachycardia; all of which are common side effects of amiodarone. The second test involved looking at amiodarone AE counts and mapping those events to tissues; next we found drug/gene interactions for amiodarone and found normal expression levels for those genes in the same tissues (Figure 3). Following this, extensibility was tested by adapting PubTator and NOBLE Coder into GNOMICS, allowing for programmatic annotation of free text. The NLTK package was added to supplement these programs as well[26,27].

Discussion

Cross-domain integration and comparison of large volumes of information across clinical and translational research applications continues to be a major challenge in the biomedical domain. With both a programmatic and visual interface, as well as a focus on terminology translation, we anticipate that GNOMICS and its continued development could accelerate the process of bringing together the two disparate fields. Additionally, GNOMICS has an intuitive object-based usage that can be understood quickly, keeping the relative startup cost smaller than that of a dedicated database download. The system manages to deliver uniform data which can be easily manipulated and parsed as the user finds necessary. GNOMICS further remains continually up-to-date given its connectedness with its component sources and continues to be a portable and scalable solution in the biomedical data landscape. However, GNOMICS has certain limitations. Because it is based on web services, GNOMICS requires a dedicated internet connection and more substantial API limits. While several biomedical and genomics databases have open API systems, others require registration and complex layers of security tokens and request tickets. While GNOMICS does some of the heavy lifting in these situations, users still have to navigate some of these processes manually. Further, for what GNOMICS gains in portability, it loses in speed, as certain databases may take longer to respond to queries. However, it has been shown that given increases in processing power, this tradeoff is often negligible to the end-user.

Conclusion

GNOMICS is under continuous development, including addition of more object types, functions, and identifier mappings. Currently, the GNOMICS system is biased towards well-established and/or heavily trafficked databases, especially those with APIs. Additional multi-system testing is ongoing, with the front-end interface still in active development. This includes taking into account portability- and scalability-related issues alongside those components of the command-line interface. To address these points and others, a public GitHub source ( is available with a README and Wiki. Documentation is provided which offers insight into identifier types available, database coverage, and language coverage, among other things. Additions to and recommendations for the system are highly encouraged, with plugins written in Python or other programming languages easily adaptable into the existing framework.

23 in total

1. The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors: Olivier Bodenreider
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

2. The Ontology Lookup Service: bigger and better.

Authors: Richard Côté; Florian Reisinger; Lennart Martens; Harald Barsnes; Juan Antonio Vizcaino; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2010-05-11 Impact factor: 16.971

3. Linking MedDRA(®)-Coded Clinical Phenotypes to Biological Mechanisms by the Ontology of Adverse Events: A Pilot Study on Tyrosine Kinase Inhibitors.

Authors: Sirarat Sarntivijai; Shelley Zhang; Desikan G Jagannathan; Shadia Zaman; Keith K Burkhart; Gilbert S Omenn; Yongqun He; Brian D Athey; Darrell R Abernethy
Journal: Drug Saf Date: 2016-07 Impact factor: 5.606

4. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data.

Authors: Warren A Kibbe; Cesar Arze; Victor Felix; Elvira Mitraka; Evan Bolton; Gang Fu; Christopher J Mungall; Janos X Binder; James Malone; Drashtti Vasant; Helen Parkinson; Lynn M Schriml
Journal: Nucleic Acids Res Date: 2014-10-27 Impact factor: 16.971

5. Mouse Genome Database (MGD)-2017: community knowledge resource for the laboratory mouse.

Authors: Judith A Blake; Janan T Eppig; James A Kadin; Joel E Richardson; Cynthia L Smith; Carol J Bult
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

6. Ensembl core software resources: storage and programmatic access for DNA sequence and genome annotation.

Authors: Magali Ruffier; Andreas Kähäri; Monika Komorowska; Stephen Keenan; Matthew Laird; Ian Longden; Glenn Proctor; Steve Searle; Daniel Staines; Kieron Taylor; Alessandro Vullo; Andrew Yates; Daniel Zerbino; Paul Flicek
Journal: Database (Oxford) Date: 2017-01-01 Impact factor: 3.451

Review 7. The Human Phenotype Ontology in 2017.

Authors: Sebastian Köhler; Nicole A Vasilevsky; Mark Engelstad; Erin Foster; Julie McMurry; Ségolène Aymé; Gareth Baynam; Susan M Bello; Cornelius F Boerkoel; Kym M Boycott; Michael Brudno; Orion J Buske; Patrick F Chinnery; Valentina Cipriani; Laureen E Connell; Hugh J S Dawkins; Laura E DeMare; Andrew D Devereau; Bert B A de Vries; Helen V Firth; Kathleen Freson; Daniel Greene; Ada Hamosh; Ingo Helbig; Courtney Hum; Johanna A Jähn; Roger James; Roland Krause; Stanley J F Laulederkind; Hanns Lochmüller; Gholson J Lyon; Soichi Ogishima; Annie Olry; Willem H Ouwehand; Nikolas Pontikos; Ana Rath; Franz Schaefer; Richard H Scott; Michael Segal; Panagiotis I Sergouniotis; Richard Sever; Cynthia L Smith; Volker Straub; Rachel Thompson; Catherine Turner; Ernest Turro; Marijcke W M Veltman; Tom Vulliamy; Jing Yu; Julie von Ziegenweidt; Andreas Zankl; Stephan Züchner; Tomasz Zemojtel; Julius O B Jacobsen; Tudor Groza; Damian Smedley; Christopher J Mungall; Melissa Haendel; Peter N Robinson
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971