Literature DB >> 32810207

MELODI Presto: a fast and agile tool to explore semantic triples derived from biomedical literature.

Abstract

SUMMARY: The field of literature-based discovery is growing in step with the volume of literature being produced. From modern natural language processing algorithms to high quality entity tagging, the methods and their impact are developing rapidly. One annotation object that arises from these approaches, the subject-predicate-object triple, is proving to be very useful in representing knowledge. We have implemented efficient search methods and an application programming interface, to create fast and convenient functions to utilize triples extracted from the biomedical literature by SemMedDB. By refining these data, we have identified a set of triples that focus on the mechanistic aspects of the literature, and provide simple methods to explore both enriched triples from single queries, and overlapping triples across two query lists.
AVAILABILITY AND IMPLEMENTATION: https://melodi-presto.mrcieu.ac.uk/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Year: 2021 PMID： 32810207 PMCID： PMC8088324 DOI： 10.1093/bioinformatics/btaa726

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The biomedical literature contains a wealth of knowledge on disease mechanisms that link risk factors to disease outcomes. Literature reviews are an important part of developing new mechanistic hypotheses, but are time-consuming when exploring many risk factors or disease outcomes in parallel. We propose utilizing the entire biomedical literature corpus to generate a metric of evidence based on the number of times a particular statement has been documented. This approach can be used to produce an overview of the main topics and terms for any given biomedical query, e.g. a risk factor, or identify overlapping elements between two queries, e.g. a risk factor (exposure) and disease outcome. Previously, we created MELODI, a web application that derives overlapping enriched literature elements connecting a risk factor and a disease (Elsworth ), thus, identifying potential intermediate mechanisms. MELODI utilizes semantic triples (‘subject-predicate-object’) derived from the titles and abstracts of nearly 30 million biomedical articles using SemRep (Rindflesch and Fiszman, 2003) and provided by SemMedDB (Kilicoglu ). Whilst effective, the scale and complexity of data utilized by MELODI, its implementation of a graph database and web application, and its focus on single risk factor/outcome combinations limits its application to many queries in parallel. To address these challenges, we developed MELODI Presto, a quicker and more agile tool to identify overlapping elements between any two query lists using millions of semantic triples from the scientific literature (https://melodi-presto.mrcieu.ac.uk/).

2 Features

The main features and innovations in MELODI Presto are described below: Filter by UMLS semantic type: SemMedDB triples were filtered to include only those matching particular ‘term types’. These types are defined by the UMLS semantic type abbreviations (https://metamap.nlm.nih.gov/SemanticTypesAndGroups.shtml). We focus on terms most relevant to mechanistic inference. Supplementary Table S1 lists selected terms. For a triple to be included both the subject and object semantic types needed to be in this list. Filter by predicate type: To improve the interpretability of the data, we removed the more ambiguous predicates. Supplementary Table S2 lists excluded predicates. We retained predicates that implied direction or causality, excluding predicates, such as PROCESS_OF, PART_OF and ISA. Combined, these two filtering criteria reduced the number of PREDICATION triples from 103 284 300 to 8 295 443. Improve search performance: MELODI Presto implements a simpler approach than MELODI, which does not require a graph database architecture. We therefore selected Elasticsearch (https://www.elastic.co/elasticsearch) for performance based on our previous experience applying it to large biomedical datasets (Elsworth ).

3 Implementation

3.1 Data sources and pre-processing

MELODI Presto incorporates semantic triples from the SemMedDB database (Kilicoglu ), which is built by running the SemRep semantic knowledge representation tool (Rindflesch and Fiszman, 2003) on the MEDLINE database. The SemMedDB resource (version semmedVER42_R, 2020) is updated periodically and provides data downloads in SQL format. For MELODI Presto, we extracted the PREDICATION, SENTENCE and CITATION tables. For enrichment analysis, frequency counts of triples were pre-calculated using Elasticsearch aggregation calls and added to a separate index. This will be updated with new SemMedDB releases. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.

3.2 Functions

MELODI Presto provides three functions: Enrich: The enrichment method follows the same principle as MELODI, using a standard 2 × 2 Fisher’s exact test. For example, if a query ‘Sleep duration’ returned a set of triples ‘Sleep Apnea, Obstructive: PREDISPOSES: Hypertensive disease’, then, we can count the number of this specific triple returned by the query (localCount), the total number of triples returned by the query (localTotal), the number of this specific triple in the database (globalCount) and the total number of triples in the database (globalTotal). These values are then provided as a 2 × 2 contingency table using a two-sided alternative hypothesis, producing a prior odds ratio and P-value (a=localCount, b=localTotal-localCount, c=globalCount, d=globalTotal-globalCount). Overlap: This is an extension of the Enrich function described above. By providing two lists of search terms, e.g. multiple risk factors and multiple diseases, all terms are first tested for enrichment, then, overlapping enriched elements are identified. An overlap is taken to be cases where the object of a triple from the set of ‘x’ queries overlaps with a subject from the set of ‘y’ queries (Fig. 1).

Fig. 1.

Data flow of the MELODI Presto overlap function. Two lists of queries (Q1 and Q2) are first checked for previous enrichment analysis. For queries, which have not been previously analysed the text of each missing term is run as a PubMed query and returned IDs are matched to the MELODI Presto database for enrichment. The results of previous analysis are loaded from a local store. Overlapping elements between each pair of enriched triple sets (one from each query list, Q1 and Q2) are then identified and returned

Sentence: This function enables the user to check the source literature for SemMedDB triples. To enable rapid evaluation of the literature underpinning any potential mechanism, MELODI Presto provides a function that takes a PubMed ID and returns triples from the refined PREDICATE data, and the sentences from which they were derived. Data flow of the MELODI Presto overlap function. Two lists of queries (Q1 and Q2) are first checked for previous enrichment analysis. For queries, which have not been previously analysed the text of each missing term is run as a PubMed query and returned IDs are matched to the MELODI Presto database for enrichment. The results of previous analysis are loaded from a local store. Overlapping elements between each pair of enriched triple sets (one from each query list, Q1 and Q2) are then identified and returned

3.3 Performance

The first time a query is run MELODI Presto creates local copies of the enrichment data, as illustrated above. For this reason, if a query has not been run previously, it may take a few seconds (generally <30 s depending on the number of articles returned from the query). However, if an existing variable, or list of variables, are queried, the Overlap function generally runs in a few seconds.

3.4 Access and source code

MELODI Presto is available via an application programming interface [API, using either the Swagger interface or an appropriate interface library (e.g. Python requests)]. Python examples of which can be found in the Jupyter notebooks linked below. Each function returns JSON objects, which can be easily incorporated into standard workflows. We also provide a web application to enable some of the functionality to be explored. All code used to process the raw data, create the Elasticsearch indexes, API and web application are publicly available (https://github.com/MRCIEU/MELODI-Presto) with Jupyter notebooks providing a demonstration of basic API usage, use cases and details of specific methods and performance.

3.5 URLs

MELODI Presto—https://melodi-presto.mrcieu.ac.uk/. MELODI Presto Web—https://melodi-presto.mrcieu.ac.uk/app/. MELODI Presto API—https://melodi-presto.mrcieu.ac.uk/docs/. MELODI Presto GitHub—https://github.com/MRCIEU/MELODI-Presto. MELODI Presto Notebooks—https://github.com/MRCIEU/MELODI-Presto/blob/master/notebooks/.

4 Conclusion

MELODI Presto provides a fast and efficient method to systematically profile semantic triples derived from the literature. This can be used to explore the enriched literature data for a given search term and identify potential intermediate disease mechanisms between lists of terms. Using a refined literature dataset and a high-performance search architecture, it is possible to query millions of articles in seconds. The agility of the method and its construction mean that as and when alternative or improved triples of data are produced, we can add them (Koroleva ; Sybrandt ). There is also the scope to expand this approach to include preprints, full text and any other source of data.

Funding

This work was supported by the UK Medical Research Council Integrative Epidemiology Unit [MC_UU_00011/4]; the Cancer Research UK Integrative Cancer Epidemiology Programme [C18281/A19169]; and the University of Bristol. T.R.G. holds a fellowship from the Alan Turing Institute. T.R.G. receives funding from GlaxoSmithKline and Biogen for unrelated research. Click here for additional data file.

3 in total

1. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text.

Authors: Thomas C Rindflesch; Marcelo Fiszman
Journal: J Biomed Inform Date: 2003-12 Impact factor: 6.317

2. SemMedDB: a PubMed-scale repository of biomedical semantic predications.

Authors: Halil Kilicoglu; Dongwook Shin; Marcelo Fiszman; Graciela Rosemblat; Thomas C Rindflesch
Journal: Bioinformatics Date: 2012-10-08 Impact factor: 6.937

3. MELODI: Mining Enriched Literature Objects to Derive Intermediates.

Authors: Benjamin Elsworth; Karen Dawe; Emma E Vincent; Ryan Langdon; Brigid M Lynch; Richard M Martin; Caroline Relton; Julian P T Higgins; Tom R Gaunt
Journal: Int J Epidemiol Date: 2018-01-12 Impact factor: 7.196

3 in total

2 in total

1. Trans-ethnic Mendelian-randomization study reveals causal relationships between cardiometabolic factors and chronic kidney disease.

Authors: Jie Zheng; Yuemiao Zhang; Humaira Rasheed; Venexia Walker; Yuka Sugawara; Jiachen Li; Yue Leng; Benjamin Elsworth; Robyn E Wootton; Si Fang; Qian Yang; Stephen Burgess; Philip C Haycock; Maria Carolina Borges; Yoonsu Cho; Rebecca Carnegie; Amy Howell; Jamie Robinson; Laurent F Thomas; Ben Michael Brumpton; Kristian Hveem; Stein Hallan; Nora Franceschini; Andrew P Morris; Anna Köttgen; Cristian Pattaro; Matthias Wuttke; Masayuki Yamamoto; Naoki Kashihara; Masato Akiyama; Masahiro Kanai; Koichi Matsuda; Yoichiro Kamatani; Yukinori Okada; Robin Walters; Iona Y Millwood; Zhengming Chen; George Davey Smith; Sean Barbour; Canqing Yu; Bjørn Olav Åsvold; Hong Zhang; Tom R Gaunt
Journal: Int J Epidemiol Date: 2021-10-20 Impact factor: 7.196

2. EpiGraphDB: a database and data mining platform for health data science.

Authors: Yi Liu; Benjamin Elsworth; Pau Erola; Valeriia Haberland; Gibran Hemani; Matt Lyon; Jie Zheng; Oliver Lloyd; Marina Vabistsevits; Tom R Gaunt
Journal: Bioinformatics Date: 2021-06-09 Impact factor: 6.937

2 in total