Literature DB >> 30329013

Thalia: semantic search engine for biomedical abstracts.

Axel J Soto1, Piotr Przybyła1, Sophia Ananiadou1.   

Abstract

SUMMARY: Although the publication rate of the biomedical literature has been growing steadily during the last decades, the accessibility of pertinent research publications for biologist and medical practitioners remains a challenge. This article describes Thalia, which is a semantic search engine that can recognize eight different types of concepts occurring in biomedical abstracts. Thalia is available via a web-based interface or a RESTful API. A key aspect of our search engine is that it is updated from PubMed on a daily basis. We describe here the main building blocks of our tool as well as an evaluation of the retrieval capabilities of Thalia in the context of a precision medicine dataset.
AVAILABILITY AND IMPLEMENTATION: Thalia is available at http://nactem.ac.uk/Thalia_BI/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2018. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2019        PMID: 30329013      PMCID: PMC6513154          DOI: 10.1093/bioinformatics/bty871

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


The volume, variety and rate of publication for the biomedical scientific literature make it an exemplary case of Big Data and of its inherent challenges. In this information overload scenario, the accurate retrieval of relevant information from such a large volume of written knowledge becomes a necessary asset for biologists and medical practitioners alike (Gonzalez ). In this article, we present Thalia—Text mining for Highlighting, Aggregating and Linking Information in Articles—which is a semantic search tool for the biomedical literature. Its semantic capacity originates from the automatic mining of concepts occurring in articles indexed in PubMed (NCBI Resource Coordinators, 2017) and its normalization to specialized ontologies. In this way, it is possible to search and retrieve all documents containing any mentions of a given concept regardless of the textual variation that is used to represent that concept. Similarly, polysemy—i.e. a same term having multiple meanings—is resolved based on the context where a term occurs. Thalia currently recognizes eight types of concepts, namely: chemicals, diseases, drugs, genes, metabolites, proteins, species and anatomical entities. Although similar search systems have been made available before (Hoehndorf ; Lee ; Lu, 2011; Müller ; Thomas ; Wei ), there are several distinctive aspects of Thalia: It is updated daily by automatically downloading updates from PubMed, mining concepts and adding them to the search index. This is a crucial feature, as systems lacking it quickly become outdated after deployment. Thalia’s named entity recognition (NER) methods have been customized for biomedical entity mining as a result of years of research and participation in shared tasks. Thalia uses a context-sensitive acronym resolution in order to improve concept recognition. It provides a visual interface, which allows autocompletion and concept aggregation, as well as a RESTful API that enables programmatic access to the search system. To recognize named entities from the literature, Thalia uses components of Argo (Rak ), which is a text mining workflow system. This includes NER modules for chemicals, drugs and metabolites (Kolluru ; Nobata ), genes, diseases and proteins (Rak ), species (Wang ), and anatomical entities (Pyysalo and Ananiadou, 2014). These models are based on dictionary matching as well as conditional random fields models trained using human-annotated data. The recognition step is followed by a normalization (Batista-Navarro ) to concepts from the following ontologies: ChEBI (chemicals), DrugBank (drugs), HMDB (metabolites), HGNC (genes), UMLS Metathesaurus (diseases), UniProt (proteins), NCBI (species) and CARO (anatomical). We leverage our acronym disambiguation module (Okazaki ) to improve NER precision and recall. If the long (i.e. spelled out) version of an acronym is recognized as a concept by the NER, but its short form is not, then we can extend the concept on the short form, too. Similarly, if a concept is recognized in an abbreviated form by the NER, but not as the same concept as the one recognized in the long form, then we correct the concept recognition in the short form. This follows the observation that long forms are less ambiguous as NER models can be deceived by ad hoc abbreviations. The search system was implemented using Elasticsearch (https://www.elastic.co/products/elasticsearch), which can be accessed from a web-based interface written in Javascript (Fig. 1). Semantic search is enabled by expanding the query area or by interacting with the different entity facets, which suggest the most frequent entities to narrow down the list of retrieved documents. Thalia also allows inspecting the full text of each abstract with its occurring entities highlighted as well as linking to the concepts in the ontology. Alternatively, the API allows by passing the visual interface to interact with Thalia’s search engine programmatically. The Supplementary Material contains documentation for the web-based interface and the API, as well as a video that shows how users can benefit from the semantic search capacity of Thalia.
Fig. 1.

The user interface of Thalia is divided into: a search area (top), main search results pane (middle) and faceted results for publication metadata (left) and entities (right)

The user interface of Thalia is divided into: a search area (top), main search results pane (middle) and faceted results for publication metadata (left) and entities (right) We evaluated the search capacity of Thalia in a precision medicine (PM) scenario. In PM, a problem that medical practitioners need to overcome is to find the best treatment given a patient's disease and her genetic features. Herein, we make use of TREC 2017 PM shared task data (Roberts ). The challenge involved a set of patient cases, which are described by the patient disease, her genetic variance and other demographic information. The goal was to find documents (PubMed entries, conference abstracts and clinical trials) that are relevant, i.e. they relate to a potential treatment for the patient. Herein, we consider the PubMed part of the task only since this is the corpus that the openly available version of Thalia operates on. More information about the multi-source version could be found in a separate publication (Przybyła ). We experimented using two main search strategies on the TREC PM dataset. The first strategy employed a purely textual search of the disease, gene and demographic data of the patients. Our second strategy incorporated the semantic search capacity of Thalia, which involves textual as well as concept matching. This latter type of matching enables the retrieval of documents regardless of whether the same string occurs in the query and the documents, but depending on whether the same concept is present in the query and the retrieved documents. In this way, vocabulary mismatch between query and documents is addressed, hence improving retrieval performance. The concepts in the query are obtained by using a feature of Thalia that given a term, returns the most likely concept associated with it. The results can be observed from Table 1. As per the shared task evaluation, the results consisted of measuring infNDCG, Precision at 10 and R-prec (Roberts ). Note that some of the retrieved documents may have not been assessed by the shared task evaluators, so by taking a conservative approach, those documents were considered as not relevant in this post hoc evaluation. This implies that the results in Table 1 represent a lower bound to the actual performance. Additionally, we provide an average time of query processing and retrieval by means of the API. The results indicate that Thalia’s semantic capacity leads to improved retrieval performance with little increase in processing time.
Table 1.

System performance in terms of infNDCG, precision at 10, R-prec and retrieval time per query in seconds depending on whether the semantic concepts are used for retrieval or not

infNDCGP@10R-precQuery time
Textual0.3380.4030.2131.22
Thalia0.3830.4270.2301.86
System performance in terms of infNDCG, precision at 10, R-prec and retrieval time per query in seconds depending on whether the semantic concepts are used for retrieval or not

Funding

This work was supported by BBSRC, Enriching Metabolic PATHwaY models with evidence from the literature (EMPATHY) [Grant ID: BB/M006891/1] and The Manchester Molecular Pathology Innovation Centre (MMPathIC) [Grant ID: MR/N00583X/1]. Conflict of Interest: none declared. Click here for additional data file.
  16 in total

1.  Mining metabolites: extracting the yeast metabolome from the literature.

Authors:  Chikashi Nobata; Paul D Dobson; Syed A Iqbal; Pedro Mendes; Jun'ichi Tsujii; Douglas B Kell; Sophia Ananiadou
Journal:  Metabolomics       Date:  2010-10-31       Impact factor: 4.290

2.  Building a high-quality sense inventory for improved abbreviation disambiguation.

Authors:  Naoaki Okazaki; Sophia Ananiadou; Jun'ichi Tsujii
Journal:  Bioinformatics       Date:  2010-03-25       Impact factor: 6.937

Review 3.  PubMed and beyond: a survey of web tools for searching biomedical literature.

Authors:  Zhiyong Lu
Journal:  Database (Oxford)       Date:  2011-01-18       Impact factor: 3.451

4.  Using workflows to explore and optimise named entity recognition for chemistry.

Authors:  Balakrishna Kolluru; Lezan Hawizy; Peter Murray-Rust; Junichi Tsujii; Sophia Ananiadou
Journal:  PLoS One       Date:  2011-05-25       Impact factor: 3.240

5.  Argo: an integrative, interactive, text mining-based workbench supporting curation.

Authors:  Rafal Rak; Andrew Rowley; William Black; Sophia Ananiadou
Journal:  Database (Oxford)       Date:  2012-03-20       Impact factor: 3.451

6.  GeneView: a comprehensive semantic search engine for PubMed.

Authors:  Philippe Thomas; Johannes Starlinger; Alexander Vowinkel; Sebastian Arzt; Ulf Leser
Journal:  Nucleic Acids Res       Date:  2012-06-12       Impact factor: 16.971

7.  Disambiguating the species of biomedical named entities using natural language parsers.

Authors:  Xinglong Wang; Jun'ichi Tsujii; Sophia Ananiadou
Journal:  Bioinformatics       Date:  2010-01-06       Impact factor: 6.937

8.  PubTator: a web-based text mining tool for assisting biocuration.

Authors:  Chih-Hsuan Wei; Hung-Yu Kao; Zhiyong Lu
Journal:  Nucleic Acids Res       Date:  2013-05-22       Impact factor: 16.971

9.  Anatomical entity mention recognition at literature scale.

Authors:  Sampo Pyysalo; Sophia Ananiadou
Journal:  Bioinformatics       Date:  2013-10-25       Impact factor: 6.937

10.  Processing biological literature with customizable Web services supporting interoperable formats.

Authors:  Rafal Rak; Riza Theresa Batista-Navarro; Jacob Carter; Andrew Rowley; Sophia Ananiadou
Journal:  Database (Oxford)       Date:  2014-07-08       Impact factor: 3.451

View more
  9 in total

1.  PubTator central: automated concept annotation for biomedical full text articles.

Authors:  Chih-Hsuan Wei; Alexis Allot; Robert Leaman; Zhiyong Lu
Journal:  Nucleic Acids Res       Date:  2019-07-02       Impact factor: 16.971

2.  Semi-Automated evidence synthesis in health psychology: current methods and future prospects.

Authors:  Iain J Marshall; Blair T Johnson; Zigeng Wang; Sanguthevar Rajasekaran; Byron C Wallace
Journal:  Health Psychol Rev       Date:  2020-01-29

Review 3.  Big Data and Atrial Fibrillation: Current Understanding and New Opportunities.

Authors:  Qian-Chen Wang; Zhen-Yu Wang
Journal:  J Cardiovasc Transl Res       Date:  2020-05-06       Impact factor: 4.132

4.  Toward systematic review automation: a practical guide to using machine learning tools in research synthesis.

Authors:  Iain J Marshall; Byron C Wallace
Journal:  Syst Rev       Date:  2019-07-11

5.  Improving reference prioritisation with PICO recognition.

Authors:  Austin J Brockmeier; Meizhi Ju; Piotr Przybyła; Sophia Ananiadou
Journal:  BMC Med Inform Decis Mak       Date:  2019-12-05       Impact factor: 2.796

6.  Menagerie: A text-mining tool to support animal-human translation in neurodegeneration research.

Authors:  Caroline J Zeiss; Dongwook Shin; Brent Vander Wyk; Amanda P Beck; Natalie Zatz; Charles A Sneiderman; Halil Kilicoglu
Journal:  PLoS One       Date:  2019-12-17       Impact factor: 3.240

7.  A term-based and citation network-based search system for COVID-19.

Authors:  Chrysoula Zerva; Samuel Taylor; Axel J Soto; Nhung T H Nguyen; Sophia Ananiadou
Journal:  JAMIA Open       Date:  2021-12-14

8.  GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships.

Authors:  Mustafa H Gunturkun; Efraim Flashner; Tengfei Wang; Megan K Mulligan; Robert W Williams; Pjotr Prins; Hao Chen
Journal:  G3 (Bethesda)       Date:  2022-05-06       Impact factor: 3.542

9.  Women's health in The BMJ: a data science history.

Authors:  Eva N Hamulyák; Austin J Brockmeier; Johanna D Killas; Sophia Ananiadou; Saskia Middeldorp; Armand M Leroi
Journal:  BMJ Open       Date:  2020-10-21       Impact factor: 2.692

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.