Literature DB >> 34037702

SimText: A text mining framework for interactive analysis and visualization of similarities among biomedical entities.

Marie Macnee¹, Eduardo Pérez-Palma², Sarah Schumacher-Bass³, Jarrod Dalton⁴, Costin Leu⁵, Daniel Blankenberg⁵, Dennis Lal^1,5,6,7.

Abstract

Literature exploration in PubMed on a large number of biomedical entities (e.g., genes, diseases or experiments) can be time-consuming and challenging, especially when assessing associations between entities. Here, we describe SimText, a user-friendly toolset that provides customizable and systematic workflows for the analysis of similarities among a set of entities based on text. SimText can be used for (i) text collection from PubMed and extraction of words with different text mining approaches, and (ii) interactive analysis and visualization of data using unsupervised learning techniques in an interactive app.
AVAILABILITY AND IMPLEMENTATION: We developed SimText as an open-source R software and integrated it into Galaxy (https://usegalaxy.eu), an online data analysis platform with supporting self-learning training material available at https://training.galaxyproject.org. A command-line version of the toolset is available for download from GitHub (https://github.com/dlal-group/simtext) or as Docker image (https://hub.docker.com/r/dlalgroup/simtext/tags.). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34037702 PMCID： PMC9502138 DOI： 10.1093/bioinformatics/btab365

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Researchers rely on time-intensive manual literature surveys to compare biomedical entities (e.g. genes, authors or disorders) to one another and to learn about the research landscape overall. Various tools and packages have been developed to extract higher-level information from the literature in a systematic way. Without the need for programming, different web tools and databases provide summary statistics and annotations for results of a single search term (Garcia-Pelaez ; Wei ) or associations and relationships among biomedical entities in the literature (Ren ; Szklarczyk ). However, such web tools and databases cannot be customized, do not visualize the results or are focused on specific applications (e.g. relationships among proteins). To compute similarities between entities, many recently published methods use word and concept embeddings techniques, e.g. BioBERT, as opposed to comparing raw text words (Junge and Jensen, 2020; Lee ; Szklarczyk ). Here, we focus on a different approach that can be applied to any kind of strings. Frequent words or scientific terms are extracted from text and compared among biomedical entities of interest while assuming that more similar or related biomedical entities share more frequently co-occurring words and scientific terms in their text sources than unrelated entities. Our semi-automatic framework for literature research, SimText, allows users to collect text from PubMed for any given set of biomedical entities, extract associated vocabulary and visually inspect similarities among them and their key characteristics in an interactive app. To make large-scale literature analyses accessible to everyone, also to people who do not code, we provide the SimText toolset without the requirement of installation in the online data analysis platform Galaxy.

2 SimText description and workflow

SimText tools can be used individually or combined for a complete analysis, as detailed in Figure 1A. All tools are available as command-line tools or along many more text manipulation tools as part of the online data analysis platform Galaxy (Afgan ). Detailed descriptions of the tools can be found in Supplementary Material. SimText consists of the following three modules:

Fig. 1.

Schematic presentation of the SimText toolset. (A) Tools are shown in dark blue boxes. Top left: For text collection of a set of entities (e.g. gene names), the entities are provided as search queries to retrieve abstracts or PMIDs from PubMed (‘pubmed_by_queries’). Else, the user can provide manually curated PMIDs for each entity that are used to fetch the corresponding abstracts (‘abstracts_by_pmids’). Bottom left: From the collected abstracts and/or manually curated text, the corresponding vocabulary associated with each entity is extracted while providing various optional text-mining techniques (‘text_to_wordmatrix’). Alternatively, using PMIDs as input, scientific terms of specific categories can be extracted for each entity using PubTator (‘pmids_to_pubtator_matrix’). In both approaches, the output represents a binary matrix with all extracted words and entities. Right: Analysis of the generated matrix is enabled by an interactive app (‘simtext_app’). The key characteristics of the entities can be explored, and different dimension reduction and clustering techniques can be applied to the matrix to visualize similarities among the entities. Custom grouping variables (e.g. associated diseases or pathways of genes) can be compared with the grouping of the entities based on their associated vocabulary. (B) Dimensionality reduction plot and hierarchical clustering of monogenic disorder genes (use-case example 1) in the SimText app Text collection: SimText provides two tools to collect text from abstracts related to the entities of interest. Abstracts or PubMed identifiers (PMIDs) related to the entities of interest (e.g. gene names) can be retrieved automatically based on PubMed’s keyword search rules and syntax using the ’pubmed_by_queries’ tool. Using the ’abstracts_by_pmids’ tool, pre-populated PMIDs for each entity can be used to fetch the corresponding abstracts. Alternatively, custom text can be provided to be analyzed instead or in addition to the collected text. Text mining: Two different vocabularies can be generated for each biomedical entity. The ’text_to_wordmatrix’ tool identifies the most frequently occurring words, after optional word quality control, from all collected text for each biomedical entity. Alternatively, the ’pmids_to_pubtator_matrix’ tool extracts scientific terms using PubTator (Wei ) annotations of biomedical concepts. For both tools, the output is a high-dimensional binary matrix of all terms and biomedical entities. Analysis and exploration: In an interactive app (’simtext_app’, online in Galaxy or offline as command-line tool), groups of related biomedical entities can be analyzed and visualized by applying different unsupervised learning techniques to the matrix from the previous step (detailed in Supplementary Material).

3 Use-case examples

Our use-case examples are described in detail in Supplementary Material and can be reproduced using commands and data from our GitHub repository or by following our Galaxy training material. In one use-case example, we validated the approach by performing a SimText analysis on 95 monogenic disorder genes and hypothesized that the large-scale gene-level information extraction from abstracts could be used to replicate expert-curated disorder categories. In the downstream analysis in the interactive SimText app, we found that the gene grouping based on associated vocabulary is concordant with expert-curated disorder categories (Fig. 1B). To quantify this, we calculated the Adjusted Rand Index (ARI) and found a moderate (ARI = 0.6, using the ‘text_to_wordmatrix’ tool to extract frequent words) to good (ARI = 0.84, using the ‘pmids_to_pubtator_matrix’ tool to extract frequent scientific terms) agreement between the expert-curated disorder categories and the text-based gene grouping. Notably, several genes do not cluster or group with genes of their pre-existing disorder category. For example, KCNQ4 was found in a cluster of genes associated with neurodevelopmental disorders, particularly with KCNB1 but was pre-defined to be associated with non-syndromic genetic deafness. A literature search shows that SimText can help to identify similar genes beyond clinical phenotype grouping as both genes share similarities in biology: KCNQ4, as well as KCNB1 encode potassium channels and KCNB1 is thought to play a critical role in the regulation of neuronal excitability, particularly in sensory cells of the cochlea (Kubisch ). In another example, we systematically assessed shared (or distinct) interests among 185 researchers from 12 different departments based on the word content of their published abstracts. By visualizing the results in the SimText app we could identify groups of researchers with similar interests beyond department borders, which is valuable knowledge potentially leading to fruitful collaborations. As shown, SimText can equally extract valuable information from very different types of biological entities, such as genes and researchers. The results can be found for illustrative purposes at https://simtext.shinyapps.io/genes and https://simtext.shinyapps.io/researcher.

4 Conclusion

SimText enables the extraction of knowledge for large lists of biomedical entities and the visual exploration of any existing relationships among them. This way, overlooked implicit similarities and connections in the literature can be found, and hypotheses for scientific research can be generated. We demonstrate the versatility of SimText in our use-case examples, e.g. by identifying investigators with similar interests from a large multi-disciplinary research institute. The SimText tools can be used without programming knowledge nor require installation, individually or in different workflows for a large number of possible use-cases. Financial Support: none declared.

Conflict of Interest: D.B. has a significant financial interest in GalaxyWorks, a company that may have a commercial interest in the results of this research and technology. This potential conflict of interest has been reviewed and is managed by the Cleveland Clinic. Click here for additional data file.

8 in total

1. PubTator central: automated concept annotation for biomedical full text articles.

Authors: Chih-Hsuan Wei; Alexis Allot; Robert Leaman; Zhiyong Lu
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

2. KCNQ4, a novel potassium channel expressed in sensory outer hair cells, is mutated in dominant deafness.

Authors: C Kubisch; B C Schroeder; T Friedrich; B Lütjohann; A El-Amraoui; S Marlin; C Petit; T J Jentsch
Journal: Cell Date: 1999-02-05 Impact factor: 41.582

3. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.

Authors: Enis Afgan; Dannon Baker; Bérénice Batut; Marius van den Beek; Dave Bouvier; Martin Cech; John Chilton; Dave Clements; Nate Coraor; Björn A Grüning; Aysam Guerler; Jennifer Hillman-Jackson; Saskia Hiltemann; Vahid Jalili; Helena Rasche; Nicola Soranzo; Jeremy Goecks; James Taylor; Anton Nekrutenko; Daniel Blankenberg
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

4. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.

Authors: Damian Szklarczyk; Annika L Gable; David Lyon; Alexander Junge; Stefan Wyder; Jaime Huerta-Cepas; Milan Simonovic; Nadezhda T Doncheva; John H Morris; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

5. PubTerm: a web tool for organizing, annotating and curating genes, diseases, molecules and other concepts from PubMed records.

Authors: José Garcia-Pelaez; David Rodriguez; Roberto Medina-Molina; Gerardo Garcia-Rivas; Carlos Jerjes-Sánchez; Victor Trevino
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451