| Literature DB >> 35982602 |
Ricardo A Dorr1, Juan J Casal1, Roxana Toriano1.
Abstract
OBJECTIVES: Automated systems for information extraction are becoming very useful due to the enormous scale of the existing literature and the increasing number of scientific articles published worldwide in the field of medicine. We aimed to develop an accessible method using the open-source platform KNIME to perform text mining (TM) on indexed publications. Material from scientific publications in the field of life sciences was obtained and integrated by mining information on hemolytic uremic syndrome (HUS) as a case study.Entities:
Keywords: Bibliography; Data Mining; Hemolytic Uremic Syndrome; Information Storage and Retrieval; Tutorial
Year: 2022 PMID: 35982602 PMCID: PMC9388920 DOI: 10.4258/hir.2022.28.3.276
Source DB: PubMed Journal: Healthc Inform Res ISSN: 2093-3681
Figure 1Example of starting a workflow. (A) From the KNIME File menu, select “Install KNIME Extensions.” (B) To select the required text mining extension (KNIME Textprocessing), “text” can be typed in the text box. Similarly, the Vernalis KNIME Nodes and KNIME Indexing and Searching extensions must be installed. (C) The workflow starts with a search on the European PubMed Central (ePMC) site. Specific query terms should be typed in the General Query text box. In our example, the query terms were (“Haemolytic uraemic syndrome” OR “Hemolytic uremic syndrome”), corresponding to two different spellings for the same clinical syndrome. The years of publication were limited to 2020–2021. The Test Query button is used to check the number of hits. The node returns an XML document. (D) The XPath node allows selecting the fields of interest (see Column Name) from the XML document. (E) All the fields are indexed with the Table Indexer node. (F) The Index Query node creates a filtered data table, which is the input corpus for the following nodes. In the figure, only the articles published in 2020 are selected. Configuration windows C, D, E, and F are opened with a double left click on the node icon (blue arrow).
Examples of syntax of query text in the KNIME Index Query node
| Information to obtain | Syntax in query text |
|---|---|
| All publications in year 2000 | pubYear:2000 |
| All publications from 1986 to 2021 inclusive | pubYear: [1986 TO 2021] |
| All publications from 1900 to 2022 excluding 2021 | pubYear:(19* or 2*) AND NOT pubYear:2021 |
| All publications with token HUS in abstracts | abstractText:HUS |
| All publications with authors with surname Davis | authorString:Davis |
| All publications by the author JE Davis | authorString:“Davis JE” |
| All published works in the | medlineAbbreviation:“Healthc Inform Res” |
| All published works in 2020 in the | medlineAbbreviation:“Healthc Inform Res” AND pubYear:2020 |
The logical operators AND, AND NOT, OR, OR NOT can be used in the query that is based on Apache Lucene (https://lucene.apache.org/).
Figure 2Example for obtaining a list of authors. The authorString table lists the authors signing the publication after running the Index Query node. All authors of a publication are in the same row (see entry detail). The transposition performed with the Transpose node and the splitting of a cell into its constituent components (Cell Splitter node) are used to obtain the individual list of authors (see output detail). The results of a node action can be viewed by opening a window with a right click on the node icon (green arrow). The orange arrow indicates a brief node description.
Figure 3Example of automated and unsupervised detection of topics in abstracts about hemolytic uremic syndrome (HUS) and quantification of their characteristic words. The example shows the topics detected in publications on hemolytic uremic syndrome from 2020 to 2021 inclusive. A proposed text preprocessing method that facilitates subsequent analysis is also exemplified, eliminating characters and words without semantic importance, grouping by lemmatization and labeling the tokens. The result of topic detection (fork 1) is shown in tabular form but could also be presented in another graphical form. The word cloud (result of fork 2) represents the most abundant words in a bag of words; the larger its size, the higher its frequency of use. Words in a topic have the same color. Green arrow: output with right click, Orange arrow: brief node description.
Figure 4Workflow with cross-referencing. The input table (see details at the top) contains the terms to be identified in the corpus by means of the Cross Joiner node. Unrecognized terms are excluded by applying a filter in the Row Filter node. The output shows the number of times that each FDA-approved drug was found in abstracts (see details at the bottom). Green arrow: output with right click, Orange arrow: brief node description, FDA: Food and Drug Administration.
Description of the text mining nodes used in the tutorials (based on the KNIME node descriptions)
| Node | Description |
|---|---|
| European PubMed Central Advanced Search | It recreates the advanced query interface of European-PubMed Central. It returns a single column of query results as XML cells (one row per result). |
| XPath | It takes the XML documents and performs XPath queries on them. |
| Table Indexer | It creates an index from the input table. |
| Index Query | It allows to query a given index. |
| String Replacer | It replaces values in string cells if they match a certain pattern. |
| Topic Extractor (Parallel LDA) | Simple parallel threaded implementation of the generative statistical model known as latent Dirichlet allocation (LDA). |
| Bag of Words Creator | It creates a bag of words from a set of documents. |
| Tags to Strings | It converts the term’s tag values of the specified tag types to strings. |
| Strings to Document | It converts strings to documents. |
| Punctuation Erasure | It removes all punctuation characters from the input documents. |
| Stop Word Filter | It filters all terms of the input documents, which are contained in a specified stop word list. |
| Abner Tagger | It recognizes biomedical named entities and assigns tags to the corresponding terms. |
| POS Tagger | It assigns to each term of a document a part of speech (POS) tag. Therefore, the Penn Tree-bank tag set is used. |
| Stanford Lemmatizer | It lemmatizes terms contained in the input documents with the StanfordCore NLP library. |
| Case Converter | It converts all terms contained in the input documents to lower or upper case. |