| Literature DB >> 34037702 |
Marie Macnee1, Eduardo Pérez-Palma2, Sarah Schumacher-Bass3, Jarrod Dalton4, Costin Leu5, Daniel Blankenberg5, Dennis Lal1,5,6,7.
Abstract
Literature exploration in PubMed on a large number of biomedical entities (e.g., genes, diseases or experiments) can be time-consuming and challenging, especially when assessing associations between entities. Here, we describe SimText, a user-friendly toolset that provides customizable and systematic workflows for the analysis of similarities among a set of entities based on text. SimText can be used for (i) text collection from PubMed and extraction of words with different text mining approaches, and (ii) interactive analysis and visualization of data using unsupervised learning techniques in an interactive app.Entities:
Year: 2021 PMID: 34037702 PMCID: PMC9502138 DOI: 10.1093/bioinformatics/btab365
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Schematic presentation of the SimText toolset. (A) Tools are shown in dark blue boxes. Top left: For text collection of a set of entities (e.g. gene names), the entities are provided as search queries to retrieve abstracts or PMIDs from PubMed (‘pubmed_by_queries’). Else, the user can provide manually curated PMIDs for each entity that are used to fetch the corresponding abstracts (‘abstracts_by_pmids’). Bottom left: From the collected abstracts and/or manually curated text, the corresponding vocabulary associated with each entity is extracted while providing various optional text-mining techniques (‘text_to_wordmatrix’). Alternatively, using PMIDs as input, scientific terms of specific categories can be extracted for each entity using PubTator (‘pmids_to_pubtator_matrix’). In both approaches, the output represents a binary matrix with all extracted words and entities. Right: Analysis of the generated matrix is enabled by an interactive app (‘simtext_app’). The key characteristics of the entities can be explored, and different dimension reduction and clustering techniques can be applied to the matrix to visualize similarities among the entities. Custom grouping variables (e.g. associated diseases or pathways of genes) can be compared with the grouping of the entities based on their associated vocabulary. (B) Dimensionality reduction plot and hierarchical clustering of monogenic disorder genes (use-case example 1) in the SimText app