| Literature DB >> 35951425 |
Amy L Olex1,2, Evan French1,2, Peter Burdette1, Srilakshmi Sagiraju1, Thomas Neumann3, Tamas S Gal1,3, Bridget T McInnes2.
Abstract
TopEx is a natural language processing application developed to facilitate the exploration of topics and key words in a set of texts through a user interface that requires no programming or natural language processing knowledge, thus enhancing the ability of nontechnical researchers to explore and analyze textual data. The underlying algorithm groups semantically similar sentences together followed by a topic analysis on each group to identify the key topics discussed in a collection of texts. Implementation is achieved via a Python library back end and a web application front end built with React and D3.js for visualizations. TopEx has been successfully used to identify themes, topics and key words in a variety of corpora, including Coronavirus disease 2019 (COVID-19) discharge summaries and tweets. Feedback from the BioCreative VII Challenge Track 4 concludes that TopEx is a useful tool for text exploration for a variety of users and tasks. DATABSE URL: http://topex.cctr.vcu.edu.Entities:
Mesh:
Year: 2022 PMID: 35951425 PMCID: PMC9369716 DOI: 10.1093/database/baac063
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 4.462
Figure 1.TopEx NLP pipeline.
Figure 2.Screenshots of TopEx tabs and menu items. First panel shows the ‘Load Data’ tab with options for importing text data into TopEx. Second Panel shows the ‘Parameters’ (top), ‘Re-Cluster’ (middle) and ‘Import/Export’ (bottom) tabs. ‘Parameters’ tab allows customization of analysis and algorithm settings. ‘Re-Cluster’ enables quick adjustment of the cluster number without re-running the NLP pipeline from scratch. ‘Import/Export’ enables saving TopEx results or importing previous TopEx analyses. Third panel expands the ‘Advanced Parameters’ section of the ‘Parameters’ tab.
Figure 3.Screenshot of the TopEx interface showing results presented by a tSNE scatter plot and the sentence information displayed on the right when hovering over a point. Corpus used is a randomly sampled set of tweets from March 2020 in the COVID-19 Twitter Chatter data set discussed in the ‘Use Case’ section (same set of tweets that produced the UMAP visualization of Figure 4A).
Figure 4.UMAP scatter plots and example word clouds from TopEx results for tweets from (A) March 2020 and (B) December 2020. Scatter plots were generated in R from the coordinate text file output by TopEx.
Current TopEx use cases
| Text type | Use case |
|---|---|
| Reflective medical writings | Identify common challenges experienced by medical students (Olex |
| COVID-19 discharge summaries | Identify key phrases and terms associated with COVID-19 patients to develop better rule-based queries using an in-house NLP system at VCU Massey Cancer Center. |
| Government COVID-19 communications | Identify how mitigation strategies implemented in South Korea changed over time during the first wave of the COVID-19 pandemic [poster presented at 43rd Annual Meeting and Scientific Sessions of the Society of Behavioral Medicine ( |
| COVID-19 tweets | Assess changing topics of community interest during the pandemic (this manuscript). |
| Medical student narrative assessments | Identifying common themes in positive and negative feedback from formal assessments for medical students (two posters accepted to AAMC 2022 Annual Meeting and manuscript in preparation). |
User survey questions with numerical ratings and the average score for TopEx during the first (BioCreative, n = 7) and second (post-BioCreative, n = 6) rounds of evaluation with and without one outlier each (n = 6 and n = 5, respectively)
| Question | Rating rubric | TopEx Score from BioCreative (with outlier) | TopEx Score Post-BioCreative (with outlier) |
|---|---|---|---|
| I think that I would like to use this system frequently. | 3.2 (2.9) | 3 (2.8) | |
| I found the system unnecessarily complex. | 2.3 (2.4) | 2 (2.3) | |
| I thought the system was easy to use. | 3 (3) | 4.2 (3.8) | |
| I think I would need support from the developer to be able to use this system. | 3.3 (3.1) | 2.8 (3.2) | |
| I found the various functions of the system well integrated. | 3.8 (3.7) | 3.8 (3.5) | |
| I thought there was too much inconsistency in this system. | 1 = strongly disagree | 2.3 (2.4) | 1.6 (2) |
| I would imagine that most people would learn to use this system very quickly. | 5 = strongly agree | 3.2 (3.1) | Not asked |
| I found the system very cumbersome to use. | 2.8 (2.9) | 1.8 (2.2) | |
| The system has met my expectations. | 3.2 (3.1) | 4 (3.5) | |
| I felt very confident using the system. | 3.5 (3.1) | 3.6 (3.2) | |
| I needed to learn a lot of things before I could get going with the system. | 3.3 (3.4) | 2.8 (3) | |
| How easy was it to format and input data into this tool? | 1 = not at all easy | 3 (3) | 3.2 (3.2) |
| Please rate your overall impression with the system. | 1 = very negative | 3.3 (3.3) | 3.6 (3.2) |
| How likely is it that you would recommend this system to a colleague performing COVID-19 related research? | 1 = not at all likely | 6 (6) | 7.8 (6.8) |
Suggested use cases for TopEx from the BioCreative feedback
| Text type | Use case |
|---|---|
| PubMed abstracts | Identify main themes in a set of queried abstracts from PubMed. |
| Grant summaries | Identify topics addressed in a set of grants that need to be assigned to reviewers. |
| Publications | Identify thematic gene lists for manual curation from a collection of publications. |
| Interview transcripts | Analysis of transcripts of interviews in social behavioral work for common themes. |
| Open-ended survey/blog responses | Assessing themes or topics addressed in open-ended survey responses or topic-focused blog posts. |