| Literature DB >> 36134915 |
Ileana Scarpino1, Chiara Zucco1,2, Rosarina Vallelunga1, Francesco Luzza3, Mario Cannataro1,2.
Abstract
Through an adequate survey of the history of the disease, Narrative Medicine (NM) aims to allow the definition and implementation of an effective, appropriate, and shared treatment path. In the present study different topic modeling techniques are compared, as Latent Dirichlet Allocation (LDA) and topic modeling based on BERT transformer, to extract meaningful insights in the Italian narration of COVID-19 pandemic. In particular, the main focus was the characterization of Post-acute Sequelae of COVID-19, (i.e., PASC) writings as opposed to writings by health professionals and general reflections on COVID-19, (i.e., non-PASC) writings, modeled as a semi-supervised task. The results show that the BERTopic-based approach outperforms the LDA-base approach by grouping in the same cluster the 97.26% of analyzed documents, and reaching an overall accuracy of 91.97%.Entities:
Keywords: BERTopic; LDA; narrative medicine; text mining; topic modeling
Year: 2022 PMID: 36134915 PMCID: PMC9496775 DOI: 10.3390/biotech11030041
Source DB: PubMed Journal: BioTech (Basel) ISSN: 2673-6284
Figure 1Histogram of words count per document of the entire dataset.
Figure 2Histogram of words count per document and class.
Figure 3Word cloud showing the most frequent tokens in the document after preprocessing. Tokens with the largest font size are the most frequent.
Performance of LDA wrt the number of topics. Best performance are reached by a number of topics equal to 3.
| Number of Topics | Coherence Score | Perplexity |
|---|---|---|
| 3 |
|
|
| 5 | 0.332 | −7.337 |
| 7 | 0.299 | −7.398 |
| 10 | 0.346 | −7.5239 |
Figure 4Bar plot of keywords and weight for each topic modeled through the LDA approach.
Figure 5Bar chart describing the distribution of topics modeled trough the LDA approach w.r.t. PASC or non-PASC classes.
Figure 6Bar plot of keywords and weight for each topic modeled through the BERTopic approach. Topic -1 represents a virtual topic of unassigned documents to other topics.
Figure 7Bar chart describing the distribution of topics modeled trough the BERTopic approach wrt PASC or non-PASC classes.
Distribution of LDA topics wrt the considered classes.
| Assigned LDA Topic | Class | Document Count |
|---|---|---|
| 1 | non-PASC | 104 |
| PASC | 2 | |
| 2 | non-PASC | 6 |
| PASC | 57 | |
| 3 | non-PASC | 4 |
| PASC | 14 |
Distribution of BERTopic topics wrt the considered classes.
| Assigned BERTopic | Class | Document Count |
|---|---|---|
| −1 | non-PASC | 5 |
| PASC | 1 | |
| 1 | non-PASC | 101 |
| PASC | 1 | |
| 2 | non-PASC | 8 |
| PASC | 71 |