Literature DB >> 34127235

Evaluation of clustering and topic modeling methods over health-related tweets and emails.

Juan Antonio Lossio-Ventura1, Sergio Gonzales2, Juandiego Morzan3, Hugo Alatrista-Salas3, Tina Hernandez-Boussard2, Jiang Bian4.   

Abstract

BACKGROUND: Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts.
METHODS: We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels).
RESULTS: In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N = 286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets.
CONCLUSIONS: Researchers hoping to group or classify health related short-text data can expect to select the most suitable topic modeling and clustering methods for their specific research questions. Therefore, we presented a comparison of the most common used topic modeling and clustering algorithms over two health-related, short-text datasets using both internal and external clustering validation indices. Internal indices suggested Online Twitter LDA and GSDMM as the best, while external indices suggested LSI and k-means with TF-IDF as the best. In summary, our work suggested researchers can improve their analysis of model performance by using a variety of metrics, since there is not a single best metric.
Copyright © 2021 Elsevier B.V. All rights reserved.

Entities:  

Keywords:  Clustering; External cluster indices; Internal cluster indices; Natural language processing; Topic modeling

Mesh:

Year:  2021        PMID: 34127235      PMCID: PMC9040385          DOI: 10.1016/j.artmed.2021.102096

Source DB:  PubMed          Journal:  Artif Intell Med        ISSN: 0933-3657            Impact factor:   7.011


  49 in total

1.  Identifying functional miRNA-mRNA regulatory modules with correspondence latent dirichlet allocation.

Authors:  Bing Liu; Lin Liu; Anna Tsykin; Gregory J Goodall; Jeffrey E Green; Min Zhu; Chang Hee Kim; Jiuyong Li
Journal:  Bioinformatics       Date:  2010-10-17       Impact factor: 6.937

2.  Similarity measure between patient traces for clinical pathway analysis: problem, method, and applications.

Authors:  Zhengxing Huang; Wei Dong; Huilong Duan; Haomin Li
Journal:  IEEE J Biomed Health Inform       Date:  2014-01       Impact factor: 5.772

3.  A cluster separation measure.

Authors:  D L Davies; D W Bouldin
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  1979-02       Impact factor: 6.226

4.  Latent treatment pattern discovery for clinical processes.

Authors:  Zhengxing Huang; Xudong Lu; Huilong Duan
Journal:  J Med Syst       Date:  2013-02-08       Impact factor: 4.460

5.  What are we 'tweeting' about obesity? Mapping tweets with Topic Modeling and Geographic Information System.

Authors:  Debarchana Debs Ghosh; Rajarshi Guha
Journal:  Cartogr Geogr Inf Sci       Date:  2013

6.  Electronic patient-provider communication: will it offset office visits and telephone consultations in primary care?

Authors:  Trine S Bergmo; Per Egil Kummervold; Deede Gammon; Lauritz Bredrup Dahl
Journal:  Int J Med Inform       Date:  2005-09       Impact factor: 4.046

7.  Clustering and topic modeling over tweets: A comparison over a health dataset.

Authors:  Juan Antonio Lossio-Ventura; Juandiego Morzan; Hugo Alatrista-Salas; Tina Hernandez-Boussard; Jiang Bian
Journal:  Proceedings (IEEE Int Conf Bioinformatics Biomed)       Date:  2020-02-06

8.  Suicide Note Classification Using Natural Language Processing: A Content Analysis.

Authors:  John Pestian; Henry Nasrallah; Pawel Matykiewicz; Aurora Bennett; Antoon Leenaars
Journal:  Biomed Inform Insights       Date:  2010-08-04

9.  Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets.

Authors:  Jonathan H Chen; Mary K Goldstein; Steven M Asch; Lester Mackey; Russ B Altman
Journal:  J Am Med Inform Assoc       Date:  2017-05-01       Impact factor: 4.497

10.  Using twitter to examine smoking behavior and perceptions of emerging tobacco products.

Authors:  Mark Myslín; Shu-Hong Zhu; Wendy Chapman; Mike Conway
Journal:  J Med Internet Res       Date:  2013-08-29       Impact factor: 5.428

View more
  3 in total

1.  Effects of COVID-19 on Multilingual Communication.

Authors:  Maria Pilgun; Aleksei N Raskhodchikov; Olga Koreneva Antonova
Journal:  Front Psychol       Date:  2022-02-01

2.  Exploring public values through Twitter data associated with urban parks pre- and post- COVID-19.

Authors:  Jing-Huei Huang; Myron F Floyd; Laura G Tateosian; J Aaron Hipp
Journal:  Landsc Urban Plan       Date:  2022-07-26       Impact factor: 8.119

3.  Perceived Unmet Needs in Patients Living With Advanced Bladder Cancer and Their Caregivers: Infodemiology Study Using Data From Social Media in the United States.

Authors:  Simon Renner; Paul Loussikian; Pierre Foulquié; Benoit Arnould; Alexia Marrel; Valentin Barbier; Adel Mebarki; Stéphane Schück; Murtuza Bharmal
Journal:  JMIR Cancer       Date:  2022-09-20
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.