| Literature DB >> 35911485 |
Shalani Athukorala1, Wathsala Mohotti1.
Abstract
Social media such as Twitter connect billions of people by allowing them to exchange their thoughts via short-text communication. Topic modelling is a widely used technique for analysing short texts. Discovering topic clusters in short-text collections faces issues with distance-based, density-based and dimensionality reduction-based methods due to their higher dimensionality and short length which results in extremely sparse text representation matrices. We propose the 'neighbourhood-based assistance'-driven non-negative matrix factorization (NMF) method to handle high-dimensional sparse short-text representation with lower-dimensional projection effectively. We utilized NMF that aligned with the natural non-negativity of text data coupled with the symmetric document affinity information to identify topic distribution in the short text. Neighbourhood information within documents is captured using Jaccard similarity to assist information loss, resulting in higher-to-lower-dimensional projection. Experimental results with Twitter data sets show that the proposed approach is able to attain high accuracy compared to state-of-the-art methods quantitatively, while qualitative analysis with case studies validates the ability of the proposed approach in generating meaningful topic clusters.Entities:
Keywords: Neighbourhood; Non-negative matrix factorization; Short texts; Text mining; Topic modelling; Twitter
Year: 2022 PMID: 35911485 PMCID: PMC9309003 DOI: 10.1007/s13278-022-00898-5
Source DB: PubMed Journal: Soc Netw Anal Min
Fig. 1Overview of the proposed NaNMF algorithm
Data set properties
| Data set | Size | Vocabulary size | Average length | Clusters | Density |
|---|---|---|---|---|---|
| Cancer | 13,002 | 4552 | 15.598 | 5 | 0.002105 |
| Health | 12,101 | 4683 | 15.731 | 4 | 0.002101 |
| Sports | 13,946 | 5091 | 14.992 | 8 | 0.001859 |
Results against existing baselines
| Baseline | Metric | Cancer | Health | Sports | Average |
|---|---|---|---|---|---|
| LDA | F1 | 0.26 | 0.28 | 0.23 | 0.26 |
| NMI | 0.07 | 0.04 | 0.16 | 0.09 | |
| Topic coherence | 0.56 | 0.55 | 0.61 | 0.57 | |
| NMF | F1 | 0.79 | 0.84 | 0.96 | 0.86 |
| NMI | 0.74 | 0.80 | 0.95 | 0.83 | |
| Topic coherence | 0.64 | 0.63 | 0.69 | 0.65 | |
| Biterm | F1 | 0.66 | 0.84 | 0.75 | 0.75 |
| NMI | 0.57 | 0.75 | 0.74 | 0.69 | |
| Topic coherence | 0.45 | 0.47 | 0.53 | 0.48 | |
| SeaNMF | F1 | 0.97 | 0.68 | 0.88 | 0.84 |
| NMI | 0.95 | 0.60 | 0.88 | 0.81 | |
| Topic coherence | 0.75 | 0.83 | 0.70 | 0.76 | |
| AVITM | F1 | 0.20 | 0.26 | 0.13 | 0.20 |
| NMI | 0.00 | 0.00 | 0.13 | 0.04 | |
| Topic coherence | 0.92 | 0.93 | 0.94 | 0.93 | |
| NaNMF | F1 | ||||
| NMI | |||||
| Topic coherence |
The performance of NaNMF (Our proposed method) are given in bold
Results against different similarity measures
| Jaccard(NaNMF) | Cosine | Sigmoid | Word2Vec | |||||
|---|---|---|---|---|---|---|---|---|
| F1 | NMI | F1 | NMI | F1 | NMI | F1 | NMI | |
| Cancer | 0.97 | 0.95 | 0.98 | 0.95 | 0.98 | 0.96 | ||
| Health | 0.97 | 0.94 | 0.98 | 0.95 | 0.99 | 0.97 | ||
| Sports | 0.89 | 0.90 | 0.97 | 0.96 | 0.97 | 0.96 | ||
| Average | 0.95 | 0.93 | 0.97 | 0.95 | 0.98 | 0.96 | ||
The performance of NaNMF (Our proposed method) are given in bold
Fig. 2F1-Score and NMI for different document representation methods
Experimental results with special pre-processing and standard text pre-processing
| Standard | Special (NaNMF) | |||
|---|---|---|---|---|
| F1 | NMI | F1 | NMI | |
| Cancer | 0.69 | 0.63 | 0.98 | 0.96 |
| Health | 0.83 | 0.77 | 0.99 | 0.97 |
| Sports | 0.89 | 0.88 | 0.97 | 0.96 |
| Average | 0.80 | 0.76 | 0.98 | 0.96 |
Fig. 3Convergence of NaNMF Algorithm
Fig. 4Scalability of NaNMF Algorithm
Fig. 5Word clouds of case study I
Fig. 6Word clouds of case study II