| Literature DB >> 35655515 |
Muzamil Malik1, Waqar Aslam1, Zahid Aslam1, Abdullah Alharbi2, Bader Alouffi3, Hafiz Tayyab Rauf4.
Abstract
People's lives are influenced by social media. It is an essential source for sharing news, awareness, detecting events, people's interests, etc. Social media covers a wide range of topics and events to be discussed. Extensive work has been published to capture the interesting events and insights from datasets. Many techniques are presented to detect events from social media networks like Twitter. In text mining, most of the work is done on a specific dataset, and there is the need to present some new datasets to analyse the performance and generic nature of Topic Detection and Tracking methods. Therefore, this paper publishes a dataset of real-life event, the Oscars 2018, gathered from Twitter and makes a comparison of soft frequent pattern mining (SFPM), singular value decomposition and k-means (K-SVD), feature-pivot (Feat-p), document-pivot (Doc-p), and latent Dirichlet allocation (LDA). The dataset contains 2,160,738 tweets collected using some seed words. Only English tweets are considered. All of the methods applied in this paper are unsupervised. This area needs to be explored on different datasets. The Oscars 2018 is evaluated using keyword precision (K-Prec), keyword recall (K-Rec), and topic recall (T-Rec) for detecting events of greater interest. The highest K-Prec, K-Rec, and T-Rec were achieved by SFPM, but they started to decrease as the number of clusters increased. The lowest performance was achieved by Feat-p in terms of all three metrics. Experiments on the Oscars 2018 dataset demonstrated that all the methods are generic in nature and produce meaningful clusters.Entities:
Mesh:
Year: 2022 PMID: 35655515 PMCID: PMC9155953 DOI: 10.1155/2022/5980043
Source DB: PubMed Journal: Comput Intell Neurosci
Description of frequently used abbreviations.
| Abbreviations | Description |
|---|---|
| Doc-p | Document-pivot |
| Feat-p | Feature-pivot |
| K-SVD | Singular value decomposition and |
| LDA | Latent Dirichlet allocation |
| SFPM | Soft frequent pattern mining |
|
| Keyword precision |
|
| Keyword recall |
|
| Topic recall |
| Tf-idf | Term frequency-inverse document frequency |
Figure 1LDA graph model.
List of keywords that were manually removed because they were not useful in generating meaningful clusters. Insightful observations can aid in the extraction of useful data.
| People | Guarding | Red | Pay | Much |
| Black | Time | Carpet | Oh | Movies |
| Night | Things | Need | List | Well |
| Hey | Watch | Years | See | Still |
| Awards | Full | Tonight | Will | Let |
| Great | Man | Check | Live | Sign |
| Make | Yes | Double | Many | Room |
Figure 2Word cloud after pre-processing. It illustrates the nature of the dataset used.
Figure 3Frequency graph of the Oscars 2018 dataset by dates. The real-life event was held on 4th March 2018.
Figure 4Twitter activity of the Oscars 2018 event.
A list of ground truths collected from news headlines for the event, the Oscars 2018.
| Story | Keywords |
|---|---|
| Shape of water won the best picture award | Shape, water, best, picture, award |
| Frances Macdormand won Oscar for best actress | Frances, Macdormand, won, oscar, best, actress |
| Guillermo Toro accepts best director for Shape of water | Guillermo, Toro, best, director, Shape, water |
| Michael Keegan gave happy reaction on Jordan Peele wins Oscar | Michael, Keegan, happy, reaction, Jordan, Peele |
| Meryl Streep dressed fairy godmother Shrek | Meryl, Streep, dress, fairy, godmother, Shrek |
| Gary Oldman best actor for darkest hour | Gary, Oldman, best, actor, darkest, hour |
| Coco won award for best animated feature film | Coco, won, best, animated, feature, film |
| Jimmy Kimmel talked on women harassment | Jimmy, Kimmel, talked, women, harrasment |
| Mcdormand speech powerful words “inclusion rider” | Mcdormand, speech, powerful, words, inclusion, rider |
| Dunkrik won three sound and editing Oscars | Dunkrik, won, three, sound, editing, Oscars |
| Kobe Bryant acceptance speech | Kobe, Bryant, accenptance, speech |
| Police investigates theft of $150000 Oscars dress won by Lupita Nyongo | Police, investigate, theft, Oscars, dress, won, Lupita, Nyongo |
Figure 5A graphical representation of the Oscars dataset using K-SVD for 5, 10, 15, and 20 clusters. When k = 5 and k = 10, the generated clusters are distanced from one another and distinct; however, when k = 15 and k = 20, the formed clusters are overlapped. (a) Clusters = 5. (b) Clusters = 10. (c) Clusters = 15. (d) Clusters = 20.
Topic recall, keyword precision, and keyword recall of all the methods.
| Methods | Number of topics | Topic recall | Keyword precision | Keyword recall |
|---|---|---|---|---|
| LDA | 5 | 0.6 | 1 | 0.6 |
| K-SVD | 5 | 0.8 | 1 | 1 |
| Doc-p | 5 | 1 | 1 | 1 |
| Feat-p | 5 | 0.6 | 0.8 | 0.6 |
| SFPM | 5 | 1 | 1 | 1 |
|
| ||||
| LDA | 10 | 0.8 | 0.7 | 0.7 |
| K-SVD | 10 | 0.8 | 0.8 | 0.8 |
| Doc-p | 10 | 0.8 | 1 | 0.8 |
| Feat-p | 10 | 0.7 | 0.7 | 0.6 |
| SFPM | 10 | 0.8 | 0.9 | 0.8 |
|
| ||||
| LDA | 15 | 0.5 | 0.7 | 0.5 |
| K-SVD | 15 | 0.7 | 0.8 | 0.7 |
| Doc-p | 15 | 0.5 | 0.9 | 0.5 |
| Feat-p | 15 | 0.5 | 0.7 | 0.5 |
| SFPM | 15 | 0.8 | 0.9 | 0.8 |
|
| ||||
| LDA | 20 | 0.6 | 0.6 | 0.5 |
| K-SVD | 20 | 0.7 | 0.8 | 0.6 |
| Doc-p | 20 | 0.6 | 0.8 | 0.5 |
| Feat-p | 20 | 0.6 | 0.6 | 0.5 |
| SFPM | 20 | 0.8 | 0.9 | 0.8 |
|
| ||||
| LDA | 25 | 0.5 | 0.5 | 0.4 |
| K-SVD | 25 | 0.6 | 0.7 | 0.5 |
| Doc-p | 25 | 0.5 | 0.5 | 0.5 |
| Feat-p | 25 | 0.5 | 0.5 | 0.4 |
| SFPM | 25 | 0.7 | 0.7 | 0.6 |
Figure 6Topic recall of LDA, K-SVD, Doc-p, Feat-p, and SFPM.
Figure 7K-prec of LDA, K-SVD, Doc-p, Feat-p, and SFPM.
Figure 8K-rec of LDA, K-SVD, Doc-p, Feat-p, and SFPM.
Figure 9Cluster 1 (best picture award Shape of water, as mentioned in Table 3) present in the ground truth topics. It demonstrates the effectiveness of the document-pivot strategy.
Figure 10Cluster 2 (Meryl Streep dressed fairy godmother Shrek) present in the ground truth topics.
Figure 11Cluster 3 (best animated feature film Coco) present in the ground truth topics.
Figure 12Cluster 4 (Meryl Streep, Tiffany Haddish, Margot Robbie, and Jennifer Lawrence beautiful dressed) present in the ground truth topics.
Figure 13k-means elbow method where k = 10. The elbow approach chooses an optimal value of k depending on the distance between data points and their associated clusters using the sum of squared distance (SSE). We picked a k value where the SSE started to flatten out and an inflection point appeared.
Figure 14k-means elbow method where k = 15.