| Literature DB >> 35224484 |
Yang Ren1, Dezhi Wu2, Avineet Singh1, Erin Kasson3, Ming Huang4, Patricia Cavazos-Rehg3.
Abstract
There are increasingly strict regulations surrounding the purchase and use of combustible tobacco products (i.e., cigarettes); simultaneously, the use of other tobacco products, including e-cigarettes (i.e., vaping products), has dramatically increased. However, public attitudes toward vaping vary widely, and the health effects of vaping are still largely unknown. As a popular social media, Twitter contains rich information shared by users about their behaviors and experiences, including opinions on vaping. It is very challenging to identify vaping-related tweets to source useful information manually. In the current study, we proposed to develop a detection model to accurately identify vaping-related tweets using machine learning and deep learning methods. Specifically, we applied seven popular machine learning and deep learning algorithms, including Naïve Bayes, Support Vector Machine, Random Forest, XGBoost, Multilayer Perception, Transformer Neural Network, and stacking and voting ensemble models to build our customized classification model. We extracted a set of sample tweets during an outbreak of e-cigarette or vaping-related lung injury (EVALI) in 2019 and created an annotated corpus to train and evaluate these models. After comparing the performance of each model, we found that the stacking ensemble learning achieved the highest performance with an F1-score of 0.97. All models could achieve 0.90 or higher after tuning hyperparameters. The ensemble learning model has the best average performance. Our study findings provide informative guidelines and practical implications for the automated detection of themed social media data for public opinions and health surveillance purposes.Entities:
Keywords: EVALI; Twitter; classification; deep learning; detection; e-cigarette; machine learning; vaping
Year: 2022 PMID: 35224484 PMCID: PMC8866955 DOI: 10.3389/fdata.2022.770585
Source DB: PubMed Journal: Front Big Data ISSN: 2624-909X
Summary and cross-comparison of vaping-related twitter studies.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Adhikari et al. ( | Public opinions analysis about cannabis and JUUL on tweets | Dj:597,000 tweets from 2016 to 2018; Dc: 3.28M tweets from 2014 to 2018 | 500 tweets annotated from Dj, and 500 tweets annotated from Dc | Logistic Regression (LR), Support Vector Machine (SVM), LSTM-based Deep Neural Network (DNN) | Hyperparameters were tuned for each classifier | |
| Myslín et al. ( | Tobacco-relevance tweets detection, positive & negative sentiment | 7,362 tweets at 15-day intervals from December 2011 to July 2012 by keywords | Each of 7,362 tweets was manually classified | Naïve Bayes (NB), K-Nearest Neighbors (K-NN), SVM | Rainbow toolkit 10-fold cross-validation | |
| Martinez et al. ( | Public opinion about vaping investigates using sentiment analysis | 973 tweets selected from 193,051 geocoded tweets within the U.S., and were collected between October 28, 2015 and February 6, 2016 by keywords | 100 tweets were manually coded by two coders; Other tweets were single coded according to the codebook classifications |
|
|
|
| Aphinyanaphongs et al. ( | Vaping use and the detection of vaping use for smoking cessation tweets | 13,146 tweets were selected from 228,145 tweets, collected between January 2010 and January 2015 by keywords | Each of 13,146 selected tweets was labeled by the classifiers | NB, SVM, LR, Random Forests (RF) | Parameters Settings: | |
| Han and Kavuluru ( | Marketing E- cigarette tweets detection and themes analysis | 1,000 tweets were selected from 1,166,494 tweets obtained from April 2015 to June 2016 by keywords | Both authors independently annotated the 1,000 tweets | SVM, LR, Convolutional Neural Network (CNN) | Ten such models were run for each classifier on 10 different 80–20% train-test splits of the dataset | |
| Benson et al. ( | Adolescents and young adults for JUUL tweets detection and sentiment analysis | 4,000 tweets were selected from 11,556 unique tweets containing a JUUL-related keyword | Manually annotated 4,000 tweets for JUUL-related themes of use and sentiment | LR, NB, RF | Grid search was applied to optimize hyperparameters 10-fold cross-validation | |
| Visweswaran et al. ( | The relevance and commercial Vaping-related tweets detection, and sentiment analysis | 4,000 tweets were selected from 810,600 tweets extracted from August 2018 to October 2018 by vaping-related keywords | Manually annotated each of 4,000 tweets | LR, RF, SVM, NB, CNN, LSTM, LSTM-CNN, BiLSTM | Used default setting for the parameters in LR, RF, SVM. |
Monthly distribution of tweets in the annotated corpus.
|
|
|
|
|
|---|---|---|---|
|
|
| ||
| July | 498 | 495 | 993 |
| August | 499 | 502 | 1,001 |
| September | 509 | 467 | 976 |
| Total | 1,506 | 1,464 | 2,970 |
Figure 1The structures of our ensemble learning model.
Best performance (F1-score) achieved for each classifier.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Naïve Bayes | 0.94 | 0.93 | 0.96 | 0.95 | 90% | 10% |
| SVM | 0.95 | 0.98 | 0.92 | 0.95 | 7, 8 (Month) | 9 (Month) |
| 0.95 | 0.96 | 0.94 | 0.95 | 90% | 10% | |
| 0.94 | 0.96 | 0.95 | 0.95 | 70% | 30% | |
| Random Forest | 0.95/0.96 | 0.96/0.95/0.94/0.93 | 0.97/0.96 | 0.96 | 90%/70%/ | 10%/30% |
| XGBoost | 0.91/0.92 | 0.94/0.93/0.91 | 0.91/0.92 | 0.92 | All training sets are based on various month and percentage combinations except: 7, 8 (Month) | All testing sets are based on various month and percentage combinations except: 9 (Month) |
| Ensemble - Stacking | 0.97 | 0.97 | 0.97 | 0.97 | All training sets are based on various month and percentage combinations | All testing sets are based on various month and percentage combinations |
| MLP | 0.94 | 0.94 | 0.94 | 0.94 | 7, 9 (Month) | 8 (Month) |
| 0.94 | 0.94 | 0.94 | 0.94 | 50% | 50% | |
| Transformer | 0.96 | 0.96 | 0.96 | 0.96 | 7, 8 (Month) | 9 (Month) |
Figure 2The performance comparison for all classifiers based on different month-based training-testing combinations.
Figure 3The performance comparison for all classifiers based on percentage-based different training-testing combinations.
Figure 4Top 20 important features based on random forest classifier.
Best performance achieved for each classifier—evaluation dataset.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Naïve Bayes | 0.90 | 0.95 | 0.86 | 0.90 | 7, 8 | 9 (Month) |
| 0.90 | 0.92 | 0.88 | 0.90 | 8, 9 | 7 (Month) | |
| SVM | 0.95 | 0.96 | 0.94 | 0.95 | 7, 8 | 9 (Month) |
| 0.95 | 0.96 | 0.94 | 0.95 | 90% | 10% | |
| Random Forest | 0.94 | 0.94 | 0.94 | 0.94 | 7, 8 | 9 (Month) |
| 0.94 | 0.94 | 0.93 | 0.94 | 8, 9 | 7 (Month) | |
| XGBoost | 0.94 | 0.94 | 0.94 | 0.94 | 7, 8 | 9 (Month) |
| Ensemble—stacking | 0.95 | 0.97 | 0.97 | 0.97 | 7, 9 | 8 (Month) |
| MLP | 0.95 | 0.95 | 0.95 | 0.95 | 7, 8 | 9 (Month) |
| 0.95 | 0.95 | 0.95 | 0.95 | 90% | 10% |