| Literature DB >> 32784184 |
Shyam Visweswaran1,2, Jason B Colditz3, Patrick O'Halloran1, Na-Rae Han4, Sanya B Taneja2, Joel Welling5, Kar-Hai Chu3, Jaime E Sidani3, Brian A Primack6.
Abstract
BACKGROUND: Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets.Entities:
Keywords: deep learning; infodemiology; infoveillance; machine learning; social media; vaping
Mesh:
Year: 2020 PMID: 32784184 PMCID: PMC7450367 DOI: 10.2196/17478
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Figure 1A hierarchical annotation scheme for vaping-related tweets.
Descriptions of labels used for annotating vaping-related tweets.
| Labels | Descriptions |
| Relevant | Is the tweet in English and related to the vaping topic at hand (eg, vape use or users, vaping devices, or products)? |
| Not relevant | Tweets categorized as not relevant were typically in non-English or had referenced vaping cannabis products specifically, such as: “Teens are smoking, vaping and eating cannabis” “What if I vape weed?” |
| Commercial | Is the tweet selling, marketing, or advertising vaping products? |
| Noncommercial | Includes tweets that demonstrate favorability toward a product but do not directly advocate for purchasing it. |
| Provape | Is vaping associated with positive emotions or contexts? Such as: The tweet author is currently using, has recently used, or intends to use a vape product. The tweet author indicates acceptance of others’ vaping or favorability toward others’ positive perspectives of vaping. The tweet author mentions vaping in association with other positive aspects of society or popular culture (eg, partying, sexuality, popularity, and attractiveness). |
| Not provape | Includes tweets that are antivape, neutral or fact based, or without subjective judgment about positive or acceptable aspects of vaping. |
Description of preprocessing steps and options used in traditional classifiers.
| Preprocessing steps | Descriptions | Optionsa |
| placeholder_remove | Remove textual placeholders such as _mention_, _hashtag_, _unicode_, and _url_ | True, false |
| emoji_remove | Remove textual descriptions that denote emojis | True, false |
| negation_expand | Expand negative contractions, for example, “don’t” is expanded to “do not” and “can’t” is expanded to “cannot” | True, false |
| punctuation_remove | Remove all punctuation symbols | True, false |
| digits_remove | Remove all numeric digits (0-9) | True, false |
| negation_mark | Mark words that occur between a negation trigger and a punctuation mark with the NEG prefix [ | True, false |
| normalize | Reduce to 2 characters all consecutive characters that appear more than twice, for example, “happppy” is reduced to “happy” | True, false |
| stemming | Reduce inflection in words (eg, troubled, troubles) to their root form (eg, trouble) using the Porter Stemmer [ | True, false |
| stopwords_remove | Remove common words such as “the,” “a,” “on,” “is,” and “all” that are listed in the Natural Language Toolkit English stop words list [ | True, false |
| lowercase | Change the case of all characters to lowercase | True, false |
aIf the option for a step is set to true, the corresponding preprocessing step will be applied in the preprocessing pipeline; if the option is set to false, the corresponding preprocessing step will be skipped in the pipeline.
Description of training and test data sets.
| Targets | Total number of tweets, n (%) | Number of tweets with positive target, n (%) | Number of tweets with negative target, n (%) |
| Relevance |
Total: 4000 (100) Training: 3600 (100) Test: 400 (100) | Relevant Total: 3011 (75.28) Training: 2709 (75.25) Test: 302 (75.5) | Nonrelevant Total: 989 (24.72) Training: 891 (24.75) Test: 98 (24.5) |
| Commercial |
Total: 3011 (100) Training: 2709 (100) Test: 302 (100) | Noncommercial Total: 2175 (72.24) Training: 1957 (72.24) Test: 218 (72.2) | Commercial Total: 836 (27.76) Training: 752 (27.86) Test: 84 (27.8) |
| Sentiment | Total: 2175 (100) Training: 1957 (100) Test: 218 (100) | Provape Total: 1357 (62.39) Training: 1221 (62.39) Test: 136 (62.4) | Not provape Total: 818 (37.61) Training: 736 (37.61) Test: 82 (37.6) |
Description of traditional classifiers and parameter settings used in the experiments (the same parameter settings were used for the following 3 targets: relevance, commercial, and sentiment).
| Classifiers | Scikit-learn functions (version) | Parameter values |
| Logistic regression | sklearn.linear_model.LogisticRegression (0.20.3) | All default values except C=0.001 |
| Random forest | sklearn.ensemble.RandomForestClassifier (0.20.3) | All default values except max_features=“sqrt” |
| Support vector machine | sklearn.linear_model.SGDClassifier (0.20.3) | All default values except α=.01 |
| Naive Bayes | sklearn.naive_bayes.MultinomialNB (0.20.3) | All default values |
Description of deep learning classifiers, target, and parameter settings used in the experiments.
| Deep learning classifiers | Targets | Parameter values | |
|
| |||
|
| CNNa | Relevance | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: rmsprop, filters: 100, kernel_size: 1, epochs: 5, batch_size: 16 |
|
| LSTMb | Relevance | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: adam, epochs: 10, batch_size: 16 |
|
| LSTM-CNN | Relevance | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: adam, filters: 50, kernel_size: 2, epochs: 10, batch_size: 16 |
|
| BiLSTMc | Relevance | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: adam, epochs: 10, batch_size: 16 |
|
| CNN | Commercial | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: adam, filters: 100, kernel_size: 2, epochs: 10, batch_size: 16 |
|
| LSTM | Commercial | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: rmsprop, epochs: 5, batch_size: 32 |
|
| LSTM-CNN | Commercial | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: rmsprop, filters: 75, kernel_size: 2, epochs: 5, batch_size: 16 |
|
| BiLSTM | Commercial | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: adam, epochs: 5, batch_size: 64 |
|
| CNN | Sentiment | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: rmsprop, filters: 100, kernel_size: 2, epochs: 10, batch_size: 32 |
|
| LSTM | Sentiment | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: adam, epochs: 5, batch_size: 64 |
|
| LSTM-CNN | Sentiment | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: adam, filters: 75, kernel_size: 3, epochs: 5, batch_size: 64 |
|
| BiLSTM | Sentiment | max_features: 166,395, embed_size: 300, max_len: 75, optimizer: rmsprop, epochs: 5, batch_size: 32 |
|
| |||
|
| CNN | Relevance | max_features: 15,890, embed_size: 200, max_len: 75, optimizer: adam, filters: 100, kernel_size: 2, epochs: 10, batch_size: 16 |
|
| LSTM | Relevance | max_features: 15,890, embed_size: 200, max_len: 75, optimizer: adam, epochs: 5, batch_size: 32 |
|
| LSTM-CNN | Relevance | max_features: 15,890, embed_size: 200, max_len: 75, optimizer: adam, filters: 50, kernel_size: 2, epochs: 10, batch_size: 16 |
|
| BiLSTM | Relevance | max_features: 15,890, embed_size: 200, max_len: 75, optimizer: adam, epochs: 5, batch_size: 64 |
|
| CNN | Commercial | max_features: 10,842, embed_size: 200, max_len: 75, optimizer: rmsprop, filters: 50, kernel_size: 2, epochs: 5, batch_size: 16 |
|
| LSTM | Commercial | max_features: 10,842, embed_size: 200, max_len: 75, optimizer: adam, epochs: 5, batch_size: 16 |
|
| LSTM-CNN | Commercial | max_features: 10,842, embed_size: 200, max_len: 75, optimizer: adam, filters: 75, kernel_size: 2, epochs: 5, batch_size: 32 |
|
| BiLSTM | Commercial | max_features: 10,842, embed_size: 200, max_len: 75, optimizer: adam, epochs: 5, batch_size: 64 |
|
| CNN | Sentiment | max_features: 7979, embed_size: 200, max_len: 75, optimizer: rmsprop, filters: 100, kernel_size: 3, epochs: 5, batch_size: 64 |
|
| LSTM | Sentiment | max_features: 7979, embed_size: 200, max_len: 75, optimizer: adam, epochs: 5, batch_size: 32 |
|
| LSTM-CNN | Sentiment | max_features: 7979, embed_size: 200, max_len: 75, optimizer: rmsprop, filters: 75, kernel_size: 1, epochs: 10, batch_size: 64 |
|
| BiLSTM | Sentiment | max_features: 7979, embed_size: 200, max_len: 75, optimizer: adam, epochs: 5, batch_size: 32 |
aCNN: convolutional neural network.
bLSTM: long short-term memory.
cBiLSTM: bidirectional long short-term memory.
Performance of relevance classifiers.
| Classifiers | Area under the receiver operating | Precision | Recall | F1 |
| Logistic regression | 0.84 (0.78-0.89) | 0.80 | 1.00 | 0.92 |
| Random forest | 0.95 (0.93-0.98) | 0.93 | 0.97 |
|
| Support vector machine | 0.92 (0.88-0.96) | 0.91 | 0.97 | 0.95 |
| Naive Bayes | 0.88 (0.83-0.93) | 0.88 | 0.99 | 0.93 |
| CNNa (vaping-related word vectors) | 0.94 (0.91-0.97) | 0.90 | 0.97 | 0.98 |
| LSTMb (vaping-related word vectors) | 0.91 (0.88-0.95) | 0.89 | 0.98 | 0.96 |
| LSTM-CNN (vaping-related word vectors) | 0.89 (0.85-0.93) | 0.93 | 0.87 | 0.95 |
| BiLSTMc (vaping-related word vectors) | 0.89 (0.85-0.94) | 0.90 | 0.96 | 0.94 |
| CNN (GloVed word vectors) | 0.95 (0.92-0.97) | 0.93 | 0.95 | 0.98 |
| LSTM (GloVe word vectors) | 0.95 (0.92-0.98) | 0.95 | 0.95 | 0.98 |
| LSTM-CNN (GloVe word vectors) | 0.96 (0.93-0.98) | 0.96 | 0.93 | 0.98 |
| BiLSTM (GloVe word vectors) | 0.95 (0.93-0.98) | 0.92 | 0.96 | 0.98 |
aCNN: convolutional neural network.
bLSTM: long short-term memory.
cBiLSTM: bidirectional long short-term memory.
dGloVe: Global Vectors for Word Representation.
Performance of commercial classifiers.
| Classifiers | Area under the receiver operating | Precision | Recall | F1 |
| Logistic regression | 0.98 (0.95-0.99) | 0.93 | 0.83 | 0.96 |
| Random forest | 0.97 (0.96-0.99) | 0.95 | 0.82 | 0.97 |
| Support vector machine | 0.98 (0.91-0.99) | 0.92 | 0.86 | 0.92 |
| Naive Bayes | 0.96 (0.94-0.99) | 0.83 | 0.89 | 0.92 |
| CNNa (vaping-related word vectors) | 0.98 (0.96-0.99) | 0.93 | 0.75 | 0.94 |
| LSTMb (vaping-related word vectors) | 0.97 (0.95-0.99) | 0.88 | 0.81 | 0.94 |
| LSTM-CNN (vaping-related word vectors) | 0.97 (0.94-0.99) | 0.92 | 0.85 | 0.94 |
| BiLSTMc (vaping-related word vectors) | 0.98 (0.96-0.99) | 0.84 | 0.87 | 0.95 |
| CNN (GloVed word vectors) | 0.99 (0.98-0.99) | 0.93 | 0.89 | 0.98 |
| LSTM (GloVe word vectors) | 0.99 (0.98-0.99) | 0.89 | 0.94 | 0.98 |
| LSTM-CNN (GloVe word vectors) | 0.99 (0.98-0.99) | 0.86 | 0.96 | 0.99 |
| BiLSTM (GloVe word vectors) | 0.99 (0.98-0.99) | 0.97 | 0.88 | 0.98 |
aCNN: convolutional neural network.
bLSTM: long short-term memory.
cBiLSTM: bidirectional long short-term memory.
dGloVe: Global Vectors for Word Representation.
Performance of sentiment classifiers.
| Classifiers | Area under the receiver operating | Precision | Recall | F1 |
| Logistic regression | 0.78 (0.71-0.84) | 0.73 | 0.88 | 0.82 |
| Random forest | 0.78 (0.70-0.83) | 0.78 | 0.79 | 0.82 |
| Support vector machine | 0.69 (0.64-0.78) | 0.66 | 0.98 | 0.75 |
| Naive Bayes | 0.75 (0.66-0.82) | 0.75 | 0.79 | 0.80 |
| CNNa (vaping-related word vectors) | 0.74 (0.66-0.81) | 0.73 | 0.85 | 0.80 |
| LSTMb (vaping-related word vectors) | 0.74 (0.69-0.82) | 0.75 | 0.81 | 0.81 |
| LSTM-CNN (vaping-related word vectors) | 0.75 (0.71-0.84) | 0.74 | 0.91 | 0.83 |
| BiLSTMc (vaping-related word vectors) | 0.74 (0.68-0.81) | 0.72 | 0.91 | 0.82 |
| CNN (GloVed word vectors) | 0.81 (0.75-0.87) | 0.72 | 0.96 | 0.86 |
| LSTM (GloVe word vectors) | 0.78 (0.71-0.84) | 0.76 | 0.82 | 0.84 |
| LSTM-CNN (GloVe word vectors) | 0.80 (0.74-0.86) | 0.83 | 0.84 | 0.84 |
| BiLSTM (GloVe word vectors) | 0.83 (0.78-0.89) | 0.79 | 0.79 | 0.88 |
aCNN: convolutional neural network.
bLSTM: long short-term memory.
cBiLSTM: bidirectional long short-term memory.
dGloVe: Global Vectors for Word Representation.