| Literature DB >> 29897323 |
Keyuan Jiang1, Shichao Feng2, Qunhao Song2, Ricardo A Calix2, Matrika Gupta2, Gordon R Bernard3.
Abstract
BACKGROUND: As Twitter has become an active data source for health surveillance research, it is important that efficient and effective methods are developed to identify tweets related to personal health experience. Conventional classification algorithms rely on features engineered by human domain experts, and engineering such features is a challenging task and requires much human intelligence. The resultant features may not be optimal for the classification problem, and can make it challenging for conventional classifiers to correctly predict personal experience tweets (PETs) due to the various ways to express and/or describe personal experience in tweets. In this study, we developed a method that combines word embedding and long short-term memory (LSTM) model without the need to engineer any specific features. Through word embedding, tweet texts were represented as dense vectors which in turn were fed to the LSTM neural network as sequences.Entities:
Keywords: Deep learning; Health surveillance; LSTM neural network; Pharmacovigilance; Social media; Twitter; Unsupervised feature learning
Mesh:
Year: 2018 PMID: 29897323 PMCID: PMC5998756 DOI: 10.1186/s12859-018-2198-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Representation of an example tweet
| Tweet Text | Thank | you | aspirin | No | more | headache |
|---|---|---|---|---|---|---|
| Symbolic Term Index |
|
|
|
|
|
|
| Actual Term Index | 5918 | 1012 | 720 | 3973 | 241 | 2354 |
Fig. 1The pipeline to generate the vocabulary and vector space model. A corpus of 22 million unlabeled tweets was collected and pre-processed to remove certain punctuations, duplicates, non-English tweets, and tweets with URLs. A collection of unique terms was compiled to generate a vocabulary, and a vector space model was created the preprocessed tweets
Fig. 2The pipeline to represent study tweets and classify the tweets. A total of 12,331 annotated tweets for training and test were preprocessed first. The index of each term in the preprocessed tweets was retrieved from the vocabulary, and the text of each tweet was converted to a sequence of the vectors of the corresponding term indices (see Fig. 3). Sequences of term index vectors were fed to the LSTM network for classification
Fig. 3The high level overview of the LSTM model
Statistics of the corpus of annotated tweets
| # of Tweets | # of PETs | # of Non-PETs |
|---|---|---|
| 12,331 | 2962 | 9369 |
Parameter settings of classifiers
| Classifier | Parameter Settings |
|---|---|
| Logistic Regression | penalty:'l2’, tol = 1e-4, C = 1.0, solver:'liblinear’,max_iter = 100 |
| Decision Tree (J48) | criterion = ‘entropy’, max_depth = 30, min_samples_split = 2, min_samples_leaf = 1 |
| KNN | n_neighbors = 1, |
| SVM | C = 1.0, kernel = ‘rbf’, tol = 1e-4, gomma = 0.001 |
| BoW + Logistic Regr. | C = 1000, random_state = 0 |
| Word Embedding + LSTM | In LSTM layer, the input and output dimensions: 128, L2 for regularizer, and the parameter for L2: 0.01. 30% of training dataset was used as validation dataset. Class weight for PET class: 6547/2650, and for non-PET class: 2650/6547 |
Classification performance
| Classifier | Accuracy | Precision (PET) | Recall (PET) | F1 (PET) | ROC/AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.637 | 0.356 | 0.471 | 0.405 | 0.598 |
| Decision Tree | 0.602 | 0.329 | 0.442 | 0.357 | 0.547 |
| KNN | 0.669 | 0.383 | 0.481 | 0.411 | 0.604 |
| SVM | 0.635 | 0.339 | 0.478 | 0.393 | 0.580 |
| BoW + Logistic Regr. | 0.757 | 0.498 | 0.567 | 0.530 | 0.698 |
| Word Embedding + LSTM |
|
|
|
|
|
The highest values are in boldface
Results of statistical analysis (p-values)
| Classifier | Accuracy | Precision | Recall | F1 | ROC/AUC |
|---|---|---|---|---|---|
| Logistic Regression | 2.52 × 10− 08 | 1.85 × 10−09 | 5.48 × 10− 09 | 5.87 × 10−10 | 1.46 × 10− 09 |
| Decision Tree | 1.80 × 10− 04 | 1.51 × 10− 04 | 6.99 × 10− 06 | 1.92 × 10− 06 | 1.16 × 10− 05 |
| KNN | 8.08 × 10− 05 | 6.22 × 10− 05 |
| 8.50 × 10− 05 |
|
| SVM | 1.17 × 10− 08 | 4.61 × 10− 08 | 1.74 × 10− 04 | 7.89 × 10− 07 | 5.02 × 10− 06 |
| BoW + Logistic Regr. |
|
| 1.79 × 10− 04 |
| 2.12 × 10− 05 |
The highest values are in boldface