| Literature DB >> 34804810 |
Sanjana Garg1, Jordan Taylor1, Mai El Sherief1, Erin Kasson2, Talayeh Aledavood3, Raven Riordan2, Nina Kaiser2, Patricia Cavazos-Rehg2, Munmun De Choudhury1.
Abstract
INTRODUCTION: Opioid misuse is a public health crisis in the US, and misuse of synthetic opioids such as fentanyl have driven the most recent waves of opioid-related deaths. Because those who misuse fentanyl are often a hidden and high-risk group, innovative methods for identifying individuals at risk for fentanyl misuse are needed. Machine learning has been used in the past to investigate discussions surrounding substance use on Reddit, and this study leverages similar techniques to identify risky content from discussions of fentanyl on this platform.Entities:
Keywords: Detection; Fentanyl; Machine learning; Opioids; Overdose; Social media
Year: 2021 PMID: 34804810 PMCID: PMC8581502 DOI: 10.1016/j.invent.2021.100467
Source DB: PubMed Journal: Internet Interv ISSN: 2214-7829
Data description from r/fentanyl.
| # | # users | # avg. words | # intake | # no-intake | |
|---|---|---|---|---|---|
| Posts | 804 | 422 | 88 | 207 | 597 |
| Comments | 5655 | 980 | 54 | 1421 | 4234 |
| Total | 6459 | 1124 | 59 | 1628 | 4831 |
Post and comment distribution from users who deleted their account.
| Num from deleted author | Total | Percent from deleted author | ||
|---|---|---|---|---|
| All data | All | 481 | 6459 | 7.45% |
| Posts | 120 | 804 | 14.93% | |
| Comments | 361 | 5655 | 6.38% | |
| All data with text | All | 83 | 5744 | 1.44% |
| Posts | 13 | 387 | 3.36% | |
| Comments | 70 | 5357 | 1.31% | |
| Annotated data | All | 5 | 391 | 1.28% |
| Posts | 1 | 42 | 2.38% | |
| Comments | 4 | 349 | 1.15% |
Some types of posts don't have a text body, and some comment and post texts in our dataset simply say they were removed or deleted.
Description of the number of posts and comments per author in our entire dataset and in our annotated dataset.
| Mean per Author in Entire Dataset | Median per Author in Entire Dataset | ||
|---|---|---|---|
| All data authors | All | 5.32 (±15.14) | 2.00 |
| Posts | 0.61 (±1.39) | 0.00 | |
| Comments | 4.71 (±14.47) | 1.00 | |
| Annotated data authors | All | 16.41 (±31.49) | 7.00 |
| Posts | 1.26 (±1.96) | 1.00 | |
| Comments | 15.15 (±30.19) | 7.00 | |
Annotated data statistics.
| Low risk | Elevated risk | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| # | # users | # avg. words | Intake | No-intake | # | # users | # avg. words | Intake | No-intake | |
| Posts | 2 | 2 | 177 | 0 | 2 | 40 | 37 | 255 | 31 | 9 |
| Comments | 144 | 99 | 26 | 55 | 89 | 205 | 135 | 103 | 156 | 49 |
| Total | 146 | 101 | 28 | 55 | 91 | 245 | 162 | 128 | 187 | 58 |
Fig. 1Confusion matrix for each classifier. (LR — low risk, ER — elevated risk).
Macro-average model performances on 5-fold cross validation on 80% of our annotated data and performances of models trained on our training set (80% of our annotated data) and evaluated on our test set (20% of our annotated data).
| Features | Cross validation | ||||
|---|---|---|---|---|---|
| Precision | Recall | Macro-F1 | Accuracy | AUC | |
| N-Gram L + D | 0.82 (±0.11) | 0.81 (±0.09) | 0.81 (±0.09) | 0.81 (±0.09) | 0.91 (±0.12) |
| N-Gram L + D | 0.82 (±0.10) | 0.81 (±0.09) | 0.81 (±0.08) | 0.81 (±0.09) | 0.89 (±0.12) |
| TFIDF L + D | 0.84 (±0.04) | 0.83 (±0.04) | 0.82 (±0.05) | 0.83 (±0.04) | 0.92 (±0.06) |
| BERT | 0.82 (±0.05) | 0.78 (±0.06) | 0.78 (±0.06) | 0.81 (±0.05) | 0.87 (±0.04) |
| Baseline | 0.75 (±0.08) | 0.75 (±0.09) | 0.71 (±0.10) | 0.72 (±0.10) | 0.86 (±0.09) |
LR — logistic regression, SVM — linear support vector machine, RF — random forest, LSTM NN — long short-term neural network.
L + D — lemmatized and debiased (see Supplement section “Debiasing” for more information about debiasing).
Fig. 2ROC (receiver operating characteristic) curves for each classifier.
Comparison of top features across three top performing classifiers. Weights denote the feature importance.
Correctly classified examples with associated risk factors and top features highlighted.
Drug related words among the 200 word embedding tokens most similar to the seed word or words. The numbers in parentheses represent the cosine similarity between the drug related word on the right and the seed word(s) on the left.
Misclassified examples.