| Literature DB >> 31693125 |
Abeed Sarker1,2, Graciela Gonzalez-Hernandez1, Yucheng Ruan3, Jeanmarie Perrone4.
Abstract
Importance: Automatic curation of consumer-generated, opioid-related social media big data may enable real-time monitoring of the opioid epidemic in the United States. Objective: To develop and validate an automatic text-processing pipeline for geospatial and temporal analysis of opioid-mentioning social media chatter. Design, Setting, and Participants: This cross-sectional, population-based study was conducted from December 1, 2017, to August 31, 2019, and used more than 3 years of publicly available social media posts on Twitter, dated from January 1, 2012, to October 31, 2015, that were geolocated in Pennsylvania. Opioid-mentioning tweets were extracted using prescription and illicit opioid names, including street names and misspellings. Social media posts (tweets) (n = 9006) were manually categorized into 4 classes, and training and evaluation of several machine learning algorithms were performed. Temporal and geospatial patterns were analyzed with the best-performing classifier on unlabeled data. Main Outcomes and Measures: Pearson and Spearman correlations of county- and substate-level abuse-indicating tweet rates with opioid overdose death rates from the Centers for Disease Control and Prevention WONDER database and with 4 metrics from the National Survey on Drug Use and Health for 3 years were calculated. Classifier performances were measured through microaveraged F1 scores (harmonic mean of precision and recall) or accuracies and 95% CIs.Entities:
Mesh:
Year: 2019 PMID: 31693125 PMCID: PMC6865282 DOI: 10.1001/jamanetworkopen.2019.14672
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Performances of Different Classifiers on the Testing Set
| Classifier | Precision | Recall | Microaveraged F1 or Accuracy Score (95% CI) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Class A | Class I | Class U | Class N | Class A | Class I | Class U | Class N | ||
| Random classifier | 0.166 | 0.235 | 0.535 | 0.052 | 0.189 | 0.224 | 0.530 | 0.044 | 0.375 (0.360-0.394) |
| NB | 0.307 | 0.501 | 0.788 | 0.737 | 0.670 | 0.504 | 0.463 | 0.811 | 0.539 (0.518-0.558) |
| NB Random oversampling | 0.297 | 0.502 | 0.806 | 0.745 | 0.695 | 0.495 | 0.456 | 0.778 | 0.523 (0.505-0.542) |
| NB Undersampling | 0.293 | 0.620 | 0.820 | 0.735 | 0.733 | 0.454 | 0.499 | 0.867 | 0.548 (0.529-0.568) |
| NB SMOTE | 0.319 | 0.509 | 0.793 | 0.737 | 0.651 | 0.498 | 0.526 | 0.811 | 0.555 (0.536-0.574) |
| DT | 0.389 | 0.540 | 0.725 | 0.816 | 0.371 | 0.447 | 0.783 | 0.889 | 0.638 (0.618-0.655) |
| DT Random oversampling | 0.388 | 0.510 | 0.752 | 0.818 | 0.455 | 0.476 | 0.724 | 0.900 | 0.617 (0.599-0.644) |
| DT Undersampling | 0.341 | 0.481 | 0.797 | 0.802 | 0.487 | 0.548 | 0.630 | 0.900 | 0.599 (0.579-0.617) |
| DT SMOTE | 0.307 | 0.437 | 0.723 | 0.833 | 0.365 | 0.488 | 0.638 | 0.889 | 0.568 (0.549-0.587) |
| k-NN | 0.314 | 0.791 | 0.589 | 0.852 | 0.101 | 0.081 | 0.942 | 0.876 | 0.593 (0.574-0.612) |
| k-NN Random oversampling | 0.287 | 0.629 | 0.627 | 0.861 | 0.248 | 0.159 | 0.852 | 0.900 | 0.587 (0.567-0.607) |
| k-NN Undersampling | 0.355 | 0.474 | 0.815 | 0.781 | 0.522 | 0.572 | 0.606 | 0.911 | 0.599 (0.580-0.618) |
| k-NN SMOTE | 0.317 | 0.446 | 0.724 | 0.868 | 0.380 | 0.493 | 0.643 | 0.878 | 0.574 (0.549-0.587) |
| SVM | 0.476 | 0.717 | 0.728 | 0.895 | 0.374 | 0.529 | 0.856 | 0.944 | 0.700 (0.681-0.718) |
| SVM Random oversampling | 0.446 | 0.657 | 0.821 | 0.895 | 0.560 | 0.756 | 0.644 | 0.944 | 0.704 (0.683 –0.720) |
| SVM Undersampling | 0.409 | 0.611 | 0.862 | 0.843 | 0.629 | 0.668 | 0.667 | 0.956 | 0.675 (0.656 0.693) |
| SVM Oversampling SMOTE | 0.330 | 0.598 | 0.764 | 0.920 | 0.566 | 0.548 | 0.616 | 0.9 | 0.605 (0.587-0.624) |
| RF | 0.493 | 0.762 | 0.713 | 0.835 | 0.330 | 0.469 | 0.897 | 0.956 | 0.701 (0.683-0.718) |
| RF Random oversampling | 0.447 | 0.679 | 0.775 | 0.835 | 0.462 | 0.569 | 0.809 | 0.956 | 0.700 (0.684-0.719) |
| RF Undersampling | 0.414 | 0.561 | 0.883 | 0.791 | 0.616 | 0.688 | 0.639 | 0.967 | 0.663 (0.645-0.682) |
| RF Oversampling SMOTE | 0.379 | 0.539 | 0.771 | 0.843 | 0.465 | 0.565 | 0.688 | 0.956 | 0.634 (0.616-0. 652) |
| CNN | 0.532 | 0.676 | 0.759 | 0.902 | 0.386 | 0.608 | 0.858 | 0.922 | 0.720 (0.699-0.735) |
| CNN Random oversampling | 0.532 | 0.677 | 0.758 | 0.902 | 0.386 | 0.602 | 0.860 | 0.922 | 0.720 (0.699-0.734) |
| CNN Undersampling | 0.414 | 0.551 | 0.866 | 0.902 | 0.400 | 0.565 | 0.639 | 0.922 | 0.638 (0.618-0.658) |
| CNN SMOTE | 0.493 | 0.598 | 0.800 | 0.902 | 0.414 | 0.548 | 0.688 | 0.922 | 0.658 (0.640-0.677) |
| Ensemble_1 (CNN, RF, SVM, NB) | 0.517 | 0.721 | 0.758 | 0.887 | 0.425 | 0.565 | 0.866 | 0.956 | 0.726 (0.708-0.743) |
| Ensemble_biased_1 (CNN, RF, SVM, NB) | 0.489 | 0.716 | 0.780 | 0.887 | 0.506 | 0.563 | 0.836 | 0.956 | 0.721 (0.703-0.739) |
| Ensemble_2 (CNN, RF, SVM, NB, DT) | 0.482 | 0.707 | 0.743 | 0.878 | 0.377 | 0.517 | 0.875 | 0.956 | 0.709 (0.692-0.726) |
| Ensemble_biased_2 (CNN, RF, SVM, NB, DT) | 0.456 | 0.708 | 0.810 | 0.878 | 0.597 | 0.577 | 0.786 | 0.956 | 0.713 (0.696-0.730) |
Abbreviations: A, self-reported abuse or misuse; CNN, convolutional neural network; DT, decision tree; I, information sharing; k-NN, k-nearest neighbors; N, non-English; NB, naive Bayes; RF, random forest; SMOTE, synthetic minority oversampling technique; SVM, support vector machine; U, unrelated.
The random classifier randomly assigns 1 of the 4 classes to a tweet.
Best performance.
Figure 1. Monthly Distributions of the Frequencies and Proportions of Social Media Posts Classified as Abuse and Information in the Unlabeled Data Set Over 3 Years
Figure 2. Comparison of County-Level Heat Maps of Opioid-Related Death Rates and Abuse-Related Social Media Post Rates in Pennsylvania, 2012-2014, and Scatterplot of the Association Between the 2 Variables
Figure 3. Substate-Level Heat Maps and Scatterplots Comparing Frequencies of Abuse-Indicating Social Media Posts With 4 Survey Metrics, 2012-2014
The computed correlations and their statistical significance are summarized in Table 2. Pennsylvania substate information is found in eTable 6 in the Supplement. NSDUH indicates National Survey on Drug Use and Health.
Pearson and Spearman Correlations for Geolocation-Specific Abuse-Indicating Social Media Post Rates With County-Level Opioid Overdose Death Rates and 4 Metrics From the National Survey on Drug Use and Health
| Measure | Pearson | Spearman | No. of Data Points | ||
|---|---|---|---|---|---|
| Opioid overdose death rate | 0.451 | <.001 | 0.331 | .004 | 75 |
| Illicit drug use, no marijuana, past mo | 0.850 | <.001 | 0.341 | .25 | 13 |
| Nonmedical use of pain relievers, past y | 0.683 | .01 | 0.346 | .25 | 13 |
| Illicit drug dependence or abuse, past y | 0.935 | <.001 | 0.401 | .17 | 13 |
| Illicit drug dependence, past y | 0.937 | <.001 | 0.495 | .09 | 13 |
Indicates statistical significance.