| Literature DB >> 35939303 |
David Goodman-Meza1, Chelsea L Shover2, Jesus A Medina3, Amber B Tang3, Steven Shoptaw4, Alex A T Bui5.
Abstract
Importance: Overdose is one of the leading causes of death in the US; however, surveillance data lag considerably from medical examiner determination of the death to reporting in national surveillance reports. Objective: To automate the classification of deaths related to substances in medical examiner data using natural language processing (NLP) and machine learning (ML). Design, Setting, and Participants: Diagnostic study comparing different natural language processing and machine learning algorithms to identify substances related to overdose in 10 health jurisdictions in the US from January 1, 2020, to December 31, 2020. Unstructured text from 35 433 medical examiner and coroners' death records was examined. Exposures: Text from each case was manually classified to a substance that was related to the death. Three feature representation methods were used and compared: text frequency-inverse document frequency (TF-IDF), global vectors for word representations (GloVe), and concept unique identifier (CUI) embeddings. Several ML algorithms were trained and best models were selected based on F-scores. The best models were tested on a hold-out test set and results were reported with 95% CIs. Main Outcomes and Measures: Text data from death certificates were classified as any opioid, fentanyl, alcohol, cocaine, methamphetamine, heroin, prescription opioid, and an aggregate of other substances. Diagnostic metrics and 95% CIs were calculated for each combination of feature extraction method and machine learning classifier.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35939303 PMCID: PMC9361079 DOI: 10.1001/jamanetworkopen.2022.25593
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Figure 1. Natural Language Processing Pipeline
CUI indicates concept unique identifier; GloVE, global vectors for word representations; KNN, κ-nearest neighbors; SVM, support vector machine; XGBoost, eXtreme Gradient Boosting.
Figure 2. Substances Identified in Overdoses From Medical Examiner Data
MDA indicates 3,4-methylenedioxyamphetamine; MDMA, 3,4-methylenedioxymethamphetamine.
Top 3 Models by Substance in 10-Fold Cross-Validation of Training Data Set
| Substance | TF-IDF | Word embeddings (GloVe) | CUI2Vec embeddings | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Mean | Model | Mean | Model | Mean | ||||
| F-score | SE | F-score | SE | F-score | SE | ||||
| Any opioid | XGBoostc | 0.969 | 0.002 | SVMc | 0.970 | 0.002 | SVMc | 0.992 | 0.001 |
| Random forestc | 0.969 | 0.001 | Neural network | 0.967 | 0.003 | XGBoost | 0.989 | 0.001 | |
| Neural network | 0.968 | 0.002 | XGBoost | 0.965 | 0.003 | Random forest | 0.987 | 0.001 | |
| Heroin | Logistic regressionc | 1.000 | 0.000 | Logistic regressionc | 1.000 | 0.000 | Logistic regressionc | 1.000 | 0.000 |
| Random forestc | 1.000 | 0.000 | SVMc | 1.000 | 0.000 | SVMc | 1.000 | 0.000 | |
| XGBoostc | 1.000 | 0.000 | Neural network | 0.999 | 0.000 | XGBoost | 0.996 | 0.002 | |
| Fentanyl | Random forestc | 1.000 | 0.000 | SVMc | 1.000 | 0.000 | SVMc | 1.000 | 0.000 |
| XGBoostc | 1.000 | 0.000 | Neural networkc | 1.000 | 0.000 | Neural networkc | 1.000 | 0.000 | |
| Logistic regression | 0.999 | 0.000 | Logistic regressionc | 1.000 | 0.000 | XGBoost | 0.999 | 0.000 | |
| Prescription opioid | XGBoostc | 0.561 | 0.015 | XGBoostc | 0.554 | 0.015 | SVMc | 0.996 | 0.002 |
| Random forest | 0.558 | 0.015 | Random forest | 0.514 | 0.016 | Neural network | 0.989 | 0.002 | |
| Logistic regression | 0.545 | 0.015 | SVM | 0.510 | 0.012 | Logistic regression | 0.985 | 0.002 | |
| Methamphetamine | Logistic regressionc | 1.000 | 0.000 | SVMc | 0.999 | 0.000 | SVMc | 0.998 | 0.001 |
| Random forestc | 1.000 | 0.000 | Neural network | 0.997 | 0.002 | Logistic regression | 0.987 | 0.005 | |
| XGBoostc | 1.000 | 0.000 | Logistic regression | 0.997 | 0.001 | XGBoost | 0.986 | 0.001 | |
| Cocaine | Logistic regressionc | 1.000 | 0.000 | Logistic regressionc | 1.000 | 0.000 | Logistic regressionc | 1.000 | 0.000 |
| Random forestc | 1.000 | 0.000 | SVMc | 1.000 | 0.000 | SVMc | 1.000 | 0.000 | |
| XGBoostc | 1.000 | 0.000 | SVM | 0.999 | 0.000 | Neural network | 0.998 | 0.001 | |
| Benzodiazepine | Random forestc | 0.671 | 0.013 | Neural networkc | 0.662 | 0.011 | Neural networkc | 0.902 | 0.009 |
| XGBoost | 0.666 | 0.015 | SVM | 0.645 | 0.016 | SVMc | 0.902 | 0.009 | |
| Neural network | 0.657 | 0.013 | XGBoost | 0.637 | 0.014 | XGBoost | 0.867 | 0.01 | |
| Alcohol | Random forestc | 0.974 | 0.003 | SVMc | 0.956 | 0.002 | XGBoostc | 0.852 | 0.005 |
| XGBoost | 0.974 | 0.003 | Neural network | 0.951 | 0.003 | Random forestc | 0.852 | 0.005 | |
| Neural network | 0.973 | 0.003 | XGBoost | 0.948 | 0.002 | SVM | 0.851 | 0.005 | |
| Other | XGBoost | 0.812 | 0.005 | Neural networkc | 0.806 | 0.005 | SVMc | 0.968 | 0.003 |
| Random forest | 0.811 | 0.004 | XGBoost | 0.804 | 0.003 | Logistic regression | 0.953 | 0.006 | |
| Neural network | 0.807 | 0.004 | Random forest | 0.772 | 0.004 | XGBoost | 0.941 | 0.004 | |
Abbreviations: CUI, concept unique identifier; GloVe, global vectors for word representations; SVM, support vector machine; TF-IDF, term frequency–inverse document frequency.
GloVe with 6 billion tokens and 100 dimensions was used in this analysis.
CUI2vec with 109 053 tokens and 500 dimensions was used in this analysis.
The best performing models based on the mean F-score of 10-fold cross-validation.
Bootstrapped Diagnostic Metrics and Best Performing Models in Test Data Set (N = 7087) Using CUI2Vec as Feature Representationsa
| Metric | Mean (95% CI)b | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Any opioid | Heroin | Fentanyl | Prescription opioid | Methamphetamine | Cocaine | Benzodiazepine | Alcohol | Other | |
| F-score | 0.989 (0.982-0.994) | 1.00 (1.00-1.00) | 0.999 (0.997-1.00) | 0.977 (0.941-1.00) | 0.995 (0.989-1.00) | 1.00 (1.00-1.00) | 0.840 (0.788-0.889) | 0.854 (0.828-0.880) | 0.950 (0.933-0.965) |
| Accuracy | 0.996 (0.994-0.998) | 1.00 (1.00-1.00) | 1.00 (0.999-1.00) | 0.998 (0.996-1.00) | 0.999 (0.999-1.00) | 1.00 (1.00-1.00) | 0.988 (0.967-0.993) | 0.979 (0.975-0.983) | 0.992 (0.990-0.995) |
| κ | 0.986 (0.979-0.993) | 1.00 (1.00-1.00) | 0.999 (0.997-1.00) | 0.977 (0.939-1.00) | 0.995 (0.988-1.00) | 1.00 (1.00-1.00) | 0.722 (0-0.885) | 0.843 (0.815-0.871) | 0.945 (0.928-0.962) |
| Sensitivity (recall) | 0.98 (0.970-0.989) | 1.00 (1.00-1.00) | 0.999 (0.997-1.00) | 0.971 (0.931-1.00) | 0.993 (0.98-1.00) | 1.00 (1.00-1.00) | 0.658 (0-0.829) | 0.749 (0.709-0.787) | 0.912 (0.885-0.938) |
| Specificity | 1 (0.998-1.00) | 1.00 (1.00-1.00) | 1.00 (1.00-1.00) | 0.999 (0.998-1.00) | 1.00 (1.00-1.00) | 1.00 (1.00-1.00) | 0.999 (0.996-1.00) | 1.00 (0.997-1.00) | 0.999 (0.998-1.00) |
| Positive predictive value (precision) | 0.998 (0.988-1.00) | 1.00 (1.00-1.00) | 1.00 (0.997-1.00) | 0.984 (0.931-1.00) | 0.997 (0.992-1.00) | 1.00 (1.00-1.00) | 0.940 (0.873-0.986) | 0.994 (0.955-1.00) | 0.99 (0.973-1.00) |
| Negative predictive value | 0.996 (0.994-0.998) | 1.00 (1.00-1.00) | 1.00 (1.00-1.00) | 0.999 (0.998-1.00) | 1.00 (0.999-1.00) | 1.00 (1.00-1.00) | 0.989 (0.967-0.995) | 0.978 (0.974-0.982) | 0.992 (0.99-0.995) |
| AUROC | 0.994 (0.988-0.999) | 1.00 (1.00-1.00) | 1.00 (0.997-1.00) | 0.987 (0.965-1.00) | 0.997 (0.986-1.00) | 1.00 (1.00-1.00) | 0.940 (0.895-0.978) | 0.901 (0.883-0.918) | 0.981 (0.956-0.995) |
Abbreviation: AUROC, area under the receiver operating curve.
CUI2vec with 109 053 tokens and 500 dimensions was used in this analysis.
Values are means of 1000 resamples bootstrapping procedure, values in parenthesis are lower and upper bounds of 95% percentiles for the bootstrapping procedure.