| Literature DB >> 32349766 |
Brihat Sharma1, Dmitriy Dligach1,2, Kristin Swope3, Elizabeth Salisbury-Afshar4, Niranjan S Karnik5, Cara Joyce2,3, Majid Afshar6,7,8.
Abstract
BACKGROUND: Automated de-identification methods for removing protected health information (PHI) from the source notes of the electronic health record (EHR) rely on building systems to recognize mentions of PHI in text, but they remain inadequate at ensuring perfect PHI removal. As an alternative to relying on de-identification systems, we propose the following solutions: (1) Mapping the corpus of documents to standardized medical vocabulary (concept unique identifier [CUI] codes mapped from the Unified Medical Language System) thus eliminating PHI as inputs to a machine learning model; and (2) training character-based machine learning models that obviate the need for a dictionary containing input words/n-grams. We aim to test the performance of models with and without PHI in a use-case for an opioid misuse classifier.Entities:
Keywords: Computable phenotype; Heroin; Machine learning; Natural language processing; Opioid misuse; Opioid use disorder
Mesh:
Year: 2020 PMID: 32349766 PMCID: PMC7191715 DOI: 10.1186/s12911-020-1099-y
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 3.298
Fig. 1PHI-free and PHI-laden inputs to a machine learning model with an example of a convolutional neural network using an embedding with Concept Unique Identifiers (CUIs)
Machine learning models with hyperparameters
| Model | Hyper-parameters |
|---|---|
| Logistic Regression-CUIs | C = 1, penalty = L1, class_weight = balanced |
| Logistic Regression-Words | C = 1, penalty = L1, class_weight = balanced |
| Convolutional Neural Network-CUIs | Filters = 1024, Filter Size = 1, Dropout = 0.5, Units = 1024, Learning Rate = 0.0001 |
| Convolutional Neural Network-Words | Filters = 1024, Filter Size = 3, Dropout = 0.25, Units = 128, Learning Rate = 0.0001 |
| Convolutional Neural Network-Character | Filters = 1024, Filter Size = 11, Dropout = 0.25, Units = 1024, Learning Rate = 0.0001 |
| Deep Averaging Network-CUIs | Dropout = 0.25, Units in layer 1 = 2048, Units in layer 2 = 512, Learning Rate = 0.001 |
| Deep Averaging Network-Words | Dropout = 0.75, Units = 128, Learning Rate = 0.001 |
| Max Pooling Network-CUIs | Dropout = 0.5, Units = 128, Learning Rate = 0.001 |
| Max Pooling Network-Words | Dropout = 0.5, Units = 128, Learning Rate = 0.001 |
| Deep Averaging + Max Pooling Network-CUIs | Dropout = 0.5, Units = 1024, Learning Rate = 0.001 |
| Deep Averaging + Max Pooling Network-Words | Dropout = 0.25, Units = 512, Learning Rate = 0.001 |
Logistic regression’s C value is inverse of regularization strength, and penalty term that penalizes the loss function using different regularization techniques. Optimizer Adam is selected for all the neural networks. Units are the number of neurons in the dense layer of the neural network
Comparison of classifiers for opioid misuse
| Classifier | ROC AUC | F1 | Precision/PPV (95% CI) | Recall/Sensitivity (95% CI) | Specificity (95% CI) | NPV (95% CI) | |
|---|---|---|---|---|---|---|---|
| Rule-based | NAa | 0.76 | 0.68 (0.57, 0.78) | 0.87 (0.76, 0.94) | 0.79 (0.71, 0.86) | 0.92 (0.85, 0.96) | < 0.01 |
| Logistic Regression CUI | 0.91 (0.86, 0.95) | 0.79 | 0.89 (0.77, 0.96) | 0.71 (0.58, 0.81) | 0.95 (0.90, 0.98) | 0.86 (0.80, 0.91) | 0.06 |
| Logistic Regression Word | 0.91 (0.86, 0.95) | 0.72 | 0.86 (0.73, 0.94) | 0.62 (0.49, 0.73) | 0.95 (0.89, 0.98) | 0.83 (0.76, 0.88) | < 0.01 |
| Convolutional Neural Network CUI | 0.93 (0.90, 0.97) | 0.81 | 0.82 (0.70, 0.90) | 0.79 (0.68, 0.88) | 0.91 (0.85, 0.95) | 0.89 (0.83, 0.94) | 0.51 |
| Convolutional Neural Network Word | 0.94 (0.91, 0.98) | 0.84 | 0.94 (0.85, 0.99) | 0.75 (0.63, 0.85) | 0.98 (0.93, 1.00) | 0.88 (0.82, 0.93) | 0.42 |
| Convolutional Neural Network Character | 0.93 (0.90, 0.97) | 0.79 | 0.88 (0.76, 0.95) | 0.72 (0.60, 0.82) | 0.95 (0.89, 0.98) | 0.87 (0.80, 0.92) | < 0.01 |
| Deep Averaging Network CUI | 0.83 (0.78, 0.88) | 0.74 | 0.68 (0.57, 0.78) | 0.87 (0.76, 0.94) | 0.79 (0.71, 0.86) | 0.92 (0.85, 0.96) | < 0.01 |
| Deep Averaging Network Word | 0.80 (0.74, 0.86) | 0.49 | 0.74 (0.56, 0.87) | 0.37 (0.25, 0.49) | 0.93 (0.87, 0.97) | 0.74 (0.67, 0.80) | < 0.01 |
| Max Pooling Network CUI | 0.93 (0.89, 0.96) | 0.79 | 0.85 (0.73, 0.93) | 0.74 (0.61, 0.83) | 0.93 (0.87, 0.97) | 0.87 (0.80, 0.92) | 0.60 |
| Max Pooling Network Word | 0.91 (0.86, 0.96) | 0.78 | 0.87 (0.76, 0.95) | 0.71 (0.58, 0.81) | 0.95 (0.89, 0.98) | 0.86 (0.79, 0.91) | 0.36 |
| Deep Averaging + Max Pooling Network CUI | 0.94 (0.91, 0.97) | 0.81 | 0.92 (0.82, 0.98) | 0.72 (0.60, 0.82) | 0.97 (0.92, 0.99) | 0.87 (0.80, 0.92) | < 0.01 |
| Deep Averaging + Max Pooling Network Word | 0.94 (0.91, 0.97) | 0.78 | 0.86 (0.74, 0.94) | 0.72 (0.60, 0.82) | 0.94 (0.88, 0.97) | 0.87 (0.80, 0.92) | 0.09 |
Logistic regression with a combination of unigrams and bigrams; PPV positive predictive value, NPV negative predictive value, ROC AUC area under the curve receiver operating characteristic, CUI concept unique identifier, CI confidence interval
*model fit by Hosmer-Lemeshow Goodness of Fit test where p > 0.05 demonstrate the model fit the data well
aNA not applicable because bivariate predictions (0/1) without predicted probabilities to plot ROC AUC
Fig. 2Receiver operating characteristics area under the curve for convolutional neural network model using concept unique identifiers (CUI) for classification of opioid misuse. CNN = convolutions neural network; AUC = area under the curve
Fig. 3Calibration plot for top performing machine learning classifiers for opioid misuse. The diagonal line represents perfect calibration between predicted probabilities that are observed (y-axis) and predicted (x-axis). CNN = convolutions neural network; CUIs = concept unique identifiers; LR = logistic regression; MPN = max pooling network
Concept Unique Identifiers (CUIs) for opioid misuse from logistic regression classifier and their β coefficients
| CUI | Related text | β coefficients |
|---|---|---|
| C0011892 | Heroin | 16.57 |
| C0344198 | Victim of abuse (finding) | 12.70 |
| C0562381 | Cocaine | 4.39 |
| C0025605 | Methadone | 4.19 |
| C0376196 | Opiates | 4.09 |
| C0001927 | Albuterol | 2.40 |
| C0728755 | Dilaudid | 1.73 |
| C0029944 | Drug Overdose | 1.34 |
| C0030049 | Oxycodone | 1.12 |
| C0150055 | Chronic pain | 0.47 |
| C0040861 | Triage | 0.47 |
| C1299583 | Independently able | 0.19 |
| C0022742 | Knee | 0.02 |
| C0002903 | Anesthesia procedures | −2.08 |
| C0003483 | Aorta | −1.51 |
| C0006826 | Malignant Neoplasms | −1.50 |
| C1272883 | Injection | −1.36 |
| C0006434 | Burn injury | −0.71 |
| C0020538 | Hypertensive disease | −0.42 |
| C0021641 | Insulin | −0.09 |
| C0004604 | Back Pain | −0.01 |