| Literature DB >> 31437930 |
Jihad S Obeid1, Paul M Heider1, Erin R Weeda2, Andrew J Matuskowitz3, Christine M Carr3,1, Kevin Gagnon4, Tami Crawford1, Stephane M Meystre1.
Abstract
Clinical text de-identification enables collaborative research while protecting patient privacy and confidentiality; however, concerns persist about the reduction in the utility of the de-identified text for information extraction and machine learning tasks. In the context of a deep learning experiment to detect altered mental status in emergency department provider notes, we tested several classifiers on clinical notes in their original form and on their automatically de-identified counterpart. We tested both traditional bag-of-words based machine learning models as well as word-embedding based deep learning models. We evaluated the models on 1,113 history of present illness notes. A total of 1,795 protected health information tokens were replaced in the de-identification process across all notes. The deep learning models had the best performance with accuracies of 95% on both original and de-identified notes. However, there was no significant difference in the performance of any of the models on the original vs. the de-identified notes.Entities:
Keywords: Data Anonymization; Machine Learning; Natural Language Processing
Mesh:
Year: 2019 PMID: 31437930 PMCID: PMC6779034 DOI: 10.3233/SHTI190228
Source DB: PubMed Journal: Stud Health Technol Inform ISSN: 0926-9630
Breakdown of the Numbers of PHI Tokens Replaced by the De-identification System.
| PHI Token Class | n | % |
|---|---|---|
| Health care unit names | 558 | 31.1% |
| Ages greater than 89 | 512 | 28.5% |
| Dates | 360 | 20.1% |
| Provider names | 128 | 7.1% |
| Patient names | 122 | 6.8% |
| Street or City | 73 | 4.1% |
| State or Country | 16 | 0.9% |
| Phone numbers | 15 | 0.8% |
| Other organization names | 10 | 0.6% |
| Other IDs | 1 | 0.1% |
List of ICD-9 and ICD-10 Codes Considered to be AMS in the Context of Pulmonary Embolism.
| Code Set | ICD Code | Diagnosis Name |
|---|---|---|
| ICD9 | 780.0x | Alteration of consciousness |
| ICD9 | 780.2 | Syncope and collapse |
| ICD9 | 780.97 | Altered mental status |
| ICD9 | 799.5x | Signs and symptoms involving cognition |
| ICD10 | R40.x | Somnolence, stupor and coma |
| ICD10 | R41.0 | Disorientation, unspecified |
| ICD10 | R41.8x | Other symptoms and signs involving cognitive functions and awareness |
| ICD10 | R41.9 | Unspecified symptoms and signs involving cognitive functions and awareness |
| ICD10 | R55 | Syncope and collapse |
Figure 1.AUC values and 95% confidence intervals for all the models for both original and de-identified (Deid) data.
Performance of models. (Acc=accuracy, Prec=precision; other abbreviations are in the text, Δ = De-identified – Original, 95% CI=95% confidence interval)
| Original | De-identified | Δ | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Model | AUC (95% CI) | Acc | Prec | Recall | AUC (95% CI) | Acc | Prec | Recall | AUC | Acc |
| RF | 0.978 (0.974–0.981) | 0.924 | 0.938 | 0.885 | 0.978 (0.975–0.981) | 0.928 | 0.943 | 0.890 | 0.001 | 0.004 |
| LASS | 0.970 (0.966–0.974) | 0.906 | 0.955 | 0.824 | 0.971 (0.967–0.975) | 0.907 | 0.958 | 0.823 | 0.001 | 0.001 |
| SVM | 0.967 (0.962–0.971) | 0.907 | 0.905 | 0.879 | 0.967 (0.963–0.972) | 0.908 | 0.907 | 0.880 | 0.001 | 0.001 |
| MLP | 0.946 (0.940–0.952) | 0.885 | 0.875 | 0.860 | 0.946 (0.940–0.952) | 0.883 | 0.869 | 0.863 | 0.000 | −0.002 |
| NBC | 0.929 (0.922–0.936) | 0.842 | 0.776 | 0.898 | 0.930 (0.923–0.937) | 0.848 | 0.782 | 0.903 | 0.001 | 0.006 |
| SDT | 0.916 (0.907–0.924) | 0.911 | 0.921 | 0.870 | 0.920 (0.911–0.928) | 0.913 | 0.927 | 0.870 | 0.004 | 0.002 |
| CNN D200 | 0.946 | 0.929 | 0.001 | 0.004 | ||||||
| CNN D50 | 0.985 (0.982–0.988) | 0.945 | 0.986 (0.984–0.989) | 0.948 | 0.947 | 0.001 | 0.001 | |||
| CNN W2V | 0.982 (0.979–0.985) | 0.939 | 0.941 | 0.918 | 0.982 (0.979–0.985) | 0.937 | 0.936 | 0.919 | 0.000 | −0.001 |