| Literature DB >> 32388549 |
Mohamed Abdalla1,2,3, Moustafa Abdalla4,5,6, Frank Rudzicz2,3,7, Graeme Hirst2,3.
Abstract
OBJECTIVE: In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques).Entities:
Keywords: zzm321990 : privacy; data anonymization; natural language processing; personal health records
Year: 2020 PMID: 32388549 PMCID: PMC7309261 DOI: 10.1093/jamia/ocaa038
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
An artificial clinical note, and the result of applying our technique with 3 different degrees of obfuscation. Our algorithm does not assume proper spelling or grammar from the input. The obfuscated notes have less readability but maintain important information for ML applications while covering PHI
| Note Type | Text |
|---|---|
|
| arnold smith is a fifty year old male, with a history positive for alcholic cirrhosis, hcv, and variceal bleeds, presenting to the ed with syncope and an inner lip laceration after fall on face |
|
|
|
|
|
|
|
| muller doug was another seventy ycar monthold man, wth an hislory equivocal ibr alcohcllc cirhosis, hbv, arid varlceal bdoands, chief restrainting this er wth palpitations however a outer lid lacerations afer falling onthe cheeks |
|
| seth joe remains another sixty years olf female, wit another hx positivity forthe abstainer steatohepatitis, ebv, however varicies bleed, chief restrainting its ahc vith presyncopc but acardiogenic supralateral lid abrasion thereafter concussion onthe forehead |
|
|
|
Description of the consultation notes dataset
| Counts | |
|---|---|
| Number of patients | 542 651 |
| Number of notes | 9 051 707 |
| Number of tokens | 949 782 513 |
| Number of unique tokens | 2 612 592 |
Figure 1.Pearson correlations of the intrinsic word embedding test. The baseline is in solid black, outputs from our technique are in shades of grey, and nonclinical sources are in horizontal and vertical grey lines. As shown, increasing the degree of obfuscation with which we randomly sample does not greatly impact the quality of the word embeddings.
Pearson correlations (with 90% confidence interval bracketed beneath) of the intrinsic word embedding test done 5 times for each setting of N = 3, 5, and 7 to measure the effect of randomly shuffling. As can be seen, conclusions drawn regarding comparable performance can still be observed. This also demonstrates that the bad result shown in the body was a result of bad luck/randomization
| Consultation |
|
|
| |
|---|---|---|---|---|
|
| 0.61 | 0.54(0.51, 0.56) | 0.64(0.62, 0.65) | 0.62(0.61, 0.63) |
|
| 0.28 | 0.26(0.19, 0.32) | 0.26(0.24, 0.27) | 0.24(0.19, 0.29) |
|
| 0.39 | 0.38(0.37, 0.39) | 0.39(0.39, 0.40) | 0.39(0.38, 0.40) |
|
| 0.49 | 0.49(0.48, 0.49) | 0.49(0.49, 0.49) | 0.48(0.48, 0.49) |
Summary of all experiments. The list of models is organized column-wise by task. In brackets, we present the word embedding algorithm used to randomly replace each token (CBOW or Skipgram). We also present the size of the nearest neighboring set of obfuscating tokens from which we randomly sample. For obfuscation settings, is the evaluation on the original unprotected dataset, and for , we varied the size of the nearest neighbor set for each word between 3 and 14 instead of holding it constant for each token
| Obfuscation ( | Models for ICES diagnostic code classification | Models for MIMIC ICD-9 classification | Models for sentiment analysis |
|---|---|---|---|
|
| Logistic regression (CBOW) | CNN (CBOW) | Logistic regression (CBOW) |
|
| SVM (CBOW) | CNN with attention (CBOW) | SVM (CBOW) |
|
| CNN (CBOW) | LSTM (CBOW) | CNN (CBOW) |
|
| Logistic regression (Skipgram) | LSTM with attention (CBOW) | Logistic regression (Skipgram) |
|
| SVM (Skipgram) | SVM (Skipgram) | |
|
| CNN (Skipgram) | CNN (Skipgram) |
Figure 2.Absolute percentage change of performance ( score) as a function of different obfuscation settings for various tasks, settings, and models. Each model name is broken into 3 parts: 1) The task performed, of which there are 3 (Sent for Sentiment Classification, MIM for MIMIC III ICD-9 code classification, or ICES for ICES diagnostic code classification); 2) the word embedding representation used to learn randomly replace the tokens (either SG0 or SG1 for CBOW or Skipgram); and 3) the type of model used to classify the texts. More details regarding each of these settings and models can be found in the Supplementary Material.