| Literature DB >> 32673230 |
Mohamed Abdalla1,2,3, Moustafa Abdalla4,5,6, Graeme Hirst1,2, Frank Rudzicz1,2,7,8.
Abstract
BACKGROUND: Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models.Entities:
Keywords: data anonymization; natural language processing; personal health records; privacy
Mesh:
Year: 2020 PMID: 32673230 PMCID: PMC7391163 DOI: 10.2196/18055
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Figure 1Process flow for gathering and preparing the clinical notes for embedding generation and experimentation.
Figure 2Process flow for generating word embeddings and performing the name reconstruction experiment.
Figure 3Relationship between frequency of name occurrence and the average difference between the in-group and out-group for patients. This graph is generated from an experiment run on a GloVe model with a dimension of 100, window of 10, learning rate of 0.05, minimum occurrence of 1, and alpha of .75.
Figure 4Process flow for generating word embeddings and performing statistical testing. For population-level statistical testing, we performed a Wilcoxon signed-rank test, and for patient-level statistical testing, we calculated empirical P values using 1000 randomly generated permutations.
The number and percentage of paired tokens that are part of true names as a function of context window size, using the cosine distance metric of the first 600 paired tokens sorted in ascending order.
| Context window size | Skipgram names, n (%) | CBOWa names, n (%) | GLoVeb names, n (%) |
| 1 | 51 (8.5) | 17 (2.8) | 8 (1.3)c |
| 3 | 369 (61.5) | 265 (44.2) | 158 (26.3) |
| 5 | 393 (65.6) | 323 (53.8) | 278 (46.3) |
| 7 | 410 (68.3) | 331 (55.2) | 317 (52.8) |
| 9 | 411 (68.5) | 340 (56.7) | 323 (53.8) |
aCBOW: Continuous Bag of Words.
bGLoVe: Global Vectors.
cResult not significant after correcting for multiple comparisons using the Holm-Bonferroni correction.
Figure 5Visual representation of the percentage of paired names belonging to true names from the first 600 paired tokens when sorted in ascending order.
Difference between the in-group and out-group as a function of context window size for various word embedding algorithms using the cityblock distance metric. The differences are relative distances between word embedding vectors in an n-dimensional space.
| Context window sizea | Skipgram difference | CBOWb difference | GLoVec difference |
| 1 | 3.91 | 7.59 | 4.85 |
| 3 | 2.88 | 28.53 | 5.69 |
| 5 | 2.33 | 39.55 | 5.45 |
| 7 | 1.84 | 47.10 | 5.12 |
| 9 | 1.51 | 51.61 | 5.54 |
aAll differences were statistically significant after correcting for multiple comparisons.
bCBOW: Continuous Bag of Words.
cGLoVe: Global Vectors.
Figure 6Visualization of the difference between the in-group and the out-group as a function of context window size for various word embedding algorithms using the cityblock distance metric.
The percentage of patients whose diagnoses are identifiable due to a statistically significant difference between the in-group and out-group as a function of context window size for various word embedding algorithms using the cityblock distance metric.
| Size | Skipgram patients, % | CBOWa patients, % | GLoVeb patients, % |
| 1 | 49 (7.7) | 77 (12.1) | 400 (62.7) |
| 3 | 41 (6.4) | 149 (23.4) | 401 (62.8) |
| 5 | 33 (5.2) | 152 (23.8) | 403 (63.2) |
| 7 | 16 (2.5) | 153 (24.0) | 380 (59.6) |
| 9 | 12 (1.9) | 153 (24.0) | 449 (70.4) |
aCBOW: Continuous Bag of Words.
bGLoVe: Global Vectors.
Figure 7Visualization of the percentage of patients who have a significant difference between their in- and out-groups as a function of context window size for multiple word embedding algorithms using the cityblock distance metric.
The percentage of times using a word embedding–based attack beats the majority baseline for A@1 and A@5 for various context window sizes over 1000 random diagnosis selections.
| Context window sizea | Skipgram A@1, A@5 | CBOWb A@1, A@5 | GLoVec A@1, A@5 |
| 1 | 55.8, 56.7 | 61.8, 61.8 | 55.4, 56.9 |
| 3 | 55.6, 53.1 | 51.2, 52.6 | 60.5, 59.5 |
| 5 | 57.4, 55.6 | 53.6, 54.5 | 59.4, 57.2 |
| 7 | 57.4, 53.5 | 54.6, 53.9 | 55.9, 54.0 |
| 9 | 57.2, 53.2 | 53.7, 51.2 | 60.6, 56.7 |
aWe observed that the majority baseline is surpassed consistently and up to 60% of the time.
bCBOW: Continuous Bag of Words.
cGLoVe: Global Vectors.