| Literature DB >> 26024886 |
Kurt Schmidlin1, Kerri M Clough-Gorr2,3, Adrian Spoerri4.
Abstract
BACKGROUND: Record linkage of existing individual health care data is an efficient way to answer important epidemiological research questions. Reuse of individual health-related data faces several problems: Either a unique personal identifier, like social security number, is not available or non-unique person identifiable information, like names, are privacy protected and cannot be accessed. A solution to protect privacy in probabilistic record linkages is to encrypt these sensitive information. Unfortunately, encrypted hash codes of two names differ completely if the plain names differ only by a single character. Therefore, standard encryption methods cannot be applied. To overcome these challenges, we developed the Privacy Preserving Probabilistic Record Linkage (P3RL) method.Entities:
Mesh:
Year: 2015 PMID: 26024886 PMCID: PMC4460842 DOI: 10.1186/s12874-015-0038-6
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1Basic steps of Privacy Preserving Probabilistic Record Linkage (P3RL)
Fig. 2Flowchart of Privacy Preserving Probabilistic Record Linkage (P3RL) methods
Fig. 3Example of Privacy Preserving Probabilistic Record Linkage (P3RL) masking and shuffling procedures
Fig. 4Example of Privacy Preserving Probabilistic Record Linkage (P3RL) pre-processing data cleaning rules
Fig. 5Example of Bloom filter encryption for surname (bigrams, two hash-functions, Bloom filter length 28 bits)
P3RL - Computational requirements of Masking and Shuffling, Pre-processing (100,000 records) and Linkage (100,000 records table A and 50,000 records table B)
| Step | Linkage type | ||
|---|---|---|---|
| Plain | P3RL - Encrypted names | P3RL - Encrypted dates | |
|
| - | 11 variables to mask and shuffle | 11 variables to mask and shuffle |
|
| - | 3 name variables to pre-process | 2 date variables to pre-process |
|
| - | 4 name variables to encrypt (trigrams, 10 hash functions, bit array size 800) | 2 date variables to encrypt |
|
| 13 plain variables, | 9 plain variables, | 13 plain variables |
|
| 63 min | 8 Bloom filter array comparisons | 72 min |
Tests were performed on Desktop Computer with Intel® Xeon® CPU, 4 cores, 64-bit, 3 GHz, 12 GB RAM, Windows 7 Professional 64 bit operating system. These estimates were derived using in-house software for masking and encryption, KNIME for pre-processing and G-LINK for linkage. G-LINK is the latest linkage software in desktop version (former GRLS), developed by Statistics Canada
Estimates may vary widely using other programs and/or hardware