| Literature DB >> 27104620 |
Murat Sariyar1,2, Irene Schlünder2.
Abstract
Sharing data in biomedical contexts has become increasingly relevant, but privacy concerns set constraints for free sharing of individual-level data. Data protection law protects only data relating to an identifiable individual, whereas "anonymous" data are free to be used by everybody. Usage of many terms related to anonymization is often not consistent among different domains such as statistics and law. The crucial term "identification" seems especially hard to define, since its definition presupposes the existence of identifying characteristics, leading to some circularity. In this article, we present a discussion of important terms based on a legal perspective that it is outlined before we present issues related to the usage of terms such as unique "identifiers," "quasi-identifiers," and "sensitive attributes." Based on these terms, we have tried to circumvent a circular definition for the term "identification" by making two decisions: first, deciding which (natural) identifier should stand for the individual; second, deciding how to recognize the individual. In addition, we provide an overview of anonymization techniques/methods for preventing re-identification. The discussion of basic notions related to anonymization shows that there is some work to be done in order to achieve a mutual understanding between legal and technical experts concerning some of these notions. Using a dialectical definition process in order to merge technical and legal perspectives on terms seems important for enhancing mutual understanding.Entities:
Keywords: anonymization; data protection; identity; re-identification
Mesh:
Year: 2016 PMID: 27104620 PMCID: PMC5073223 DOI: 10.1089/bio.2015.0100
Source DB: PubMed Journal: Biopreserv Biobank ISSN: 1947-5543 Impact factor: 2.300
Transformed Data Set
| 6 | M | 1970–1989 | 1011[ | Q90.1 |
| 8 | F | 1950–1969 | 10117 | F31.1 |
| 1 | M | 1970–1989 | 1011[ | F31.0 |
| 9 | M | 1970–1989 | 11067 | F31.9 |
| 11 | F | 1950–1969 | 1191[ | G50.1 |
| 4 | F | 1970–1989 | 1193[ | F34.8 |
| 10 | M | 1970–1989 | 12[ | F34.8 |
| 3 | F | 1950–1969 | 12[ | F31.9 |
| 2 | M | 1970–1989 | 12[ | F31.1 |
| 5 | F | 1950–1969 | 12[ | G50.1 |
| 12 | M | 1970–1989 | 13[ | Q90.1 |
| 7 | M | 1970–1989 | 13[ | Q90.0 |
Records 8 and 9 are suppressed because they would lead to a coarsening of four other records in their zip code. For “year of birth,” a full-domain generalization to “2-decade-range of birth” is made, whereas local generalizations (or recodings) for the zip code are made to achieve a two-anonymous data set with minimal distortion.
Indicates that digits were omitted.
Original Data Set Without Any Key Identifiers
| 6 | M | 1980 | 10117 | Q90.1 |
| 8 | F | 1966 | 10117 | F31.1 |
| 1 | M | 1979 | 10118 | F31.0 |
| 9 | M | 1988 | 11067 | F31.9 |
| 11 | F | 1965 | 11910 | G50.1 |
| 4 | F | 1983 | 11934 | F34.8 |
| 10 | M | 1973 | 12002 | F34.8 |
| 3 | F | 1967 | 12033 | F31.9 |
| 2 | M | 1989 | 12200 | F31.1 |
| 5 | F | 1959 | 12200 | G50.1 |
| 12 | M | 1976 | 13011 | Q90.1 |
| 7 | M | 1975 | 13135 | Q90.0 |
ID is an irrelevant number for record identification. QIDs are sex, year of birth, and zip code. The sensitive attribute is the ICD-10 code.
F, female; M, male; SensAttr, sensitive attribute; QIDs, quasi-identifiers.