| Literature DB >> 22164229 |
Khaled El Emam1, Elizabeth Jonker, Luk Arbuckle, Bradley Malin.
Abstract
BACKGROUND: Privacy legislation in most jurisdictions allows the disclosure of health data for secondary purposes without patient consent if it is de-identified. Some recent articles in the medical, legal, and computer science literature have argued that de-identification methods do not provide sufficient protection because they are easy to reverse. Should this be the case, it would have significant and important implications on how health information is disclosed, including: (a) potentially limiting its availability for secondary purposes such as research, and (b) resulting in more identifiable health information being disclosed. Our objectives in this systematic review were to: (a) characterize known re-identification attacks on health data and contrast that to re-identification attacks on other kinds of data, (b) compute the overall proportion of records that have been correctly re-identified in these attacks, and (c) assess whether these demonstrate weaknesses in current de-identification methods. METHODS ANDEntities:
Mesh:
Year: 2011 PMID: 22164229 PMCID: PMC3229505 DOI: 10.1371/journal.pone.0028071
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The 18 elements in the HIPAA Privacy Rule Safe Harbor standard that must be removed or generalized for a data set to be considered de-identified (see 45 CFR 164.514(b)(2)(i)).
| The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed: |
| (A) Names; |
| (B) All geographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code if, according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000. |
| (C) All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older; |
| (D) Telephone numbers; |
| (E) Fax numbers; |
| (F) Electronic mail addresses; |
| (G) Social security numbers; |
| (H) Medical record numbers; |
| (I) Health plan beneficiary numbers; |
| (J) Account numbers; |
| (K) Certificate/license numbers; |
| (L) Vehicle identifiers and serial numbers, including license plate numbers; |
| (M) Device identifiers and serial numbers; |
| (N) Web Universal Resource Locators (URLs); |
| (O) Internet Protocol (IP) address numbers; |
| (P) Biometric identifiers, including finger and voice prints; |
| (Q) Full face photographic images and any comparable images; and |
| (R) Any other unique identifying number, characteristic, or code. |
Figure 1PRISMA diagram.
PRISMA diagram summarizing the steps involved in the systematic review of the re-identification attack literature.
A summary of successful re-identification attacks on the evaluation criteria.
| ID | Study | Pub Year § | Health data included? | Profession of adversary | Number of individuals re-identified | Country of adversary | Proper de-identification of attacked data ? | Re-identification verified ? |
|
|
| 2001 | No | Researchers | 29 of 273 | Germany | “Factually anonymous” | Yes (records containing insurance numbers only) |
|
|
| 2001 | No | Researchers | 75% of 11,000 | USA | Direct identifiers removed | No |
|
|
| 2002 | Yes | Researcher | 1 of 135,000 | USA | Removal of names and addresses | Yes |
|
| 2003 | No | Researchers | 219 unique matches, 112 with 2 possibilities, 8 confirmed | UK | Yes | Verified matches, but not identities | |
|
|
| 2006 | No | Journalist | 1 of 657,000 | USA | No | Yes (with individual) |
|
|
| 2006 | Yes | Researchers | 79% of 550 | USA | No | Verified (with original data set) |
|
| 2006 | No | Researchers | Of 133 users, 60% of those who mention at least 8 movies | USA | Direct identifiers removed | No | |
|
|
| 2006 | Yes | Expert Witness | 18 of 20 | USA | Only type of cancer, zip code and date of diagnosis included in request | Yes (verified by the Department of Health) |
|
|
| 2007 | No | Researchers | 2,400 of 4.4 million | USA | Identifying information removed | Verified using original data |
|
| 2007 | Yes | Broadcaster | 1 | Canada | Direct Identifiers removed & possibly other unknown de-id methods used | Yes | |
|
|
| 2008 | No | Researchers | 2 of 50 | USA | Direct identifiers removed+maybe perturbation | No |
|
|
| 2009 | Yes | Researcher | 1 of 3,510 | Canada | Direct identifiers removed | Yes |
|
|
| 2009 | No | Researchers | 30.8% of 150 pairs of nodes | USA | Identifying information removed | Verified using ground-truth mapping of the 2 networks |
|
|
| 2010 | Yes | Researchers | 2 of 15,000 | USA | Yes - HIPAA Safe Harbor | Yes |
(§This is the first year that the report or article appears. Some of the reports we cite have been updated at later dates. Some reports describe re-identification attacks that may have occurred in earlier years. Since the appearance of the original results in 2010 a second article has been published more recently).
Figure 2Caterpillar plot (all studies).
Caterpillar plot of the individual mean and confidence intervals for all studies with overall mean proportion.
Figure 3Caterpillar plot (health studies).
Caterpillar plot of the individual mean and confidence intervals for health studies with overall mean proportion.
Figure 4Senstivitiy (all studies).
The number of new studies with success rates below/above the current mean that would need to be performed to significantly change the current mean for all studies.
Figure 5Sensitivity (health studies).
The number of new studies with success rates below/above the current mean that would need to be performed to significantly change the current mean for health studies.
Figure 6Funnel plot (all studies).
Funnel plot showing the proportion of records re-identified in all studies against standard error. The points were slightly jittered to reveal overlap.
Figure 7Funnel plot (health studies).
Funnel plot showing the proportion of records re-identified in health studies against standard error. The points were slightly jittered to reveal overlap.