| Literature DB >> 18652655 |
Ishna Neamatullah1, Margaret M Douglass, Li-wei H Lehman, Andrew Reisner, Mauricio Villarroel, William J Long, Peter Szolovits, George B Moody, Roger G Mark, Gari D Clifford.
Abstract
BACKGROUND: Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification.Entities:
Mesh:
Year: 2008 PMID: 18652655 PMCID: PMC2526997 DOI: 10.1186/1472-6947-8-32
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
PHI types defined by HIPAA
| PHI Type | Notes |
| Names | |
| Locations | |
| Dates | |
| Ages > 89 years | |
| Telephone numbers | |
| Fax numbers | |
| Electronic mail addresses | |
| Social security numbers | |
| Medical record numbers | |
| Health plan beneficiary numbers | |
| Account numbers | |
| Certificate/license numbers | |
| Vehicle identifiers | |
| Device identifiers and serial numbers | |
| Web Universal Resource Locators (URLs) | |
| Internet Protocol (IP) address numbers | |
| Biometric identifiers | |
Figure 1Example of discharge note after de-identification by the algorithm
Clinician de-identification performance.
| Min | Max | Mean | ||
| 1 person | Recall | 0.63 | 0.94 | 0.81 |
| Precision | 0.95 | 1.00 | 0.98 | |
| 2 people | Recall | 0.89 | 0.98 | 0.94 |
| Precision | 0.95 | 0.99 | 0.97 | |
| 3 people | Recall | 0.98 | 0.99 | 0.98 |
| Precision | 0.95 | 0.99 | 0.97 |
PHI category breakdown in gold standard corpus.
| PHI Type | Original Count/Distribution | Added PHI (Enrichment) | Total Count/Distribution After Enrichment |
| Patient Name | 34 (2.17%) | 20 | 54 (3.04%) |
| Patient Name Initial | 2 (0.13%) | 0 | 2 (0.11%) |
| Relative/Proxy Name | 125 (7.97%) | 50 | 175 (9.84%) |
| Clinician Name | 518 (33.04%) | 75 | 593 (33.33%) |
| Date (not year) | 475 (30.29%) | 6 | 482 (27.09%) |
| Year | 42 (2.68%) | 4 | 46 (2.59%) |
| Location | 328 (20.92%) | 40 | 367 (20.63%) |
| Phone | 37 (2.36%) | 16 | 53 (2.98%) |
| Age over 89 | 4 (0.26%) | 0 | 4 (0.22%) |
| Undefined | 3 (0.19%) | 0 | 3 (0.17%) |
| Total | 1,568 | 211 | 1,779 |
Performance on gold standard corpus.
| PHI Type | PHI sub-type | Count | # FNs | # FNs per 100,000 words | Per Category Recall | Per Category Precision |
| Name | Patient Name | 54 | 0 | 0 | 1.00 | |
| Patient Name Initial | 2 | 2 | 0.598 | 0.00 | ||
| Relative/Proxy Name | 175 | 4 | 1.195 | 0.977 | ||
| Clinician Name | 593 | 3 | 1.494 | 0.995 | 0.725 | |
| Date | Date (not year) | 482 | 26 | 7.769 | 0.946 | |
| Year | 46 | 11 | 3.287 | 0.761 | 0.713 | |
| Location | 367 | 10 | 4.482 | 0.973 | 0.922 | |
| Phone | 53 | 0 | 0 | 1.00 | 0.898 | |
| Age over 89 | 4 | 1 | 0.299 | 0.750 | 0.600 | |
| Undefined | 3 | 2 | 0.598 | 0.333 | N/A | |
| Overall | 1779 | 59 | 19.720 | 0.967 | 0.749 | |
(FNs are false negatives and N/A indicates not applicable)
Categorization of algorithm false negatives by PHI type on test corpus.
| PHI Type | # False negatives in 296,400 words/1,836 nursing notes | # False negatives per 100,000 words | Recall |
| Full name | 4 † | 1 | |
| Last name | 14 † | 5 | |
| First name | 31 † | 11 | |
| Location (not street address) | 7 | 2 | |
| Full date | 2 | 1 | Unknown |
| Partial date | 9 | 3 | |
| Year | 8 | 3 | |
| Age over 89 | 3 | 1 | |
| Overall | 78 | 27 | 0.94 (estimated) |
† None of these names were actually patient names, and therefore were non-critical PHI.
Performance without customized dictionary on gold standard corpus.
| PHI Type | PHI sub-type | Count | # FNs | Per Category Recall | Per Category Precision |
| Name | Patient Name | 54 | 1 | 0.981 | |
| Patient Name Initial | 2 | 2 | 0.00 | ||
| Relative/Proxy Name | 175 | 5 | 0.971 | ||
| Clinician Name | 593 | 24 | 0.973 | 0.731 | |
| Date | Date (not year) | 482 | 26 | 0.946 | |
| Year | 46 | 11 | 0.761 | 0.712 | |
| Location | 367 | 231 | 0.371 | 0.840 | |
| Phone | 53 | 0 | 1.00 | 0.898 | |
| Age over 89 | 4 | 1 | 0.750 | 0.600 | |
| Undefined | 3 | 2 | 0.333 | N/A | |
| Overall | 1779 | 295 | 0.834 | 0.725 |
(FNs are false negatives and N/A indicates not applicable.)