| Literature DB >> 22839356 |
Oscar Ferrández1, Brett R South, Shuying Shen, F Jeffrey Friedlin, Matthew H Samore, Stéphane M Meystre.
Abstract
BACKGROUND: The increased use and adoption of Electronic Health Records (EHR) causes a tremendous growth in digital information useful for clinicians, researchers and many other operational purposes. However, this information is rich in Protected Health Information (PHI), which severely restricts its access and possible uses. A number of investigators have developed methods for automatically de-identifying EHR documents by removing PHI, as specified in the Health Insurance Portability and Accountability Act "Safe Harbor" method.This study focuses on the evaluation of existing automated text de-identification methods and tools, as applied to Veterans Health Administration (VHA) clinical documents, to assess which methods perform better with each category of PHI found in our clinical notes; and when new methods are needed to improve performance.Entities:
Mesh:
Year: 2012 PMID: 22839356 PMCID: PMC3445850 DOI: 10.1186/1471-2288-12-109
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Main characteristics of the de-identification tools
| Main technique | Rule-based | X | X | X | n/a | n/a |
| | ML-based | n/a | n/a | n/a | X | X |
| Programming language | Java | Java | Perl | Python | Python | |
| ML algorithm | n/a | n/a | n/a | CRF (Carafe) | CRF (CRFsuite) | |
| Input documents | XML/txt | HL7/txt | txt | txt/XML-inline/json | XML/txt/HL7 | |
| HIPAA compliant | X | X | X | 1 | 1 | |
| Regular Expressions (#) | ~50 | ~40 | ~90 | 2 | 2 | |
| PHI markers (e.g., Mr.) | X | X | X | 3 | -- | |
| Part-of-speech information | -- | X | -- | -- | -- | |
| String similarity techniques (e.g. edit distance, fuzzy matching) | -- | X | -- | -- | -- | |
| Dictionaries* (size) | Person names | ~101K | ~280K | ~96K4 | -- | -- |
| | Geographic places | | ~167K | ~4K | -- | -- |
| | US area code | -- | -- | ~380 | -- | -- |
| | Medical phrases | -- | ~50 | ~28 | -- | -- |
| | Medical terms | -- | ~80K | ~175K | -- | -- |
| | Companies | -- | ~200 | ~500 | -- | -- |
| | Ethnicities | -- | ~120 | ~195 | -- | -- |
| | Common words | -- | ~220K | ~50K | -- | -- |
| Machine Learning features | Contextual window | n/a | n/a | n/a | 3-words | 4-words |
| | Morphological (#) | n/a | n/a | n/a | 22 | 34 |
| | Syntactic | n/a | n/a | n/a | -- | -- |
| | Semantic | n/a | n/a | n/a | -- | -- |
| From dictionaries | n/a | n/a | n/a | 5 | 5 | |
*HMS Scrubber’s dictionary sources: 1990 US Census (person and place names).
*MeDS’ dictionary sources: Ispell, 2005 SS Death Index, Regenstrief Medical Record System, UMLS, MESH.
*MIT deid’s dictionary sources: 1990 US Census, MIMIC II Database, Atkinson’s Spell Checking Oriented Word Lists, UMLS.
1 It will depend on the types of the PHI instances used for training.
2 Both MIST and HIDE use regular expression in order to derive the morphological features from the tokens (e.g., all_caps_token ‘^[A-Z] + $’).
3 Within MIST, PHI markers are used only for detecting companies (e.g., “Ltd.”).
4 Person names dictionaries comprise lists of names, last names and name prefixes.
5 These systems are tailored to derive features from dictionaries, however they are not distributed with any default dictionary.
Figure 1 Example of BIO annotations.
PHI category distribution and mapping for the VHA, i2b2 and Swedish Stockholm EPR corpora
| Patient Name | 206 (3.88%) | Patients | 929 (4.76%) | Person Name | First Name | 923 (20.87%) |
| Relative Name | 30 (0.55%) | | | | | |
| Other Person Name | 20 (0.37%) | | | | Last Name | 929 (21%) |
| Healthcare Provider Name | 492 (9.08%) | Doctors | 3751 (19.24%) | | | |
| Street City | 137 (2.53%) | Locations | 263 (1.35%) | Location | 148 (3.35%) | |
| State Country | 161 (2.97%) | | | | | |
| Zip code | 4 (0.07%) | | | | | |
| Deployment | 43 (0.79%) | - | - | - | - | |
| Healthcare Unit Name | 1453 (26.83%) | Hospitals | 2400 (12.31%) | Health_Care_Unit | 1021 (23.08%) | |
| Other Organization | 86 (1.59%) | - | - | - | - | |
| Date | 2547 (47.03%) | Dates | 7098 (36.40%) | Date_Part | 710 (16.05%) | |
| | | | | Full_Date | 500 (11.30%) | |
| Age > 89 | 4 (0.07%) | Ages | 16 (0.08%) | Age | 56 (1.27%) | |
| Phone Number | 90 (1.66%) | Phone Numbers | 232 (1.19%) | Phone Number | 136 (3.07%) | |
| Electronic Address | 4 (0.07%) | - | - | - | - | |
| SSN | 16 (0.30%) | IDs | 4809 (24.66%) | - | - | |
| Other ID Number | 123 (2.27%) | - | - | |||
Figure 2 Frequency distribution of the 100 most frequent VHA document types.
Figure 3 Example of exact, partial and fully-contained matches.s
“Out-of-the-box” overall results for using the VHA evaluation corpus exact, partial and fully-contained matches with one category, and with each PHI categories separately
| One PHI | P (CI) | 0.01 (0.005-0.015) | 0.10 (0.085-0.115) | 0 | 0.32 (0.31-0.33) | 0.45 (0.435-0.465) | 0.16 (0.15-0.17) | 0.14 (0.125-0.155) | 0.42 (0.40-0.44) | |
| | R (CI) | 0.02 (0.015-0.025) | 0.21 (0.20-0.22) | 0 | 0.65 (0.64-0.66) | 0.64 (0.625-0.655) | 0.34 (0.325-0.355) | 0.32 (0.305-0.335) | 0.36 (0.345-0.375) | |
| | F2 (CI) | 0.02 (0.012-0.025) | 0.17 (0.16-0.18) | 0 | 0.54 (0.53-0.55) | 0.68 (0.665-0.695) | 0.28 (0.27-0.29) | 0.25 (0.24-0.26) | 0.37 (0.355-0.385) | |
| All PHI types | P (CI) | 0.01 (0.005-0.015) | 0.05 (0.045-0.055) | 0 | 0.23 (0.22-0.24) | 0.34 (0.325-0.365) | 0.76 (0.745-0.775) | 0.12 (0.115-0.125) | 0.10 (0.09-0.11) | 0.40 (0.335-0.465) |
| | R (CI) | 0.02 (0.0195-0.0215) | 0.14 (0.13-0.15) | 0 | 0.47 (0.455-0.485) | 0.60 (0.585-0.615) | 0.60 (0.585-0.615) | 0.26 (0.225-0.295) | 0.22 (0.205-0.235) | 0.34 (0.325-0.355) |
| | F2 (CI) | 0.02 (0.018-0.022) | 0.10 (0.09-0.11) | 0 | 0.39 (0.38-0.40) | 0.52 (0.505-0.535) | 0.63 (0.615-0.645) | 0.21 (0.195-0.225) | 0.18 (0.17-0.19) | 0.35 (0.315-0.385) |
| Overall results | ||||||||||
| | | | MIST | HIDE | MIST | HIDE | MIST | HIDE | ||
| One PHI | P (CI) | 0.54 | 0.50 | 0.89 | 0.58 | 0.56 | ||||
| | | | (0.52-0.56) | (0.48-0.52) | (0.935-0.965) | (0.875-0.905) | (0.56-0.60) | (0.54-0.58) | ||
| | R (CI) | 0.25 | 0.27 | 0.46 | 0.28 | 0.30 | ||||
| | | | (0.24-0.26) | (0.26-0.28) | (0.445-0.475) | (0.475-0.505) | (0.27-29) | (0.29-31) | ||
| | F2 (CI) | 0.28 | 0.30 | 0.51 | 0.31 | 0.33 | ||||
| | | | (0.265-0.295) | (0.285-0.315) | (0.495-0.525) | (0.525-0.555) | (0.295-0.325) | (0.315-0.345) | ||
| All PHI types | P (CI) | 0.52 | 0.48 | 0.90 | 0.84 | 0.55 | 0.52 | |||
| | | | (0.495-0.545) | (0.46-0.50) | (0.885-0.915) | (0.825-0.855) | (0.525-0.575) | (0.50-0.54) | ||
| | R (CI) | 0.24 | 0.25 | 0.44 | 0.46 | 0.27 | 0.28 | |||
| | | | (0.225-255) | (0.24-0.26) | (0.425-0.455) | (0.445-0.475) | (0.255-0.285) | (0.265-0.295) | ||
| | F2 (CI) | 0.27 | 0.28 | 0.49 | 0.50 | 0.30 | 0.31 | |||
| (0.255-0.285) | (0.265-0.295) | (0.475-0.505) | (0.485-0.515) | (0.285-0.315) | (0.295-0.325) | |||||
CI: Confidence Interval obtained with a confidence level of 95%.
One PHI = one overall PHI category considered.
All PHI types = each PHI type evaluated separately.
P = Precision; R = Recall; F = F2-measure.
“Out-of-the-box” individual PHI recall results for partial and fully-contained matches using the VHA evaluation corpus
| | | ||||||
| Patient Name | 206 | 0.83 | 0.98 | 0.57 | 0.69 | 0.69 | |
| Relative Name | 30 | 0.76 | 0.95 | 1 | 0.57 | 0.67 | 0.77 |
| Healthcare Provider Name | 492 | 0.74 | 0.96 | 0.94 | 0.43 | 0.47 | 0.38 |
| Other Person Name | 20 | 0.66 | 0.81 | 0.74 | 0.30 | 0.25 | 0.35 |
| Street City | 137 | 0.90 | 0.96 | 0.81 | 0.70 | 0.78 | 0.78 |
| State Country | 161 | 0.45 | 0.49 | 0.85 | 0.43 | 0.45 | 0.84 |
| Deployment | 43 | 0.34 | 0.49 | 0.27 | 0.07 | 0.02 | 0.05 |
| ZIP code | 4 | 1 | 1 | 1 | 1 | 1 | 1 |
| Healthcare Unit Name | 1453 | 0.45 | 0.51 | 0.12 | 0.24 | 0.23 | 0.03 |
| Other Org Name | 86 | 0.33 | 0.50 | 0.27 | 0.03 | 0.20 | 0.03 |
| Date | 2547 | 0.74 | 0.87 | 0.80 | 0.34 | 0.27 | 0.46 |
| Age > 89 | 4 | 0 | 0 | 1 | 0 | 0 | 1 |
| Phone Number | 90 | 0.73 | 0.79 | 0.80 | 0.42 | 0.5 | 0.48 |
| Electronic Address | 4 | 0 | 0.86 | 0.75 | 0 | 0 | 0.75 |
| SSN | 16 | 1 | 1 | 1 | 1 | 1 | 1 |
| Other ID Number | 123 | 0.66 | 0.82 | 0.41 | 0.43 | 0.61 | 0.27 |
| PHI type | |||||||
| | | MIST | HIDE | MIST | HIDE | ||
| Patient Name | 206 | 0.51 | 0.54 | 0.42 | 0.50 | ||
| Relative Name | 30 | 0.13 | 0.13 | 0.13 | 0.13 | ||
| Healthcare Provider Name | 492 | 0.53 | 0.59 | 0.44 | 0.53 | ||
| Other Person Name | 20 | 0 | 0.20 | 0 | 0.15 | ||
| Street City | 137 | 0.26 | 0.29 | 0.26 | 0.27 | ||
| State Country | 161 | 0.14 | 0.22 | 0.14 | 0.21 | ||
| Deployment | 43 | 0.07 | 0.05 | 0.07 | 0.02 | ||
| ZIP code | 4 | 0 | 0.75 | 0 | 0.75 | ||
| Healthcare Unit Name | 1453 | 0.09 | 0.09 | 0.06 | 0.05 | ||
| Other Org Name | 86 | 0.09 | 0.07 | 0.06 | 0.06 | ||
| Date | 2547 | 0.72 | 0.73 | 0.39 | 0.38 | ||
| Age > 89 | 4 | 0 | 0 | 0 | 0 | ||
| Phone Number | 90 | 0.34 | 0.61 | 0.24 | 0.51 | ||
| Electronic Address | 4 | 0 | 0 | 0 | 0 | ||
| SSN | 16 | 0.56 | 0.87 | 0.56 | 0.87 | ||
| Other ID Number | 123 | 0.32 | 0.69 | 0.20 | 0.63 | ||
10 fold cross-validation overall results using the VHA evaluation corpus for exact, partial and fully-contained matches with one category, and with each PHI types separately
| One PHI | P (CI) | 0.89 | 0.88 | 0.95 | 0.91 | 0.91 | |
| | | (0.88-0.90) | (0.87-0.89) | (0.95-0.97) | (0.94-0.96) | (0.90-0.92) | (0.90-0.92) |
| | R (CI) | 0.64 | 0.70 | 0.70 | 0.67 | 0.73 | |
| | | (0.625-0.655) | (0.685-0.715) | (0.685-0.715) | (0.75-0.77) | (0.655-0.685) | (0.72-0.74) |
| | F2 (CI) | 0.68 | 0.73 | 0.74 | 0.71 | 0.76 | |
| | | (0.665-0.695) | (0.72-0.74) | (0.725-0.755) | (0.775-0.805) | (0.70-0.72) | (0.75-0.77) |
| All PHI types | P (CI) | 0.87 | 0.87 | 0.95 | 0.92 | 0.90 | 0.89 |
| | | (0.855-0.885) | (0.86-0.88) | (0.94-0.96) | (0.905-0.935) | (0.885-0.915) | (0.88-0.90) |
| | R (CI) | 0.63 | 0.69 | 0.69 | 0.74 | 0.66 | 0.71 |
| | | (0.615-0.655) | (0.675-0.705) | (0.675-0.705) | (0.725-0.755) | (0.645-0.675) | (0.695-0.725) |
| | F2 (CI) | 0.67 | 0.72 | 0.73 | 0.77 | 0.70 | 0.74 |
| (0.655-0.685) | (0.71-0.73) | (0.713-0.745) | (0.76-0.78) | (0.685-0.715) | (0.725-0.755) | ||
CI: Confidence Interval obtained with a confidence level of 95%.
One PHI = one overall PHI category considered.
All PHI types = each PHI type evaluated separately.
P = Precision; R = Recall; F = F2-measure.
10 fold cross-validation recall results for partial and fully-contained matches by PHI type and using the VHA evaluation corpus
| Patient Name | 206 | 0.51 | 0.51 | 0.49 | 0.51 |
| Relative Name | 30 | 0 | 0.13 | 0 | 0.10 |
| Healthcare Provider Name | 492 | 0.58 | 0.61 | 0.54 | 0.59 |
| Other Person Name | 20 | 0 | 0.20 | 0 | 0.20 |
| Street City | 137 | 0.28 | 0.48 | 0.28 | 0.43 |
| State Country | 161 | 0.58 | 0.71 | 0.58 | 0.70 |
| Deployment | 43 | 0.19 | 0.28 | 0.16 | 0.21 |
| ZIP code | 4 | 0 | 0 | 0 | 0 |
| Healthcare Unit Name | 1453 | ||||
| Other Org Name | 86 | 0.10 | 0.29 | 0.09 | 0.25 |
| Date | 2547 | ||||
| Age > 89 | 4 | 0 | 0 | 0 | 0 |
| Phone Number | 90 | 0.27 | 0.88 | 0.23 | 0.78 |
| Electronic Address | 4 | 0.75 | 0.75 | 0 | 0.75 |
| SSN | 16 | 0.37 | 0.62 | 0.37 | 0.56 |
| Other ID Number | 123 | 0.37 | 0.72 | 0.34 | 0.65 |
Figure 4 Partial matches character overlap analysis.