| Literature DB >> 24859155 |
Brett R South1, Danielle Mowery2, Ying Suo3, Jianwei Leng3, Óscar Ferrández4, Stephane M Meystre5, Wendy W Chapman6.
Abstract
The Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor method requires removal of 18 types of protected health information (PHI) from clinical documents to be considered "de-identified" prior to use for research purposes. Human review of PHI elements from a large corpus of clinical documents can be tedious and error-prone. Indeed, multiple annotators may be required to consistently redact information that represents each PHI class. Automated de-identification has the potential to improve annotation quality and reduce annotation time. For instance, using machine-assisted annotation by combining de-identification system outputs used as pre-annotations and an interactive annotation interface to provide annotators with PHI annotations for "curation" rather than manual annotation from "scratch" on raw clinical documents. In order to assess whether machine-assisted annotation improves the reliability and accuracy of the reference standard quality and reduces annotation effort, we conducted an annotation experiment. In this annotation study, we assessed the generalizability of the VA Consortium for Healthcare Informatics Research (CHIR) annotation schema and guidelines applied to a corpus of publicly available clinical documents called MTSamples. Specifically, our goals were to (1) characterize a heterogeneous corpus of clinical documents manually annotated for risk-ranked PHI and other annotation types (clinical eponyms and person relations), (2) evaluate how well annotators apply the CHIR schema to the heterogeneous corpus, (3) compare whether machine-assisted annotation (experiment) improves annotation quality and reduces annotation time compared to manual annotation (control), and (4) assess the change in quality of reference standard coverage with each added annotator's annotations. Published by Elsevier Inc.Entities:
Keywords: Anonymization; Clinical corpora; Confidentiality; De-identification; Electronic health records; Medical informatics; Natural language processing; Patient data privacy
Mesh:
Year: 2014 PMID: 24859155 PMCID: PMC5627768 DOI: 10.1016/j.jbi.2014.05.002
Source DB: PubMed Journal: J Biomed Inform ISSN: 1532-0464 Impact factor: 6.317
Annotation type definitions between i2b2 and extended CHIR Schema. Annotation types having co-referring relationships.
| i2b2 PHI types | Definitions | Extended CHIR | Definitions |
|---|---|---|---|
| All elements of a date except for the year | Date, | ||
| First and last names of patients, their health proxies, and family members | Patient’s first name, last name, middle name, and initials excluding salutations. Ex. “Mr. | ||
| Proper name of relatives. Ex. “patient’s daughter | |||
| Other persons mentioned or patient proxy. Ex. “lived in his friend | |||
| Medical doctors and other practitioners as well as transcriber’s name and initials | Health care worker’s first name, last name, middle name, and initials excluding salutations Ex. “ | ||
| Ages above 89 | Expanded to include all mentions of age. Ex. “ | ||
| Any combination of numbers, letters, and special characters identifying medical records, patients, doctors, or hospitals | All combinations of numbers and letters that could represent a medical record number, lab test number, or other patient or provider identifier such as driver’s license number. Ex. “Driver’s license: | ||
| Electronic mail addresses and references to personal Websites, Facebook pages, Twitter. Ex. “CC: | |||
| Numbers and/or characters, that could represent a social security reference. Ex. “SSN is | |||
| Geographic locations such as cities, states, street names, zip codes, building names, and numbers | Street or city names excluding name as part of organization name. Ex. “lived on | ||
| State or country. Ex. “lived in | |||
| All digits acting as a zipcode. Ex. “works in | |||
| A specific geographic location, or mention of unit, battalion, regiment, brigade etc. Ex. “deployed with the | |||
| Hospitals | Names of medical organizations and of nursing homes where patients are treated and may also reside including room numbers of patients, and buildings and floors related to doctors’ affiliations | Affiliation with companies such as employment that are not related to health care. Ex. “employed by | |
| Sub-specialty clinics, consults or referral to services, or recommendations from services where health care was or will be provided to a patient. Ex “Care provided at | |||
| Phone Numbers | Telephone, pager, and fax numbers | Phone/fax/pager numbers including phone number extensions. Ex. “Fax No: | |
| Non-PHI | Not annotated as part of i2b2 | Medical procedures that contain proper names of persons, places, or locations. Ex. “ | |
| Non-PHI | Not annotated as part of i2b2 | Medical devices that contain proper names of persons, places, or locations excluding brand names. Ex. “ | |
| Non-PHI | Not annotated as part of i2b2 | Diseases that contain proper names of persons, places, or locations. Ex. “history of | |
| Not annotated as part of i2b2 | Anatomic structures contain proper names of persons, places, or locations. Ex. “ |
Fig. 1Logical representation of the annotation schema. Annotation types color-coded by PHI privacy risk ranking: red (high risk), orange (medium risk), yellow (low risk), and gray (non-PHI). Co-referring paired relationships were created between annotations for person names. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 2Annotation experimental conditions.
Prevalence of annotation types and PHI risk category by annotator training and experiment for the final reference standard.
| Annotation prevalence training and experiment | ||||
|---|---|---|---|---|
| Annotator training | Annotator experiment | |||
| N | % | N | % | |
| Social Security Numbers | – | – | – | – |
| Patient Names | 86 | 3.5 | 248 | 2.5 |
| 204 | 8.4 | 860 | 8.5 | |
| Relative Names | 17 | <1.0 | 12 | <1.0 |
| Other Person Names | 4 | <1.0 | 15 | <1.0 |
| Dates | 630 | 26.0 | 2,305 | 22.8 |
| Street City | 24 | 1.0 | 119 | 1.2 |
| State Country | 33 | 1.4 | 95 | 1.0 |
| Zip codes | – | – | – | – |
| Phone Number | 2 | <1.0 | 6 | <1.0 |
| Deployments | 2 | <1.0 | 1 | <1.0 |
| Other Organization Names | 49 | 2.0 | 109 | 1.1 |
| Electronic Addresses | – | – | – | – |
| Other ID Numbers | 4 | <1.0 | 178 | 1.8 |
| Ages | 476 | 19.6 | 1,544 | 15.3 |
| 110 | 9.0 | 469 | 4.6 | |
| Anatomic Structures | 44 | 1.8 | 164 | 1.6 |
| Devices | 412 | 16.9 | 1,622 | 16.1 |
| Diseases | 48 | 2.0 | 263 | 2.6 |
| Procedures | 157 | 6.5 | 713 | 7.1 |
| 66 | 2.7 | 287 | 2.8 | |
| Patient Names relations | 61 | 2.5 | 167 | 1.66 |
| Relative Names relations | 2 | <1.0 | 2 | <1.0 |
| Total Annotations | 2,431 | 19.4 | 10,091 | 80.6 |
Bold is provided for super categories of annotation classes only and overall numbers within these tables.
Inter-annotator agreement for the experiment.
| Inter-annotator agreement (IAA) (experiment) | ||||
|---|---|---|---|---|
| Exact (IAA) | Partial (IAA) | |||
| Control: raw annotation | Experiment: BoB + eHOST Oracle | Control: raw annotation | Experiment: BoB + eHOST Oracle | |
| Social Security Numbers | – | – | – | – |
| Patient Names | 0.87 | 0.40 | 0.91 | 0.80 |
| 0.90 | 0.91 | 0.95 | 0.92 | |
| Relative Names | 0.8 | 0 | 0.8 | 0 |
| Other Person Names | 0.33 | 0.10 | 0.33 | 0.11 |
| Dates | 0.84 | 0.75 | 0.86 | 0.76 |
| Street City | 0.82 | 0.44 | 0.84 | 0.44 |
| State Country | 0.78 | 0.35 | 0.79 | 0.46 |
| Zip codes | – | – | – | – |
| Phone Numbers | 0.50 | 0 | 0.50 | 0 |
| Deployments | 0.33 | – | 0.33 | – |
| Other Organization Names | 0.61 | 0.30 | 0.64 | 0.39 |
| Electronic Addresses | – | – | – | – |
| Other ID Numbers | 0.07 | 0.60 | 0.15 | 0.60 |
| Ages | 0.84 | 0.83 | 0.92 | 0.89 |
| 0.50 | 0.50 | 0.54 | 0.55 | |
| Anatomic Structures | 0.67 | 0.55 | 0.68 | 0.59 |
| Devices | 0.68 | 0.76 | 0.72 | 0.77 |
| Diseases | 0.62 | 0.67 | 0.65 | 0.67 |
| Procedures | 0.55 | 0.40 | 0.56 | 0.45 |
Bold is provided for super categories of annotation classes only and overall numbers within these tables.
Performance metrics for control (raw annotation) and experimental (BoB + eHOST Oracle) conditions.
| Performance metrics annotator (experiment) | |||||
|---|---|---|---|---|---|
| Exact (recall, precision, F1-measure) | Partial (recall, precision, F1-measure) | ||||
| Control: raw annotation | Experiment: BoB + eHOST Oracle | Control: raw annotation | Experiment: BoB + eHOST Oracle | ||
| Social Security Numbers | – | – | – | – | |
| Patient Names | 0.95, 0.98, 0.96 | 0.78, 0.85, 0.81 | 0.96, 0.99, 0.98 | 0.91, 0.99, 0.95 | |
| 0.94, 0.96, 0.95 | 0.90, 0.96, 0.93 | 0.97, 0.98, 0.97 | 0.93, 0.99, 0.96 | ||
| Relative Names | 0.82, 0.93, 0.88 | 0.50, 0.50, 0.50 | 0.88, 1.0, 0.94 | 1, 1, 1 | |
| Other Person Names | 0.50, 0.80, 0.62 | 0.69, 0.06, 0.11 | 0.50, 0.80, 0.62 | 0.81, 0.07, 0.13 | |
| Dates | 0.86, 0.95, 0.90 | 0.84, 0.93, 0.88 | 0.88, 0.97, 0.92 | 0.86, 0.94, 0.90 | |
| Street City | 0.88, 0.92, 0.90 | 0.92, 0.50, 0.65 | 0.89, 0.93, 0.91 | 0.93, 0.51, 0.66 | |
| State Country | 0.80, 0.94, 0.86 | 0.83, 0.50, 0.62 | 0.80, 0.95, 0.87 | 0.96, 0.57, 0.72 | |
| Zip codes | – | – | – | – | |
| Phone Numbers | 0.50, 0.71, 0.59 | 1, 1, 1 | 0.70, 1.0, 0.82 | 1, 1, 1 | |
| Deployments | 0.67, 0.67, 0.67 | – | 0.67, 0.67, 0.67 | – | |
| Other Organization Names | 0.69, 0.81, 0.74 | 0.61, 0.53, 0.57 | 0.72, 0.84, 0.77 | 0.67, 0.58, 0.62 | |
| Electronic Addresses | – | – | – | – | |
| Other ID Numbers | 0.37, 0.46, 0.41 | 0.36, 0.54, 0.44 | 0.54, 0.69, 0.61 | 0.53, 0.80, 0.64 | |
| Ages | 0.90,0.93,0.91 | 0.89, 0.93, 0.91 | 0.94, 0.98, 0.96 | 0.93, 0.98, 0.95 | |
| 0.69, 0.75, 0.72 | 0.76, 0.54, 0.63 | 0.73, 0.80, 0.76 | 0.83, 0.59, 0.69 | ||
| Anatomic Structures | 0.77, 0.83, 0.80 | 0.64, 0.82, 0.72 | 0.78, 0.84, 0.81 | 0.65, 0.83, 0.73 | |
| Devices | 0.77, 0.91, 0.83 | 0.79, 0.88, 0.83 | 0.79, 0.94, 0.86 | 0.81, 0.91, 0.85 | |
| Diseases | 0.76, 0.87, 0.81 | 0.81, 0.79, 0.80 | 0.79, 0.91, 0.84 | 0.83, 0.81, 0.82 | |
| Procedures | 0.69, 0.85, 0.76 | 0.62, 0.73, 0.67 | 0.69, 0.85, 0.76 | 0.63, 0.75, 0.68 | |
Bold is provided for super categories of annotation classes only and overall numbers within these tables.
Experimental effects estimated using the wilcoxon rank sum test.
| Wilcoxon rank sum test | |||||
|---|---|---|---|---|---|
| Control: raw annotation | Experiment: BoB + eHOST Oracle | Significance | |||
| Median F1-measure | N | Median F1-Measure | N | Pr>|Z| | |
| 0.91 | 1156 | 0.91 | 741 | 0.296 | |
| 1 | 365 | 1 | 274 | < | |
| Patient Names | 1 | 78 | 1 | 32 | |
| 1 | 338 | 1 | 201 | 0.278 | |
| Relative Names | 1 | 8 | 0.5 | 2 | 0.553 |
| Other Person Names | 0.5 | 11 | 0 | 106 | < |
| 1 | 879 | 1 | 579 | 0.0748 | |
| Street City | 1 | 68 | 0 | 72 | < |
| State Country | 0.96 | 48 | 0 | 64 | |
| Zip codes | – | – | – | – | – |
| Deployments | 0.33 | 2 | – | – | – |
| Other Organization Names | 0.5 | 72 | 0 | 65 | |
| Dates | 1 | 533 | 1 | 342 | 0.195 |
| Ages | 1 | 764 | 1 | 493 | 0.992 |
| Phone Numbers | 0.58 | 4 | 1 | 2 | 0.140 |
| Electronic Addresses | – | – | – | – | – |
| Other ID Numbers | 0.20 | 47 | 0 | 37 | 0.553 |
| 0.667 | 277 | 0 | 221 | ||
| 0.667 | 277 | 0 | 221 | ||
| 0.857 | 729 | 0.995 | 459 | 0.7103 | |
| Anatomic structures | 0.933 | 101 | 0.8 | 61 | 0.600 |
| Devices | 0.872 | 485 | 1 | 303 | 0.103 |
| Diseases | 0.919 | 116 | 1 | 66 | 0.784 |
| Procedures | 0.667 | 347 | 0.667 | 211 | 0.929 |
| 1 | 141 | 1 | 72 | 0.458 | |
Bold is provided for super categories of annotation classes only and overall numbers within these tables.
Control condition generated significantly higher quality data than the experimental condition.
Fig. 3PHI coverage differences as a function of annotator number.