| Literature DB >> 16515714 |
Bruce A Beckwith1, Rajeshwarri Mahaadevan, Ulysses J Balis, Frank Kuo.
Abstract
BACKGROUND: Electronic medical records, including pathology reports, are often used for research purposes. Currently, there are few programs freely available to remove identifiers while leaving the remainder of the pathology report text intact. Our goal was to produce an open source, Health Insurance Portability and Accountability Act (HIPAA) compliant, deidentification tool tailored for pathology reports. We designed a three-step process for removing potential identifiers. The first step is to look for identifiers known to be associated with the patient, such as name, medical record number, pathology accession number, etc. Next, a series of pattern matches look for predictable patterns likely to represent identifying data; such as dates, accession numbers and addresses as well as patient, institution and physician names. Finally, individual words are compared with a database of proper names and geographic locations. Pathology reports from three institutions were used to design and test the algorithms. The software was improved iteratively on training sets until it exhibited good performance. 1800 new pathology reports were then processed. Each report was reviewed manually before and after deidentification to catalog all identifiers and note those that were not removed.Entities:
Mesh:
Year: 2006 PMID: 16515714 PMCID: PMC1421388 DOI: 10.1186/1472-6947-6-12
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Identifiers that must be removed to deidentify medical data per HIPAA.
| Names |
| Geographic subdivisions smaller than a State * |
| All elements of dates (except year) |
| All ages over 89 * |
| Telephone numbers |
| Fax numbers |
| Electronic mail addresses |
| Social security numbers |
| Medical record numbers |
| Health plan beneficiary numbers |
| Account numbers |
| Certificate/license numbers |
| Vehicle identifiers and serial numbers, including license plate numbers |
| Device identifiers and serial numbers |
| Web Universal Resource Locators (URLs) |
| Internet Protocol (IP) address numbers |
| Biometric identifiers, including finger and voice prints |
| Full face photographic images and any comparable images |
| Any other unique identifying number, characteristic, or code |
* Additional details are given in the regulation text.
Note that these categories refer only to identifiers that concern an individual, their relatives, employers or household members
Figure 1Example surgical pathology report in SPIN XML format. This is an example of a simple (and fictitious) surgical pathology report, which has been converted into the SPIN XML schema and deidentified by our scrubber. The schema supports a great amount of detail, but only a few of the elements are mandatory. This example shows the minimal elements needed to have a valid XML file which the scrubber will process. The format is somewhat redundant as the textual portions of the report are included twice, once in the
Figure 2Average number of unique identifiers present per report. Note that Department B did not have any consult reports included in the sample.
Performance summary of the deidentification software
| Dept. A | Dept. B | Dept. C | Total | |
| Reports | 600 | 600 | 600 | 1800 |
| Reports with any identifier | 415 | 239 | 600 | 1254 |
| Unique identifiers | 1079 | 338 | 2082 | 3499 |
| Unique identifiers per report | 1.8 | 0.6 | 3.5 | 1.9 |
| Unique identifiers removed | 1057 | 320 | 2062 | 3439 |
| Unique identifiers remaining, total | 22 | 18 | 20 | 60 |
| Unique HIPAA identifiers remaining | 11 | 1 | 7 | 19 |
| % Unique identifiers removed | 98.0% | 94.7% | 99.0% | 98.3% |
| Unique over-scrubs | 1126 | 961 | 2584 | 4671 |
| Unique over-scrubs per report | 1.9 | 1.6 | 4.3 | 2.6 |
| % unique phrases removed that were identifiers | 48.4% | 25.0% | 44.4% | 42.4% |
Summary of identifiers that were not removed.
| Identifier | Identifier Type | In-house Cases | Consult Cases | Total |
| Accession number | HIPAA | 0 | 10 | 10 |
| Patient Name | ||||
| Misspelled | HIPAA | 5 | 2 | 7 |
| Correctly spelled | HIPAA | 0 | 0 | 0 |
| Medical record number | HIPAA | 1 | 0 | 1 |
| Date | HIPAA | 1 | 0 | 1 |
| HIPAA subtotal | 7 | 12 | 19 | |
| Institution address, partial | Non-HIPAA | 0 | 17 | 17 |
| Age <90 | Non-HIPAA | 16 | 0 | 16 |
| Health care organization name | Non-HIPAA | 0 | 6 | 6 |
| Doctor name | Non-HIPAA | 1 | 1 | 2 |
| Non-HIPAA subtotal | 17 | 24 | 41 | |
| Grand total | HIPAA and Non-HIPAA | 24 | 36 | 60 |
There were a total of 3499 unique identifiers in the test reports, of which 1809 were from consult reports and 1690 from in-house reports.