| Literature DB >> 32337372 |
Beau Norgeot1, Kathleen Muenzen1, Thomas A Peterson1, Xuancheng Fan1, Benjamin S Glicksberg1, Gundolf Schenk1, Eugenia Rutenberg1, Boris Oskotsky1, Marina Sirota1, Jinoos Yazdany2, Gabriela Schmajuk2,3, Dana Ludwig1, Theodore Goldstein1, Atul J Butte1,4.
Abstract
There is a great and growing need to ascertain what exactly is the state of a patient, in terms of disease progression, actual care practices, pathology, adverse events, and much more, beyond the paucity of data available in structured medical record data. Ascertaining these harder-to-reach data elements is now critical for the accurate phenotyping of complex traits, detection of adverse outcomes, efficacy of off-label drug use, and longitudinal patient surveillance. Clinical notes often contain the most detailed and relevant digital information about individual patients, the nuances of their diseases, the treatment strategies selected by physicians, and the resulting outcomes. However, notes remain largely unused for research because they contain Protected Health Information (PHI), which is synonymous with individually identifying data. Previous clinical note de-identification approaches have been rigid and still too inaccurate to see any substantial real-world use, primarily because they have been trained with too small medical text corpora. To build a new de-identification tool, we created the largest manually annotated clinical note corpus for PHI and develop a customizable open-source de-identification software called Philter ("Protected Health Information filter"). Here we describe the design and evaluation of Philter, and show how it offers substantial real-world improvements over prior methods.Entities:
Keywords: Health care; Medical research
Year: 2020 PMID: 32337372 PMCID: PMC7156708 DOI: 10.1038/s41746-020-0258-y
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Performance comparison of tools and corpora.
| UCSF | I2B2 | |||||
|---|---|---|---|---|---|---|
| F2 | F2 | |||||
| PHIlter | 78.28 | 99.46 | 94.36 | 78.58 | 99.92 | 94.77 |
| Physionet | 90.62 | 85.10 | 86.15 | 89.49 | 69.84 | 73.05 |
| Scrubber | 79.24 | 95.30 | 91.59 | 76.26 | 87.80 | 85.22 |
Performance comparison of tools and corpora.
P precision, R recall.
Remaining PHI analysis by tool, UCSF test corpus.
| PHI category | Instances of PHI remaining (PHIlter) | Instances of PHI remaining (Physionet) | Instances of PHI remaining (Scrubber) |
|---|---|---|---|
| Age ≥ 90 | 0 | 0 | 0 |
| Patient_Vehicle_or_Device_Id | 0 | 18 | 0 |
| Patient_Account_Number | 0 | 35 | 4 |
| Patient_Medical_Record_Id | 0 | 445 | 0 |
| Patient_Social_Security_Number | 0 | 0 | 6 |
| Patient_Phone_Fax | 0 | 0 | 1 |
| Patient_Initials | 2 | 120 | 132 |
| Patient_Name_or_Family_Member_Name | 6 | 211 | 93 |
| Patient_Address | 7 | 25 | 16 |
| Patient_Unique_ID | 20 | 442 | 34 |
| 0 | 1 | 1 | |
| URL_IP | 4 | 20 | 153 |
| Date | 7 | 257 | 269 |
| Provider_Certificate_or_License | 0 | 276 | 99 |
| Provider_Name | 12 | 546 | 90 |
| Provider_Initials | 12 | 236 | 217 |
| Provider_Address_or_Location | 43 | 1597 | 210 |
| Provider_Phone_Fax | 45 | 49 | 43 |
PHI counts for PHIlter, Physionet and Scrubber performance on the UCSF corpus. Instances of PHI represent single tokens within the span of multiple or single-token items of PHI.
Remaining PHI analysis by tool, I2B2 corpus.
| PHI category | Instances of PHI remaining (PHIlter) | Instances of PHI remaining (Physionet) | Instances of PHI remaining (Scrubber) |
|---|---|---|---|
| Age | 0 | 1 | 0 |
| Device | 0 | 6 | 0 |
| Medical record | 0 | 524 | 18 |
| Patient | 2 | 154 | 92 |
| Date | 0 | 4590 | 1587 |
| Fax | 0 | 2 | 0 |
| Phone | 0 | 31 | 67 |
| Zip | 0 | 3 | 1 |
| Username | 1 | 92 | 92 |
| Street | 2 | 27 | 21 |
| Location-other | 2 | 9 | 12 |
| Idnum | 2 | 297 | 206 |
| City | 2 | 14 | 52 |
| Doctor | 5 | 197 | 186 |
PHI counts for PHIlter, Physionet and Scrubber performance on the I2B2 corpus.
Fig. 1Algorithm Pipeline.
A conceptual overview of the philter pipeline and process.
Fig. 2Ecosystem.
The compute environment and system used for the development and validation of Philter.