OBJECTIVE: De-identified medical records are critical to biomedical research. Text de-identification software exists, including "resynthesis" components that replace real identifiers with synthetic identifiers. The goal of this research is to evaluate the effectiveness and examine possible bias introduced by resynthesis on de-identification software. DESIGN: We evaluated the open-source MITRE Identification Scrubber Toolkit, which includes a resynthesis capability, with clinical text from Vanderbilt University Medical Center patient records. We investigated four record classes from over 500 patients' files, including laboratory reports, medication orders, discharge summaries and clinical notes. We trained and tested the de-identification tool on real and resynthesized records. MEASUREMENTS: We measured performance in terms of precision, recall, F-measure and accuracy for the detection of protected health identifiers as designated by the HIPAA Safe Harbor Rule. RESULTS: The de-identification tool was trained and tested on a collection of real and resynthesized Vanderbilt records. Results for training and testing on the real records were 0.990 accuracy and 0.960 F-measure. The results improved when trained and tested on resynthesized records with 0.998 accuracy and 0.980 F-measure but deteriorated moderately when trained on real records and tested on resynthesized records with 0.989 accuracy 0.862 F-measure. Moreover, the results declined significantly when trained on resynthesized records and tested on real records with 0.942 accuracy and 0.728 F-measure. CONCLUSION: The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less "realistic," resulting in loss of performance particularly when training on resynthesized data and testing on real data.
OBJECTIVE: De-identified medical records are critical to biomedical research. Text de-identification software exists, including "resynthesis" components that replace real identifiers with synthetic identifiers. The goal of this research is to evaluate the effectiveness and examine possible bias introduced by resynthesis on de-identification software. DESIGN: We evaluated the open-source MITRE Identification Scrubber Toolkit, which includes a resynthesis capability, with clinical text from Vanderbilt University Medical Center patient records. We investigated four record classes from over 500 patients' files, including laboratory reports, medication orders, discharge summaries and clinical notes. We trained and tested the de-identification tool on real and resynthesized records. MEASUREMENTS: We measured performance in terms of precision, recall, F-measure and accuracy for the detection of protected health identifiers as designated by the HIPAA Safe Harbor Rule. RESULTS: The de-identification tool was trained and tested on a collection of real and resynthesized Vanderbilt records. Results for training and testing on the real records were 0.990 accuracy and 0.960 F-measure. The results improved when trained and tested on resynthesized records with 0.998 accuracy and 0.980 F-measure but deteriorated moderately when trained on real records and tested on resynthesized records with 0.989 accuracy 0.862 F-measure. Moreover, the results declined significantly when trained on resynthesized records and tested on real records with 0.942 accuracy and 0.728 F-measure. CONCLUSION: The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less "realistic," resulting in loss of performance particularly when training on resynthesized data and testing on real data.
Authors: Ben Wellner; Matt Huyck; Scott Mardis; John Aberdeen; Alex Morgan; Leonid Peshkin; Alex Yeh; Janet Hitzeman; Lynette Hirschman Journal: J Am Med Inform Assoc Date: 2007-06-28 Impact factor: 4.497
Authors: Clete A Kushida; Deborah A Nichols; Rik Jadrnicek; Ric Miller; James K Walsh; Kara Griffin Journal: Med Care Date: 2012-07 Impact factor: 2.983
Authors: David Carrell; Bradley Malin; John Aberdeen; Samuel Bayer; Cheryl Clark; Ben Wellner; Lynette Hirschman Journal: J Am Med Inform Assoc Date: 2012-07-06 Impact factor: 4.497
Authors: David S Carrell; David J Cronkite; Muqun Rachel Li; Steve Nyemba; Bradley A Malin; John S Aberdeen; Lynette Hirschman Journal: J Am Med Inform Assoc Date: 2019-12-01 Impact factor: 4.497
Authors: Oscar Ferrández; Brett R South; Shuying Shen; F Jeffrey Friedlin; Matthew H Samore; Stéphane M Meystre Journal: J Am Med Inform Assoc Date: 2012-09-04 Impact factor: 4.497
Authors: David S Carrell; David J Cronkite; Bradley A Malin; John S Aberdeen; Lynette Hirschman Journal: Methods Inf Med Date: 2016-07-13 Impact factor: 2.176
Authors: Karthik Murugadoss; Ajit Rajasekharan; Bradley Malin; Vineet Agarwal; Sairam Bade; Jeff R Anderson; Jason L Ross; William A Faubion; John D Halamka; Venky Soundararajan; Sankar Ardhanari Journal: Patterns (N Y) Date: 2021-05-12
Authors: Todd Lingren; Yizhao Ni; Louise Deleger; Megan Kaiser; Laura Stoutenborough; Keith Marsolo; Michal Kouril; Katalin Molnar; Imre Solti Journal: J Biomed Inform Date: 2014-02-17 Impact factor: 6.317
Authors: David S Carrell; Bradley A Malin; David J Cronkite; John S Aberdeen; Cheryl Clark; Muqun Rachel Li; Dikshya Bastakoty; Steve Nyemba; Lynette Hirschman Journal: J Am Med Inform Assoc Date: 2020-07-01 Impact factor: 4.497