| Literature DB >> 34435175 |
Aditi Gupta1, Albert Lai1, Jessica Mozersky2, Xiaoteng Ma1, Heidi Walsh2, James M DuBois2.
Abstract
OBJECTIVE: Sharing health research data is essential for accelerating the translation of research into actionable knowledge that can impact health care services and outcomes. Qualitative health research data are rarely shared due to the challenge of deidentifying text and the potential risks of participant reidentification. Here, we establish and evaluate a framework for deidentifying qualitative research data using automated computational techniques including removal of identifiers that are not considered HIPAA Safe Harbor (HSH) identifiers but are likely to be found in unstructured qualitative data.Entities:
Keywords: data sharing; deidentification; natural language processing; qualitative research data
Year: 2021 PMID: 34435175 PMCID: PMC8382275 DOI: 10.1093/jamiaopen/ooab069
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.Overall approach for the qualitative research data review, development, and validation of the natural language processing (NLP)-based deidentification pipeline.
List of categories identified during qualitative analysis
| Category name | Identifier category and classification |
|---|---|
| Name | HSH: Names |
| Location |
HSH: All geographic subdivisions smaller than a state Non-HSH: References to a geographic area at the state level or larger including country such as “I was born on the East Coast”. |
| Date/time/age |
HSH: All elements of dates (except year), age greater than 89 Non-HSH: References to age in years, months, or weeks not considered HSH such as “The baby was four weeks old on Christmas Day” or “It was my thirtieth birthday”. |
| Numbers |
HSH: Telephone numbers, vehicle identifiers and serial numbers (including license plate numbers), fax numbers, device identifiers and serial numbers, Social Security numbers, medical record numbers, health plan beneficiary numbers, account numbers, any other unique identifying number, characteristic, or code, certificate/license numbers Non-HSH: Any numerical value or digit not categorized as HSH such as “He weighed over 600 pounds” or “She had 13 children” or “Our highest paid nurse earns $12,500 a year”. |
| Web emails/URLs | HSH: Email addresses, web universal resource locators (URLs), internet protocol (IP) addresses. |
| Organization | Non-HSH: Institution or organization name: References to the name of an institution or organization that is not categorized as an HSH geographic region smaller than a state such as “Barnes-Jewish Hospital” or “Washington University in St Louis” which are not actual addresses and constitute multiple potential locations. Proper names of institutions or organizations would go here such as “Pfizer” or “World Health Organization”. |
| Rare diseases | Non-HSH: Commonly recognized rare diseases obtained from public databases. |
| Race, ethnicity | Non-HSH: References to NIH racial/ethnic categories, indigenous status, or nationality such as “Most patients were from Haiti” “A Hispanic nurse working on the psychiatric ward treated me”. |
| Sexual orientation | Non-HSH: Reference to sex, gender, or sexual orientation that is not heterosexual including LGBTQI. |
| Other | Non-HSH: Rare events and other rare references not captured under any existing category and that are unlikely to be captured by automation such as “He won the Olympic gold medal for swimming in Houston” or “Nobel laureate in 1995”. |
Note: Each category contained HSH and/or non-HSH identifiers. Identifier text was replaced by its corresponding category name in the deidentified text.
HSH: HIPAA Safe Harbor.
Descriptive statistics of the 2 datasets used in the study
| Dataset 1 (NIB stories) | Dataset 2 (Interviews) | Total | |
|---|---|---|---|
| Number of files | 304 | 120 | 424 |
| Number of word tokens | 547 733 | 683 580 | 1 231 313 |
| Mean length of file (# word tokens) | 1801.75 | 5696.50 | 2904.04 |
NIB: Narrative Inquiry in Bioethics.
Figure 2.Distribution of various identifier categories HIPAA Safe Harbor (HSH identifiers) and non-HSH in the 2 datasets.
Figure 3.The flowchart of how our NLP pipeline deidentifies the qualitative research documents.
Descriptive statistics of the 2 datasets used in the study and the number (%) of identifiers (HSH and non-HSH) extracted using the NLP pipeline from each set; and gold-standard evaluation of the NLP system
| Dataset name (number of files) | Token count | Identifier count (%) | Precision | Recall | F1 score | |
|---|---|---|---|---|---|---|
| Pilot files | NIB stories (6 files) | 12 620 | 389 (3 %) | 0.93 | 0.92 | 0.93 |
| QDS interviews (9 files)—Iteration 1 | 85 590 | 650 (1 %) | 0.98 | 0.83 | 0.90 | |
| QDS interviews (9 files)—Iteration 2 | 85 590 | 650 (1 %) | 0.98 | 0.90 | 0.94 | |
| Additional files | NIB stories (25 files) | 48 807 | 858 (2%) | 0.93 | 0.98 | 0.95 |
| QDS interviews (30 files)—Iteration 1 | 139 323 | 998 (1%) | 0.97 | 0.81 | 0.88 | |
| QDS interviews (30 files)—Iteration 2 | 139 323 | 998 (1%) | 0.97 | 0.95 | 0.96 | |
| Total | 70 | 286 340 | 2888 (1%) | 0.95 | 0.88 | 0.91 |
| Total—Iteration 2 | 70 | 286 340 | 2888 (1%) | 0.95 | 0.96 | 0.96 |
We performed an error analysis after Iteration 1 and observed that a single name of the organization which repeated as part of an interview question in every transcript of dataset 2, was being missed by our pipeline and driving the low recall. Iteration 2 results show the performance of the pipeline after the removal of one problematic organization name that was not recognized.
HSH: HIPAA Safe Harbor; NIB: Narrative Inquiry in Bioethics; QDS: qualitative data sharing.