| Literature DB >> 35252956 |
Tanmoy Paul1,2, Md Kamruz Zaman Rana2,3, Preethi Aishwarya Tautam3, Teja Venkat Pavan Kotapati3, Yaswitha Jampani2,3, Nitesh Singh2,3, Humayera Islam2,4, Vasanthi Mandhadi1,2, Vishakha Sharma5, Michael Barnes5, Richard D Hammer6, Abu Saleh Mohammad Mosa1,2,3,4.
Abstract
BACKGROUND: Electronic health record (EHR) systems contain a large volume of texts, including visit notes, discharge summaries, and various reports. To protect the confidentiality of patients, these records often need to be fully de-identified before circulating for secondary use. Machine learning (ML) based named entity recognition (NER) model has emerged as a popular technique of automatic de-identification.Entities:
Keywords: NLP; clinical text de-identification; conditional random field; data warehousing; de-identification; named entity recognition; protected health information
Year: 2022 PMID: 35252956 PMCID: PMC8890696 DOI: 10.3389/fdgth.2022.728922
Source DB: PubMed Journal: Front Digit Health ISSN: 2673-253X
Figure 1Workflow of the 1st stage of the experiment: identifying the best feature set.
Figure 2Workflow of the 2nd stage of the experiment: determining the minimum size of training set.
Precision, Recall, and F1-score achieved by the models trained by individual feature for each PHI.
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| NAME | P | 0.57 |
|
| 0.40 | N/A | N/A | 0.58 | 0.78 |
| 0.34 |
| R | 0.36 |
| 0.75 | 0.21 | 0 | 0 | 0.44 |
| 0.75 | 0.06 | |
| F1 | 0.44 |
| 0.77 | 0.28 | 0 | 0 | 0.50 |
| 0.77 | 0.11 | |
| LOCATION | P | 0.78 |
|
| 0.81 | N/A | N/A |
|
|
| 0.10 |
| R | 0.72 |
|
| 0.75 | 0 | 0 | 0.75 |
| 0.80 | 0.01 | |
| F1 | 0.75 |
|
| 0.78 | 0 | 0 | 0.79 |
| 0.82 | 0.01 | |
| DATE | P | 0.74 |
|
| 0.72 | N/A | N/A | 0.76 |
|
| 0.80 |
| R | 0.45 |
|
| 0.45 | 0 | 0 | 0.45 |
|
| 0.70 | |
| F1 | 0.56 |
|
| 0.55 | 0 | 0 | 0.56 | 0.80 |
| 0.74 | |
| HOSPITAL | P | 0.82 |
|
| 0.78 | N/A | N/A |
|
|
| 0.05 |
| R | 0.77 |
|
| 0.74 | 0 | 0 | 0.85 |
|
| 0.01 | |
| F1 | 0.79 |
|
| 0.76 | 0 | 0 | 0.88 |
|
| 0.01 | |
| PHONE | P | N/A |
|
|
| N/A | N/A | 0.93 |
|
|
|
| R | 0 |
|
|
| 0 | 0 |
|
|
|
| |
| F1 | 0 |
|
|
| 0 | 0 |
|
|
|
| |
| ID | P | 0.98 |
| 0.97 | N/A | N/A | N/A | 0.64 | 0.92 |
| 0.96 |
| R | 0.02 | 0.78 |
| 0 | 0 | 0 | 0.03 | 0.78 |
| 0.29 | |
| F1 | 0.04 | 0.87 |
| 0 | 0 | 0 | 0.06 | 0.84 |
| 0.44 | |
| INITIALS | P | 0.91 | 0.96 |
|
| N/A | N/A | 0.91 | 0.90 |
| 0.70 |
| R | 0.08 |
| 0.45 | 0.06 | 0 | 0 | 0.12 | 0.19 |
| 0.10 | |
| F1 | 0.14 |
|
| 0.13 | 0 | 0 | 0.22 | 0.32 |
| 0.02 |
The highest and the second highest values of each performance measure for each PHI are marked by red and green colors, respectively.
Precision, Recall, and F1-score achieved by the models trained by all the features and four of the best performing features.
|
|
|
|
|
|---|---|---|---|
| NAME | P | 0.81 | 0.80 |
| R | 0.80 | 0.79 | |
| F1 | 0.81 | 0.80 | |
| LOCATION | P | 0.87 | 0.85 |
| R | 0.84 | 0.83 | |
| F1 | 0.85 | 0.84 | |
| DATE | P | 0.86 | 0.86 |
| R | 0.78 | 0.79 | |
| F1 | 0.82 | 0.82 | |
| HOSPITAL | P | 0.95 | 0.96 |
| R | 0.94 | 0.93 | |
| F1 | 0.95 | 0.95 | |
| PHONE | P | 0.97 | 0.95 |
| R | 0.96 | 0.94 | |
| F1 | 0.96 | 0.95 | |
| ID | P | 0.98 | 0.99 |
| R | 0.82 | 0.82 | |
| F1 | 0.89 | 0.90 | |
| INITIALS | P | 0.98 | 0.97 |
| R | 0.47 | 0.49 | |
| F1 | 0.64 | 0.65 |
Figure 3Variation of performance (F1-score) of CRF model with the size of training data for (A) LOCATION, DATE, HOSPITAL, PHONE, NAME, ID, and (B) INITIALS.
Figure 4Framework for automatic clinical de-identification pipeline.