| Literature DB >> 34620985 |
Jitendra Jonnagaddala1, Aipeng Chen2, Sean Batongbacal2, Chandini Nekkantti3.
Abstract
For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.Entities:
Mesh:
Year: 2021 PMID: 34620985 PMCID: PMC8497517 DOI: 10.1038/s41598-021-99554-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of the OpenDeID corpus where n = total number of pathology reports.
| Category | All settings (n = 2100) | Setting 1 (n = 700) | Setting 2 (n = 700) | Setting 3 (n = 700) |
|---|---|---|---|---|
| NAME | 11,789 | 3929 | 3903 | 3957 |
| AGE | 141 | 40 | 42 | 59 |
| CONTACT | 7 | 1 | 5 | 1 |
| LOCATION | 9861 | 3359 | 3151 | 3351 |
| DATE | 7665 | 2566 | 2501 | 2598 |
| ID | 8951 | 2913 | 3042 | 2996 |
| PROFESSION | 0 | 0 | 0 | 0 |
| OTHER | 0 | 0 | 0 | 0 |
| Total number of PHI entities | 38,414 | 12,808 | 12,644 | 12,962 |
| Average number of PHI entities per report | 18.29 | 18.29 | 18.06 | 18.51 |
| Standard deviation of PHI entities | 7.35 | 6.85 | 7.67 | 7.50 |
| Total number of tokens | 1,548,741 | 510,357 | 508,988 | 529,396 |
| Average number of tokens per report | 737.49 | 729.08 | 727.12 | 756.28 |
| Standard deviation of tokens | 362.33 | 345.18 | 374.74 | 366.22 |
Time spent under each annotation setting.
| Setting | Annotator | No. of reports | Time spent by annotators independently (hours) | Time spent by annotators collaboratively (hours) | Total time |
|---|---|---|---|---|---|
| 1 | 1 | 700 | 24.65 | 8.25 | 37.4 |
| 2 | 4.5 | ||||
| 2 | 1 | 700 | 25.9 | 9.75 | 55.2 |
| 2 | 19.55 | ||||
| 3 | 1 | 700 | 10.8 | 9.75 | 27.75 |
| 2 | 700 | 7.2 |
IAA across individual settings.
| All settings (n = 2100) | Setting1 | Setting2 | Setting3 | Setting3 | |
|---|---|---|---|---|---|
| Precision | 0.9482 | 0.9565 | 0.9329 | 0.9263 | 0.9618 |
| Recall | 0.9445 | 0.9552 | 0.9346 | 0.824 | 0.8455 |
| IAA | 0.9464 | 0.9559 | 0.9337 | 0.8721 | 0.8999 |
DS across individual settings.
| All settings (n = 2100) | Setting1 (n = 700) | Setting2 (n = 700) | Setting3 (n = 700) | |||||
|---|---|---|---|---|---|---|---|---|
| Annotator1 | Annotator2 | Annotator1 | Annotator2 | Annotator | Annotator | Annotator | Annotator | |
| Precision | 0.954 | 0.997 | 0.9564 | 0.9998 | 0.9508 | 0.991 | 0.9549 | 0.9934 |
| Recall | 0.9466 | 0.9931 | 0.9552 | 1 | 0.9411 | 0.979 | 0.9436 | 0.9934 |
| DS | 0.9503 | 0.995 | 0.9558 | 0.9999 | 0.9459 | 0.985 | 0.9492 | 0.9934 |
| Average DS | 0.9726 | 0.9779 | 0.9655 | 0.9713 | ||||
Figure 1Overall OpenDeID corpus construction process.
Figure 2Annotation process under three different settings.