| Literature DB >> 34042780 |
Mahanazuddin Syed1, Shaymaa Al-Shukri1, Shorabuddin Syed1, Kevin Sexton1, Melody L Greer1, Meredith Zozus2, Sudeepa Bhattacharyya3, Fred Prior1.
Abstract
Named Entity Recognition (NER) aims to identify and classify entities into predefined categories is a critical pre-processing task in Natural Language Processing (NLP) pipeline. Readily available off-the-shelf NER algorithms or programs are trained on a general corpus and often need to be retrained when applied on a different domain. The end model's performance depends on the quality of named entities generated by these NER models used in the NLP task. To improve NER model accuracy, researchers build domain-specific corpora for both model training and evaluation. However, in the clinical domain, there is a dearth of training data because of privacy reasons, forcing many studies to use NER models that are trained in the non-clinical domain to generate NER feature-set. Thus, influencing the performance of the downstream NLP tasks like information extraction and de-identification. In this paper, our objective is to create a high quality annotated clinical corpus for training NER models that can be easily generalizable and can be used in a downstream de-identification task to generate named entities feature-set.Entities:
Keywords: Annotation; Clinical Corpus; De-identification; Named Entity Recognition; Natural Language Processing
Mesh:
Year: 2021 PMID: 34042780 PMCID: PMC9019788 DOI: 10.3233/SHTI210195
Source DB: PubMed Journal: Stud Health Technol Inform ISSN: 0926-9630
Details of entities, time taken to annotate, and Inter Annotator Agreement (IAA) measured.
| Entity Type | Total Entities identified by Annotator 1 (n=5603, time=30 hours) (no pre-annotation set) | Total Entities identified by Annotator 2 (n=5776, time=34.5 hours) (pre-annotation set) | Average Inter Annotator Agreement (IAA) after 5 iterations |
|---|---|---|---|
| DATE | 3493 | 3541 | 0.94 |
| PERSON | 1403 | 1481 | 0.91 |
| AGE | 253 | 256 | 0.94 |
| ID | 169 | 171 | 0.98 |
| ORGANIZATION | 114 | 134 | 0.87 |
| LOCATION | 113 | 130 | 0.86 |
| PHONE | 55 | 59 | 0.93 |
| WEB | 3 | 4 | 0.84 |
Figure 1.A workflow depicting the annotation process, which was divided into preparation and annotation stage. The preparation stage finalizes the pre-requisites, and the annotation stage performs annotation, computes inter-annotator agreement, and finalize the corpus based on discussion and consensus.
Figure 2.Example file depicting annotations using BRAT tool. (a) Set of pre-defined entities, (b) Sample BRAT annotation document, and (c) Final metadata file with annotations that will be used for training.