| Literature DB >> 31419946 |
Yipei Wang1, Xingyu Fan2, Luoxin Chen1, Eric I-Chao Chang3, Sophia Ananiadou4, Junichi Tsujii4,5, Yan Xu6,7.
Abstract
*: Background Consisting of dictated free-text documents such as discharge summaries, medical narratives are widely used in medical natural language processing. Relationships between anatomical entities and human body parts are crucial for building medical text mining applications. To achieve this, we establish a mapping system consisting of a Wikipedia-based scoring algorithm and a named entity normalization method (NEN). The mapping system makes full use of information available on Wikipedia, which is a comprehensive Internet medical knowledge base. We also built a new ontology, Tree of Human Body Parts (THBP), from core anatomical parts by referring to anatomical experts and Unified Medical Language Systems (UMLS) to make the mapping system efficacious for clinical treatments. *: Result The gold standard is derived from 50 discharge summaries from our previous work, in which 2,224 anatomical entities are included. The F1-measure of the baseline system is 70.20%, while our algorithm based on Wikipedia achieves 86.67% with the assistance of NEN. *: Conclusions We construct a framework to map anatomical entities to THBP ontology using normalization and a scoring algorithm based on Wikipedia. The proposed framework is proven to be much more effective and efficient than the main baseline system.Entities:
Keywords: Anatomical entity; Human body parts; Named entity normalization; Natural language processing; Wikipedia
Mesh:
Year: 2019 PMID: 31419946 PMCID: PMC6697955 DOI: 10.1186/s12859-019-3005-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The top level of the Tree of Human Body Parts (THBP) THBP consists of 9 parts: head, neck, chest, abdomen, pelvis, back, hip, extremity, and trunk. For each part, its sublayer is constructed by organs or tissues in this part. The human body image in the figure was created by the author
Inter-annotator agreement between A1 and A2
| Annotator | Precision | Recall | F1 |
| A1 and A2 | 89.93% | 91.34% | 90.63% |
Inter-annotator agreement between each annotator and the gold standard
| Annotator | Precision | Recall | F1 |
|---|---|---|---|
| A1 | 89.53% | 88.63% | 89.07% |
| A2 | 90.28% | 91.88% | 91.07% |
Fig. 2The flowchart of mapping. Anatomical related entities are extracted from medical text first. Then the entities are normalized by using our synonyms dictionary or co-reference chains provided in the corpus of [51]. After that, we match the normalized entities with THBP to see if they are included. If the entities are successfully matched (e.g. lower limb), the results are considered final. If not (e.g. myocarditis), it turns to the external knowledge base to normalize entities and match them with THBP again
Fig. 3The baseline system. First, anatomical related entities are extracted from medical records. After that, entities are matched with THBP to see if they are included. If so, an entity such as left eye, which is included in eye, the result is that left eye belongs to the class of eye. Otherwise (e.g. Myocarditis), the result comes out as not matched
Inter-annotator agreement between A1 and A2
| Correct | Wrong | Accuracy | |
|---|---|---|---|
| Distance | 35 | 15 | 70.00% |
| Frequency | 34 | 16 | 68.00% |
Results of combinations of different methods with baseline
| Precision | Recall | F1 | ||
|---|---|---|---|---|
| B | 83.87% | 60.37% | 70.20% | - |
| B+N | 86.87% | 67.00% | 75.65% | 5.45% |
| B+D | 86.16% | 67.12% | 75.46% | 5.26% |
| B+F | 86.24% | 68.66% | 76.45% | 6.25% |
| B+D&F | 86.51% | 69.19% | 76.89% | 6.69% |
| B+N+D | 88.48% | 79.15% | 83.55% | 13.35% |
| B+N+F | 88.47% | 79.09% | 83.52% | 13.32% |
| B+N+D&F | 89.11% | 84.36% | 86.67% | 16.47% |
Note: B-Baseline, N-Normalization, D-Distance, F-Frequency, D&F-Distance & Frequency