Wangjin Lee1, Kyungmo Kim2, Eun Young Lee3, Jinwook Choi4. 1. Interdisciplinary Program for Bioengineering, Graduate School, Seoul National University, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea. Electronic address: jinsamdol@snu.ac.kr. 2. Interdisciplinary Program for Bioengineering, Graduate School, Seoul National University, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea. Electronic address: medinfoman@snu.ac.kr. 3. Division of Rheumatology, Department of Internal Medicine, Seoul National University College of Medicine, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea. Electronic address: elee@snu.ac.kr. 4. Interdisciplinary Program for Bioengineering, Graduate School, Seoul National University, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea; Department of Biomedical Engineering, Seoul National University College of Medicine, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea; Institute of Medical and Biological Engineering, Medical Research Center, Seoul National University, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea. Electronic address: jinchoi@snu.ac.kr.
Abstract
BACKGROUND: This study demonstrates clinical named entity recognition (NER) methods on the clinical texts of rheumatism patients in South Korea. Despite the recent increase in the adoption rate of the electronic health record (EHR) system in global health institutions, health information technologies for handling and acquisition of information from numerous unstructured texts in the EHR system are still in their developing stages. The aim of this study is to verify the conventional named entity recognition (NER) methods, namely dictionary-lookup-based string matching and conditional random fields (CRFs). METHODS: We selected discharge summaries for 200 rheumatic patients from the EHR system of the Seoul National University Hospital and attempted to identify heterogeneous semantic types present in the clinical notes of each patient's history. RESULTS: CRFs outperform string matching in extracting most semantic types (median F1 = 0.761, minimum = 0.705, maximum = 0.906). String matching is found to be better suited for identifying hospital visit information. The performance of both methods is comparable for identifying medications. The 10-fold cross-validation shows that CRFs had median F1 = 0.811 (minimum = 0.752, maximum = 0.918), and exhibited good performance even when trained with simple features. CONCLUSION: CRFs are a good candidate for implementing clinical NER in Korean clinical narrative documents. Increasing the training data and incorporating sophisticated feature engineering might improve the accuracy of identifying health information, enabling automated patient history summarization in the future.
BACKGROUND: This study demonstrates clinical named entity recognition (NER) methods on the clinical texts of rheumatismpatients in South Korea. Despite the recent increase in the adoption rate of the electronic health record (EHR) system in global health institutions, health information technologies for handling and acquisition of information from numerous unstructured texts in the EHR system are still in their developing stages. The aim of this study is to verify the conventional named entity recognition (NER) methods, namely dictionary-lookup-based string matching and conditional random fields (CRFs). METHODS: We selected discharge summaries for 200 rheumaticpatients from the EHR system of the Seoul National University Hospital and attempted to identify heterogeneous semantic types present in the clinical notes of each patient's history. RESULTS: CRFs outperform string matching in extracting most semantic types (median F1 = 0.761, minimum = 0.705, maximum = 0.906). String matching is found to be better suited for identifying hospital visit information. The performance of both methods is comparable for identifying medications. The 10-fold cross-validation shows that CRFs had median F1 = 0.811 (minimum = 0.752, maximum = 0.918), and exhibited good performance even when trained with simple features. CONCLUSION: CRFs are a good candidate for implementing clinical NER in Korean clinical narrative documents. Increasing the training data and incorporating sophisticated feature engineering might improve the accuracy of identifying health information, enabling automated patient history summarization in the future.