Jianbo Lei1, Buzhou Tang2, Xueqin Lu3, Kaihua Gao3, Min Jiang4, Hua Xu4. 1. Center for Medical Informatics, Peking University, Beijing, China The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA. 2. The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China. 3. Center for Medical Informatics, Peking University, Beijing, China. 4. The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA.
Abstract
OBJECTIVE: Named entity recognition (NER) is one of the fundamental tasks in natural language processing. In the medical domain, there have been a number of studies on NER in English clinical notes; however, very limited NER research has been carried out on clinical notes written in Chinese. The goal of this study was to systematically investigate features and machine learning algorithms for NER in Chinese clinical text. MATERIALS AND METHODS: We randomly selected 400 admission notes and 400 discharge summaries from Peking Union Medical College Hospital in China. For each note, four types of entity-clinical problems, procedures, laboratory test, and medications-were annotated according to a predefined guideline. Two-thirds of the 400 notes were used to train the NER systems and one-third for testing. We investigated the effects of different types of feature including bag-of-characters, word segmentation, part-of-speech, and section information, and different machine learning algorithms including conditional random fields (CRF), support vector machines (SVM), maximum entropy (ME), and structural SVM (SSVM) on the Chinese clinical NER task. All classifiers were trained on the training dataset and evaluated on the test set, and micro-averaged precision, recall, and F-measure were reported. RESULTS: Our evaluation on the independent test set showed that most types of feature were beneficial to Chinese NER systems, although the improvements were limited. The system achieved the highest performance by combining word segmentation and section information, indicating that these two types of feature complement each other. When the same types of optimized feature were used, CRF and SSVM outperformed SVM and ME. More specifically, SSVM achieved the highest performance of the four algorithms, with F-measures of 93.51% and 90.01% for admission notes and discharge summaries, respectively. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
OBJECTIVE: Named entity recognition (NER) is one of the fundamental tasks in natural language processing. In the medical domain, there have been a number of studies on NER in English clinical notes; however, very limited NER research has been carried out on clinical notes written in Chinese. The goal of this study was to systematically investigate features and machine learning algorithms for NER in Chinese clinical text. MATERIALS AND METHODS: We randomly selected 400 admission notes and 400 discharge summaries from Peking Union Medical College Hospital in China. For each note, four types of entity-clinical problems, procedures, laboratory test, and medications-were annotated according to a predefined guideline. Two-thirds of the 400 notes were used to train the NER systems and one-third for testing. We investigated the effects of different types of feature including bag-of-characters, word segmentation, part-of-speech, and section information, and different machine learning algorithms including conditional random fields (CRF), support vector machines (SVM), maximum entropy (ME), and structural SVM (SSVM) on the Chinese clinical NER task. All classifiers were trained on the training dataset and evaluated on the test set, and micro-averaged precision, recall, and F-measure were reported. RESULTS: Our evaluation on the independent test set showed that most types of feature were beneficial to Chinese NER systems, although the improvements were limited. The system achieved the highest performance by combining word segmentation and section information, indicating that these two types of feature complement each other. When the same types of optimized feature were used, CRF and SSVM outperformed SVM and ME. More specifically, SSVM achieved the highest performance of the four algorithms, with F-measures of 93.51% and 90.01% for admission notes and discharge summaries, respectively. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Entities:
Keywords:
Chinese clinic notes; Machine learning algorithm; Medical concept recognition; Natural language processing
Authors: Yan Xu; Yining Wang; Tianren Liu; Jiahua Liu; Yubo Fan; Yi Qian; Junichi Tsujii; Eric I Chang Journal: J Am Med Inform Assoc Date: 2013-08-09 Impact factor: 4.497
Authors: Min Jiang; Yukun Chen; Mei Liu; S Trent Rosenbloom; Subramani Mani; Joshua C Denny; Hua Xu Journal: J Am Med Inform Assoc Date: 2011-04-20 Impact factor: 4.497
Authors: Berry de Bruijn; Colin Cherry; Svetlana Kiritchenko; Joel Martin; Xiaodan Zhu Journal: J Am Med Inform Assoc Date: 2011-05-12 Impact factor: 4.497
Authors: Parisa Kordjamshidi; Wouter Massa; Thomas Provoost; Marie-Francine Moens Journal: Biomed Eng Syst Technol Int Jt Conf BIOSTEC Revis Sel Pap Date: 2016-01-05