Yao Chen1, Changjiang Zhou1, Tianxin Li1, Hong Wu1, Xia Zhao2, Kai Ye3, Jun Liao4. 1. School of Science, China Pharmaceutical University, Nanjing, China. 2. Adverse Drug Reaction Monitoring Center of Wuxi, Wuxi, China. 3. MandalaT Software Corporation, Wuxi, China. 4. School of Science, China Pharmaceutical University, Nanjing, China. Electronic address: liaojun@cpu.edu.cn.
Abstract
BACKGROUND: The Adverse Drug Event Reports (ADERs) from the spontaneous reporting system are important data sources for studying Adverse Drug Reactions (ADRs) as well as post-marketing pharmacovigilance. Apart from the conventional ADR information contained in the structured section of ADERs, more detailed information such as pre- and post- ADR symptoms, multi-drug usages and ADR-relief treatments are described in the free-text section, which can be mined through Natural Language Processing (NLP) tools. OBJECTIVE: The goal of this study was to extract ADR-related entities from free-text section of Chinese ADERs, which can act as supplements for the information contained in structured section, so as to further assist in ADR evaluation. METHODS: Three models of Conditional Random Field (CRF), Bidirectional Long Short-Term Memory-CRF (BiLSTM-CRF) and Lexical Feature based BiLSTM-CRF (LF-BiLSTM-CRF) were constructed to conduct Named Entity Recognition (NER) tasks in free-text section of Chinese ADERs. A semi-supervised learning method of tri-training was applied on the basis of the three established models to give un-annotated raw data with reliable tags. RESULTS: Among the three basic models, the LF-BiLSTM-CRF achieved the highest average F1 score of 94.35%. After the process of tri-training, almost half of the un-annotated cases were tagged with labels, and the performances of all the three models improved after iterative training. CONCLUSIONS: The LF-BiLSTM-CRF model that we constructed could achieve a comparatively high F1 score, and the fusion of CRF, while BiLSTM-CRF and LF-BiLSTM-CRF in tri-training might further strengthen the reliability of predicted tags. The results suggested the usefulness of our methods in developing the specialized NER tools for identifying ADR-related information from Chinese ADERs.
BACKGROUND: The Adverse Drug Event Reports (ADERs) from the spontaneous reporting system are important data sources for studying Adverse Drug Reactions (ADRs) as well as post-marketing pharmacovigilance. Apart from the conventional ADR information contained in the structured section of ADERs, more detailed information such as pre- and post- ADR symptoms, multi-drug usages and ADR-relief treatments are described in the free-text section, which can be mined through Natural Language Processing (NLP) tools. OBJECTIVE: The goal of this study was to extract ADR-related entities from free-text section of Chinese ADERs, which can act as supplements for the information contained in structured section, so as to further assist in ADR evaluation. METHODS: Three models of Conditional Random Field (CRF), Bidirectional Long Short-Term Memory-CRF (BiLSTM-CRF) and Lexical Feature based BiLSTM-CRF (LF-BiLSTM-CRF) were constructed to conduct Named Entity Recognition (NER) tasks in free-text section of Chinese ADERs. A semi-supervised learning method of tri-training was applied on the basis of the three established models to give un-annotated raw data with reliable tags. RESULTS: Among the three basic models, the LF-BiLSTM-CRF achieved the highest average F1 score of 94.35%. After the process of tri-training, almost half of the un-annotated cases were tagged with labels, and the performances of all the three models improved after iterative training. CONCLUSIONS: The LF-BiLSTM-CRF model that we constructed could achieve a comparatively high F1 score, and the fusion of CRF, while BiLSTM-CRF and LF-BiLSTM-CRF in tri-training might further strengthen the reliability of predicted tags. The results suggested the usefulness of our methods in developing the specialized NER tools for identifying ADR-related information from Chinese ADERs.
Keywords:
Adverse drug reaction; Chinese natural language processing; Lexical feature based bidirectional long short-term memory; Named entity recognition; Tri-training