Kai Xu1, Zhenguo Yang2, Peipei Kang3, Qi Wang4, Wenyin Liu5. 1. Department of Computer Science, Guangdong University of Technology, Guangzhou, China. Electronic address: kaixu.gdut@foxmail.com. 2. Department of Computer Science, Guangdong University of Technology, Guangzhou, China; Department of Computer Science, City University of Hong Kong, Hong Kong, China. Electronic address: zhengyang5-c@my.cityu.edu.hk. 3. Department of Computer Science, Guangdong University of Technology, Guangzhou, China. Electronic address: ppkanggdut@126.com. 4. Department of Computer Science, Guangdong University of Technology, Guangzhou, China. Electronic address: wangqi_6414@sina.com. 5. Department of Computer Science, Guangdong University of Technology, Guangzhou, China. Electronic address: liuwy@gdut.edu.cn.
Abstract
BACKGROUND: Disease named entity recognition (NER) plays an important role in biomedical research. There are a significant number of challenging issues to be addressed; among these, the identification of rare diseases and complex disease names and the problem of tagging inconsistency (i.e., if an entity is tagged differently in a document) are attracting substantial research attention. METHODS: We propose a new neural network method named Dic-Att-BiLSTM-CRF (DABLC) for disease NER. DABLC applies an efficient exact string matching method to match disease entities with a disease dictionary; here, the dictionary is constructed based on the Disease Ontology. Furthermore, DABLC constructs a dictionary attention layer by incorporating a disease dictionary matching method and document-level attention mechanism. Finally, a bidirectional long short-term memory network and conditional random field (BiLSTM-CRF) with a dictionary attention layer is proposed to combine the disease dictionary to develop disease NER. RESULTS: Extensive experiments are conducted on two widely-used corpora: the NCBI disease corpus and the BioCreative V CDR corpus. We apply each test on 10 executions of each model, with a 95% confidence interval. DABLC achieves the highest F1 scores (NCBI: Precision = 0.883, Recall = 0.89, F1 = 0.886; BioCreative V CDR: Precision = 0.891, Recall = 0.875, F1 = 0.883), outperforming the state-of-the-art methods. CONCLUSION: DABLC combines the advantages of both external dictionary resources and deep attention neural networks. This aids the identification of rare diseases and complex disease names; moreover, it reduces the impact of tagging inconsistency. Special disease NER and deep learning models addressing long sentences are noteworthy areas for future examination.
BACKGROUND: Disease named entity recognition (NER) plays an important role in biomedical research. There are a significant number of challenging issues to be addressed; among these, the identification of rare diseases and complex disease names and the problem of tagging inconsistency (i.e., if an entity is tagged differently in a document) are attracting substantial research attention. METHODS: We propose a new neural network method named Dic-Att-BiLSTM-CRF (DABLC) for disease NER. DABLC applies an efficient exact string matching method to match disease entities with a disease dictionary; here, the dictionary is constructed based on the Disease Ontology. Furthermore, DABLC constructs a dictionary attention layer by incorporating a disease dictionary matching method and document-level attention mechanism. Finally, a bidirectional long short-term memory network and conditional random field (BiLSTM-CRF) with a dictionary attention layer is proposed to combine the disease dictionary to develop disease NER. RESULTS: Extensive experiments are conducted on two widely-used corpora: the NCBI disease corpus and the BioCreative V CDR corpus. We apply each test on 10 executions of each model, with a 95% confidence interval. DABLC achieves the highest F1 scores (NCBI: Precision = 0.883, Recall = 0.89, F1 = 0.886; BioCreative V CDR: Precision = 0.891, Recall = 0.875, F1 = 0.883), outperforming the state-of-the-art methods. CONCLUSION:DABLC combines the advantages of both external dictionary resources and deep attention neural networks. This aids the identification of rare diseases and complex disease names; moreover, it reduces the impact of tagging inconsistency. Special disease NER and deep learning models addressing long sentences are noteworthy areas for future examination.
Authors: Maaly Nassar; Alexander B Rogers; Francesco Talo'; Santiago Sanchez; Zunaira Shafique; Robert D Finn; Johanna McEntyre Journal: Gigascience Date: 2022-08-11 Impact factor: 7.658
Authors: Sandra Brasil; Carlota Pascoal; Rita Francisco; Vanessa Dos Reis Ferreira; Paula A Videira; And Gonçalo Valadão Journal: Genes (Basel) Date: 2019-11-27 Impact factor: 4.096