| Literature DB >> 30526592 |
Kai Xu1, Zhanfan Zhou2, Tao Gong3,4, Tianyong Hao5, Wenyin Liu6.
Abstract
BACKGROUND: Disease named entity recognition (NER) is a fundamental step in information processing of medical texts. However, disease NER involves complex issues such as descriptive modifiers in actual practice. The accurate identification of disease NER is a still an open and essential research problem in medical information extraction and text mining tasks.Entities:
Keywords: Biomedical informatics; Machine learning; Neural networks; Text mining
Mesh:
Year: 2018 PMID: 30526592 PMCID: PMC6284263 DOI: 10.1186/s12911-018-0690-y
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1The overall architecture of the proposed SBLC model including three layers: The first layer is word embedding containing word embeddings trained on three large-scale datasets. The second layer is Bi-LSTM used to learn context information. The third layer is CRF and Ab3p capturing the relationship among word part-of-speech labels
The statistics of the NCBI dataset for disease NER
| Characteristics | Training | Developing | Testing | Total |
|---|---|---|---|---|
| # of PubMed article abstracts | 593 | 100 | 100 | 793 |
| # of annotated disease mentions | 5145 | 787 | 960 | 6892 |
| # of unique annotated disease mentions | 1710 | 368 | 427 | 2136 |
| Avg. sentences/abstract | 10 | 10 | 10 | 10 |
| Avg. words/sentence | 20 | 22 | 22 | 21 |
| Avg. words/abstract | 217 | 226 | 232 | 225 |
Parameter combination comparison
| Methods | Dictionary look-up | Disease name normalization | Word embedding | LSTM | CRF |
|---|---|---|---|---|---|
| Dictionary look-up [ | Y | Y | N | N | N |
| cTAKES [ | Y | Y | N | N | Y |
| MetaMap [ | Y | Y | N | N | N |
| Inference Method [ | Y | Y | N | N | N |
| CRF + UMLS [ | Y | Y | N | N | Y |
| CRF + CMT [ | Y | Y | N | N | Y |
| CRF + MeSH [ | Y | Y | N | N | Y |
| DNorm [ | Y | Y | N | N | N |
| C-Bi-LSTM-CRF [ | N | N | N | Y | Y |
| TaggerOne [ | N | Y | N | N | N |
| DNER [ | N | Y | N | N | Y |
| SBLC | N | N | Y | Y | Y |
The optimized parameter settings of the LSTM network
| Parameter | Setting | Description |
|---|---|---|
| Word_dim | 200 | Token embedding dimension |
| Word_LSTM_dim | 100 | Token size in LSTM hidden layer |
| Word_bidirectional | TRUE | Using Bi-LSTM |
| Word Embedding | TRUE | Using word embedding |
| CRF | TRUE | Using CRF |
| Dropout | 1 | Input droupout |
| Learning method | SGD | SGD Adadelta Adam |
| Abbreviation | TRUE | Using Ab3P |
Effects of dimension settings of hidden layer dimension in Bi-LSTM
| Dimensions | Precision | Recall | F1 | |
|---|---|---|---|---|
| Bi-LSTM | 50 | 0.802 | 0.738 | 0.768 |
|
|
|
|
| |
| 150 | 0.838 | 0.684 | 0.753 | |
| 200 | 0.848 | 0.702 | 0.768 |
The highest values are denoted in bold type
Effects of different parameter settings of word embedding dimensions
| Dimensions | Precision | Recall | F1 | |
|---|---|---|---|---|
| Word embeddings | 50 | 0.816 | 0.737 | 0.774 |
| 100 | 0.834 | 0.750 | 0.790 | |
| 150 | 0.859 | 0.686 | 0.763 | |
|
|
|
|
|
The highest values are denoted in bold type
Performance comparison using different combinations of external training datasets
| Pre-Data Sets | Precision | Recall | F1 |
|---|---|---|---|
| Wikipedia | 0.842 | 0.838 | 0.840 |
| PMC (full text) | 0.866 | 0.856 | 0.861 |
| PubMed (abstract) | 0.847 | 0.838 | 0.843 |
| PubMed (abstract) + PMC (full text) |
|
|
|
| Wikipedia+PubMed (abstract) + PMC (full text) | 0.865 | 0.858 | 0.861 |
The highest values are denoted in bold type
Fig. 2The performance of SBLC using different numbers of testing texts. The lines are the averaged F1 for 100 times testing and the shaded areas are at the 95% confidence level
Effects of different parameter settings and the final optimized result
| Parameter | Precision | Recall | F1 |
|---|---|---|---|
| CRF | 0.701 | 0.675 | 0.688 |
| Bi-LSTM | 0.600 | 0.425 | 0.498 |
| Ab3p + CRF | 0.726 | 0.689 | 0.707 |
| Ab3p + Bi-LSTM | 0.645 | 0.452 | 0.532 |
| Bi-LSTM + CRF | 0.806 | 0.800 | 0.803 |
| Ab3p + Bi-LSTM + CRF | 0.813 | 0.808 | 0.811 |
| Word Embedding + Bi-LSTM | 0.675 | 0.501 | 0.575 |
| Word Embedding + CRF | 0.821 | 0.772 | 0.796 |
| Word Embedding + Bi-LSTM + CRF | 0.842 | 0.828 | 0.835 |
| Ab3p + Word Embedding + Bi-LSTM | 0.613 | 0.689 | 0.648 |
| Ab3p + Word Embedding + CRF | 0.846 | 0.786 | 0.815 |
| Ab3p + Word Embedding + Bi-LSTM + CRF (SBLC) |
|
|
|
The highest values are denoted in bold type
The performance comparison of our SBLC model with the baseline methods on the same NCBI test dataset
| Methods | Precision | Recall | F1 |
|---|---|---|---|
| Dictionary look-up [ | 0.213 | 0.718 | 0.316 |
| cTAKES (version 4.0) [ | 0.476 | 0.541 | 0.506 |
| MetaMap (semantic type filtering) [ | 0.495 | 0.679 | 0.541 |
| MetaMap (MEDIC filtering) [ | 0.510 | 0.702 | 0.559 |
| Inference method [ | 0.597 | 0.731 | 0.637 |
| CRF + CMT [ | 0.795 | 0.683 | 0.735 |
| CRF + MeSH [ | 0.855 | 0.660 | 0.746 |
| CRF + UMLS [ | 0.839 | 0.688 | 0.756 |
| DNorm [ | 0.822 | 0.775 | 0.798 |
| C-Bi-LSTM-CRF [ | 0.848 | 0.761 | 0.802 |
| TaggerOne [ | 0.835 | 0.796 | 0.815 |
| TaggerOne(+ normalization) [ | 0.851 | 0.808 | 0.829 |
| DNER [ | 0.853 | 0.833 | 0.843 |
| SBLC |
|
|
|
The highest values are denoted in bold type
Fig. 3The annotations of the identified disease named entities
The semantic similarity among the identified disease concepts using Cosine similarity measure
| hepatic | copper | accumulation | overload | liver | |||||
|---|---|---|---|---|---|---|---|---|---|
| Hepatic | 0.784 | cobalt | 0.849 | depletion | 0.736 | overloading | 0.807 | kidney | 0.81 |
| liver | 0.770 | nickel | 0.831 | accumulates | 0.688 | Nontransfusional | 0.672 | hepatic | 0.77 |
| extra-hepatic | 0.738 | manganese | 0.824 | overaccumulation | 0.684 | overload-related | 0.632 | pancreas | 0.741 |
| intra-hepatic | 0.733 | iron | 0.811 | degradation | 0.683 | overload-induced | 0.626 | kidneys | 0.716 |
| extrahepatic | 0.714 | zinc | 0.799 | redistribution | 0.681 | dyshomeostasis | 0.611 | livers | 0.698 |
Fig. 4The example word embedding projected to a two-dimensional space