| Literature DB >> 23847424 |
Siddhartha Jonnalagadda1, Trevor Cohen, Stephen Wu, Hongfang Liu, Graciela Gonzalez.
Abstract
Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes.Entities:
Keywords: concept extraction; distributional semantics; empirical lexical resources; named entity recognition; natural language processing
Year: 2013 PMID: 23847424 PMCID: PMC3702195 DOI: 10.4137/BII.S11664
Source DB: PubMed Journal: Biomed Inform Insights ISSN: 1178-2226
Figure 1Overall Architecture of the System.
Notes: The design of the system to identify concepts using machine learning and distributional semantics. The top three components are related to distributional semantics.
Description of different components and settings of the system.
| Name | Description |
|---|---|
| Conditional Random Fields (CRF) | CRF is a sequential deterministic machine learning algorithm that is considered state of the art for concept extraction in general English, biomedical literature and clinical narratives. We use the MALLET |
| Sentence-level orthographic and linguistic features | These machine learning features used by all the settings are generated through NLP tasks such as tokenization, part-of-speech tagging, chunking and parsing. We used Apache OpenNLP |
| Lexicons for clinical concept extraction | Compiled from UMLS Metathesaurus |
| Semantic vectors | Semantic Vectors creates semantic vector spaces of individual tokens and documents from free natural language text. This package is extended in this paper to empirically construct three different types of lexical resources for this project: Quasi-lexicons using SVM, Word clusters using K-means, Quasi-thesaurus using K-nearest neighbor. |
| BANNER | One of the best CRF-based protein-tagging systems. |
| BioCreative II gene normalization training set | The source for the 344,000 single-word lexicon used by BANNER by default (called BANNER_Dict in this paper). |
Figure 2Nearest Tokens to Haloperidol.
Notes: The closest tokens to haloperidol in the word space are psychiatric drugs. Using the nearest tokens to haloperidol as features, when haloperidol is not a manually compiled lexicon or when the context is unclear, would help to still infer (statistically) that haloperidol is a drug (medical treatment).
Clinical NER: comparison of SVM-based features and clustering-based features with N-nearest neighbors– based features.
| Setting | Exact F | Inexact F | Exact increase | Inexact increase |
|---|---|---|---|---|
| MED_Dict | 80.3 | 89.7 | ||
| MED_Dict+SVM | 80.6 | 90 | 0.3 | 0.3 |
| MED_Dict+NN | 81.7 | 90.9 | 1.4 | 1.2 |
| MED_Dict+NN+SVM | 81.9 | 91 | 1.6 | 1.3 |
| MED_Dict+CL | 80.8 | 90.1 | 0.5 | 0.4 |
| MED_Dict+NN+SVM+CL | 81.7 | 90.9 | 1.4 | 1.2 |
Notes: MED_Dict is the baseline, which is a machine-learning clinical NER system with several sentence-level orthographic and syntactic features, along with features from lexicons such as UMLS, Drugs@FDA, and MedDRA. In MED_Dict+SVM, the quasi-lexicons are also used. In MED_Dict+NN, the quasi-thesaurus is used. In MED_Dict+CL, the clusters automatically generated are used in addition to other features in MED_Dict. Exact F is the F-score for exact match as calculated by the shared task software. Inexact F is the F-score for inexact match or matching only a part of the other. Exact Increase is the increase in Exact F from previous row. Inexact Increase is the increase in Inexact F from previous row.
Clinical NER: impact of distributional semantic features.
| Setting | Exact F | Inexact F | Exact increase | Inexact increase |
|---|---|---|---|---|
| MED_noDict | 79.4 | 89.2 | ||
| MED_Dict | 80.3 | 89.7 | 0.9 | 0.5 |
| MED_noDict+NN+SVM | 81.4 | 90.8 | 2.0 | 1.6 |
| MED_Dict+NN+SVM | 81.9 | 91.0 | 2.5 | 1.8 |
Notes: MED_noDict is the machine-learning clinical NER system with all the sentence-level orthographic and syntactic features, but no features from lexicons such as UMLS, Drugs@FDA, and MedDRA. MED_noDict+NN+SVM also has the features generated using SVM and the nearest neighbors algorithm.
Clinical NER: impact of the source of unlabeled corpus.
| Unlabeled corpus | Exact F | Inexact F |
|---|---|---|
| None | 80.3 | 89.7 |
| Medline | 81.9 | 91.0 |
| UT Houston | 82.3 | 91.3 |
| Mayo | 82.0 | 91.3 |
Notes: None = The machine-learning clinical NER system that does not use any distributional semantic features. Medline = The machine-learning clinical NER system that uses distributional semantic features derived from the Medline abstracts indexed as pertaining to clinical trials. UT Houston = The machine-learning clinical NER system that uses distributional semantic features derived from the notes in the clinical data warehouse at University of Texas Health Sciences Center. Mayo = The machine-learning clinical NER system that uses distributional semantic features derived from the clinical notes of Mayo Clinic, Rochester, MN.
Figure 3Impact of the size of the unlabeled corpus.
Notes: On the X-axis, N represents the system created using distributional semantic features from N-unlabeled documents. N = 0 refers to the system that does not use any distributional semantic feature.
Protein tagging: impact of distributional semantic features on BANNER.
| Rank | Setting | Precision | Recall | F-score | Significance |
|---|---|---|---|---|---|
| 1 | Rank 1 system | 88.48 | 85.97 | 87.21 | 6–11 |
| 2 | Rank 2 system | 89.30 | 84.49 | 86.83 | 8–11 |
| 3 | BANNER_Dict+DistSem | 88.25 | 85.12 | 86.66 | 8–11 |
| 4 | Rank 3 system | 84.93 | 88.28 | 86.57 | 8–11 |
| 5 | BANNER_noDict+DistSem | 87.95 | 85.06 | 86.48 | 10–11 |
| 6 | Rank 4 system | 87.27 | 85.41 | 86.33 | 10–11 |
| 7 | Rank 5 system | 85.77 | 86.80 | 86.28 | 10–11 |
| 8 | Rank 6 system | 82.71 | 89.32 | 85.89 | 10–11 |
| 9 | BANNER_Dict | 86.41 | 84.55 | 85.47 | – |
| 10 | Rank 7 system | 86.97 | 82.55 | 84.70 | – |
| 11 | BANNER_noDict | 85.63 | 83.10 | 84.35 | – |
Notes: The significance column indicates which systems are significantly less accurate than the system in the corresponding row. These values are based on the Bootstrap re-sampling calculations performed as part of the evaluation in the BioCreative II shared task (the latest gene or protein tagging task). BANNER_Dict+DistSem is the system that uses both manual and empirical lexical resources. BANNER_noDict+DistSem is the system that uses only empirical lexical resources. BANNER_Dict is the system that uses only manual lexical resources. This is the system available prior to this research, and the baseline for this study. BANNER_noDict is the system that uses neither manual nor empirical lexical resources. BANNER_Dict+DistSem is the system that is significantly more accurate than the baseline. It is equally important to the improvement that the accuracy of BANNER_noDict+DistSem is better than BANNER_noDict. The most significant contribution in terms of research is that an equivalent accuracy (BANNER_noDict+DistSem and BANNER_Dict) could be achieved even without using any manually compiled lexical resources apart from the annotated corpora.
Example outputs of the additional true positives found in the clinical NER system that uses distributional semantic features over the one that does not.
| Annotation | Sentence | Quasi-thesaurus |
|---|---|---|
| Concept = mesna; type = treatment | She also received Cisplatin 35 per meter squared on 06/19 and Ifex and | |
| Concept = mild bradycardia; type = problem | May start beta-blocker at a low dose given | |
| Concept = 2 liters nasal cannula oxygen; type = treatment | She needs home oxygen and is currently at |
Notes: These examples are from the annotated corpus that belongs to Partners Healthcare. We were allowed to share them publicly after removing the protected health information.