| Literature DB >> 26763894 |
Eugene Tseytlin1, Kevin Mitchell2, Elizabeth Legowski3, Julia Corrigan4, Girish Chavan5, Rebecca S Jacobson6.
Abstract
BACKGROUND: Natural language processing (NLP) applications are increasingly important in biomedical data analysis, knowledge engineering, and decision support. Concept recognition is an important component task for NLP pipelines, and can be either general-purpose or domain-specific. We describe a novel, flexible, and general-purpose concept recognition component for NLP pipelines, and compare its speed and accuracy against five commonly used alternatives on both a biological and clinical corpus. NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The system's matching options can be configured individually or in combination to yield specific system behavior for a variety of NLP tasks. The software is open source, freely available, and easily integrated into UIMA or GATE. We benchmarked speed and accuracy of the system against the CRAFT and ShARe corpora as reference standards and compared it to MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator, and cTAKES Fast Dictionary Lookup Annotator.Entities:
Mesh:
Year: 2016 PMID: 26763894 PMCID: PMC4712516 DOI: 10.1186/s12859-015-0871-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Key terms and definitions
| Term | Definition |
|---|---|
| Abbreviation | A shortened form of a word, name, or phrase. |
| Annotation | The tagging of words comprising a mention to assign them to a concept or text feature [ |
| Auto coder | A computer-based system that automatically matches text terms to a code or concept. |
| Concept | A “cognitive construct” that is built on our perception or understanding of something [ |
| Controlled vocabulary | A vocabulary that reduces ambiguity and establishes relationships by linking each concept to a term and its synonyms [ |
| Entity | An “object of interest.” [ |
| Gazetteer | A list or dictionary of entities [ |
| Lexical variant | Different forms of the same term that occur due to variations in spelling, grammar, etc. [ |
| Mention | One or more words and or punctuation within a text which refer to a specific entity. |
| Named entity | A specific word or phrase referring to an object of interest [ |
| Ontology | A defined group of terms and their relationships to each other, within the context of a particular domain [ |
| Semantic type | A logical category of related terms [ |
| Stop word | A word of high frequency but limited information value (e.g.determiners) that is excluded from a vocabulary to improve results of a subsequent task [ |
| Synonym | A term with the same meaning as another term; terms that describe the same concept [ |
| Term | One or more words including punctuation that represent a concept; there may be multiple terms associated with one concept [ |
| Terminology | A catalog of terms related to a specific domain [ |
| Vocabulary | A terminology where the terms and concepts are defined [ |
| Word | A linguistic unit that has a definable meaning and/or function [ |
Widely used concept recognition systems
| System | Approach | Availability | Interoperability | Terminologies | Terminology building tools |
|---|---|---|---|---|---|
| MetaMap (and MMTx) [ | Noun-phrase, lexical variants | Open Source [ | Java API for MMTx | UMLS | MetamorphoSys, DataFileBuilder |
| MGrep [ | Single word variations, | Closed Source Binary Utility | Command line utility (MGrep) integrated with RESTful API in OBA | Custom dictionaries (MGrep) with UMLS and Bioportal in OBA | N/A |
| Radix-Tree search | |||||
| Concept Mapper [ | Word Lookup Table | Open Source [ | UIMA plugin | XML file | N/A |
| cTAKES Dictionary Lookup Annotator [ | Noun-phrase, dictionary lookup | Open Source [ | Java API with full integration in UIMA | UMLS (RRF), Bar Separated Value (BSV) file | Example scripts available [ |
| cTAKES Fast Dictionary Lookup Annotator [ | Rare Word index | Open Source [ | Java API with full integration in UIMA | UMLS (RRF), Bar Separated Value (BSV) file | Example scripts available [ |
| Index Finder [ | Word Lookup Table | N/A | N/A | UMLS | N/A |
| Doublet [ | Bigram Lookup Table | Open Source [ | Command line utility (Perl) | Custom dictionary format | N/A |
| MedLEE [ | Noun-phrase | Commercial | XML based input/output | UMLS | N/A |
| NOBLE Coder | Word Lookup Table | Open Source [ | Java API, UIMA and GATE wrappers | UMLS (RRF), OWL, OBO, BioPortal | Terminology Loader UI |
Fig. 1NOBLE Coder Algorithmshowing a. terminology build process and b. concept recognition process
NOBLE matching options with examples
| Matching feature | Explanation | Example input text | Example input vocabulary | Example Output based on ParameterSetting (T, F) | |
|---|---|---|---|---|---|
| TRUE | FALSE | ||||
| Subsumption | Only more comprehensive concepts are coded | “Deep margin” | Deep | Deep margin | Deep |
| Margin | Margin | ||||
| Deep margin | Deep margin | ||||
| Overlap | A mapped concept may be fully or partially within the boundaries of another concept | “Deep lateral margin” | Deep | Deep | Deep |
| Lateral margin | Lateral margin | Lateral margin | |||
| Deep margin | Deep margin | ||||
| Contiguity | Words in text that map to concept term must be continuous (within word gap) | “Deep lateral margin” | Deep | Deep | Deep |
| Margin | Margin | Margin | |||
| Deep margin | Deep margin | ||||
| Order | Order of words in text must be the same as in the concept term | “Margin, deep” | Deep | Deep | Deep |
| Margin | Margin | Margin | |||
| Deep margin | Deep margin | ||||
| Partial | Input text that only partially matches the concept term is coded | “Margin” | Deep margin | Deep margin | No concepts coded |
Examples of NOBLE matching strategies produced by combinations of matching options
| Use Cases | Combination of matching options | |||||
|---|---|---|---|---|---|---|
| Task | Description | Subsumption | Overlap | Contiguity | Order | Partial |
| Best match | Provides the narrowest meaningful match with the fewest candidates. Best for concept coding and information extraction. | Yes | Yes | Yes (gap = 1) | No | No |
| All match | Provides as many matched candidates as possible. Best for information retrieval and text mining. | No | Yes | No | No | No |
| Precise match | Attempts to minimize the number of false positives by filtering out candidates that do not appear in exactly the same form as in controlled terminology. Similar to best match, but increases precision at the expense of recall. | Yes | No | Yes (gap = 0) | Yes | No |
| Sloppy match | Allows matching of concepts even if the entire term representing it is not mentioned in input text. Best for concept coding with small, controlled terminologies and poorly developed synonymy. | No | Yes | No | No | Yes |
For contiguity, the gap indicates the number of words (not counting stop words) that can occur in-between words that make up a valid term
Fig. 2NOBLE Coder User Interface. a Example of NOBLE Coder processing reports. b Terminology Importer loads the Bioporter ontology, one of the many supported formats. c Terminology Exporter creates a custom terminology by merging branches from two different terminologies
Fig. 3Benchmarking Study
Parameter settings used for concept recognition systems
| Concept recognition system | Download reference | Parameters changed from standard | Goal achieved |
|---|---|---|---|
| MMTx | [ | Best-mapping = false | Overlap |
| MGrep | [ | Longest match = true | Subsumption |
| Concept Mapper | [ | Contiguous match = true | Contiguity |
| cTAKES Dictionary Lookup Annotator | [ | N/A | N/A |
| cTAKES Fast Dictionary Lookup Annotator | [ | OverlapJCasTermAnnotator | Overlap |
| PrecisionTermConsumer | Subsumption | ||
| NOBLE Coder | [ | Best Match Strategy | Overlap |
| Contiguity | |||
| Subsumption |
Vocabulary build steps required for each system
| Concept Recognition Systems | Dictionary Data Structure Used by Coder |
|---|---|
| MMTxa | Used MetamorphoSys to convert RRF to ORF and used bundled data file builder to create terminology for each corpus; this process required significant user interaction and took many hours |
| MGrep | Sent RRF files for both corpora to the MGrep authors and received from them a tab delimited text file that could be used with the MGrep system enriched with LVG; there is limited publicly available information about the vocabulary format required by MGrep |
| Concept Mapper | Wrote custom Java code to convert RRF files to an XML file formatted in the Concept Mapper valid syntax |
| cTAKES Dictionary Lookup Annotator | Wrote custom Java code to convert RRF files to seed a Lucene Index |
| cTAKES Fast Dictionary Lookup Annotator | Wrote custom Java code to convert RRF into Bar Separated Values (BSV) file that FDLA imports |
| NOBLE Codera | Directly imported RRF files |
aSystems that have vocabulary import and selection tooling
Performance metrics
| Corpus | Concept Recognition System | TP | PP | FP | FN | Precision | Recall | F1 | Median runtime over 10 runs (ms)** | IQR (ms) |
|---|---|---|---|---|---|---|---|---|---|---|
| CRAFT | MMTx | 35,140 | 659 | 45,791 | 51,875 | 0.43 | 0.40 | 0.42 | 640,450 | 3,937 |
| CRAFT | MGrep | 9,955 | 292 | 10,666 | 77,427 | 0.48 | 0.12 | 0.19 | 27,448* | 747 |
| CRAFT | Concept Mapper | 29,353 | 713 | 32,122 | 57,608 | 0.48 | 0.34 | 0.40 | 5,329 | 113 |
| CRAFT | cTAKES Dictionary Lookup | 37,736 | 742 | 36,951 | 49,196 | 0.51 | 0.43 | 0.47 | 4,082,685 | 3,459 |
| CRAFT | cTAKES Fast Lookup | 35,078 | 784 | 51,383 | 51,812 | 0.41 | 0.4 | 0.41 | 9,812 | 1,160 |
| CRAFT | NOBLE Coder | 36,568 | 1,637 | 46,344 | 49,469 | 0.44 | 0.43 | 0.43 | 17,431 | 44 |
| ShARe | MMTx | 2,375 | 101 | 1,675 | 1,735 | 0.58 | 0.58 | 0.58 | 52,016 | 2,678 |
| ShARe | MGrep | 2,340 | 35 | 1,075 | 1,836 | 0.68 | 0.56 | 0.62 | 7,103* | 148 |
| ShARe | Concept Mapper | 2,302 | 34 | 2,483 | 1,875 | 0.48 | 0.55 | 0.51 | 1,543 | 57 |
| ShARe | cTAKES Dictionary Lookup | 2,417 | 39 | 2,587 | 1,755 | 0.48 | 0.58 | 0.53 | 263,336 | 2,316 |
| ShARe | cTAKES Fast Lookup | 2,374 | 36 | 1,101 | 1,801 | 0.68 | 0.57 | 0.62 | 2,754 | 81 |
| ShARe | NOBLE Coder | 2,315 | 99 | 1,413 | 1,797 | 0.62 | 0.56 | 0.59 | 6,466 | 94 |
*MGrep runtime is a sum of the runtime of a harness and a stand-alone MGrep invocation on a corpus
**all measurements were performed on a UIMA platform and a Linux Workstation, 32GB RAM, Intel® Core™ i7-3770 CPU @ 3.40GHz
Fig. 4Frequency of All Errors Made by NOBLE Coder and Other Systems Shows the total number of errors made by NOBLE Coder and the number of other systems that made the same errors. a Total number of FN and FP errors on CRAFT corpus. b Total number of FN and FP errors on ShARe corpus
Analysis of sampled NOBLE Coder errors
| Error Type | Definition | Type of error | CRAFT | ShARe |
|---|---|---|---|---|
| Boundary detection | Incorrectly incorporates words from earlier or later in the sentence, considering them to be part of the concept annotated | FP | 0 (0 %) | 3 (1.5 %) |
| Concept hierarchy | Incorrectly assigns more general or more specific concept than gold standard | FP and FN | 18 (9 %) | 13 (6.5 %) |
| Context/background knowledge | Concept annotated incorrectly because context or background knowledge was needed | FP and FN, usually FN | 72 (36 %) | 81 (40.5 %) |
| Exact match missed | Concept not annotated despite exactly matching the preferred name or a synonym | FN | 2 (1 %) | 3 (1.5 %) |
| Importance | Annotated concept was not deemed relevant by gold annotators | FP | 33 (16.5 %) | 51 (25.5 %) |
| Abbreviation detection | Abbreviation defined in the dictionary had a case-insensitive match, because it did not match a defined abbreviation pattern | FP | 18 (9 %) | 0 (0 %) |
| Alternative application of terminology | Gold used obsolete term, term is not in SNOMED, or same term existed in multiple ontologies, resulting in different annotations for same mention | FN | 31 (15.5 %) | 10 (5 %) |
| Text span | Concept annotated was identical to gold but text span was different than gold | FP and FN | 10 (5 %) | 20 (10 %) |
| Word sense ambiguity | Concept annotated was used in different word sense | FP | 4 (2 %) | 0 (0 %) |
| Wording mismatch | Missing or incorrect annotation due to word inflection mismatch between dictionary term and input text | FP and FN, usually FN | 12 (6 %) | 19 (9.5 %) |
| Total Errors | 200 (100 %) | 200 (100 %) |
Fig. 5Frequency of Error Types Made by NOBLE Coder and Other Systems Shows the number of FP and FN errors made by NOBLE Coder on the CRAFT and ShARe corpora, and the number of other systems that made the same errors. a Number of FN errors on CRAFT corpus. b Number of FN errors on ShARe corpus. c Number of FP errors on CRAFT corpus. d Number of FP errors on ShARe corpus