| Literature DB >> 24571547 |
Christopher Funk1, William Baumgartner, Benjamin Garcia, Christophe Roeder, Michael Bada, K Bretonnel Cohen, Lawrence E Hunter, Karin Verspoor.
Abstract
BACKGROUND: Ontological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem.Entities:
Mesh:
Year: 2014 PMID: 24571547 PMCID: PMC4015610 DOI: 10.1186/1471-2105-15-59
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Characteristics of ontologies evaluated
| Cell type | 25:05:2007 | 838 | 20.0 ± 9.5 | 3.0 ± 1.4 | 0.5 ± 1.1 | 11.6 | 4.8 | 3.3 |
| Sequence | 30:03:2009 | 1,610 | 21.6 ± 13.3 | 3.1 ± 1.0 | 1.4 ± 1 | 91.9 | 6.6 | 9.3 |
| ChEBI | 28:05:2008 | 19,633 | 25.5 ± 24.2 | 4.3 ± 4.8 | 2.0 ± 2.5 | 54.8 | 41.3 | 0 |
| NCBITaxon | 12:07:2011 | 789,538 | 24.6 ± 10.2 | 3.6 ± 2.0 | N/A | 53.7 | 56.0 | 0.3 |
| GO-MF | 28:11:2007 | 7,984 | 39.1 ± 15.4 | 4.6 ± 2.2 | 2.8 ± 4.6 | 52.8 | 26.6 | 2.7 |
| GO-BP | 28:11:2007 | 14,306 | 40.1 ± 19.0 | 5.0 ± 2.7 | 2.1 ± 2.5 | 23.5 | 7.0 | 45.7 |
| GO-CC | 28:11:2007 | 2,047 | 26.6 ± 14.2 | 3.6 ± 1.7 | 0.1 ± 0.9 | 29.5 | 14.4 | 6.8 |
| Protein | 22:04:2011 | 26,807 | 38.4 ± 18.5 | 5.5 ± 2.5 | 3.1 ± 3.2 | 68.4 | 74.8 | 4.3 |
System parameter description and values
| wholeWordOnly | Term recognition must match whole words - (YES, NO) |
| filterNumber | Specifies whether the entity recognition step should filter numbers - (YES, NO) |
| stopWords | List of stop words to exclude from matching - (PubMed - commonly found terms from PubMed (included as Additional file
|
| stopWordsCaseSensitive | Whether stop words are case sensitive - (YES, NO) |
| minTermSize | Specifies minimum length of terms to be returned - (ONE, THREE, FIVE) |
| withSynonyms | Whether to include synonyms in matching - (YES, NO) |
| model | Determines which data model is used - (STRICT - lexical, manual, and syntactic filtering are applied, RELAXED - lexical and manual filtering are used) |
| gaps | Specifies how to handle gaps in terms when matching - (ALLOW, NONE) |
| wordOrder | Specifies how to handle word order when matching - (ORDER MATTERS, IGNORE) |
| acronymAbb | Determines which generated acronym or abbreviations are used - (NONE, DEFAULT, UNIQUE - restricts variants to only those with unique expansions) |
| derivationalVars | Specifies which type of derivational variants will be used - (NONE, ALL, ONLY ADJ NOUN) |
| scoreFilter | MetaMap reports a score from 0–1000 for every match, with 1000 being the highest, those matches with scores ≤ will be returned - (0, 600, 800, 1000) |
| minTermSize | Specifies minimum length of terms to be returned - (ONE, THREE, FIVE) |
| searchStrategy | Specifies the dictionary lookup strategy - (CONTIGUOUS - longest match of contiguous tokens, SKIP ANY - returns longest match of not-necessarily contiguous tokens and next lookup begin in next span, SKIP ANY ALLOW OVERLAP - returns longest match of not-necessarily contiguous tokens in the span and next lookup begin after next token) |
| caseMatch | Specifies the case folding mode to use - (IGNORE - fold everything to lower case, INSENSITIVE - fold only tokens with initial caps to lowercase, SENSITIVE - no folding, FOLD DIGIT - fold only tokens with digits to lower case) |
| stemmer | Name of the stemmer to use before matching - (Porter - classic stemmer that removes common morphological and inflectional endings from Engish words, BioLemmatizer - domain specific lemmatization tool for the morphological analysis of biomedical literature presented in Liu |
| orderIndependentLookup | Specifies if ordering of tokens within a span can be ignored - (TRUE, FALSE) |
| findAllMatches | Specifies if all matches will be returned - (TRUE, FALSE - only the longest match will be returned) |
| stopWords | List of stop words to exclude from matching - (PubMed - commonly found terms from PubMed (included as Additional file
|
| synonyms | Specifies which synonyms will be included when creating the dictionary - (EXACT ONLY, ALL) |
Parameters that were evaluated for each system along with a description and possible values are listed in all capital letters. For the most part, parameters are self-explanatory, but for more information see documentation for each system. CM [29], NCBO Annotator [44], MM [18].
Figure 1Maximum F-measure for each system-ontology pair. A wide range of maximum scores is seen for each system within each ontology.
Best performing parameter combinations for CL and GO subsections
| | | | |||
| wholeWordOnly | YES | model | ANY | searchStrategy | CONTIGUOUS |
| filterNumber | ANY | gaps | NONE | caseMatch | INSENSITIVE |
| stopWords | ANY | wordOrder | ORDER MATTERS | stemmer | Porter/BioLemmatizer |
| SWCaseSensitive | ANY | acronymAbb | DEFAULT/UNIQUE | stopWords | NONE |
| minTermSize | ONE/THREE | derivationalVariants | ALL | orderIndLookup | OFF |
| withSynonyms | YES | scoreFilter | 0 | findAllMatches | NO |
| | | minTermSize | 1/3 | synonyms | EXACT ONLY |
| | | | |||
| wholeWordOnly | YES | model | ANY | searchStrategy | CONTIGUOUS |
| filterNumber | ANY | gaps | NONE | caseMatch | INSENSITIVE |
| stopWords | ANY | wordOrder | ORDER MATTERS | stemmer | Porter |
| SWCaseSensitive | ANY | acronymAbb | DEFAULT/UNIQUE | stopWords | NONE |
| minTermSize | ONE/THREE | derivationalVariants | ANY | orderIndLookup | OFF |
| withSynonyms | ANY | scoreFilter | 0/600 | findAllMatches | NO |
| | | minTermSize | 1/3 | synonyms | EXACT ONLY |
| | | | |||
| wholeWordOnly | NO | model | ANY | searchStrategy | CONTIGUOUS |
| filterNumber | ANY | gaps | NONE | caseMatch | ANY |
| stopWords | ANY | wordOrder | ORDER MATTERS | stemmer | BioLemmatizer |
| SWCaseSensitive | ANY | acronymAbb | DEFAULT/UNIQUE | stopWords | NONE |
| minTermSize | ANY | derivationalVariants | ANY | orderIndLookup | OFF |
| withSynonyms | NO | scoreFilter | 0/600 | findAllMatches | NO |
| | | minTermSize | 1/3 | synonyms | EXACT ONLY |
| | | | |||
| wholeWordOnly | YES | model | ANY | searchStrategy | CONTIGUOUS |
| filterNumber | ANY | gaps | NONE | caseMatch | INSENSITIVE |
| stopWords | ANY | wordOrder | ORDER MATTERS | stemmer | Porter |
| SWCaseSensitive | ANY | acronymAbb | ANY | stopWords | NONE |
| minTermSize | ANY | derivationalVariants | ADJ NOUN VARS | orderIndLookup | OFF |
| withSynonyms | YES | scoreFilter | 0 | findAllMatches | NO |
| minTermSize | 5 | synonyms | ALL | ||
Suggested parameters to use that correspond to best score on CRAFT. Parameters where choices don’t seem to make a difference in performance are represented as “ANY”.
Best performing parameter combinations for SO, ChEBI, NCBITaxon, and PRO
| | | | |||
| wholeWordOnly | YES | model | STRICT | searchStrategy | CONTIGUOUS |
| filterNumber | ANY | gaps | NONE | caseMatch | INSENSITIVE |
| stopWords | ANY | wordOrder | ANY | stemmer | Porter/BioLemmatizer |
| SWCaseSensitive | ANY | acronymAbb | DEFAULT/UNIQUE | stopWords | NONE |
| minTermSize | THREE | derivationalVariants | NONE | orderIndLookup | OFF |
| withSynonyms | YES | scoreFilter | 600 | findAllMatches | NO |
| | | minTermSize | 3 | synonyms | EXACT ONLY |
| | | | |||
| wholeWordOnly | YES | model | ANY | searchStrategy | ANY |
| filterNumber | ANY | gaps | NONE | caseMatch | CASE FOLD DIGITS |
| stopWords | PubMed | wordOrder | ANY | stemmer | NONE |
| SWCaseSensitive | ANY | acronymAbb | DEFAULT/UNIQUE | stopWords | NONE |
| minTermSize | ONE/THREE | derivationalVariants | NONE | orderIndLookup | OFF |
| withSynonyms | YES | scoreFilter | 600 | findAllMatches | NO |
| | | minTermSize | 3/5 | synonyms | ALL |
| | | | |||
| wholeWordOnly | YES | model | ANY | searchStrategy | SKIP ANY/ALLOW |
| filterNumber | ANY | gaps | NONE | caseMatch | ANY |
| stopWords | ANY | wordOrder | ORDER MATTERS | stemmer | BioLemmatizer |
| SWCaseSensitive | ANY | acronymAbb | DEFAULT/UNIQUE | stopWords | PubMed |
| minTermSize | FIVE | derivationalVariants | NONE | orderIndLookup | OFF |
| withSynonyms | ANY | scoreFilter | 0/600 | findAllMatches | NO |
| | | minTermSize | 3 | synonyms | EXACT ONLY |
| | | | |||
| wholeWordOnly | YES | model | STRICT | searchStrategy | CONTIGUOUS |
| filterNumber | ANY | gaps | NONE | caseMatch | ANY |
| stopWords | ANY | wordOrder | ORDER MATTERS | stemmer | BioLemmatizer |
| SWCaseSensitive | ANY | acronymAbb | DEFAULT/UNIQUE | stopWords | NONE |
| minTermSize | ONE/THREE | derivationalVariants | NONE | orderIndLookup | OFF |
| withSynonyms | YES | scoreFilter | 0/600 | findAllMatches | YES |
| minTermSize | 5 | synonyms | EXACT ONLY | ||
Suggested parameters to use that correspond to best score on CRAFT. Parameters where choices don’t seem to make a difference in performance are represented as “ANY”.
Best performance for each ontology-system pair
| NCBO Annotator | 0.32 | 0.76 | 0.20 | 1169 | 379 | 4591 |
| MetaMap | 0.69 | 0.61 | 0.80 | 4590 | 3010 | 1170 |
| 0.83 | 0.88 | 0.78 | 4478 | 592 | 1282 | |
| NCBO Annotator | 0.40 | 0.75 | 0.27 | 2287 | 779 | 6067 |
| MetaMap | 0.70 | 0.67 | 0.73 | 6111 | 2969 | 2341 |
| 0.77 | 0.92 | 0.66 | 5532 | 452 | 2822 | |
| NCBO Annotator | 0.08 | 0.47 | 0.04 | 173 | 195 | 4007 |
| MetaMap | 0.09 | 0.09 | 0.09 | 393 | 3846 | 3787 |
| 0.14 | 0.44 | 0.08 | 337 | 425 | 3834 | |
| NCBO Annotator | 0.25 | 0.70 | 0.15 | 2592 | 1120 | 14321 |
| 0.42 | 0.53 | 0.34 | 5802 | 4994 | 11111 | |
| ConceptMapper | 0.36 | 0.46 | 0.29 | 4909 | 5710 | 12004 |
| NCBO Annotator | 0.44 | 0.63 | 0.33 | 7056 | 4094 | 14231 |
| MetaMap | 0.50 | 0.47 | 0.54 | 11402 | 12634 | 9885 |
| 0.56 | 0.56 | 0.57 | 12059 | 9560 | 9228 | |
| 0.56 | 0.7 | 0.46 | 3782 | 1595 | 4355 | |
| MetaMap | 0.42 | 0.36 | 0.50 | 4424 | 8689 | 3717 |
| 0.56 | 0.55 | 0.56 | 4583 | 3687 | 3554 | |
| NCBO Annotator | 0.04 | 0.16 | 0.02 | 157 | 807 | 7292 |
| MetaMap | 0.45 | 0.31 | 0.88 | 6587 | 14954 | 862 |
| 0.69 | 0.61 | 0.79 | 5857 | 3793 | 1592 | |
| NCBO Annotator | 0.50 | 0.49 | 0.51 | 7958 | 8288 | 7636 |
| MetaMap | 0.36 | 0.39 | 0.34 | 5255 | 8307 | 10339 |
| 0.57 | 0.57 | 0.57 | 8843 | 6620 | 6751 | |
Maximum F-measure for each system on each ontology. Bolded systems produced the highest F-measure.
Figure 2All parameter combinations for CL. The distribution of all parameter combinations for each system on CL. (MetaMap - yellow square, ConceptMapper - green circle, NCBO Annotator - blue triangle, default parameters - red).
Figure 3All parameter combinations for GO_CC. The distribution of all parameter combinations for each system on GO_CC. (MetaMap - yellow square, ConceptMapper - green circle, NCBO Annotator - blue triangle, default parameters - red).
Word length in GO - Biological Process
| 5 | 7 | 14.3 | 14.3 | 14.3 |
| 4 | 109 | 17.4 | 3.7 | 9.2 |
| 3 | 317 | 37.2 | 33.4 | 35.0 |
| 2 | 2077 | 49.0 | 50.7 | 43.3 |
| 1 | 13574 | 27.6 | 34.2 | 11.6 |
Figure 4All parameter combinations for GO_BP. The distribution of all parameter combinations for each system on GO_BP. (MetaMap - yellow square, ConceptMapper - green circle, NCBO Annotator - blue triangle, default parameters - red).
Figure 5All parameter combinations for GO_MF. The distribution of all parameter combinations for each system on GO_MF. (MetaMap - yellow square, ConceptMapper - green circle, NCBO Annotator - blue triangle, default parameters - red).
Figure 6Improvement seen by CM on GO_MF by adding synonyms to the dictionary. By adding synonyms of terms without “activity” to the GO_MF dictionary precision and recall are increased.
Figure 7All parameter combinations for SO. The distribution of all parameter combinations for each system on SO. (MetaMap - yellow square, ConceptMapper - green circle, NCBO Annotator - blue triangle, default parameters - red).
Figure 8All parameter combinations for PRO. The distribution of all parameter combinations for each system on PRO. (MetaMap - yellow square, ConceptMapper - green circle, NCBO Annotator - blue triangle, default parameters - red).
Figure 9Improvement on PRO when top 5 FPs are removed. The top 5 FPs for each system are removed. Arrows show increase in precision when they are removed. No change in recall was seen.
Figure 10All parameter combinations for NCBITaxon. The distribution of all parameter combinations for each system on NCBITaxon. (MetaMap - yellow square, ConceptMapper - green circle, NCBO Annotator - blue triangle, default parameters - red).
Figure 11All parameter combinations for ChEBI. The distribution of all parameter combinations for each system on ChEBI. (MetaMap - yellow square, ConceptMapper - green circle, NCBO Annotator - blue triangle, default parameters - red).
Figure 12Two CM parameter that interact on CHEBI. Synonyms (left) and stemmer (right) parameter interact. The stemmer produce distinct clusters when only exactsynonyms are used. When allsynonyms are used, it is hard to distinguish any patterns in the stemmer.
Figure 13Differences between maximum F-measure and performance when optimizing one dimension. Arrows point from best performing F-measure combination to the best precision/recall parameter combination. All systems and all ontologies are shown.
Best parameters for optimizing performance for precision or recall
| | | | |||
| wholeWordOnly | YES | model | STRICT | searchStrategy | CONTIGUOUS |
| filterNumber | ANY | gaps | NONE | caseMatch | SENSITIVE |
| stopWords | ANY | wordOrder | ORDER MATTERS | stemmer | NONE |
| SWCaseSensitive | ANY | acronymAbb | DEFAULT/UNIQUE | stopWords | NONE |
| minTermSize | THREE/FIVE | derivationalVariants | NONE | orderIndLookup | OFF |
| withSynonyms | NO | scoreFilter | 1000 | findAllMatches | NO |
| | | minTermSize | 3/5 | synonyms | EXACT ONLY |
| | | | |||
| wholeWordOnly | NO | model | RELAXED | searchStrategy | SKIP ANY/ALLOW |
| filterNumber | ANY | gaps | ALLOW | caseMatch | IGNORE/INSENSITIVE |
| stopWords | ANY | wordOrder | IGNORE | stemmer | Porter/BioLemmatizer |
| SWCaseSensitive | ANY | acronymAbb | ALL | stopWords | PubMed |
| minTermSize | ONE/THREE | derivationalVariants | ALL/ADJ NOUN | orderIndLookup | ON |
| withSynonyms | YES | scoreFilter | 0 | findAllMatches | YES |
| minTermSize | 1/3 | synonyms | ALL | ||