| Literature DB >> 18834492 |
Jörg Hakenberg1, Conrad Plake, Loic Royer, Hendrik Strobelt, Ulf Leser, Michael Schroeder.
Abstract
BACKGROUND: The goal of text mining is to make the information conveyed in scientific publications accessible to structured search and automatic analysis. Two important subtasks of text mining are entity mention normalization - to identify biomedical objects in text - and extraction of qualified relationships between those objects. We describe a method for identifying genes and relationships between proteins.Entities:
Mesh:
Year: 2008 PMID: 18834492 PMCID: PMC2559985 DOI: 10.1186/gb-2008-9-s2-s14
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Example for gene mention normalization using context models. Disambiguation in gene mention normalization. Terminology relating to function, location, disease, and so on is explained in the text and defines the textual context, which is matched against potential gene contexts. Although there are four contexts for the gene name 'p54', only one encodes a human RNA helicase and is located on band q23.3 of chromosome 11, as described in the text.
Figure 2Multiple aligned sentences define a consensus pattern. Logo for a motif derived from multiple sequence alignments that can be applied to sentences of unknown content. PTN represents arbitrary protein names, as does P. V and W are interaction verbs in present and past form, respectively; E is a preposition and D a determiner.
Results for gene mention normalization
| Short description of the submitted run | Precision | Recall | F measure (%) | True positives ( | False positives ( | False negatives ( |
| Training set | 82.1 | 81.6 | 81.8 | 522 | 114 | 118 |
| Training set, no filtering, no disambiguation | 20.2 | 92.7 | 33.1 | 593 | 2,348 | 47 |
| Test set | 78.9 | 83.3 | 81.0 | 654 | 175 | 131 |
| Test set, no disambiguation | 49.6 | 87.5 | 63.3 | 687 | 699 | 98 |
| Test set, unextended lexicon | 70.7 | 72.5 | 71.6 | 569 | 236 | 216 |
| Test set, current performance | 90.7 | 82.4 | 86.4 | 647 | 66 | 138 |
Performance of the gene mention normalization component on the BioCreative II gene normalization sets. Each run includes the extended gene name lexicon, all false-positive filters, and the disambiguation, unless indicated otherwise. Results on the test set reflect official results achieved in the external evaluation; the last row shows the current performance, resulting from improvements added in the aftermath of BioCreative II.
Impact of different context types on human gene mention normalization
| Context type | Precision | Recall | F measure |
| Baseline: NER only | 9.7 | 91.1 | 17.5 |
| NER + GeneRifs | 50.8 | 78.3 | 61.6 |
| NER + GO terms | 46.3 | 81.2 | 59.0 |
| NER + EntrezGene summaries | 49.0 | 66.7 | 56.5 |
| NER + diseases | 22.7 | 43.9 | 29.9 |
| NER + functions | 50.8 | 72.5 | 59.7 |
| NER + keywords | 53.0 | 53.6 | 53.3 |
| NER + locations | 74.2 | 14.8 | 24.7 |
| NER + tissues | 39.4 | 29.1 | 33.4 |
| NER + immediate context filter (heuristics) | 23.5 | 89.8 | 37.2 |
| NER + immediate context filter (HMM) | 52.9 | 80.8 | 63.4 |
| NER + PMIDs | 96.2 | 50.8 | 66.4 |
Starting from a baseline configuration (pure recognition of named entities; see text), each context type was evaluated separately. In addition, we present the impact of filtering by the immediate context: excluding genes from wrong species, abbreviations, and similar heuristics, and using an hidden Markov model (HMM) learned from the training data. Using PubMed IDs (PMIDs) curated for each gene (for instance, via GeneRIFs, Gene Ontology [GO] annotation, and UniProt) would be the best way to ensure high precision and F measure, although these data were not used for the BioCreative II evaluation. NER, named entity recognition.
Performance of the IPS system on the BioCreative II data
| Short description | Precision | Recall | Micro-F | Macro-F |
| 1.8 | 43.7 | 3.5 | 3.1 | |
| 25.2 | 23.3 | 24.2 | 21.1 | |
| 22.8 | 21.6 | 22.2 | 19.7 |
Results for different strategies using the IntAct pattern collection on the BioCreative II test set. gl, guide list cut-off; ids, number of submitted IDs per protein; IPS, interaction pair subtask; ld, maximum length difference; m, minimum of identified interactions per pair and article required; sr, species resolution (pick first species from abstract [a], pick human [h], mouse [m], yeast [y], or from best scored protein [*], in the given order; see text for explanations on these parameters).
Performance of PPI extraction on the Spies corpus
| Corpus | Short description of the submitted run | Precision | Recall | F measure |
| Spies | Initial pattern set | 85.8 | 15.2 | 25.8 |
| Spies | CP, single layer (POS tag including entity) | 76.6 | 47.1 | 58.3 |
| Spies | CP, multilayer (token, POS tag, stem, entity) | 78.7 | 51.9 | 62.6 |
| Spies | CP, optimized for precision | +1 | -4 | 60.1 |
| Spies | CP, optimized for recall | -5 | +5 | 63.9 |
Performance of our approach to protein-protein interaction (PPI) extraction on other external corpora. We also show the influence of using part-of-speech (POS) tags only compared with multilayer alignments, and results for optimization towards a single metric (precision or recall). Note that these evaluations do not require the identification of proteins, as in BioCreative II, so figures are higher in general. CP, consensus patterns resulting from clustering and multiple sentence alignment.
Sources of errors for the gene mention normalization
| Cause | Evidence or examples | |
| False negatives | Evidence from abstract/closest lexicon entry | |
| 24 | Polluting tokens | spectrin betaIV/spectrin beta non-erythrocytic |
| 35 | Unrecognized variations (orthographic, | DCoHm/DCOHM |
| lexical, structural, morphological) | prothrombin/thrombin | |
| 4 | Segmentation of name failed | hOBP (IIb)/hOBPIIb |
| 2 | Syntactically unrelated | polycomblike/PHD finger protein |
| 66 | Removed by filtering step | |
| False positives | Examples, with EntrezGene ID | |
| 30 | Triggered by wrong name boundary | type II |
| 30 | Context filtering (reference to cell etc.) | CD4+ |
| 22 | TF*IDF filter | five |
| 11 | Disambiguation picked wrong gene | Nup358 (440872 instead of 5903) |
| 8 | Abbreviation resolution failed | Wolf-Hirschhorn syndrome ( |
| 4 | Wrong species | |
| 2 | Overlap of names not recognized | |
| 2 | NER missed correct ID | TR2 (8740 instead of 10587) |
| 26 | Multiple identifiers for one name | |
| 40 | Other | |
Analysis of errors that occurred during gene identification, false negatives and false positives, and examples of errors. Words in italics are the parts recognized in longer compound names. NER, named entity recognition.
Performance for gene mention normalization for mouse, yeast, and fruit fly datasets
| Short description of the submitted run | Precision | Recall | F measure (%) | True positives ( | False positives ( | False negatives ( |
| Mouse, training set | 86.6 | 69.2 | 77.0 | 322 | 50 | 143 |
| Yeast, training set | 89.0 | 84.0 | 86.4 | 219 | 27 | 42 |
| Fly, training set | 87.9 | 55.6 | 68.1 | 124 | 17 | 99 |
| Mouse, test set | 91.6 | 72.6 | 81.0 | 355 | 36 | 149 |
| Yeast, test set | 94.9 | 84.8 | 89.6 | 520 | 28 | 93 |
| Fly, test set | 82.1 | 69.5 | 75.3 | 298 | 65 | 131 |
Current performance of the gene mention normalization component on the BioCreative I gene normalization sets. Each run includes an extended gene name lexicon (based on BioCreative I data and with additional synonyms from EntrezGene), all false positive filters, and the disambiguation.
Filtering rules for species, direct references, and chromosomal locations
| Species | |
| - | |
| + | |
| - | <candidate name> {(, ','} {a, an, the} |
| + | <candidate name> {(, ','} {a, an, the} human |
| + | |
| Direct mentions, cell lines, chromosomal loci | |
| + | <candidate name> {gene, protein} |
| - | <candidate name> {cell(s), culture(s)} |
| + | {locus, loci, location, chromosome, chromosomal, gene * associated} |
Examples for heuristic rules to filter out candidate names when they appear to refer to some other concept (gene from another species, cell line, disease locus). '
Figure 3Example of a multiple sentence alignment to identify consensus patterns. Here, five patterns extracted from the corpus define one consensus pattern (bottom). Three layers are used in this example: tokens, stems, and part-of-speech (POS) tags. The weights represent the overall distributions per positions.