| Literature DB >> 18426547 |
Yoshimasa Tsuruoka1, John McNaught, Sophia Ananiadou.
Abstract
BACKGROUND: One of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach.Entities:
Mesh:
Year: 2008 PMID: 18426547 PMCID: PMC2352870 DOI: 10.1186/1471-2105-9-S3-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Ambiguity and variability in a dictionary.
This is an imaginary dictionary consisting of three concept IDs. All terms belonging to the same concept ID are assumed to be synonymous (conveying the same meaning).
| Concept ID | Term |
| 1 | IL2 |
| 1 | IL-2 |
| 1 | Interleukin |
| 2 | IL3 |
| 2 | IL-3 |
| 2 | Interleukin |
| 3 | ZFP580 |
| 3 | ZFP581 |
| 3 | Zinc finger protein |
Statistics of the dictionaries
| Dictionary | #Concept IDs | #Terms | Ambiguity | Variability |
| Gene/protein name dictionary (original) | 14,893 | 205,909 | 5.715 | 13.826 |
| Gene/protein name dictionary (reduced) | 14,882 | 174,162 | 1.000 | 11.703 |
| Disease dictionary (original) | 48,391 | 148,531 | 1.005 | 3.069 |
| Disease dictionary (reduced) | 48,391 | 147,859 | 1.000 | 3.056 |
Discovering rules from a gene/protein dictionary
| Dictionary | Lookup performance | ||||
| Iter. | Ambiguity | Variability | Rule | Precision | Recall |
| 0 | 1.004 | 10.399 | 0.975 | 0.194 | |
| 1 | 1.006 | 10.101 | ‘ ’ → ‘-’ | 0.967 | 0.233 |
| 2 | 1.009 | 9.759 | ‘-’ → ‘’ | 0.966 | 0.280 |
| 3 | 1.012 | 9.318 | ‘protein’ → ‘’ | 0.958 | 0.340 |
| 4 | 1.013 | 9.155 | ‘precursor’ → ‘’ | 0.959 | 0.347 |
| 5 | 1.013 | 9.038 | ‘,’ → ‘’ | 0.961 | 0.366 |
| 6 | 1.013 | 9.006 | ‘incfinger’ → ‘nf’ | 0.961 | 0.368 |
| 7 | 1.013 | 8.979 | ‘isoforma’ → ‘’ | 0.962 | 0.375 |
| 8 | 1.013 | 8.953 | ‘isoformb’ → ‘’ | 0.962 | 0.377 |
| 9 | 1.013 | 8.937 | ‘prepro’ → ‘’ | 0.962 | 0.379 |
| 10 | 1.013 | 8.916 | ‘ike’ → ‘’ | 0.962 | 0.380 |
| 11 | 1.013 | 8.911 | ‘rotocadherin’ → ‘cdh’ | 0.962 | 0.380 |
| 12 | 1.013 | 8.891 | ‘(drosophila)’ → ‘’ | 0.962 | 0.383 |
| 13 | 1.013 | 8.873 | ‘variant’ → ‘’ | 0.962 | 0.384 |
| 14 | 1.014 | 8.867 | ‘nterleukin’ → ‘l’ | 0.962 | 0.384 |
| 15 | 1.014 | 8.857 | ‘drosophilahomologof’ → ‘homolog’ | 0.963 | 0.385 |
| 16 | 1.014 | 8.846 | ‘coupledrecepto’ → ‘p’ | 0.963 | 0.387 |
| 17 | 1.014 | 8.830 | ‘(s.cerevisiae)’ → ‘’ | 0.963 | 0.390 |
| : | : | : | : | : | : |
| 20 | 1.014 | 8.805 | ‘oncogene’ → ‘’ | 0.963 | 0.393 |
| 21 | 1.014 | 8.796 | ‘ingfinger’ → ‘nf’ | 0.963 | 0.394 |
| 22 | 1.014 | 8.790 | ‘isoformc’ → ‘’ | 0.963 | 0.395 |
| 23 | 1.014 | 8.783 | ‘ransmembrane’ → ‘mem’ | 0.963 | 0.395 |
| 24 | 1.014 | 8.778 | ‘ibosomal’ → ‘p’ | 0.964 | 0.396 |
| 25 | 1.014 | 8.770 | ‘subunit’ → ‘chain’ | 0.964 | 0.397 |
| 26 | 1.014 | 8.761 | ‘s.cerevisiaehomologof’ → ‘’ | 0.964 | 0.398 |
| : | : | : | : | : | : |
| 34 | 1.014 | 8.719 | ‘/’ → ‘f’ | 0.962 | 0.400 |
| : | : | : | : | : | : |
| 37 | 1.014 | 8.703 | ‘hypothetical’ → ‘’ | 0.962 | 0.402 |
| : | : | : | : | : | : |
| 41 | 1.014 | 8.685 | ‘eptid’ → ‘rote’ | 0.962 | 0.403 |
| 42 | 1.014 | 8.682 | ‘eucinerichrepeatcontaining’ → ‘rrc’ | 0.962 | 0.403 |
| 43 | 1.014 | 8.678 | ‘betadefensin’ → ‘defb’ | 0.962 | 0.404 |
| : | : | : | : | : | : |
| 57 | 1.014 | 8.639 | ‘molecule’ → ‘antigen’ | 0.962 | 0.405 |
| : | : | : | : | : | : |
| 62 | 1.014 | 8.631 | ‘oxonly’ → ‘x’ | 0.962 | 0.406 |
| 63 | 1.014 | 8.627 | ‘hromosome21openreadingframe’ → ‘21orf’ | 0.962 | 0.407 |
| 64 | 1.014 | 8.625 | ‘typeicytoskeletal’ → ‘’ | 0.962 | 0.408 |
| : | : | : | : | : | : |
| 68 | 1.014 | 8.611 | ‘member’ → ‘’ | 0.962 | 0.410 |
| 69 | 1.014 | 8.587 | ‘lfactoryreceptorfamily’ → ‘r’ | 0.963 | 0.413 |
| : | : | : | : | : | : |
Discovering rules from a disease dictionary
| Dictionary | Lookup performance | ||||
| Iter. | Ambiguity | Variability | Rule | Precision | Recall |
| 0 | 1.001 | 2.794 | 0.994 | 0.158 | |
| 1 | 1.002 | 2.747 | ‘,’ → ‘’ | 0.989 | 0.184 |
| 2 | 1.002 | 2.667 | ‘ nos’ → ‘’ | 0.986 | 0.216 |
| 3 | 1.003 | 2.609 | ‘[x]’ → ‘’ | 0.985 | 0.263 |
| 4 | 1.003 | 2.580 | ‘o’ → ‘’ | 0.982 | 0.275 |
| 5 | 1.003 | 2.554 | ‘ies’ → ‘y’ | 0.983 | 0.291 |
| 6 | 1.003 | 2.529 | ‘ ’ → ‘-’ | 0.984 | 0.305 |
| 7 | 1.003 | 2.504 | ‘-’ → ‘;’ | 0.984 | 0.317 |
| 8 | 1.003 | 2.484 | ‘e’ → ‘i’ | 0.985 | 0.332 |
| 9 | 1.004 | 2.472 | ‘iasi’ → ‘rdir’ | 0.986 | 0.336 |
| 10 | 1.004 | 2.459 | ‘’s’ → ‘’ | 0.986 | 0.345 |
| 11 | 1.004 | 2.449 | ‘s’ → ‘z’ | 0.986 | 0.347 |
| 12 | 1.004 | 2.448 | ‘;(nz)’ → ‘’ | 0.986 | 0.347 |
| 13 | 1.004 | 2.447 | ‘kidniy’ → ‘rinal’ | 0.986 | 0.347 |
| 14 | 1.004 | 2.446 | ‘pulmnary’ → ‘lung’ | 0.986 | 0.347 |
| 15 | 1.004 | 2.443 | ‘ir’ → ‘ri’ | 0.986 | 0.348 |
| 16 | 1.004 | 2.441 | ‘aimia’ → ‘imiaz’ | 0.986 | 0.349 |
| 17 | 1.004 | 2.439 | ‘[d]’ → ‘’ | 0.986 | 0.349 |
| 18 | 1.004 | 2.436 | ‘aimlytic;animiaz’ → ‘imlytic;animia’ | 0.986 | 0.351 |
| : | : | : | : | : | : |
| 24 | 1.004 | 2.427 | ‘z;thi’ → ‘’ | 0.986 | 0.354 |
| : | : | : | : | : | : |
| 31 | 1.004 | 2.420 | ‘z;’ → ‘/’ | 0.986 | 0.355 |
| 32 | 1.004 | 2.348 | ‘/’ → ‘;’ | 0.987 | 0.377 |
| 33 | 1.004 | 2.348 | ‘dizrdri;liv’ → ‘livri;dizrd’ | 0.987 | 0.377 |
| : | : | : | : | : | : |
| 38 | 1.004 | 2.345 | ‘uding’ → ‘’ | 0.987 | 0.378 |
| : | : | : | : | : | : |
| 42 | 1.005 | 2.343 | ‘zufficiincy’ → ‘cmpitinci’ | 0.987 | 0.380 |
| : | : | : | : | : | : |
| 50 | 1.005 | 2.339 | ‘(in;zputum)’ → ‘in;zputum’ | 0.987 | 0.381 |
| : | : | : | : | : | : |
| 57 | 1.005 | 2.335 | ‘iincy’ → ‘’ | 0.987 | 0.382 |
| : | : | : | : | : | : |
| 70 | 1.005 | 2.333 | ‘[idta]’ → ‘’ | 0.987 | 0.385 |
| : | : | : | : | : | : |
| 89 | 1.005 | 2.327 | ‘ph’ → ‘f’ | 0.987 | 0.387 |
| : | : | : | : | : | : |
| 93 | 1.005 | 2.325 | ‘ci’ → ‘x’ | 0.987 | 0.388 |
| : | : | : | : | : | : |
Gene/protein name snippets.
Examples of the gene/protein name snippets used in the lookup experiments reported in Table 6 and 7. The snippets are indicated in boldface type.
| Snippets in context | EntrezGene IDs |
| … conserved in | 1845 |
| These properties suggest that | 1845 |
| … the kinase domain of the | 2263 |
| … ( | 2263 |
| The | 196 |
| … as a component of the DNA binding form of the | 196 |
| : | : |
Evaluation using gene/protein name snippets from MEDLINE abstracts
| Dictionary | Lookup performance | ||||
| Iter. | Ambiguity | Variability | Rule | Precision | Recall |
| 0 | 5.797 | 12.479 | 0.782 | 0.582 | |
| 1 | 5.807 | 12.161 | ‘-’ → ‘’ | 0.766 | 0.603 |
| 2 | 5.811 | 12.025 | ‘ precursor’ → ‘’ | 0.767 | 0.611 |
| 3 | 5.812 | 11.941 | ‘,’ → ‘’ | 0.767 | 0.611 |
| 4 | 5.812 | 11.907 | ‘inc finger protein’ → ‘nf’ | 0.767 | 0.611 |
| 5 | 5.812 | 11.868 | ‘ isoform 1’ → ‘’ | 0.767 | 0.611 |
| 6 | 5.813 | 11.832 | ‘ isoform 2’ → ‘’ | 0.766 | 0.611 |
| 7 | 5.813 | 11.806 | ‘ isoform a’ → ‘’ | 0.766 | 0.611 |
| 8 | 5.813 | 11.781 | ‘ isoform b’ → ‘’ | 0.766 | 0.611 |
| 9 | 5.813 | 11.748 | ‘ containing protein’ → ‘containing’ | 0.766 | 0.611 |
| 10 | 5.813 | 11.730 | ‘ variant’ → ‘’ | 0.766 | 0.611 |
| : | : | : | : | : | : |
| 21 | 5.815 | 11.597 | ‘nterleukin’ → ‘l’ | 0.767 | 0.613 |
| : | : | : | : | : | : |
| 24 | 5.816 | 11.566 | ‘specific’ → ‘’ | 0.767 | 0.615 |
| : | : | : | : | : | : |
| 33 | 5.816 | 11.450 | ‘protein’ → ‘gene’ | 0.765 | 0.616 |
| 34 | 5.828 | 11.056 | ‘ gene’ → ‘’ | 0.765 | 0.619 |
| : | : | : | : | : | : |
| 38 | 5.829 | 11.016 | ‘ recepto’ → ‘’ | 0.767 | 0.623 |
| : | : | : | : | : | : |
| 44 | 5.830 | 10.970 | ‘ alph’ → ‘’ | 0.765 | 0.625 |
| : | : | : | : | : | : |
| 75 | 5.831 | 10.838 | ‘ i’ → ‘1’ | 0.766 | 0.626 |
| : | : | : | : | : | : |
| 84 | 5.831 | 10.790 | ‘ lpha’ → ‘’ | 0.766 | 0.627 |
| : | : | : | : | : | : |
| 86 | 5.831 | 10.782 | ‘ beta’ → ‘b’ | 0.767 | 0.630 |
| : | : | : | : | : | : |
| 100 | 5.832 | 10.732 | ‘ type’ → ‘’ | 0.767 | 0.633 |
Dictionary lookup performance.
This table shows the speed and accuracy of dictionary lookup tasks using the human gene/protein dictionary and gene/protein name snippets. F-score is the harmonic mean of precision and recall. The values in the parentheses are the threshold values in soft string matching.
| Method | Precision | Recall | F-score | Average lookup time (microsecond) |
| Bigram similariy (0.97) | 0.758 | 0.587 | 0.661 | 6.7 × 105 |
| Bigram similariy (0.95) | 0.691 | 0.592 | 0.638 | 6.8 × 105 |
| Bigram similariy (0.93) | 0.612 | 0.610 | 0.611 | 6.8 × 105 |
| No normalization | 0.809 | 0.502 | 0.619 | 7 |
| Case normalization | 0.782 | 0.582 | 0.666 | 8 |
| Heuristic normalization [ | 0.730 | 0.657 | 0.692 | 8 |
| 0.767 | 0.633 | 0.694 | 29 |