| Literature DB >> 22464129 |
Haibin Liu1, Tom Christiansen, William A Baumgartner, Karin Verspoor.
Abstract
BACKGROUND: The wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research.Entities:
Year: 2012 PMID: 22464129 PMCID: PMC3359276 DOI: 10.1186/2041-1480-3-3
Source DB: PubMed Journal: J Biomed Semantics
OED categories related to biomedicine
| Animal Physiology | Bacteriology | Biochemistry | Biology |
|---|---|---|---|
| Botany | Cytology | Embryology | Genetics |
| Geomorphology | Haematology | Immunology | Marine Biology |
| Medicine | Microbiology | Morphology | Old Medicine |
| Palaeobotany | Palaeontolgy | Palaeontology | Pathology |
| Physiological | Physiology | Pisciculture | Plant Physiology |
| Veg. Physiolology | Veterinary Medicine | Veterinary Science | Zoology |
Consensus and disagreement of annotations across lemmatization tools
| Consensus (No.) | Percentage | Disagreement (No.) | Percentage | |
|---|---|---|---|---|
| All 9 tools | 4559 | 70.78% | 1882 | 29.22% |
| 8 tools | ||||
| (exclude CLEAR) | 5207 | 80.84% | 1234 | 19.16% |
| 6 tools | ||||
| (further exclude Norm and LuiNorm) | 5862 | 91.01% | 579 | 8.99% |
Lemmatization performance comparison of lemmatization tools on CRAFT set
| Recall | Precision | F-score | |
|---|---|---|---|
| MorphAdorner | 81.87% (474/579) | 82.29% (474/576) | 82.08% |
| 72.71% (421/579) | 72.71% (421/579) | 72.71% | |
| CLEAR | 72.37% (419/579) | 72.37% (419/579) | 72.37% |
| WordNet | 74.27% (430/579) | 70.03% (430/614) | 72.09% |
| GENIA Tagger | 72.02% (417/579) | 72.02% (417/579) | 72.02% |
| Norm | 83.25% (482/579) | 59.36% (482/812) | 69.30% |
| LuiNorm | 62.18% (360/579) | 62.50% (360/576) | 62.34% |
| TreeTagger | 50.78% (294/579) | 50.78% (294/579) | 50.78% |
Lemmatization performance comparison of lemmatization tools on OED set
| Recall | Precision | F-score | |
|---|---|---|---|
| 81.56% (659/808) | |||
| 75.74% (612/808) | 75.74% (612/808) | 75.74% | |
| LuiNorm | 73.02% (590/808) | 73.02% (590/808) | 73.02% |
| Norm | 61.18% (692/1131) | 71.37% | |
| CLEAR | 62.50% (505/808) | 62.50% (505/808) | 62.50% |
| MorphAdorner | 55.45% (448/808) | 55.45% (448/808) | 55.45% |
| WordNet | 56.56% (457/808) | 54.21% (457/843) | 55.36% |
| TreeTagger | 53.96% (436/808) | 53.96% (436/808) | 53.96% |
| GENIA Tagger | 49.01% (396/808) | 49.01% (396/808) | 49.01% |
Lemmatization performance comparison of lemmatization tools on LLL05 set
| Recall | Precision | F-score | |
|---|---|---|---|
| MorphAdorner | 97.22% (908/934) | 97.22% (908/934) | 97.22% |
| GENIA Tagger | 96.79% (904/934) | 96.79% (904/934) | 96.79% |
| 96.36% (900/934) | 96.36% (900/934) | 96.36% | |
| TreeTagger | 96.25% (899/934) | 96.25% (899/934) | 96.25% |
| WordNet | 96.90% (905/934) | 95.36% (905/949) | 96.12% |
| CLEAR | 93.36% (872/934) | 93.36% (872/934) | 93.36% |
| LuiNorm | 84.90% (793/934) | 85.92% (793/923) | 85.41% |
| Norm | 90.79% (848/934) | 79.55% (848/1066) | 84.80% |
Incorrect and inconsistent instances in LLL05 set
| Token | Generated POS | BioLemmatizer lemma | LLL05 gold lemma | |
|---|---|---|---|---|
| 1 | predominants | NNS | predominant | predominants |
| 2 | coding | NN | coding | code |
| 3 | Most | JJS | many | most |
| 4 | primer | NN | primer | prime |
| 5 | directed | JJ | directed | direct |
| 6 | might | MD | may | might |
| 7 | located | JJ | located | locate |
| 8 | least | JJS | little | least |
| 9 | more | RBR | much | more |
Task-specific normalization instances in LLL05 set
| Token | Generated POS | BioLemmatizer lemma | LLL05 gold lemma | |
|---|---|---|---|---|
| 1 | sigmaG | NN | sigmaG | sigG |
| 2 | sigmaK | NN | sigmaK | sigK |
| 3 | sigmaE | NN | sigmaE | sigE |
| 4 | sigmaA | NN | sigmaA | sigA |
| 5 | sigmaD | NN | sigmaD | sigD |
| 6 | sigmaF | NN | sigmaF | sigF |
| 7 | sigmaL | NN | sigmaL | sigL |
| 8 | sigmaB | NN | sigmaB | sigB |
| 9 | sigmaH | NN | sigmaH | sigH |
| 10 | ykvD | NN | ykvD | kinD |
| 11 | ykrQ | NN | ykrQ | kinE |
| 12 | B. | NNP | B. | Bacillus |
| 13 | fulfil | VB | fulfil | fulfill |
Lemmatization performance comparison of lemmatization tools on updated LLL05 set
| Recall | Precision | F-score | |
|---|---|---|---|
| MorphAdorner | 98.93% (924/934) | 98.93% (924/934) | 98.93% |
| GENIA Tagger | 97.97% (915/934) | 97.97% (915/934) | 97.97% |
| 97.75% (913/934) | 97.75% (913/934) | 97.75% | |
| WordNet | 98.18% (917/934) | 96.63% (917/949) | 97.40% |
| TreeTagger | 96.68% (903/934) | 96.68% (903/934) | 96.68% |
| CLEAR | 94.65% (884/934) | 94.65% (884/934) | 94.65% |
| LuiNorm | 85.87% (802/934) | 86.89% (802/923) | 86.38% |
| Norm | 91.86% (858/934) | 80.49% (858/1066) | 85.80% |
Lemmatization performance of the BioLemmatizer resources on CRAFT set
| Silver Standard | |||
|---|---|---|---|
| Base (MorphAdorner lexicon) | 94.37% (5532/5862) | 94.16% (5532/5875) | 94.26% |
| Base + GENIA | 94.20% (5522/5862) | 93.90% (5522/5881) | 94.05% |
| Base + BioLexicon | 98.41% (5769/5862) | 98.23% (5769/5873) | 98.32% |
| Entire Lexicon | 98.60% (5780/5862) | 98.42% (5780/5873) | 98.51% |
| Rule Only | 97.83% (5735/5862) | 97.83% (5735/5862) | 97.83% |
| Rule + Lexicon Validation | 98.67% (5784/5862) | 98.67% (5784/5862) | 98.67% |
| Recall | Precision | F-score | |
| Base (MorphAdorner lexicon) | 53.71% (311/579) | 53.34% (311/583) | 53.52% |
| Base + GENIA | 62.69% (363/579) | 61.95% (363/586) | 62.32% |
| Base + BioLexicon | 64.77% (375/579) | 64.10% (375/585) | 64.43% |
| Entire Lexicon | 76.68% (444/579) | 75.90% (444/585) | 76.29% |
| Rule Only | 85.84% (497/579) | 85.84% (497/579) | 85.84% |
| Rule + Lexicon Validation | 90.85% (526/579) | 90.85% (526/579) | 90.85% |
Lemmatization performance of the BioLemmatizer resources on OED set
| Recall | Precision | F-score | |
|---|---|---|---|
| Base (MorphAdorner lexicon) | 53.34% (431/808) | 53.34% (431/808) | 53.34% |
| Base + GENIA | 52.97% (428/808) | 52.97% (428/808) | 52.97% |
| Base + BioLexicon | 54.08% (437/808) | 54.08% (437/808) | 54.08% |
| Entire Lexicon | 54.21% (438/808) | 54.21% (438/808) | 54.21% |
| Rule Only | 66.96% (541/808) | 66.96% (541/808) | 66.96% |
| Rule + Lexicon Validation | 71.29% (576/808) | 71.29% (576/808) | 71.29% |
Event extraction performance using various lemmatization tools on GE development set
| Recall | Precision | F-score | |
|---|---|---|---|
| 58.70% (1106/1884) | |||
| 33.58% (1089/3243) | 58.83% (1089/1851) | 42.76% | |
| GENIA Tagger | 33.58% (1089/3243) | 58.64% (1089/1857) | 42.71% |
| MorphAdorner | 33.52% (1087/3243) | 58.57% (1087/1856) | 42.64% |
| WordNet | 33.21% (1077/3243) | 42.51% | |
| TreeTagger | 33.09% (1073/3243) | 58.89% (1073/1822) | 42.37% |
Event extraction performance using various lemmatization tools on GE development set
| Simple Events | |||
|---|---|---|---|
| 77.82% (656/843) | |||
| 58.94% (653/1108) | 77.92% (653/838) | 67.11% | |
| GENIA Tagger | 58.84% (652/1108) | 77.99% (652/836) | 67.08% |
| WordNet | 58.84% (652/1108) | 77.90% (652/837) | 67.04% |
| MorphAdorner | 58.75% (651/1108) | 77.87% (651/836) | 66.98% |
| TreeTagger | 58.30% (646/1108) | 66.80% | |
| Recall | Precision | F-score | |
| TreeTagger | 24.66% (92/373) | ||
| 43.46% (93/214) | 31.69% | ||
| 24.93% (93/373) | 43.46% (93/214) | 31.69% | |
| GENIA Tagger | 24.93% (93/373) | 43.46% (93/214) | 31.69% |
| MorphAdorner | 24.93% (93/373) | 43.46% (93/214) | 31.69% |
| WordNet | 23.32% (87/373) | 43.72% (87/199) | 30.42% |
| Recall | Precision | F-score | |
| 19.47% (343/1762) | 42.93% (343/799) | 26.79% | |
| GENIA Tagger | 19.52% (344/1762) | 42.63% (344/807) | 26.78% |
| MorphAdorner | 19.47% (343/1762) | 42.56% (343/806) | 26.71% |
| WordNet | 19.18% (338/1762) | 42.89% (338/788) | 26.51% |
| TreeTagger | 19.01% (335/1762) | 42.41% (335/790) | 26.25% |
Figure 1The MorphAdorner lemmatization flow diagram.
Figure 2The BioLemmatizer Part-Of-Speech search hierarchy.
Distribution of sources for the BioLemmatizer lexicon
| Lexical Source | Domain of Focus | POS tagset | No. of Entries | ||
|---|---|---|---|---|---|
| 1 | MorphAdorner | General English | NUPOS | 161,166 | 46% |
| 2 | GENIA tagger | Biomedicine | Penn Treebank | 68,990 | 20% |
| 3 | BioLexicon | Biomedicine | Penn Treebank | 116,809 | 34% |
| Total | BioLemmatizer | Biomedicine | NUPOS, Penn Treebank | 346,965 | 100% |
Lemmatization rule comparison between BioLemmatizer and MorphAdorner
| MorphAdorner | BioLemmatizer | MorphAdorner | BioLemmatizer | |
|---|---|---|---|---|
| Adjective | 24 | 26 | 0 | 25 |
| Adverb | 3 | 3 | 0 | 0 |
| Verb | 163 | 165 | 6 | 11 |
| Noun | 10 | 22 | 0 | 6 |
| Total | 200 | 216 | 6 | 42 |
Figure 3The BioLemmatizer lemmatization flow diagram.