| Literature DB >> 22289351 |
Ning Kang1, Erik M van Mulligen, Jan A Kors.
Abstract
BACKGROUND: To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. The creation of gold standard corpora (GSCs), however, is expensive and time-consuming. GSCs therefore tend to be small and to focus on specific subdomains, which limits their usefulness. We investigated the use of a silver standard corpus (SSC) that is automatically generated by combining the outputs of multiple chunking systems. We explored two use scenarios: one in which chunkers are trained on an SSC in a new domain for which a GSC is not available, and one in which chunkers are trained on an available, although small GSC but supplemented with an SSC.Entities:
Mesh:
Year: 2012 PMID: 22289351 PMCID: PMC3280170 DOI: 10.1186/1471-2105-13-17
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance (F-score) of chunkers and their combination when trained for noun-phrase and verb-phrase recognition on different training sets.
| Training set for noun phrases | Training set for verb phrases | |||||
|---|---|---|---|---|---|---|
| System | GENIA GSC | PennBioIE SSC | PennBioIE GSC | GENIA GSC | PennBioIE SSC | PennBioIE GSC |
| Lingpipe | 75.8% | 78.5% | 78.7% | 90.6% | 91.6% | 91.9% |
| OpenNLP | 80.8% | 83.9% | 84.7% | 90.7% | 93.2% | 94.8% |
| Yamcha | 80.1% | 83.2% | 84.0% | 89.5% | 92.8% | 94.2% |
| Combined | 84.3% | 84.5% | 87.2% | 93.7% | 93.9% | 95.5% |
All systems are tested on the PennBioIE corpus.
Performance (F-score) of chunkers and their combination trained on subsets of different size of the GENIA GSC and on the GSC subset supplemented with an SSC, for noun-phrase and verb-phrase recognition.
| Lingpipe | OpenNLP | Yamcha | Combined | |||||
|---|---|---|---|---|---|---|---|---|
| GSC size | GSC | GSC+SSC | GSC | GSC+SSC | GSC | GSC+SSC | GSC | GSC+SSC |
| Noun phrases | ||||||||
| 10 | 65.8% | 80.8% | 83.0% | 87.9% | 82.7% | 85.6% | 86.8% | 90.7% |
| 25 | 72.2% | 81.1% | 85.7% | 88.3% | 84.3% | 86.0% | 87.9% | 90.9% |
| 50 | 76.8% | 81.3% | 87.5% | 88.6% | 85.4% | 86.2% | 88.9% | 91.2% |
| 100 | 78.2% | 81.9% | 87.9% | 88.9% | 85.6% | 86.6% | 89.3% | 91.5% |
| 250 | 82.4% | 82.8% | 88.3% | 89.3% | 86.7% | 87.2% | 90.6% | 92.0% |
| 500 | 84.5% | n.a | 89.7% | n.a | 88.1% | n.a | 92.8% | n.a |
| Verb phrases | ||||||||
| 10 | 64.1% | 86.9% | 84.3% | 93.6% | 86.2% | 92.5% | 91.3% | 94.6% |
| 25 | 73.8% | 87.3% | 88.8% | 94.0% | 89.7% | 92.9% | 93.0% | 94.9% |
| 50 | 79.2% | 87.6% | 92.1% | 94.4% | 91.7% | 93.1% | 94.4% | 95.5% |
| 100 | 83.6% | 87.9% | 93.6% | 94.7% | 92.3% | 93.4% | 95.4% | 95.8% |
| 250 | 88.3% | 88.7% | 95.0% | 95.3% | 93.8% | 93.9% | 95.8% | 96.0% |
| 500 | 90.3% | n.a | 95.7% | n.a | 94.1% | n.a | 96.3% | n.a |