| Literature DB >> 22759462 |
Zorana Ratkovic1, Wiktoria Golik, Pierre Warnier.
Abstract
BACKGROUND: Bacteria biotopes cover a wide range of diverse habitats including animal and plant hosts, natural, medical and industrial environments. The high volume of publications in the microbiology domain provides a rich source of up-to-date information on bacteria biotopes. This information, as found in scientific articles, is expressed in natural language and is rarely available in a structured format, such as a database. This information is of great importance for fundamental research and microbiology applications (e.g., medicine, agronomy, food, bioenergy). The automatic extraction of this information from texts will provide a great benefit to the field.Entities:
Mesh:
Year: 2012 PMID: 22759462 PMCID: PMC3384252 DOI: 10.1186/1471-2105-13-S11-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Biotope example. An example of a Bacteria Biotope event extraction task (BTID-50185).
Figure 2BB event extraction workflow. Outline of the BB event extraction workflow.
Examples of type assignment.
| Corpus term | Matching rule | Head | Assigned Type | |
|---|---|---|---|---|
| Ex. 1 | Exact term | Environment | ||
| Ex. 2 | Term head | Environment | ||
| Ex. 3 | Term head and disambiguation | Host-Part (not Water) | ||
| Ex. 4 | Subterm head | Water | ||
| Ex. 5 | None | None |
Examples of disambiguation and location boundary adjustment rules.
| Rule role | Rule example | Location term | |
|---|---|---|---|
| Type disambiguation | IF | ||
| Boundary extension | IF nationality before term (Env) | ||
| Boundary reduction | IF irrelevant modifier in term |
Number of terms in the test corpus per type assignment method and per resource.
| MBTO | EnvO | DTa2 | |
|---|---|---|---|
| % of corpus terms | 16% | 9% | 10% |
| Exact match | 147 | 46 | 72 |
| Main head of term | 133 | 114 | 103 |
| Subterm head | 5 | 4 | 3 |
| Ambiguous head | 26 | 21 | 17 |
| Total head matching | 164 | 139 | 123 |
| Total | 311 | 185 | 195 |
Number of anaphors found in each corpus per anaphor type.
| Corpus | Single ante | Bi ante | Taxon ante |
|---|---|---|---|
| Train | 933 | 4 | 129 |
| Dev | 204 | 3 | 22 |
| Test | 240 | 0 | 18 |
| Total | 1377 | 7 | 169 |
Alvis system scores for event extraction on the BB task.
| Event recall | Event precision | F-score | |
|---|---|---|---|
| Part-of event | 45 (23) +22 | 66 (79) -13 | 53 (36) +17 |
The Part-of event and the total system scores are presented, as provided by the task organizers. The current system scores are found on the left while previous (BioNLP 2011 challenge) scores on the BB task are in parenthesis. The difference (+/-) between both is on the right.
Alvis system scores for entity recall, recall, precision and F-score on the BB task by location type.
| Bacteria | Host | Host-part | Water | Soil | |||
|---|---|---|---|---|---|---|---|
| 84 (84) | 82 (82) | 76 (72) | 61 (53) | 70 (29) | 64 (83) | 88 (86) | |
| 63 (61) | 56 (53) | 41 (29) | 31 (13) | 35 (60) | 70 (69) | ||
| 55 (48) | 40 (42) | 24 (24) | 36 (38) | 44 (55) | 70 (59) | ||
| 59 (53) | 47 (47) | 31 (26) | 34 (19) | 39 (57) | 70 (63) |
The current system scores are found on the left while previous (BioNLP 2011 challenge) scores are in the parenthesis. Medical is not included because it is not significant. Food was not found in the test corpus.
Scores obtained by different resources and the symbolic syntactic-semantic type assignment method.
| Entity recall | Recall | Precision | F-score | |
|---|---|---|---|---|
| DTa2 | 74.6 | 47.6 | 50 | 48.8 |
| MBTO | 78.6 | 51.4 | 46 | 48.5 |
| MBTO+DTa2 | 79.2 | 52.1 | 46.3 | 49.1 |
| EnvO | 63.9 | 30.4 | 38.8 | 34.1 |
Scores obtained by different resources and the exact match type assignment method.
| Entity recall | Recall | Precision | F-score | |
|---|---|---|---|---|
| DTa2 | 64.3 | 37.7 | 51.9 | 43.7 |
| MBTO | 69.3 | 41.3 | 46.5 | 43.7 |
| MBTO+DTa2 | 69.7 | 41.7 | 47.1 | 44.3 |
| EnvO | 58.4 | 28.7 | 46 | 35.4 |
Figure 3Score improvement. Performance changes with the addition of different modules. The baseline performance is the result obtained using only publicly available resources. Each subsequent experiment involves the addition of the specified modules.