| Literature DB >> 19852798 |
Paul Thompson1, Syed A Iqbal, John McNaught, Sophia Ananiadou.
Abstract
BACKGROUND: Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources.Entities:
Mesh:
Year: 2009 PMID: 19852798 PMCID: PMC2774701 DOI: 10.1186/1471-2105-10-349
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Example annotation output
| activated | NifA: | glnAP2: | In Escherichia Coli: |
| encodes | GlnA: | glutamine synthetase: | |
For each verb, its separate arguments within the sentence are indicated, together with the biological concepts assigned to them. Each argument is also categorized according to the semantic roles assigned.
Semantic roles
| AGENT | Drives/instigates event | |
| THEME | a) Affected by/results from event | |
| MANNER | Method/way in which event is carried out | cpxA gene |
| INSTRUMENT | Used to carry out event | EnvZ |
| LOCATION | Where | Phosphorylation of OmpR |
| SOURCE | Start point of event | A transducing lambda phage was |
| DESTINATION | End point of event | Transcription is activated by |
| TEMPORAL | Situates event in time/w.r.t another event | The Alp protease activity is |
| CONDITION | Environmental conditions/changes in conditions | Strains carrying a mutation in the crp structural gene fail to |
| RATE | Change of level or rate | marR mutations |
| DESCRIPTIVE-AGENT | Descriptive information about AGENT of event | HyfR |
| DESCRIPTIVE-THEME | Descriptive information about THEME of event | The ptsH mutant |
| PURPOSE | Purpose/reason for the event occurring | The fusion strains were |
For each semantic role, a brief description is given, together with an example sentence containing an instance of the role. In the example sentences, the verb on which the event is centred is indicated in italics, whilst the event argument corresponding to the appropriate semantic role is indicated in bold.
Inter-annotator agreement during training
| Event identification | 58.35 | 56.01 | 68.26 | 77.07 | 71.94 |
| Argument identification (relaxed span match) | 80.45 | 85.05 | 91.45 | 89.39 | 91.09 |
| Argument identification (exact span match) | 61.92 | 63.98 | 73.96 | 79.84 | 79.17 |
| Semantic role assignment | 67.27 | 75.21 | 93.91 | 84.89 | 86.59 |
| Bio-concept identification | 71.35 | 78.65 | 78.29 | 88.55 | 82.36 |
| Bio-concept category assignment (exact category) | 72.34 | 72.05 | 71.61 | 68.84 | 59.76 |
| Bio-concept category assignment (including parent) | 77.53 | 76.74 | 75.11 | 71.58 | 63.65 |
| Bio-concept supercategory assignment | 89.21 | 89.32 | 93.45 | 90.57 | 84.09 |
Each numbered column (C1 to C5) displays the IAA results calculated after a particular cycle of training, for a number of separate annotation subtasks. Agreement was calculated between each pair of annotators, and the figures shown in the table are averages amongst all pairs of annotators. Training cycles C1 to C4 were concerned with E. coli abstracts, whilst cycle C5 concerned human abstracts
General corpus statistics
| No of abstracts | 240 | 167 | 73 |
| No of events | 3067 | 2394 | 673 |
| Average Events per abstract | 12.78 | 14.34 | 9.22 |
| Distinct nom. verbs annotated | 91 | 81 | 36 |
| Events centred on nominalised verbs | 1274 | 1066 | 208 |
| Distinct verbs annotated | 184 | 152 | 107 |
| Events centred on verbs | 1793 | 1328 | 465 |
Separate figures are shown for the complete corpus, the E. coli part of the corpus and the human part of the corpus.
Most common words describing events
| Expression | 362 | Expression | 309 | Expression | 53 |
| Encode | 175 | Transcription | 139 | Encode | 50 |
| Transcription | 171 | Encode | 125 | Express | 36 |
| Bind | 143 | Bind | 110 | Bind | 33 |
| Regulation | 119 | Regulation | 102 | Transcription | 32 |
| Activate | 106 | Regulate | 87 | Activate | 29 |
| Regulate | 106 | Activate | 77 | Interact | 21 |
| Repress | 82 | Repress | 72 | Regulate | 19 |
| Require | 73 | Binding | 61 | Require | 19 |
| Activation | 67 | Repression | 60 | Involve | 18 |
Separate lists are shown for the corpus as a whole (combined), and for the separate E. coli and human parts of the corpus. For each word, its type is given (either (V) erb or (N)ominalised verb) together with an indication of the total number annotated events centred on the word and the percentage of all events in the corpus (or corpus part) that this figure represents.
Figure 1Distribution of event argument counts. Each section of the chart shows the percentage of events in the GREC that have been annotated with the indicated number of arguments.
Semantic role occurrences
| THEME | 2593 | 84.55 |
| AGENT | 1648 | 53.73 |
| MANNER | 416 | 13.56 |
| LOCATION | 300 | 9.78 |
| DESTINATION | 193 | 6.29 |
| CONDITION | 152 | 4.96 |
| DESCRIPTIVE-THEME | 137 | 4.47 |
| SOURCE | 83 | 2.71 |
| DESCRIPTIVE-AGENT | 68 | 2.22 |
| PURPOSE | 65 | 2.12 |
| TEMPORAL | 53 | 1.73 |
| RATE | 50 | 1.63 |
| INSTRUMENT | 32 | 1.04 |
For each role, the total number of event arguments to which the role has been assigned in the corpus is indicated, together with the percentage of events in which the role has been assigned to an argument.
Most common semantic role patterns
| AGENT | THEME | 947 (30.88) | |
| THEME | 693 (22.60) | ||
| THEME | DESCRIPTIVE-THEME | 119 (3.88) | |
| THEME | LOCATION | 117 (3.81) | |
| AGENT | THEME | MANNER | 113 (3.68) |
| AGENT | DESTINATION | 113 (3.68) | |
| THEME | MANNER | 112 (3.65) | |
| AGENT | THEME | LOCATION | 64 (2.09) |
| AGENT | 59 (1.92) | ||
| AGENT | DESCRIPTIVE-AGENT | 51 (1.66) | |
| THEME | CONDITION | 47 (1.53) | |
| MANNER | 42 (1.37) | ||
| AGENT | THEME | CONDITION | 38 (1.24) |
| SOURCE | 36 (1.17) | ||
| THEME | PURPOSE | 31 (1.01) | |
The patterns shown are independent of their ordering in the text. If the roles of AGENT and/or THEME are present in the pattern, this is indicated in the 1st and 2nd columns, respectively. The 3rd column shows any other role present in the pattern (the most common patterns all have a maximum of one role which is not AGENT and/or THEME). The final column shows the total number of events that have been annotated with each pattern in the corpus, together with the percentage of all events in the corpus that this figure represents.
Figure 2Distribution of biological concept supercategories. Each section of the chart shows the percentage of annotated biological concepts in the GREC that have been assigned a concept class belonging to the indicated supercategory.
Comparison of concept assignments in E. coli and human abstracts
| Gene | 645 | G | Gene | 129 | G |
| Gene_Expression | 350 | G | Protein | 112 | S |
| Regulator | 287 | S | Transcription_Factor | 107 | G |
| Promoter | 255 | S | Gene_Expression | 83 | G |
| Transcription | 200 | S | Cells | 61 | S |
| Regulation | 199 | S | Transcription | 60 | S |
| Gene_Activation | 189 | S | Gene_Activation | 60 | S |
| Protein | 170 | S | Activator | 47 | S |
| Repressor | 158 | S | Regulation | 43 | S |
| Activator | 150 | S | DNA | 33 | S |
| Operon | 148 | S | Promoter | 31 | S |
| Gene_Repression | 136 | S | Transcription_Binding_Site | 31 | G |
| Locus | 99 | S | Protein_Complex | 26 | S |
| Enzyme | 82 | G | Sub_Unit | 23 | S |
| DNA | 79 | S | mRNA | 22 | S |
Separate lists are shown for E. coli abstracts and human abstracts. For each category, the total number of identified concepts assigned to the category is indicated, together with the percentage of all events in the corpus section that this figure represents. The Type column indicates whether each category is a (G)eneral category within its hierarchy (meaning that it has its own child concepts) or a (S)pecific category, indicating that it is a bottom-level category with no child concepts within its hierarchy.
General agreement statistics in the GREC
| Event identification | 72.27% | 76.37% |
| Argument identification(relaxed span match) | 90.23% | 91.27% |
| Argument identification (exact span match) | 75.10% | 77.48% |
| Semantic role assignment | 88.96% | 88.30% |
| Biological concept identification | 82.55% | 82.03% |
| Bio-concept category assignment(exact) | 71.02% | 66.03% |
| Bio-concept assignment(including parent) | 75.38% | 68.97% |
| Bio-concept supercategory assignment | 95.52% | 94.75% |
Average F-Score agreement figures are shown for several annotation substasks, with separate figures being shown for the E. coli and human parts of the corpus.
Individual role agreement statistics
| THEME | 5560 | 92.41% | SOURCE | 10 | 100% |
| AGENT | 3702 | 92.31% | LOCATION | 302 | 96.36% |
| MANNER | 697 | 86.68% | AGENT | 2009 | 92.95% |
| DESTINATION | 486 | 85.42% | DESTINATION | 403 | 92.12% |
| SOURCE | 250 | 84.71% | MANNER | 344 | 90.84% |
| LOCATION | 425 | 84.25% | THEME | 2485 | 89.67% |
| RATE | 176 | 76.44% | PURPOSE | 53 | 89.47% |
| CONDITION | 227 | 67.26% | TEMPORAL | 41 | 72.00% |
| PURPOSE | 85 | 41.95% | CONDITION | 21 | 58.82% |
| DESCRIPTIVE-THEME | 259 | 39.72% | DESCRIPTIVE-THEME | 234 | 57.46% |
| DESCRIPTIVE-AGENT | 100 | 34.32% | DESCRIPTIVE-AGENT | 90 | 36.36% |
| TEMPORAL | 33 | 25.00% | INSTRUMENT | 9 | 0.00% |
| INSTRUMENT | 9 | 16.5% | RATE | 5 | 0.00% |
Separate statistics are shown for the E. coli and human parts of the corpus. Within each part, semantic roles are ordered according to their agreement rates. The columns headed N show the total number of assignments for each role. Assignments by each pair of annotators are counted separately and added to the total.
Individual biological concept category agreement statistics
| Gene | 2010 | 90.55% | Gene | 432 | 89.35% |
| Protein | 771 | 51.88% | Protein | 419 | 61.58% |
| Promoter | 644 | 95.34% | Transcription_Factor | 301 | 51.83% |
| Repressor | 436 | 68.35% | DNA | 298 | 63.08% |
| Operon | 434 | 85.25% | Promoter | 154 | 92.21% |
| Gene_Expression | 407 | 78.62% | Transcription_Binding_Site | 140 | 50.00% |
| Regulator | 349 | 25.21% | Transcription | 118 | 100.00% |
| Activator | 345 | 42.32% | Cells | 111 | 95.49% |
| Locus | 192 | 72.91% | Regulation | 66 | 96.97% |
| Enzyme | 176 | 89.77% | Activator | 65 | 9.23% |
Separate statistics are shown for the E. coli and human parts of the corpus. Within each part, categories are ordered according to their total number of assignments, as shown in the columns headed with N. Assignments by each pair of annotators are counted separately and added to the total.
Most common concept categories confused with Protein
| Regulator | 108 | Transcription_Factor | 74 |
| Activator | 87 | Activator | 27 |
| Repressor | 59 | Regulator | 16 |
| Transcription_Factor | 29 | Gene | 9 |
| Gene | 27 | Sub_Unit | 8 |
Separate statistics are shown for the E. coli and human parts of the corpus. Within each part, the categories are ordered according to the number of times that the confusion occurred, as indicated in the columns headed N.