| Literature DB >> 18779866 |
K Bretonnel Cohen1, Martha Palmer, Lawrence Hunter.
Abstract
BACKGROUND: This paper presents data on alternations in the argument structure of common domain-specific verbs and their associated verbal nominalizations in the PennBioIE corpus. Alternation is the term in theoretical linguistics for variations in the surface syntactic form of verbs, e.g. the different forms of stimulate in FSH stimulates follicular development and follicular development is stimulated by FSH. The data is used to assess the implications of alternations for biomedical text mining systems and to test the fit of the sublanguage model to biomedical texts. METHODOLOGY/PRINCIPALEntities:
Mesh:
Substances:
Year: 2008 PMID: 18779866 PMCID: PMC2527518 DOI: 10.1371/journal.pone.0003158
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
A sample predicate for which the three prepositions of, in, and by are insufficient for capturing all arguments.
| Argument | Associated prepositions | |
| Arg0 | Causer of increase |
|
| Arg1 | Thing increasing |
|
| Arg2 | Amount increased by |
|
| Arg3 | Start point |
|
| Arg4 | End point |
|
Our representation of this predicate is the same as PropBank's.
The tag set for verbs.
| Tag | Example | |||
| AT | active | Transitive | verbal |
|
| AI | active | intransitive | verbal |
|
| At | active | Transitive | adjectival |
|
| Ai | active | intransitive | adjectival |
|
| PT | passive | Transitive | verbal |
|
| PI | passive | intransitive | verbal | Not attested |
| Pt | passive | Transitive | adjectival |
|
| Pi | passive | intransitive | adjectival | Not attested |
| N | Noun |
|
Figure 1A screen shot showing the representation of a predicate and the annotation of a token of that predicate in text.
The top pane shows the textual data. The slots in the bottom right pane indicate the arguments of the predicate activate: an Arg0, the activator, and an Arg1, the activatee. The subpanes corresponding to those slots show the text in which the arguments are instantiated—by either cromakalim or NS-1619 and K+ channel—and indicates the syntactic position—post-predicate and pre-predicate, respectively—of each. The bottom left pane lists all segments of text that have been annotated. Since the predicate itself is highlighted in the bottom left pane, its argument structure and arguments are displayed in the bottom right pane.
Counts of annotated tokens.
| Nominalization | BioIE (both) | BioIE-P450 | BioIE-Onc |
|
| 100 | 50 | 50 |
|
| 100 | 50 | 50 |
|
| 100 | 50 | 50 |
|
| 101 | 50 | 51 |
|
| 91 | 14 | 77 |
|
| 2 | 1 | 1 |
|
| 1 | 0 | 1 |
|
| 51 | 3 | 48 |
|
| 96 | 46 | 50 |
|
| 100 | 50 | 50 |
Rows are ordered by frequency of the corresponding verb in the BioIE corpus. The goal was 100 tokens per type.
Most common domain-specific verb lemmas.
| BioIE (both) | BioIE-P450 | BioIE-Onc | GENIA |
| inhibit (637) | inhibit (615) | associate (101) | induce (1322) |
| induce (310) | induce (238) | identify (84) | activate (1122) |
| increase (257) | increase (188) | occur (81) | express (827) |
| express (135) | treat (102) | activate (78) | inhibit (811) |
| associate (133) | decrease (102) | include (73) | demonstrate (734) |
| mediate (130) | catalyze (100) | induce (72) | bind (730) |
| contain (125) | mediate (94) | contain (70) | increase (700) |
| occur (124) | reduce (74) | increase (69) | regulate (659) |
| treat (118) | follow (69) | express (68) | contain (595) |
| activate (116) | stimulate (68) | analyze (62) | require (555) |
Alternations involving BioIE top-10 verbs in the CYP450 section.
| Lemma | bare | -s | -ing | -ed |
| inhibit | AT, AI | AT | AT, at, ai | AT, AX, PT, pt |
| induce | AT | AT | AT, at, ai | AT, PT, pt |
| increase | AT, AI, N | AT, AI, N | AT, AI, ai, X | AT, AI, ai, PT, X, PX, px |
| express | AT | AT | AT, at | AT, PT, pt, PX |
| associate | — | — | — | PT, pt |
| mediate | AT | AT | AT | PT, pt |
| contain | AT | AT | AT, at | AT, PT |
| occur | AI | AI | AI, ai | AI |
| treat | AT | — | AT | PT, pt |
| activate | AT | AT | AT, at | AT, PT, pt |
Dashes indicate that a verbal form did not occur in the corpus. AT = active, transitive, verbal. at = active, transitive, participial modifer. N = nominalization. P = verbal passive. p = adjectival passive. PT = verbal passive, transitive. pt = adjectival passive, transitive.
Passive alternations in the 10 most common verbs.
| Alternation | count |
| Verbal passive (5.1) | 287 |
| Adjectival passive (5.3) | 186 |
| Adjectival perfect participle (5.4) | 0 |
| All passive (5.1+5.3+5.4) | 473 |
| All active | 1,142 |
The top half of the table breaks down the passives by type. The bottom half gives the sum of the passives, and the corresponding number of actives.
Incidence of transitives and intransitives for verbs that varied.
| Lemma | Trans. | Intrans. | Couldn't tell |
| inhibit | 539 | 2 | 1 |
| induce | 187 | 1 | 0 |
| increase | 96 | 60 | 5 |
Adjectival alternations among the ten most common verbs.
| Alternation | count |
| Adjectival passive (5.3) | 184 |
| Adjectival perfect participle (5.4) | 0 |
| Adjectival “X” | 2 |
| Present participial adjective (transitive) | 59 |
| Present participial adjective (intransitive) | 49 |
| Present participial adjective (all) | 108 |
| All adjectival | 294 |
| All non-adjectival verbs | 1,321 |
The top half of the table gives the breakdown among adjectival types. The bottom half of the table gives the sums across all types.
Counts of the nominalizations in the BioIE and GENIA corpora.
| Nominalization | BioIE (both) | BioIE-P450 | BioIE-Onc | GENIA |
|
| 861 | 774 | 87 | 445 |
|
| 342 | 273 | 69 | 826 |
|
| – | – | – | 324 |
|
| 1,306 | 300 | 1,006 | 3,190 |
|
| 112 | 14 | 98 | 92 |
|
| 2 | 1 | 1 | 0 |
|
| 1 | 0 | 1 | 0 |
|
| 51 | 3 | 48 | 9 |
|
| 690 | 477 | 213 | 455 |
|
| 552 | 250 | 302 | 2,403 |
|
| 3,917 | 2,092 | 1,825 | 7,744 |
Rows are ordered by frequency of the corresponding verb in the BioIE corpus.
Occurrence, the lone single-argument predicate.
| P450 | Onc | Both | |
| Pre-nominal | 1 | 4 | 5 |
| Post-nominal | 2 | 41 | 43 |
| NP-external | 0 | 3 | 3 |
| Absent | 0 | 0 | 0 |
| Total | 3 | 48 | 51 |
Activation, a two-argument predicator (Arg0 and Arg1).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| 3 | 3 | 1 | 32 |
|
| 4 | 6 | 3 | 27 | |
|
| – | 1 | 3 | 3 | |
|
| 1 | 1 | – | 3 | |
Data is combined from both parts of the BioIE corpus. 14/16 possible patterns are attested in 91 tokens (9 can't-tell).
Activation, a two-argument predicator (Arg0 and Arg1).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | 3 | 1 | 9 |
|
| 1 | 3 | – | 14 | |
|
| – | 1 | 3 | 3 | |
|
| 1 | 1 | – | 3 | |
Data is from the CYP450 section of the corpus. 12/16 possible patterns are attested in 43 tokens (7 can't-tell).
Activation, a two-argument predicator (Arg0 and Arg1).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| 3 | – | – | 23 |
|
| 3 | 3 | 3 | 13 | |
|
| – | – | – | – | |
|
| – | – | – | – | |
Data is from the Oncology section of the corpus. 6/16 patterns are attested in 48 tokens (2 can't-tell).
Alternations for the four two-argument predicates.
| Alternations | Tokens | X | attested/possible | type/token | |
|
| 6 | 97 | 4 | 0.375 | 0.062 |
|
| 2 | 2 | 2 | 0.124 | 1.0 |
|
| 1 | 1 | 0 | .063 | 1.0 |
|
| 14 | 91 | 9 | 0.875 | 0.154 |
The maximum number possible is 42. Data is given for the full BioIE corpus. The column labelled tokens shows the number of tokens for which no argument was labelled “can't tell.” The column labelled X shows the number of tokens with at least one argument labelled “can't tell.”
Alternations for the five three-argument predicates.
| Alternations | Tokens | X | attested/possible | type/token | |
|
| 24 | 95 | 5 | 0.375 | 0.253 |
|
| 19 | 92 | 8 | 0.297 | 0.21 |
|
| 5 | 8 | 0 | 0.078 | 0.625 |
|
| 10 | 78 | 1 | 0.156 | 0.128 |
|
| 9 | 58 | 7 | 0.141 | 0.155 |
The maximum number possible is 43. Data is given for the full BioIE corpus. The column labelled tokens shows the number of tokens for which no argument was labelled “can't tell.” The column labelled X shows the number of tokens with at least one argument labelled “can't tell.”
Alternations for the lone 4-argument predicate (treatment.03) and the lone 5-argument predicate (increase).
| Alternations | Tokens | X | attested/possible | type/token | |
|
| 16 | 29 | 2 | .063 | 0.552 |
|
| 21 | 83 | 17 | .021 | 0.253 |
The maximum number possible is 44 and 45, respectively. Data is given for the full BioIE corpus. The column labelled tokens shows the number of tokens for which no argument was labelled “can't tell.” The column labelled X shows the number of tokens with at least one argument labelled “can't tell.”
Inter-annotator agreement for the two most difficult nominalizations.
| IAA | TP | FP | FN | |
| Both predicates | 100% | 28 | 0 | 0 |
| All arguments for both predicators | 87.5% | 49 | 7 | 7 |
| Positionally defined types for both predicators | 95.8% | 34 | 3 | 0 |
| Other types for both predicators | 68.4% | 13 | 5 | 7 |
| Arg0 for both | 74.1% | 20 | 7 | 7 |
| Arg1 for both | 96.4% | 27 | 1 | 1 |
Confusion matrix for both nominalizations and both arguments.
| Pre | Post | Ext | Abs | X | |
|
| .321 | .018 | |||
|
| .018 | .286 | .018 | ||
|
| .089 | .089 | |||
|
| .143 | ||||
|
| .018 |
Confusion between NP-external and absent arguments was the largest source of disagreements. Fractions are the count for the cell divided by the number of slots (56). They sum to 1.
Confusion matrix for Arg0 of both nominalizations.The denominator is 28.
| Pre | Post | Ext | Abs. | X | |
|
| .036 | .036 | |||
|
| .25 | .036 | |||
|
| .143 | .179 | |||
|
| .286 | ||||
|
| .036 |
Confusion matrix for Arg1 of both nominalizations.
| Pre | Post | Ext | Abs. | X | |
|
| .61 | ||||
|
| .036 | .32 | |||
|
| .036 | ||||
|
| |||||
|
|
The denominator is 28.
Confusion matrix for Arg0 of activation.
| Pre | Post | Ext | Abs | X | |
|
| .071 | .071 | |||
|
| .286 | ||||
|
| .143 | .143 | |||
|
| .286 | ||||
|
|
The denominator is 14.
Confusion matrix for Arg1 of activation.
| Pre | Post | Ext | Abs | X | |
|
| .643 | ||||
|
| .286 | ||||
|
| .071 | ||||
|
| |||||
|
|
The denominator is 14.
Confusion matrix for Arg0 of expression.
| Pre | Post | Ext | Abs. | X | |
|
| |||||
|
| .214 | .071 | |||
|
| .143 | .214 | |||
|
| .286 | ||||
|
| .071 |
Confusion matrix for Arg1 of expression.
| Pre | Post | Ext | Abs. | X | |
|
| .61 | ||||
|
| .036 | .32 | |||
|
| .036 | ||||
|
| |||||
|
|
The most frequent syntactic positions for each semantic role (cf. Palmer et al.'s Table 7, 2005:91).
| Semantic role | Total | Most common syntactic positions |
|
| 570 | Absent (378), NP-external (82), Post-nominal (64), Pre-nominal (46) |
|
| 612 | Post-nominal (341), Pre-nominal (124), Absent (79), NP-external (68) |
See Tables 43 and 44 for the raw data.
The most frequent semantic roles for each syntactic position (c.f. Palmer et al.'s Table 6, 2005:90).
| Position | Total | ||
|
| Arg1 (124) | Arg0 (51) | 175 |
|
| Arg1 (341) | Arg0 (107) | 448 |
|
| Arg0 (85) | Arg1 (68) | 153 |
|
| Arg0 (378) | Arg1 (79) | 461 |
Only Args 0 and 1 are indicated. Association.02,03 are omitted. See Tables 43 and 44 for the raw data.
Argumentless nominalizations.
| No arguments at all | 20 |
| No Arg0 or Arg1 | 71 |
Occurrence, a 1-argument predicator (Arg1).
| P450 | Onc | Both | |
|
| 1 | 4 | 5 |
|
| 2 | 41 | 43 |
|
| 0 | 3 | 3 |
|
| 0 | 0 |
Data is combined from both parts of the BioIE corpus. 3/4 possible patterns are attested in 51 tokens (0 can't-tell).
Activation, a two-argument predicator (Arg0 and Arg1).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| 3 | 3 | 1 | 32 |
|
| 4 | 6 | 3 | 27 | |
|
| – | 1 | 3 | 3 | |
|
| 1 | 1 | – | 3 | |
Data is combined from both parts of the BioIE corpus. 14/16 possible patterns are attested in 91 tokens (9 can't-tell).
Activation, a two-argument predicator (Arg0 and Arg1).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | 3 | 1 | 9 |
|
| 1 | 3 | – | 14 | |
|
| – | 1 | 3 | 3 | |
|
| 1 | 1 | – | 3 | |
Data is from the CYP450 section of the corpus. 12/16 possible patterns are attested in 43 tokens (7 can't-tell).
Activation, a two-argument predicator (Arg0 and Arg1).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| 3 | – | – | 23 |
|
| 3 | 3 | 3 | 13 | |
|
| – | – | – | – | |
|
| – | – | – | – | |
Data is from the Oncology section of the corpus. 6/16 patterns are attested in 48 tokens (2 can't-tell).
Inhibition, a 3-argument predicator (Arg0, Arg1, and Arg2; only Args 0 and 1 are shown).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | 2 | 8 | 4 |
|
| 1 | 15 | 16 | 26 | |
|
| 1 | 3 | 5 | 1 | |
|
| 3 | 2 | 2 | 6 | |
Data is combined from both parts of the BioIE corpus. 24/64 possible patterns are attested in 95 tokens (5 can't-tell).
Induction, a 3-argument predicator (Arg0, Arg1, and Arg2; only Args 0 and 1 are shown).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| 1 | 3 | – | 8 |
|
| 11 | 12 | 3 | 33 | |
|
| 3 | 2 | 2 | 3 | |
|
| 2 | 1 | – | 8 | |
Data is combined from both parts of the BioIE corpus. 19/64 possible patterns are attested in 92 tokens (8 can't-tell). For comparability with other tables in the paper, tokens where any arg is X are omitted from this table, but there are 3 additional tokens where the X is in Arg2 that could be added to this table: 1 Arg0-Ext/Arg1-Ext, and 1 Arg0-Ext/Arg1-Abs.
Increase, a 5-argument predicator (Arg0, Arg1, Arg2, Arg3, and Arg4; only Args 0 and 1 are shown).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | – | – | 7 |
|
| 6 | 10 | 27 | 30 | |
|
| – | 1 | 2 | 3 | |
|
| – | – | – | 2 | |
Data is combined from both parts of the BioIE corpus. 21/1,024 (45) possible patterns are attested in 83 tokens (17 can't-tell).
Expression, a 2-argument predicator (Arg0 and Arg1).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | 1 | 1 | 42 |
|
| – | – | 3 | 44 | |
|
| – | – | – | 6 | |
|
| – | – | – | – | |
Data is combined from both parts of the BioIE corpus. 6/16 possible patterns are attested in 97 tokens (4 can't-tell).
Association.01, a 3-argument predicator (Arg0, Arg1, and Arg2; only Args 0 and 1 are shown).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | – | 1 | – |
|
| – | – | 2 | – | |
|
| – | – | 3 | 2 | |
|
| – | – | – | – | |
Data is combined from both parts of the BioIE corpus. 5/64 possible patterns are attested in 8 tokens (0 can't-tell).
Association.02 (reciprocal); only Args 0 and 1 for non-plural associands are shown.
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | – | – | 1 |
|
| 8 | 39 | 13 | 4 | |
|
| 1 | – | 8 | – | |
|
| – | – | – | 48 | |
Data is combined from both parts of the BioIE corpus. 10 patterns are attested in 78 tokens (1 can't-tell); the number of possible patterns is greater than 64, but its exact value depends on how reciprocal tokens are handled.
Mediation, a 2-argument predicator.
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | – | – | – |
|
| 1 | 1 | – | – | |
|
| – | – | – | – | |
|
| – | – | – | – | |
Data is combined from both parts of the BioIE corpus. 2/16 possible patterns are attested in 2 tokens (2 can't-tell).
Containment, a 2-argument predicator.
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | – | – | 1 |
|
| – | – | – | – | |
|
| – | – | – | – | |
|
| – | – | – | – | |
Data is combined from both parts of the BioIE corpus. 1/16 possible patterns are attested in 1 token (0 can't-tell).
Treatment.03 (medical), a 4-argument predicator (only Args 0 and 1 are shown).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | – | – | 1 |
|
| – | – | – | 3 | |
|
| – | – | – | 3 | |
|
| – | – | – | 22 | |
Data is combined from both parts of the BioIE corpus. 16/256 (44) possible patterns are attested in 29 tokens (2 can't-tell).
Treatment.04 (affect a change in something by applying a substance), a 3-argument predicator (only Args 0 and 1 are shown).
| Arg0 | |||||
| Pre | Post | Ext | Abs | ||
|
|
| – | – | – | – |
|
| – | – | – | 14 | |
|
| – | – | – | 18 | |
|
| – | – | – | 26 | |
Data is combined from both parts of the BioIE corpus. 9/64 possible patterns are attested in 58 tokens (7 can't-tell).
Most frequent syntactic positions for Arg0 (association.02,.03 (reciprocal, and .03 omitted).
| Total | Arg0 absent | Arg0 external | Arg0 post-nom | Arg0 pre-nom | |
| Occurrence | – | – | – | – | |
| Activation | 65 | 7 | 11 | 8 | |
| Inhibition | 37 | 31 | 22 | 5 | |
| Induction | 52 | 5 | 18 | 17 | |
| Increase | 42 | 29 | 11 | 6 | |
| Expression | 92 | 4 | 1 | 0 | |
| Association.01 | 2 | 6 | 0 | 0 | |
| Mediation | 0 | – | 1 | 1 | |
| Containment | 1 | 0 | 0 | 0 | |
| Treatment.03 | 29 | 0 | 0 | 0 | |
| Treatment.04 | 58 | 0 | 0 | 0 | |
| Total | 378 | 82 | 64 | 46 |
Note that occurrence is a single-argument predicate and has no Arg0.
Most frequent syntactic positions for Arg1.
| Total | Arg1 post-nom | Arg1 pre-nom | Arg1 absent | Arg1 external | |
| Occurrence | 43 | 5 | 0 | 3 | |
| Activation | 40 | 39 | 5 | 7 | |
| Inhibition | 58 | 14 | 13 | 10 | |
| Induction | 59 | 12 | 11 | 10 | |
| Increase | 73 | 7 | 2 | 6 | |
| Expression | 47 | 44 | 0 | 6 | |
| Association.01 | 2 | 1 | 0 | 5 | |
| Mediation | 2 | 0 | 0 | 0 | |
| Containment | 0 | 1 | 0 | 0 | |
| Treatment.03 | 3 | 1 | 22 | 3 | |
| Treatment.04 | 14 | 0 | 26 | 18 | |
| Total | 341 | 124 | 83 | 68 |
Levin classes of the most common verbs.
| Lemma | Class | |
| inhibit | — | — |
| induce | — | — |
| increase | 45.4 | Verbs of Change of State: Other alternating verbs of change of state |
|
|
| |
| express | 11.1 | Verbs of Sending and Carrying: |
| 48.1.2 | Reflexive verbs of appearance | |
| associate |
|
|
| mediate | — | — |
| contain | 8.2 | Verbs Requiring Special Diatheses: Obligatorily reflexive object |
|
|
| |
| 54.3 | Measure Verbs: | |
| occur |
|
|
| treat | 8.5 | Verbs Requiring Special Diatheses: Obligatory adverb |
| 29.2 | Verbs with Predicative Complements: | |
Verbs are ordered by frequency. Dashes indicate that the verb does not appear in Levin (1993). Bolding indicates that the Levin class seems to fit the semantics of the verb as used in the CYP450 section of the BioIE corpus.