| Literature DB >> 17764570 |
Richard Tzong-Han Tsai1, Wen-Chi Chou, Ying-Shan Su, Yu-Chun Lin, Cheng-Lung Sung, Hong-Jie Dai, Irene Tzu-Hsuan Yeh, Wei Ku, Ting-Yi Sung, Wen-Lian Hsu.
Abstract
BACKGROUND: Bioinformatics tools for automatic processing of biomedical literature are invaluable for both the design and interpretation of large-scale experiments. Many information extraction (IE) systems that incorporate natural language processing (NLP) techniques have thus been developed for use in the biomedical field. A key IE task in this field is the extraction of biomedical relations, such as protein-protein and gene-disease interactions. However, most biomedical relation extraction systems usually ignore adverbial and prepositional phrases and words identifying location, manner, timing, and condition, which are essential for describing biomedical relations. Semantic role labeling (SRL) is a natural language processing technique that identifies the semantic roles of these words or phrases in sentences and expresses them as predicate-argument structures. We construct a biomedical SRL system called BIOSMILE that uses a maximum entropy (ME) machine-learning model to extract biomedical relations. BIOSMILE is trained on BioProp, our semi-automatic, annotated biomedical proposition bank. Currently, we are focusing on 30 biomedical verbs that are frequently used or considered important for describing molecular events.Entities:
Mesh:
Year: 2007 PMID: 17764570 PMCID: PMC2072962 DOI: 10.1186/1471-2105-8-325
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A parsing tree annotated with semantic roles.
The thirty selected verbs
| Verb | Is the verb one of Top 30 frequent verbs in GENIA? | Is the usage different in the newswire and biomedical domains? | # of PAS's in BioProp |
| activate | Yes | Yes | 145 |
| affect | No | No | 53 |
| alter | No | No | 27 |
| associate | Yes | No | 81 |
| bind | Yes | Yes | 189 |
| block | No | No | 56 |
| decrease | No | No | 41 |
| differentiate | No | No | 10 |
| encode | Yes | Yes | 75 |
| enhance | Yes | No | 37 |
| express | Yes | Yes | 186 |
| increase | Yes | No | 99 |
| induce | Yes | No | 263 |
| inhibit | Yes | No | 181 |
| interact | No | Yes | 34 |
| mediate | Yes | No | 103 |
| modulate | No | Yes | 22 |
| mutate | No | Yes | 5 |
| phosphorylate | No | Yes | 12 |
| prevent | No | No | 15 |
| promote | No | Yes | 13 |
| reduce | No | No | 38 |
| regulate | Yes | No | 116 |
| repress | No | No | 17 |
| signal | No | No | 7 |
| stimulate | Yes | No | 75 |
| suppress | No | No | 37 |
| transactivate | No | Yes | 21 |
| transform | No | No | 10 |
| trigger | No | No | 14 |
Argument types and their descriptions
| Arg0 | agent |
| Arg1 | direct object/theme/patient |
| Arg2–5 | not fixed |
| ArgM-NEG | negation marker |
| ArgM-LOC | location |
| ArgM-TMP | time |
| ArgM-MNR | manner |
| ArgM-EXT | extent |
| ArgM-ADV | general-purpose |
| ArgM-PNC | purpose |
| ArgM-CAU | cause |
| ArgM-DIR | direction |
| ArgM-DIS | discourse connectives |
| ArgM-MOD | modal verb |
| ArgM-REC | reflexives and reciprocals |
| ArgM-PRD | marks of secondary predication |
Framesets and examples of "modulate" and "regulate"
| modulate (VerbNet) | [Arg1 The chords] | |
| regulate (VerbNet) | The battle focuses on [Arg0the state's certificate-of-need law], [R-Arg0which] | |
| modulate (BioProp) | [Arg0Cytomegalovirus] |
Inter-annotator agreement
| P(A) | P(E) | Kappa score | ||
| Including ArgM | role identification | .97 | .52 | .94 |
| role classification | .96 | .18 | .95 | |
| combined decision | .96 | .18 | .95 | |
| Excluding ArgM | role identification | .97 | .26 | .94 |
| role classification | .99 | .28 | .98 | |
| combined decision | .99 | .28 | .98 |
Distribution of argument types in PropBank I
| Arg0 | 897 | 23.96% |
| Arg1 | 1440 | 38.46% |
| Arg2 | 361 | 9.64% |
| Arg3 | 133 | 3.55% |
| ArgM-NEG | 55 | 1.47% |
| ArgM-LOC | 58 | 1.55% |
| ArgM-TMP | 207 | 5.53% |
| ArgM-MNR | 122 | 3.26% |
| ArgM-EXT | 7 | 0.19% |
| ArgM-ADV | 122 | 3.26% |
| ArgM-PNC | 21 | 0.56% |
| ArgM-CAU | 29 | 0.77% |
| ArgM-DIR | 1 | 0.03% |
| ArgM-DIS | 86 | 2.30% |
| ArgM-MOD | 204 | 5.45% |
| ArgM-REC | 1 | 0.03% |
| Total | 3744 | 100.00% |
Distribution of argument types in BioProp
| Arg0 | 1355 | 25.03% |
| Arg1 | 1961 | 36.22% |
| Arg2 | 313 | 5.78% |
| Arg3 | 10 | 0.18% |
| ArgM-NEG | 103 | 1.90% |
| ArgM-LOC | 377 | 6.96% |
| ArgM-TMP | 141 | 2.60% |
| ArgM-MNR | 477 | 8.81% |
| ArgM-EXT | 23 | 0.42% |
| ArgM-ADV | 301 | 5.56% |
| ArgM-PNC | 3 | 0.06% |
| ArgM-CAU | 15 | 0.28% |
| ArgM-DIR | 22 | 0.41% |
| ArgM-DIS | 179 | 3.31% |
| ArgM-MOD | 121 | 2.23% |
| ArgM-REC | 6 | 0.11% |
| ArgM-PRD | 7 | 0.13% |
| Total | 5414 | 100.00% |
Results of all configurations
| SMILE | PropBank I | BioProp | 74.95 | 54.05 | 62.80 |
| BIOSMILEBaseline | BioProp | BioProp | 87.03 | 81.65 | 84.25 |
| BIOSMILENE | BioProp | BioProp | 87.31 | 81.66 | 84.38 |
| BIOSMILETemplate | BioProp | BioProp | 87.56 | 82.15 | 84.76 |
Comparison of performance on SMILE and BIOSMILEBaseline
| SMILE | BIOSMILEBaseline | ||||||||||
| Type | ΔF (%) | ||||||||||
| Arg0 | 85.66 | 63.47 | 72.86 | 2.66 | 92.33 | 90.52 | 91.41 | 1.44 | 18.55 | 33.59 | Y |
| Arg1 | 82.10 | 75.02 | 78.39 | 1.96 | 88.86 | 85.71 | 87.25 | 1.42 | 8.86 | 20.05 | Y |
| Arg2 | 39.58 | 30.69 | 34.35 | 5.73 | 86.46 | 81.26 | 83.68 | 3.93 | 49.33 | 38.89 | Y |
| ArgM-ADV | 38.59 | 22.52 | 27.94 | 7.96 | 64.14 | 51.20 | 56.60 | 5.77 | 28.66 | 15.97 | Y |
| ArgM-DIS | 72.58 | 52.12 | 59.92 | 8.62 | 83.74 | 74.91 | 78.83 | 5.39 | 18.91 | 10.19 | Y |
| ArgM-LOC | 62.17 | 1.98 | 3.79 | 3.60 | 76.03 | 77.12 | 76.48 | 3.67 | 72.69 | 77.45 | Y |
| ArgM-MNR | 45.29 | 18.61 | 25.95 | 6.99 | 83.30 | 81.02 | 82.04 | 2.74 | 56.09 | 40.92 | Y |
| ArgM-MOD | 99.25 | 87.48 | 92.84 | 3.66 | 97.22 | 94.67 | 95.82 | 2.36 | 2.98 | 3.75 | Y |
| ArgM-NEG | 99.37 | 76.77 | 86.24 | 6.66 | 97.70 | 94.98 | 96.17 | 2.80 | 9.93 | 7.53 | Y |
| ArgM-TMP | 71.60 | 57.33 | 62.98 | 9.88 | 81.48 | 61.65 | 69.67 | 7.25 | 6.69 | 2.99 | Y |
Comparison of performance on BIOSMILEBaseline and BIOSMILENE
| BIOSMILEBaseline | BIOSMILENE | ||||||||||
| Type | ΔF (%) | ||||||||||
| Arg0 | 92.33 | 90.52 | 91.41 | 1.44 | 92.29 | 90.46 | 91.35 | 1.53 | -0.05 | -0.14 | N |
| Arg1 | 88.86 | 85.71 | 87.25 | 1.42 | 89.32 | 86.07 | 87.66 | 1.31 | 0.41 | 1.18 | N |
| Arg2 | 86.46 | 81.26 | 83.68 | 3.93 | 86.78 | 81.07 | 83.73 | 4.39 | 0.05 | 0.05 | N |
| ArgM-ADV | 64.14 | 51.20 | 56.60 | 5.77 | 64.73 | 50.90 | 56.61 | 6.06 | 0.01 | 0.01 | N |
| ArgM-DIS | 83.74 | 74.91 | 78.83 | 5.39 | 84.14 | 74.71 | 78.81 | 5.66 | -0.02 | -0.01 | N |
| ArgM-LOC | 76.03 | 77.12 | 76.48 | 3.67 | 76.54 | 77.06 | 76.71 | 3.74 | 0.23 | 0.25 | N |
| ArgM-MNR | 83.30 | 81.02 | 82.04 | 2.74 | 83.05 | 81.20 | 82.02 | 2.79 | -0.02 | -0.03 | N |
| ArgM-MOD | 97.22 | 94.67 | 95.82 | 2.36 | 97.31 | 94.47 | 95.76 | 2.68 | -0.05 | -0.08 | N |
| ArgM-NEG | 97.70 | 94.98 | 96.17 | 2.80 | 97.45 | 94.97 | 96.03 | 2.91 | -0.14 | -0.19 | N |
| ArgM-TMP | 81.48 | 61.65 | 69.67 | 7.25 | 81.80 | 61.33 | 69.62 | 7.31 | -0.05 | -0.03 | N |
Comparison of performance on BIOSMILEBaseline and BIOSMILETemplate
| BIOSMILEBaseline | BIOSMILETemplate | ||||||||||
| Type | ΔF (%) | ||||||||||
| Arg0 | 92.33 | 90.52 | 91.41 | 1.44 | 92.35 | 90.48 | 91.40 | 1.52 | -0.01 | -0.02 | N |
| Arg1 | 88.86 | 85.71 | 87.25 | 1.42 | 88.83 | 85.75 | 87.25 | 1.39 | 0.00 | 0.01 | N |
| Arg2 | 86.46 | 81.26 | 83.68 | 3.93 | 86.45 | 81.63 | 83.87 | 3.93 | 0.19 | 0.19 | N |
| ArgM-ADV | 64.14 | 51.20 | 56.60 | 5.77 | 66.96 | 54.77 | 59.93 | 5.83 | 3.33 | 2.22 | Y |
| ArgM-DIS | 83.74 | 74.91 | 78.83 | 5.39 | 84.14 | 74.92 | 78.99 | 5.39 | 0.16 | 0.12 | N |
| ArgM-LOC | 76.03 | 77.12 | 76.48 | 3.67 | 79.65 | 78.07 | 78.75 | 3.21 | 2.27 | 2.55 | Y |
| ArgM-MNR | 83.30 | 81.02 | 82.04 | 2.74 | 84.15 | 83.02 | 83.49 | 2.69 | 1.44 | 2.06 | Y |
| ArgM-MOD | 97.22 | 94.67 | 95.82 | 2.36 | 97.55 | 94.67 | 96.00 | 2.39 | 0.18 | 0.29 | N |
| ArgM-NEG | 97.70 | 94.98 | 96.17 | 2.80 | 97.70 | 94.98 | 96.17 | 2.80 | 0.00 | 0.00 | N |
| ArgM-TMP | 81.48 | 61.65 | 69.67 | 7.25 | 83.90 | 63.33 | 71.75 | 6.32 | 2.08 | 1.18 | N |
Figure 3Performance improvement of template features overall and on several adjunct argument types.
Comparison of performance difference on verbs that have different framesets and the same framesets in Experiment 1
| Different frame set | 28.12% |
| The same set | 18.42% |
Distribution of NEs in the main and NULL arguments
| Compound | 97 | 7 | 0 | 2361 |
| Space | 46 | 48 | 0 | 5371 |
| Protein | 167 | 169 | 13 | 10651 |
| Other | 25 | 260 | 0 | 4860 |
| Nucleotide | 18 | 36 | 1 | 3753 |
Template feature statistics for the five argument types
| F-score (%) | ||||||
| Argument Type | Baseline | Template | ΔF (%) | # of templates | # of instances | Template Density |
| ArgM-ADV | 56.60 | 59.93 | 3.33 | 88 | 301 | 0.292359 |
| ArgM-DIS | 78.83 | 78.99 | 0.16 | 2 | 22 | 0.090909 |
| ArgM-LOC | 76.48 | 78.75 | 2.27 | 274 | 377 | 0.726790 |
| ArgM-MNR | 82.04 | 83.49 | 1.44 | 72 | 477 | 0.150943 |
| ArgM-TMP | 69.67 | 71.75 | 2.08 | 57 | 141 | 0.404255 |
Figure 4Relationship betweenΔF and template density.
Figure 5Relationship betweenΔF and template density after removing ArgM-ADV.
An example of using an ArgM-TMP template
| NAC | (PTN*) | NN | - | (Arg0*) | (Arg0*) |
| Not | * | RB | - | * | * |
| Only | * | RB | - | * | * |
| blocks | * | VBZ | - | * | * |
| The | * | DT | - | * | * |
| Effect | * | NN | - | * | * |
| Of | * | IN | - | * | * |
| TPCK | (OOC*) | NN | - | * | * |
| But | * | CC | - | * | * |
| enhances | * | VBZ | enhance | (V*) | (V*) |
| mitogenesis | * | NN | - | (Arg1* | (Arg1* |
| And | * | CC | - | * | * |
| cytokine | (OTR(PTN*) | NN | - | * | * |
| production | *) | NN | - | *) | *) |
| ( | * | -LRB- | - | (ArgM-EXT* | (ArgM-EXT* |
| > | * | JJR | - | * | * |
| 2.5-fold | * | RB | - | * | * |
| In | * | IN | - | * | * |
| some | * | DT | - | * | * |
| cases | * | NNS | - | * | * |
| ) | * | -RRB- | - | *) | *) |
| upon | * | IN | - | * | (ArgM-TMP* |
| activation | * | NN | - | * | * |
| of | * | IN | - | * | * |
| unsuppressed | (SRC* | JJ | - | * | * |
| T | (SRC* | NN | - | * | * |
| cells | *)) | NNS | - | * | *) |
| * | - | * | * |
The features used in the baseline argument classification model
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
Five NE categories in GENIA ontology
| Protein | Proteins include protein groups, families, molecules, complexes, and substructures. | PTN |
| Nucleotide | A nucleic acid molecule or the compounds that consist of nucleic acids. | NUC |
| Other organic compounds | Organic compounds excluding proteins and nucleotides. | OOC |
| Source | Sources are biological locations where substances are found and their reactions take place. | SRC |
| Others | The terms that are not categorized as sources or substances can be marked. | OTR |
Figure 6An aligned argument pair.