| Literature DB >> 23613763 |
Haibin Liu1, Lawrence Hunter, Vlado Kešelj, Karin Verspoor.
Abstract
The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23613763 PMCID: PMC3629260 DOI: 10.1371/journal.pone.0060954
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Dependency Graph of “Methylation is known to regulate expression of c-abl.”
Figure 2General Architecture of ASM-based Event Extraction.
Figure 3Event Rule Induction Example.
Event rule representation.
| Rule | Rule Description | Graph | |||
| ID | Type | Trigger | Theme | Cause | Representation |
| E1a | Pos. | lead-20/VBP | Phosphorylation: | Binding: | nsubj(lead-20/VBP, ligation-6/NN) |
| reg. | phosphorylation-23/NN | ligation-6/NN | prep_to(lead-20/VBP, phosphorylation-23/NN) | ||
| E1b | Pos. | lead-20/VBP | Phosphorylation: | Binding: | rcmod(ligation-6/NN, lead-20/VBP) |
| reg. | phosphorylation-23/NN | ligation-6/NN | prep_to(lead-20/VBP, phosphorylation-23/NN) | ||
| E1c | Pos. | lead-20/VBP | Phosphorylation: | prep_to(lead-20/VBP, phosphorylation-23/NN) | |
| reg. | phosphorylation-23/NN | ||||
| E1d | Pos. | lead-20/VBP | Binding: | nsubj(lead-20/VBP, ligation-6/NN) | |
| reg. | ligation-6/NN | ||||
| E1e | Pos. | lead-20/VBP | Binding: | rcmod(ligation-6/NN, lead-20/VBP) | |
| reg. | ligation-6/NN | ||||
Figure 4Iterative Bottom-up Event Extraction Example.
Figure 5ASM-based Event Extraction.
Figure 6Dependency Kernel Example.
Statistics of BioNLP-ST 2011 GE dataset (values in parentheses are the numbers of full articles).
| Attributes Counted | Training | Development | Testing |
| Abstracts+Full articles | 908 (5) | 259 (5) | 347 (4) |
| Sentences | 8,759 | 2,954 | 3,437 |
| Proteins | 11,625 | 4,690 | 5,301 |
| Total events | 10,287 | 3,243 | 4,457 |
| Sentence-based events | 9,583 | 3,058 | hidden |
ASM parameter setting for the GE task.
| Parameter | Value | Parameter | Value |
|
| 7 |
| 3 |
|
| 5 |
| 3 |
|
| 7 |
| 3 |
|
| 10 |
| 10 |
|
| 10 |
| 10 |
|
| 7 |
| 10 |
Distribution of event rules.
| Event type | No. of event rules |
| Gene_expression | 2,438 |
| Transcription | 479 |
| Protein_catabolism | 130 |
| Phosphorylation | 282 |
| Localization | 281 |
| Binding | 1,651 |
| Regulation | 1,487 |
| Positive_regulation | 4,626 |
| Negative_regulation | 1,619 |
| TOTAL | 12,993 |
GE results on testing set evaluated by “Approximate Span/Approximate Recursive Matching.”
| Event type(No. of events) | Recall(%) | Precision(%) | F-score(%) |
| Gene_expression (1002) | 68.66 | 85.36 | 76.11 |
| Transcription (174) | 47.13 | 76.64 | 58.36 |
| Protein_catabolism (15) | 53.33 | 100.00 | 69.57 |
| Phosphorylation (185) | 80.00 | 71.15 | 75.32 |
| Localization (191) | 45.55 | 75.65 | 56.86 |
| [SVT-TOTAL] (1567) | 64.65 | 81.43 | 72.07 |
| Binding (491) | 35.44 | 54.55 | 42.96 |
| [EVT-TOTAL] (2058) | 57.68 | 75.94 | 65.56 |
| Regulation (385) | 22.34 | 42.16 | 29.20 |
| Positive_regulation (1443) | 33.75 | 54.66 | 41.73 |
| Negative_regulation (571) | 28.55 | 39.95 | 33.30 |
| [REG-TOTAL] (2399) | 30.68 | 48.97 | 37.72 |
| [ALL-TOTAL] (4457) | 43.15 | 62.72 | 51.12 |
Performance comparison with other systems on the GE task of BioNLP-ST 2011.
| System | SVT | BIND | REG | TOTAL | ||
| F-score | F-score | F-score | Recall | Precision | F-score | |
| UMass | 73.50 | 48.79 | 43.82 | 48.49 | 64.08 | 55.20 |
| UTurku | 72.11 | 43.28 | 42.72 | 49.56 | 57.65 | 53.30 |
| MSR-NLP | 71.54 | 41.39 | 40.02 | 48.64 | 54.71 | 51.50 |
|
| 72.07 | 42.96 | 37.72 | 43.15 | 62.72 | 51.12 |
| ConcordU | 70.52 | 36.88 | 40.16 | 43.55 | 59.58 | 50.32 |
| UWMadison | 68.70 | 36.88 | 40.37 | 42.56 | 61.21 | 50.21 |
| Stanford | 70.88 | 44.34 | 35.21 | 42.36 | 61.08 | 50.03 |
|
| 68.47 | 36.21 | 36.01 | 37.45 | 66.41 | 47.89 |
Performance comparison with other systems on the dataset of BioNLP-ST 2009.
| System | SVT | BIND | REG | TOTAL | ||
| F-score | F-score | F-score | Recall | Precision | F-score | |
| UMass | 71.54 | 50.76 | 45.51 | 48.74 | 65.94 | 56.05 |
| UTurku | 70.36 | 47.50 | 44.30 | 50.06 | 59.48 | 54.37 |
| MSR-NLP | 70.08 | 43.86 | 40.85 | 48.52 | 56.47 | 52.20 |
|
| 70.07 | 43.21 | 38.78 | 42.80 | 64.73 | 51.53 |
| Stanford | 69.29 | 47.57 | 36.09 | 42.55 | 62.69 | 50.69 |
| UWMadison | 65.13 | 43.21 | 41.08 | 42.17 | 62.30 | 50.30 |
| ConcordU | 67.75 | 37.41 | 40.96 | 43.09 | 60.37 | 50.28 |
|
| 64.78 | 41.55 | 36.68 | 36.77 | 68.86 | 47.94 |
Wilcoxon signed rank test results.
| WilcoxonTest | ASM | ASM | ASM( | ASM( |
|
|
|
|
|
| ||
| Recall | 42.26 | 5.71 | 36.62 | 5.52 | 0.002 |
| Precision | 69.39 | 5.19 | 72.93 | 9.14 | 0.037 |
| F-score | 52.40 | 5.52 | 48.51 | 5.70 | 0.002 |
Figure 7Physical Validation of Protein Residue Relation.
Statistics of Protein-Residue relation dataset.
| Attributes Counted | No. of instances |
| Total abstracts | 18,045 |
| Total sentences | 138,790 |
| Sentences with co-mentions of protein and residue | 5,256 |
| Physically validated protein-residue relations | 2,814 |
Performance comparison on Protein-Residue association extraction.
| System | Recall(%) | Precision(%) | F-score(%) |
| Co-occurrence baseline | 100.00 | 62.42 | 76.86 |
|
| 78.43 | 83.60 | 80.93 |
|
| 86.62 | 81.96 | 84.22 |