| Literature DB >> 26202570 |
Sampo Pyysalo, Tomoko Ohta, Rafal Rak, Andrew Rowley, Hong-Woo Chun, Sung-Jae Jung, Sung-Pil Choi, Jun'ichi Tsujii, Sophia Ananiadou.
Abstract
BACKGROUND: Since their introduction in 2009, the BioNLP Shared Task events have been instrumental in advancing the development of methods and resources for the automatic extraction of information from the biomedical literature. In this paper, we present the Cancer Genetics (CG) and Pathway Curation (PC) tasks, two event extraction tasks introduced in the BioNLP Shared Task 2013. The CG task focuses on cancer, emphasizing the extraction of physiological and pathological processes at various levels of biological organization, and the PC task targets reactions relevant to the development of biomolecular pathway models, defining its extraction targets on the basis of established pathway representations and ontologies.Entities:
Mesh:
Year: 2015 PMID: 26202570 PMCID: PMC4511510 DOI: 10.1186/1471-2105-16-S10-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Properties of selected shared tasks in biomedical information extraction prior to the BioNLP ST 13.
| Task | Levels of biological organization | Representation |
|---|---|---|
| LLL'05 | Molecular | Binary relations |
| BioCreative II PPI | Molecular | Binary relations |
| BioCreative II.5 IPT | Molecular | Binary relations |
| BioNLP ST'09 GE | Molecular, subcellular | Event structures |
| BioCreative III PPI | Molecular | Binary relations |
| BioNLP ST'11 GE | Molecular, subcellular | Event structures |
| BioNLP ST'11 EPI | Molecular | Event structures |
| BioNLP ST'11 ID | Molecular, organism | Event structures |
| BioNLP ST'11 BB | Cellular, anatomical, environment | Binary relations |
| BioNLP ST'11 BI | Molecular | Event structures |
Entity mention-focused tasks (detection and normalization) and the supporting tasks of the BioNLP ST are not included.
Figure 1Illustration of entity annotations. a) Cancer Genetics task b) Pathway Curation task. (Illustrations created with BRAT[59])
Figure 2Illustration of relation annotations. a) Cancer Genetics task b) Pathway Curation task.
Figure 3Illustration of event annotations. a) Cancer Genetics task b) Pathway Curation task.
Figure 4Illustration of the data format. Adapted from [36].
Cancer Genetics task entity types, reference resources, and definitions.
| Type | Reference | Ontology ID |
|---|---|---|
| NCBI taxonomy | CARO:0000012 | |
| | Species-specific | CARO:0000032 |
| | anatomy | CARO:0000011 |
| | resources (e.g. | CARO:0000024 |
| M | FMA), derived | CARO:0000055 |
| T | resources (e.g. | CARO:0000043 |
| D | UBERON) | UBERON:0005423 |
| C | CL | CARO:0000013 |
| | GO-CC | GO:0005575 |
| | FMA etc. | CARO:0000004 |
| | FMA etc. | CARO:0000007 |
| | - | - |
| | - | - |
| | gene, protein, | SBO:0000246 |
| | and related | SBO:0000493 |
| | entity DBs | SBO:0000493 |
| | ChEBI | SBO:0000247 |
| | ChEBI | CHEBI:33709 |
The indentation of the types corresponds to is-a relations. The labels in italics are not annotated types but groupings defined only for organization.
Cancer Genetics task event types and their arguments.
| Type | Core arguments | Additional arguments |
|---|---|---|
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| (other chemical modifications defined similarly to PHOSPHORYLATION) | ||
| P | ||
| | ||
| | ||
| | ||
| Regulation | ||
| | ||
| | ||
The indentation corresponds to ontological structure (is-a/part-of relations). The suffixes ?, *, and + denote zero or one, zero or more, and one or more arguments of the shown type (respectively). GGP stands for Gene or gene product. For brevity, additional argument types are not shown in table: the AtLoc, FromLoc and ToLoc arguments take an anatomical entity type, and Site arguments take a Protein domain or region or DNA domain or region entity type.
Pathway Curation task entity types, reference resources, and definitions.
| Entity type | Reference | Ontology ID |
|---|---|---|
| ChEBI | SBO:0000247 | |
| gene/protein DBs | SBO:0000246 | |
| complex DBs | SBO:0000253 | |
| GO-CC | SBO:0000290 |
Pathway Curation task event types and arguments.
| Event type | Core arguments | Additional arguments | Ontology ID |
|---|---|---|---|
| SBO:0000182 | |||
| | SBO:0000216 | ||
| | SBO:0000330 | ||
| (Other modifications, such as ACETYLATION, defined similarly.) | |||
| GO:0051179 | |||
| | SBO:0000185 | ||
| GO:0010467 | |||
| | SBO:0000183 | ||
| | SBO:0000184 | ||
| SBO:0000179 | |||
| SBO:0000177 | |||
| SBO:0000180 | |||
| GO:0065007 | |||
| P | GO:0048518, | ||
| GO:0044093 | |||
| SBO:0000412 | |||
| | GO:0048519, | ||
| GO:0044092 | |||
| | SBO:0000412 | ||
| SBO:0000375 | |||
"Molecule" represents any of Simple chemical, Gene or gene product, or Complex. "Any" refers to an annotation of any type. The indentation of the types corresponds to ontological relations (is-a and part-of ) between the event types
Figure 5Pathway model reactions and event representations. Illustration of reactions in a pathway model (left), idealized explicit statements annotated with a directly mapped representation (center), and realistic expressions in text with actual event annotation. Figure from [5].
Queries for Cancer Genetics task document selection.
| Domain | Documents | Query terms |
|---|---|---|
| Carcinogenesis | 150 | cell transformation, neoplastic AND (proteins OR genes) |
| Metastasis | 100 | neoplasm metastasis AND (proteins OR genes) |
| Apoptosis | 50 | apoptosis AND (proteins OR genes) |
| Glucose metabolism | 50 | (glucose/metabolism OR glycolysis) AND neoplasms |
Only matches against MeSH terms were queried for, excluding cases where the query terms appeared only in text (e.g. "neoplasms"[MeSH Terms]).
Pathway models used to select documents for the Pathway Curation task.
| Pathway | Repository | ID | Publication |
|---|---|---|---|
| mTOR | BioModels | MODEL1012220002 | [ |
| mTORC1 upstream regulators | BioModels | MODEL1012220003 | [ |
| TLR | BioModels | MODEL2463683119 | [ |
| Yeast Cell Cycle | BioModels | MODEL1011020000 | [ |
| Rb | BioModels | MODEL4132046015 | [ |
| EGFR | BioModels | MODEL2463576061 | [ |
| Human Metabolic Network | BioModels | MODEL6399676120 | [ |
| NF-kappaB pathway | - | - | [ |
| p38 MAPK | PANTHER DB | P05918 | - |
| p53 | PANTHER DB | P00059 | - |
| p53 feedback loop pathway | PANTHER DB | P04392 | - |
| Wnt signaling pathway | PANTHER DB | P00057 | - |
Cancer Genetics task corpus statistics.
| Item | Train | Devel | Test | Total |
|---|---|---|---|---|
| Documents | 300 | 100 | 200 | 600 |
| Words | 66082 | 21732 | 42064 | 129878 |
| Entities | 11034 | 3665 | 6984 | 21683 |
| Relations | 466 | 176 | 275 | 917 |
| Events | 8803 | 2915 | 5530 | 17248 |
| Modifications | 670 | 214 | 442 | 1326 |
Pathway Curation task corpus statistics.
| Item | Train | Devel | Test | Total |
|---|---|---|---|---|
| Documents | 260 | 90 | 175 | 525 |
| Words | 53811 | 18579 | 35966 | 108356 |
| Entities | 7855 | 2734 | 5312 | 15901 |
| Relations | 455 | 128 | 330 | 913 |
| Events | 5992 | 2129 | 4004 | 12125 |
| Modifications | 317 | 80 | 174 | 571 |
Participating teams, ranks and references to system descriptions.
| Team | Institution | Tasks (rank) | Members | Ref |
|---|---|---|---|---|
| TEES-2.1 | University of Turku | CG(1), PC(2) | 1 BI | [ |
| NaCTeM | National Centre for Text Mining | PC(1), CG(2) | 1 NLP | [ |
| NCBI | National Center for Biotechnology Information | CG(3) | 3 BI | [ |
| RelAgent | RelAgent Private Ltd. | CG(4) | 1 LI, 1 CS | [ |
| UET-NII | University of Engineering and Technology, Vietnam and National Institute of Informatics, Japan | CG(5) | 6 CS | [ |
| ISI | Indian Statistical Institute | CG(6) | 2 ML, 2 NLP | - |
Abbreviations: BI: Bioinformatician, CS: Computer Scientist, LI: Linguist, ML: Machine Learning researcher, NLP: Natural Language Processing researcher.
Summary of system architectures.
| NLP methods | Events | Resources | ||||||
|---|---|---|---|---|---|---|---|---|
| TEES-2.1 | Porter | McCCJ + SD | SVM | SVM | SVM | SVM | GE | hedge words |
| NaCTeM | Snowball | Enju, GDep | SVM | SVM | SVM | SVM | (see text) | triggers |
| NCBI | MedPost, BLem | McCCJ + SD | Joint, subgraph matching | - | GE, EPI | - | ||
| RelAgent | Brill | fnTBL, custom | rules | rules | rules | rules | - | - |
| UET-NII | Porter | Enju | SVM | MaxEnt | Earley | - | - | triggers |
| ISI | CoreNLP | CoreNLP | NERsuite | Joint, MaltParser | - | - | - | |
Abbreviations: Trigger: event trigger detection, Arg: trigger-argument relation detection, Group: argument grouping into event structures, Modif.: event modification prediction, CoreNLP: Stanford CoreNLP, Porter: Porter stemmer, BLem: BioLemmatizer, Snowball: Snowball stemmer, McCCJ: McClosky-Charniak-Johnson parser, Charniak: Charniak parser, SD: Stanford Dependency conversion, SVM: Support Vector Machines, MaxEnt: Maximum Entropy Modeling.
Cancer Genetics task primary evaluation result summary.
| Team | recall | F-score | |
|---|---|---|---|
| TEES-2.1 | 48.76 | ||
| NaCTeM | 55.82 | 52.09 | |
| NCBI | 38.28 | 58.84 | 46.38 |
| RelAgent | 41.73 | 49.58 | 45.32 |
| UET-NII | 19.66 | 62.73 | 29.94 |
| ISI | 16.44 | 47.83 | 24.47 |
Cancer Genetics task primary evaluation F-scores by event type.
| Event | TEES-2.1 | NaCTeM | NCBI | RelAgent | UET-NII | ISI |
|---|---|---|---|---|---|---|
| | 64.77 | 67.33 | 66.31 | 61.72 | 53.66 | |
| | 78.82 | 81.92 | 79.60 | 21.49 | 13.56 | |
| | 75.97 | 59.85 | 66.67 | 70.87 | 65.52 | |
| | 73.17 | 74.07 | 64.71 | 77.78 | 63.16 | |
| | 73.30 | 75.18 | 66.98 | 25.17 | 7.35 | |
| | 78.33 | 72.73 | 64.39 | 71.43 | 57.40 | |
| | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| | 56.34 | 48.48 | 48.98 | 54.55 | 24.14 | |
| | 30.00 | 22.22 | 21.05 | 20.00 | 23.53 | |
| | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| 71.31 | 73.68 | 70.82 | 50.04 | 38.86 | ||
| | 38.00 | 25.11 | 27.36 | 27.91 | 9.52 | |
| | 72.18 | 67.14 | 64.12 | 35.96 | 24.72 | |
| | 81.56 | 71.13 | 67.07 | 57.14 | 32.39 | |
| | 70.13 | 76.54 | 42.42 | 58.67 | 50.70 | |
| | 51.05 | 52.69 | 47.79 | 56.41 | 26.20 | |
| | 69.57 | 69.23 | 33.33 | 11.76 | 0.00 | |
| 59.78 | 54.19 | 48.14 | 46.90 | 25.17 | ||
| | 70.27 | 74.29 | 80.00 | 68.75 | 71.43 | |
| | 71.11 | 53.57 | 64.71 | 48.65 | ||
| | 36.36 | 38.10 | 23.08 | 20.00 | 36.36 | |
| | 0.00 | 95.45 | 97.78 | 0.00 | 0.00 | |
| | 0.00 | 0.00 | 0.00 | |||
| | 78.21 | 73.69 | 69.45 | 58.01 | 53.28 | |
| | 37.33 | 42.86 | 28.12 | 32.00 | 20.93 | |
| | 22.22 | 0.00 | 0.00 | 0.00 | 0.00 | |
| | 0.00 | |||||
| | 66.67 | 66.67 | 66.67 | |||
| | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| | 63.33 | 53.12 | 64.15 | 58.33 | 50.00 | |
| | 0.00 | 0.00 | 100.00 | |||
| | 0.00 | 80.00 | 0.00 | 0.00 | ||
| | 30.30 | 42.11 | 32.43 | 33.33 | ||
| | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| | 59.07 | 51.14 | 34.29 | 18.31 | 35.64 | |
| 72.77 | 67.33 | 60.72 | 49.35 | 46.70 | ||
| | 43.93 | 37.89 | 32.69 | 33.94 | 11.92 | |
| | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| | 54.83 | 47.58 | 45.22 | 44.94 | 35.94 | |
| 52.20 | 44.70 | 40.89 | 41.76 | 29.59 | ||
| | 28.73 | 14.19 | 26.48 | 5.51 | 4.57 | |
| | 44.18 | 34.70 | 38.40 | 13.00 | 12.33 | |
| | 43.17 | 33.20 | 40.47 | 10.30 | 12.16 | |
| 39.79 | 29.21 | 35.58 | 10.30 | 10.29 | ||
| | 39.43 | 34.28 | 28.57 | 22.74 | 21.22 | |
| 53.50 | 48.56 | 46.37 | 31.72 | 25.90 | ||
| | 29.55 | 0.00 | 34.64 | 0.00 | 0.00 | |
| | 27.14 | 0.00 | 25.90 | 0.00 | 0.00 | |
| 29.95 | 0.00 | 30.88 | 0.00 | 0.00 | ||
| 52.09 | 46.38 | 45.32 | 29.94 | 24.47 | ||
Pathway Curation task primary evaluation result summary.
| Team | recall | F-score | |
|---|---|---|---|
| NaCTeM | 52.23 | 53.48 | 52.84 |
| TEES-2.1 | 47.15 | 55.78 | 51.10 |
Pathway Curation task primary evaluation results by event type.
| Event | NaCTeM | TEES-2.1 | ||||
|---|---|---|---|---|---|---|
| | 34.33 | 35.48 | 34.90 | 35.82 | 42.86 | |
| | 62.46 | 55.94 | 59.02 | 53.40 | 66.00 | |
| | 45.00 | 56.25 | 35.00 | 77.78 | 48.28 | |
| | 69.57 | 72.73 | 71.11 | 82.61 | 76.00 | |
| | 33.33 | 33.33 | 0.00 | 0.00 | 0.00 | |
| | 42.86 | 60.00 | 50.00 | 57.14 | 80.00 | |
| | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| | 52.94 | 64.29 | 58.06 | 58.82 | 76.92 | |
| | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| | 42.25 | 61.22 | 43.66 | 54.39 | 48.44 | |
| | 65.52 | 61.29 | 56.55 | 59.85 | 58.16 | |
| | 90.65 | 83.15 | 84.55 | 79.39 | 81.89 | |
| | 71.15 | 82.22 | 57.69 | 73.17 | 64.52 | |
| | 0.00 | 0.00 | 0.00 | 50.00 | 100.00 | |
| 66.42 | 64.80 | 60.40 | 67.87 | 63.92 | ||
| | 78.57 | 89.19 | 78.57 | 78.57 | 78.57 | |
| | 78.54 | 70.96 | 72.06 | 72.06 | 72.06 | |
| | 44.62 | 55.77 | 38.46 | 45.45 | 41.67 | |
| | 64.96 | 47.30 | 53.96 | 53.96 | 53.96 | |
| | 38.46 | 46.88 | 35.90 | 45.16 | 40.00 | |
| | 84.91 | 75.50 | 70.94 | 75.50 | 73.15 | |
| 69.07 | 62.69 | 61.16 | 65.74 | 63.37 | ||
| | 33.33 | 33.97 | 33.65 | 29.73 | 39.51 | |
| | 35.49 | 42.81 | 38.81 | 34.51 | 45.45 | |
| | 45.75 | 50.64 | 41.02 | 47.37 | 43.97 | |
| 37.73 | 42.79 | 35.17 | 44.76 | 39.39 | ||
| 53.47 | 53.96 | 48.23 | 56.22 | 51.92 | ||
| | 24.52 | 35.87 | 29.13 | 25.16 | 41.30 | |
| | 15.79 | 22.22 | 0.00 | 0.00 | 0.00 | |
| 23.56 | 34.65 | 28.05 | 22.41 | 40.00 | ||
| 52.23 | 53.48 | 47.15 | 55.78 | 51.10 | ||
Cancer Genetics task core evaluation results.
| Team | recall | F-score | Δ | |
|---|---|---|---|---|
| TEES-2.1 | 52.14 | 66.18 | 58.33 | 2.92 |
| NaCTeM | 53.32 | 58.98 | 56.01 | 3.92 |
| NCBI | 43.33 | 62.07 | 51.04 | 4.66 |
| RelAgent | 44.82 | 52.40 | 48.32 | 3.00 |
| UET-NII | 22.08 | 65.21 | 33.00 | 3.06 |
| ISI | 18.57 | 49.93 | 27.08 | 2.61 |
The Δcolumn shows absolute difference to the primary evaluation F-score.
Pathway Curation task core evaluation results.
| Team | recall | F-score | Δ | |
|---|---|---|---|---|
| NaCTeM | 54.14 | 54.78 | 54.46 | 1.62 |
| TEES-2.1 | 49.49 | 57.02 | 52.99 | 1.89 |
The Δcolumn shows absolute difference to the primary evaluation F-score.
Cancer Genetics task evaluation results with single partial penalty.
| Primary (full task) | Core | |||||||
|---|---|---|---|---|---|---|---|---|
| TEES-2.1 | 50.64 | 68.78 | 58.33 | 2.92 | 53.72 | 70.49 | 60.97 | 2.64 |
| NaCTeM | 50.70 | 61.88 | 55.74 | 3.65 | 55.03 | 64.43 | 59.36 | 3.35 |
| NCBI | 39.75 | 65.97 | 49.61 | 3.23 | 44.35 | 68.54 | 53.85 | 2.81 |
| RelAgent | 43.47 | 54.39 | 48.32 | 3.00 | 46.55 | 57.21 | 51.33 | 3.01 |
| UET-NII | 22.35 | 67.01 | 33.52 | 3.58 | 24.85 | 68.65 | 36.49 | 3.49 |
| ISI | 17.65 | 51.17 | 26.25 | 1.78 | 19.92 | 52.96 | 28.95 | 1.87 |
The Δcolumns show absolute difference to the corresponding F-scores without single partial penalty.
Pathway Curation task evaluation results with single partial penalty.
| Primary (full task) | Core Δ | |||||||
|---|---|---|---|---|---|---|---|---|
| NaCTeM | 54.14 | 57.02 | 55.54 | 2.70 | 56.12 | 58.34 | 57.21 | 2.75 |
| TEES-2.1 | 49.66 | 58.77 | 53.83 | 2.73 | 52.13 | 59.85 | 55.72 | 2.73 |
The Δcolumns show absolute difference to the corresponding F-scores without single partial penalty.
Figure 6Simple events. Events with single arguments are reliably extracted regardless of factors such as text domain or level or biological organization.
Figure 7Complex events. Events involving multiple participants, recursive structure, and modifications continue to represent challenges for extraction.