| Literature DB >> 26202680 |
Jin-Dong Kim, Jung-Jae Kim, Xu Han, Dietrich Rebholz-Schuhmann.
Abstract
BACKGROUND: The third edition of the BioNLP Shared Task was held with the grand theme "knowledge base construction (KB)". The Genia Event (GE) task was re-designed and implemented in light of this theme. For its final report, the participating systems were evaluated from a perspective of annotation. To further explore the grand theme, we extended the evaluation from a perspective of KB construction. Also, the Gene Regulation Ontology (GRO) task was newly introduced in the third edition. The final evaluation of the participating systems resulted in relatively low performance. The reason was attributed to the large size and complex semantic representation of the ontology. To investigate potential benefits of resource exchange between the presumably similar tasks, we measured the overlap between the datasets of the two tasks, and tested whether the dataset for one task can be used to enhance performance on the other.Entities:
Mesh:
Year: 2015 PMID: 26202680 PMCID: PMC4511578 DOI: 10.1186/1471-2105-16-S10-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Annotation example 1 in visualization. MTb induces NFAT5 gene expression via the MyD88-dependent signaling cascade
Figure 2Annotation example 2 in visualization.
Figure 3Annotation example 3 in visualization.
Figure 4Annotation example 3 in BioNLP .
Figure 5Annotation example 3 in JSON.
Figure 6Annotation example 3 in RDF.
Queries used for the KB evaluation.
| # | Meaning | SPARQL |
|---|---|---|
| Q1 | Find the proteins that are in the context of gene expression. | |
| Q2 | Find the proteins that are in the context of localization. | |
| Q3 | Find the protein that are in the context of binding. | |
| Q4 | Find the protein pairs that bind to each other. | |
| Q5 | Find the protein pairs of which one regulates the other. | |
| Q6 | Find the protein pairs of which one regulates the other (transitive). | |
| Q7 | Find the protein pairs of which one regulates expression of the other. | |
| Q8 | Find the protein pairs of which one regulates expression of the other (transitive). | |
Basic statistics of GE'13 and GRO'13 benchmark datasets.
| GE'13 | GRO'13 | |||||
|---|---|---|---|---|---|---|
| Documents | 10 papers | 10 papers | 14 papers | 150 abstracts | 50 abstracts | 100 abstracts |
| Entities | 3692 | 4452 | 4686 | 5902 | 1910 | 4007 |
| Events | 2817 | 3199 | 3348 | 2175 | 747 | 2319 |
Mappings between the Genia ontology concepts for the GE'13 task and the GRO concepts for the GRO'13 task.
| Genia concept | GRO concept |
|---|---|
| Acetylation | Acetylation |
| Binding | BindingToProtein |
| Gene_expression | GeneExpression |
| Localization | Localization |
| Negative_regulation | NegativeRegulation |
| Phosphorylation | Phosphorylation |
| Positive_regulation | PositiveRegulation |
| Protein | Gene |
| Protein | Protein |
| Protein_catabolism | ProteinCatabolism |
| Protein_modification | ProteinModification |
| Regulation | RegulatoryProcess |
| Transcription | Transcription |
Results of annotation-oriented evaluation on Gene-expression and Localization. Acronyms: GS=gold standard, P=positives, TP=true positives, R=recall, P=precision, F=f-score.
| Gene_expression | Localization | |||||
|---|---|---|---|---|---|---|
| 619 | 619 | 99 | 99 | - | ||
| EVEX | 600 | 504 | 81.42 / 84.00 / 82.69 | 56 | 47 | 47.47 / 83.93 / 60.65 |
| TEES-2.1 | 600 | 504 | 81.42 / 84.00 / 82.69 | 59 | 50 | |
| BioSEM | 526 | 457 | 73.83 / 86.88 / 79.83 | 47 | 42 | 42.42 / 89.36 / 57.53 |
| NCBI | 641 | 495 | 79.97 / 77.22 / 78.57 | 47 | 39 | 39.39 / 82.98 / 53.42 |
| DlutNLP | 580 | 480 | 77.54 / 82.76 / 80.07 | 39 | 35 | 35.35 / 89.74 / 50.72 |
| HDS4NLP | 556 | 501 | 66 | 50 | 50.51 / 75.76 / 60.61 | |
| NICTANLM | 761 | 506 | 81.74 / 66.49 / 73.33 | 52 | 31 | 31.31 / 59.62 / 41.06 |
| USheff | 450 | 386 | 62.36 / 85.78 / 72.22 | 27 | 23 | 23.23 / 85.19 / 36.51 |
| UZH | 497 | 406 | 65.59 / 81.69 / 72.76 | 39 | 34 | 34.34 / 87.18 / 49.28 |
| HCMUS | 790 | 488 | 78.84 / 61.77 / 69.27 | 61 | 32 | 32.32 / 52.46 / 40.00 |
Results of annotation-oriented evaluation on Binding and REGULATION-ALL. Acronyms: GS=gold standard, P=positives, TP=true positives, R=recall, P=precision, F=f-score.
| Binding | REGULATION_ALL | |||||
|---|---|---|---|---|---|---|
| 333 | 333 | - | 1944 | 1944 | - | |
| EVEX | 306 | 137 | 41.14 / 44.77 / 42.88 | 1336 | 630 | |
| TEES-2.1 | 318 | 141 | 1436 | 643 | 33.08 / 44.78 / 38.05 | |
| BioSEM | 302 | 158 | 47.45 / 52.32 / 49.76 | 1115 | 547 | 28.19 / 49.06 / 35.80 |
| NCBI | 299 | 125 | 37.54 / 41.81 / 39.56 | 865 | 481 | 24.74 / 55.61 / 34.25 |
| DlutNLP | 308 | 136 | 40.84 / 44.16 / 42.43 | 1185 | 515 | 26.49 / 43.46 / 32.92 |
| HDS4NLP | 412 | 139 | 41.74 / 33.74 / 37.32 | 780 | 411 | 21.14 / 52.69 / 30.18 |
| NICTANLM | 344 | 107 | 32.13 / 31.10 / 31.61 | 891 | 420 | 21.60 / 47.14 / 29.63 |
| USheff | 224 | 105 | 31.53 / 46.88 / 37.70 | 1050 | 324 | 16.67 / 30.86 / 21.64 |
| UZH | 264 | 74 | 22.22 / 28.03 / 24.79 | 1912 | 381 | 19.60 / 19.93 / 19.76 |
| HCMUS | 478 | 129 | 38.74 / 26.99 / 31.81 | 693 | 215 | 11.06 / 31.02 / 16.31 |
Results of KB-oriented evaluation for Q1 (Find the proteins in the context of gene expression) and Q2 (Find the proteins in the context of localization).
| Q1 (Gene_expression) | Q2 (Localization) | |||||
|---|---|---|---|---|---|---|
| 604 | 604 | - | 94 | 94 | - | |
| EVEX | 604 | 497 | 82.28 / 82.28 / 82.28 | 56 | 45 | 47.87 / 80.36 / 60.00 |
| TEES-2.1 | 604 | 497 | 82.28 / 82.28 / 82.28 | 59 | 48 | |
| BioSEM | 537 | 456 | 75.50 / 84.92 / 79.93 | 52 | 43 | 45.74 / 82.69 / 58.90 |
| NCBI | 647 | 493 | 81.62 / 76.20 / 78.82 | 46 | 38 | 40.43 / 82.61 / 54.29 |
| DlutNLP | 591 | 475 | 78.64 / 80.37 / 79.50 | 38 | 35 | 37.23 / 92.11 / 53.03 |
| HDS4NLP | 563 | 500 | 68 | 50 | 53.19 / 73.53 / 61.73 | |
| NICTANLM | 748 | 501 | 82.95 / 66.98 / 74.11 | 52 | 30 | 31.91 / 57.69 / 41.10 |
| USheff | 452 | 386 | 63.91 / 85.40 / 73.11 | 26 | 23 | 24.47 / 88.46 / 38.33 |
| UZH | 496 | 404 | 66.89 / 81.45 / 73.45 | 40 | 34 | 36.17 / 85.00 / 50.75 |
| HCMUS | 763 | 481 | 79.64 / 63.04 / 70.37 | 68 | 31 | 32.98 / 45.59 / 38.27 |
Acronyms: GS=gold standard, P=positives, TP=true positives, R=recall, P=precision, F=f-score.
Results of KB-oriented evaluation for Q3 (Find the protein in the context of binding) and Q4 (Find the protein pairs binding to each other).
| Q4 (pair Binding) | Q4 (pair Binding) | |||||
|---|---|---|---|---|---|---|
| 300 | 300 | - | 83 | 83 | - | |
| EVEX | 324 | 182 | 60.67 / 56.17 / 58.33 | 122 | 27 | 32.53 / 22.13 / 26.34 |
| TEES-2.1 | 336 | 188 | 62.67 / 55.95 / 59.12 | 144 | 32 | |
| BioSEM | 355 | 182 | 60.67 / 51.27 / 55.57 | 114 | 21 | 25.30 / 18.42 / 21.32 |
| NCBI | 318 | 177 | 59.00 / 55.66 / 57.28 | 167 | 24 | 28.92 / 14.37 / 19.20 |
| DlutNLP | 352 | 193 | 64.33 / 54.83 / 59.20 | 179 | 25 | 30.12 / 13.97 / 19.08 |
| HDS4NLP | 393 | 219 | 72 | 15 | 18.07 / 20.83 / 19.35 | |
| NICTANLM | 369 | 175 | 58.33 / 47.43 / 52.32 | 177 | 21 | 25.30 / 11.86 / 16.15 |
| USheff | 252 | 156 | 52.00 / 61.90 / 56.52 | 43 | 13 | 15.66 / 30.23 / 20.63 |
| UZH | 255 | 143 | 47.67 / 56.08 / 51.53 | 0 | 0 | 00.00 / 00.00 / 00.00 |
| HCMUS | 491 | 207 | 69.00 / 42.16 / 52.34 | 75 | 19 | 22.89 / 25.33 / 24.05 |
Acronyms: GS=gold standard, P=positives, TP=true positives, R=recall, P=precision, F=f-score.
Results of KB evaluation for Q5 (Find the protein pairs of which one regulates the other) and Q6 (Find the protein pairs of which one regulates the other, transitive).
| Q5 (Regulation) | Q6 (transitive Regulation) | |||||
|---|---|---|---|---|---|---|
| 108 | 108 | - | 360 | 360 | - | |
| EVEX | 61 | 18 | 16.67 / 29.51 / 21.30 | 197 | 126 | 35.00 / 63.96 / 45.24 |
| TEES-2.1 | 65 | 18 | 16.67 / 27.69 / 20.81 | 218 | 133 | |
| BioSEM | 45 | 14 | 12.96 / 31.11 / 18.30 | 155 | 90 | 25.00 / 58.06 / 34.95 |
| NCBI | 34 | 8 | 07.41 / 23.53 / 11.27 | 103 | 69 | 19.17 / 66.99 / 29.81 |
| DlutNLP | 69 | 20 | 174 | 106 | 29.44 / 60.92 / 39.70 | |
| HDS4NLP | 31 | 13 | 12.04 / 41.94 / 18.71 | 31 | 17 | 04.72 / 54.84 / 08.70 |
| NICTANLM | 31 | 5 | 04.63 / 16.13 / 07.19 | 112 | 64 | 17.78 / 57.14 / 27.12 |
| USheff | 18 | 5 | 04.63 / 27.78 / 07.94 | 60 | 35 | 09.72 / 58.33 / 16.67 |
| UZH | 6 | 0 | 00.00 / 00.00 / 00.00 | 21 | 8 | 02.22 / 38.10 / 04.20 |
| HCMUS | 94 | 9 | 08.33 / 09.57 / 08.91 | 156 | 33 | 09.17 / 21.15 / 12.79 |
Acronyms: GS=gold standard, P=positives, TP=true positives, R=recall, P=precision, F=f-score.
Results of KB evaluation for Q7 (Find the protein pairs of which one regulates expression of the other) and Q8 (Find the protein pairs of which one regulates expression of the other, transitive).
| Q7 (Regulation of Exp) | Q8 (transitive Regulation of Exp) | |||||
|---|---|---|---|---|---|---|
| 111 | 111 | - | 128 | 128 | - | |
| EVEX | 52 | 38 | 34.23 / 73.08 / 46.63 | 61 | 50 | 39.06 / 81.97 / 52.91 |
| TEES-2.1 | 67 | 43 | 77 | 56 | ||
| BioSEM | 26 | 21 | 18.92 / 80.77 / 30.66 | 31 | 25 | 19.53 / 80.65 / 31.45 |
| NCBI | 37 | 25 | 22.52 / 67.57 / 33.78 | 37 | 32 | 25.00 / 86.49 / 38.79 |
| DlutNLP | 49 | 36 | 32.43 / 73.47 / 45.00 | 52 | 41 | 32.03 / 78.85 / 45.56 |
| HDS4NLP | 0 | 0 | 00.00 / 00.00 / 00.00 | 0 | 0 | 00.00 / 00.00 / 00.00 |
| NICTANLM | 42 | 24 | 21.62 / 57.14 / 31.37 | 40 | 30 | 23.44 / 75.00 / 35.71 |
| USheff | 31 | 20 | 18.02 / 64.52 / 28.17 | 29 | 20 | 15.63 / 68.97 / 25.48 |
| UZH | 10 | 2 | 01.80 / 20.00 / 03.31 | 10 | 3 | 02.34 / 30.00 / 04.35 |
| HCMUS | 38 | 16 | 14.41 / 42.11 / 21.48 | 38 | 16 | 12.50 / 42.11 / 19.28 |
Acronyms: GS=gold standard, P=positives, TP=true positives, R=recall, P=precision, F=f-score, Exp=Gene_expression.
Statistics of conversion rates.
| GE → GRO | GRO → GE | ||
|---|---|---|---|
| Entities | Convertible | 6,449 (98.1%) | 4,193 (51.9%) |
| Non-convertible | 125 (1.9%) | 3,881 (48.1%) | |
| Events | Convertible | 3,436 (92.5%) | 1,094 (25.5%) |
| Non-convertible | 280 (7.5%) | 3,188 (74.6%) | |
Most frequent GRO concepts and their ancestor concepts that correspond to Genia concepts.
| Genia concept (count) | GRO concepts and their ancestors corresponding to the Genia concept (count) |
|---|---|
| Protein (2887) | Protein (1521) |
| Gene (482) | |
| TranscriptionFactor < TranscriptionRegulator < Protein (294) | |
| Enzyme < Protein (264) | |
| ProteinSubunit < Protein (143) | |
| Regulation (289) | RegulatoryProcess (221) |
| PositiveRegulationOfGeneExpression < RegulationOfGeneExpression < RegulatoryProcess (22) | |
| NegativeRegulationOfTranscription < RegulationOfTranscription < RegulatoryProcess (18) | |
| Gene_expression (237) | GeneExpression (237) |
| Positive_regulation (229) | PositiveRegulation (229) |
| Negative_regulation (145) | NegativeRegulation (145) |
| Binding (126) | BindingToProtein (126) |
| Localization (112) | Localization (62) |
| Transport < Localization (36) | |
| ProteinTargeting < ProteinTransport < Localization (12) | |
| Transcription (105) | Transcription (83) |
| TranscriptionOfGene < Transcription (22) | |
Most frequent GRO concepts that are not convertible to Genia.
| Level 3 | Level 4 | Level 5 | Level 6 | Count |
|---|---|---|---|---|
| (under the branch of Continuant > PhysicalContinuant) | ||||
| LivingEntity | > Organism | > Eukaryote | 470 | |
| LivingEntity | > Cell | 383 | ||
| Tissue | 218 | |||
| MolecularEntity | > InformationBiopolymer | > NucleicAcid | > DNA | 193 |
| MolecularEntity | > InformationBiopolymer | > ProteinDomain | 171 | |
| MolecularEntity | > Chemical | > OrganicChemical | > AminoAcid | 129 |
| CellComponent | 122 | |||
| (under the branch of Occurrent > Process) | ||||
| Increase | 92 | |||
| Disease | 91 | |||
| PhysicalInteraction | > Binding | > BindingOfProteinToDNA | 71 | |
| MolecularProcess | > Pathway | > SignalingPathway | 67 | |
| Mutation | 47 | |||
The count of the last column is the count of the concept at the lowest level in each row.
Performance changes for different GRO concepts after using the additional converted data from the GE task.
| Concept | No. of instances of concept in the original training dataset of GRO'13 | No. of instances of concept converted from the GE'13 training dataset | F-measure before conversion integration | F-measure after conversion integration | Change |
|---|---|---|---|---|---|
| GeneExpression | 221 | 748 | 58.8% | 63.7% | 4.9% |
| PositiveRegulation | 206 | 785 | 16.3% | 13.7% | −2.6% |
| RegulatoryProcess | 183 | 305 | 24.1% | 23.7% | −0.4% |
| NegativeRegulation | 124 | 512 | 16.5% | 16.1% | −0.4% |
| BindingToProtein | 126 | 201 | 32.7% | 32.1% | −0.6% |
| SignalingPathway | 66 | 0 | 54.6% | 35.3% | −19.3% |
| CellGrowth | 17 | 0 | 32.3% | 26.4% | −5.9% |
| Mutation | 45 | 0 | 23.3% | 17.9% | −5.4% |
| Disease | 91 | 0 | 19.0% | 10.1% | −8.9% |
Average number of instances per ontology concept.
| (per applicable concept) | GE | GRO |
|---|---|---|
| Average number of instances | 82 | 13 |
| Average number of convertible instances to the other task | 287 | 47 |
Changes of errors in events with at least one Gene/Protein participant by the data conversion
| Before conversion | After conversion | |||
|---|---|---|---|---|
| FP | FN | FP | FN | |
| Protein | 85 | 121 | 67 | 125 |
| Gene | 26 | 58 | 31 | 57 |
Changes of errors for frequent subconcepts of the GRO concepts that have equivalent correspondent Genia concepts, by the data conversion.
| Concept | No. of instances | Before conversion | After conversion | ||
|---|---|---|---|---|---|
| BindingOfProteinToDNA | 55 | 42 | 45 | 34 | 46 |
| PositiveRegulationOfGeneExpression | 33 | 11 | 25 | 6 | 28 |
| Heterodimerization | 32 | 6 | 25 | 4 | 25 |
| BindingOfTranscriptionFactorToDNA | 25 | 0 | 25 | 0 | 25 |
| PositiveRegulationOfTranscription | 24 | 0 | 4 | 0 | 4 |
FP stands for false positives and FN stands for false negatives.