| Literature DB >> 27554092 |
Juliane Fluck1, Sumit Madan2, Sam Ansari3, Alpha T Kodamullil2, Reagon Karki2, Majid Rastegar-Mojarad4, Natalie L Catlett5, William Hayes5, Justyna Szostak3, Julia Hoeng3, Manuel Peitsch3.
Abstract
Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types 'increases' and 'decreases'. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task.Database URL: http://wiki.openbel.org/display/BIOC/Datasets.Entities:
Mesh:
Year: 2016 PMID: 27554092 PMCID: PMC4995071 DOI: 10.1093/database/baw113
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.An example of a BEL statement with the associated context information.
An overview of all functions and relationships used in the corpora
| Short Form | Long form | Example | Example description |
|---|---|---|---|
| a(CHEBI:water) | the abundance of water | ||
| p(HGNC:IL6) | the abundance of human IL6 protein | ||
| complex(NCH:”AP-1 Complex”) | the abundance of the AP-1 complex | ||
| complex(p(MGI:Fos), p(MGI:Jun)) | the abundance of the complex comprised of mouse Fos and Jun proteins | ||
| composite(p(MGI:Il13),p(MGI:Ifng)) | the abundance of Il13 and Ifng protein, together | ||
| g(HGNC:ERBB2) | the abundance of the ERBB2 gene (DNA) | ||
| m(MGI:Mir21) | the abundance of mouse Mir21 microRNA | ||
| r(HGNC:IL6) | the abundance of human IL6 RNA | ||
| p(HGNC:AKT1, pmod(P)) | the abundance of human AKT1 protein modified by phosphorylation | ||
| p(MGI:Rela, pmod(A, K)) | the abundance of mouse Rela protein acetylated at an unspecified lysine | ||
| p(HGNC:HIF1A, pmod(H, N, 803)) | the abundance of human HIF1A protein hydroxylated at asparagine 803 | ||
| p(HGNC:PIK3CA, sub(E, 545, K)) | the abundance of the human PIK3CA protein in which glutamic acid 545 has been substituted with lysine | ||
| p(HGNC:ABCA1, trunc(1851)) | the abundance of human ABCA1 protein that has been truncated at amino acid residue 1851 via introduction of a stop codon | ||
| p(HGNC:BCR, fus(HGNC:JAK2, 1875, 2626)) | the abundance of a fusion protein of the 5' partner BCR and 3' partner JAK2, with the breakpoint for BCR at 1875 and JAK2 at 2626 | ||
| p(HGNC:BCR, fus(HGNC:JAK2)) | the abundance of a fusion protein of the 5' partner BCR and 3' partner JAK2 | ||
| deg(r(HGNC:MYC)) | the degradation of human MYC RNA | ||
| sec(p(MGI:Il6)) | the secretion of mouse Il6 protein | ||
| surf(p(RGD:Fas)) | the cell surface expresion of Rat Fas protein | ||
| tloc(p(HGNC:NFE2L2), MESHCL:Cytoplasm, MESHCL:”Cell Nucleus”) | the event in which human NFE2L2 protein is translocated from the cytoplasm to the nucleus | ||
| rxn(reactants(a(CHEBI:"leukotriene D4")), products(a(CHEBI:"leukotriene E4"))) | the reaction in which the reactant leukotriene D4 is converted into leukotriene E4 | ||
| act(p(RGD:Sod1)) | The activity of rat Sod1 protein | ||
| cat(p(RGD:Sod1)) | the catalytic activity of rat Sod1 protein | ||
| chap(p(HGNC:CANX)) | the chaperone activity of the human CANX (Calnexin) protein | ||
| gtp(p(PFH:”RAS Family”)) | the GTP-bound activity of RAS Family protein | ||
| kin(p(HGNC:CHEK1)) | the kinase activity of the human protein CHEK1 | ||
| pep(p(RGD:Ace)) | the peptidase activity of the Rat angiotensin converting enzyme (ACE) | ||
| phos(p(HGNC:DUSP1)) | the phosphatase activity of human DUSP1 protein | ||
| ribo(p(HGNC:PARP1)) | the ribosylation activity of human PARP1 protein | ||
| tscript(p(MGI:Trp53)) | the transcriptional activity of mouse TRP53 (p53) protein | ||
| tport(complex(NCH:”ENaC Complex”)) | the ion transport activity of the the epithelial sodium channel (ENaC) complex | ||
| bp(GO:”cellular senescence”) | the biological process cellular senescence | ||
| path(MESHD: Atherosclerosis) | the pathology Atherosclerosis | ||
| cat(p(MGI:Crk)) increases p(MGI:Bcar1,pmod(P)) | the catalytic activ form of the mouse portein Crk induces phosphorylation of the mouse protein Bcar1 | ||
| cat(p(MGI:Crk)) directlyIncreases p(MGI:Bcar1,pmod(P)) | the catalytic activ form of the mouse portein Crk induces phosphorylation of the mouse protein Bcar1 | ||
| p(HGNC:TIMP2) decreases cat(p(HGNC:MMP2)) | the protein TIMP2 decreases the catalytic activity of MMP2 | ||
| p(HGNC:TIMP2) directlyDecreases cat(p(HGNC:MMP2)) | the protein TIMP2 decreases the catalytic activity of MMP2 | ||
An overview of all corpora summaries
| Corpus | Purpose | Selection criterion | Content | No. unique sentences | No. statements |
|---|---|---|---|---|---|
| Source corpus for selection of the different corpora | Automatic filtering of expert generated BEL nanopubs | positive examples | 18 224 | 29 484 | |
| Training for task 1 and 2 BioCreative V BEL track | Randomly selected from BEL base corpus | positive examples | 6353 | 11 066 | |
| Evaluation for task 1 during training phase BioCreative V BEL track, task1 | Randomly selected from BEL base corpus; reannotated | positive examples | 183 | 354 | |
| Provided during test phase BioCreative V BEL track, task 1 | Randomly selected from BEL base corpus; | 105 | |||
| Gold standard for evaluation of BioCreative V BEL track, task 1 | Randomly selected from BEL base corpus; reannotated for task 1 evaluation | positive examples | 105 | 202 | |
| Provided during test phase BioCreative V BEL track, task 2 | Randomly selected from BEL base corpus; reannotated; unpublished | 100 | |||
| Annotated for future training of task 2; not available during BioCreative V | Predicted by two different systems, one prediction done by a system participating in the BioCreative BEL track task 2 | positive and negative examples |
806 Fully supportive: 316 TP/490 FP Partially Supportive: 429 TP/377 FP | 100 |
Figure 2.The workflow of the first method employing the semantic search engine SCAIView.
Figure 3.The workflow of the second method implemented by Rastegar-Mojarad et al. (44).
Figure 4.An example of the BEL_Extraction sample corpus.
Distributions of term, function, and relationship types in the BEL_Extraction corpora
| P | 19 918 | 497 | 346 | 20 761 |
| A | 1927 | 79 | 37 | 2043 |
| bp | 877 | 79 | 31 | 987 |
| path | 244 | 54 | 15 | 313 |
| act | 6332 | 0 | 36 | 6368 |
| pmod | 1411 | 24 | 9 | 1444 |
| complex | 750 | 26 | 15 | 791 |
| tloc | 406 | 11 | 13 | 430 |
| deg | 205 | 18 | 6 | 229 |
| sub | 23 | 0 | 0 | 23 |
| trunc | 6 | 0 | 0 | 6 |
| increases | 8112 | 221 | 155 | 8488 |
| decreases | 2956 | 84 | 53 | 3093 |
Distribution of positive and negative sentences in the BEL_Sentence_Classification corpus
| Class | True | False | Total |
|---|---|---|---|
| 578 | 976 | 1554 | |
| 804 | 750 | 1554 |
A relationship is annotated as fully supportive if the BEL statement is fully expressed in the text excerpts. It contains all information to allow the extraction of the corresponding statements
The category partially supportive is always valid if the fully supportive value is true. In addition, a relationship is annotated as partially supportive if the statement can only be extracted by taking biological background knowledge or contextual details into account.
IAA (F-score) of 40 randomly chosen BEL statements from the BEL_Extraction sample corpus
| Class | |||
|---|---|---|---|
| 97.35 | 94.92 | 95.73 | |
| 93.33 | 97.32 | 93.33 | |
| 95.24 | 97.48 | 95.24 | |
| 98.55 | 93.33 | 91.89 | |
| 97.14 | 86.49 | 89.19 | |
| 91.18 | 85.33 | 83.78 | |
IAA (observed agreement and kappa score) of 30 randomly chosen entries from the BEL_Sentence_Classification corpus
| Class | |||
|---|---|---|---|
| Fully supportive | 90.0% (Kappa: 0.80) | 86.7% (Kappa: 0,72) | 75.0% (Kappa: 0,44) |
| Partially supportive | 86.7% (Kappa: 0.71) | 93.3% (Kappa: 0.86) | 85.0% (Kappa: 0.69) |
| Best practice: | |
| Not recommended: | |
| Best practice: | |
| Not recommended: | |
| Best practice: |
| Best practice: |
| Best practice: |
| Best practice: |
| Best practice: |
| Best practice: | |
| Sample/Test: |