| Literature DB >> 26200752 |
Justyna Szostak1, Sam Ansari2, Sumit Madan3, Juliane Fluck3, Marja Talikka1, Anita Iskandar1, Hector De Leon1, Martin Hofmann-Apitius3, Manuel C Peitsch1, Julia Hoeng1.
Abstract
Capture and representation of scientific knowledge in a structured format are essential to improve the understanding of biological mechanisms involved in complex diseases. Biological knowledge and knowledge about standardized terminologies are difficult to capture from literature in a usable form. A semi-automated knowledge extraction workflow is presented that was developed to allow users to extract causal and correlative relationships from scientific literature and to transcribe them into the computable and human readable Biological Expression Language (BEL). The workflow combines state-of-the-art linguistic tools for recognition of various entities and extraction of knowledge from literature sources. Unlike most other approaches, the workflow outputs the results to a curation interface for manual curation and converts them into BEL documents that can be compiled to form biological networks. We developed a new semi-automated knowledge extraction workflow that was designed to capture and organize scientific knowledge and reduce the required curation skills and effort for this task. The workflow was used to build a network that represents the cellular and molecular mechanisms implicated in atherosclerotic plaque destabilization in an apolipoprotein-E-deficient (ApoE(-/-)) mouse model. The network was generated using knowledge extracted from the primary literature. The resultant atherosclerotic plaque destabilization network contains 304 nodes and 743 edges supported by 33 PubMed referenced articles. A comparison between the semi-automated and conventional curation processes showed similar results, but significantly reduced curation effort for the semi-automated process. Creating structured knowledge from unstructured text is an important step for the mechanistic interpretation and reusability of knowledge. Our new semi-automated knowledge extraction workflow reduced the curation skills and effort required to capture and organize scientific knowledge. The atherosclerotic plaque destabilization network that was generated is a causal network model for vascular disease demonstrating the usefulness of the workflow for knowledge extraction and construction of mechanistically meaningful biological networks.Entities:
Mesh:
Year: 2015 PMID: 26200752 PMCID: PMC5630939 DOI: 10.1093/database/bav057
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Overall workflow used to create the atherosclerosis plaque destabilization network. The knowledge extraction workflow contains six steps. Step 1: Selection of articles that represent specific biological context from which the knowledge base was constructed. Step 2: Text-processing with special character clearance and line break corrections to ensure proper machine parsing. Step 3: Text mining pipeline with an automated recognition of NER and relationships and coding into a BEL-compliant syntax. Step 4: Domain expert curation process in the curation interface based on the automatically created and proposed BEL statements. Step 5: Validation and transformation of BEL statements into knowledge network models. Step 6: Visualization of the knowledge network model.
Evaluation of the efficiency of BEL text mining.
| Dictionary | Overall entity count in gold standard | Recall rate initial version (%) | Recall rate final version (%) |
|---|---|---|---|
| Genes/protein (HGNC) | 1673 | 80 | 93 |
| Chemical compounds (ChEBI) | 218 | 15 | 66 |
| Chemical compounds (SCHEM) | 575 | 30 | 75 |
| Chemical compounds (ChEBI + SCHEM + ChEMBL) | 793 | Not determined | 91 |
| Selventa-human-complex | 45 | 40 | 46 |
| GO-complex | 45 | Not determined | 64 |
| Selventa-human-complex + Complex | 45 | Not determined | 82 |
| GO-function | 66 | 22 | Not determined |
| Selventa-human-families | 66 | 8 | 77 |
The recall rate statistics (i.e. the numbers of existing BEL names that were detected) are shown as well as the counts in the manually annotated reference corpus composed of 1348 sentences and 2577 annotations.
Figure 2.Screenshot of the knowledge extraction curation interface.
Content of the two networks created with the semi-automated and manual knowledge extraction processes.
| Biological entities | Semi-automated knowledge extraction | Manual knowledge extraction |
|---|---|---|
| Number of statements | 234 | 191 |
| Number of annotations | 112 | 46 |
| Overall curation time [min] | 395 | 613 |
| Curation time per statement [min] | 1.7 | 3.2 |
| Curation time per statement + annotation [min] | 1.2 | 2.6 |
| Number of nodes | 149 | 145 |
| Number of edges | 285 | 251 |
The semi-automated and manually created networks contain 149 and 145 nodes that are connected by 285 and 251 edges, respectively. Overall curation time, time per statement, and the time per statement and annotation was calculated. The topological analysis of the degree of distribution for both networks showed that the most highly connected nodes (from 13° to 19° of distribution) were similar—namely, VEGFA, HMGB1, THBS1, FGF2, CYp4A11, Col4a3 and ABL1. Only two of the connected nodes, 20-HETE and Tumstatin, were different in the two networks.
Summary of the contents of the atherosclerosis plaque destabilization network.
| BEL function | Number of nodes |
|---|---|
| Abundance (chemicals or lipids) | 33 |
| Protein abundance | 114 |
| RNA abundance | 42 |
| Gene abundance | 1 |
| Complex abundance | 17 |
| Composite abundance | 4 |
| Molecular activity | 2 |
| Peptidase activity | 8 |
| Kinase activity | 6 |
| Catalytic activity | 4 |
| Transcriptional activity | 1 |
| Degradation | 4 |
| Cell secretion | 12 |
| Biological process | 43 |
| Pathology | 13 |
Figure 3.Atherosclerosis plaque destabilization network showing the degree of distribution of nodes. Biological Entities or Nodes. Pink circles indicate the 10 most connected nodes (from 41° to 28°) defined as hubs. From left to right, the hub nodes are Ile2-angiotensin II ( 1-7 ), plaque destabilization, ETS2, CCL2, TIMP1, MMP9, atherosclerosis, COL1A1, atherogenesis and CD40LG. Relationships or edges. Gray lines with arrows indicate positive causal relationships; fine dotted lines with Ts indicate negative causal relationships; gray sine waves indicate correlative relationships; and fine dotted lines indicate non-causal relationship.
Figure 4.Part of the network showing CD40LG and its interactions with other hub nodes. Biological entities or nodes. CD40LG is indicated in red, and the nodes that are regulated by CD40LG are indicated in blue. An example of evidence extracted from Inoue et al . ( 53 ) (PMID: 12438296) with the semi-automated extraction workflow and the associated BEL statement are given in the two boxes on the bottom left of the figure. Square, RNA abundance; triangle, protein abundance; V shape, protein activity; hexagon, complex; diamond, secretion. (B) Relationships or edges. Lines with dark arrows indicate positive causal relationships; lines with dark Ts indicate negative causal relationships; black sine waves indicate correlative relationships and black dotted lines indicate non-causal relationships.
| Semi-automated: |
|
| Manual: |
|