| Literature DB >> 31225582 |
Charles Tapley Hoyt1,2, Daniel Domingo-Fernández1,2, Rana Aldisi1,2, Lingling Xu1,2, Kristian Kolpeja1, Sandra Spalek1, Esther Wollert1, John Bachman3, Benjamin M Gyori3, Patrick Greene3, Martin Hofmann-Apitius1,2.
Abstract
The rapid accumulation of new biomedical literature not only causes curated knowledge graphs (KGs) to become outdated and incomplete, but also makes manual curation an impractical and unsustainable solution. Automated or semi-automated workflows are necessary to assist in prioritizing and curating the literature to update and enrich KGs. We have developed two workflows: one for re-curating a given KG to assure its syntactic and semantic quality and another for rationally enriching it by manually revising automatically extracted relations for nodes with low information density. We applied these workflows to the KGs encoded in Biological Expression Language from the NeuroMMSig database using content that was pre-extracted from MEDLINE abstracts and PubMed Central full-text articles using text mining output integrated by INDRA. We have made this workflow freely available at https://github.com/bel-enrichment/bel-enrichment.Entities:
Mesh:
Year: 2019 PMID: 31225582 PMCID: PMC6587072 DOI: 10.1093/database/baz068
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Statistics for the number of BEL nodes and BEL statements in the 10 KGs selected from the NeuroMMSig inventory before re-curation (using the version last updated on 6 December 2016), after re-curation and after enrichment
| Label | Description | Before re-curation | After re-curation | After enrichment | |||
|---|---|---|---|---|---|---|---|
| Nodes | Edges | Nodes | Edges | Nodes | Edges | ||
| Tau protein subgraph | The downstream effects of the post-translational modification, aggregation and transport of the Tau protein | 191 | 493 | 261 | 733 | 708 | 2054 |
| DKK1 subgraph, GSK3 subgraph | The interaction partners with GSK-3β and its targets of post-translational modification. The complementary DKK1 pathway is a specific signaling cascade upstream of GSK-3β | 128 | 254 | 174 | 377 | 376 | 1165 |
| Inflammatory response | Processes related to inflammation in the context of Alzheimer's disease | 182 | 373 | 341 | 743 | 2003 | 7607 |
| Insulin signal transduction | The molecular relationships between insulin resistance and inflammation, motivated by epidemiological studies that suggested a correlation between Alzheimer's disease (AD) and type II diabetes ( | 251 | 739 | 315 | 881 | 612 | 1973 |
| Amyloidogenic subgraph | The downstream effects of the amyloid precursor protein (APP), its protein modifiers and its cleavage products | 493 | 1223 | 652 | 1751 | 2090 | 7436 |
| Non-amyloidogenic subgraph | Chemicals and processes known to down-regulate the expression of the transcript corresponding to APP or the abundance of the APP protein | 195 | 359 | 325 | 635 | 795 | 2238 |
| Apoptosis and cell death | Processes relevant to AD that result in apoptosis including the Caspase subgraph, XIAP subgraph and Complement system subgraph | 104 | 143 | 170 | 229 | 1065 | 2401 |
| Acetylcholine subgraph | Pathways including biological entities and processes related to cholinergic neurons and acetylcholine transmission | 106 | 197 | 148 | 337 | 423 | 1275 |
| GABA subgraph | Pathways including biological entities and process related to GABAergic neurons and GABA transmission | 21 | 30 | 91 | 190 | 305 | 721 |
| Reactive oxygen species subgraph | The effects of reactive oxygen species, including the Myeloperoxidase subgraph, Hydrogen peroxide subgraph, Free radical formation subgraph and Nitric oxide subgraph | 104 | 173 | 126 | 224 | 1401 | 6277 |
| Total | 1188 | 3529 | 1704 | 5391 | 5850 | 23 811 | |
Later, we discuss these statistics in terms of INDRA statements—the discrepancies are due to the ontological reasoner applied in the conversion process from INDRA statements to BEL statements.
Confidence annotations using the Likert scale for re-curation
| Confidence | Rationale |
|---|---|
| None | If the evidence string is nonsense or contains no reasonable biological knowledge, delete it and the related statements entirely. It is okay to remove BEL statements that are not supported. |
| Low | If it's not clear what BEL should represent the biology, add SET Confidence = "Low" for later discussion. |
| Medium | If the statement is wrong, fix it and add the annotation SET Confidence = "Medium". |
| High | If statement can be asserted from the given evidence, add the annotation SET Confidence = "High". |
Figure 1A workflow for syntactic quality assessment. This figure can be found on FigShare at https://doi.org/10.6084/m9.figshare.7643006.v1.
Figure 2A workflow for the rational enrichment of knowledge graphs. This figure can be found on FigShare at https://doi.org/10.6084/m9.figshare.7642964.v1.
Figure 3a) Recovered BEL statements per minute. Note that the time reported here includes the time invested in annotate the statement as well as INDRA errors. b) A comparison of the curation effort between genes for which INDRA had high accuracies (top 20) and genes presenting low accuracies (bottom 20).
Figure 4a) The distribution of the accuracies in triple identification by INDRA for each gene. X-axis: Correct statements (%). Y-axis: Number of genes (frequency). b) Distribution of recovered statements after curation (mean: 74.63%).
Figure 5The frequencies of common errors found while curating BEL statements generated from 113 genes. Further details about each error type and the annotation process are available in the guidelines available at https://github.com/pharmacome/curation/blob/master/indra-errors.rst.
Examples of errors that resulted in suggestions for improvements for the underlying relation extraction systems
| Gene | Evidence | Issue | Suggestion |
|---|---|---|---|
| MRC1 | In conclusion, these results suggest that BCR and ABL kinase abrogates MMR activity to inhibit apoptosis and induce mutator phenotype. ( | MRC1, also known as MMR, was confused with mismatch repair (MMR) |
Machine learning methods generating contextual word embeddings could be used to improve the named entity recognition component such as NeuralCoref ( |
| TIMP1 | In our work, the restoration of cholesterol efflux capacities from EPA-enriched human monocyte-derived macrophages (HMDM) treated with both the adenylate cyclase activator forskolin and the phosphodiesterase inhibitor IBMX strongly suggests that EPA decreased the ABCA1 mediated cholesterol efflux from HMDM through a PKA dependent pathway. ( | TIMP1, also known as EPA, was confused with eicosapentaenoic acid (EPA) | Improve the named entity recognition (disambiguation) process, for example, by updating synonym dictionaries in rule-based systems. |
| TRPV1 | Moreover, recently TRPV1 has been demonstrated to be either inhibited or activated by PIP 2. ( | Only the inhibition relationship was extracted | Rule-based relation extraction systems could be appended with new rules to handle sentences with multiple objects. This and similar examples could be included in the training data for machine learning-based relation extraction. |
| NUMB | This interaction is mediated by the NPXY motif of LNX1 and leads to ubiquitination of Numb by the RING domain of LNX1, thereby targeting Numb to proteasomal degradation. ( | The complex sentence structure of `ubiquitination’ and `targeting’ event were not resolved properly, and the ubiquitination was omitted. | Rule-based systems like REACH that explicitly handle ubiquitination events could be appended with new rules. |
| USF2 | Taken together, the results shown in | Relation should be treated as an indirect, rather than direct, increase | Update the INDRA PybelAssembler to make use of information about whether a relation is mediated through physical contact. |