| Literature DB >> 32460703 |
Emily K Mallory1, Matthieu de Rochemonteix2, Alex Ratner3, Ambika Acharya3, Chris Re3, Roselie A Bright4, Russ B Altman5.
Abstract
BACKGROUND: Enzymatic and chemical reactions are key for understanding biological processes in cells. Curated databases of chemical reactions exist but these databases struggle to keep up with the exponential growth of the biomedical literature. Conventional text mining pipelines provide tools to automatically extract entities and relationships from the scientific literature, and partially replace expert curation, but such machine learning frameworks often require a large amount of labeled training data and thus lack scalability for both larger document corpora and new relationship types.Entities:
Keywords: Chemical reactions; Curation; Database; Snorkel; Text mining
Mesh:
Year: 2020 PMID: 32460703 PMCID: PMC7251675 DOI: 10.1186/s12859-020-03542-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of the Snorkel pipeline. First, users input a text corpus. Snorkel extracts relationships of interest by 1) detecting co-occurring entities (i.e., relationship candidates), 2) applying labeling functions to automatically label noisy training examples, and 3) training generative and discriminative machine learning models using the labeling functions and general features to predict which candidates are true relationships. The output is a binary prediction of a true relationship for each relationship candidate
Number of abstracts and candidates in MetaCyc_Corpus and Bacteria_Corpus
| Corpus | Abstracts | Candidates |
|---|---|---|
| MetaCyc_Corpus | 1799 | 67,922 |
| Bacteria_Corpus | 873,237 | 8,936,941 |
Example labeling functions for the MetaCyc corpus
| Example labeling function | Description |
|---|---|
| LF_keyword_context | If there is a word of a given list, such as |
| LF_sep_verb | If the chemicals are separated by a verb, we label TRUE |
| LF_argument_order | If the candidate product is before the candidate substrate, we label FALSE |
| LF_followed_ase | If one of the chemicals is followed by a word that ends with “ase”, we label FALSE |
| LF_sep_or | If the chemicals are separated by the word |
Examples of additional labeling functions on the Bacteria_Corpus
| Example labeling function | Description |
|---|---|
| LF_metacyc | If the chemical reaction is already in the MetaCyc curated database, we label TRUE |
| LF_chemical_elements | If one of the chemicals is a chemical element, we label FALSE |
| LF_group | If there is a close mention of a functional chemical group, we label FALSE |
| LF_treatment | If there is mention of keywords frequently associated with clinical trials, we label FALSE |
Sizes and gold label statistics of the three splits for the MetaCyc_Corpus
| Split | Abstracts | Candidates | Positives | Docs w. candidates | Docs w. positives |
|---|---|---|---|---|---|
| MetaCyc_Train | 1753 | 65,398 | – | 1544 | – |
| MetaCyc_Dev | 23 | 1292 | 60 | 23 | 16 |
| MetaCyc_Test | 23 | 1232 | 51 | 23 | 15 |
Sizes and gold label statistics of the splits for the Bacteria_Corpus
| Split | Abstracts | Candidates | Positives | Docs w. candidates | Docs w. positives |
|---|---|---|---|---|---|
| Bacteria_Train | 872,591 | 8,928,937 | – | 417,404 | – |
| Bacteria_Test | 200 | 2398 | 43 | 96 | 13 |
| Bacteria_Dev | 223 | 2806 | 69 | 110 | 22 |
| MetaCyc_Test | 23 | 1212 | 49 | 23 | 15 |
Labeling Function metrics for MetaCyc_Corpus and Bacteria_Corpus. Coverage refers to the proportion of candidates labeled with the labeling function. Overlaps refers to the proportion of candidates labeled with another labeling function. Conflicts refers to proportion of candidates labeled with an opposing labeling function
| MetaCyc_Corpus | Bacteria_Corpus | |||||
|---|---|---|---|---|---|---|
| Labeling function | Coverage | Overlaps | Conflicts | Coverage | Overlaps | Conflicts |
| LF_keyword_context | 0.005963 | 0.002110 | 0.001361 | 0.001902 | 0.001750 | 0.001719 |
| LF_sep_verb | 0.000933 | 0.000291 | 0.000092 | 0.001252 | 0.001146 | 0.001137 |
| LF_argument_order | 0.500000 | 0.238234 | 0.005520 | 0.499939 | 0.470721 | 0.016873 |
| LF_followed_ase | 0.180954 | 0.155800 | 0.001697 | 0.015969 | 0.015838 | 0.000966 |
| LF_sep_or | 0.006453 | 0.003333 | 0.000000 | 0.006702 | 0.006399 | 0.000526 |
| LF_metacyc | – | – | – | 0.031805 | 0.030915 | 0.030823 |
| LF_chemical | – | – | – | 0.130835 | 0.130539 | 0.004871 |
| LF_treatment | – | – | – | 0.029490 | 0.028674 | 0.000609 |
Evaluation results for MetaCyc_Corpus. We evaluated three models: majority voting, generative model, and discriminative model
| Model | Coverage | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Majority voting | 0.73 | 0.79 | 0.22 | 0.34 |
| Generative model | 0.73 | 0.79 | 0.22 | 0.34 |
| Discriminative model |
Evaluation results on the Bacteria_Corpus, using a 0.5 threshold. We evaluated three models: majority voting, generative model, and discriminative model
| Model | Coverage | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Majority voting | 0.92 | 0.20 | 0.34 | |
| Generative model | 0.94 | 0.33 | 0.49 | |
| Discriminative model | 0.50 |