| Literature DB >> 35134132 |
Andre Brincat1, Markus Hofmann1.
Abstract
The detection of bacterial antibiotic resistance phenotypes is important when carrying out clinical decisions for patient treatment. Conventional phenotypic testing involves culturing bacteria which requires a significant amount of time and work. Whole-genome sequencing is emerging as a fast alternative to resistance prediction, by considering the presence/absence of certain genes. A lot of research has focused on determining which bacterial genes cause antibiotic resistance and efforts are being made to consolidate these facts in knowledge bases (KBs). KBs are usually manually curated by domain experts to be of the highest quality. However, this limits the pace at which new facts are added. Automated relation extraction of gene-antibiotic resistance relations from the biomedical literature is one solution that can simplify the curation process. This paper reports on the development of a text mining pipeline that takes in English biomedical abstracts and outputs genes that are predicted to cause resistance to antibiotics. To test the generalisability of this pipeline it was then applied to predict genes associated with Helicobacter pylori antibiotic resistance, that are not present in common antibiotic resistance KBs or publications studying H. pylori. These genes would be candidates for further lab-based antibiotic research and inclusion in these KBs. For relation extraction, state-of-the-art deep learning models were used. These models were trained on a newly developed silver corpus which was generated by distant supervision of abstracts using the facts obtained from KBs. The top performing model was superior to a co-occurrence model, achieving a recall of 95%, a precision of 60% and F1-score of 74% on a manually annotated holdout dataset. To our knowledge, this project was the first attempt at developing a complete text mining pipeline that incorporates deep learning models to extract gene-antibiotic resistance relations from the literature. Additional related data can be found at https://github.com/AndreBrincat/Gene-Antibiotic-Resistance-Relation-Extraction.Entities:
Mesh:
Year: 2022 PMID: 35134132 PMCID: PMC9263533 DOI: 10.1093/database/baab077
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 4.462
Figure 1.Complete flowchart of the developed pipeline which is divided as data acquisition on the left and data processing on the right.
Description of the different dataset subsets used for model training
| Dataset code | Dataset description | Training counts (% positive labels) | Validation counts (% positive labels) |
|---|---|---|---|
| SINGLE | Candidate relations from sentences containing only a single candidate relation | 8828 (89 %) | 1528 (88 %) |
| MULTI | Candidate relations from sentences containing only multiple candidate relations | 37 002 (69 %) | 8351 (54 %) |
| MULTI_LIMITED | Candidate relations from sentences containing multiple relations with the genes belonging to a maximum of 2 UniRef50 cluster IDs and the antibiotics belonging to a maximum of 2 different antibiotic groups | 28 961 (87 %) | 4712 (81 %) |
| FULL | All candidate relations from all sentences | 45 830 (73 %) | 9879 (60 %) |
Summary of holdout dataset
| Dataset grouping | Relation | NA |
|---|---|---|
| Single-instance | 235 (43 %) | 306 (57 %) |
| Multi-instance (bags) | 186 (48 %) | 203 (52 %) |
Summary of deep learning models used
| Model code | Model type |
|---|---|
| PCNN | Piecewise Convolutional Neural Network (PCNN) with pretrained word2vec word vectors |
| BIOBERT | BioBERT uncased base model with pretrained weights with entity markers |
| BAG_PCNN | Multi-instance learning PCNN using pretrained word2vec word vectors and with instances having the same entity pairs grouped in bags. |
| BAG_BIOBERT | Multi-instance learning BioBERT uncased base model with pretrained weights and with instances having the same entity pairs grouped in bags. |
Figure 2.Single-instance vs multi-instance RE.
Figure 3.Number of publications related to antibiotic resistance till Q1 of 2020.
Summary of the abstracts obtained from PubMed related to antibiotic resistance
| Description | Statistic |
|---|---|
| No. of abstracts | 60 490 |
| No. of abstracts with sentences having both gene and antibiotic mentions | 22 915 (38 %) |
| Mean no. of unique genes per abstract | 3.6 (± 2.9 SD) |
| Mean no. of unique antibiotics per abstract | 3.5 (± 2.9 SD) |
| Mean no. of characters in the abstracts’ text | 1485.8 (± 596.1 SD) |
| Mean no. of tokens in the abstracts’ text | 209 (± 85.4 SD) |
Summary of the sentences having both gene and antibiotic mentions
| Description | Statistic |
|---|---|
| No. of unique sentences | 29 935 |
| Mean no. of tokens in sentences | 26.9 (± 11.4 SD) |
| Mean no. of unique gene mentions per sentence | 1.5 (± 1.1 SD) |
| Mean no. of unique antibiotic mentions per sentence | 1.4 (± 1.0 SD) |
Figure 4.Top five gene identifiers and antibiotic combinations found in sentences.
Summary of candidate relations obtained from sentences containing both gene and antibiotic entities
| Metric | Score |
|---|---|
| Precision | 0.87 |
| Recall | 0.63 |
| F1-score | 0.73 |
Evaluation of the rule-base method used for generating the silver standard corpus using the holdout dataset
| Description | Statistic |
|---|---|
| No. of candidate relations | 81 889 |
| No. of unique candidate relations | 11 625 |
| No. of candidate relations related to | 4976 |
| No. of unique candidate relations related to | 1434 |
Summary of knowledge bases composition of genes, antibiotics and associated facts
| Description | Statistic |
|---|---|
| No. of unique facts (gene identifier and antibiotic name) | 42 380 |
| No. of unique facts (gene UniRef50 ID and antibiotic group) | 2455 |
| Gene identifiers | |
| ‘No. of unique gene identifiers’ | 35 905 |
| ‘Mean no. of associated antibiotics with each gene identifier’ | 1.2 (± 0.5 SD) |
| ‘Mean no. of associated antibiotic groups with each gene identifier’ | 1.1 (± 0.4 SD) |
| UniRef50 Clusters | |
| ‘No. of unique UniRef50 Clusters’ | 2147 |
| ‘Mean no. of associated antibiotics with each Uniref50 Clusters’ | 1.6 (± 1.2 SD) |
| ‘Mean no. of associated antibiotic groups with each Uniref50’ ‘Clusters’ | 1.1 (± 0.6 SD) |
| Antibiotics | |
| ‘No. of unique antibiotics’ | 121 |
| ‘No. of unique Antibiotic groups’ | 35 |
Holdout dataset metrics of all models tested on different datasets
| Dataset and model used for training | Precision | Recall | F1-score |
|---|---|---|---|
| Co-occurrence (bag level) | 0.48 | 1.00 | 0.65 |
| FULL | |||
|
| 0.62 | 0.55 | 0.58 |
|
| 0.65 | 0.56 | 0.60 |
|
| 0.59 | 0.95 | 0.73 |
|
| 0.55 | 0.98 | 0.70 |
| MULTI | |||
|
| 0.61 | 0.62 | 0.62 |
|
| 0.62 | 0.63 | 0.63 |
|
| 0.60 | 0.95 | 0.74 |
|
| 0.56 | 0.97 | 0.71 |
| MULTI_LIMITED | |||
|
| 0.65 | 0.66 | 0.66 |
|
| 0.61 | 0.66 | 0.63 |
|
| 0.56 | 0.94 | 0.70 |
|
| 0.52 | 0.99 | 0.68 |
| SINGLE | |||
|
| 0.55 | 0.88 | 0.68 |
|
| 0.48 | 1.00 | 0.65 |
|
| 0.48 | 1.00 | 0.65 |
|
| 0.48 | 1.00 | 0.65 |
Figure 5.Number of predicted gene-antibiotic resistance associations linked to H. pylori for the top 10 antibiotic groups.
Figure 6.Top 10 bacterial species which were the main organism understudy in publications mentioning genes linked to H. pylori.
Figure 7.Network graph of all predicted gene-antibiotic group relations for antibiotic resistance in H. pylori, with blue nodes represent the antibiotic groups and red nodes represent the individual genes.
Figure 8.Subgraph of the nitroimidazole antibiotic group with predicted gene relations where the edge thickness is proportional to the number of times this association was predicted.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|