| Literature DB >> 35758780 |
Jiyu Chen1, Benjamin Goudey1, Justin Zobel1, Nicholas Geard1, Karin Verspoor1,2.
Abstract
MOTIVATION: Literature-based gene ontology annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in the primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This article presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection.Entities:
Mesh:
Year: 2022 PMID: 35758780 PMCID: PMC9235499 DOI: 10.1093/bioinformatics/btac230
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Visual abstract
Example of consistent GOA in BC4GO corpus annotated by expert curators and four major types of inconsistent GOA; in each case we show the synthetic strategy for modifying a consistent instance to generate an inconsistency
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fig. 2.Architecture of graph neural network (GNN) with the objective of edge type classification for encoding GO specificity knowledge. , the vector of vertex u in the nth layer of GNN; e, denotes the edge type vector; , flat concatenation of vectors; ReLU, rectified linear unit, the activation function; Norm, batch normalization; MLP, single-layered multi-layer perceptron
Performance of GO-GNN on edge type classification
|
|
|
|
|
|---|---|---|---|
| is_a | 0.98 | 0.96 | 0.97 |
| part_of | 0.97 | 0.95 | 0.96 |
| parent_is_a | 0.97 | 0.96 | 0.96 |
| parent_part_of | 0.97 | 0.94 | 0.95 |
Fig. 3.Architecture of GNN-BERT model for GOA (in)consistency detection. Tok* denotes a linguistic token, E* and T* denote a token embedding, [CLS] and [SEP] are special tokens marking the boundary of an input pair
Hyperparameter settings for PubMedBERT and GNN-BERT during fine-tuning on the synthetic training set
|
| PubMedBERT | GNN-BERT |
|---|---|---|
| Fine-tune epochs | 5 | 3 |
| Fine-tune batch | 16 | 16 |
| Warmup steps | 300 | 300 |
| Weight decay | 0.01 | 0.01 |
Performance of baseline, PubMedBERT and GNN-BERT in discriminating inconsistencies including over-specific (OS), over-broad (OB), irrelevant GO mention (IM) and incorrect gene (IG). The bold values indicate the highest performance in detection of each type of inconsistency.
| Baseline | PubMedBERT | GNN-BERT | |||
|---|---|---|---|---|---|
| ID | OOD | ID | OOD | ||
| OS | |||||
|
| 0.18 | 0.30 | 0.41 |
| 0.45 |
|
| 1.00 | 0.52 | 0.42 |
| 0.56 |
|
| 0.30 | 0.38 | 0.41 |
| 0.50 |
| OB | |||||
|
|
| 0.53 | 0.49 | 0.52 | 0.53 |
|
| 0.56 |
| 0.69 | 0.69 | 0.65 |
|
|
| 0.61 | 0.57 | 0.59 | 0.58 |
| IM | |||||
|
| NA | 0.38 | 0.33 |
| 0.33 |
|
| NA | 0.73 | 0.71 |
| 0.79 |
|
| NA | 0.50 | 0.45 |
| 0.47 |
| IG | |||||
|
| NA | 0.18 | 0.15 |
| 0.17 |
|
| NA | 0.19 | 0.27 |
| 0.35 |
|
| NA | 0.18 | 0.20 |
| 0.23 |
ID, the model is fine-tuned using in-distribution samples and OOD indicates the model is fine-tuned with out-of-distribution samples; NA, not applicable (the Baseline method does not recognize gene mentions and has zero Recall as it considers every GO mention as negative regardless of true IM inconsistency).
Fig. 4.Change of performance of PubMedBERT and GNN-BERT on the test set with respect to the fraction of noisy samples in the training set
Fig. 5.Change of F1 of PubMedBERT and GNN-BERT in discriminating four types of GOA inconsistency with respect to prediction uncertainty