| Literature DB >> 32875947 |
Valerie Wood1,2, Seth Carbon3, Midori A Harris1,2, Antonia Lock4, Stacia R Engel5, David P Hill6, Kimberly Van Auken7, Helen Attrill8, Marc Feuermann9, Pascale Gaudet9, Ruth C Lovering10, Sylvain Poux9, Kim M Rutherford1,2, Christopher J Mungall3.
Abstract
Biological processes are accomplished by the coordinated action of gene products. Gene products often participate in multiple processes, and can therefore be annotated to multiple Gene Ontology (GO) terms. Nevertheless, processes that are functionally, temporally and/or spatially distant may have few gene products in common, and co-annotation to unrelated processes probably reflects errors in literature curation, ontology structure or automated annotation pipelines. We have developed an annotation quality control workflow that uses rules based on mutually exclusive processes to detect annotation errors, based on and validated by case studies including the three we present here: fission yeast protein-coding gene annotations over time; annotations for cohesin complex subunits in human and model species; and annotations using a selected set of GO biological process terms in human and five model species. For each case study, we reviewed available GO annotations, identified pairs of biological processes which are unlikely to be correctly co-annotated to the same gene products (e.g. amino acid metabolism and cytokinesis), and traced erroneous annotations to their sources. To date we have generated 107 quality control rules, and corrected 289 manual annotations in eukaryotes and over 52 700 automatically propagated annotations across all taxa.Entities:
Keywords: annotation; biocuration; gene ontology; quality control
Year: 2020 PMID: 32875947 PMCID: PMC7536087 DOI: 10.1098/rsob.200149
Source DB: PubMed Journal: Open Biol ISSN: 2046-2441 Impact factor: 6.411
Rule file format. (Two mandatory columns contain the GO identifiers (IDs) for the pair of mutually exclusive terms, and remaining columns allow optional identifiers for exceptions to the rule (see ‘Allowable annotation overlaps’ in the main text). For example, line 1 consists only of ‘GO:0006399 GO:0006457’ in columns 1 and 2, and states that the GO terms ‘tRNA metabolic process’ (GO:0006399) and ‘protein folding’ (GO:0006457) should not both be associated with a single gene. Column 3 may contain one or more pipe-separated IDs for GO terms that allow correct use of an otherwise mutually exclusive pair. In line 2, ‘GO:0006399 GO:0006310 GO:0045190’ states that genes may be annotated to both ‘tRNA metabolic process’ (GO:0006399) and ‘DNA recombination’ (GO:0006310) only if they are annotated to ‘isotype switching’ (GO:0045190). Similarly, column 4 allows identifiers for individual gene products or for specific PANTHER families that cover entire orthologous groups, where annotation to both terms in a pair has been confirmed as accurate. In line 3, ‘GO:0002181 GO:0006605 WB:WBGene00006946’ states that C. elegans prx-10, but not other genes, may be annotated to both ‘cytoplasmic translation’ (GO:0002181) and ‘protein targeting’ (GO:0006605) owing to a tandem gene fusion in C. elegans.)
| Term1 | Term2 | excepted GO term | excepted gene |
|---|---|---|---|
| GO:0006399 | GO:0006457 | ||
| GO:0006399 | GO:0006310 | GO:0045190 | |
| GO:0002181 | GO:0006605 | WB:WBGene00006946 |
Figure 1.Annotation matrices showing fission yeast annotations for 21 selected GO term pairs in 2012 and 2020. Each row–column intersection off the diagonal shows the number of genes annotated to two different terms. Cells are colour-coded by number of co-annotated genes. Disputed phylogenetically-inferred annotations have been removed from the 2020 dataset.
Figure 2.(a) For each of 35 GO BP subset terms, the cumulative number of genes in all organisms annotated to both the BP term and the CC term ‘cohesin complex’ (GO:0008278) is shown for May 2016 and August 2019. (b) For each database, the table shows the number of annotation errors of each type identified and corrected.
Figure 3.Intersection-based annotation quality control workflow. Step 1: Term Matrix retrieves annotations shared between pairs of GO terms. For term pairs with few annotations, both annotations and ontology are inspected, and errors corrected. Step 2: based on known biology, create co-annotation QC rules that disallow simultaneous annotation to term pairs (‘NO OVERLAP’ between annotation sets for the indicated terms). Step 3: re-run Term Matrix to find annotations that violate the rules; report to contributing databases for validation. Step 4: correct annotation errors, or amend rules to allow specific biologically valid exceptions.
Error types. (Number of different errors of each type found in annotations and the ontology structure, and the number of annotations affected. ND, not determined.)
| correction type | occurrences | entries affected |
|---|---|---|
| UniProt keyword to GO mapping | 41 | ND |
| UniPathway to GO mapping | 2 | ND |
| InterPro to GO mapping | 55 | >380 000 |
| PAINT annotation | 14 | 1818 |
| affecting all annotations | 19 | >2 000 000 |
| 289 |