| Literature DB >> 25147557 |
Gaston K Mazandu1, Nicola J Mulder1.
Abstract
With the advancement of new high throughput sequencing technologies, there has been an increase in the number of genome sequencing projects worldwide, which has yielded complete genome sequences of human, animals and plants. Subsequently, several labs have focused on genome annotation, consisting of assigning functions to gene products, mostly using Gene Ontology (GO) terms. As a consequence, there is an increased heterogeneity in annotations across genomes due to different approaches used by different pipelines to infer these annotations and also due to the nature of the GO structure itself. This makes a curator's task difficult, even if they adhere to the established guidelines for assessing these protein annotations. Here we develop a genome-scale approach for integrating GO annotations from different pipelines using semantic similarity measures. We used this approach to identify inconsistencies and similarities in functional annotations between orthologs of human and Drosophila melanogaster, to assess the quality of GO annotations derived from InterPro2GO mappings compared to manually annotated GO annotations for the Drosophila melanogaster proteome from a FlyBase dataset and human, and to filter GO annotation data for these proteomes. Results obtained indicate that an efficient integration of GO annotations eliminates redundancy up to 27.08 and 22.32% in the Drosophila melanogaster and human GO annotation datasets, respectively. Furthermore, we identified lack of and missing annotations for some orthologs, and annotation mismatches between InterPro2GO and manual pipelines in these two proteomes, thus requiring further curation. This simplifies and facilitates tasks of curators in assessing protein annotations, reduces redundancy and eliminates inconsistencies in large annotation datasets for ease of comparative functional genomics.Entities:
Keywords: Gene Ontology annotation; annotation pipeline; electronic annotation; functional annotation; manual annotation
Year: 2014 PMID: 25147557 PMCID: PMC4123725 DOI: 10.3389/fgene.2014.00264
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
GO annotations of the protein .
| GO:0008270 | Zinc ion binding | 6 | IEA | InterPro |
| GO:0004842 | Ubiquitin-protein transferase activity | 6 | IMP | UniProt |
| GO:0004842 | Ubiquitin-protein transferase activity | 6 | IDA | BHF-UCL |
| GO:0004842 | Ubiquitin-protein transferase activity | 6 | IDA | UniProt |
| GO:0019789 | SUMO ligase activity | 6 | IDA | UniProt |
| GO:0019789 | SUMO ligase activity | 6 | IMP | UniProt |
| GO:0044547 | DNA topoisomerase binding | 4 | IPI | UniProt |
| GO:0003677 | DNA binding | 4 | IDA | UniProt |
| GO:0005515 | Protein binding | 2 | IPI | UniProt |
| GO:0003823 | Antigen binding | 2 | IPI | UniProt |
These annotations were retrieved from GOA-human gene association from the GOA database and level represents the maximum number of links from the root to the GO term in the GO DAG, assuming that the root of each ontology is located at level 0. Evidence codes IEA, IMP, IPI, and IDA stand for Inferred from Electronic Annotation, Inferred from Mutant Phenotype, Inferred from Physical Interaction and Inferred from Direct Assay, respectively.
Percentage redundancy of manual and electronic pipelines for different confidence levels.
| 0.0 | Human | 22.70 | 0.00 | 22.32 | 28.01 | 0.00 | 12.66 | 22.09 | 0.00 | 18.11 |
| fruitfly | 29.02 | 0.00 | 27.08 | 18.31 | 0.00 | 13.45 | 20.72 | 0.00 | 17.18 | |
| 0.3 | Human | 16.74 | 0.00 | 17.44 | 26.04 | 0.00 | 11.63 | 10.06 | 0.00 | 8.89 |
| fruitfly | 20.30 | 0.00 | 19.65 | 16.31 | 0.00 | 12.16 | 10.17 | 0.00 | 8.52 | |
| 0.7 | Human | 11.20 | 0.00 | 11.22 | 6.08 | 0.00 | 4.22 | 2.28 | 0.00 | 1.91 |
| fruitfly | 11.02 | 0.00 | 11.40 | 5.04 | 0.00 | 6.74 | 3.89 | 0.00 | 3.41 | |
The confidence level of 0.0 refers to the strict non-redundancy which consists of using the “true path” rule of the GO structure to identify the ancestor of a term as redundant annotation, but for confidence level of 0.3 and 0.7, a term ancestor is considered to be a redundant annotation for a protein if their semantic similarity score is 0.3 and 0.7, respectively. EXP, IPR, and AEC stand for manual, electronic pipelines and considering all evidence codes, respectively.
Figure 1Comparison of annotations inferred manually and electronically in human and fruitfly genomes in terms of annotation specificity score computed using the GO-universal metric. (A) Human genome. (B) Fruitfly genome.
Figure 2Comparison of annotations inferred manually and electronically in human and fruitfly genomes in terms of annotation consistency score computed using the GO-universal metric. (A) Human genome. (B) Fruitfly genome.
Figure 3Annotation matches between manual and electronic pipelines for human and fruitfly genomes with scores computed using the GO-universal metric. (A) Human genome. (B) Fruitfly genome.
General features of fruitfly-human ortholog proteins in terms total number of ortholog proteins with GO annotations in BP, MF, and CC ontologies for different genomes under consideration.
| Annotated ortholog | 1766 | 2674 | 1678 | 2669 | 1682 | 2752 | |||
| Annotated ortholog pair | 2866 | 2829 | 2759 | ||||||
| Uncharacterized ortholog pair | 109 | 198 | 102 | ||||||
| Missing annotation ortholog | 180 | 191 | 95 | 224 | 103 | 382 | |||
Figure 4Fruitfly-human ortholog functional similarity scores. Comparing GO annotations for ortholog protein pairs between human and fruitfly genomes with scores computed using the GO-universal metric.