| Literature DB >> 21873635 |
Pascale Gaudet1, Michael S Livstone, Suzanna E Lewis, Paul D Thomas.
Abstract
The goal of the Gene Ontology (GO) project is to provide a uniform way to describe the functions of gene products from organisms across all kingdoms of life and thereby enable analysis of genomic data. Protein annotations are either based on experiments or predicted from protein sequences. Since most sequences have not been experimentally characterized, most available annotations need to be based on predictions. To make as accurate inferences as possible, the GO Consortium's Reference Genome Project is using an explicit evolutionary framework to infer annotations of proteins from a broad set of genomes from experimental annotations in a semi-automated manner. Most components in the pipeline, such as selection of sequences, building multiple sequence alignments and phylogenetic trees, retrieving experimental annotations and depositing inferred annotations, are fully automated. However, the most crucial step in our pipeline relies on software-assisted curation by an expert biologist. This curation tool, Phylogenetic Annotation and INference Tool (PAINT) helps curators to infer annotations among members of a protein family. PAINT allows curators to make precise assertions as to when functions were gained and lost during evolution and record the evidence (e.g. experimentally supported GO annotations and phylogenetic information including orthology) for those assertions. In this article, we describe how we use PAINT to infer protein function in a phylogenetic context with emphasis on its strengths, limitations and guidelines. We also discuss specific examples showing how PAINT annotations compare with those generated by other highly used homology-based methods.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21873635 PMCID: PMC3178059 DOI: 10.1093/bib/bbr042
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Species with more than 1000 experimentally-based annotations (evidence codes: EXP, IDA, IEP, IMP, IGI and IPI)
| Species name | Number of annotations based on experimental data |
|---|---|
| 54 131 | |
| 53 428 | |
| 50 291 | |
| 37 367 | |
| 32 320 | |
| 29 169 | |
| 24 332 | |
| 23 861 | |
| 14 708 | |
| 9442 | |
| 6684 | |
| 5244 | |
| 4350 | |
| 3720 | |
| 2307 | |
| 1779 | |
| 1673 | |
| 1250 | |
| 1093 | |
| 1081 |
aSee http://geneontology.org/GO.evidence.shtml for evidence codes description.
Figure 1:The concept of PAINT. This example presents a MutS homolog family showing experimental evidence for ‘GO term’. (A) Primary experimentally based annotations to one term or any of its ancestors (light green labels) are used to infer that the most recent common ancestor (CA) of the all those proteins also had that function. The curator notes this by dragging the term onto the node of the MCRA (orange box). (B) Subsequently, PAINT propagated this annotation forward to other descendant leaves (blue labels).
Figure 2:Gain of function. The MRCA of all eukaryotic MSH2 orthologs (leftmost orange circle) already likely functioned in DNA repair (inherited from LUCA, data not shown) and maintenance of DNA repeats. The gene was then coopted in the animal MRCA for a role in apoptosis, and later, in the vertebrate MRCA for a role in somatic hypermutation of immunoglobulin genes. Inferences for ancestral genes (orange circles) are based on experimental GO annotations for the genes shown in green, which are inferred by inheritance for descendants including uncharacterized genes in extant organisms shown in blue. Thus, the ortholog in Bos taurus, for example, will be annotated by PAINT with different functions than the ortholog in Saccharomyces cerevisiae.
Figure 3:Loss of Function. The active site residues of PGM1 relatives have been annotated in the CDD database based on the 3D protein structure for PGM from Paramecium tetraurelia. In PAINT, the biocurator used the integrated multiple sequence alignment viewer to determine that key active site residues are mutated in all of the vertebrate PGM5 orthologs, suggesting that phosphoglucomutase activity was lost shortly after duplication. The biocurator correspondingly annotated the vertebrate ancestor of PGM5 with ‘NOT phosphoglucomutase activity’, which PAINT then propagated to all vertebrate orthologs of PGM5.
Figure 4:General workflow for annotation of functional evolution events using PAINT. Step1: The curator uses experimental-based annotations to give an initial hypothesis that the function first appeared in the MRCA of all genes with a related experiment-based annotation. Step 2: The curator decides which ancestor is most appropriate for annotation: either the initially hypothesized MRCA (Option A); an earlier ancestor (Option B), meaning that the MRCA from Step 1 likely inherited its annotation from an earlier ancestor; or more recent ancestor(s) (Option C), meaning that there was homoplasy and the MRCA from Step 1 is not where the function first appeared.
GO annotations inferred for different human genes by InterPro2GO, Compara and PAINT
| Human Gene | Aspect | InterPro2GO | Compara | PAINT |
|---|---|---|---|---|
| SOD1 | MF | Metal ion binding | SOD activity, chaperone binding | SOD activity, zinc ion binding, copper ion binding |
| CC | Nucleus, cytoplasm, mitochondrion, neuronal cell body | Nucleus, cytosol, mitochondrion, extracellular region | ||
| BP | Superoxide metabolic process, oxidation-reduction process, | Activation of MAPK activity, response to reactive oxygen species, ovarian follicle development, myeloid cell homeostasis, retina homeostasis, anti-apoptosis, spermatogenesis, aging, locomotory behavior, response to drug, 31 others | Removal of superoxide radicals | |
| CCS | MF | Metal ion binding | SOD copper chaperone activity, zinc ion binding, copper ion binding, NOT SOD activity | |
| CC | Cytosol, mitochondrion, nucleus | |||
| BP | Superoxide metabolic process, oxidation-reduction process, metal ion transport | Removal of superoxide radicals, intracellular copper ion transport | ||
| PGM1 | MF | Magnesium ion binding, intramolecular transferase activity, phosphotransferases | Phosphoglucomutase activity | |
| CC | Cytosol | |||
| BP | Carbohydrate metabolic process | Glycogen biosynthetic process, glucose-1-phosphate metabolic process | ||
| PGM5 | MF | Magnesium ion binding, intramolecular transferase activity, phosphotransferases | NOT phosphoglucomutase activity | |
| CC | Spot adherens junction, Z disc, focal adhesion | Cytosol, spot adherens junction, Z disc, stress fiber, focal adhesion, intercalated disc | ||
| BP | Carbohydrate metabolic process | NOT glycogen biosynthetic process, NOT glucose-1-phosphate metabolic process |
These are arranged by aspect in the GO: MF, CC and BP.
Figure 5:A simplified phylogeny of the SOD family (PTHR10003). The last universal common ancestor, LUCA, was duplicated in the ancestors to eukaryotes (square node). The descendents of the duplication that shows the least divergence from its ancestor also retained the SOD activity. That was lost in the CCS clade.