| Literature DB >> 21122125 |
Sidahmed Benabderrahmane1, Malika Smail-Tabbone, Olivier Poch, Amedeo Napoli, Marie-Dominique Devignes.
Abstract
BACKGROUND: The Gene Ontology (GO) is a well known controlled vocabulary describing the biological process, molecular function and cellular component aspects of gene annotation. It has become a widely used knowledge source in bioinformatics for annotating genes and measuring their semantic similarity. These measures generally involve the GO graph structure, the information content of GO aspects, or a combination of both. However, only a few of the semantic similarity measures described so far can handle GO annotations differently according to their origin (i.e. their evidence codes).Entities:
Mesh:
Substances:
Year: 2010 PMID: 21122125 PMCID: PMC3098105 DOI: 10.1186/1471-2105-11-588
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
EC weight lists assigned to the 16 GO ECs considered in this study
| Auth | Exp | Comp | Cur | Auto | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| List1 | 1 | ||||||||||||||||
| List2 | 1 | 0.5 | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.5 | 0 | 0.4 |
| List3 | 1 | 0.5 | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.8 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.5 | 0 | 0 |
| List4 | 0 | 1 | |||||||||||||||
Table 1: The various weights assigned to the ECs are listed in the following lines as EC weight lists 1 to 4. TAS: Traceable Author Statement; NAS: Non-traceable Author Statement; EXP: Inferred from Experiment; IDA: Inferred from Direct Assay; IPI: Inferred from Physical Interaction; IMP: Inferred from Mutant Phenotype; IGI: Inferred from Genetic Interaction; IEP: Inferred from Expression Pattern; ISS: Inferred from Sequence Similarity; RCA: Inferred from Reviewed Computational Analysis; ISA: Inferred from Sequence Alignment; ISO: Inferred from Sequence Orthology; ISM: Inferred from Sequence Model; IGC: Inferred from Genomic Context; IC: Inferred from Curator; IEA: Inferred from Electronic Annotation; ND: No biological Data available. The EC categories are indicated in the first line of the table. Auth: Author statement; Exp: Experimental; Comp: Computational Analysis; Cur: Curator statement; Auto: Automatically assigned.
Figure 1Distribution of EC (evidence codes) in yeast and human gene annotations according to BP and MF aspects. The number of annotations assigned to a gene with a given EC is represented for each EC. Note that some genes can be annotated twice with the same term but with a different EC. The cumulative numbers of all non-IEA annotations are 18,496 and 9,564 for the yeast BP and MF annotations, respectively, and 21,462 and 16,243 for the human BP and MF annotations, respectively. Statistics are derived from the NCBI annotation file, version June 2009.
List of yeast and human pathways used in this study.
| KEGG | KEGG | Yeast | Name | Nb genes | Human | Name | Nb genes |
|---|---|---|---|---|---|---|---|
| 01100 Metabolism | 01101 Carbohydrate Metabolism | sce00562 | Inositol phosphate metabolism | 15 | hsa00040 | Pentose and glucuronate interconversions | 26 |
| 01102 Energy Metabolism | sce00920 | Sulfur metabolism | 13 | hsa00920 | Sulfur metabolism | 13 | |
| 01103 Lipid Metabolism | sce00600 | Sphingolipid metabolism | 13 | hsa00140 | C21-Steroid homone metabolism | 17 | |
| 01105 Amino Acid | sce00300 | Lysine biosynthesis | 13 | hsa00290 | Valine, leucine and isoleucine biosynthesis | 11 | |
| Metabolism | sce00410 | Alanine biosynthesis | 8 | ||||
| 01107 Glycan Biosynthesis and Metabolism | sce00514 | O-Mannosyl glycan biosynthesis | 13 | hsa00563 | Glycosylphosphatidylinositol (GPI)-anchor biosynthesis | 23 | |
| 01109 Metabolism of Cofactors and Vitamins | sce00670 | One carbon pool by folate | 14 | hsa00670 | One carbon pool by folate | 16 | |
| 01110 Biosynthesis of Secondary Metabolites | sce00903 | Limonene and pinene degradation | 7 | hsa00232 | Caffeine metabolism | 7 | |
| 01120 Genetic Information Processing | 01121 Transcription | sce03022 | Basal transcription factors | 24 | hsa03022 | Basal transcription factors | 38 |
| hsa03020 | RNA polymerase | 29 | |||||
| 01123 Folding, Sorting and Degradation | sce04130 | SNARE interactionst in vesicular Transport | 23 | hsa04130 | SNARE interactions in vesicular transport | 38 | |
| 01124 Replication and Repair | sce03450 | Non-homologous end-joining | 10 | hsa03450 | Non-homologous end-joining | 14 | |
| hsa03430 | Mismatch repair | 23 | |||||
| 01130 Environmental Information Processing | 01132 Signal Transduction | sce04070 | Phosphatidylinositol signaling system | 15 | |||
| 01140 Cellular Processes | 01151 Transport and Catabolism | sce04140 | Regulation of autophagy | 17 | |||
| 01160 Human Diseases | 01164 Metabolic Disorders | hsa04950 | Maturity onset diabetes of the young | 25 | |||
| Total genes number | 185 | 280 | |||||
| Non-IEA:IEA ratio | 572:435 (1.3) | 560:620 (0.9) | |||||
Table 2: The KEGG categories and subcategories are indicated for each pathway as well as its name and the number of genes it contains (KEGG version Dec 2009). The non-IEA:IEA ratio refers to Biological Process GO annotation of the complete set of genes for each species.
List of yeast and human genes and Pfam clans used this study.
| Pfams clan accession (yeast) | Nb genes | Pfams clan name | Pfams clan accession (human) | Nb genes | Pfams clan name |
|---|---|---|---|---|---|
| CL0328.1 | 15 | 2heme_cytochrom | CL0099.10 | 18 | ALDH-like |
| CL0059.12 | 13 | 6_Hairpin | CL0106.10 | 8 | 6PGD_C |
| CL0092.9 | 8 | ADF | CL0417.1 | 9 | BIR-like |
| CL0099.10 | 11 | ALDH-like | CL0165.8 | 5 | Cache |
| CL0179.11 | 11 | ATP-grasp | CL0149.9 | 7 | CoA-acyltrans |
| CL0255.6 | 7 | ATP_synthase | CL0085.11 | 12 | FAD_DHS |
| CL0378.1 | 10 | Ac-CoA-synth | CL0076.9 | 18 | FAD_Lum_binding |
| CL0257.6 | 18 | Acetyltrans-like | CL0289.3 | 6 | FBD |
| CL0034.12 | 11 | Amidohydrolase | CL0119.10 | 7 | Flavokinase |
| CL0135.8 | 14 | Arrestin_N-like | CL0042.9 | 10 | Flavoprotein |
| Total genes number | 118 | 100 | |||
| Non-IEA:IEA ratio | 121:366 (0.3) | 144:309 (0.46) | |||
Table 3: Clans are indicated by their accession identifier in the Sanger Pfam database (October 2009 release) and by the number of genes retrieved either in yeast (left part) or in human (right part). Each clan contains several Pfam entries listed in the Pfam_C file at [57]. The non-IEA:IEA ratio refers to the Molecular Function GO annotation of the complete set of genes for each species.
Figure 2Intra-set similarities with the KEGG pathway dataset using BP annotations. The intra-set similarity is calculated as the mean of all pairwise gene similarities within a KEGG pathway, with the four measures compared in this study, namely, IntelliGO (using EC weight List1 ), Lord-normalized, Al-Mubaid, and Weighted-cosine. A set of thirteen pathways were selected from the KEGG Pathway database for yeast (top panel) and human (bottom panel) pathways. Only BP annotations are used here (see also Table 2).
Figure 3Intra-set similarities with the Pfam clan dataset using MF annotations. The intra-set similarity is calculated for all genes of a given species within a Pfam clan using MF annotations. Two collections of ten Pfam clans were selected from the Sanger Pfam database to retrieve yeast (top panel) and human (bottom panel) genes belonging to these clans (see also Table 3).
Figure 4Influence of various EC weight lists on the distribution of pairwise similarity values obtained for intra-set similarity calculation. KEGG pathway datasets are handled with BP annotations, and Pfam clans with MF annotations. The MV bar is for Missing Values and represents the number of pairwise similarity values that cannot be calculated using List3 or List4 due to the missing annotations for certain genes. Pairwise similarity intervals are displayed on the × axis of the histograms, while values on the y axis represent the number of pairwise similarity values present in each interval.
Figure 5Comparison of the inter-set discriminating power of four similarity measures using KEGG pathways and BP annotations. The DP values obtained with the IntelliGO, Lord-normalized, Al-Mubaid, and SimGIC similarity measures are plotted for each KEGG pathway (top panel for yeast and bottom panel for human).
Figure 6Comparison of the inter-set discriminating power of four similarity measures using Pfam clans and MF annotations. The DP values obtained with the IntelliGO, Lord-normalized, Al-Mubaid, and SimGIC similarity measures are plotted for each Pfam clan (yeast genes on top and human genes at bottom).
Evaluation results obtained with the CESSM evaluation tool.
| Metrics | Method | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SimGIC | SimUI | RA | RM | RB | LA | LM | LB | JA | JM | JB | IntelliGO | ||
| ECC | 0.62 | 0.63 | 0.39 | 0.45 | 0.60 | 0.42 | 0.45 | 0.64 | 0.34 | 0.36 | 0.56 | 0.65 | |
| Pfam | 0.63 | 0.61 | 0.44 | 0.18 | 0.57 | 0.44 | 0.18 | 0.56 | 0.33 | 0.12 | 0.49 | 0.48 | |
| All EC | SeqSim | 0.71 | 0.59 | 0.50 | 0.12 | 0.66 | 0.46 | 0.12 | 0.60 | 0.29 | 0.10 | 0.54 | 0.40 |
| ECC | 0.58 | 0.57 | 0.37 | 0.47 | 0.48 | 0.38 | 0.51 | 0.51 | 0.37 | 0.46 | 0.51 | 0.48 | |
| Pfam | 0.58 | 0.55 | 0.43 | 0.44 | 0.52 | 0.42 | 0.42 | 0.51 | 0.33 | 0.34 | 0.45 | 0.43 | |
| Non-IEA EC | SeqSim | 0.66 | 0.59 | 0.46 | 0.48 | 0.65 | 0.41 | 0.40 | 0.59 | 0.31 | 0.36 | 0.52 | 0.43 |
Table 4: Pearson linear correlation coefficients are displayed for the ECC (Enzyme Classification Comparison), Pfam, and sequence similarity metrics (SeqSim). The Molecular Function GO annotation is used including (first three rows) or excluding (last three rows) annotation terms with IEA evidence codes. The column headings are listed as: SimUI: Union Intersection similarity; RA: Resnick Average; RM: Resnick Max; RB: Resnick Best match; LA: Lord Average; LM: Lord Max; LB: Lord Best match; JA: Jaccard Average; JM: Jaccard Max; JB: Jaccard Best match.