| Literature DB >> 20043848 |
Abstract
BACKGROUND: Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue.Entities:
Mesh:
Substances:
Year: 2009 PMID: 20043848 PMCID: PMC2813249 DOI: 10.1186/1471-2105-10-455
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Prediction quality of different prediction approaches
| Specificity | # of GO terms | KLR | KL1LR | KLR with Relief | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Pfam | Interpro | All | ES | ||||||||
| NOS | S | NOS | S | NOS | S | ||||||
| 3-10 | 952 | 0.60 | 0.58 | 0.55 | 0.54 | 0.55 | 0.73 | 0.75 | 0.72 | 0.74 | 0.58 |
| 11-30 | 435 | 0.74 | 0.76 | 0.70 | 0.70 | 0.82 | 0.85 | 0.88 | 0.79 | 0.85 | 0.69 |
| 31-100 | 239 | 0.79 | 0.79 | 0.82 | 0.82 | 0.84 | 0.84 | 0.88 | 0.78 | 0.86 | 0.64 |
| 101-300 | 100 | 0.80 | 0.80 | 0.84 | 0.84 | 0.83 | 0.82 | 0.86 | 0.73 | 0.84 | 0.61 |
GO terms are categorized into four groups based on the number of genes covering the GO term (specificity in the first column). Prediction quality is estimated using AUC values for KLR using Pfam or Interpro data only, KLR using all data sources, KLR using a data source selected by exhaustive search (ES), KL1LR, and KLR using data sources selected by the Relief method. In the case of KL1LR, two different values of the regularization parameter λ are used. NOS stands for non-standardization of features, and S for standardization.
Figure 1The AUC values of function predictions based on different prediction approaches. The AUC values of KLR using Pfam, KLR using all data sets (standardized), KLR using a data source selected by exhaustive search, KL1LR (λ = 0.01, standardized), and KLR using data sources selected by the Relief method are shown.
Contributions of genomic data sources
| Data source | Exhaustive search | KL1LR | ||||||
|---|---|---|---|---|---|---|---|---|
| # of GO terms (# in union) | AUC | # of GO terms (# in union) | AUC | NCG | ||||
| Protein-protein interactions | OPHID | 192 | 0.82 | 201 | 0.89 | 83 | ||
| Protein domain | Interpro | 522 | (697) | 0.87 | 408 | (518) | 0.89 | 266 |
| Pfam | 600 | 0.86 | 311 | 0.89 | 210 | |||
| Phenotype | MGI | 213 | 0.87 | 346 | 0.90 | 129 | ||
| Phylogenetic profile | BioMart | 33 | (95) | 0.83 | 59 | (166) | 0.88 | 4 |
| Inparanoid | 70 | 0.84 | 124 | 0.88 | 22 | |||
| Disease | OMIM | 41 | 0.85 | 32 | 0.88 | 3 | ||
| Gene expression | Zhang | 28 | 0.81 | 147 | 0.90 | 10 | ||
| Su | 21 | (55) | 0.82 | 158 | (309) | 0.89 | 8 | |
| Sage | 16 | 0.83 | 113 | 0.90 | 7 | |||
The numbers of GO terms satisfying the cut-off of prediction accuracies by the AUC and P20R values are presented for each data source along with the average AUC values of the GO terms. For the protein domain, phylogenetic profile, and gene expression data, the number of terms in the union set is shown in parentheses. The numbers of common terms between the two approaches are shown in the last column (NCG).
GO terms giving a high prediction quality using only one data source
| Data source | GO | Term | NPWGD | NPWG | AUC | P20R | DA |
|---|---|---|---|---|---|---|---|
| OPHID (PPI) | 0048489 | synaptic vesicle transport | 13 | 14 | 0.92 | 0.39 | 0.17 |
| 0006887 | exocytosis | 21 | 27 | 0.91 | 0.87 | 0.12 | |
| Interpro (Domain) | 0006071 | glycerol metabolism | 10 | 11 | 0.87 | 0.8 | 0.44 |
| 0000160 | two-component signal transduction system (phosphorelay) | 10 | 10 | 1 | 1 | 0.41 | |
| 0006801 | superoxide metabolism modification-dependent | 10 | 10 | 0.93 | 1 | 0.31 | |
| 0043632 | macromolecule catabolism | 47 | 47 | 0.96 | 0.5 | 0.28 | |
| 0006508 | proteolysis | 233 | 240 | 0.92 | 0.69 | 0.28 | |
| 0006812 | cation transport | 173 | 176 | 0.90 | 0.93 | 0.15 | |
| Pfam (Domain) | 0016311 | dephosphorylation | 48 | 51 | 0.97 | 0.81 | 0.35 |
| 0006338 | chromatin remodeling | 21 | 22 | 0.91 | 0.28 | 0.33 | |
| 0031497 | chromatin assembly protein amino acid | 29 | 30 | 0.97 | 0.63 | 0.32 | |
| 0006470 | Dephosphorylation | 46 | 49 | 0.98 | 0.69 | 0.31 | |
| 0006333 | chromatin assembly or disassembly | 41 | 42 | 0.96 | 0.71 | 0.3 | |
| MGI (Phenotype) | 0008344 | adult locomotory behavior | 14 | 19 | 0.9 | 0.21 | 0.31 |
| 0030534 | adult behaviour | 18 | 23 | 0.9 | 0.21 | 0.3 | |
| 0007605 | sensory perception of sound | 26 | 40 | 0.94 | 0.55 | 0.27 | |
| 0048232 | male gamete generation | 44 | 70 | 0.93 | 0.34 | 0.26 | |
| 0007283 | spermatogenesis | 44 | 70 | 0.94 | 0.28 | 0.25 | |
| 0000003 | reproduction | 101 | 152 | 0.87 | 0.52 | 0.20 | |
| OMIM (Diseases) | 0008643 | carbohydrate transport | 11 | 30 | 0.94 | 0.87 | 0.15 |
| Zhang | 0001502 | cartilage condensation | 10 | 10 | 0.85 | 0.23 | 0.15 |
GO terms and data sources displaying the outstanding contributions to the prediction of the given GO term are listed, where only part of the lists among the GO terms covering greater than or equal to 10 proteins are presented. NPWGD stands for number of proteins having the given GO term and the given data source, NPWG is the number of proteins with the given GO term, and DA is the difference between the AUC score and the second highest accuracy.
Figure 2Illustration of genomic data sources for the genes with 'Cation transport' function. 176 genes have the 'Cation transport' (GO:0006812) function. The relationship between these genes and the five different genomic data sources are illustrated. For each data source, the AUC value and the P20R value with the KLR method are also represented. (a) Protein domains belonging to the genes with the given GO term are coloured blue in the matrix. Domains appearing in the more than 10 genes are boxed in red in the matrix, and their names and identifiers are listed below. (b) Expression levels of genes with the given GO terms are presented. Genes and tissues are grouped based on the hierarchical bi-clustering of expression levels. Tissues commonly over-expressed in several genes are circled in blue and the names of tissues are listed below. (c) MGI phenotypes belonging to the genes with the given GO term are coloured blue in the matrix. Phenotypes appearing in the more than 15 genes are boxed in red in the matrix, and their names and identifiers are listed below. (d) Protein-protein interaction network of genes with the given GO term. Red (black) lines indicate the direct (indirect) interactions. Highly connected proteins based on direct interactions are listed with their identifiers. (e) OMIM disease is similarly presented its domains and phenotypes.
Figure 3Illustration of genomic data sources for the genes having a 'Positive regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism'. 161 genes have the 'positive regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism' (GO:0045935) function. (a) - (e) as described in Figure 2 (a) - (e).
Figure 4Illustration of genomic data sources for the genes having a 'Reproduction' function. 152 genes have the 'Reproduction' (GO: 0000003) function. (a) - (e) as described in Figure 2 (a) - (e).
Enrichment test for an informative data source in the GO hierarchy
| Data source | GO term | Description | P-value | NO | NOPW |
|---|---|---|---|---|---|
| OPHID (PPI) | |||||
| Interpro (Domain) | |||||
| Pfam (Domain) | 0044271 | nitrogen compound biosynthetic process | 0 | 10 | 10 |
| 0006725 | aromatic compound metabolic process | 4.85E-08 | 25 | 21 | |
| MGI (Phenotype) | |||||
| BioMart (Phylogenetic profile) | 0046164 | alcohol catabolic process | 1.21E-08 | 6 | 4 |
| 0030100 | regulation of endocytosis | 5.89E-07 | 5 | 3 | |
| 0046365 | monosaccharide catabolic process | 5.89E-07 | 5 | 3 | |
| Inparanoid (Phylogenetic profile) | |||||
| OMIM (Diseases) | 0006812 | cation transport | 1.62E-05 | 15 | 4 |
| Zhang | 0040013 | negative regulation of locomotion | 3.21E-06 | 3 | 2 |
| 0008380 | RNA splicing | 6.22E-05 | 6 | 2 | |
| Su | |||||
| 0007059 | chromosome segregation | 1.39E-05 | 5 | 2 | |
| SAGE (Gene expression) | |||||
| 0050851 | antigen receptor-mediated signaling pathway | 1.54E-05 | 7 | 2 | |
The data source in the first column is informative for predicting gene functions belonging to the GO terms in the second column and its off-spring GO terms. The p-values in the fourth column represent the significance of the number of off-springs that are well predicted using the given data source. This table presents GO terms having p-value <1.00E-04 and having a sufficient number of off-springs. The bold and italic fonts indicate the significant GO terms based on the enrichment tests from the exhaustive search feature selection and the KL1LR method for the given data source. Among them, if the data source is informative for the GO term itself, the GO term is underlined (see the main text for more explanation). NO stands for the number of all off-springs of a GO term and NOPW for the number of off-springs predicted well using a given data source.
Figure 5The hierarchy of 'Ion transport' (GO:0006811). Coloured GO terms have high AUC values based on the Interpro domain data set. Among them, the red boxes represent GO terms having a high AUC in only that data source. In parentheses, the number of genes having the given GO term (the number of genes having the Interpro data source is also represented), the AUC values, and P20R values having the Interpro data source are represented.
Prediction quality of newly annotated genes
| Specificity | # of genes | KLR | L1LR | |
|---|---|---|---|---|
| Exhaustive search | ||||
| 3-10 | 540 | 0.61 | 0.76 | 0.79 |
| 11-30 | 273 | 0.71 | 0.79 | 0.81 |
| 31-100 | 163 | 0.72 | 0.80 | 0.82 |
| 101-300 | 75 | 0.69 | 0.78 | 0.80 |
GO terms are categorized into four groups based on the number of genes covering the GO term. Prediction quality is estimated by using the AUC values for KLR with a data source selected by exhaustive search and KL1LR with all data sources. In the case of KL1LR, two different values of regularization parameter λ are used. NOS stands for non-standardization, and S for standardization.
Performance of GO term prediction of yeast genes
| Specificity | # of GO terms | KLR | K L1LR | KLR with Relief | |||||
|---|---|---|---|---|---|---|---|---|---|
| All | ES | ||||||||
| 3-10 | 567 | 0.51 | 0.51 | 0.61 | 0.68 | 0.70 | 0.67 | 0.68 | 0.52 |
| 11-30 | 348 | 0.70 | 0.70 | 0.79 | 0.80 | 0.82 | 0.78 | 0.79 | 0.66 |
| 31-100 | 210 | 0.75 | 0.75 | 0.80 | 0.79 | 0.82 | 0.77 | 0.79 | 0.64 |
| 101-300 | 121 | 0.77 | 0.77 | 0.79 | 0.77 | 0.80 | 0.73 | 0.78 | 0.65 |
GO terms are categorized into four groups based on the number of genes covering the GO term. Prediction quality is estimated using AUC values for KLR with all data sources, KLR with a data source selected by exhaustive search (ES), KL1LR method (with two regularization parameters), and KLR with features selected by the Relief method. NOS stands for non-standardization of the features, S for standardization.
Consistencies of informative genomic data types between old and new annotation data in Mus musculus and between Mus musculus and yeast
| Yeast | ||||
|---|---|---|---|---|
| Specificity | # of GO terms | # of GO terms having the same informative feature types | # of GO terms | # of GO terms having the same informative feature types |
| 3-10 | 540 | 203 | 270 | 61 |
| 11-30 | 273 | 177 | 176 | 59 |
| 31-100 | 163 | 91 | 119 | 41 |
| 101-300 | 75 | 35 | 57 | 25 |
| Total | 1051 | 506 | 622 | 186 |
GO terms are categorized into four groups based on the number of genes covering the GO term. The number of genes and the number of GO terms with the same informative feature types with Mus musculus Feb 2006 data are presented for newly annotated genes on Aug 2009 for Mus musculus and yeast data sets, respectively.