| Literature DB >> 18613946 |
Lourdes Peña-Castillo1, Murat Tasan, Chad L Myers, Hyunju Lee, Trupti Joshi, Chao Zhang, Yuanfang Guan, Michele Leone, Andrea Pagnani, Wan Kyu Kim, Chase Krumpelman, Weidong Tian, Guillaume Obozinski, Yanjun Qi, Sara Mostafavi, Guan Ning Lin, Gabriel F Berriz, Francis D Gibbons, Gert Lanckriet, Jian Qiu, Charles Grant, Zafer Barutcuoglu, David P Hill, David Warde-Farley, Chris Grouios, Debajyoti Ray, Judith A Blake, Minghua Deng, Michael I Jordan, William S Noble, Quaid Morris, Judith Klein-Seetharaman, Ziv Bar-Joseph, Ting Chen, Fengzhu Sun, Olga G Troyanskaya, Edward M Marcotte, Dong Xu, Timothy R Hughes, Frederick P Roth.
Abstract
BACKGROUND: Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18613946 PMCID: PMC2447536 DOI: 10.1186/gb-2008-9-s1-s2
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Data collection description: summary of the data sources
| Data type | Description | Representation |
| Gene expression | Expression data from oligonucleotide arrays for 13,566 genes across 55 mouse tissues (Zhang | Median-subtracted, arcsinh intensity measurements |
| Expression data from Affymetrix arrays for 18,208 genes across 61 mouse tissues (Su | gcRMA-condensed intensity measurements | |
| Tag counts at quality 0.99 cut-off from 139 SAGE libraries for 16,726 genes [ | Average and total tag counts | |
| Sequence patterns | Protein sequence pattern annotations from Pfam-A (release 19) for 15,569 genes with 3,133 protein families [ | Binary annotation patterns |
| Protein sequence pattern annotations from InterPro (release 12.1) for 16,965 genes with 5,404 sequence patterns [ | Binary annotation patterns | |
| Protein interactions | Protein-protein interactions from OPHID for 7,125 genes [ | Binary interaction patterns and shortest path between genes |
| Phenotypes | Phenotype annotations from MGI for 3,439 genes with 33 phenotypes [ | Binary annotation patterns |
| Conservation profile | Conservation pattern from Ensembl (v38) for 15,939 genes across 18 species [ | Binary conservation patterns and conservation scores |
| Conservation pattern from Inparanoid (v4.0) for 15,703 genes across 21 species [ | Binary conservation patterns and Inparanoid scores | |
| Disease associations | Disease associations from OMIM for 1,938 genes to 2,488 diseases/phenotypes [ | Binary annotation patterns |
gcRMA, robust multi-array analysis with background adjustment for GC content of probes; OMIM, Online Mendelian Inheritance in Man; OPHID, Online Predicted Human Interaction Database; SAGE, serial analysis of gene expression.
Brief description of function prediction methods used
| Submission identifier | Approach | Name | Author initials |
| A | Compute several kernel matrices (SVM) for each data matrix, train one GO term specific SVM per kernel, and map SVMs' discriminants to probabilities using logistic regression | Calibrated ensembles of SVMs | GO, GL, JQ, CG, MJ, and WSN |
| B | Four different kernels are used per data set. Integration of best kernels and data sources is done using the kernel logistic regression model | Kernel logistic regression [ | HL, MD, TC, and FS |
| C | Construct similarity kernels, assign a weight to each kernel using linear regression, combine the weighted kernels, and use a graph based algorithm to obtain the score vector | geneMANIA | SM, DW-F, CG, DR, and QM |
| D | Train SVM classifiers on each GO term and individual data sets, construct several Bayesian networks that incorporate diverse data sources and hierarchical relationships, and chose for each GO term the Bayes net or the SVM yielding the highest AUC | Multi-label hierarchical classification [ | YG, CLM, ZB, and OGT |
| E | Combination of an ensemble of classifiers (naïve Bayes, decision tree, and boosted tree) with guilt-by-association in a functional linkage network, choosing the maximum score | Combination of classifier ensemble and gene network | WKK, CK, and EMM |
| F | Code the relationship between functional similarity and the data into a functional linkage graph and predict gene functions using Boltzmann machine and simulated annealing | GeneFAS (gene function annotation system) [ | TJ, CZ, GNL, and DX |
| G | Two methods with scores combined by logistic regression: guilt-by-association using a weighted functional linkage graph generated by probabilistic decision trees; and random forests trained on all binary gene attributes | Funckenstein | WT, MT, FDG, and FPR |
| H | Pairwise similarity features for gene pairs were derived from the available data. A Random Forest classifier was trained using pairs of genes for each GO term. Predictions are based on similarity between the query gene and the positive examples for that GO term | Function prediction through query retrieval | YQ, JK, and ZB |
| I | Construct an interaction network per data set, merge data set graphs into a single graph, and apply a belief propagation algorithm to compute the probability for each protein to have a specific function given the functions assigned to the proteins in the rest of the graph | Function prediction with message passing algorithms [ | ML and AP |
AUC, area under the receiver operating characteristic curve; GO, Gene Ontology.
Figure 1Measures of performance for the initial round of GO term predictions. (a) Mean area under the receiver operating characteristic curve (AUC) within each evaluation category, evaluated using the held-out genes. Gene Ontology Biological process (GO-BP), Cellular component (GO-CC), and Molecular function (GO-MF) branches are indicated on the x-axis, grouped by specificity (indicated by the minimum number of genes in the training set associated with each GO term in a given category). Upper case letters associated with the color code correspond to submission identifier. (b) Mean AUC within each evaluation category, evaluated prospectively using newly annotated genes. (c) For each pair of submissions X and Y, we test for difference in AUC value for every GO term (evaluated using held-out genes). Color bars indicate fraction of pairwise comparisons for which X's AUC is significantly higher (blue), not significantly different (beige), and significantly lower (maroon). (d) As (c), except evaluated using the newly annotated genes. (e) The fraction of GO terms exceeding the indicated precision at 20% recall (P20R) value, evaluated using held-out genes. The black line corresponds to the fraction of GO terms for which the 'straw man' approach achieved the indicated precision. (f) As (e), except with P20R values derived prospectively from newly annotated genes.
Figure 2Measures of performance for the second round of GO term predictions. (a, b) As described in Figure 1a, b, except that the gray color area indicates performance in the first set of submissions. (c-f) As described in Figure 1c-f, except that asterisks in (c) and (d) indicate second-round submissions and dashed lines in (e) and (f) indicate the performance of an earlier submission by the same group. GO, Gene Ontology.
Figure 3Factors affecting prediction performance. (a) Precision at 20% recall (P20R) values evaluated using held-out annotations on all Gene Ontology (GO) terms (vertical axis) within each of the 12 evaluation categories for each submission (left panel) and for a simple guilt-by association using each data set in turn as its sole evidence source (right panel). The number of genes in each evaluation category is shown in parentheses. GO-BP, GO Biological process; GO-CC, GO Cellular component; GO-MF, GO Molecular function; NB, naïve Bayes. Data sets are described in Table 1. (b) Fraction of the 21,603 genes in the data collection with at least one annotated neighbor per data set. (c) Analysis of variance (ANOVA), exploring the effects of various factors on P20R values. (d) Fraction of total variance in P20R values that is explained by each effect. Asterisks in (c, d) indicate interaction between two factors.
Figure 4Distribution of GO terms at several precision/recall performance points. Proportion of Gene Ontology (GO) terms per evaluation category with a precision/recall performance point that is both above and to the right of a given precision/recall point in the contour plots. GO-BP, GO Biological process; GO-CC, GO Cellular component; GO-MF, GO Molecular function.
Figure 5Number of high-precision predictions among GO terms for which precision can be confidently estimated. Number of currently annotated (green) versus predicted genes (orange, predictions expected to be correct; gray, predictions expected to be incorrect) for a subset of Gene Ontology (GO) terms for which 30% precision on held-out annotations was achieved while recovering at least 10 positives in the held-out set. The number of predicted genes displayed was limited to 1,000. GO terms were ordered according to similarity of prediction/annotation patterns. Terminal digits of GO term identifiers are shown in parentheses. GO-BP, GO Biological process; GO-CC, GO Cellular component; GO-MF, GO Molecular function.
Figure 6Illustration of evidence underlying predictions for the GO term 'Cell adhesion'. As an assessment of predictive usefulness, the precision at 20% recall (P20R) value based on each single data source is shown in parentheses. (a) Expression levels of annotated genes (dark green) and predictions (orange), grouped by Pearson correlation and complete-linkage hierarchical clustering. (b) Protein domains in common among predictions and annotated genes. (c) Largest protein-protein interaction network among predictions and annotated genes. OPHID, Online Predicted Human Interaction Database. (d) Disease and (e) phenotype annotations in common between predictions and annotated genes. Terminal digits of identifiers are shown in parentheses. OMIM, Online Mendelian Inheritance in Man.
Figure 7Illustration of evidence underlying predictions for the GO term 'Mitochondrial part'. (a-e) As described in Figure 6a-e. GO, Gene Ontology.