| Literature DB >> 22375144 |
Ernesto Iacucci1, Hans H Zingg, Theodore J Perkins.
Abstract
High-throughput molecular biology studies, such as microarray assays of gene expression, two-hybrid experiments for detecting protein interactions, or ChIP-Seq experiments for transcription factor binding, often result in an "interesting" set of genes - say, genes that are co-expressed or bound by the same factor. One way of understanding the biological meaning of such a set is to consider what processes or functions, as defined in an ontology, are over-represented (enriched) or under-represented (depleted) among genes in the set. Usually, the significance of enrichment or depletion scores is based on simple statistical models and on the membership of genes in different classifications. We consider the more general problem of computing p-values for arbitrary integer additive statistics, or weighted membership functions. Such membership functions can be used to represent, for example, prior knowledge on the role of certain genes or classifications, differential importance of different classifications or genes to the experimenter, hierarchical relationships between classifications, or different degrees of interestingness or evidence for specific genes. We describe a generic dynamic programming algorithm that can compute exact p-values for arbitrary integer additive statistics. We also describe several optimizations for important special cases, which can provide orders-of-magnitude speed up in the computations. We apply our methods to datasets describing oxidative phosphorylation and parturition and compare p-values based on computations of several different statistics for measuring enrichment. We find major differences between p-values resulting from these statistics, and that some statistics recover "gold standard" annotations of the data better than others. Our work establishes a theoretical and algorithmic basis for far richer notions of enrichment or depletion of gene sets with respect to gene ontologies than has previously been available.Entities:
Keywords: depletion; dynamic programming; enrichment; gene ontology; weighted membership
Year: 2012 PMID: 22375144 PMCID: PMC3284693 DOI: 10.3389/fgene.2012.00024
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Schematic of GO. Abstract example of an ontology illustrating the principles of the GO DRAG and the scoring functions Boxes labeled with Root or n correspond to classifications Sets represent the genes directly mapped to each classification. The broken line indicates the subgraph rooted at v − i.
Figure A1Explicit examples of weighted membership functions and .
Figure 2Urn representation of Figure . Gray balls (genes) are those found at or below v4 and white balls (white) are those found in the rest of the DRAG. (A) Gray and white balls (genes) in urn arc independent. (B) Gray and while (genes) in urn are interrelated, the GO DRAG and the scoring functions. The solid lines indicate that balls (genes) are related.
Summary of notation used in this paper.
| Root of the ontology | |
| Vertex (classification) in the ontology | |
| Set of all vertices | |
| The complement of | |
| Gene | |
| Set of all genes | |
| Set of interesting genes (Subset of | |
| Random subset of | |
| Φ( | Weighted membership of gene |
| Φ1( | Counts the membership in the DRAG from |
| Φ2( | Measures the number of paths in the DRAG from |
| Φ3( | Is equal to some global “score” (e.g., differentiation expression) assigned to |
| Φ( | Sum of weighted memberships in classification |
Comparison of methods with added noise.
| Φ2 | Φ1 | Φ2 | Φ2 | Φ1 | Φ1 | |
|---|---|---|---|
| AUC | 0.6969 | 0.7367 | 0.7107 |
| Sensitivity | 0.4857 | 0.8857 | 0.3142 |
| Specificity | 0.9488 | 0.3352 | 0.9261 |
| AUC | 0.7179 | 0.8002 | 0.7867 |
| Sensitivity | 0.4571 | 0.8000 | 0.4571 |
| Specificity | 0.9240 | 0.5379 | 0.9438 |
| AUC | 0.7241 | 0.8124 | 0.7754 |
| Sensitivity | 0.4000 | 0.7714 | 0.4571 |
| Specificity | 0.9401 | 0.7356 | 0.9476 |
Figure 3Maintenance of rank. The methods were compared based on their ability to maintain the rank of the “golden standard” classifications. (A) ROC plot of the rankings of the classifications from the added 500 random genes dataset. (B) ROC plot of the rankings of the classifications from the added 300 random genes dataset. (C) ROC plot of the rankings of the classifications from the added 100 random genes dataset.
Comparison of methods on the Girotti and Zingg (.
| Gene ontology | Φ1 | Φ1 | Φ2 | Φ1 | Φ3 | Φ3 | Φ2 | Φ2 |
|---|---|---|---|---|
| Ribosome | 0.093 | 0.078 | 0.956 | 5.91 × 10−5 |
| Protein modification | 0.842 | 0.653 | 0.268 | 4.21 × 10−5 |
| Defense response | 0.209 | 0.068 | 0.009 | 3.02 × 10−4 |
| Lipid transport | 0.423 | 0.423 | 0.601 | 0.285 |
| Lipid metabolism | 0.233 | 0.106 | 0.616 | 1.88 × 10−5 |
| Intracellular protein transport | 0.545 | 0.548 | 0.504 | 0.096 |
| Hormone | 0.308 | 0.113 | 0.248 | 0.552 |
| Cell differentiation | 0.281 | 0.281 | 0.423 | 0.791 |
| Extracellular matrix | 0.002 | 0.003 | 0.977 | 0.176 |
| Cytoskeleton | 0.259 | 0.125 | 0.432 | 0.017 |
| Cell motility | 0.022 | 0.024 | 0.128 | 0.003 |
Figure A2Method comparisons (A–F) pair-wise . R2 values are provided as a summary of the agreement.
Agreement of significance calls.
| Method | Φ1 | Φ1 | Φ2 | Φ1 | Φ3 | Φ3 | Φ2 | Φ2 |
|---|---|---|---|---|
| Φ1 | Φ1 | 23 | 21 | 10 | 12 |
| Φ2 | Φ1 | 36 | 14 | 21 | |
| Φ3 | Φ3 | 56 | 24 | ||
| Φ2 | Φ2 | 107 |