| Literature DB >> 20826881 |
Aleksandar Stojmirović1, Yi-Kuo Yu.
Abstract
MOTIVATION: Term-enrichment analysis facilitates biological interpretation by assigning to experimentally/computationally obtained data annotation associated with terms from controlled vocabularies. This process usually involves obtaining statistical significance for each vocabulary term and using the most significant terms to describe a given set of biological entities, often associated with weights. Many existing enrichment methods require selections of (arbitrary number of) the most significant entities and/or do not account for weights of entities. Others either mandate extensive simulations to obtain statistics or assume normal weight distribution. In addition, most methods have difficulty assigning correct statistical significance to terms with few entities.Entities:
Mesh:
Year: 2010 PMID: 20826881 PMCID: PMC2958744 DOI: 10.1093/bioinformatics/btq511
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 2.P-value consistency and retrieval stability. (A) The output of ITM Probe emitting mode with human MLL protein (histone methyltransferase subunit) as the source (top) and the log2 ratios from the human T-cell signaling microarray GSM89756 (bottom) were processed by each of the five investigated statistical methods with varying number of weighted entities included for analysis (All and Pos include all entities; All uses raw weights while Pos sets all negative weights to 0). The P-values for GO terms from the union of the sets of top five hits for each method and different numbers of selected entities, are indicated by colors of the corresponding cell. Red dots show the actual top five hits for the method represented by that column. (B) Degree of overlap between sets of significant GO terms. Each panel corresponds to a single method with different numbers of entities used for analysis, with the results from microarray queries shown in the upper triangle and those based on network flow shown in the lower triangle. Color in each cell indicates the average pairwise overlap between the two sets of top ten entities retrieved. For example, consider the light orange colored cell (horizontally labeled by 100 and vertically labeled by 500) in the mHG panel. This indicates that on average the top ten terms retrieved by mHG using top 100 and top 500 network flow proteins share about three common terms.
Fig. 1.Empirical P-values versus P-value cutoffs reported for investigated enrichment methods. Methods with accurate statistics have their curves follow the dotted line closely over the whole range. Each curve was constructed by aggregating the results of ∼109 GO-based decoy term queries. Displayed on the left (right) are results using weights derived from protein network information flow simulations (microarrays). In microarray plots for SaddleSum, T-profiler and GAGE, full lines indicate the results where negative weights were set to 0, while dashed lines show the results using all weights. The reason that HGEM curves run below the theoretical line and parallel to it is that every curve is an aggregate of many curves, each of which (i) represents a single sample of weights determining parameters to be fed into hypergeometric distribution, and (ii) is a step function touching the theoretical line and dropping below it. Merging curves from many samples produces the effect seen in our plots.
Running times of evaluated enrichment statistics algorithms (in seconds)
| Total running time | Average time per query | |||
|---|---|---|---|---|
| Method | Network | Microarray | Network | Microarray |
| SaddleSum | 558 | 872 | 0.56 | 0.64 |
| HGEM | 501 | 615 | 0.50 | 0.45 |
| T-profiler | 446 | 586 | 0.45 | 0.43 |
| GAGE | 499 | 651 | 0.50 | 0.48 |
| mHG | 2433 | 3407 | 2.43 | 2.51 |
We queried GO 10 times with each of the five examined enrichment methods using weights from 100 network simulation results and 136 microarrays (same datasets used for P-value accuracy experiments). Running times for P-value calculations on dual-core 2.8 GHz AMD Opteron 254 processors (using a single core for each run) aggregated over all samples are shown on the left, while average times per query are shown on the right. The HGEM method used 100-object cutoff.