| Literature DB >> 19429895 |
Hui Sun Leong1, David Kipling.
Abstract
A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to pre-defined terminologies such as GO and KEGG. We report our explorations of whether ORA can be applied to a wider mining of free-text. We found that a hitherto underappreciated feature of experimentally derived gene lists is that the constituents have substantially more annotation associated with them, as they have been researched upon for a longer period of time. This bias, a result of patterns of research activity within the biomedical community, is a major problem for classical hypergeometric test-based ORA approaches, which cannot account for such bias. We have therefore developed three approaches to overcome this bias, and demonstrate their usability in a wide range of published datasets covering different species. A comparison with existing tools that use GO terms suggests that mining PubMed abstracts can reveal additional biological insight that may not be possible by mining pre-defined ontologies alone.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19429895 PMCID: PMC2699530 DOI: 10.1093/nar/gkp310
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 2.A scatter plot of Chip versus List frequencies for tokens in the ISG gene list. Each data point represents an abstract term. Terms that were identified as significantly enriched (i.e. Bonferroni P ≤ 0.05) in the ISG gene list by using the Outlier method are circled and the adjacent numbers corresponding to their rankings. Chip (y-axis) represents the number of genes associated with each term on the whole chip. List (x-axis) represents the number of genes associated with each term in the ISG gene list. The log 2-transformed List and Chip frequencies are plotted.
Significantly over-represented abstract terms in the ISG gene list identified using the classical hypergeometric test
| Rank | Term | Rank | Term | Rank | Term |
|---|---|---|---|---|---|
| 1 | INTERFERON | 33 | INFECT | 65 | LYSIS |
| 2 | IFN | 34 | INDUCE | 66 | AUTOIMMUNE |
| 3 | ANTIVIRAL | 35 | HLA-B | 67 | INDIGENOUS |
| 4 | IFN-BETA | 36 | HISTOCOMPATIBILITY | 68 | PROTEASOME |
| 5 | IFN-ALPHA | 37 | LINE | 69 | LMP2 |
| 6 | INDUCIBLE | 38 | HEPATITIS | 70 | LMP7 |
| 7 | INTERFERON-ALPHA | 39 | MELANOMA | 71 | PKR |
| 8 | INFECTION | 40 | ENCEPHALOMYOCARDITIS | 72 | INDUCIBILITY |
| 9 | VIRAL | 41 | REPLICATION | 73 | CORRESPONDING |
| 10 | IMMUNE | 42 | AFTER | 74 | MOLECULE |
| 11 | TREAT | 43 | MONOCLONAL | 75 | DEFENSE |
| 12 | INNATE | 44 | EPSTEIN-BARR | 76 | DIFFERENTIAL |
| 13 | IFN-GAMMA | 45 | UPREGULATE | 77 | ACTION |
| 14 | VIRUS | 46 | SYNTHESIS | 78 | TAP |
| 15 | IMMUNITY | 47 | BETA2-MICROGLOBULIN | 79 | STIMULATE |
| 16 | DSRNA | 48 | EBV | 80 | CONFER |
| 17 | INDUCTION | 49 | GAMMA-INTERFERON | 81 | LOAD |
| 18 | OLIGOADENYLATE | 50 | HLA | 82 | REACTIVITY |
| 19 | LYMPHOBLASTOID | 51 | INTERFERON-GAMMA | 83 | OR-C |
| 20 | ISRE | 53 | OAS | 84 | MEDIATE |
| 21 | HOST | 52 | HLA-G | 85 | RECOMBINANT |
| 22 | ISG | 54 | TYPE | 86 | CTL |
| 23 | MHC | 55 | MXA | 87 | MICROGLOBULIN |
| 24 | TREATMENT | 56 | ALPHA | 88 | STRAND |
| 25 | HLA-A | 57 | DEFINE | 89 | RECOGNIZE |
| 26 | STOMATITIS | 58 | IMMUNODEFICIENCY | 90 | ALSO |
| 27 | BETA | 59 | PROMYELOCYTIC | 91 | DERIVE |
| 28 | RESPONSE | 60 | INTACT | 92 | P69 |
| 29 | HLA-CLASS | 61 | LEUKEMIA | 93 | VSV |
| 30 | EVASION | 62 | INDEPENDENT | 94 | DOUBLE |
| 31 | CYTOKINE | 63 | EACH | ||
| 32 | ANTIGEN | 64 | TAPASIN |
Over-represented abstract terms are defined as those tokens with P ≤ 0.05 after Bonferroni correction. The most significant hits are ranked at the top of the table. See Supplementary Data 1, Table S1 for term frequencies and P-values associated with the individual terms.
Figure 1.The relationship between annotation bias and gene age. (a) 52 gene lists from the HG-U133A chip were collated from published literature and for each of these equivalently sized random gene lists were created. The numbers of PMIDs associated with them were calculated and plotted against the size of the gene lists. Both axes are on logarithmic scale. (b) A mean age was calculated for each of the 52 literature gene lists by averaging the consensus ages of its constituent genes. Fold-change in PMID was calculated by dividing the number of PMIDs associated with a literature gene list by the average PMID count in an equivalently sized random gene list. The vertical dashed line represents the mean age of a random gene list, which is 1996 in this case; the horizontal dashed line represents the level at which there is no difference between the numbers of PMIDs associated with the literature and random gene lists.
A comparison of the results from different methods when applied to the ISG gene list
| Abstract term | Chip | List | Bonferroni | ||
|---|---|---|---|---|---|
| Permutation | |||||
| INTERFERON | 414 | 46 | < | ||
| IFN | 245 | 35 | < | ||
| IFN-BETA | 71 | 18 | < | ||
| ANTIVIRAL | 176 | 23 | < | ||
| IFN-ALPHA | 114 | 19 | < | ||
| INTERFERON-ALPHA | 59 | 14 | < | ||
| OLIGOADENYLATE | 18 | 8 | < | ||
| ISG | 14 | 7 | < | ||
| ISRE | 31 | 9 | < | ||
| DSRNA | 60 | 11 | < | ||
| HLA-CLASS | 11 | 6 | < | ||
| HLA-A | 30 | 8 | 1 | ||
| HLA-B | 25 | 7 | 1 | ||
| INDUCIBLE | 1068 | 37 | < | ||
| ENCEPHALOMYOCARDITIS | 16 | 6 | < | ||
| STOMATITIS | 52 | 9 | < | ||
| OAS | 10 | 5 | 0.0968 | 0.1047 | |
| HLA-G | 10 | 5 | 1 | 0.1047 | |
| MXA | 11 | 5 | 1 | 0.1624 | |
| EVASION | 65 | 9 | < | ||
| INNATE | 363 | 21 | |||
| TAPASIN | 12 | 5 | 1 | 0.2407 | |
| VIRAL | 892 | 32 | |||
| INFECTION | 1177 | 36 | 0.0968 | ||
| OR-C | 5 | 4 | < | 0.0717 | 0.5133 |
| LYMPHOBLASTOID | 239 | 16 | < | 0.0872 | |
| IFN-GAMMA | 443 | 22 | 0.1936 | 0.1113 | |
| IMMUNITY | 387 | 20 | 0.0968 | 0.1799 | |
| IMMUNE | 1275 | 35 | 0.6776 | 0.4363 | |
| TREAT | 1817 | 40 | < | 0.9165 | |
| MHC | 353 | 17 | 1 | 1 | |
| VIRUS | 1408 | 34 | 1 | 1 | |
Abstract terms that are identified as over-represented by the corresponding method are shown in bold. The cutoff used is P ≤ 0.05 after Bonferroni correction. 100 000 randomisations were performed for the permutation test. 4840 tokens were being tested during the permutation test and the best possible Bonferroni P-value attainable is 10–5 × 4840 = 0.0484. Any term with an empirical P-value less than 10–5 is provisionally assigned a value of <10, and the corresponding Bonferroni P-value is set to be <0.0484. The unadjusted P-values and other details (e.g. Z-scores and odds ratio) can be found in Supplementary Data 1, Table S1. Chip is the number of genes that contains a given term on the entire chip; List is the number of genes that are associated with a given term in the ISG gene list.
Significantly over-represented GO terms in the ISG gene list identified by DAVID
| GO term | Chip | List | Bonferroni |
|---|---|---|---|
| Response to biotic stimulus | 853 | 49 | 6.40 |
| Immune response | 737 | 44 | 8.30 |
| Defense response | 816 | 45 | 2.90 |
| Response to stimulus | 1765 | 52 | 1.10 |
| Organismal physiological process | 1660 | 46 | 2.00 |
| Response to virus | 70 | 14 | 2.20 |
| Response to pest, pathogen or parasite | 503 | 25 | 2.40 |
| Response to other organism | 514 | 25 | 3.90 |
| Response to stress | 956 | 27 | 6.00 |
| MHC protein complex | 18 | 6 | 6.30 |
| MHC class I protein complex | 18 | 6 | 6.30 |
| Antigen presentation, endogenous antigen | 27 | 7 | 6.70 |
| Antigen processing, endogenous antigen via MHC class I | 28 | 7 | 8.50 |
| MHC class I receptor activity | 36 | 7 | 2.00 |
| Antigen processing | 36 | 7 | 4.20 |
| Antigen presentation | 42 | 7 | 1.10 |
| Immunological synapse | 31 | 6 | 1.20 |
The ontological tool DAVID 2.0 was used to identify over-represented GO terms in the ISG gene list. The analysis was performed using all levels of GO terms and HG-U133A chip as background (database version as of 19 December 2007). Over-represented GO terms were defined as having Bonferroni P-value ≤0.05 based on Fisher's exact test (threshold settings: Count = 2, EASE = 0.1).
Figure 3.A comparison of the performance of Outlier (a) and ExtendedHG (b) across different species. The average number of tokens called significant by the two approaches, Outlier and ExtendedHG, is plotted against the annotation density (i.e. number of PMID per gene) for experimentally derived gene lists that were performed on 10 Affymetrix platforms representing eight different species, including HG-U133A (hsa), HG-U133 Plus 2.0 (hum), Mouse 430 2.0 (mou), Rat 230 2.0 (rat), Arabidopsis ATH1 (ath); DrosGenome1 (dm), Drosophila 2.0 (dros), Xenopus laevis (xl), C. elegans (ce) and Zebrafish (dr).