| Literature DB >> 12372143 |
Damien Chaussabel1, Alan Sher.
Abstract
BACKGROUND: The rapidly expanding fields of genomics and proteomics have prompted the development of computational methods for managing, analyzing and visualizing expression data derived from microarray screening. Nevertheless, the lack of efficient techniques for assessing the biological implications of gene-expression data remains an important obstacle in exploiting this information.Entities:
Mesh:
Year: 2002 PMID: 12372143 PMCID: PMC134484 DOI: 10.1186/gb-2002-3-10-research0055
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Gene-specific and baseline term occurrences in the literature. The literature-mining technique we describe compares term occurrence in a collection of abstracts relating to a specific gene to their occurrence in an unbiased set of abstracts (baseline occurrence in the literature). In the example illustrated here, the occurrence values for terms present in more than 25% of the abstracts relating to the gene RANTES are plotted on the y-axis. To determine baseline occurrence, occurrence values found in the literature concerning this gene are then averaged with values found for an increasing number of genes chosen randomly from all known human genes indexed in the LocusLink database (x-axis). Terms with high occurrence values in the collection of abstracts relating to RANTES and a low baseline occurrence in the literature are plotted in green.
Term selection by filtering
| Occurrence in abstracts | |||||
| Terms | Baseline | AK3 | H2A | IRF7 | ISG15 |
| A | 85.7 | 84.6 | 100 | 100 | |
| Active | 14.3 | 7.7 | 28.6 | 0 | |
| Cell-free | 0.6 | 0 | 0 | 0 | 0 |
| Histone | 1.4 | 0 | 0 | 0 | |
| Infected | 1.2 | 0 | 0 | ||
| Interestingly | 2.7 | 0 | 0 | 14.3 | 0 |
| Interferon | 1.1 | 0 | 0 | ||
| Levels | 0 | 7.7 | 7.1 | 42.9 | |
| Protein | 28.6 | 38.5 | 57.1 | 100 | |
| Signaling | 0 | 0 | 7.1 | 0 | |
This sample was extracted from a table (see Additional data files) containing occurrence values for nearly 25,000 terms for each of the 50 genes used in our example. It illustrates the selective process resulting from the use of several filtering rounds. Baseline occurrence levels are calculated by averaging the occurrence values determined for 250 randomly chosen genes. The occurrence values for four of the genes included in the analysis are shown here: H2A (histone 2A), IRF7 (interferon regulatory factor 7), AK3 (adenylate kinase 3), ISG15 (interferon-stimulated protein, 15 kDa). The first filtering removes terms with high baseline occurrence levels (shown in italics). The second filter selects the terms with occurrence values over baseline by at least 25% (bold). Only terms meeting this criterion for at least two genes - in this case 'interferon' and 'infected'- are retained.
List of genes used to illustrate the technique and their abbreviations
| Abbreviation | Gene name |
| ABCB2 | ATP-binding cassette, subfamily B (MDR/TAP), member 2 |
| AK3 | Adenylate kinase 3 |
| B2M | Beta-2-microglobulin |
| BIRC3 | Baculoviral IAP repeat-containing 3 |
| CFLAR | CASP8 and FADD-like apoptosis regulator |
| DUSP1 | Dual specificity phosphatase 1 |
| DUSP4 | Dual specificity phosphatase 4 |
| DUSP5 | Dual specificity phosphatase 5 |
| G1P3 | Interferon, alpha-inducible protein (clone IFI-6-16) |
| GADD45A | Growth arrest and DNA-damage-inducible, alpha |
| GADD45B | Growth arrest and DNA-damage-inducible, beta |
| GBP1 | Guanylate binding protein 1, interferon-inducible, 67kD |
| GCH1 | GTP cyclohydrolase 1 |
| H2AFO | H2A histone family, member O |
| HLA-F | Major histocompatibility complex, class I, F |
| IFIT | Interferon-induced protein with tetratricopeptide repeats |
| IFITM | Interferon induced transmembrane protein |
| IL15R | Interleukin 15 receptor |
| IL7R | Interleukin 7 receptor |
| IP10 | Interferon induced protein 10 |
| IP9 | Interferon induced protein 9 |
| IRF4 | Interferon regulatory factor 4 |
| IRF7 | Interferon regulatory factor 7 |
| ISG15 | Interferon-stimulated protein, 15 kDa |
| ISG20 | Interferon stimulated gene (20kD) |
| MCP2 | Monocyte chemotactic protein 2 |
| MCP3 | Monocyte chemotactic protein 3 |
| MIG | Monokine induced by gamma interferon |
| MIP3A | Macrophage inflammatory protein 3 alpha |
| MMP9 | Matrix metalloproteinase 9 |
| MT1A | Metallothionein 1A (functional) |
| MX1 | Myxovirus (influenza) resistance 1 |
| MX2 | Myxovirus (influenza) resistance 2 |
| NFKB1 | Nuclear factor kappaB 1 (p105) |
| NFKB2 | Nuclear factor kappaB2 (p49/p100) |
| NFKBIA | Nuclear factor kappaB inhibitor, alpha |
| NR4A3 | Nuclear receptor subfamily 4, group A, member 3 (NOR1) |
| NRP2 | Neuropilin 2 |
| OAS | 2'-5'-oligoadenylate synthetase |
| PDE4B | Phosphodiesterase 4B, cAMP-specific |
| PSMA | Proteasome (prosome, macropain) subunit, alpha |
| PSME | Proteasome activator subunit 2 (PA28) |
| PTP1B | Protein tyrosine phosphatase 1B |
| RANTES | RANTES |
| SOD2 | Superoxide dismutase 2, mitochondrial |
| STAT1 | Signal transducer and activator of transcription 1, 91kD |
| STAT4 | Signal transducer and activator of transcription 4 |
| TNFAIP3 | Tumor necrosis factor, alpha-induced protein 3 |
| TNFAIP6 | Tumor necrosis factor, alpha-induced protein 6 |
| TRAF1 | TNF receptor-associated factor 1 |
| VEGF | Vascular endothelial growth factor |
The genes for which a sufficient number of abstracts could be retrieved are listed. For the complete list of co-induced genes and ESTs included in the analysis, see Additional data files.
Figure 2Analysis of patterns of term occurrence in abstracts. After filters have been applied to the original list, selected term-occurrence values relating to each gene are analyzed. Terms (columns) and genes (rows) were grouped on the basis of similarities between patterns of term occurrence in abstracts by hierarchical clustering. Some of the areas of the clustergram are shown in detail. Clusters are further referenced by color codes: blue, 'nuclear factors'; orange, 'receptor-ligand pair'; green, 'interferon-related'; red, 'chemokines'; violet, 'MHC class I antigen-presentation pathway'. Shades of yellow indicate different levels of term occurrence in abstracts.
Figure 3Annotated dendrogram obtained by clustering term-occurrence values relative to each gene. The corresponding clustergram is shown in Figure 2. Genes are arranged according to patterns of term occurrence. Distances between nodes of the tree diagram indicate the degree of association between genes or groups of genes. A subset of representative terms used in the analysis was chosen to annotate this list of genes. Shades of yellow indicate different levels of term occurrence in abstracts. Table 1 lists the gene abbreviations used.
Figure 4The degree of association found among groups of genes by literature profiling correlates with their likelihood of being related. (a) The clustergram resulting from the analysis of the list of co-induced genes used to illustrate the mining technique is given for comparison. (b) A group of 50 genes was picked at random from all known human genes listed in the LocusLink database and their literature content was analyzed. (c) A group of 50 genes was picked at random from the list of known interleukins, chemokines and chemokine receptors and subjected to a similar analysis. The number of positive gene-term associations retained after filtering (term occurrence for a given gene higher than the baseline by 25%) is shown for each group. Numbers of shared terms for (a), (b) and (c), was 101, 49 and 116, respectively.
Figure 5Conditions for the emergence of groups of related genes. (a) Groups of related genes found by clustering term-occurrence values. The color code is similar to the one used in Figure 2. (b) Grouping is conserved after gene names or terms making up gene names are removed from the analysis (for example, NFkappaB, RANTES, interferon, vascular, MIG). (c) Associations shown in (a) disappear when occurrence values are permuted for each of the genes, suggesting that associations made through the analysis of patterns of term occurrence do not arise by chance from a sufficiently high number of co-occurring terms.
Figure 6Profiling the bacteria-induced macrophage activation program. Literature profiles were generated for a list of nearly 200 genes constituting the 'common transcriptional program', induced in human macrophages upon bacterial infection ([12], see also Additional data files). The clustergram generated for the analysis of patterns of term occurrence is shown at top left. (a-g) Detailed views for groups of genes (columns) sharing a common vocabulary (rows). Groups of terms were selected on the basis of clustering hierarchy whereas the number of genes shown in the inserts is arbitrary. For gene abbreviations see Additional data files.
Figure 7Profiling classic medulloblastomas. Literature profiles were generated for a list of 200 genes found to be differentially expressed by classic versus desmoplastic medulloblastomas in a study of central nervous system embryonal tumors recently published by Pomeroy et al. ([19] and see Additional data files). The clustergram generated for the analysis of patterns of term occurrence is shown at top left. (a-i) Detailed views for groups of genes (columns) found to share a common vocabulary (rows). Groups of terms were selected on the basis of clustering hierarchy, whereas the number of genes shown in the inserts is arbitrary. For gene abbreviations see Additional data files.