| Literature DB >> 16911788 |
Ralf Steuer1, Peter Humburg, Joachim Selbig.
Abstract
BACKGROUND: The biological interpretation of large-scale gene expression data is one of the paramount challenges in current bioinformatics. In particular, placing the results in the context of other available functional genomics data, such as existing bio-ontologies, has already provided substantial improvement for detecting and categorizing genes of interest. One common approach is to look for functional annotations that are significantly enriched within a group or cluster of genes, as compared to a reference group.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16911788 PMCID: PMC1586215 DOI: 10.1186/1471-2105-7-380
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Validating clustering results by the mutual information: A schematic example. Each gene is uniquely assigned to one functional category Aand grouped into cluster Cby a given clustering algorithm. The joint probabilities can be straightforwardly estimated from the associated contingency table and the mutual information is calculated according to Eq. (1). To assess how related the clustering is to the annotation, the value of the mutual information is compared to random assignments of genes to cluster number, i.e. each gene is randomly assigned to a cluster, preserving the total number of genes within each cluster, but destroying all possible relationship between the clustering and the functional annotation. The lower right plot shows the mutual information, compared to an ensemble of 500 randomized assignments, In this example, the z-score, estimated according to Eq. (8), is S ≈ 3.8. For a z-score to be deemed significant, we further require that no random assignment results in a mutual information equal or larger that the tested annotation. Note that, though we expect the mutual information to be zero for the randomized assignments, the average estimated mutual information for randomized data has a bias towards positive values due to finite-size effects [19,20]. As a rule of thumb, to obtain reliable estimate of the mutual information the number of genes should be at least three times larger than the number of clusters or functional categories [20].
The multi-functions of genes defy a straightforward estimation of the mutual information. Each gene is assigned to a vector of binary attributes A = {A1, A2, ..., }, described by a number a∈ [0, - 1]. The contingency table to evaluate the mutual information I(C, [A1, ..., ]), taking all possible combinations into account, would thus include up to columns.
| annotation | ||||||||
| Cluster | ... | |||||||
| gene 1 | 0 | 1 | 0 | 0 | ... | 1 | ||
| gene 2 | 0 | 0 | 0 | 1 | ... | 0 | ||
| gene 3 | 1 | 0 | 1 | 0 | ... | 0 | ||
| ⋮ | ⋮ | ⋮ | ... | ⋮ | ⋮ | |||
Figure 2Histogram of the z-score Eq. (8) for all individual attributes. The attributes with the highest score are marked. The figure is for the cell cycle dataset of Spellman et al. [24] using the same preprocessing as described in [13] (sec appendix]. After filtering ≈ 2500 GO attributes remained for evaluation. Repeating the analysis for all datasets given in Table 2 yields similar results. The clustering was obtained using a k-means algorithm with Euclidean distance and k = 25, the results do not change significantly for different choices of k (tested between k = 5 - 30, corresponding to the region where the z-score of the mutual information is largest [13]). Note that the top scoring attributes appear to be largely redundant, i.e. a gene that is annotated to the cellular component 'cytosolic ribosome' can be intuitively expected to be also annotated to the biological process 'protein biosynthesis'. See next section for details.
Several datasets were used to verity the results, corresponding to different experimental setups and conditions. In each case, only the 3000 genes with highest variance were selected for further analysis. Note that this implies that the set of selected genes is (slightly) different for each dataset. For details on the preprocessing and normalization of the data see Appendix. In the following, all shown results will refer to the dataset of Spellman et al.[24]
| Name | no. of points | Description | Ref. | |
| Spellman | (1998) | 75 | Cell Cycle | [24] |
| Zhu | (2000) | 26 | Cell Cycle | [25] |
| Gasch | (2000) | 175 | Various conditions | [26] |
The contingency tables between the live top scoring attributes given in Fig. 2, along with the z-score S for the pair-wise mutual information I(A, A), estimated according to Eq. (8) with respect to 500 randomized realizations. High values of S indicate that both attributes are not independent, i.e. that the probability of observing such a value of the mutual information I(A, A) for statistically independent attributes Aand Ais low. Shown are the nodes: 'GO:0005830' (component: cytosolic ribosome). 'GO:0003735' (function: structural constituent of ribosome). 'GO:0005840' (component: ribosome). 'GO:0006412' (process: protein biosynthesis), and 'GO:0019538' (process: protein metabolism). Note that the contingency tables, as well as the z-score, was estimated for the full set of 6312 genes. Reducing the analysis to those 3000 genes used in the creation of Fig. 2 increases the redundancy even more.
| G0:0005830 | GO:0003735 | GO:0005840 | GO:0006412 | GO:0019538 | |||||||
| GO:0005830 | 140 | 0 | 137 | 3 | 140 | 0 | 137 | 3 | 137 | 3 | |
| 0 | 6172 | 61 | 6111 | 89 | 6083 | 421 | 5751 | 709 | 5463 | ||
| - | |||||||||||
| GO:0003735 | 137 | 61 | 198 | 0 | 196 | 2 | 198 | 0 | 198 | 0 | |
| 3 | 6111 | 0 | 6114 | 33 | 6081 | 360 | 5754 | 648 | 5466 | ||
| - | - | ||||||||||
| GO:0005840 | 140 | 89 | 196 | 33 | 229 | 0 | 219 | 10 | 220 | 9 | |
| 0 | 6083 | 2 | 6081 | 0 | 6083 | 339 | 5744 | 626 | 5457 | ||
| - | - | - | |||||||||
| GO:0006412 | 137 | 421 | 198 | 360 | 219 | 339 | 558 | 0 | 558 | 0 | |
| 3 | 5751 | 0 | 5754 | 10 | 5744 | 0 | 5754 | 288 | 5466 | ||
| - | - | - | - | ||||||||
| GO:0019538 | 137 | 709 | 198 | 648 | 220 | 626 | 558 | 288 | 816 | 0 | |
| 3 | 5463 | 0 | 5466 | 0 | 5457 | 0 | 5466 | 0 | 5466 | ||
| - | - | - | - | - | |||||||
Combinations of GO attributes: Shown is a schematic example of 8 genes, separated into two distinct clusters (Table 4a). As can be observed neither of the two given attributes is significantly enriched within any of the cluster, resulting in a vanishing mutual information between the clustering and the annotation (see the respective contingency tables in Table 4b and 4c). However, clearly the combination of both attributes does uniquely determine both cluster. In particular, genes with the combination A = (A1, A2) = (0, 0) or (1, 1) are grouped together in the first cluster 0. while genes sharing the annotation A = (0, 1) or (1, 0) are grouped together in the second cluster.
| gene | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| cluster | ||||||||
| 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | |
| 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | |
| Cluster: | 0 | 1 | ||||||
| 0 | 2 | 2 | ||||||
| 1 | 2 | 2 | ||||||
| Cluster: | 0 | 1 | ||||||
| 0 | 2 | 2 | ||||||
| 1 | 2 | 2 | ||||||
Combinations of GO attributes selected for the dataset of Spellman et al. [24]. Starting with the highest scoring attribute 'cytosolicribosome'. new attributes were iteratively included until kmax = 31, the first, 20 are given here. Note that the results do not depend specifically on which of the datasets was used: GO IDs that have been selected among the top 32 for all datasets listed in Table 2 are indicated in bold. The clustering was the same as considered above, see caption of Fig. 2 for details. Note that neither of the attributes is dedicatedly related to the cell cycle, except 'cell cycle' and 'mitosis', which were likewise found for all of the considered datasets.
| rank k | GO ID | description |
| 0 | ||
| 1 | ||
| 2 | ||
| 3 | ||
| 4 | ||
| 5 | ||
| 6 | GO:0016043 | cell organization and biogenesis |
| 7 | ||
| 8 | ||
| 9 | ||
| 10 | ||
| 11 | GO:0009058 | biosynthesis |
| 12 | ||
| 13 | ||
| 14 | ||
| 15 | GO:0006259 | DNA metabolism |
| 16 | GO:0009056 | catabolism |
| 17 | G0:0006519 | amino acid and derivative metabolism |
| 18 | ||
| 19 | ||
| 20 | GO:0005488 | binding |
The contingency tables of the the live top scoring attributes listed in Table 5. Note that in this case the respective scores are significantly lower, as compared to the results given in Table 3. This indicates that the respective attributes are, though not statistically independent, much less redundant than in the previous case. Shown are the nodes: 'GO:0005830' (component: cytosolic ribosome), 'GO:0005737' (component: cytoplasm). 'GO:0007049' (process: cell cycle). 'GO:0005634' (component: nucleus), 'GO:0003824' (function: catalytic activity).
| GO:0005830 | GO:0005737 | GO.0007049 | GO:0005634 | GO:0003824 | |||||||
| GO:0005830 | 140 | 0 | 140 | 0 | 0 | 140 | 1 | 139 | 0 | 140 | |
| 0 | 6172 | 1108 | 5064 | 390 | 5782 | 520 | 5652 | 1153 | 5019 | ||
| - | |||||||||||
| GO:0005737 | 140 | 1108 | 1248 | 0 | 124 | 1124 | 46 | 1202 | 364 | 884 | |
| 0 | 5064 | 0 | 5064 | 266 | 4798 | 475 | 4589 | 789 | 4275 | ||
| - | - | ||||||||||
| GO:0007049 | 0 | 390 | 124 | 266 | 390 | 0 | 139 | 251 | 140 | 250 | |
| 140 | 5782 | 1124 | 4798 | 0 | 5922 | 382 | 5540 | 1013 | 4909 | ||
| - | - | - | |||||||||
| GO:0005634 | 1 | 520 | 46 | 475 | 139 | 382 | 521 | 0 | 193 | 328 | |
| 139 | 5652 | 1202 | 4589 | 251 | 5540 | 0 | 5791 | 960 | 4831 | ||
| - | - | - | - | ||||||||
| GO:0003824 | 0 | 1153 | 364 | 789 | 140 | 1013 | 198 | 960 | 1153 | 0 | |
| 140 | 5019 | 884 | 4275 | 250 | 4909 | 328 | 4831 | 0 | 5159 | ||
| - | - | - | - | - | |||||||
Figure 3Combinations of GO attributes: Shown is a graphical representation of the contingency tables between the clustering result and the GO annotations. Darker color indicates more genes in that cluster with this annotation. Upper plot: The results corresponding to Fig. 2. The highest scoring attributes as determined by the individual mutual information I(C,A). The attributes are sorted according to their appearance in the GO database. Lower plot: Combined attributes: Shown are the results for the first 5 entries of Table 5. For simplicity, the combinations are given as binary code A = (A0,...,A4), where A0 = cytosolic ribosome, A1 = cytoplasm. A2 = cell cycle. A3 = nucleus and A4 = catalytic activity. Genes not possessing any of the top five attributes listed in Table 5 are omitted.