| Literature DB >> 18538024 |
Meng P Tan1, Erin N Smith, James R Broach, Christodoulos A Floudas.
Abstract
BACKGROUND: DNA microarray technology allows for the measurement of genome-wide expression patterns. Within the resultant mass of data lies the problem of analyzing and presenting information on this genomic scale, and a first step towards the rapid and comprehensive interpretation of this data is gene clustering with respect to the expression patterns. Classifying genes into clusters can lead to interesting biological insights. In this study, we describe an iterative clustering approach to uncover biologically coherent structures from DNA microarray data based on a novel clustering algorithm EP_GOS_Clust.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18538024 PMCID: PMC2442101 DOI: 10.1186/1471-2105-9-268
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Results of iterative clustering with dataset I. A. Number of clusters as a function of the number of iterations of clustering dataset I with a p-value cutoff of 10-4. B. Percent of genes residing in biologically coherent clusters as a function of iteration cycle. Data are shown for the percent of clusters with a minimum of biological coherence of p-value less than 10-3 and of p-values less than 10-4. C. Average p-value over the entire set of clusters as a function of iteration cycle.
Comparison of biological coherence of clusters obtained for dataset I by different clustering algorithms
| Proportion of Genes (%) in Clusters of p-values | ||||
| Average Correlation | <= 10-4 | <= 10-3 | ||
| (Clustering Method) | EP_GOS_Clust | 0.617 | 32.8* | 64.9* |
| Iterated EP_GOS_Clust | 0.685* | > 69* | > 90* | |
| KMedians | 0.615 | 30.8 | 62.2* | |
| KCityBlk | 0.398 | 27.5 | 56.7 | |
| KCorr | 0.630* | 32.6* | 60.1 | |
| KMeans | 0.614 | 25.1 | 55.2 | |
| KAvePair | 0.567 | 25.2 | 54.4 | |
| QTClust | 0.572 | 31.1 | 56.9 | |
| SOTA | 0.604 | 30.2 | 58.9 | |
| SOM | 0.623* | 30.5 | 59.2 | |
Shown are the comparative biological coherence formed by various clustering methods on dataset I. The 2 shaded rows represent clustering done by the standalone EP_GOS_Clust backbone and the proposed iterative approach.
* The top three performers in each category are indicated with an asterisk.
Figure 2Results of iterative clustering with dataset II. A. Number of clusters as a function of the number of iterations of clustering dataset II with a p-value cutoff of 10-4, either including all 5657 genes in the Dataset (No Noise Adjustment) or including only those 4346 genes that exhibit a 1.7 fold change for at least 10% of the time points (1.7-Fold Change Noise Adjusted). B. Percent of genes residing in biologically coherent clusters as a function of iteration cycle. Data are shown for percent of clusters with a minimum of biological coherence of p-value less than 10-4 and 10-5 for all genes in the dataset and for a biological coherence of p-value less than 10-4, 10-5 and 10-6 for the subset of genes that exhibit a 1.7 fold change for at least 10% of the time points (minimal fold-change for meaningful clustering).
Improvement in biological coherence after iterative clustering of dataset II
| Proportion of Genes (%) | ||||
| p-value <= 10-4 | p-value <= 10-5 | p-value <= 10-4 | p-value <= 10-5 | |
| (Uncorrected) | (Bonferroni-Corrected) | |||
| Initial Iteration | 79.4 | 66.1 | 57.2 | 52.2 |
| Final Iteration | 86.2 | 83.9 | 73.1 | 68.9 |
Shown are the percentages of genes in clusters in which one or more subsets of the genes in the cluster exhibit a statistically non-random membership in a common biological function, as defined by Gene Ontology (GO) classification, either at a significance level of 10-4 or 10-5, either uncorrected (left) or Bonferroni corrected for multiple hypothesis testing (right). The values are for the clusters obtained following initial EP_GOS_Clust clustering or after six round of the iterative algorithm described in Methods.
Comparison of biological coherence of clusters obtained for dataset II by different clustering algorithms
| Proportion of Genes (%) in Clusters of p-values | ||||
| <= 10-4 | <= 10-5 | <= 10-4 | <= 10-5 | |
| (Uncorrected) | (Bonferroni-Corrected) | |||
| EP_GOS_Clust | 79.4* | 66.1* | 57.2 | 52.2* |
| Iterated EP_GOS_Clust | 86.2* | 83.9* | 73.1* | 68.9* |
| K-Means | 78.0 | 62.1 | 58.0* | 51.7 |
| K-Correlation | 77.1 | 63.9 | 57.3* | 51.8 |
| K-Medians | 78.8* | 65.3 | 56.9 | 52.2* |
| SOTA | 75.2 | 66.9* | 57.1 | 39.3 |
| IClust | 66.0 | 54.0 | 34.2 | 29.1 |
| Cluster Correlation | -log10(P) Values | |||
| Max. | Min. | Ave. | Average | |
| EP_GOS_Clust | 0.920 | 0.454* | 0.730* | 9.17* |
| Iterated EP_GOS_Clust | 0.956* | 0.489* | 0.750* | 11.09* |
| K-Means | 0.961* | 0.049 | 0.668 | 9.01 |
| K-Correlation | 0.964* | 0.398* | 0.717* | 9.13 |
| K-Medians | 0.923 | 0.203 | 0.683 | 9.09 |
| SOTA | 0.911 | 0.285 | 0.624 | 9.20* |
| IClust | N.A. | N.A. | N.A. | 9.01 |
Methods include EP_GOS_Clust backbone, the iterative algorithm described in this report, the K-family of partitional clustering algorithms with pre-assigned clusters, self organizing tree algorithm (SOTA), and mutual information based clustering (IClust) [28]. Data in the upper table are presented as described in the legend to Table 2 while the lower table presents data on expression correlation within clusters and the average -log(P) values for biological coherence over all the clusters.
* The top three performers in each category are indicated with an asterisk.
Comparison of cluster coherence between the full and abridged versions of the iterative clustering algorithm
| Proportion of Genes (%) – Final Iteration | ||||
| p-value <= 10-4 | p-value <= 10-5 | p-value <= 10-4 | p-value <= 10-5 | |
| (Uncorrected) | (Bonferroni-Corrected) | |||
| Full Method | 86.2 | 83.9 | 73.1 | 68.9 |
| Abridged Method | 86.3 | 84.5 | 72.7 | 65.1 |
Results from applying either the full iterative method or an abridged iterative method, as described in the text, to dataset II, executed with a threshold significance of 10-5.
Data are presented as described in the legend to Table 1.
Effect of imposing minimal gene expression levels on cluster correlation
| No Adjustment | 1.9-fold, at least 5% time points | 1.8-fold, at least 10% time points | 1.7-fold, at least 10% time points | 1.6-fold, at least 20% time points | |
| Number of Genes | 5657 | 4317 | 4045 | 4346 | 4280 |
| Optimal Clusters | 224 | 135 | 118 | 123 | 140 |
| Ave. Correlation | 0.666 | 0.706 | 0.728 | 0.735 | 0.719 |
| Max. Correlation | 0.925 | 0.925 | 0.940 | 0.947 | 0.934 |
| Min. Correlation | 0.387 | 0.445 | 0.478 | 0.474 | 0.469 |
We compare the effects on genes retention and correlation results of within-cluster elements after clustering dataset II by the iterative algorithm.
Effect of imposing minimal gene expression levels on biological coherence
| No Adjustment | 1.9-fold, at least 5% time points | 1.8-fold, at least 10% time points | 1.7-fold, at least 10% time points | 1.6-fold, at least 20% time points | |
| % of Genes in Clusters with -log10(P) Values Ranges | |||||
| > 0 and < 2 | 0.49 | 0.18 | 0.19 | 0.12 | 0.36 |
| > 4 | 79.39 | 81.70 | 82.55 | 82.60 | 80.24 |
| > 5 | 66.06 | 70.76 | 71.14 | 73.95 | 69.31 |
| Comparison of -log10(P) Values | |||||
| Mean Value | 8.39 | 8.84 | 9.25 | 9.27 | 8.81 |
| Best Value | 59.82 | 66.67 | 70.96 | 69.44 | 66.23 |
We compare the effects on biological coherence of genes within clusters of dataset II obtained by the iterative algorithm.
Figure 3Schematic showing the multi-stage clustering process for dataset III. The full set of genes, including dubious ORFs and control samples are clustered by EP_GOS_Clust to yield 49 clusters. Those clusters with correlation ≥ 0.5 are retained and split into two groups. Those with ≥ 60% of their member genes annotated as unknown biological function are set aside as group B. The second group is subjected to iterative clustering as described in Methods, with a threshold p-value of 10-4, yielding 21 clusters (group A). The remaining genes from the initial clustering process are first filtered to remove those with little correlation to any other gene or limited expression. Those genes passing the filter are subjected to EP_GOS_Clust and those clusters exhibiting expression correlation ≥ 0.5 are examined. Those clusters that also have at least 30% their genes annotated to a common function with a p-value less than 10-3 are retained as group C. Those with ≥ 50% of their member genes annotated as unknown biological function are set aside as group D. The remaining genes are once again clustered by EP_GOS_Clust, yielding one cluster with ≥ 40% of their member genes annotated as unknown biological function (group F) and several clusters with the indicated correlation, precision and coherence. The remaining 3,760 genes are then stringently filtered. Since the genes have already been subjected to clustering, we can assume that the most useful information has already been sieved out. The remaining 3562 genes are probably all irrelevant, but we would still like to identify the genes that have significant levels of expression. We hence look at the number of genes that has a minimum proportion of feature points falling within the data mean ± 0.5*(standard deviation), and find that as the pre-determined proportion is decreased, the number of genes increases almost linearly until the 77% mark, where it then starts to grow exponentially. We take this to signify an increasing bulk of spurious genes and set the cut-off at 77% to extract 206 genes for further clustering. This yields the final group of clusters (group E).
Figure 4Functional analysis of dataset III clusters. Genes in clusters obtained as described in Figure 3 are assessed for motif enrichment in their 5' and 3' flanking regions using FIRE. On the left is the subject of clusters (columns) exhibiting statistically significant enrichment (shades of yellow) or exclusion (shades of blue) of a motif (rows) whose consensus sequence is shown to the right. If known, the name of the motif or the factor that likely binds to it is provided. The clusters are also examined to determine whether the expression pattern of genes in a clustered are associated with a gene segregating in the cross from which the data is derived. The LOD (log of the odds) score of linkage to each 20 Kb bin across the entire yeast genome is shown for a representative subset of the clusters. A potential trans-acting factor encoded within the interval of the elevated LOD score is shown next to the peak.
Summary of clustering results with dataset III
| Cluster Group | Cluster Size | -log10(P) Values | Correction | Correlation | Precision | ||||
| Max. | Min. | Ave. | Max. | Min. | Ave. | Ave. | Ave. | ||
| B | 61 | 4 | 2.5 | 4.2 | 1.2 | 1.7 | 0.609 | 0.753 | |
| D | 175 | 8 | 5.5 | 13.7 | 1.5 | 3.9 | 0.362 | 0.641 | |
| F | 102 | 102 | 2.9 | 2.9 | 2.9 | 0.9 | 0.172 | 0.494 | |
| A | Initial | 271 | 2 | 20.0 | 140.1 | 0a | 19.7 | 0.655 | 0.461 |
| Final | 271 | 2 | 21.7 | 140.1 | 1.8b | 20.6 | 0.707 | 0.522 | |
| C | 116 | 2 | 10.4 | 33.3 | 3.2 | 9.5 | 0.672 | 0.735 | |
| E | Initialc | - | - | - | - | - | - | - | - |
| Final | 88 | 2 | 4.4 | 11.4 | 1.1 | 3.4 | 0.635 | 0.440 | |
Genes in dataset III are clustered by EP_GOS_Clust through a sequential process outlined in Figure 3. Genes in cluster groups A and E are further clustered by the iterative algorithm, yielding an initial and final set of clusters. Precision is defined as the fraction of genes within a cluster assigned to the predominant functional group within that cluster.
The cluster p-value is zero if a GO search did not manage to uncover any significant annotation.
After iteratively clustering 184 genes into 15 initial clusters ('A' on Figure 3), just one poor cluster remains. The next worse cluster has a -log10(P) value of 4.1.
There are no applicable initial values here since the remaining genes to be clustered are subjected to the second filter before being re-clustered into the initial 6 clusters (see Figure 3).
Figure 5Schematic of the EP_GOS_Clust clustering algorithm used in this study. Although the formulation in this study has been notated for DNA microarray data, the algorithm framework can be adapted for clustering any numeric data.