| Literature DB >> 17626636 |
Curtis Huttenhower1, Avi I Flamholz, Jessica N Landis, Sauhard Sahi, Chad L Myers, Kellen L Olszewski, Matthew A Hibbs, Nathan O Siemers, Olga G Troyanskaya, Hilary A Coller.
Abstract
BACKGROUND: The availability of microarrays measuring thousands of genes simultaneously across hundreds of biological conditions represents an opportunity to understand both individual biological pathways and the integrated workings of the cell. However, translating this amount of data into biological insight remains a daunting task. An important initial step in the analysis of microarray data is clustering of genes with similar behavior. A number of classical techniques are commonly used to perform this task, particularly hierarchical and K-means clustering, and many novel approaches have been suggested recently. While these approaches are useful, they are not without drawbacks; these methods can find clusters in purely random data, and even clusters enriched for biological functions can be skewed towards a small number of processes (e.g. ribosomes).Entities:
Mesh:
Year: 2007 PMID: 17626636 PMCID: PMC1941745 DOI: 10.1186/1471-2105-8-250
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1NNN Algorithm overview. An example of the Nearest Neighbor Networks operating on 14 genes with clique size g = 3 and neighborhood size n = 4. A. A directed graph is generated in which each gene is connected to its n nearest neighbors. B. An undirected graph is constructed from bidirectional connections. C. Overlapping cliques of size g are merged to produce preliminary networks. D. Preliminary networks containing cut-vertices are split into final networks, with copies of the cut-vertices occupying both networks.
Clustering algorithm summary statistics.
| 1527 | 3410 | 6162 | 6137 | 2284 | |
| 54 | 800 | 82 | 127 | 113 | |
| 28.4 | 4.26 | 75.1 | 48.3 | 102 | |
| 49.2 | 16.91 | 161 | 93.3 | 70.3 | |
| 1142 | 4079 | 6115 | 6092 | 3120 | |
| 38 | 666 | 9 | 69 | 128 | |
| 30.1 | 6.12 | 679 | 88.3 | 130 | |
| 62.5 | 35.58 | 787 | 220 | 101 | |
| 64 | 6251 | 6256 | 6236 | 280 | |
| 11 | 45 | 16 | 56 | 5 | |
| 5.82 | 138.9 | 391 | 11.4 | 88.4 | |
| 1.19 | 347.3 | 474 | 258 | 36.5 | |
| 1996 | 2579 | 6153 | 6121 | 3375 | |
| 29 | 519 | 75 | 177 | 325 | |
| 68.9 | 4.97 | 82.0 | 34.6 | 45.9 | |
| 245.4 | 11.95 | 107 | 57.8 | 44.1 | |
| 2247 | 5820 | 6005 | 5970 | 778 | |
| 27 | 687 | 46 | 110 | 25 | |
| 83.2 | 8.47 | 131 | 54.3 | 139 | |
| 390 | 19.26 | 187 | 80.4 | 96.3 | |
| 2050 | 5535 | 5701 | 5669 | 777 | |
| 28 | 616 | 47 | 100 | 32 | |
| 73.3 | 8.99 | 121 | 56.7 | 69.0 | |
| 324 | 30.14 | 206 | 114 | 37.3 | |
| 694 | 6155 | 6160 | - | 4892 | |
| 29 | 7 | 5 | - | 609 | |
| 23.9 | 879.3 | 1232 | - | 63.7 | |
| 34.7 | 2140 | 1768 | - | 82.0 | |
| 0 (± 0) | 5988 (± 0.89) | 3600 (± 3286) | 5964 (± 28.8) | 0 (± 0) | |
| 0 (± 0) | 216.2 (± 2.95) | 9.8 (± 9.81) | 109 (± 4.72) | 0 (± 0) | |
| 0 (± 0) | 27.7 (± 0.38) | 190 (± 175) | 53.0 (± 1.39) | 0 (± 0) | |
| 0 (± 0) | 21.86 (± 0.25) | 48.8 (± 45.7) | 35.2 (± 0.791) | 0 (± 0) | |
| 0 (± 0) | 5986 (± 3.58) | 6000 (± 0) | 5975 (± 4.77) | 0 (± 0) | |
| 0 (± 0) | 231.6 (± 3.29) | 28.8 (± 11.9) | 124 (± 1.30) | 0 (± 0) | |
| 0 (± 0) | 25.85 (± 0.36) | 235 (± 82.6) | 48.3 (± 0.482) | 0 (± 0) | |
| 0 (± 0) | 18.14 (± 0.15) | 64.8 (± 46.3) | 30.9 (± 0.374) | 0 (± 0) | |
| 101.4 (± 28.85) | 0 (± 0) | 6162 (± 0) | 5837 (± 260.6) | 1061 (± 35.87) | |
| 16.2 (± 3.96) | 0 (± 0) | 36.2 (± 28.99) | 428 (± 33.88) | 156 (± 4.85) | |
| 6.23 (± 0.78) | 0 (± 0) | 680.7 (± 864.7) | 13.67 (± 0.46) | 32.46 (± 1.35) | |
| 1.64 (± 0.79) | 0 (± 0) | 884.5 (± 1179) | 2.36 (± 0.52) | 18.03 (± 1.13) | |
| 19.4 (± 6.66) | 0 (± 0) | 4586 (± 3058) | 5507 (± 47.19) | 1382 (± 15.27) | |
| 3.6 (± 1.34) | 0 (± 0) | 20.75 (± 33.71) | 411.2 (± 5.12) | 219.8 (± 15.27) | |
| 5.47 (± 1.04) | 0 (± 0) | 701 (± 941.3) | 13.39 (± 0.058) | 18.38 (± 0.35) | |
| 0.66 (± 1.48) | 0 (± 0) | 950.8 (± 1197) | 1.7 (± 0.03) | 9.25 (± 0.38) | |
| 20.2 (± 8.61) | 572.8 (± 12.74) | 4922 (± 2752) | 4815 (± 76.96) | 1808 (± 56.32) | |
| 3.6 (± 1.82) | 224 (± 8.22) | 13 (± 10.84) | 407.2 (± 7.56) | 390.8 (± 5.67) | |
| 6.13 (± 1.64) | 2.56 (± 0.044) | 592.5 (± 826.8) | 11.83 (± 0.038) | 11.09 (± 0.39) | |
| 0.53 (± 0.71) | 0.82 (± 0.046) | 101.7 (± 200.2) | 1.15 (± 0.024) | 5.59 (± 0.5) | |
Summary statistics detailing Nearest Neighbor Networks clusters formed from the data sets employed in this study, from their concatenation, and from two synthetic random data sets using default parameters (g = 5, n = 25). Results from other clustering algorithms with appropriate output formats (CAST, CLICK, QTC, and SAMBA) have been included, also utilizing default parameter settings provided by the algorithms' implementations. Random values are shown with standard deviations over five different seeds.
Figure 2Example NNN output. A subset of the Nearest Neighbor Networks clusters produced from the [35] data set using the parameters g = 5 and n = 10, visualized using Java TreeView [42]. NNN clusters have been colored, internally hierarchically clustered, and the cluster centroids have in turn been hierarchically clustered to provide an easily interpretable tree.
Figure 3Global evaluation of clustering algorithms. Evaluation results for eight clustering algorithms and six microarray data sets based on the global answer set (employing 200 GO terms of functional interest and discarding ribosome biogenesis and assembly [37]). Performance has been measured using log2(TP) on the horizontal axis and log-likelihood score LLS = log2((TP/FP)/(P/N)) for P total positive pairs, N total negative pairs, and TP and FP the number of true and false positives at a particular recall threshold. A. Brem 2005. B. Gasch 2000. C. Haugen 2004. D. Hughes 2000. E. Primig 2000. F. Spellman 1998. G. All six data sets concatenated.
Figure 4Functional evaluation of clustering algorithms. Function-specific evaluation results for each clustering method on a per data set and GO term basis. Each cell represents an AUC score calculated analytically using the Wilcoxon Rank Sum formula; below baseline performance appears in blue, and yellow indicates higher performance. Data set and term combinations for which ten or fewer pairs were able to be evaluated are excluded and appear as gray missing values; functions for which less than 10% of methods were available due to gene exclusion by NNN, QTC, or SAMBA were removed. Visualization provided by TIGR MeV [41].
Figure 5Functional diversity of clustering algorithms. An evaluation of each clustering algorithm's ability to detect the 88 biological processes for which data was available in our analysis. For each algorithm, the maximum AUC across all six data sets was determined, and the resulting AUCs are presented here in descending order per algorithm. NNN correctly clusters genes from substantially more biological processes relative to previous methods.