| Literature DB >> 24564555 |
Pablo A Jaskowiak, Ricardo J G B Campello, Ivan G Costa.
Abstract
BACKGROUND: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new <span class="Disease">cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. RESULTS ANDEntities:
Mesh:
Year: 2014 PMID: 24564555 PMCID: PMC4072854 DOI: 10.1186/1471-2105-15-S2-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Evaluation scenarios applied to each type of data.
| Data Type | ||
|---|---|---|
| Evaluation Scenario | Cancer Sample | Gene Time-Series |
| Fixed # of Clusters | ✓ | - |
| Variable # of Clusters | ✓ | - |
| Estimated # of Clusters | ✓ | ✓ |
| Robustness to Noise | ✓ | - |
Figure 1Cancer Datasets Results: Class recovery obtained for cancer datasets regarding the three evaluation scenarios under consideration, subfigures (a), (b), and (c). Bars display mean results for each pair of clustering method and distance function in different types of datasets: cDNA (left) and Affymetrix (right).
Figure 2Robustness to Noise for Cancer Datasets: ARI values for different noise levels (%) regarding PE, JK, SP, RM, COS and EUC. Plots correspond to the mean ARI values for runs performed in 100 different noisy datasets with the same amount (%) of noise points. Bars account for standard deviations.
Wins/Ties/Losses for 15 distances and 17 datasets.
| SL | AL | CL | KM | |
|---|---|---|---|---|
| SL | -- | 531/370/2924 | 378/384/3063 | 385/323/3117 |
| AL | 2912/406/507 | -- | 1903/93/1829 | 1710/80/2035 |
| CL | 3063/386/376 | 1821/106/1898 | -- | 1803/17/2005 |
| KM | 3117/323/385 | 2032/80/1713 | 2001/18/1806 | -- |
Figure 3Gene Time-Series Results: Results for gene time-series data. Figures (a), (b) and (c) depict pairwise comparison of distances for each clustering method. Figure (d) depicts an all against all pairwise comparison. Each cell account for the number of datasets in which the method from the row obtained a better enrichment than the method from the column. The "hotter"/"colder" the cell the better/worst is the row method in comparison to the column one.
Summary of the cancer benchmark data employed in our evaluation.
| Name |
|
|
| |
|---|---|---|---|---|
| 2 | 72 | 1081 | ||
| 2 | 104 | 182 | ||
| 2 | 72 | 1877 | ||
| 2 | 181 | 1626 | ||
| 2 | 37 | 2202 | ||
| Affymetrix | 2 | 28 | 1070 | |
| 2 | 22 | 1152 | ||
| 2 | 34 | 857 | ||
| 2 | 77 | 798 | ||
| 2 | 102 | 339 | ||
| 2 | 49 | 1198 | ||
| 2 | 248 | 2526 | ||
| 3 | 72 | 2194 | ||
| 3 | 40 | 1203 | ||
| 3 | 72 | 1877 | ||
| 4 | 50 | 1377 | ||
| 5 | 203 | 1543 | ||
| 5 | 42 | 1379 | ||
| 6 | 248 | 2526 | ||
| 10 | 174 | 1571 | ||
| 14 | 190 | 1363 | ||
| 2 | 42 | 1095 | ||
| cDNA | 2 | 180 | 85 | |
| 2 | 38 | 2201 | ||
| 3 | 50 | 1739 | ||
| 3 | 69 | 1625 | ||
| 3 | 37 | 1411 | ||
| 3 | 62 | 2093 | ||
| 4 | 92 | 1288 | ||
| 4 | 62 | 2093 | ||
| 4 | 66 | 4553 | ||
| 4 | 83 | 1069 | ||
| 4 | 110 | 2496 | ||
| 4 | 42 | 1771 | ||
| 5 | 104 | 2315 | ||
Columns display name of the data, number of clusters (nc), number of objects (no) and, number of features (nf ), respectively.
Summary of the time-series benchmark data employed in our evaluation.
| Name | Source |
|
|
|
|---|---|---|---|---|
| 1030 | 6152 | 7 | ||
| 1016 | 6152 | 7 | ||
| 962 | 6152 | 7 | ||
| 999 | 6152 | 7 | ||
| 1038 | 6152 | 8 | ||
| Gasch | 991 | 6152 | 8 | |
| 988 | 6152 | 8 | ||
| 1050 | 6152 | 9 | ||
| 976 | 6152 | 10 | ||
| 1011 | 6152 | 10 | ||
| 1022 | 6152 | 10 | ||
| 1011 | 6152 | 12 | ||
| 935 | 6178 | 14 | ||
| 1044 | 6178 | 17 | ||
| Spellman | 1099 | 6178 | 18 | |
| 1086 | 6178 | 24 | ||
| Chu | 1171 | 6118 | 7 | |
Columns display name of the data, source, number of objects originally in the dataset (noo), number of filtered objects (nfo) and, number of features (nf ), respectively.