| Literature DB >> 17408499 |
Pierre R Bushel1, Russell D Wolfinger, Greg Gibson.
Abstract
BACKGROUND: Commonly employed clustering methods for analysis of gene expression data do not directly incorporate phenotypic data about the samples. Furthermore, clustering of samples with known phenotypes is typically performed in an informal fashion. The inability of clustering algorithms to incorporate biological data in the grouping process can limit proper interpretation of the data and its underlying biology.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17408499 PMCID: PMC1839893 DOI: 10.1186/1752-0509-1-15
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Figure 1Modified k-prototypes clustering of mixed data types. a) The data sets used for clustering and the components of the modk-prototypes algorithm. The type of the data is denoted in parentheses. b) The k-prototypes algorithm was modified (termed modk-prototypes) to include B iterations of the assignment of the samples to the k number of clusters for each k = 2 to N number of samples. d(X, Q) is the dissimilarity function between the ith sample and the lth cluster prototype. The cluster prototypes are updated and the samples are reassigned repeatedly until there is no more change in cluster assignment. The validity score is computed for the final assignment of the samples. The number of clusters in the data is estimated by finding the assignment of the samples (over all B initializations and all k partitions) that yielded the optimal validity score.
Figure 2Determination of k clusters in the heart disease and acetaminophen data sets using modk-prototypes. The a) heart disease data and b) acetaminophen data were clustered using the modk-prototypes algorithm at values of k increasing from 2 to the N number of samples in the data. DVI_CU (on the y axis) was computed and plotted for the clustering of the data at each value of k (on the x axis). Only k = 2 to 10 is shown.
Figure 3Validation of the cluster assignments for the acetaminophen data. modk-prototypes clustering of the acetaminophen data was performed 100 times at k = 3 using equal weighting of the microarray, clinical chemistry and histopathology domain data. The necrosis of the centrilobular region of the rat liver histopathology observation was removed from the data prior to clustering and used as an external indicator of cluster assignment validation. The adjusted Rand index (x axis) was computed for each clustering of the data and graphed by count (y axis) of cluster assignments scored with the range of the index.
Proposed weighting schemes for the domain data.
| 1 | 0.6 | 0.2 | 0.2 |
| 2 | 0.5 | 0.2 | 0.3 |
| 3 | 0.5 | 0.25 | 0.25 |
| 4 | 0.6 | 0 | 0.4 |
| 5a | 0.2 | 0.5 | 0.3 |
| 5b | 0.8 | 0.1 | 0.1 |
| 6 | 0.6 | 0.1 | 0.3 |
| 7a | 0.4 | 0.4 | 0.2 |
| 7b | 0.4 | 0.2 | 0.4 |
| 8a | 0.6 | 0.3 | 0.1 |
| 8b | 0.4 | 0.2 | 0.4 |
| 9 | 0.6 | 0.2 | 0.2 |
| 10 | 0.6 | 0.2 | 0.2 |
| 11 | 0.6 | 0.2 | 0.2 |
| 12 | 0.4 | 0.3 | 0.3 |
| 13 | 0.333 | 0.333 | 0.333 |
| 14a | 0.333 | 0.333 | 0.333 |
| 14b | 0.5 | 0.2 | 0.3 |
| 15a | 0.7 | 0.2 | 0.1 |
| 15b | 0.3 | 0.4 | 0.3 |
| 15c | 0.7 | 0.2 | 0.1 |
| Average | 0.508 | 0.239 | 0.253 |
| Standard Dev. | 0.152 | 0.113 | 0.100 |
Validation of clustering the samples from the acetaminophen data using modk-prototypes.
| 1 | 0.64 | ||||
| 2 | 1 | 0 | 0 | 0.51 | 2 |
| 3 | 0 | 1 | 0 | 0.69 | 6 |
| 4 | 0 | 0 | 1 | 0.33 | 2 |
| 5 | 0.5 | 0.5 | 0 | 0.66 | 5 |
| 6 | 0.5 | 0 | 0.5 | 0.46 | 4 |
| 7 | 0 | 0.5 | 0.5 | 0.64 | 3 |
| 8 | 0.51 | 0.24 | 0.25 | 0.67 | 3 |
| 9 | 0.4 | 0.4 | 0.2 | 0.64 | 3 |
| 10 | 0.4 | 0.2 | 0.4 | 0.67 | 3 |
| 11 | 0.6 | 0.2 | 0.2 | 0.67 | 3 |
| 12 | 0.2 | 0.4 | 0.4 | 0.64 | 3 |
| 13* | 0.26 | 0.39 | 0.35 | 0.64 | 3 |
| 14 | 0.8 | 0.1 | 0.1 | 0.67 | 3 |
| 15 | 0.7 | 0.15 | 0.15 | 0.67 | 3 |
α, β and γ denote the weights for the microarray, clinical chemistry and histopathology data domain dissimilarity measures, respectively. k is the number of clusters formed.
Partial end-point components of the phenotypic prototypes from the clustering of the acetaminophen-treated samples.
| Cluster | |||
| Features | 1 | 2 | 3 |
| Cong_Sinusoid | Moderate | None | Minimal |
| Necr_Cent | Moderate | None | Mild |
| Infl_Cent | None | None | Minimal/Mild* |
| Hypert_Hepa | Minimal | None | None |
| Regen_Hepa | None | None | Minimal |
| Dege_Hepa* | None | None | Minimal |
| ALB (g/dL) | 5.16 | 5.03 | 4.78 |
| ALP (IU/L) | 413.22 | 323.43 | 368.78 |
| ALT (IU/L) | 9649.40 | 118.48 | 1676.10 |
| AST (IU/L) | 20304.00 | 171.80 | 2820.20 |
| Creat (mg/dL) | 0.70 | 0.70 | 0.70 |
| BUN (mg/dL) | 23.56 | 15.56 | 18.11 |
| CHOLE (mg/dL) | 59.78 | 86.54 | 85.44 |
| TBA (umol/L) | 61.67 | 7.61 | 43.56 |
| SDH (IU/L) | 2.89 | 32.41 | 398.89 |
| TP (g/dL) | 7.53 | 7.52 | 7.19 |
| * Observed in the left medial lobe | |||
Cluster assignment of the acetaminophen-treated samples.
| Cluster 1 | Cluster 2 | Cluster 3 | |||
| Treatment | Animal # | Treatment | Animal # | Treatment | Animal # |
| 1500 MG/K18 HR | 404 | 50 MG/KG6 HR | 202 | 1500 MG/K24 HR | 407 |
| 1500 MG/K24 HR | 419 | 50 MG/KG6 HR | 203 | 1500 MG/K48 HR | 411 |
| 1500 MG/K24 HR | 421 | 50 MG/KG18 HR | 204 | 1500 MG/K48 HR | 412 |
| 2000 MG/K18 HR | 505 | 50 MG/KG18 HR | 206 | 1500 MG/K18 HR | 416 |
| 2000 MG/K18 HR | 506 | 50 MG/KG24 HR | 208 | 1500 MG/K24 HR | 420 |
| 2000 MG/K24 HR | 508 | 50 MG/KG24 HR | 209 | 1500 MG/K48 HR | 424 |
| 2000 MG/K24 HR | 509 | 50 MG/KG48 HR | 210 | 2000 MG/K48 HR | 510 |
| 2000 MG/K18 HR | 516 | 50 MG/KG48 HR | 211 | 2000 MG/K48 HR | 512 |
| 2000 MG/K24 HR | 521 | 50 MG/KG48 HR | 212 | 2000 MG/K48 HR | 524 |
| 50 MG/KG6 HR | 213 | ||||
| 50 MG/KG6 HR | 214 | ||||
| 50 MG/KG18 HR | 216 | ||||
| 50 MG/KG18 HR | 217 | ||||
| 50 MG/KG24 HR | 220 | ||||
| 50 MG/KG24 HR | 221 | ||||
| 50 MG/KG48 HR | 223 | ||||
| 150 MG/KG6 HR | 302 | ||||
| 150 MG/KG6 HR | 303 | ||||
| 150 MG/KG18 HR | 306 | ||||
| 150 MG/KG24 HR | 307 | ||||
| 150 MG/KG24 HR | 308 | ||||
| 150 MG/KG48 HR | 310 | ||||
| 150 MG/KG48 HR | 311 | ||||
| 150 MG/KG48 HR | 312 | ||||
| 150 MG/KG6 HR | 314 | ||||
| 150 MG/KG6 HR | 315 | ||||
| 150 MG/KG18 HR | 316 | ||||
| 150 MG/KG18 HR | 317 | ||||
| 150 MG/KG18 HR | 318 | ||||
| 150 MG/KG24 HR | 319 | ||||
| 150 MG/KG24 HR | 320 | ||||
| 150 MG/KG48 HR | 324 | ||||
| 1500 MG/K6 HR | 402 | ||||
| 1500 MG/K6 HR | 403 | ||||
| 1500 MG/K18 HR | 405 | ||||
| 1500 MG/K18 HR | 406 | ||||
| 1500 MG/K6 HR | 413 | ||||
| 1500 MG/K6 HR | 414 | ||||
| 1500 MG/K48 HR | 423 | ||||
| 2000 MG/K6 HR | 501 | ||||
| 2000 MG/K6 HR | 503 | ||||
| 2000 MG/K6 HR | 513 | ||||
| 2000 MG/K6 HR | 514 | ||||
| 2000 MG/K18 HR | 518 | ||||
| 2000 MG/K24 HR | 520 | ||||
| 2000 MG/K48 HR | 522 | ||||
Figure 4Gene expression components of the phenotypic prototypes. Plotting of the gene expression component of the prototypes from the clusters generated from clustering the acetaminophen data using the modk-prototypes algorithm (with the levels of the necrosis of the centrilobular region of the rat liver included, 100 iterations and the average of the suggested weights of the domain data). a) All genes detected as significantly differentially expressed b) 82 genes significant and unique in distinguishing contrasts between the levels of necrosis of the centrilobular region of the rat liver. The red, blue and green lines denote the gene expression prototype from Clusters 1, 2 and 3 respectively. The log10 ratio values of the genes from the prototypes are signified on the y axis and the indices for the genes are denoted on the x axis.
Subset of significant and unique genes that distinguish between levels of centrilobular necrosis of the rat liver.
| Cluster Comparison | ||||
| Feature ID | Gene | Description | A vs B | |
| A_42_P464546 | AI501407 | TNFAIP3 interacting protein 2 | 1 | 2 |
| A_42_P496622 | AI232716 | Similar to thioether S-methyltransferase | 1 | 2 |
| A_42_P552441 | BI303289 | Growth arrest specific 5 | 1 | 2 |
| A_42_P565917 | BF392498 | Cytochrome P450, family 2, subfamily u, polypeptide 1 | 1 | 2 |
| A_42_P681012 | NM_013055 | Mitogen activated protein kinase kinase kinase 12 | 1 | 2 |
| A_42_P684538 | NM_138827 | Solute carrier family 2 (facilitated glucose transporter), member 1 | 1 | 2 |
| A_42_P767698 | BE097112 | EBNA1 binding protein 2 | 1 | 2 |
| A_42_P786624 | NM_012693 | Cytochrome P450, subfamily 2A, polypeptide 1 | 1 | 2 |
| A_42_P788480 | BF419374 | Thyrotroph embryonic factor | 1 | 2 |
| A_43_P11472 | NM_012580 | Heme oxygenase (decycling) 1 | 1 | 2 |
| A_43_P11681 | NM_013048 | Tocopherol (alpha) transfer protein | 1 | 2 |
| A_43_P12400 | NM_024134 | DNA-damage inducible transcript 3 | 1 | 2 |
| A_43_P12595 | NM_031576 | P450 (cytochrome) oxidoreductase | 1 | 2 |
| A_43_P12996 | NM_053955 | Crystallin, mu | 1 | 2 |
| A_42_P487744 | BF396233 | Similar to 2410004L22Rik protein | 1 | 3 |
| A_42_P607568 | AI176590 | Similar to RIKEN cDNA C730048E16 | 1 | 3 |
| A_42_P634040 | AW918024 | Ngg1 interacting factor 3-like 1 (S. pombe) | 1 | 3 |
| A_42_P634187 | AW252746 | Forkhead box D4 | 1 | 3 |
| A_42_P677628 | NM_031642 | Core promoter element binding protein | 1 | 3 |
| A_42_P681533 | AI237597 | Transcribed locus, moderately similar to NP_034610.1 heat shock protein 1, alpha [Mus musculus] | 1 | 3 |
| A_43_P11142 | Y10056 | S100 calcium binding protein A11 (calizzarin) (predicted) | 1 | 3 |
| A_43_P11477 | NM_012591 | Interferon regulatory factor 1 | 1 | 3 |
| A_43_P12806 | NM_053439 | RAN, member RAS oncogene family | 1 | 3 |
| A_43_P14163 | NM_012615 | Ornithine decarboxylase 1 | 1 | 3 |
| A_43_P14782 | AI406490 | Tyrosine kinase, non-receptor, 2 | 1 | 3 |
| A_43_P16550 | CA509226 | Splicing factor 3a, subunit 1 (predicted) | 1 | 3 |
| A_43_P19988 | CB545293 | Similar to CGI-94 protein (predicted) | 1 | 3 |
| A_42_P539275 | AA891212 | Replication factor C (activator 1) 3 | 2 | 3 |
| A_42_P660046 | BF551617 | Kinesin family member 16B (predicted) | 2 | 3 |
| A_42_P780457 | AI071307 | Ectodermal-neural cortex 1 | 2 | 3 |
| A_43_P11285 | BQ209715 | Similar to Cc2-27 (predicted) | 2 | 3 |
| A_43_P20438 | CB545761 | Small optic lobes homolog (Drosophila) (predicted) | 2 | 3 |
Figure 5Hierarchical clustering of the biological samples. Log10 transformed gene expression ratio values of the 82 genes from the prototypes of the clusters of the biological samples were subjected to agglomerative hierarchical clustering using cosine correlation as the similarity measure and average linkage methodology. The branches of the dendrograms represent the amount of similarity between clusters of samples.
Figure 6Two-way hierarchical clustering of the biological samples and the extracted genes. Log10 transformed gene expression ratio values of the 82 genes from the prototypes of the clusters of the biological samples were subjected to agglomerative hierarchical clustering as detailed in Figure 5. The resulting gene expression heat map contains the genes as the rows and samples as the columns with red indicating up regulation, green denoting down regulation and black signifying no change. At the top of the heat map, the level (UI/L) of ALT is plotted for each sample. At the bottom of the heat map, the severity of centrilobular necrosis observed is shown for each sample. (yellow, none; blue, minimal; magenta, mild; green, moderate).