| Literature DB >> 12537558 |
Marcel Dettling1, Peter Bühlmann.
Abstract
BACKGROUND: We focus on microarray data where experiments monitor gene expression in different tissues and where each experiment is equipped with an additional response variable such as a cancer type. Although the number of measured genes is in the thousands, it is assumed that only a few marker components of gene subsets determine the type of a tissue. Here we present a new method for finding such groups of genes by directly incorporating the response variables into the grouping process, yielding a supervised clustering algorithm for genes.Entities:
Mesh:
Year: 2002 PMID: 12537558 PMCID: PMC151171 DOI: 10.1186/gb-2002-3-12-research0069
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Lymphoma data. Average cluster expression shaped for the separation of response class 1 (FL), versus response classes 0 and 2 (DLBCL and CLL) on the x-axis, and formed for discrimination of class 2 versus classes 0 and 1 on the y-axis.
Margin statistics
| Margins | max | med | min | |
| Leukemia | 0.20 | 0.05 | -0.01 | -2.41 |
| Breast cancer | 1.29 | 0.23 | 0.04 | -0.82 |
| Prostate | 0.05 | 0.02 | -0.04 | -0.90 |
| Colon | 0.08 | 0.05 | -0.12 | -1.39 |
| SRBCT | 1.00 | 0.11 | -0.06 | -1.16 |
| Lymphoma | 1.65 | 0.14 | 0.01 | -1.16 |
| Brain | 1.03 | 0.32 | 0.09 | -0.29 |
| NCI | 2.52 | 0.44 | 0.12 | -0.91 |
Margins m(0) from the original datasets, as well as maximal, median and minimal margins m*(from 1,000 permuted replicates, for leukemia data (AML/ALL distinction), breast cancer data (ER-positive/ER-negative distinction), prostate data (tumor/normal distinction), colon data (tumor/normal distinction), SRBCT data (distinction of the Ewing family of tumors versus three other tumor types), lymphoma data (distinction of DLBCL versus FL and CLL), brain tumor data (separation of atypical teratoid/rhabdoid tumors (AT/RTs) against 4 other tumor types) and NCI data (distinction of leukemia against seven other cancers).
Scores
| min | max | Number of ( | ||
| Leukemia | 0 | 0 | 279 | 0.41 |
| Breast Cancer | 0 | 0 | 43 | 0.91 |
| Prostate | 0 | 0 | 566 | 0.17 |
| Colon | 0 | 0 | 164 | 0.11 |
| SRBCT | 0 | 0 | 148 | 0.26 |
| Lymphoma | 0 | 0 | 78 | 0.67 |
| Brain | 0 | 0 | 11 | 0.98 |
| NCI | 0 | 0 | 13 | 0.95 |
Scores s(0) from the original dataset, maximal and minimal scores s*(from L = 1,000 permuted replicates, and proportion of shuffled bootstrap trials where score 0 was achieved. The selection of data was as in Table 1.
Figure 2Histograms showing the empirical distribution of scores (left) and margins (right) for the leukemia dataset (AML/ALL distinction), based on 1,000 bootstrap replicates with permuted response variables. The dashed vertical lines mark the values of score and margin with the original response variables.
Misclassification rates based on leave-one-out cross validation
| Leukemia | |||||||
| Nearest neighbor | 5.56% | 5.56% | 4.17% | 2.78% | 2.78% | 2.78% | 2.78% |
| Aggregated trees | 5.56% | 5.56% | 1.39% | 1.39% | 2.78% | 2.78% | 2.78% |
| Breast | |||||||
| Nearest neighbor | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
| Aggregated trees | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
| Prostate | |||||||
| Nearest neighbor | 13.73% | 7.84% | 4.90% | 6.86% | 4.90% | 4.90% | 5.88% |
| Aggregated trees | 13.73% | 13.73% | 6.86% | 8.82% | 6.86% | 5.88% | 5.88% |
| Colon | |||||||
| Nearest neighbor | 27.42% | 22.58% | 22.58% | 19.35% | 16.13% | 17.74% | 19.35% |
| Aggregated trees | 27.42% | 29.03% | 19.35% | 19.35% | 16.13% | 17.74% | 17.74% |
| SRBCT | |||||||
| Nearest neighbor | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 1.59% |
| Aggregated trees | 3.17% | 0.00% | 0.00% | 0.00% | 1.59% | 1.59% | 1.59% |
| Lymphoma | |||||||
| Nearest neighbor | 3.23% | 1.61% | 1.61% | 1.61% | 0.00% | 0.00% | 0.00% |
| Aggregated trees | 3.23% | 1.61% | 1.61% | 1.61% | 0.00% | 0.00% | 0.00% |
| Brain | |||||||
| Nearest neighbor | 30.95% | 23.81% | 19.05% | 16.67% | 19.05% | 16.67% | 16.67% |
| Aggregated trees | 42.86% | 23.81% | 21.43% | 19.05% | 14.29% | 11.90% | 11.90% |
| NCI | |||||||
| Nearest neighbor | 40.98% | 40.98% | 36.07% | 29.51% | 24.59% | 27.87% | 26.23% |
| Aggregated trees | 49.18% | 47.54% | 39.34% | 29.51% | 21.31% | 21.31% | 19.67% |
Misclassification rates for out-of-sample classification with q gene clusters as features, based on leave-one-out cross-validation.
Misclassification rates based on random splitting
| Leukemia | |||||||
| Nearest neighbor | 6.58% | 4.62% | 4.21% | 3.75% | 3.33% | 3.38% | 3.25% |
| Aggregated trees | 6.58% | 6.12% | 3.71% | 3.54% | 2.79% | 2.71% | 2.62% |
| Breast | |||||||
| Nearest neighbor | 1.00% | 0.75% | 0.75% | 1.00% | 0.83% | 1.00% | 1.00% |
| Aggregated trees | 1.00% | 1.58% | 1.67% | 2.33% | 2.58% | 2.42% | 3.00% |
| Prostate | |||||||
| Nearest neighbor | 14.47% | 11.68% | 9.62% | 7.97% | 7.26% | 6.94% | 6.91% |
| Aggregated trees | 14.47% | 16.47% | 10.32% | 8.79% | 8.12% | 8.00% | 7.79% |
| Colon | |||||||
| Nearest neighbor | 23.35% | 20.35% | 19.10% | 16.95% | 16.45% | 16.05% | 15.95% |
| Aggregated trees | 23.35% | 21.80% | 19.70% | 18.10% | 16.95% | 16.20% | 16.45% |
| SRBCT | |||||||
| Nearest neighbor | 1.33% | 0.48% | 0.43% | 0.48% | 0.76% | 0.95% | 1.05% |
| Aggregated trees | 5.76% | 0.95% | 0.71% | 1.10% | 1.76% | 1.90% | 2.14% |
| Lymphoma | |||||||
| Nearest neighbor | 2.15% | 2.20% | 1.50% | 0.85% | 0.65% | 0.50% | 0.50% |
| Aggregated trees | 3.45% | 2.45% | 1.40% | 0.80% | 0.25% | 0.20% | 0.30% |
| Brain | |||||||
| Nearest neighbor | 31.21% | 27.50% | 26.36% | 24.71% | 23.86% | 23.71% | 23.36% |
| Aggregated trees | 35.43% | 28.43% | 24.43% | 22.14% | 19.64% | 18.29% | 16.86% |
| NCI | |||||||
| Nearest neighbor | 45.25% | 40.25% | 37.90% | 34.80% | 32.10% | 30.50% | 29.65% |
| Aggregated trees | 51.85% | 42.35% | 38.05% | 34.05% | 29.30% | 27.75% | 26.50% |
Misclassification rates for out-of-sample classification with q gene clusters as features, based on N = 100 random divisions into learning set (two thirds of the data) and test set (one third of the data).
Benchmark misclassification rates
| Leukemia | |||||||
| Nearest neighbor | 6.33% | 4.79% | 4.50% | 4.08% | 3.67% | 3.75% | 3.79% |
| Aggregated trees | 8.50% | 6.04% | 4.54% | 3.92% | 4.83% | 6.79% | 8.46% |
| Breast | |||||||
| Nearest neighbor | 1.08% | 0.83% | 0.92% | 1.17% | 1.33% | 1.50% | 1.58% |
| Aggregated trees | 5.42% | 2.50% | 1.83% | 2.42% | 4.17% | 5.42% | 8.33% |
| Prostate | |||||||
| Nearest neighbor | 13.24% | 10.68% | 9.15% | 8.44% | 7.76% | 8.18% | 7.85% |
| Aggregated trees | 25.47% | 21.29% | 18.56% | 17.44% | 16.65% | 17.65% | 18.94% |
| Colon | |||||||
| Nearest neighbor | 23.40% | 21.95% | 20.15% | 18.90% | 16.65% | 16.25% | 15.70% |
| Aggregated trees | 30.95% | 29.70% | 30.20% | 31.20% | 33.55% | 34.15% | 34.90% |
| SRBCT | |||||||
| Nearest neighbor | 1.76% | 0.86% | 0.81% | 1.05% | 1.19% | 1.43% | 1.48% |
| Aggregated trees | 4.38% | 2.00% | 2.62% | 3.95% | 6.48% | 6.95% | 8.43% |
| Lymphoma | |||||||
| Nearest neighbor | 2.43% | 2.29% | 1.76% | 1.05% | 0.81% | 0.81% | 0.86% |
| Aggregated trees | 4.38% | 2.81% | 2.10% | 1.00% | 0.81% | 1.05% | 1.24% |
| Brain | |||||||
| Nearest neighbor | 30.79% | 29.07% | 29.50% | 27.57% | 28.50% | 28.00% | 27.50% |
| Aggregated trees | 40.14% | 35.29% | 34.64% | 33.50% | 34.36% | 34.79% | 35.29% |
| NCI | |||||||
| Nearest neighbor | 39.63% | 34.89% | 32.84% | 31.95% | 30.68% | 29.74% | 28.95% |
| Aggregated trees | 56.58% | 49.53% | 44.84% | 42.42% | 39.21% | 39.05% | 37.79% |
Benchmark misclassification rates for out-of-sample classification with the very same but non-averaged genes from q clusters as features, based on N = 100 random divisions into learning set (two thirds of the data) and test set (one third of the data).
Classification of the breast cancer validation sample
| Tumor | 14 | 31 | 33 | 44 | 45 | 46 | 47 | 48 | 49 |
| Status | Neg? | Neg? | Neg? | Neg | Pos? | Pos? | Pos | Pos | Neg |
| Prediction | Neg | Neg | Neg | Neg | Pos | Pos | Pos | Pos | Neg |
The sample is classified with q = 3 cluster expression profiles based on the training sample with 38 tumors as features and aggregated trees as predictor. The status of the tumors is according to the information provided on the Proc Natl Acad Sci USA website [32]. The question mark means that two clinical tests yielded conflicting results. Displayed here is the outcome of the immunoblot assay method.
Comparison against the literature
| Leukemia | Breast | Prostate | Colon | SRBCT | Lymphoma | Brain | NCI* | |
| Supervised clustering | 1.39% | 0.00% | 4.90% | 16.13% | 0.00% | 0.00% | 11.90% | 26.50% |
| Literature | 1.39% | 5.26% | 9.80% | 9.68% | 0.00% | ? | 16.67% | ≅ 35% |
Best leave-one-out cross validation error-rates from our supervised clustering procedure compared to best reported results from the literature where directly comparable, references are given in the main text. *The mean error-rate on the NCI data is based on random divisions into training and test set, and compared against the median error-rate obtained under the same framework in [16].
Cluster size
| Cluster size | Mean | SD | Min | Max |
| Leukemia | 5.855 | 2.910 | 1 | 23 |
| Breast cancer | 4.344 | 2.062 | 1 | 13 |
| Prostate | 6.327 | 2.373 | 2 | 17 |
| Colon | 6.642 | 2.733 | 2 | 20 |
| SRBCT | 4.739 | 1.816 | 1 | 14 |
| Lymphoma | 5.485 | 2.679 | 1 | 16 |
| Brain | 6.094 | 2.751 | 1 | 19 |
| NCI | 6.174 | 2.930 | 1 | 20 |
Variability in size of clusters that have been shaped with the supervised algorithm, based on 1000 bootstrap replicates. Leukemia stands for distinction between AML and ALL; in the breast cancer data, the separation of the ER receptor status has been analyzed; prostate and colon stand for discrimination of normal versus tumorous tissue; in the SRBCT dataset, the Ewing family of tumors was separated against three other phenotypes; for the lymphoma dataset discrimination of DLBCL against FL and CLL was considered; in the brain tumor dataset AT/RTs were discriminated from four further malignancies; and in the NCI dataset, leukemia was separated against seven other cancers. The presented figures for the four multiclass datasets are representative for all their binary distinctions between a tumor type against all others.
Number and proportion of genes used in the various clusters
| Active genes | ||||
| Leukemia | 624 | 17.474% | 18 | 0.504% |
| Breast cancer | 128 | 1.803% | 9 | 0.130% |
| Prostate | 949 | 15.730% | 16 | 0.265% |
| Colon | 1028 | 51.400% | 12 | 0.600% |
| SRBCT | 68 | 2.946% | 11 | 0.477% |
| Lymphoma | 279 | 6.930% | 19 | 0.472% |
| Brain | 345 | 6.164% | 21 | 0.375% |
| NCI | 227 | 4.329% | 23 | 0.439% |
Number and proportion of genes that ever have been used in the first cluster (first two columns), as well as number and proportion of genes that have been used for cluster in more than 50 out of the 1000 bootstrap trials (last two columns). The selection of data is identical to Table 8.
Most frequently clustered genes in DLBC lymphoma discrimination
| Numbers | ||||
| Gene 3786 | Gene 3804 | Gene 761 | Gene 780 | |
| Gene 3763 | 184 (301) | 68 (220) | 144 (155) | 173 (133) |
| Gene 3786 | 289 (187) | 153 (132) | 72 (113) | |
| Gene 3804 | 136 (96) | 60 (83) | ||
| Gene 761 | 40 (58) | |||
| Gene 3786 | Gene 3804 | Gene 761 | Gene 780 | |
| Gene 3763 | (-) 0.000 | (-) 0.000 | (-) 0.359 | (+) 0.001 |
| Gene 3786 | (+) 0.000 | (+) 0.055 | (-) 0.000 | |
| Gene 3804 | (+) 0.000 | (-) 0.007 | ||
| Gene 761 | (-) 0.015 | |||
The top part of the table gives the numbers of observed and (in parentheses) expected (under the hypothesis of independence) gene pairs of the five most frequently clustered genes in the discrimination of DLBC lymphoma from the other two phenotypes, based on 1,000 bootstrap replicates. In the bottom part of the table, p-values for attraction (+) and repulsion (-) of gene pairs from two-sided binomial tests that compare the joint probability against the product of the marginals are shown.
Functional description of the most frequently clustered genes in DBLC lymphoma discrimination
| Sign | Gene | Clone | Function |
| - | 3763 | 769861 | CD63 antigen (melanoma 1 antigen) |
| - | 3786 | 345538 | Cathepsin L |
| - | 3804 | 343867 | Allograft-inflammatory factor-1 or interferon gamma induced macrophage protein or ionized calcium binding adaptor molecule 1 |
| + | 761 | 1341294 | Unknown |
| + | 780 | 1334411 | Unknown UG Hs.32553 ESTs |
Clone numbers and function description of the five genes that have been clustered most frequently in the discrimination of DLBC lymphoma from the other two phenotypes in the lymphoma dataset.