| Literature DB >> 15180908 |
Yingchun Liu1, Markus Ringnér.
Abstract
BACKGROUND: A routine goal in the analysis of microarray data is to identify genes with expression levels that correlate with known classes of experiments. In a growing number of array data sets, it has been shown that there is an over-abundance of genes that discriminate between known classes as compared to expectations for random classes. Therefore, one can search for novel classes in array data by looking for partitions of experiments for which there are an over-abundance of discriminatory genes. We have previously used such an approach in a breast cancer study.Entities:
Mesh:
Year: 2004 PMID: 15180908 PMCID: PMC434495 DOI: 10.1186/1471-2105-5-70
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The essential algorithmic steps in the class discovery procedure. For actual values of the parameters used in the analysis see Table 1.
Parameters in the class discovery procedure and the values used for the SRBCT and the BRCAx data.
| Parameter | Value |
| 3.0 | |
| 0.1 | |
| η | 0.9 |
| 10 | |
| 150 (SRBCT) or 50 ( |
Batches of production for the 88 SRBCT microarrays.
| Batch | Category | Experiments |
| 104 | EWS | T1, T2, T3, T4, C1, C2, C3, C4 |
| BL | C1, C2, C3, C4 | |
| NB | C1, C2, C3 | |
| RMS | T1, T2, T3, T4, C8, C11 | |
| TEST | 5, 24 | |
| 118 | EWS | T6, T7, T9, T11, T12, T13, T14, T15, T19 |
| RMS | T5, T6, T7, T8, C3, C4 | |
| TEST | 6, 9, 11, 20, 21 | |
| 119 | EWS | C6, C7, C8, C9, C1O, C11 |
| BL | C5, C6, C7, C8 | |
| NB | C4, C5, C6, C7, C8, C9, C10, C11, C12 | |
| RMS | C2, C5, C6, C7, C9, C10 | |
| TEST | 3 | |
| 143 | RMS | T11 |
| TEST | 1, 2, 4, 7, 12, 17 | |
| 163 | RMS | T10 |
| TEST | 8, 10, 13, 14, 15, 16, 18, 19, 22, 23, 25 |
Identifier of batch of production T: tumor samples; C: cell lines
Figure 2The number of discriminatory genes (score) as a function of the cut-off in P value. The data shown is for discovery of two classes in the SRBCT data set. The two curves are for the best partition found (light gray) and for random partitions (dark gray). For the P value cut-off 0.001, the best partition is supported by 602 genes, whereas the expectation for a random partition is 2.3 genes.
The four classes of experiments identified by the class discovery program (score = 470; P < 0.001; E = 1.4) after removing 923 genes discriminatory for batches of array production.
| Category | Class 1 | Class 2 | Class 3 | Class 4 |
| BL | 8 | 0 | 0 | 0 |
| EWS-C | 0 | 8 | 1 | 1 |
| EWS-T & RMS-T | 0 | 0 | 22 | 1 |
| NB-C & RMS-C | 0 | 2 | 2 | 18 |
T: tumor samples; C: cell lines
The four classes of experiments identified by the class discovery program (score = 353; P < 0.001; E = 1.2) after removing 1076 genes discriminatory for batches of array production or cell lines versus tumors.
| Category | Class 1 | Class 2 | Class 3 | Class 4 |
| BL | 8 | 0 | 0 | 0 |
| EWS | 0 | 12 | 10 | 1 |
| RMS | 0 | 4 | 13 | 3 |
| NB | 0 | 0 | 0 | 12 |
Figure 3Hierarchical clustering of the 63 SRBCT training experiments. The clustering was performed using the 353 genes discriminatory for the best partition found in the data set reduced for genes discriminating cell lines versus tumors or between print batches. Using the discriminatory genes found by our unsupervised method results in clusters that correspond to the disease categories. The scale shows the linear correlation based distance used to construct the dendrogram.
Figure 4Hierarchical clustering of all 88 SRBCT experiments. The clustering was performed using the 353 genes discriminatory for the best partition found using our unsupervised method applied to the training data set reduced for genes discriminating cell lines versus tumors or between print batches. Using these genes, the test samples cluster in clusters dominated by the correct disease category. The scale shows the linear correlation based distance used to construct the dendrogram.