| Literature DB >> 14659020 |
Ka Yee Yeung1, Roger E Bumgarner.
Abstract
Prediction of the diagnostic category of a tissue sample from its gene-expression profile and selection of relevant genes for class prediction have important applications in cancer research. We have developed the uncorrelated shrunken centroid (USC) and error-weighted, uncorrelated shrunken centroid (EWUSC) algorithms that are applicable to microarray data with any number of classes. We show that removing highly correlated genes typically improves classification results using a small set of genes.Entities:
Mesh:
Year: 2003 PMID: 14659020 PMCID: PMC329422 DOI: 10.1186/gb-2003-4-12-r83
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Tumor types and class sizes of the NCI 60 dataset
| Origin of cell lines | Class size (total 61 samples) |
| Breast | 9 |
| Central nervous system | 5 |
| Colon | 7 |
| Leukaemia | 8 |
| Melanoma | 8 |
| Non-small-cell-lung-carcinoma | 9 |
| Ovarian | 6 |
| Renal | 9 |
Tumor types and class sizes of the original full data with a total of 61 experiments.
Tumor types and class sizes of the randomly partitioned training and test sets of the NCI 60 dataset
| Origin of cell lines | Training set (total 43) | Test set (total 18) |
| Breast | 6 | 3 |
| Central nervous system | 4 | 1 |
| Colon | 5 | 2 |
| Leukaemia | 6 | 2 |
| Melanoma | 6 | 2 |
| Non-small-cell-lung-carcinoma | 6 | 3 |
| Ovarian | 4 | 2 |
| Renal | 6 | 3 |
As no additional test set is available for the NCI 60 data, we randomly divided each class of these 61 samples into roughly three parts and reserved one third of the samples as a test set.
Tumor types and class sizes for the training set and test set of the subset of multiple tumor data used in this study
| Tumor type | Training set (total 96) | Test set (total 27) |
| Breast | 7 | 0 |
| Lung | 4 | 2 |
| Colorectal | 7 | 3 |
| Lymphoma | 14 | 5 |
| Melanoma | 5 | 0 |
| Uterus | 7 | 2 |
| Leukemia | 23 | 6 |
| Renal | 5 | 3 |
| Pancreas | 7 | 0 |
| Mesotheolima | 8 | 3 |
| CNS | 9 | 3 |
Prognosis groups and class sizes of the training set and test set of the breast cancer data
| Prognosis group | Training set (total 78) | Test set (total 19) |
| Good (> 5 years of survival time) | 44 | 7 |
| Poor (≤5 years of survival time) | 34 | 12 |
Figure 1Comparison of prediction accuracy of USC and SC on the NCI 60 data. The percentage of prediction accuracy is plotted against the number of relevant genes using the USC algorithm at ρ0 = 0.6 and the SC algorithm (USC at ρ0 = 1.0). The horizontal axis is shown on a log scale. Because no independent test set is available for this data, we randomly divided the samples in each class into roughly three parts multiple times, such that a third of the samples are reserved as a test set. Thus the training set consists of 43 samples and the test set of 18 samples. The graph represents typical results over these multiple random runs.
Figure 2Prediction accuracy on the multiple tumor data using the EWUSC algorithm over the range of Δ from 0 to 20. The percentage of classification errors is plotted against Δ on (a) the full training set (96 samples) and (c) the test set (27 samples). In (b) the average percentage of errors is plotted against Δ on the cross-validation data over five random runs of fourfold cross-validation. In (d), the number of relevant genes is plotted against Δ. Different colors are used to specify different correlation thresholds (ρ0 = 0.6, 0.7, 0.8, 0.9 or 1). Results of ρ0 < 0.6 are shown in Figure S1 on [30]. Optimal parameters are inferred from the cross-validation data in (b).
Figure 3Comparison of feature stability of EWUSC, USC and SC on the multiple tumor data. The average Jaccard index is plotted against the number of relevant genes over five random runs of fourfold cross-validation using EWUSC and USC at ρ0 = 0.8 and SC. A high average Jaccard index indicates high feature stability. The EWUSC algorithm selects the most stable features. Note that the horizontal axis is shown on a log scale.
Figure 4Comparison of prediction accuracy of EWUSC, USC, SVM and SC algorithms on the multiple tumor data. The horizontal axis shows the total number of distinct genes selected over all binary SVM classifiers on a log scale. Some results are not available on the full range of the total number of genes. For example, the maximum numbers of selected genes for EWUSC and USC are roughly 1,000. The reported prediction accuracy is 78% [10] using all 16,000 available genes on the full data. The EWUSC algorithm achieves 89% prediction accuracy with only 89 genes. With 680 genes, EWUSC produces 93% prediction accuracy.
Figure 5Comparison of prediction accuracy of EWUSC, USC and SC on the breast cancer data. The percentage of prediction accuracy is plotted against the number of relevant genes using the EWUSC algorithm at ρ0 = 0.7, the USC algorithm at ρ0 = 0.6 and the SC algorithm (USC at ρ0 = 1.0). Note that the horizontal axis is shown on a log scale.
Figure 6Comparison of feature stability of EWUSC, USC and SC on the breast cancer data. The average Jaccard index is plotted against the number of relevant genes over five random runs of 10-fold cross-validation using the EWUSC algorithm at ρ0 = 0.7, the USC algorithm at ρ0 = 0.6 and the SC algorithm (USC at ρ0 = 1). The EWUSC algorithm produces relatively more stable features when the number of relevant genes is small.
Comparison of classification accuracy results from EWUSC, USC and SC on synthetic datasets at optimal parameters
| α | Number of measurements | λ | EWUSC | USC | SC | ||
| 0.1 | 4 | Low | 100% | 100% | 100% | Average % CV prediction accuracy | |
| 100% | 100% | 100% | % prediction accuracy | ||||
| 24 | 72 | Number of genes | |||||
| (18, 0.8) | (17, 0.7) | (17.5, 1) | (Δ, ρ) | ||||
| 0.1 | 4 | High | 100% | 100% | 100% | Average % CV prediction accuracy | |
| 100% | 100% | 100% | % prediction accuracy | ||||
| 16 | 22 | Number of genes | |||||
| (12.5, 0.9) | (12.5, 0.9) | (12.5, 1) | (Δ, ρ) | ||||
| 1 | 4 | Low | 100% | 100% | 100% | Average % CV prediction accuracy | |
| 100% | 100% | 100% | % prediction accuracy | ||||
| 144 | 124 | Number of genes | |||||
| (2.8, 0.5) | (3.1, 0.6) | (3.1, 1) | (Δ, ρ) | ||||
| 1 | 4 | High | 100% | 100% | 100% | Average % CV prediction accuracy | |
| 100% | 100% | 100% | % prediction accuracy | ||||
| 120 | 122 | Number of genes | |||||
| (1.9, 0.5) | (2.6, 0.6) | (2.6, 1) | (Δ, ρ) | ||||
| 2 | 4 | Low | 96.8% | 98.8% | Average % CV prediction accuracy | ||
| 97.5% | % prediction accuracy | ||||||
| 326 | 326 | Number of genes | |||||
| (1.1, 0.5) | (1, 0.4) | (1.2, 1) | (Δ, ρ) | ||||
| 2 | 4 | High | 93.3% | 98.8% | Average % CV prediction accuracy | ||
| 92.5% | % prediction accuracy | ||||||
| 186 | Number of genes | ||||||
| (1, 0.7) | (1.5, 0.5) | (1.5, 1) | (Δ, ρ) | ||||
| 2 | 1 | Low | NA | 99.5% | 99.5% | Average % CV prediction accuracy | |
| NA | 100.0% | 100.0% | % prediction accuracy | ||||
| NA | 304 | Number of genes | |||||
| NA | (1.2, 0.5) | (1.2, 1) | (Δ, ρ) | ||||
| 2 | 1 | High | NA | 95.5% | Average % CV prediction accuracy | ||
| NA | 92.5% | 92.5% | % prediction accuracy | ||||
| NA | 282 | Number of genes | |||||
| NA | (1.2, 0.5) | (1.2, 1) | (Δ, ρ) | ||||
| 2 | 8 | Low | 99.8% | Average % CV prediction accuracy | |||
| 100.0% | 100.0% | 100.0% | % prediction accuracy | ||||
| 246 | 221 | Number of genes | |||||
| (1.3, 0.5) | (1.4, 0.5) | (1.4, 1) | (Δ, ρ) | ||||
| 2 | 8 | High | 98.3% | Average % CV prediction accuracy | |||
| 97.5% | % prediction accuracy | ||||||
| 242 | 245 | Number of genes | |||||
| (1, 0.4) | (1.3, 0.5) | (1.3, 1) | (Δ, ρ) | ||||
| 2 | 20 | Low | 99.8% | Average % CV prediction accuracy | |||
| 100.0% | 100.0% | 100.0% | % prediction accuracy | ||||
| 296 | 325 | Number of genes | |||||
| (1.3, 0.5) | (1.2, 0.6) | (1.2, 1) | (Δ, ρ) | ||||
| 2 | 20 | High | 99.8% | Average % CV prediction accuracy | |||
| 100.0% | 100.0% | 100.0% | % prediction accuracy | ||||
| 252 | 252 | Number of genes | |||||
| (0.9, 0.6) | (1.3, 0.5) | (1.3, 1) | (Δ, ρ) |
Synthetic datasets were generated at different levels of biological noise (α) and technical noise (λ). The average percentage of cross validation (% CV) accuracy, the percentage of prediction accuracy on the test set, the number of relevant genes at the optimal parameters (Δ, ρ0) are shown. For each synthetic dataset, the algorithm with the maximum percentage of average cross validation accuracy, maximum prediction accuracy, or the minimum number of relevant genes is shown in bold. (a) Typical classification accuracy results using synthetic datasets with four repeated measurements at different biological noise levels (α = 0.1, 1 or 2) and difference technical noise levels (λ = 1, 5 or 10). When the biological noise level is low (α = 0.1), EWUSC consistently achieves the same prediction accuracy using fewer relevant genes at various technical noise levels. However, at medium biological noise level (α = 1), EWUSC typically outperforms USC and SC at high technical noise level and not at low technical noise level. When the biological noise level is high (α = 2), EWUSC is often not the method of choice. (b) Typical classification accuracy results using synthetic datasets at high biological noise level (α = 2) with 1, 8, or 20 repeated measurements at different technical noise levels. When there is no repeated measurement (the number of repeated measurements = 1), there are no variability estimates over repeated measurements and hence, EWUSC is reduced to USC. The results with four repeated measurement at α = 2 are shown in (a). Our results over multiple synthetic datasets showed that EWUSC only outperforms USC with a large number of repeated measurements (20) at high biological noise (α = 2). We also showed that USC typically outperforms SC by choosing a smaller number of relevant genes in most scenarios (over different biological and technical noise levels, and different numbers of repeated measurements).
Summary of prediction accuracy results
| Data | Parameters | EWUSC | USC | SC | Published results |
| NCI 60 data* | ρ0 | NA | 0.6 | 1.0 | NA |
| Δ | NA | 1.0 | 1.0 | NA | |
| Number of relevant genes | NA | 3998 | 200 | ||
| Prediction accuracy | NA | 72% | 72% | ~40-60% [ | |
| Multiple tumor data (estimated optimal parameters)† | ρ0 | 0.8 | 0.8 | 1.0 | NA |
| Δ | 5.6 | 5.6 | 8.8 | NA | |
| Number of relevant genes | 735 | 3902 | All genes | ||
| Prediction accuracy | 85% | 78% | 78% [ | ||
| Multiple tumor data (global optimal parameters)‡ | ρ0 | 0.9 | 0.9 | 1.0 | NA |
| Δ | 0 | 0 | 0.4 | NA | |
| Number of relevant genes | 1634 | 7129 | All genes | ||
| Prediction accuracy | 74% | 74% | |||
| Breast cancer data | ρ0 | 0.7 | 0.6 | 1.0 | NA |
| Δ | 0.80 | 1.15 | 1.1 | NA | |
| Number of relevant genes | 271 | 82 | 187 | ||
| Prediction accuracy | 79% | 84% |
The optimal parameters (ρ0 and Δ), number of relevant genes chosen, and prediction accuracy for the NCI 60 data, multiple tumor data and breast cancer data are summarized here. Both EWUSC (error-weighted, uncorrelated shrunken centroid) and USC (uncorrelated shrunken centroid) were motivated by SC (shrunken centroid) [17]. Both EWUSC and USC take advantage of interdependence between genes by removing highly correlated relevant genes. EWUSC makes use of error estimates or variability over repeated measurements. SC [17] is equivalent to USC at ρ0 = 1. The optimal parameters (Δ, ρ0) for EWUSC are estimated from the cross-validation results of EWUSC, while the optimal parameters (Δ, ρ0) for USC are independently estimated from the cross-validation results of USC. Entries with the minimum number of selected genes or highest prediction accuracy across all methods are highlighted in boldface type. *Since no repeated measurements or error estimates are available, EWUSC is not applicable to the NCI 60 data. In addition, there is no separate test set available for the NCI 60 data, typical results of random partitions of the original 61 samples into training and test sets are shown. †The prediction accuracy and number of relevant genes are produced using optimal parameters (Δ, ρ0) estimated by visual observation of 'bends' in the random cross-validation curves. ‡The prediction accuracy and number of relevant genes are produced using global optimal parameters, that is (Δ, ρ0) that produces the minimum average numbers of cross-validation errors over all Δ and all ρ0.
Summary of EWUSC, USC and SC
| Desirable features | EWUSC | USC | SC |
| Make use of variability over repeated measurements | + | ||
| Applicable to data with any number of classes | + | + | + |
| Exploit dependence relationships between genes | + | + | |
| Integrated approach for both feature selection and classification | + | + | + |
| No assumption on data distributions | + | + | + |
Pattern matrix for synthetic data
| Class 1 | Class 2 | Class 3 | Class 4 |
| 1 | 0 | 0 | 0 |
| -1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | -1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | -1 | 0 |
| 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | -1 |
| 1 | 1 | 0 | 0 |
| -1 | -1 | 0 | 0 |
| 1 | -1 | 0 | 0 |
| -1 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| -1 | 0 | -1 | 0 |
| 1 | 0 | -1 | 0 |
| -1 | 0 | 1 | 0 |
| 1 | 0 | 0 | 1 |
| -1 | 0 | 0 | -1 |
| 1 | 0 | 0 | -1 |
| -1 | 0 | 0 | 1 |
| 0 | 1 | 1 | 0 |
| 0 | -1 | -1 | 0 |
| 0 | 1 | -1 | 0 |
| 0 | -1 | 1 | 0 |
| 0 | 1 | 0 | 1 |
Each row represents a pattern, and each column represents a class such that entry P(i, j) is the ith pattern of class j. An entry of 1 means upregulated while an entry of -1 means downregulated. For example, the first row indicates that a patterned gene is upregulated in class 1 compared to all the other three classes.