| Literature DB >> 15640445 |
Yongxi Tan1, Leming Shi, Weida Tong, Charles Wang.
Abstract
DNA microarray technology provides a promising approach to the diagnosis and prognosis of tumors on a genome-wide scale by monitoring the expression levels of thousands of genes simultaneously. One problem arising from the use of microarray data is the difficulty to analyze the high-dimensional gene expression data, typically with thousands of variables (genes) and much fewer observations (samples), in which severe collinearity is often observed. This makes it difficult to apply directly the classical statistical methods to investigate microarray data. In this paper, total principal component regression (TPCR) was proposed to classify human tumors by extracting the latent variable structure underlying microarray data from the augmented subspace of both independent variables and dependent variables. One of the salient features of our method is that it takes into account not only the latent variable structure but also the errors in the microarray gene expression profiles (independent variables). The prediction performance of TPCR was evaluated by both leave-one-out and leave-half-out cross-validation using four well-known microarray datasets. The stabilities and reliabilities of the classification models were further assessed by re-randomization and permutation studies. A fast kernel algorithm was applied to decrease the computation time dramatically. (MATLAB source code is available upon request.).Entities:
Mesh:
Year: 2005 PMID: 15640445 PMCID: PMC546133 DOI: 10.1093/nar/gki144
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Classification results for leukemia dataset
| Number of TPCR components (%, λ = 19, LOOCV) | ||||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 50 | 13.89 | 2.78 | 4.17 | 5.56 | 8.33 | 6.94 |
| 100 | 15.28 | 4.17 | 4.17 | 5.56 | 5.56 | 5.56 |
| 200 | 13.89 | 2.78 | 2.78 | 1.39 | 4.17 | 4.17 |
| 500 | 13.89 | 2.78 | 2.78 | 4.17 | 4.17 | 4.17 |
| 1000 | 15.28 | 2.78 | 4.17 | 4.17 | 4.17 | 4.17 |
Given are the percentages of misclassification out of 72 samples using LOOCV.
Classification results for hereditary breast cancer dataset
| Number of TPCR components (%, λ = 9, LOOCV) | ||||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 50 | 54.55 | 45.45 | 40.91 | 40.91 | 45.45 | 40.91 |
| 100 | 59.09 | 50.00 | 40.91 | 31.82 | 36.36 | 22.73 |
| 200 | 54.55 | 22.73 | 18.18 | 27.27 | 31.82 | 31.82 |
| 500 | 77.27 | 31.82 | 27.27 | 36.36 | 40.91 | 40.91 |
| 1000 | 59.09 | 63.64 | 45.45 | 45.45 | 40.91 | 40.91 |
Given are the percentages of misclassification out of 22 samples using LOOCV.
Classification results for SRBCT dataset
| Number of TPCR components (%, λ = 200, LOOCV) | ||||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 50 | 36.14 | 14.46 | 1.20 | 0.00 | 0.00 | 0.00 |
| 100 | 34.94 | 14.46 | 0.00 | 0.00 | 0.00 | 0.00 |
| 200 | 34.94 | 14.46 | 0.00 | 0.00 | 0.00 | 0.00 |
| 500 | 34.94 | 15.66 | 0.00 | 0.00 | 0.00 | 0.00 |
| 1000 | 34.94 | 18.07 | 1.20 | 0.00 | 0.00 | 0.00 |
Given are the percentages of misclassification out of 83 samples using LOOCV.
Classification results for NCI60 dataset
| Number of TPCR components (%, λ = 20, LOOCV) | ||||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 50 | 100.00 | 60.00 | 60.00 | 8.57 | 8.57 | 8.57 |
| 100 | 91.43 | 57.14 | 37.14 | 2.86 | 5.71 | 5.71 |
| 200 | 77.14 | 57.14 | 20.00 | 2.86 | 2.86 | 5.71 |
| 500 | 77.14 | 57.14 | 20.00 | 2.86 | 5.71 | 5.71 |
| 1000 | 77.14 | 48.57 | 20.00 | 2.86 | 2.86 | 2.86 |
Given are the percentages of misclassification out of 35 samples using LOOCV.
Classification results using TPCR and PLS under LHOCV procedure for Leukemia, hereditary breast cancer, SRBCT and NCI60 datasets
| Number of TPCR components | Number of PLS components | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 1 | 2 | 3 | 4 | 5 | 6 | |
| Leukemia dataset (%, LHOCV) (λ = 19) | ||||||||||||
| 50 | 14.97 | 4.19 | 5.83 | 6.72 | 6.61 | 6.50 | 15.00 | 4.53 | 5.28 | 5.19 | 5.36 | 6.08 |
| 100 | 15.08 | 4.08 | 4.61 | 5.00 | 5.31 | 5.19 | 15.22 | 4.17 | 4.36 | 4.50 | 4.75 | 5.14 |
| 200 | 15.19 | 3.89 | 3.92 | 4.89 | 4.97 | 15.28 | 4.36 | 4.53 | 4.56 | 4.89 | ||
| 500 | 14.86 | 4.31 | 4.28 | 4.06 | 4.14 | 4.25 | 14.94 | 4.36 | 4.75 | 4.67 | 4.58 | 4.58 |
| 1000 | 14.72 | 5.25 | 4.83 | 4.44 | 4.19 | 4.19 | 14.72 | 4.72 | 4.89 | 4.61 | 4.61 | 4.72 |
| Hereditary breast cancer dataset (%, LHOCV) (λ = 9) | ||||||||||||
| 50 | 60.27 | 47.18 | 47.00 | 46.09 | 45.55 | 46.27 | 60.27 | 47.91 | 47.09 | 46.27 | 46.55 | 46.45 |
| 100 | 58.91 | 45.09 | 45.00 | 45.81 | 44.18 | 58.55 | 44.55 | 45.36 | 44.45 | 44.36 | ||
| 200 | 58.73 | 45.45 | 45.27 | 44.64 | 44.00 | 44.18 | 59.18 | 46.09 | 44.55 | 44.00 | 44.00 | 44.91 |
| 500 | 61.09 | 46.45 | 46.00 | 45.27 | 44.18 | 44.73 | 60.64 | 45.73 | 45.00 | 44.45 | 45.18 | 45.27 |
| 1000 | 64.64 | 54.73 | 49.00 | 48.27 | 47.55 | 46.45 | 63.55 | 50.91 | 48.91 | 47.73 | 47.45 | 47.45 |
| SRBCT dataset (%, LHOCV) (λ = 200) | ||||||||||||
| 50 | 40.85 | 18.63 | 0.83 | 0.39 | 0.56 | 0.59 | 45.07 | 20.98 | 0.73 | 0.54 | 0.56 | 0.68 |
| 100 | 41.29 | 18.80 | 0.66 | 0.29 | 0.22 | 46.54 | 21.66 | 0.61 | 0.39 | 0.37 | ||
| 200 | 42.59 | 19.22 | 1.07 | 0.32 | 0.27 | 0.32 | 49.66 | 21.54 | 0.90 | 0.29 | 0.39 | 0.39 |
| 500 | 43.83 | 20.71 | 2.20 | 0.78 | 0.56 | 0.51 | 52.71 | 23.63 | 1.88 | 0.66 | 0.56 | 0.63 |
| 1000 | 45.61 | 22.95 | 4.54 | 1.85 | 1.15 | 0.80 | 54.88 | 26.22 | 4.07 | 1.37 | 0.71 | 0.61 |
| NCI60 dataset (%, LHOCV) (λ = 20) | ||||||||||||
| 50 | 72.65 | 55.24 | 36.12 | 13.18 | 13.41 | 13.88 | 71.00 | 53.12 | 34.47 | 14.18 | 13.88 | 14.53 |
| 100 | 72.18 | 54.00 | 34.41 | 11.12 | 11.35 | 11.41 | 70.24 | 52.29 | 32.82 | 11.47 | 11.65 | 11.41 |
| 200 | 70.94 | 52.59 | 33.18 | 9.47 | 9.82 | 10.12 | 70.47 | 51.41 | 30.82 | 9.65 | 10.65 | 10.65 |
| 500 | 70.59 | 50.94 | 31.35 | 9.24 | 9.53 | 70.65 | 50.47 | 30.71 | 9.06 | 9.53 | 9.76 | |
| 1000 | 70.76 | 49.29 | 30.35 | 9.53 | 9.00 | 9.41 | 71.35 | 49.82 | 30.88 | 9.82 | 9.71 | |
Given are the percentages of misclassifications averaged over 100 re-randomizations. (Bold number denotes the minimum value in the same row; bold and underlined number means the minimum value in the whole 5 × 6 data matrix).
Figure 1Distribution of error rate (percentage of misclassified samples) over 100 runs of permutation analysis (the original class memberships of all samples were randomly shuffled for 100 times and then used together with original gene expression profiles for classification by TPCR using the same LOOCV as applied before for original dataset). The solid line with asterisk labeled in each plot represents the minimum error rate using LOOCV for each original dataset: (a) Leukemia dataset (number of selected genes, TPCR component number, λ and corresponding LOOCV error rate for the original dataset: 200, 4, 19 and 1.39%); (b) Hereditary breast cancer dataset (200, 3, 9 and 18.18%); (c) SRBCT (50, 4, 200 and 0.00%); and (d) NCI60 (100, 4, 20 and 2.86%).
Figure 2Distribution of error rate (percentage of misclassified test samples) over 100 runs of LHOCV or re-randomization analysis using TPCR (the original dataset was randomly split, half/half, into training and test samples for 100 times, then the new generated training dataset was used to predict the test dataset using TPCR): (a) Leukemia dataset (number of selected genes, number of TPCR components, λ and corresponding averaged LHOCV error rate: 200, 3, 19 and 3.75%); (b) Hereditary breast cancer dataset (100, 6, 9 and 43.82%); (c) SRBCT (100, 4, 200 and 0.17%); and (d) NCI60 (500, 4, 20 and 8.82%).
Comparison of the speed of fast kernel EVD algorithm and classic SVD algorithm
| Algorithm | Hereditary breast cancer (size: 22 × 3226) | Leukemia (72 × 7129) | SRBCT (83 × 2308) | NCI60 (35 × 1299) |
|---|---|---|---|---|
| SVD | 9.7188 | 141.3125 | 16.3751 | 2.2969 |
| Kernel EVD | 0.0156 | 0.2031 | 0.0625 | 0.0156 |
| Speed gain (fold) | 623 | 696 | 262 | 147 |
Given are the time (second) used to calculate the singular vectors U from microarray gene expression profiles (X matrix).