| Literature DB >> 20525176 |
Donghwan Lee1, Woojoo Lee, Youngjo Lee, Yudi Pawitan.
Abstract
BACKGROUND: Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.Entities:
Mesh:
Year: 2010 PMID: 20525176 PMCID: PMC2902448 DOI: 10.1186/1471-2105-11-296
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Simulation results: estimation
| SPCA | |||||||
|---|---|---|---|---|---|---|---|
| PCA | HL | LASSO | EN | ||||
| 80 | 20 | 2.0 | 0.1 | 0.054 (0.010) | 0.023 (0.011) | 0.022 (0.010) | 0.025 (0.013) |
| 0.5 | 0.1 | 0.109 (0.021) | 0.045 (0.021) | 0.051 (0.022) | 0.055 (0.029) | ||
| 50 | 200 | 2.0 | 0.1 | 0.223 (0.022) | 0.029 (0.014) | 0.035 (0.015) | 0.056 (0.028) |
| 0.5 | 0.1 | 0.424 (0.041) | 0.062 (0.032) | 0.080 (0.033) | 0.122 (0.058) | ||
| 80 | 20 | 2.0 | 0.1 | 0.055 (0.010) | 0.020 (0.009) | 0.021 (0.010) | 0.022 (0.010) |
| 0.5 | 0.1 | 0.113 (0.020) | 0.042 (0.020) | 0.050 (0.023) | 0.050 (0.026) | ||
| 50 | 200 | 2.0 | 0.1 | 0.218 (0.025) | 0.026 (0.013) | 0.032 (0.014) | 0.055 (0.030) |
| 0.5 | 0.1 | 0.993 (0.010) | 0.063 (0.030) | 0.083 (0.044) | 0.866 (0.000) | ||
The median of dist(v1, ) and the median absolute deviation in parentheses.
PCA*: PCA using X*
Simulation results: model selection
| SPCA | SSPCA | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| PCA | HL | LASSO | EN | PCA* | HL | LASSO | EN | ||||
| 80 | 20 | 2.0 | 0.1 | 0 | 72 | 12 | 64 | 0 | 95 | 14 | 99 |
| 0/16 | 16/16 | 14/16 | 16/16 | 0/16 | 16/16 | 15/16 | 16/16 | ||||
| 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | ||||
| 0.5 | 0.1 | 0 | 77 | 1 | 56 | 0 | 100 | 43 | 99 | ||
| 0/16 | 16/16 | 12/16 | 16/16 | 0/16 | 16/16 | 15/16 | 16/16 | ||||
| 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | ||||
| 50 | 200 | 2.0 | 0.1 | 0 | 73 | 0 | 88 | 0 | 100 | 27 | 87 |
| 0/196 | 196/196 | 184.5/196 | 196/196 | 0/196 | 196/196 | 194/196 | 196/196 | ||||
| 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | ||||
| 0.5 | 0.1 | 0 | 79 | 0 | 70 | 0 | 97 | 84 | 0 | ||
| 0/196 | 196/196 | 185.5/196 | 196/196 | 0/196 | 196/196 | 196/196 | 196/196 | ||||
| 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 0/4 | 3/4 | ||||
Percentages of selecting the true model, the median number of correct 0 divided by the number of zeroes and incorrect 0 divided by the number of non-zeroes.
PCA*: PCA using X*
Simulation results: prediction
| SSPCA | |||||||
|---|---|---|---|---|---|---|---|
| PCA | HL | LASSO | EN | ||||
| 80 | 20 | 2.0 | 0.1 | 7.979 (0.831) | 7.998 (0.837) | 7.996 (0.842) | 7.970 (1.116) |
| 0.5 | 0.1 | 2.050 (0.213) | 2.057 (0.222) | 2.055 (0.225) | 2.088 (0.283) | ||
| 50 | 200 | 2.0 | 0.1 | 7.907 (1.633) | 8.242 (1.599) | 8.242 (1.601) | 8.149 (1.386) |
| 0.5 | 0.1 | 1.769 (0.362) | 2.143 (0.418) | 2.140 (0.414) | 2.071 (0.349) | ||
| 80 | 20 | 2.0 | 0.1 | 7.954 (1.125) | 7.999 (0.849) | 7.997 (0.850) | 7.978 (1.115) |
| 0.5 | 0.1 | 2.062 (0.292) | 2.057 (0.226) | 2.057 (0.225) | 2.088 (0.280) | ||
| 50 | 200 | 2.0 | 0.1 | 7.564 (1.718) | 8.243 (1.593) | 8.243 (1.597) | 7.928 (1.755) |
| 0.5 | 0.1 | 0.242 (0.075) | 2.137 (0.452) | 2.149 (0.424) | 0.503 (0.316) | ||
The median of test variance with the median absolute deviation in parentheses.
PCA*: PCA using X*
Analyses of NCI data: number of zero loadings
| PCA | SPCA | SSPCA | ||
|---|---|---|---|---|
| HL | LASSO | HL | LASSO | |
| 214/21225 | 7966/21225 | 650/21225 | 19965/21225 | 1144/21225 |
| (1.01) | (37.53) | (3.06) | (94.06) | (5.39) |
The proportion (percentage) of zero elements of first loading in NCI data analysis.
Analysis of NCI data: number of zero loadings
| Principal component scores | ||||||||
|---|---|---|---|---|---|---|---|---|
| PCA | ||||||||
| Number of nonzero loadings | 21011 | 20385 | 19226 | 21099 | 20948 | 20817 | 20945 | 20997 |
| Adjusted Variance (%) | 12.3 | 10.2 | 6.6 | 4.1 | 3.6 | 3.2 | 2.9 | 2.6 |
| Cumulative adjusted Variance (%) | 12.3 | 22.5 | 29.1 | 33.2 | 36.8 | 40.0 | 42.9 | 45.5 |
| SPCA - HL | ||||||||
| Number of nonzero loadings | 13259 | 4086 | 15362 | 13547 | 13946 | 10445 | 9890 | 10958 |
| Adjusted Variance (%) | 20.6 | 13.4 | 11.5 | 6.4 | 6.1 | 4.9 | 4.0 | 4.1 |
| Cumulative adjusted Variance (%) | 20.6 | 34.0 | 45.5 | 51.9 | 58.0 | 62.9 | 66.9 | 71.0 |
| SSPCA - HL | ||||||||
| Number of nonzero loadings | 1260 | 681 | 375 | 290 | 47 | 58 | 33 | 3434 |
| Adjusted Variance (%) | 22.3 | 8.7 | 6.1 | 6.5 | 1.3 | 0.4 | 0.0 | 1.6 |
| Cumulative adjusted Variance (%) | 22.3 | 31.0 | 37.1 | 43.6 | 44.9 | 45.3 | 45.3 | 46.9 |
Number of nonzero loadings and cumulative variance for different methods.
Gene Ontology analysis
| Number | GO ID | GO Term | P-value(1) | P-value(2) | P-value(3) |
|---|---|---|---|---|---|
| 1 | GO:0048856 | anatomical structure development | 1.6e-10 | 1.5e-09 | 4.5e-07 |
| 2 | GO:0009653 | anatomical structure morphogenesis | 2.9e-10 | 4.8e-06 | |
| 3 | GO:0008283 | cell proliferation | 1.3e-09 | ||
| 4 | GO:0050793 | regulation of developmental process | 1.7e-09 | 9.4e-06 | |
| 5 | GO:0032502 | developmental process | 3.8e-09 | 8.1e-08 | 4.9e-06 |
| 6 | GO:0042127 | regulation of cell proliferation | 5.8e-08 | 3.9e-06 | |
| 7 | GO:0048513 | organ development | 6.6e-08 | ||
| 8 | GO:0048869 | cellular developmental process | 1e-07 | ||
| 9 | GO:0048731 | system development | 1.1e-07 | 3.6e-07 | 5.3e-06 |
| 10 | GO:0007155 | cell adhesion | 1.3e-07 | 7.6e-07 | |
| 11 | GO:0022610 | biological adhesion | 1.3e-07 | 7.6e-07 | |
| 12 | GO:0051093 | negative regulation of developmental process | 1.9e-06 | ||
| 13 | GO:0048519 | negative regulation of biological process | 2.8e-06 | ||
| 14 | GO:0048523 | negative regulation of cellular process | 3.4e-06 | ||
| 15 | GO:0009605 | response to external stimulus | 2.8e-07 | ||
| 16 | GO:0043065 | positive regulation of apoptosis | 7.4e-06 | ||
| 17 | GO:0043068 | positive regulation of programmed cell death | 8.6e-06 | ||
| 18 | GO:0042981 | regulation of apoptosis | 9.6e-06 | ||
| 19 | GO:0032501 | multicellular organismal process | 1.3e-06 | ||
| 20 | GO:0007275 | multicellular organismal development | 4.3e-06 |
The top 20 most enriched biological process GO terms and the associated P-values for the first three principal components from SSPCA.
Figure 1HL penalty functions associated with the ridge (.
The derivatives of the penalty functions.
| Types | |
|---|---|
| LASSO | λ |
| SCAD | |
| HL |
p, the eigen-structure tends to be systematically distorted unless p/n is small [27], resulting in ill-conditioned estimator for Σ.