| Literature DB >> 34185214 |
Rosember Guerra-Urzola1, Katrijn Van Deun2, Juan C Vera3, Klaas Sijtsma4.
Abstract
PCA is a popular tool for exploring and summarizing multivariate data, especially those consisting of many variables. PCA, however, is often not simple to interpret, as the components are a linear combination of the variables. To address this issue, numerous methods have been proposed to sparsify the nonzero coefficients in the components, including rotation-thresholding methods and, more recently, PCA methods subject to sparsity inducing penalties or constraints. Here, we offer guidelines on how to choose among the different sparse PCA methods. Current literature misses clear guidance on the properties and performance of the different sparse PCA methods, often relying on the misconception that the equivalence of the formulations for ordinary PCA also holds for sparse PCA. To guide potential users of sparse PCA methods, we first discuss several popular sparse PCA methods in terms of where the sparseness is imposed on the loadings or on the weights, assumed model, and optimization criterion used to impose sparseness. Second, using an extensive simulation study, we assess each of these methods by means of performance measures such as squared relative error, misidentification rate, and percentage of explained variance for several data generating models and conditions for the population model. Finally, two examples using empirical data are considered.Entities:
Keywords: dimension reduction; exploratory data analysis; high dimension-low sample size; regularization; sparse principal components analysis
Mesh:
Year: 2021 PMID: 34185214 PMCID: PMC8636462 DOI: 10.1007/s11336-021-09773-2
Source DB: PubMed Journal: Psychometrika ISSN: 0033-3123 Impact factor: 2.290
Summary of methods for sparse PCA.
| Method | Estimated | Objective | Sparsity | Algorithm |
|---|---|---|---|---|
| VARIMAX | Rotation | Threshold | Block | |
| SIMPLIMAX | Rotation | Threshold | Block | |
| sPCA-rSVD | low-rank | Deflating | ||
| SPCA | Max. variance | Block | ||
| pathSPCA | Max. variance | Deflating | ||
| GPower | Max. variance | Deflating |
Simulation design factors and their levels.
| Model | sparse | Repetions | |||||
|---|---|---|---|---|---|---|---|
| 100, 500 | 10, 100, 1000 | 2, 3 | 80%, 95%, 100% | 0.0, 0.5, 0.8 | 100 | ||
| 100, 500 | 10, 100, 1000 | 2, 3 | 80%, 95%, 100% | 0.0, 0.5, 0.8 | 100 | ||
| 100, 500 | 10, 100, 1000 | 2, 3 | 80%, 95%, 100% | 0.7, 0.8, 0.9 | 100 |
I sample size, J No. of variables, K N. of components, VAF variance accounted, PS proportion of sparsity
Simulation description summary.
| Condition | Sparse structure | Algorithm | Measurements | ||
|---|---|---|---|---|---|
| Type I | Alg-I | SRE | MR | PEV | |
| Alg-II | SRE | MR | PEV | ||
| Type II | Alg-III | SRE | MR | PEV | |
| Alg-III | SRE | MR | PEV | ||
| Type III | Alg-II | CosSim | MR | PEV | |
| Alg-I | CosSim | MR | PEV | ||
Fig. 1Matching sparsity: Boxplots of the performance measures in conditions with 80% of variance accounted by the model in the data and two components. Within each panel, a dashed line divides the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods. The top row summarizes the squared relative error (SRE-LW) for the loadings (at the left of the dashed line) and weights (at the right of the dashed line), the second row the SRE-S for the component scores, the third row (PEV) the proportion of variance in the data explained by the estimated model, and the bottom row the misidentification rate (MR).
Fig. 2Double sparsity: Boxplots of the performance measures in conditions with 80% of variance accounted by the model in the data and two components. Within each panel, a dashed line divides the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods. The top row summarizes the squared relative error (SRE-LW) for the loadings (at the left of the dashed line) and weights (at the right of the dashed line), the second row the SRE-S for the component scores, the third row (PEV) the proportion of variance in the data explained by the estimated model, and the bottom row the misidentification rate (MR).
Fig. 3Mismatching sparsity: boxplots of the performance measures in conditions with 80% of variance accounted by the model in the data and two components. Within each panel, a dashed line divides the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods. The top row summarizes the squared relative error (SRE-LW) for the loadings (at the left of the dashed line) and weights (at the right of the dashed line), the second row the SRE-S for the component scores, the third row (PEV) the proportion of variance in the data explained by the estimated model, and the bottom row the misidentification rate (MR).
Fig. 4Misidentification rate (MR): boxplots of the MR in conditions with 80% of variance accounted by the model in the data, a proportion of sparsity of 0.8, and two components. Within each panel, a dashed line is used to divide the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods.
Fig. 5Percentage of explained variance (PEV): boxplots of the PEV in conditions with 80% of variance accounted by the model in the data, a proportion of sparsity of 0.8, and two components. Within each panel, a dashed line is used to divide the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods.
Fig. 6Index of sparseness(IS) and percentage of explained variance (PEV) against the proportion of sparsity (PS).
Fig. 7Biplot: the dots in each subplot represent the component scores, the arrows the component loadings.
Sparse loading and weights composition by trait (OCEAN).
| sPCArSVD | Varimax | Simplimax | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Openness | 0 | 9 | 1 | 4 | 41 | 1 | 0 | 8 | 5 | 42 | 0 | 17 | 9 | 4 | 30 |
| Concientiousness | 9 | 3 | 11 | 43 | 2 | 7 | 7 | 3 | 44 | 4 | 15 | 0 | 23 | 31 | 7 |
| Extraversion | 17 | 19 | 21 | 6 | 9 | 16 | 15 | 30 | 5 | 7 | 15 | 10 | 6 | 7 | 11 |
| Agreeableness | 4 | 29 | 23 | 2 | 5 | 3 | 33 | 16 | 4 | 4 | 6 | 33 | 13 | 14 | 5 |
| Neuroticism | 34 | 4 | 8 | 9 | 7 | 37 | 9 | 7 | 6 | 7 | 28 | 4 | 13 | 8 | 11 |
| Total nonzero | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 |
| SPCA | pathSPCA | Gpower | |||||||||||||
| Openness | 0 | 17 | 4 | 13 | 25 | 16 | 12 | 14 | 12 | 10 | 27 | 4 | 12 | 41 | 33 |
| Concientiousness | 15 | 0 | 26 | 24 | 8 | 15 | 15 | 11 | 10 | 13 | 11 | 3 | 42 | 11 | 15 |
| Extraversion | 15 | 10 | 15 | 6 | 16 | 16 | 10 | 14 | 14 | 10 | 3 | 34 | 5 | 10 | 12 |
| Agreeableness | 6 | 27 | 13 | 10 | 3 | 15 | 9 | 11 | 17 | 12 | 39 | 4 | 1 | 5 | 5 |
| Neuroticism | 28 | 10 | 6 | 11 | 12 | 17 | 10 | 12 | 9 | 16 | 1 | 2 | 0 | 0 | 0 |
| Total nonzero | 64 | 64 | 64 | 64 | 64 | 79 | 56 | 62 | 62 | 61 | 81 | 47 | 60 | 67 | 65 |
Each column represents the number of items in each loading/weight that have a nonzero value in each trait. The components were ordered such that the number of nonzero loading/weights on the diagonal is maximized
Fig. 8Index of sparseness and percentage of explained variance against the proportion of sparsity when applying GPower to the gene expression data set.
Fig. 9Scatter plot of component scores.