| Literature DB >> 27516087 |
Ines Wilms1, Christophe Croux2.
Abstract
BACKGROUND: Canonical correlation analysis (CCA) is a multivariate statistical method which describes the associations between two sets of variables. The objective is to find linear combinations of the variables in each data set having maximal correlation. In genomics, CCA has become increasingly important to estimate the associations between gene expression data and DNA copy number change data. The identification of such associations might help to increase our understanding of the development of diseases such as cancer. However, these data sets are typically high-dimensional, containing a lot of variables relative to the number of objects. Moreover, the data sets might contain atypical observations since it is likely that objects react differently to treatments. We discuss a method for Robust Sparse CCA, thereby providing a solution to both issues. Sparse estimation produces canonical vectors with some of their elements estimated as exactly zero. As such, their interpretability is improved. Robust methods can cope with atypical observations in the data.Entities:
Keywords: Canonical correlation analysis; Penalized estimation; Robust estimation
Mesh:
Year: 2016 PMID: 27516087 PMCID: PMC4982144 DOI: 10.1186/s12918-016-0317-9
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Simulation designs
|
|
|
|
|
|---|---|---|---|
| Uncorrelated Sparse Low-dimensional | 10−2· | 10−2· |
|
|
| |||
| Correlated Sparse Low-dimensional |
|
|
|
|
| |||
| NonSparse Low-dimensional | 10−2· | 10−2· | 10−2· |
|
| |||
| Sparse High-dimensional 1 | 10−1· | 10−1· |
|
|
| |||
| Sparse High-dimensional 2 |
|
|
|
|
| |||
| with | |||
| Sparse Ultra High-dimensional |
|
|
|
|
| |||
| with |
Simulation results. Average of the angles between the space spanned by the true and estimated canonical vectors; average true positive rate and true negative rate are reported for each method
| Design | Method | No contamination |
| Contamination | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
| TPR | TNR |
| TPR | TNR |
| TPR | TNR | ||
| Uncorrelated | CCA | 0.11 | 1.00 | 0.00 | 0.22 | 1.00 | 0.00 | 0.38 | 1.00 | 0.00 |
| Sparse | Robust CCA | 0.14 | 1.00 | 0.00 | 0.15 | 1.00 | 0.00 | 0.15 | 1.00 | 0.00 |
| Low-dimensional | Sparse CCA | 0.04 | 0.98 | 0.97 | 0.19 | 0.94 | 0.63 | 0.34 | 1.00 | 0.04 |
| Robust Sparse CCA | 0.04 | 1.00 | 0.82 | 0.11 | 1.00 | 0.52 | 0.05 | 1.00 | 0.76 | |
| Correlated | CCA | 0.06 | 1.00 | 0.00 | 0.13 | 1.00 | 0.00 | 0.43 | 1.00 | 0.00 |
| Sparse | Robust CCA | 0.08 | 1.00 | 0.00 | 0.09 | 1.00 | 0.00 | 0.09 | 1.00 | 0.00 |
| Low-dimensional | Sparse CCA | 0.13 | 1.00 | 1.00 | 0.19 | 0.96 | 0.76 | 0.57 | 0.52 | 0.02 |
| Robust Sparse CCA | 0.07 | 1.00 | 0.57 | 0.09 | 1.00 | 0.34 | 0.07 | 1.00 | 0.53 | |
| NonSparse | CCA | 0.08 | 1.00 | NA | 0.32 | 1.00 | NA | 0.20 | 1.00 | NA |
| Low-dimensional | Robust CCA | 0.11 | 1.00 | NA | 0.12 | 1.00 | NA | 0.12 | 1.00 | NA |
| Sparse CCA | 0.41 | 0.93 | NA | 0.67 | 0.82 | NA | 0.23 | 1.00 | NA | |
| Robust Sparse CCA | 0.16 | 0.99 | NA | 0.22 | 0.99 | NA | 0.13 | 1.00 | NA | |
| Sparse | Sparse CCA | 0.65 | 0.62 | 0.99 | 0.70 | 0.71 | 0.87 | 0.36 | 1.00 | 0.80 |
| High-Dimensional 1 | Robust Sparse CCA | 0.66 | 0.84 | 0.86 | 0.56 | 0.82 | 0.86 | 0.16 | 0.96 | 0.97 |
| Sparse | Sparse CCA | 1.08 | 0.31 | 1.00 | 1.14 | 0.23 | 1.00 | 1.25 | 0.38 | 0.97 |
| High-Dimensional 2 | Robust Sparse CCA | 0.59 | 0.87 | 0.87 | 0.60 | 0.94 | 0.89 | 0.84 | 0.97 | 0.82 |
| Sparse Ultra | Sparse CCA | 1.18 | 0.17 | 1.00 | 1.22 | 0.15 | 1.00 | 1.25 | 0.40 | 1.00 |
| High-dimensional | Robust Sparse CCA | 1.42 | 0.93 | 1.00 | 1.24 | 0.98 | 1.00 | 0.98 | 1.00 | 1.00 |
As in Table 3, comparing Robust Sparse CCA to other alternatives in the “Sparse High-dimensional 2 design”
| Method | No contamination |
| Contamination | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
| TPR | TNR |
| TPR | TNR |
| TPR | TNR | |
| Sparse CCA of [ | 0.93 | 1.00 | 0.93 | 1.41 | 0.94 | 0.72 | 1.28 | 0.89 | 0.00 |
| Sparse CCA of [ | 0.79 | 0.65 | 1.00 | 1.16 | 0.30 | 0.92 | 1.57 | 0.00 | 0.00 |
| Sparse CCA of [ | 0.44 | 1.00 | 0.08 | 1.01 | 1.00 | 0.02 | 1.25 | 1.00 | 0.00 |
| Sparse CCA on pre-processed data | 0.58 | 0.92 | 0.79 | 0.72 | 0.88 | 0.74 | 1.36 | 0.74 | 0.25 |
| Sparse CCA with robust initialization | 1.07 | 0.32 | 1.00 | 1.13 | 0.24 | 1.00 | 1.25 | 0.38 | 0.97 |
| Robust Sparse CCA | 0.59 | 0.87 | 0.87 | 0.60 | 0.94 | 0.89 | 0.84 | 0.97 | 0.82 |
Fig. 1Evaporation data set: Distance-Distance plot
Evaporation data set: Cross-validation score for standard CCA, Robust CCA, Sparse CCA and Robust Sparse CCA
| Method | CV-score | CV-score | |||
|---|---|---|---|---|---|
| 0 % Trimming | 10 % Trimming | ||||
| CCA | 0.74 | 0.49 | |||
| Robust CCA | 0.57 | 0.39 | |||
| Sparse CCA | 0.57 | 0.41 | |||
| Robust Sparse CCA | 0.48 | 0.31 |
Evaporation data set: Estimated canonical vectors using Robust CCA and Robust Sparse CCA
| Robust CCA | Robust Sparse CCA | ||||
|---|---|---|---|---|---|
| Variables ∖Canonical Vectors | 1 | 2 | 1 | 2 | |
| First | MAXST: Max. daily soil temperature | –0.35 | –0.76 | 0 | –0.70 |
| data | MINST: Min. daily soil temperature | 0.03 | 0.63 | 0 | 0.71 |
| set | AVST: Avg. daily soil temperature | 0.93 | 0.18 | 1 | 0 |
| Second | MAXAT: Max. daily air temperature | 0.54 | –0.11 | 0.94 | 0 |
| data | MINAT: Min. daily air temperature | 0.67 | 0.84 | 0.14 | 0.38 |
| set | AVAT: Avg. daily air temperature | 0.14 | –0.03 | 0.17 | 0.36 |
| MAXH: Max. daily relative humidity | –0.13 | 0.09 | 0 | 0 | |
| MINH: Min. daily relative humidity | –0.03 | 0.36 | 0 | 0.85 | |
| AVH: Avg. daily relative humidity | –0.28 | 0.32 | –0.24 | 0 | |
| WIND: Total wind, measured in miles per day | –0.37 | –0.19 | 0 | 0 | |
|
| 0.93 | 0.56 | 0.87 | 0.48 | |
Nutrimouse data set: Cross-validation score for Sparse CCA and Robust Sparse CCA
| Method | CV-score | CV-score |
|---|---|---|
| 0 % Trimming | 10 % Trimming | |
| Sparse CCA | 98.78 | 92.53 |
| Robust Sparse CCA | 6.30 | 4.31 |
Fig. 2Nutrimouse data set: Coefficients of selected genes (top) and coefficients of selected fatty acids (bottom) in the first canonical vector pair
Fig. 3Breast cancer data set: 23 cross-validation scores (one for each chromosome) for Robust Sparse CCA (horizontal axis) and Sparse CCA (vertical axis). The dashed line is the 45°-line
Fig. 4Breast cancer data set: Residual Distance plot for chromosome 3 (left) and chromosome 8 (right)