| Literature DB >> 29070790 |
Lei Du1, Kefei Liu2, Xiaohui Yao2, Jingwen Yan2, Shannon L Risacher2, Junwei Han3, Lei Guo3, Andrew J Saykin2, Li Shen4.
Abstract
Brain imaging genetics intends to uncover associations between genetic markers and neuroimaging quantitative traits. Sparse canonical correlation analysis (SCCA) can discover bi-multivariate associations and select relevant features, and is becoming popular in imaging genetic studies. The L1-norm function is not only convex, but also singular at the origin, which is a necessary condition for sparsity. Thus most SCCA methods impose [Formula: see text]-norm onto the individual feature or the structure level of features to pursuit corresponding sparsity. However, the [Formula: see text]-norm penalty over-penalizes large coefficients and may incurs estimation bias. A number of non-convex penalties are proposed to reduce the estimation bias in regression tasks. But using them in SCCA remains largely unexplored. In this paper, we design a unified non-convex SCCA model, based on seven non-convex functions, for unbiased estimation and stable feature selection simultaneously. We also propose an efficient optimization algorithm. The proposed method obtains both higher correlation coefficients and better canonical loading patterns. Specifically, these SCCA methods with non-convex penalties discover a strong association between the APOE e4 rs429358 SNP and the hippocampus region of the brain. They both are Alzheimer's disease related biomarkers, indicating the potential and power of the non-convex methods in brain imaging genetics.Entities:
Mesh:
Year: 2017 PMID: 29070790 PMCID: PMC5656688 DOI: 10.1038/s41598-017-13930-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The seven non-convex penalty functions and their supergradients.
| Penalty Name | Function ( | Supergradient ( |
|---|---|---|
|
|
|
|
| Geman[ |
|
|
| SCAD[ |
|
|
| Laplace[ |
|
|
| MCP[ |
|
|
| ETP[ |
|
|
| Logarithm[ |
|
|
Figure 1Illustration of the , and seven non-convex functions. All the non-convex penalty functions share two common properties: They are singular at origin, concave and monotonically decreasing on (−∞,0], and concave and monotonically increasing on [0,∞).
Figure 2Canonical loadings estimated on four synthetic data sets. The first column shows results for Data1, and the second column is for Data2, and so forth. The first row is the ground truth, and each remaining one corresponds to an SCCA method: (1) Ground Truth. (2) L1-SCCA. (3) L1-NSCCA. (4) L1-S2CCA. (5) -norm and so forth. For each data set and each method, the estimated weights of u is shown on the left panel, and v is on the right. In each individual heat map, the x-axis indicates the indices of elements in u or v; the y-axis indicates the indices of the cross-validation folds.
Participant characteristics.
| HC | MCI | AD | |
|---|---|---|---|
| Num | 204 | 363 | 176 |
| Gender(M/F) | 111/93 | 235/128 | 95/81 |
| Handedness(R/L) | 190/14 | 329/34 | 166/10 |
| Age(mean ± std) | 76.07 ± 4.99 | 74.88 ± 7.37 | 75.60 ± 7.50 |
| Education(mean ± std) | 16.15 ± 2.73 | 15.72 ± 2.30 | 14.84 ± 3.12 |
The searching range of optimal γ for each non-convex penalty.
|
| SCAD | Geman, Laplace, MCP | ETP, Log | |
|---|---|---|---|---|
| Range of | 0.1, 0.2, 0.3 | 3.7 | 0.1, 0.01, 0.001 | 10, 100, 1000 |
Performance comparison on synthetic data sets. The AUC (area under the curve) values (mean ± std) of estimated canonical loadings u and v.
| u | v | |||||||
|---|---|---|---|---|---|---|---|---|
| Data1 | Data2 | Data3 | Data4 | Data1 | Data2 | Data3 | Data4 | |
| L1-SCCA | 1.00 ± 0.00 | 0.75 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.74 ± 0.10 | 1.00 ± 0.00 | 1.00 ± 0.00 |
| L1-S2CCA | 1.00 ± 0.00 | 0.38 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.75 ± 0.00 | 0.75 ± 0.00 | 1.00 ± 0.00 | 0.75 ± 0.00 |
| L1-NSCCA | 0.80 ± 0.45 | 0.30 ± 0.41 | 0.80 ± 0.45 | 0.40 ± 0.55 | 1.00 ± 0.00 | 0.65 ± 0.15 | 1.00 ± 0.00 | 0.80 ± 0.27 |
|
| 1.00 ± 0.00 | 0.75 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.76 ± 0.04 | 1.00 ± 0.00 | 1.00 ± 0.00 |
| Geman | 1.00 ± 0.00 | 0.75 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.74 ± 0.01 | 1.00 ± 0.00 | 1.00 ± 0.00 |
| SCAD | 1.00 ± 0.00 | 0.75 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.74 ± 0.02 | 1.00 ± 0.00 | 1.00 ± 0.00 |
| Laplace | 1.00 ± 0.00 | 0.75 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.75 ± 0.02 | 1.00 ± 0.00 | 1.00 ± 0.00 |
| MCP | 1.00 ± 0.00 | 0.75 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.76 ± 0.04 | 1.00 ± 0.00 | 1.00 ± 0.00 |
| ETP | 1.00 ± 0.00 | 0.75 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.76 ± 0.04 | 1.00 ± 0.00 | 1.00 ± 0.00 |
| Log | 1.00 ± 0.00 | 0.75 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.75 ± 0.02 | 1.00 ± 0.00 | 1.00 ± 0.00 |
Training and testing correlation coefficients (mean ± std) of 5-fold cross-validation synthetic data sets. The best values are shown in boldface.
| Training | Testing | |||||||
|---|---|---|---|---|---|---|---|---|
| data1 | data2 | data3 | data4 | data1 | data2 | data3 | data4 | |
| L1-SCCA | 0.65 ± 0.03 | 0.83 ± 0.03 | 0.65 ± 0.05 | 0.66 ± 0.04 | 0.59 ± 0.14 | 0.82 ± 0.05 | 0.59 ± 0.25 | 0.62 ± 0.08 |
| L1-S2CCA | 0.51 ± 0.25 | 0.67 ± 0.30 | 0.63 ± 0.28 | 0.32 ± 0.15 | 0.55 ± 0.23 | 0.68 ± 0.28 | 0.53 ± 0.29 | 0.24 ± 0.20 |
| L1-NSCCA | 0.62 ± 0.04 | 0.80 ± 0.01 | 0.75 ± 0.01 | 0.65 ± 0.02 | 0.61 ± 0.17 | 0.80 ± 0.04 | 0.73 ± 0.13 | 0.65 ± 0.10 |
|
| 0.62 ± 0.04 |
| 0.75 ± 0.01 | 0.65 ± 0.02 | 0.61 ± 0.17 |
| 0.73 ± 0.13 | 0.66 ± 0.10 |
| Geman | 0.62 ± 0.04 |
| 0.75 ± 0.01 | 0.65 ± 0.02 | 0.62 ± 0.17 | 0.83 ± 0.02 | 0.72 ± 0.13 | 0.66 ± 0.10 |
| SCAD | 0.62 ± 0.04 |
| 0.75 ± 0.01 | 0.65 ± 0.03 | 0.61 ± 0.17 |
| 0.73 ± 0.13 | 0.66 ± 0.10 |
| Laplace | 0.62 ± 0.04 |
| 0.75 ± 0.01 | 0.65 ± 0.02 | 0.61 ± 0.17 | 0.83 ± 0.02 | 0.73 ± 0.13 | 0.66 ± 0.10 |
| MCP | 0.62 ± 0.04 |
| 0.75 ± 0.01 | 0.65 ± 0.02 | 0.61 ± 0.17 |
| 0.73 ± 0.13 | 0.66 ± 0.10 |
| ETP | 0.62 ± 0.04 |
| 0.75 ± 0.01 | 0.65 ± 0.02 | 0.61 ± 0.17 |
| 0.73 ± 0.13 | 0.66 ± 0.10 |
| Log |
|
|
|
|
| 0.83 ± 0.03 |
|
|
Figure 3Canonical loadings estimated on real imaging genetics data. Each row corresponds to a SCCA method: (1) L1-SCCA, (2) L1-NSCCA, (3) L1-S2CCA, (4) -norm and so forth. For each method, the estimated is shown on the left panel, and is on the right one. In each individual heat map, the x-axis indicates the indices of elements in u or v (i.e., SNPs or ROIs); the y-axis indicates the indices of the cross-validation folds.
Figure 4Mapping averaged canonical weight 's estimated by every SCCA method onto the brain. The left panel and right panel show five methods respectively, where each row corresponds to a SCCA method. The L1-SCCA identifies the most signals, followed by the L1-NSCCA and L1-S2CCA. All the proposed methods identify a clean signal that helps further investigation.
Performance comparison on real data set. Training and testing correlation coefficients (mean ± std) of 5-fold cross-validation are shown. The best value is shown in boldface.
| L1-SCCA | L1-S2CCA | L1-NSCCA |
| Geman | SCAD | Laplace | MCP | ETP | Log | |
|---|---|---|---|---|---|---|---|---|---|---|
| Training | 0.27 ± 0.01 | 0.29 ± 0.02 | 0.27 ± 0.01 | 0.28 ± 0.02 | 0.27 ± 0.02 | 0.29 ± 0.02 | 0.27 ± 0.02 | 0.28 ± 0.02 | 0.28 ± 0.02 |
|
| Testing | 0.18 ± 0.04 | 0.25 ± 0.10 | 0.22 ± 0.07 | 0.26 ± 0.09 | 0.26 ± 0.10 | 0.27 ± 0.09 | 0.26 ± 0.10 | 0.26 ± 0.09 | 0.26 ± 0.09 |
|
| Training-Testing Gap | 0.09 | 0.04 | 0.05 | 0.02 | 0.01 | 0.02 | 0.01 | 0.02 | 0.02 | 0.06 |