Literature DB >> 27585988

Structured sparse CCA for brain imaging genetics via graph OSCAR.

Lei Du¹, Heng Huang², Jingwen Yan¹, Sungeun Kim¹, Shannon Risacher¹, Mark Inlow³, Jason Moore⁴, Andrew Saykin¹, Li Shen⁵.

Abstract

BACKGROUND: Recently, structured sparse canonical correlation analysis (SCCA) has received increased attention in brain imaging genetics studies. It can identify bi-multivariate imaging genetic associations as well as select relevant features with desired structure information. These SCCA methods either use the fused lasso regularizer to induce the smoothness between ordered features, or use the signed pairwise difference which is dependent on the estimated sign of sample correlation. Besides, several other structured SCCA models use the group lasso or graph fused lasso to encourage group structure, but they require the structure/group information provided in advance which sometimes is not available.
RESULTS: We propose a new structured SCCA model, which employs the graph OSCAR (GOSCAR) regularizer to encourage those highly correlated features to have similar or equal canonical weights. Our GOSCAR based SCCA has two advantages: 1) It does not require to pre-define the sign of the sample correlation, and thus could reduce the estimation bias. 2) It could pull those highly correlated features together no matter whether they are positively or negatively correlated. We evaluate our method using both synthetic data and real data. Using the 191 ROI measurements of amyloid imaging data, and 58 genetic markers within the APOE gene, our method identifies a strong association between APOE SNP rs429358 and the amyloid burden measure in the frontal region. In addition, the estimated canonical weights present a clear pattern which is preferable for further investigation.
CONCLUSIONS: Our proposed method shows better or comparable performance on the synthetic data in terms of the estimated correlations and canonical loadings. It has successfully identified an important association between an Alzheimer's disease risk SNP rs429358 and the amyloid burden measure in the frontal region.

Entities: Chemical Disease Gene Mutation Species

Keywords: Brain imaging genetics; Canonical correlation analysis; Machine learning; Structured sparse model

Mesh：

Year: 2016 PMID： 27585988 PMCID： PMC5009827 DOI： 10.1186/s12918-016-0312-1

Source DB: PubMed Journal: BMC Syst Biol ISSN： 1752-0509

Background

In recent years, the bi-multivariate analyses techniques [1], especially the sparse canonical correlation analysis (SCCA) [2-8], have been widely used in brain imaging genetics studies. These methods are powerful in identifying bi-multivariate associations between genetic biomarkers, e.g., single nucleotide polymorphisms (SNPs), and the imaging factors such as the quantitative traits (QTs). Witten et al. [3, 9] first employed the penalized matrix decomposition (PMD) technique to handle the SCCA problem which had a closed form solution. This SCCA imposed the ℓ1-norm into the traditional CCA model to induce sparsity. Since the ℓ1-norm only randomly chose one of those correlated features, it performed poorly in finding structure information which usually existed in biology data. Witten et al. [3, 9] also implemented the fused lasso based SCCA which penalized two adjacent features orderly. This SCCA could capture some structure information but it demanded the features be ordered. As a result, a lot of structured SCCA approaches arose. Lin et al. [7] imposed the group lasso regularizer to the SCCA model which could make use of the non-overlapping group information. Chen et al. [10] proposed a structure-constrained SCCA (ssCCA) which used a graph-guided fused ℓ2-norm penalty for one canonical loading according to features’ biology relationships. Du et al. [8] proposed a structure-aware SCCA (S2CCA) to identify group-level bi-multivariate associations, which combined both the covariance matrix information and the prior group information by the group lasso regularizer. These structured SCCA methods, on one hand, can generate a good result when the prior knowledge is well fitted to the hidden structure within the data. On the other hand, they become unapproachable when the prior knowledge is incomplete or not available. Moreover, it is hard to precisely capture the prior knowledge in real world biomedical studies. To facilitate structural learning via grouping the weights of highly correlated features, the graph theory were widely utilized in sparse regression analysis [11-13]. Recently, we notice that the graph theory has also been employed to address the grouping issue in SCCA. Let each graph vertex and each feature has a one-to-one correspondence relationship, and ρ be the sample correlation between features i and j. Chen et al. [4, 5] proposed a network-structured SCCA (NS-SCCA) which used the ℓ1-norm of |ρ|(u−sign(ρ)u) to pull those positively correlated features together, and fused those negatively correlated features to the opposite direction. The knowledge-guided SCCA (KG-SCCA) [14] was an extension of both NS-SCCA [4, 5] and S2CCA [8]. It used ℓ2-norm of for one canonical loading, similar to what Chen proposed, and employed the ℓ2,1-norm penalty for another canonical loading. Both NS-SCCA and KG-SCCA could be used as a group-pursuit method if the prior knowledge was not available. However, one limitation of both models is that they depend on the sign of pairwise sample correlation to recover the structure pattern. This probably incur undesirable bias since the sign of the correlations could be wrongly estimated due to possible graph misspecification caused by noise [13]. To address the issues above, we propose a novel structured SCCA which neither requires to specify prior knowledge, nor to specify the sign of sample correlations. It will also work well if the prior knowledge is provided. The GOSC-SCCA, named from raph ctagonal election and lustering algorithm for parse anonical orrelation nalysis, is inspired by the outstanding feature grouping ability of octagonal selection and clustering algorithm for regression (OSCAR) [11] regularizer and graph OSCAR (GOSCAR) [13] regularizer in regression task. Our contributions can be summarized as follows 1) GOSC-SCCA could pull those highly correlated features together when no prior knowledge is provided. While those positively correlated features will be encouraged to have similar weights, those negatively correlated ones will also be encouraged to have similar weights but with different signs. 2) Our GOSC-SCCA could reduce the estimation bias given no requirement for specifying the sign of sample correlation. 3) We provide a theoretical quantitative description for the grouping effect of GOSC-SCCA. We use both synthetic data and real imaging genetic data to evaluate GOSC-SCCA. The experimental results show that our method is better than or comparable to those state-of-the-art methods, i.e., L1-SCCA, FL-SCCA [3] and KG-SCCA [14], in identifying stronger imaging genetic correlations and more accurate and cleaner canonical loadings pattern. Note that the PMA software package were used to implement the L1-SCCA (SCCA with lasso penalty) and FL-SCCA (SCCA with fused lasso penalty) methods. Please refer to http://cran.r-project.org/web/packages/PMA/ for more details.

Methods

We denote a vector as a boldface lowercase letter, and denote a matrix as a boldface uppercase letter. m indicates the i-th row of matrix M=(m). Matrices and denote two separate datasets collected from the same population. Imposing lasso into a traditional CCA model [15], the L1-SCCA model is formulated as follows [3, 9]: where ||u||1≤c1 and ||v||1≤c2 are sparsity penalties controlling the complexity of the SCCA model. The fused lasso [2–4, 9] can also be used instead of lasso. In order to make the problem be convex, the equal sign is usually replaced by less-than-equal sign, i.e. [3].

The graph OSCAR regularization

The OSCAR regularizer is firstly introduced by Bondell et al. [11], which has been proved to have the ability of grouping features automatically by encouraging those highly correlated features to have similar weights. Formally, the OSCAR penalty is defined as follows, Note that this penalty is applied to each feature pair. To make OSCAR be more flexible, Yang et al. [13] introduce the GOSCAR, where E and E are the edge sets of the u-related and v-related graphs, respectively. Obviously, the GOSCAR will reduce to OSCAR when both graphs are complete [13]. Applying , the GOSCAR regularizer takes the following form,

The GOSC-SCCA model

Since the grouping effect is also an important consideration in SCCA learning, we propose to expand L1-SCCA to GOSC-SCCA by imposing GOSCAR instead of L1 only as follows. where (c1,c2,c3,c4) are parameters and they could control the solution path of the canonical loadings. Since the S2CCA [8] has proved that the covariance matrix information could help improve the prediction ability, we also use and other than . As a structured sparse model, GOSC-SCCA will encourage if the i-th feature and the j-th feature are highly correlated. We will give a quantitative description for this later.

The proposed algorithm

We can write the objective function into unconstrained formulation via the Lagrange multiplier method, i.e. where (λ1,λ2,β1,β2) are tuning parameters, and they have a one-to-one correspondence to parameters (c1,c2,c3,c4) in GOSC-SCCA model [4]. Taking the derivative regarding u and v respectively, and letting them be zero, we obtain, where Λ1 is a diagonal matrix with the k1-th element as , and Λ2 with the k2-th element as ; L1 is the Laplacian matrix which can be obtained from L1=D1−W1; is a matrix which is from . L2 and have the same entries as L1 and separately based on v. In the initialization, both W1 and have the same entry with each element as except the diagonal elements. But W1 and become different after each iteration, i.e., If ||u−u||1=0, the corresponding element in matrix W1 will not exist. So we regularize it as (ζ is a very small positive number) when ||u−u||1=0. We also approximate ||u||1=0 with for Λ1. Then the objective function regarding u is . It is easy to prove that will reduce to problem (6) regarding u when ζ→0. The cases of ||v||1=0 and ||v−v||1=0 can be addressed using a similar regularization method. D1 is a diagonal matrix and its i-th diagonal element is obtained by summing the i-th row of W1, i.e. . The diagonal matrix is also obtained from . Likewise, we can calculate W2, , D2 and by the same method in terms of v. Then according to Eqs. (7-8), we can obtain the solution to our problem with respect to u and v separately. We observe that L1, and Λ1 depend on u which is an unknown variable, and v is also unknown which is used to calculate L2, and Λ2. Thus we propose an effective iterative algorithm to solve this problem. We first fix v to solve u; and then fix u to solve v. Algorithm 1 exhibits the pseudo code of the proposed GOSC-SCCA algorithm. For the key calculation steps, i.e., Step 5 and Step 10, we solve a system of linear equations with quadratic complexity other than computing the matrix inverse with cubic complexity. Thus the whole algorithm can work with desired efficiency. In addition, the algorithm is guaranteed to converge and we will prove this in the next subsection.

Convergence analysis

We first introduce the following lemma.

Lemma 1

For any two nonzero real numbers and u, we have

Proof

Given the lemma in [16], we have for any two nonzero vectors. We also have and ||u||1=||u||2 for any two nonzero real numbers, which completes the proof. □ Based on Lemma 1, we have when , , and are nonzero. We now have the following theorem regarding GOSC-SCCA algorithm.

Theorem 1

The objective function value of GOSC-SCCA will monotonically decrease in each iteration till the algorithm converges. The proof consists of two parts. (1) Part 1: From Step 3 to Step 7 in Algorithm 1, u is the only unknown variable to be solved. The objective function (6) can be equivalently transferred to According to Step 5 we have where is the updated u. It is known that if L is the laplacian matrix [17]. Similarly, . Then according to Eq. (9), we obtain We first multiply 2λ1 on both sides of Eq. (13) for each feature pair separately, and do the same to both sides of Eq. (14). After that, we multiply β1 on both sides of Eq. (12). Finally, by summing all these inequations together to both sides of Eq. (15) accordingly, we arrive at Let , , we have Therefore, GOSC-SCCA will decrease the objective function in each iteration, i.e., . (2) Part 2: From Step 8 to Step 12, the only unknown variable is v. Similarly, we can arrive at Thus GOSC-SCCA also decreases the objective function in each iteration during the second phase, i.e., . Based on the analysis above, we easily have according to the transitive property of inequalities. Therefore, the objective value monotonically decreases in each iteration. Note that the CCA objective ranges from [-1,1], and both uXXu and vYYv are constrained to be 1. Thus the −uXYv is lower bounded by -1, and so Eq. (6) is lower bounded by –1. In addition, Eqs. (16–17) imply that the KKT condition is satisfied. Therefore, the GOSC-SCCA algorithm will converge to a local optimum. □ Based on the convergence analysis, to facilitate the GOSC-SCCA algorithm, we set the stopping criterion of Algorithm 1 as max{|δ|∣δ∈(u−u)}≤τ and max{|δ|∣δ∈(v−v)}≤τ, where τ is a predefined estimation error. Here we set τ=10−5 empirically from the experiments.

The grouping effect of GOSC-SCCA

For the structured sparse learning in high-dimensional situation, the automatic feature grouping property is of great importance [18]. In regression analysis, Zou and Hastie [18] have suggested that a regressor behaviors grouping effect when it can set those regression coefficients of the same group to similar weights. This is also the case for structured SCCA methods. So, it is important and meaningful to investigate the theoretical boundary of the grouping effect. We have the following theorem in terms of GOSC-SCCA.

Theorem 2

Let X and Y be two data sets, and (λ,β,γ)be the pre-tuned parameters. Let be the solution to our SCCA problem of Eqs. (10–11). Suppose the i-th feature and j-th feature only link to each other on the graph, and are their optimal solutions, thus holds. The solutions to and satisfy where ρ is the sample correlation between features i and j, and w is the corresponding element in u-related matrix W1. Let be the solution to our problem Eq. (6), we have the following equations after taking the partial derivative with respect to and , respectively. We know that features i and u are only linked to each other, thus D=D=A=w for those intermediate matrices. Besides, we also know that , , and . Then according to the definition of L1, and Λ1, we can arrive at Subtracting these two equations, we obtain Then we take ℓ2-norm on both sides of Eq. (20), apply the triangle inequality, and use the equality , We have known that our problem implies , thus we arrive at □ Now the upper bound for the canonical loadings v can also be obtained, i.e. where ρij′ is the sample correlation between the i-th and j-th feature in v, and is the corresponding element in v-related matrix W2. Theorem 2 provides a theoretical upper bound for the difference between the estimated coefficients of the i-th feature and j-th feature. It seems that this is not a tight enough bound. However our bound is slack since it does not bound much more the pairwise difference of features i and j if ρ≪1. This is desirable for two irrelevant features [19]. Suppose two features with very small correlation, i.e. ρ≪0, their coefficients do not need to be the same or similar. So we do not care about their coefficients’ pairwise difference, and will not set their pairwise difference a tight bound. This quantitative description for the grouping effect makes the GOSCAR penalty an ideal choice for structured SCCA.

Results

We compare GOSC-SCCA with several state-of-the-art SCCA and structured SCCA methods, including L1-SCCA [3], FL-SCCA [3], KG-SCCA [14]. We do not compare GOSC-SCCA with S2CCA [8], ssCCA [7] and CCA-SG (CCA Sparse Group) [10] since they require prior knowledge available in advance. We do not choose NS-SCCA [5] as benchmark either, due to the following two reasons. (1) NS-SCCA generates many intermediate variables during its iterative procedure. As the authors stated, NS-SCCA’s per-iteration complexity is linear in (p+|E|), and thus the complexity becomes O(p2) when it is in the group pursuit mode. (2) Its penalty term is similar to that of KG-SCCA which has been selected for comparison. There are six parameters to be decided before using the GOSC-SCCA, thus it will take too much time by blindly tuning. We tune the parameters following two principles. On one hand, Chen and Liu [5] found out that the result is not very sensitive to γ1 and γ2. So we choose them from a small scope [0.1, 1, 10]. On the other hand, if the parameters are too small, the SCCA will reduce to CCA due to the subtle influence of the penalties. And, too large parameters will over-penalize the results. Therefore, we tune the rest of the parameters within the range of {10−3,10−2,10−1,100,101,102,103}. In this study, we conduct all the experiments using the nested 5-fold cross-validation strategy, and the parameters are only tuned from the training set. In order to save time, we only tune these parameters on the first run of the cross-validation. That is, the parameters are tuned when the first four folds are used as the training set. Then we directly use the tuned parameters for all the remaining experiments. All these methods use the same partition for cross-validation in the experiment.

Evaluation on synthetic data

We generate four synthetic datasets to investigate the performance of GOSC-SCCA and those benchmarks. Following [4, 5], these datasets are generated by four steps: 1) We predefine the structures and use them to create u and v respectively. 2) We create a latent vector z from N(0,I). 3) We create X with each where and Y with each where . 4) For the first group of nonzero features in u, we change half of their signs, and also change the signs of the corresponding data. Since the synthetic datasets are order-independent, this setup is equivalent to randomly change a portion of features’ signs in u. Now that we change the sign of both coefficients and the data simultaneously, we still have X′u′=Xu where X′ and u′ indicate the data and coefficients after the sign swap. We do the same on the Y side to make our simulation more challenging [13]. In addition, we set all four datasets with n=80, p=100 and q=120. They also have different correlation coefficients and different group structures. Therefore, the simulation is designed to cover a set of diverse cases for a fair comparison. The estimated correlation coefficients of each method on four datasets are contained in Table 1. The best values and those are not significantly worsen than the best values are shown in bold. On the training results, we observe that GOSC-SCCA either estimates the largest correlation coefficients (Dataset 1 and Dataset 4), or is not significantly worse than the best method (Dataset 2 and Dataset 3). GOSC-SCCA also has the best average correlation coefficients. On the testing results, GOSC-SCCA also outperforms those benchmarks in terms of the average correlation coefficients, though KG-SCCA does not perform significantly worse than our method. For the overall average obtained across four datasets, GOSC-SCCA obtains the better correlation coefficients than the competing methods on both training set and testing set.

Table 1

5-fold cross-validation results on synthetic data

Training results
Methods			Dataset 1			MEAN			Dataset 2			MEAN			Dataset 3			MEAN			Dataset 4			MEAN	AVG.
L1-SCCA	0.52	0.56	0.52	0.53	0.51	0.53	0.25	0.29	0.16	0.20	0.23	0.23	0.56	0.24	0.57	0.53	0.52	0.48	0.46	0.50	0.53	0.48	0.35	0.46	0.43
FL-SCCA	0.52	0.60	0.52	0.53	0.50	0.53	NaN	NaN	0.17	NaN	0.23	0.08	0.63	0.43	0.56	0.55	0.55	0.54	0.51	0.56	NaN	0.53	0.40	0.40	0.39
KG-SCCA	0.52	0.55	0.52	0.53	0.53	0.53	0.25	0.29	0.15	0.20	0.22	0.22	0.56	0.24	0.43	0.52	0.52	0.45	0.51	0.56	0.48	0.52	0.40	0.49	0.42
GOSC-SCCA	0.57	0.62	0.57	0.59	0.63	0.60	0.26	0.30	0.15	0.21	0.17	0.22	0.64	0.31	0.42	0.61	0.59	0.51	0.51	0.56	0.55	0.54	0.41	0.52	0.46
Testing results
L1-SCCA	0.57	0.43	0.58	0.49	0.59	0.53	0.00	0.21	0.32	0.17	0.08	0.16	0.36	0.20	0.37	0.49	0.46	0.38	0.45	0.29	0.20	0.40	0.67	0.40	0.37
FL-SCCA	0.56	0.38	0.57	0.49	0.59	0.52	NaN	NaN	0.48	NaN	0.08	0.11	0.30	0.80	0.36	0.51	0.41	0.47	0.55	0.30	NaN	0.46	0.72	0.40	0.38
KG-SCCA	0.56	0.43	0.57	0.49	0.58	0.53	0.00	0.21	0.31	0.18	0.07	0.15	0.37	0.20	0.45	0.50	0.45	0.39	0.52	0.29	0.34	0.46	0.71	0.46	0.38
GOSC-SCCA	0.73	0.39	0.68	0.56	0.45	0.56	0.02	0.09	0.57	0.20	0.38	0.25	0.23	0.18	0.43	0.44	0.43	0.34	0.53	0.31	0.31	0.36	0.72	0.45	0.40

The estimated correlation coefficients and their MEAN are shown. ’NaN’ means a method fails to estimate a pair of canonical loadings. ’0.00’ means a very small correlation coefficients. ’AVG.’ denotes the MEAN across all four datasets. The best values and those that are NOT significantly worse than the best ones (t-test with p-value smaller than 0.05) are shown in bold

5-fold cross-validation results on synthetic data The estimated correlation coefficients and their MEAN are shown. ’NaN’ means a method fails to estimate a pair of canonical loadings. ’0.00’ means a very small correlation coefficients. ’AVG.’ denotes the MEAN across all four datasets. The best values and those that are NOT significantly worse than the best ones (t-test with p-value smaller than 0.05) are shown in bold Figure 1 shows the estimated canonical loadings of all four SCCA methods in a typical run. As we can see, L1-SCCA cannot accurately recover the true signals. For those coefficients with sign swapped, it fails to recognize them. The FL-SCCA slightly improves L1-SCCA’s performance but cannot identify those coefficients with sign changed either. Our GOSC-SCCA successfully groups those nonzero features together, and accurately recognizes the coefficients whose signs are changed. No matter what structures are within the dataset, GOSC-SCCA is able to estimate true signals which are very close to the ground truth. Although KG-SCCA also recognizes the coefficients with sign swapped, it is unable to recover every group of nonzero coefficients. For example, KG-SCCA misses two groups of nonzero features in terms of v for the second dataset. The results on synthetic datasets reveal that GOSC-SCCA can not only estimate stronger correlation coefficients than the competing methods, but also identifies more accurate and cleaner canonical loadings.

Fig. 1

Canonical loadings estimated on four synthetic datasets. The first column is for Dataset 1, and the second column is for Dataset 2, and so forth. For each dataset, the weights of u are shown on the left panel, and those of v are on the right. The first row is the ground truth, and each remaining row corresponds to a specific method: (1) Ground Truth. (2) L1-SCCA. (3) FL-SCCA. (4) KG-SCCA. (5) GOSC-SCCA

Evaluation on real neuroimaging genetics data

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For up-to-date information, see www.adni-info.org. Table 2 contains the characteristics of the ADNI dataset used in this work. Participants including 568 non-Hispanic Caucasian subjects, including 196 healthy control (HC), 343 MCI and 28 AD participants. However, many participants’s data are incomplete due to various factors such as data loss. After cleaning those participants with incomplete information, we get 282 participants in our experiments. The genotype data were downloaded from LONI (adni.loni.usc.edu), and the preprocessed [11C] Florbetapir PET scans (i.e., amyloid imaging data) were also obtained from LONI. Before conducting the experiment, the amyloid imaging data had been preprocessed and the specific pipeline could be found in [14]. These imaging measures were adjusted by removing the effects of the baseline age, gender, education, and handedness via the regression weights derived from HC participants. We finally obtained 191 region-of-interest (ROI) level amyloid measurements which were extracted from the MarsBaR AAL atlas. We included four genetic markers, i.e., rs429358, rs439401, rs445925 and rs584007, from the known AD risk gene APOE. We intend to investigate if our GOSC-SCCA could identify this widely known associations between amyloid deposition and APOE SNPs.

Table 2

Real data characteristics

	HC	MCI	AD
Num	196	343	28
Gender(M/F)	102/94	203/140	18/10
Handedness(R/L)	178/18	309/34	23/5
Age (mean ±std.)	74.77 ±5.39	71.92 ±7.47	75.23 ±10.66
Education (mean ±std.)	15.61 ±2.74	15.99 ±2.75	15.61 ±2.74

Real data characteristics Shown in Table 3 are the 5-fold cross-validation results of various SCCA methods. We observe that GOSC-SCCA and KG-SCCA obtain similar correlation coefficients on every run, including the training performance and testing performance. Besides, they both are significantly better than L1-SCCA and FL-SCCA, which is consistent with the analysis in [14]. This result shows that GOSC-SCCA can improve the ability of identifying interesting imaging genetic associations compared with L1-SCCA and FL-SCCA.

Table 3

5-fold cross-validation results on real data

Methods	Training results					MEAN	Testing results					MEAN
L1-SCCA	0.50	0.50	0.53	0.53	0.54	0.52	0.56	0.61	0.45	0.47	0.38	0.49
FL-SCCA	0.44	0.43	0.46	0.45	0.46	0.45	0.49	0.56	0.39	0.43	0.37	0.45
KG-SCCA	0.53	0.52	0.55	0.54	0.56	0.54	0.56	0.61	0.47	0.52	0.45	0.52
GOSC-SCCA	0.53	0.52	0.55	0.55	0.56	0.54	0.56	0.62	0.47	0.51	0.45	0.52

The estimated correlation coefficients and their MEAN are shown. The best correlation coefficients and those that are NOT significantly worse than the best ones (t-test with p-value smaller than 0.05) are shown in bold

5-fold cross-validation results on real data The estimated correlation coefficients and their MEAN are shown. The best correlation coefficients and those that are NOT significantly worse than the best ones (t-test with p-value smaller than 0.05) are shown in bold Figure 2 contains the estimated canonical loadings obtained from 5-fold cross-validation. To facilitate the interpretation, we employ the heat map for this real data. Each row denotes a method, and u (genetic markers) is shown on the left panel and v (imaging markers) is on the right. As we can see, on the genetic side, all four SCCA exhibit similar canonical loading pattern. Since every SCCA here incorporates the lasso (ℓ1-norm), they select only the APOE e4 SNP (rs429358), which is a widely known AD risk marker, with those irrelevant ones discarded to assure sparsity. On the imaging side, L1-SCCA identifies many signals which is hard to interpret. FL-SCCA fuses those adjacent features together due to its pairwise smoothness, which can be easily observed from the figure. But it is difficult to interpret either. GOSC-SCCA and KG-SCCA perform similarly again in this run. They both identify the imaging signals in accordance with the findings in [20]. It is easily to observe that they estimated a very clean signal pattern, and thus is easy to conduct further investigation. Recall the results in Table 3, the association between the marker rs429358 and the amyloid accumulation in the brain is relatively strong, and thus the signal can be well captured by both KG-SCCA and GOSC-SCCA. In addition, the correlations among the imaging variables and those among genetic variables are high enough so that the signs of these correlations can hardly be impeded by the noises. That is, the signs of sample correlations tend to be correctly estimated. Therefore, KG-SCCA does not suffer sign directionality issue, and so performs similarly to GOSC-SCCA. However, if some sample correlations are not very strong and their signs are mis-estimated, KG-SCCA may not work very well (see the results of the second synthetic dataset). In summary, this reveals that our method has better generalization ability, and could identify biologically meaningful imaging genetic associations.

Fig. 2

Canonical loadings estimated on the real dataset. Each row corresponds to a SCCA method: (1) L1-SCCA. (2) FL-SCCA. (3) KG-SCCA. (4) GOSC-SCCA. For each row, the estimated weights of u are shown on the left figure, and those of v on the right

Discussion

In this paper, we have proposed a structured SCCA method GOSC-SCCA, which intended to reduce the estimation bias caused by the incorrect sign of sample correlation. GOSC-SCCA employed the GOSCAR (Graph OSCAR) regularizer which is an extension of the popular penalty OSCAR. The GOSC-SCCA could pull those highly correlated features together no matter that they were positively correlated or negatively correlated. We also provide a theoretical quantitative description of the grouping effect of our SCCA method. An effective algorithm was also proposed to solve the GOSC-SCCA problem and the algorithm was guaranteed to converge. We evaluated GOSC-SCCA and three other popular SCCA methods on both synthetic datasets and a real imaging genetics dataset. The synthetic datasets consisted of different ground truth, i.e. different correlation coefficients and canonical loadings. GOSC-SCCA was capable of consistently identifying strong correlation coefficients on both training set and testing set, and either outperformed or performed similarly to the competing methods. Besides, GOSC-SCCA successfully and accurately recognized the signals which were the closest to the ground truth when compared with the competing methods. The results on the real data showed that both GOSC-SCCA and KG-SCCA could find an important association between the APOE SNPs and the amyloid burden measure in the frontal region of the brain. KG-SCCA performs similarly to GOSC-SCCA on this real data largely because of the strong correlations between the variables within the genetic data, as well as those within the imaging data. In this case, the signs of the correlation coefficients between these variables tend to be correctly calculated, and so KG-SCCA does not have the sign directionality issue. On the other hand, if the correlations among some variables are not very strong, the performance of KG-SCCA can be affected by the mis-estimation of some correlation signs. In this case, GOSC-SCCA, which is designed to overcome the sign directionality issue, is expected to perform better than KG-SCCA. This fact has already been validated by the results of the second synthetic dataset. The satisfactory performance of GOSC-SCCA, coupled with its theoretical convergence and grouping effect, demonstrates the promise of our method as an effective structured SCCA method in identifying meaningful bi-multivariate imaging genetic associations. The following are a few possible future directions. (1) Note that the identified pattern between the APOE genotype and amyloid deposition is a well-known and relatively strong imaging genetic association. Thus one direction is to apply GOSC-SCCA to more complex imaging genetic data for revealing novel but less obvious associations. (2) The data tested in this study is brain wide but targeted only at APOE SNPs. Another direction is to apply GOSC-SCCA to imaging genetic data with higher dimensionality, where more effective and efficient strategies for parameter tuning and cross-validation warrant further investigation. (3) The third direction is to employ GOSC-SCCA as a knowledge-driven approach, where pathways, networks or other relevant biological knowledge can be incorporated in the model to aid association discovery. In this case, comparative study can also been done between GOSC-SCCA and other state-of-the-arts knowledge-guided SCCA methods in bi-multivariate imaging genetics analyses.

Conclusions

We have presented a new structured sparse canonical analysis (SCCA) model for analyzing brain imaging genetics data and identifying interesting imaging genetic associations. This SCCA model employs a regularization item based on the graph octagonal selection and clustering algorithm for regression (GOSCAR). The goal is twofold: (1) encourage highly correlated features to have similar canonical weights, and (2) reduce the estimation bias via removing the requirement of pre-defining the sign of the sample correlation. As a result, it could pull highly correlated features together no matter whether they are positively or negatively correlated. Empirical results on both synthetic and real data have demonstrated the promise of the proposed method.

14 in total

1. Canonical correlation analysis: an overview with application to learning methods.

Authors: David R Hardoon; Sandor Szedmak; John Shawe-Taylor
Journal: Neural Comput Date: 2004-12 Impact factor: 2.026

2. Network-constrained regularization and variable selection for analysis of genomic data.

Authors: Caiyan Li; Hongzhe Li
Journal: Bioinformatics Date: 2008-03-01 Impact factor: 6.937

3. Sparse canonical correlation analysis with application to genomic data integration.

Authors: Elena Parkhomenko; David Tritchler; Joseph Beyene
Journal: Stat Appl Genet Mol Biol Date: 2009-01-06

4. Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis.

Authors: Jun Chen; Frederic D Bushman; James D Lewis; Gary D Wu; Hongzhe Li
Journal: Biostatistics Date: 2012-10-15 Impact factor: 5.899

5. Interpretable whole-brain prediction analysis with GraphNet.

Authors: Logan Grosenick; Brad Klingenberg; Kiefer Katovich; Brian Knutson; Jonathan E Taylor
Journal: Neuroimage Date: 2013-01-05 Impact factor: 6.556

6. IMAGING GENETICS VIA SPARSE CANONICAL CORRELATION ANALYSIS.

Authors: Eric C Chi; Genevera I Allen; Hua Zhou; Omid Kohannim; Kenneth Lange; Paul M Thompson
Journal: Proc IEEE Int Symp Biomed Imaging Date: 2013-12-31

7. A novel structure-aware sparse learning algorithm for brain imaging genetics.

Authors: Lei Du; Yan Jingwen; Sungeun Kim; Shannon L Risacher; Heng Huang; Mark Inlow; Jason H Moore; Andrew J Saykin; Li Shen
Journal: Med Image Comput Comput Assist Interv Date: 2014

8. APOE and BCHE as modulators of cerebral amyloid deposition: a florbetapir PET genome-wide association study.

Authors: V K Ramanan; S L Risacher; K Nho; S Kim; S Swaminathan; L Shen; T M Foroud; H Hakonarson; M J Huentelman; P S Aisen; R C Petersen; R C Green; C R Jack; R A Koeppe; W J Jagust; M W Weiner; A J Saykin
Journal: Mol Psychiatry Date: 2013-02-19 Impact factor: 15.992