Literature DB >> 23272092

Analysis of genome-wide association studies with multiple outcomes using penalization.

Jin Liu1, Jian Huang, Shuangge Ma.   

Abstract

Genome-wide association studies have been extensively conducted, searching for markers for biologically meaningful outcomes and phenotypes. Penalization methods have been adopted in the analysis of the joint effects of a large number of SNPs (single nucleotide polymorphisms) and marker identification. This study is partly motivated by the analysis of heterogeneous stock mice dataset, in which multiple correlated phenotypes and a large number of SNPs are available. Existing penalization methods designed to analyze a single response variable cannot accommodate the correlation among multiple response variables. With multiple response variables sharing the same set of markers, joint modeling is first employed to accommodate the correlation. The group Lasso approach is adopted to select markers associated with all the outcome variables. An efficient computational algorithm is developed. Simulation study and analysis of the heterogeneous stock mice dataset show that the proposed method can outperform existing penalization methods.

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 23272092      PMCID: PMC3522680          DOI: 10.1371/journal.pone.0051198

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

This study has been partly motivated by the analysis of the genetic architecture of complex traits in heterogeneous stock mice from Wellcome Trust Center. This data resource, which also includes pedigree information, was based on an advanced intercross mating among 8 inbred strains over 50 generations of random mating [1], [2], since the use of pseudorandom breeding for over 50 generations should result in an average distance between recombinants of 2 cM. The average linkage disequilibrium (LD), as measured by between adjacent markers, is 0.62 [3]. As with many complex mammal diseases, clinical risk factors and environmental exposures have failed to provide a comprehensive description of immunological disorders. The laboratory mouse is a key model organism for understanding gene function in mammals. Valdar et al. [1], [4] conducted a genome-wide association study and gene-environment interaction modeling to search for genetic markers possibly correlated to phenotypes. We analyze the CD4/CD8 ratio and CD4CD3 in this study. CD4/CD8 ratio, which is also known as the T-Lymphocyte Helper/Suppressor Profile, is a basic laboratory test in which the percentage of CD3-positive lymphocytes in the blood positive for CD4 (T helper cells) and CD8 (a class of regulatory T cells) are counted and compared. CD4CD3 is another clinical index for immunological diseases. Both indices are related to the diagnosis of immunological diseases. Since the indices, CD4/CD8 ratio and CD4CD3, are highly correlated and mechanisms behind them are related, the potentially associated genetic markers are expected to be very similar. Thus it may be more powerful to analyze the phenotypes simultaneously. GWAS data have high dimensionality. Conventional statistical approaches analyze one SNP at a time and then adjust for multiple comparisons. Such approaches are easy to implement, however, they may contradict the fact that the development and progression of complex diseases and traits are caused by the aggregated effects of multiple SNPs. They may miss SNPs with weak marginal but strong joint effects. In the analysis of joint effects of a large number of SNPs, regularized estimation is needed. In addition, it is expected that only a subset of profiled SNPs are associated with the response variables. Thus, marker selection is needed along with estimation. With high-dimensional data, penalization has been extensively applied for regularized estimation and variable selection. Commonly used penalization methods include Lasso, elastic net, bridge, SCAD, MCP and others. Such methods can effectively analyze data with a single response variable with interchangeable covariate effects. When there exists hierarchical structure among covariates, for example the “pathway, SNP-within-pathway” two-level structure, the “group” version of the aforementioned penalization methods have been proposed. The group penalty is usually a composite penalty. For example with group SCAD [5], the outer is the SCAD penalty, and the inner is the ridge penalty. We note that such group penalization methods are mainly used for the analysis of data with a single response variable. In this study, our goal is to analyze data with multiple correlated response variables and conduct marker selection. In classic statistical analysis with a small number of covariates, data with multiple response variables can be accommodated under the framework of multivariate analysis of variance (MANOVA) [6] and multivariate analysis of covariance (MANCOVA). However, such methods cannot accommodate high dimensional covariates. It is possible to first apply existing penalization methods, for example Lasso, analyze each response variable separately, and then combine the analysis results using meta-analysis methods. However, such an approach ignores the correlation among response variables and hence can be less informative. Yuan and Ekici [7] introduced a nuclear norm approach encouraging the sparsity among singular values which at the same time gives shrinkage coefficient estimates and thus conducts dimension reduction and coefficient estimation simultaneously in multivariate linear models. Chen et al. [8] proposed an approach for solving reduced rank multivariate stochastic regression models. In the heterogeneous stock mice dataset, there are multiple continuously distributed, highly correlated response variables. Under a joint modeling framework, we propose first transforming multi-response data into uni-response data following the same distribution. Then a group Lasso approach is applied to the transformed uni-response data. With two responses, the effect of one SNP needs to be represented by two regression coefficients, which naturally form a “group”. We emphasize that, unlike other group penalization studies in which one group usually corresponds to multiple covariates, here one group corresponds to a single covariate for multiple responses.

Materials and Methods

Analysis of multi-response data

Consider data with multiple correlated response variables. With data like the heterogeneous stock mice from Wellcome Trust Center, it is reasonable to assume that multiple responses share a certain common genetic basis, particularly the same set of susceptible SNPs. However, we note that although the response variables are correlated, they are not identical. With the inherent heterogeneity, it is not sensible to reinforce the same model with the same regression coefficients for different response variables. Let be the number of response variables, be the number of subjects, and be the number of SNPs. Denote as the response variables and as the covariate matrix. For , assume that is associated with via the linear model , where is the regression coefficient corresponding to the th response variable. We first transform the original data frame. For simplicity of notation, we use the same symbol but with different subscripts for the new response variable. Although the proposed method can accommodate different covariates for distinct response variables, we assume that the same set of covariates are measured for all responses. Let be the length- vector of response variables for the th subject, and . Covariates for the th subject have the form where . The regression coefficient vector is then where . To better illustrate the basic features of the model settings here, consider a dataset with  = 2 response variables and SNPs. Assume that only the first four SNPs are associated with responses. Then the coefficients may look like and correspondingly,The regression coefficient and corresponding model have the following features. First, only the first four response-associated SNPs have nonzero regression coefficients (i.e. the model is sparse). Thus marker identification amounts to identifying SNPs with non-zero regression coefficients. This strategy has been commonly used in regularized marker selection. Second, as the two response variables share the same susceptible SNPs, there is a natural grouping structure with the transformed covariates. For example, the first two regression coefficients/covariates correspond to the first SNP. Thus, they form a group of size two and should be selected at the same time. Motivated by the heterogeneous stock mice dataset, we describe the proposed approach for studies with quantitative traits under linear models. The proposed approach can be extended to other types of response variables and other statistical models, as long as the joint modeling of response variables can be conducted. In a study with response variables, the least square loss function for transformed data can be written aswhere is the covariance matrix for residuals.

Penalized estimation and marker selection

Penalized estimation

From definition, is the coefficients for the responses at the th locus. We define as the minimizer of the penalized least squares loss function:Here , , , , , , , is the norm, and is the number of levels at the th locus (equals to under the present setting). Note that prior to the transformation, we assume that the response follows a multivariate normal distribution. In contrast, after transformation, each element in the new response follows a univariate normal distribution. We center the response and make the grand mean equal to zero. The proposed penalty has been motivated by the following considerations. For a given SNP locus, we treat its regression coefficients for response variables as a group, so that we can evaluate its overall effects. The within-group penalty has an norm, and the group-level penalty has an norm. Thus, the proposed penalty may have the following main properties. First, it can conduct group-level selection. Second, if a group is selected, then all members within that group are selected with non-zero estimates. But the magnitudes of regression coefficients may differ. On the other hand, if a group is not selected, all of its members are set to be zero. Such properties fit the goal of the proposed analysis. As discussed in [9], we need to orthogonalize the transformed covariates block-wise in order to achieve computational efficiency. Write for an upper triangular matrix via Cholesky decomposition. Assume that is invertible. Let and , then the penalized least-squares in expression (1) becomesIf we center , there is no need to fit for intercept for (2).

Computational algorithm

We use the group cyclical coordinate descent (GCD) algorithm. The GCD algorithm is a natural extension of the coordinate descent algorithm [10]. It optimizes a target function with respect to a single group parameter at a time and iteratively cycles through all group parameters until convergence. It is particularly suitable for problems as the present one which has a simple closed-form solution with a single group but lacks one with multiple groups. The GCD algorithm proceeds as follows. For a given , Let be the initial estimate. A sensible initial estimate is zero (component-wise). Initialize the vector of residuals and . For , repeat the following steps: Calculate the least-square estimates with respect to Compute Update . . Iterate Step 2 until convergence. Breheny and Huang [11] discussed the convergence of coordinate descent algorithms for SCAD and MCP. We now consider the GCD for group Lasso. For any given , starting from an initial estimate , the GCD algorithm generates a sequence of updates , , whereSince the sequence is non-increasing and bounded below by 0, it always converges. The following proposition is concerned about the convergence of . Proposition 1 For any fixed , the GCD updates converge to a global minimizer of the group Lasso criterion and satisfy the inequality This proposition can be proved following the arguments of [12] who established the convergence of the coordinate descent algorithm for concave penalized selection methods including the Lasso.

Choice of tuning parameter

There are various methods that can be applied, including for example AIC, BIC, cross-validation, and generalized cross-validation. Chen and Chen [13] developed a family of extended Bayesian information criteria (EBIC) to overcome the overly liberal selection problem caused by the small--large- situation. Furthermore, Chen and Chen [14] established the consistency of EBIC under the generalized linear models in the small-n-large-p situation. For group Lasso, Yuan and Lin [15] proposed an approximation of the degree of freedom (DF). Here, we apply EBIC with an approximated DF to select the tuning parameter . The EBIC is defined as:where is the residual sum of squares under a fixed . The DF for group Lasso [15] is defined as:where is the number of predictors in the th group and is the least-square estimate for the th group obtained by fitting group only. Note that when for , group Lasso becomes Lasso, and its DF is the number of non-zero parameters selected. Therefore, one can take Lasso as a special case of group Lasso, and so does the DF in expression (4).

Significance level for the selected SNPs

With penalization methods, the relevance of a covariate usually is determined by whether its regression coefficient is nonzero. As secondary analysis, it may also be of interest to compute the value. However, it should be noted that it is usually insensible to use both estimation magnitude (zero or nonzero) and significance level for selection. Here, we use a multi-split method modified from the one proposed by Meinshausen et al. [17] to obtain -values. With linear regression, we use -test for each group to evaluate whether there are elements in this group with significant effects. This procedure puts us in a position to obtain -values at the group level. It is simulation-based and can adjust for multiple comparisons. The multi-split method proceeds as follows: Randomly split data into two disjoint sets of equal size: and . Fit data in with the proposed method. Denote the set of selected groups by . Compute , p-value for group , as follows: If group is in set , set equal to the p-value from the F-test in the regular linear regression where group is the only group. If group is not in set , set . Define the adjusted -value as , where is the size of set . This procedure is repeated times for each group. Let denote the adjusted p-value for group in the th iteration. For , let be the -quartile of . Define . It is shown in [17] that is an asymptotically correct -value, adjusted for multiplicity. The authors also proposed an adaptive version that selects a suitable value of quartile based on data:where is chosen to be 0.05. It is shown that , can be used for both FWER (family-wise error rate) and FDR (false discovery rate) control [17].

Results

Simulation studies

In simulation, we consider six different scenarios, each with 500 subjects and 5,000 or 10,000 SNPs. For each subject, we simulate two response variables. The correlation between the two responses is set to be 0.1, 0.5 or 0.9, representing weak, moderate and strong correlations. For each response variable, there are twelve SNPs with nonzero effects. Those twelve SNPs can be grouped into three clusters. Among each cluster, the correlation between two SNPs is 0.2. The correlation among SNPs not associated with response is set to be 0.2. Response-associated and noisy SNPs are independent. More specifically, the genotypes are first generated from multivariate normal distributions and then categorized into 0, 1 or 2. To mimic a SNP with equal allele frequency, we categorize genotype in a way similar to [16]. The genotype is set to be 0, 1 or 2 depending on whether , or , where is the -quartile of . For the first response variable, the regression coefficient isFor the second response variable, the regression coefficient isThe two response variables depend on the same genotypic data and are correlated through the residuals. Clustering structure exists in this simulation. To better gauge performance of the proposed approach, we also consider the following alternative approach. We first analyze each response variable separately using Lasso, and then combine the results by examining the overlapped SNPs. For both approaches, we apply the EBIC method described in the previous section to select the tuning parameter . We evaluate the number of SNPs identified, number of true positives, false discovery rate (FDR) and false negative rate (FNR). In addition, estimation performance is also evaluated using SSE (sum of squared error). Results based on 100 replicates are summarized in Table 1. Note that the true response-associated SNPs are 25–28, 41–44 and 57–60 for both responses. In total, there are 24 SNPs associated with the two responses. Table 1 shows that under all simulation scenarios, the proposed approach is able to identify almost all of the true positives, significantly more than the individual-dataset approach. The price is a few more false positives. With the proposed approach, the highest FDR is 0.18, which can be acceptable in practice. Under all scenarios, the proposed approach has significantly smaller SSEs. Taking both marker identification and estimation into consideration, we conclude that the proposed approach provides a competitive alterative to the existing individual-dataset approach. For one simulated dataset, -values evaluated by the multi-split method for the selected groups are presented in Table 2. It can be seen that many true positives have significant -values, while all false positives have insignificant -values.
Table 1

Simulation studies: the numbers are mean (standard deviation) based on 100 replicates.

Combined Individual
True PositiveModel SizeFDRFNRSSE
50000.117.60(1.99)20.18(2.93)0.12(0.09)0.27(0.08)96.89(12.67)
50000.517.80(1.82)20.66(3.02)0.13(0.08)0.26(0.08)97.85(13.51)
50000.917.66(1.72)20.32(3.11)0.12(0.09)0.26(0.07)97.39(15.61)
100000.116.78(1.64)19.24(2.76)0.12(0.09)0.30(0.07)92.76(11.16)
100000.516.84(1.80)18.96(2.56)0.10(0.08)0.30(0.07)92.44(12.57)
100000.916.70(1.79)18.22(2.41)0.08(0.06)0.30(0.07)91.37(13.68)

False discovery rate (FDR) and false negative rate (FNR) are reported together with true positives and model sizes.

Table 2

Multi-split -values for simulated data with all matched non-zero s and  = 0.9.

p = 5000p = 10000
SNP index -value -value
250.2939.3E−100.3182.2E−10
260.2707.3E−120.2631.5E−07
270.2633.1E−100.3462.9E−10
280.2511.1E−110.3011.2E−08
410.2640.0540.1821.000
420.1001.0000.3360.007
430.2450.0060.2490.019
440.0961.0000.1770.798
570.1070.0040.0931.000
580.1741.0000.0711.000
590.1832.3E−050.1737.4E−05
600.0891.0000.0941.000
3420.0061.000
22000.0091.000
36230.0101.000
39200.0131.000
41770.0041.000
45550.0031.000
54940.0081.000
58990.0371.000
71560.0201.000
90610.0011.000
93430.0041.000
95010.0041.000
98840.0131.000

Empty cells stand for SNPs that are not identified from the model.

False discovery rate (FDR) and false negative rate (FNR) are reported together with true positives and model sizes. Empty cells stand for SNPs that are not identified from the model. With the proposed approach, it is assumed that the multiple responses of interest have exact the same set of important SNPs. Such an assumption is reasonable under some settings but too restricted under others. To get a more comprehensive understanding of the proposed approach, we also conduct simulation where the two sets of important SNPs are partially matched. In Table 3, we consider the simulation setting where 25% of the important SNPs are not matched. In Table 4, we consider the scenario with 50% unmatched important SNPs. Under both simulation scenarios, the proposed approach identifies more true positives. However, the model sizes and FDRs are much larger. Such an observation is reasonable: for a SNP associated with a single response variable, when it is identified using the proposed approach, this SNP is automatically identified for the response variable it is not associated with, creating one false positive. Thus with the proposed approach and partially matched important SNP sets, identifying more true positives inevitably leads to much larger model sizes. It is interesting to note that under all simulation scenarios, the proposed approach has significantly smaller SSEs.
Table 3

Simulation studies: the numbers are mean (standard deviation) based on 100 replicates.

Combined Individual
True PositiveModel SizeFDRFNRSSE
50000.116.96(1.92)19.52(3.07)0.12(0.09)0.29(0.08)90.82(12.20)
50000.516.96(1.82)19.82(3.45)0.13(0.10)0.29(0.08)91.35(12.19)
50000.917.10(1.67)19.68(3.45)0.12(0.10)0.29(0.07)91.63(13.66)
100000.116.06(1.73)18.42(3.38)0.11(0.09)0.33(0.07)86.30(10.36)
100000.515.92(1.70)18.24(2.88)0.12(0.09)0.34(0.07)86.13(10.90)
100000.915.96(1.75)17.88(2.80)0.10(0.08)0.34(0.07)85.43(12.50)

False discovery rate (FDR), false negative rate (FNR) and sum of squared errors (SSE) are reported together with true positives and model sizes. 25 of the regression coefficients are not matched.

Table 4

Simulation studies: the numbers are mean (standard deviation) based on 100 replicates.

Combined Individual
True PositiveModel SizeFDRFNRSSE
50000.116.94(1.89)19.80(2.92)0.14(0.09)0.29(0.08)89.29(11.00)
50000.517.00(1.92)19.82(2.99)0.13(0.08)0.29(0.08)89.67(11.41)
50000.917.08(1.90)20.02(3.47)0.13(0.09)0.29(0.08)89.94(14.40)
100000.116.26(1.55)19.36(3.08)0.15(0.09)0.32(0.06)84.41(10.06)
100000.516.20(1.58)19.06(2.85)0.14(0.09)0.32(0.07)84.24(10.48)
100000.916.16(1.60)18.34(2.50)0.11(0.08)0.33(0.07)83.38(11.16)

False discovery rate (FDR), false negative rate (FNR) and sum of squared errors (SSE) are reported together with true positives and model sizes. 50 the regression coefficients are not matched.

False discovery rate (FDR), false negative rate (FNR) and sum of squared errors (SSE) are reported together with true positives and model sizes. 25 of the regression coefficients are not matched. False discovery rate (FDR), false negative rate (FNR) and sum of squared errors (SSE) are reported together with true positives and model sizes. 50 the regression coefficients are not matched. Here we focus on the scenario with two response variables to match the data analyzed in the next section. It is possible to conduct analysis with three or more responses, which may have higher computational cost.

Application to heterogeneous stock mice dataset

The heterogeneous stock mice dataset is described in the Introduction section. We refer to the original publication for more detailed descriptions [1], [2], [4]. This dataset includes fully phenotypic records on 2,202 mice, and each was genotyped for 13,459 SNP markers. In joint modeling, SNPs with missingness cannot be included. Thus, we implement fastPHASE to impute the missingness in SNPs [18]. After deleting observations with missing phenotypes and alleles with minor allele frequency less than 0.05, there are 1,514 mice and 9,991 SNP markers in 19 autosomes. We analyze the data using three different approaches: the traditional one-SNP-at-a-time approach, analysis of individual response using Lasso, and the proposed approach. In Figure 1, we show the absolute values of estimates from the single-SNP analysis on both CD4/CD8 ratio and CD4CD3. Here single-SNP analysis is conducted using a Bonferroni approach with overall -value 0.05. In Figure 2, we show the from Lasso on both phenotypes and from the proposed method. In Figure 1, one can see that the signal to noise ratio is weak, and it is difficult to tell the real associated signals from background. In contrast, the signal to noise ratio is strong, and a small number of SNPs are selected by using the Lasso and proposed method. When analyzing each response separately using Lasso and multiple responses using the proposed method, we use the method described in the previous section to select the tuning parameter . We use the multi-split method to evaluate the significance of selected SNPs. In Figure 2, the larger dots stand for the selected SNPs with significant -values. In Table 5, the total number of significant SNPs is summarized in the parenthesis for the Lasso on both phenotypes and the proposed method. Detailed information on the selected SNPs by the proposed method and individual Lasso methods on both CD4/CD8 ratio and CD4CD3 is presented in Table 6, Table 7 and Table 8, respectively. Note that there is no one-to-one correspondence between the magnitude of estimates and significance level. Such an observation is not uncommon in regression analysis. In addition, the proposed penalization approach is based on Lasso, which is known to shrink estimates towards zero. Another observation is that SNPs in high LD may have very different estimates, which is also “as expected”. In single-response analysis, Lasso has the tendency to select one out of a set of highly correlated covariates. Thus, it is possible or even likely that out of the SNPs with high LD, one may have a large estimate while others have very small or zero estimates. The numbers of selected SNPs and overlaps among the proposed method, the Lasso method and single-SNP analysis are presented in Table 5. We see that the single-SNP analysis selects a large number of SNPs. This may be due to the fact that the selection of assayed SNPs is not totally random.
Figure 1

Absolute values of

estimates from the simple linear regression on CD4/CD8 ratio and CD4∶CD3.

Figure 2

Absolute values of

estimates from Lasso on CD4/CD8 ratio and CD4∶CD3 and estimates for the proposed method. Smaller dots represent SNPs selected by the Lasso/proposed method with insignificant multi-split -values. Larger dots represent SNPs with significant -values.

Table 5

Number of SNPs identified, and overlap of SNPs among the proposed method, the Lasso and single-SNP analysis for heterogeneous stock mice dataset.

Method of SNPs of Overlapping SNPs
L1* L2** S1*** S2****
The Proposed Method45(38)12133845
Lasso on M153(49)105153
Lasso on M231(28)3031
single-SNP analysis on M129642964
single-SNP analysis on M23128

Short for Lasso on M1.

Short for Lasso on M2.

Short for single-SNP analysis on M1.

Short for single-SNP analysis on M2.

The number in the parenthesis is the number of SNPs with significant -values.

Table 6

SNPs selected by the proposed method on both phenotypes CD4/CD8 ratio and CD4∶CD3.

SNPChromosomePositionMAFBandGene* Proposed Method
p-value
rs134757941322020970.1891qBKhdrbs20.0241.7E−07
rs134758471459692200.3011qC1.1Slc40a10.0082.8E−01
rs367945911203418350.0981qE2.3Clasp10.0244.7E−08
rs825619711304856420.4281qE4Cxcr40.0063.8E−05
rs825619611304856750.4281qE4Cxcr45.1E−154.3E−05
rs368246521563179500.1462qH1Epb4.1l10.0043.9E−07
rs37188123526058740.1553qCCog60.0363.1E−08
rs365964331157598470.2053qG1Extl22.3E−032.2E−06
rs617647731178747570.2593qG1Snx74.3E−049.6E−06
rs1346036641298049780.1374qD2.2Pef15.5E−040.241
rs1347797941300044340.1374qD2.2Zcchc172.6E−150.332
rs1347798041302815640.1374qD2.2Pum12.4E−160.332
rs134782855617060700.0785qC3.1G6pd20.0150.003
rs36928265632870180.0785qC3.1Gm173848.5E−160.004
rs62220235765907040.3975qC3.3Srd5a30.0072.1E−08
rs371175151373939860.2905qG24933404O12Rik0.0095.9E−07
rs134786566218939270.0786qA3.1Ing30.038 1.0E−18
rs36655676713422070.4426qC1Rmnd5a0.0436.4E−13
rs367193261348081280.2286qG1Crebl20.0412.9E−07
rs365748271212091990.4587qF1Rras20.0130.559
rs134796738303447800.1028qA3Unc5d0.0173.5E−08
rs3322703481310270850.4808qE2Nrp10.0130.016
rs296344209169610900.0759qA2Gm56110.0060.015
rs134801419367546480.4749qA4Pknox20.0156.7E−07
rs13480826101278744560.19410qD3Rnf410.0172.6E−04
rs3719526101278902550.19410qD3Smarcc21.4E−142.6E−04
rs36703601161536740.10711qA1Ddx560.0514.2E−10
rs13481186111002246740.44111qDJup0.0032.1E−06
rs13481187111005135510.44111qDZfp385c5.5E−152.1E−06
rs6393715111117967140.32211qE2Gm116790.0021.000
rs1347213213555150900.18413qB1Slc34a10.0023.0E−05
rs369232613993166150.14313qD1Gm103200.028 1.0E−18
rs416110116107010080.36916qA1Clec16a0.0020.537
rs416304216131424350.17216qA1Ercc40.0010.005
rs371473816147228930.09116qA1Si20.0081.5E−03
rs421990516929999110.34816qC4Runx10.0243.3E−09
rs3388622017333546770.34517qB1Zfp955a0.165 1.0E−18
rs3347798517337446400.34517qB1Myo1f1.7E−15 1.0E−18
rs3366179717352767130.45617qB1Bag60.0382.4E−10
rs1348296817372686280.44517qB1Olfr930.0611.8E−10
rs3327023517383117210.09317qB1Olfr1340.011 1.0E−18
rs366803617458237310.33917qB3Tmem63b0.0047.4E−11
rs371295317504028270.07617qCDazl0.0216.0E−08
rs372082718634498700.24818qE1Fam38b0.020 1.0E−18
rs1348344918775597080.14118qE38030462N17Rik0.0235.1E−11

Gene names that SNPs belong to or are closest to.

Table 7

SNPs selected by individual Lasso on CD4/CD8 ratio.

SNPChromosomePositionMAFBandGene* Lasso
p-value
rs134758471459692200.3011qC1.1Slc40a10.0171.4E−04
rs372716211188307820.0981qE2.3Cntp5a−0.0158.7E−11
rs1347623411727718180.4791qH3Atf60.0310.004
rs1347623911741518920.3461qH3Atp1a4−0.0021.9E−05
rs1347624211752955100.4231qH3Cadm3−0.0342.5E−04
rs1347625111767223880.4301qH3Fmn2−0.0241.000
rs1347676421279740550.3602qF1Bcl2l110.0071.000
rs641142221281992270.4472qF1Gm140050.0181.000
rs37188123526058740.1553qCCog62.4E−048.2E−10
rs36742963527380920.1553qCCog6−0.0108.5E−10
rs370973231176698100.2593qG1Snx7−0.0127.6E−12
rs617647731178747570.2593qG1Snx70.0137.8E−12
rs1347743431366896450.4823qG3Gm109550.0083.7E−07
rs13477551490518480.4634qA1Rps18-ps20.0082.9E−09
rs134775844170517980.3534qA2Gm11850−0.0099.7E−10
rs134782855617060700.0785qC3.1G6pd2−0.0270.001
rs134782865622013280.0785qC3.1G6pd23.1E−130.003
rs295015365727110780.4655qC3.2Corin−0.011 1.0E−18
rs315378825729953370.4655qC3.2Cnga1−3.4E−146.4E−13
rs371175151373939860.2905qG24933404O12Rik0.0262.29E−12
rs134788016650577950.3656qC1Smarcad1−0.009 1.0E−18
rs36655676713422070.4366qC1Rmnd5a−0.072 1.0E−18
rs1347894161033488340.4416qE1Chl10.0155.6E−08
rs633472361346519680.3686qG1Loh12cr1−0.0421.1E−12
rs134793767915968730.0747qD3Gm21150.0054.2E−08
rs1347946571200469780.0757qF1Tead1−0.0202.1E−04
rs134796218159933780.4518qA1.1Csmd1−0.044 1.0E−18
rs134799308972011980.2488qC5Pllp0.0130.001
rs618030681091661650.0748qD3Cdh1−0.0060.002
rs296344209169610900.0759qA2Gm5611−0.0184.2E−04
rs6280411101255750830.45110qD3AC153489.1−0.0241.11E−07
rs3701568101289331020.24810qD3Olfr7900.0011.1E−09
rs36703601161536740.10711qA1Ddx560.039 1.0E−18
rs365658311644429100.45611qB3Gm122910.0301.5E−09
rs629752011644722100.45611qB3Gm12291−0.0032.5E−09
rs1348117011954894160.07411qDGm115280.038 1.0E−18
rs368469912282090150.07612qA2Sox110.0270.001
rs1348141112420606670.07112qB1Immp2l−0.0150.001
rs1348141212427226600.07112qB1Immp2l2.4E−160.004
rs1347213213555150900.18413qB1Slc34a10.0203.4E−08
rs1348222514653247290.36314qD1Kif13b−0.0110.076
rs4139535141099883590.08214qE3Slitrk1−0.0040.009
rs6209981141100673830.08214qE3Slitrk1−6.3E−150.031
rs31100152141104320090.08214qE3n-R5s507.7E−140.041
rs416305816132697580.18116qA1Mkl20.0039.3E−06
rs416319616134008900.18116qA1Mkl21.5E−168.8E−05
rs419904416692898590.44916qC2Speer20.0071.2E−06
rs1348295217329373600.34517qB1Zfp8110.230 1.0E−18
rs1345915117330780900.34517qB1Cyp4f131.3E−16 1.0E−18
rs3388622017333546770.34517qB1Zfp955a−1.4E−17 1.0E−18
rs3366179717352767130.45617qB1Bag60.1262.1E−11
rs371295317504028270.07617qCDazl0.076 1.0E−18
rs619442619502035200.28619qD1Sorcs10.0101.5E−06

Gene names that SNPs belong to or are closest to.

Table 8

SNPs selected by individual Lasso on CD4∶CD3.

SNPChromosomePositionMAFBandGene* Lasso
p-value
rs134757941322020970.1891qBKhdrbs2−9.5E−041.9E−05
rs1347623911741518920.3461qH3Atp1a40.0120.014
rs368246521563179500.1462qH1Epb4.1l10.0074.7E−08
rs367996231277955350.4903qG2Gm106500.0160.135
rs629040131422978550.3143qH1Gbp20.0031.2E−08
rs1347745931424920440.3543qH1Pkn20.0112.6E−09
rs295015365727110780.4655qC3.2Corin6.1E−05 1.0E−18
rs37107355731235830.4655qC3.2Txk−1.4E−162.2E−12
rs63401665731882790.4655qC3.2Tec0.0111.000
rs42252675737008370.4655qC3.2Ociad1−1.8E−161.000
rs371175151373939860.2905qG24933404O12Rik−0.0109.4E−12
rs134786566218939270.0786qA3.1Ing30.0142.0E−10
rs134788006647662500.4356qC1Atoh10.005 1.0E−18
rs36655676713422070.4426qC1Rmnd5a0.100 1.0E−18
rs134796218159933780.4418qA1.1Csmd10.027 1.0E−18
rs134796738303447800.1028qA3Unc5d0.0118.6E−09
rs134801419367546480.4749qA4Pknox20.0291.2E−11
rs134801539404836170.4559qA5.19030425E11Rik−0.0031.7E−11
rs6280411101255750830.45110qD3AC153489.10.0288.5E−11
rs13480817101259327240.45110qD3AC153489.1−1.7E−163.2E−10
rs29383570101271465950.42010qD3Myo1a0.0044.3E−10
rs1348117011954894160.07411qDGm11528−0.055 1.0E−18
rs369232613993166150.14313qD1Gm103200.014 1.0E−18
rs421990516929999110.34816qC4Runx1−0.040 1.0E−18
rs1348295217329373600.34517qB1Zfp811−0.194 1.0E−18
rs3366179717352767130.45617qB1Bag6−0.154 1.0E−18
rs3327023517383117210.09317qB1Olfr1340.007 1.0E−18
rs371295317504028270.07617qCDazl−0.113 1.0E−18
rs1348344818775597080.14118qE3Loxhd1−0.0236.7E−13
rs1348344918778760270.14118qE38030462N17Rik4.8E−158.6E−13

Gene names that SNPs belong to or are closest to.

Absolute values of

estimates from the simple linear regression on CD4/CD8 ratio and CD4CD3. estimates from Lasso on CD4/CD8 ratio and CD4CD3 and estimates for the proposed method. Smaller dots represent SNPs selected by the Lasso/proposed method with insignificant multi-split -values. Larger dots represent SNPs with significant -values. Short for Lasso on M1. Short for Lasso on M2. Short for single-SNP analysis on M1. Short for single-SNP analysis on M2. The number in the parenthesis is the number of SNPs with significant -values. Gene names that SNPs belong to or are closest to. Gene names that SNPs belong to or are closest to. Gene names that SNPs belong to or are closest to. With our limited knowledge on susceptibility SNPs for immunology, we are not able to objectively evaluate the biological implications of identified SNPs. As an alternative, we consider the following evaluation of prediction performance, which may provide partial information on identification performance. (a) Randomly split the sample into five parts with equal sizes; (b) Analyze four parts using the proposed approach; (c) Use the obtained model and make prediction for subjects in the left-out part; (d) Repeat Steps (b) and (c) over all five parts. For comparison, the same approach is also used to evaluate the individual Lasso approach. The prediction mean squared errors are 1.66 for the proposed approach and 2.33 for the combined Lasso. By jointly analyzing two responses, the proposed approach has better prediction performance.

Discussion

In the study of complex diseases, it is not uncommon that a single trait cannot provide a comprehensive description, and multiple traits need to be measured. In this article, we analyze data with multiple response variables under the assumption that they have the same set of important SNPs. A penalization approach is developed for marker selection. The proposed approach can accommodate the joint effects of multiple SNPs and be more informative than single-SNP analysis. Compared with the existing approaches that analyze different traits separately, it can more effectively accommodate the correlation among traits and hence be more efficient in marker selection. Numerical studies, including simulation and analysis of the heterogeneous stock mice dataset, show satisfactory performance of the proposed approach. The heterogeneous stock mice data have two continuous response variables with marginally normal distributions. With other types of response variables, there is a rich literature on joint modeling, which can be adopted to couple with the proposed marker selection. The proposed approach is based on the group Lasso penalty. We expect that other “group-type” penalties, such as group SCAD or group bridge, can be applied. The group Lasso is selected because of its relatively low computational cost, which is especially desirable with high-throughput data. In our numerical study, we focus on the scenario where the MAFs are not very low. When the MAFs are low, our unpublished numerical study suggests that penalization methods may not perform well because the covariate design matrix is “overly sparse”. Using penalization methods with rare variants is still being explored. Analysis of the heterogeneous stock mice data shows that the proposed approach can identify SNPs missed by single-response analysis. In addition, it has improved prediction performance. Therefore, the proposed method provides a useful alternative to the current analysis of multivariate traits in GWAS.
  9 in total

1.  Genome-wide genetic association of complex traits in heterogeneous stock mice.

Authors:  William Valdar; Leah C Solberg; Dominique Gauguier; Stephanie Burnett; Paul Klenerman; William O Cookson; Martin S Taylor; J Nicholas P Rawlins; Richard Mott; Jonathan Flint
Journal:  Nat Genet       Date:  2006-07-09       Impact factor: 38.330

2.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.

Authors:  Paul Scheet; Matthew Stephens
Journal:  Am J Hum Genet       Date:  2006-02-17       Impact factor: 11.025

3.  Group SCAD regression analysis for microarray time course gene expression data.

Authors:  Lifeng Wang; Guang Chen; Hongzhe Li
Journal:  Bioinformatics       Date:  2007-04-26       Impact factor: 6.937

4.  Genome-wide association analysis by lasso penalized logistic regression.

Authors:  Tong Tong Wu; Yi Fang Chen; Trevor Hastie; Eric Sobel; Kenneth Lange
Journal:  Bioinformatics       Date:  2009-01-28       Impact factor: 6.937

5.  SparseNet: Coordinate Descent With Nonconvex Penalties.

Authors:  Rahul Mazumder; Jerome H Friedman; Trevor Hastie
Journal:  J Am Stat Assoc       Date:  2011       Impact factor: 5.033

6.  Genetic and environmental effects on complex traits in mice.

Authors:  William Valdar; Leah C Solberg; Dominique Gauguier; William O Cookson; J Nicholas P Rawlins; Richard Mott; Jonathan Flint
Journal:  Genetics       Date:  2006-08-03       Impact factor: 4.562

7.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION.

Authors:  Patrick Breheny; Jian Huang
Journal:  Ann Appl Stat       Date:  2011-01-01       Impact factor: 2.083

8.  Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors:  Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal:  J Stat Softw       Date:  2010       Impact factor: 6.440

9.  A protocol for high-throughput phenotyping, suitable for quantitative trait analysis in mice.

Authors:  Leah C Solberg; William Valdar; Dominique Gauguier; Graciela Nunez; Amy Taylor; Stephanie Burnett; Carmen Arboledas-Hita; Polinka Hernandez-Pliego; Stuart Davidson; Peter Burns; Shoumo Bhattacharya; Tertius Hough; Douglas Higgs; Paul Klenerman; William O Cookson; Youming Zhang; Robert M Deacon; J Nicholas P Rawlins; Richard Mott; Jonathan Flint
Journal:  Mamm Genome       Date:  2006-02-06       Impact factor: 2.957

  9 in total
  3 in total

1.  Analyzing Association Mapping in Pedigree-Based GWAS Using a Penalized Multitrait Mixed Model.

Authors:  Jin Liu; Can Yang; Xingjie Shi; Cong Li; Jian Huang; Hongyu Zhao; Shuangge Ma
Journal:  Genet Epidemiol       Date:  2016-06-01       Impact factor: 2.135

2.  Integrative analysis of gene-environment interactions under a multi-response partially linear varying coefficient model.

Authors:  Cen Wu; Yuehua Cui; Shuangge Ma
Journal:  Stat Med       Date:  2014-08-21       Impact factor: 2.373

3.  Penalized multivariate linear mixed model for longitudinal genome-wide association studies.

Authors:  Jin Liu; Jian Huang; Shuangge Ma
Journal:  BMC Proc       Date:  2014-06-17
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.