Literature DB >> 20018017

Comparison between single-marker analysis using Merlin and multi-marker analysis using LASSO for Framingham simulated data.

Yun Ju Sung1, Treva K Rice, Gang Shi, C Charles Gu, Dc Rao.   

Abstract

We compared family-based single-marker association analysis using Merlin and multi-marker analysis using LASSO (least absolute shrinkage and selection operator) for the low-density lipoprotein phenotype at the first visit for all 200 replicates of the Genetic Analysis Workshop 16 Framingham simulated data sets. Using "answers," we selected single-nucleotide polymorphisms (SNPs) on chromosome 22 for comparison of results between single-marker and multi-marker analyses. For the major causal SNP rs2294207 on chromosome 22, both single-marker and multi-marker analyses provided similar results, indicating the importance of this SNP. For the 12 polygenic SNPs on the same chromosome, both single-marker and multi-marker analyses failed to provide statistically significant associations, indicating that their effects were too weak to be detected by either method. The main difference between the two methods was that for the 14 SNPs near the causal SNPs, p-values from Merlin were the next smallest, whereas LASSO often excluded these non-causal neighboring SNPs entirely from the first 10,000 models.

Entities:  

Year:  2009        PMID: 20018017      PMCID: PMC2795924          DOI: 10.1186/1753-6561-3-s7-s27

Source DB:  PubMed          Journal:  BMC Proc        ISSN: 1753-6561


Background

Association analysis is often performed using single markers or haplotype analysis of multiple single-nucleotide polymorphisms (SNPs) within adjoining short regions or candidate genes. However, analysis that simultaneously uses multiple markers may be more powerful for detecting several causal genes and, hence, may be more appropriate for complex diseases [1]. The least absolute shrinkage and selection operator (LASSO) is a penalized least squares method imposing the L1-penalty on the regression coefficients [2]. Because this penalty induces shrinkage, prediction using LASSO is more reproducible than the regular multiple linear regression, in the case when there are more predictors than individuals (small n large p). Compared with a regular multiple linear regression (ordinary least squares), LASSO can handle the multicollinearity resulting from the highly correlated markers. Moreover, due to the nature of the L1-penalty, many regression coefficients are exactly zero. Hence, LASSO does both shrinkage and automatic variable selection simultaneously, a form of parsimonious model selection. Our main goal in this paper was to explore the performance of LASSO for SNP selection in association analysis. In particular, we compared the relative importance (ranks) of SNPs provided by LASSO to that of SNPs inferred by single-marker analysis.

Methods

Phenotypes and genotypes

We used the low-density lipoprotein (LDL) phenotype at the first visit for all 200 replicates of the Genetic Analysis Workshop 16 (GAW16) Framingham simulated data sets. This phenotype was adjusted for age, smoking, and diet separately for both sexes and then corrected for medication (HMG-CoA reductase inhibitors) [3]. Because the GAW16 data set only contained individuals with genotypes, we created records for untyped parents as founder individuals. Because their actual relationship with other members in the same family ID was not provided, one extended family was often divided into multiple families: 1129 families with size ranging from 1 to 470 became 1920 families with size ranging from 1 to 72. Chromosome 22 included one major causal SNP and 12 polygenic SNPs that influenced the simulated LDL phenotype [4]. To reduce the number of SNPs, we chose 5011 SNPs located between 23.28 Mb and 49.10 Mb, 0.1 Mb in each direction past the left and right influencing SNPs. We excluded SNPs with minor allele frequency (MAF) less than or equal to 0.003 (we wanted to include one polygenic SNP with MAF 0.004). The final data set for analysis consisted of 4589 SNPs and 6857 individuals.

Single-marker analysis using Merlin

For single marker analysis, we used Merlin [5,6]. The family-based association test provided by Merlin has two advantages. First, missing genotypes (1.5% of all genotypes) were imputed, using flanking markers and family relationships, and incorporated in the association test. Second, unlike most family-based linkage and association programs, which do not provide results for data sets with mendelian inconsistent genotypes, the Merlin association test does provide results by ignoring families with mendelian inconsistent genotypes. Even though this may not be an optimal way to handle genotype errors, it bypasses removing genotype errors, which can be tedious for data sets with large number of SNPs and large families. Linkage disequilibrium (LD) between the major causal SNP and other SNPs (measured by r2) was computed using R package genetics.

Multi-marker analysis using LASSO

For covariate-adjusted phenotype yand SNPs x,..., xof ith individual, LASSO minimizes subject to . The LASSO solution path provides a sequence of models, from the simplest model including only an intercept (when t = 0) to the most complex model including all SNPs as predictors (when t is very large). If a particular SNP becomes a predictor in the ith model, then that SNP tends to stay as a predictor for all bigger models, but this does not always happen. For ranking SNPs, we used this "entry" number that indicates when a particular SNP becomes a predictor in the LASSO solution path. For our analysis, we evaluated the first 10,000 models in the LASSO solution path, using R package lars [7]. We used Merlin to impute missing SNPs because lars requires each individual to have values for all predictors: removing individuals with partially missing SNPs would make use of only one-tenth of the data. This also makes the data set more consistent with single-marker analysis.

Results

Figure 1A shows association test results for Replicate 1 of 200 simulated LDL phenotypes: results were consistent across all 200 Replicates (Table 1). The major causal SNP rs2294207 provided statistically significant association with p-value 4.5 × 10-19 for Replicate 1: for all 200 replicates, this SNP ranked 1.1 on average (Table 1) with p-values ranging from 6.9 × 10-13 to 1.6 × 10-29. In Replicate 1, 14 SNPs near the major causal SNP (10 SNPs around 30.91 and 4 SNPs around 30.95) had p-values ranging from 3.0 × 10-8 to 3.8 × 10-19 (Figure 1A): these SNPs provided significant association across all 200 replicates (Table 1). Ranks of these neighboring SNPs were almost in the order of LD between them and the causal SNP. Out of 12 polygenic SNPs, the most significantly associated SNP was rs5765113 (p-value 3.5 × 10-5 ranking 20 for Replicate 1): for all 200 replicates, this SNP ranked 35.8 on average (Table 1) with p-values ranging from 5.7 × 10-2 to 7.9 × 10-8.
Figure 1

Association tests of 4589 SNPs on chromosome 22 for Replicate 1 of the simulated LDL phenotype. A, p-values from single-marker analysis using Merlin; B, entry numbers from multi-marker analysis using LASSO; C, comparison of ranks from Merlin and LASSO (correlation = 0.08). Red dots indicate 1 major causal SNP and 12 polygenic SNPs. Cyan points in A indicate 960 SNPs that were not in any of the first 10,000 models from LASSO.

Table 1

Summary statistics, based on Replicates 1 through 200, for 12 polygenic SNPs and SNPs near the major causal SNP rs2294207 (shown in red) in chromosome 22.

p-value from Merlinentry number from LASSO


MarkersLoc (Mb)LDaE(p)brank E(p)cE rank(p)dMin.pMax.peE(e)rank E(e)E rank(e)Min.eMax.eOutf
rs13146423.394800.07594439.61.1 × 10-40.671228205954.55687020
rs13325224.13700.2752513457.9 × 10-40.981690466117032100014
rs575230925.070100.06674390.36.6 × 10-60.991570398915.6311000111
rs154333533.513600.6444573007.55.2 × 10-21645237573113.34451000182
rs1700203439.326300.02636184.23.3 × 10-60.48512731672499.21921000171
rs651931340.882800.14198743.22.8 × 10-40.95343120092021.8231000120
rs736415241.646600.15212785.21.2 × 10-40.85674738733293.2931000173
rs576511343.773601.9 × 10-32035.87.9 × 10-80.057246211251438.22001000119
rs600750343.98900.06980396.59.0 × 10-30.78253912111766.3130100013
rs1215987147.7400.39107719021.4 × 10-31862043973781180210001141
rs452887847.806200.3810131851.90.011513931752708.24321000153
rs1701324048.999700.5627352671.80.0161624136663134.95291000163
rs599448130.89960.060.016251235.0 × 10-60.52358621362155.4151000113
rs13641430.90510.334.5 × 10-668.41.6 × 10-131.8 × 10-4380923012013.1871000149
rs13641630.90520.325.1 × 10-678.81.7 × 10-142.0 × 10-4572634522751.911000177
rs13641730.90530.325.2 × 10-689.52.3 × 10-142.0 × 10-4801542463470.7310001139
rs13642230.90640.326.4 × 10-6910.55.3 × 10-142.6 × 10-4676938823050.7310001106
rs13645730.91470.221.7 × 10-51111.95.1 × 10-131.4 × 10-3881044293750.6710001159
rs13645830.91470.221.7 × 10-51212.45.1 × 10-131.4 × 10-388714437378915510001158
rs13646030.91480.221.6 × 10-51011.23.7 × 10-131.4 × 10-3822243103530.7810001144
rs13647730.91840.222.0 × 10-51413.32.1 × 10-121.4 × 10-3454327912557.21651000137
rs13648530.92210.221.8 × 10-51311.77.0 × 10-131.4 × 10-350083092264931000154

aLD, linkage disequilibrium between the major causal SNP and other SNPs (measured by r2)

bE(p), averaged p-value over 200 replications

crank E(p), rank of the averaged p-value over 200 replications

dE rank(p), averaged rank of p-values (similarly for entry number)

eMax.e, 10001 if the SNP was not in any of the first 10000 models

fOut, count of replicates for which the SNP was excluded in the LASSO solution path (up to 10,000 models)

Association tests of 4589 SNPs on chromosome 22 for Replicate 1 of the simulated LDL phenotype. A, p-values from single-marker analysis using Merlin; B, entry numbers from multi-marker analysis using LASSO; C, comparison of ranks from Merlin and LASSO (correlation = 0.08). Red dots indicate 1 major causal SNP and 12 polygenic SNPs. Cyan points in A indicate 960 SNPs that were not in any of the first 10,000 models from LASSO. Summary statistics, based on Replicates 1 through 200, for 12 polygenic SNPs and SNPs near the major causal SNP rs2294207 (shown in red) in chromosome 22. aLD, linkage disequilibrium between the major causal SNP and other SNPs (measured by r2) bE(p), averaged p-value over 200 replications crank E(p), rank of the averaged p-value over 200 replications dE rank(p), averaged rank of p-values (similarly for entry number) eMax.e, 10001 if the SNP was not in any of the first 10000 models fOut, count of replicates for which the SNP was excluded in the LASSO solution path (up to 10,000 models) Figure 1B shows LASSO results for Replicate 1 of 200 simulated LDL phenotypes. For Replicate 1, the major causal SNP rs2294207 entered first in the LASSO solution path, which happened in 114 out of 200 replicates. In 84 out of the remaining 86 replicates, one of three nearby SNPs entered first: rs8137034 (42 times), rs2294208 (34 times), and rs5998330 (8 times). Ranks of these four SNPs including the major causal SNP were 5.3, 57.2, 334.1, 1174 on average for 200 replicates (Table 1). Because these nearby SNPs were highly correlated with the causal SNP, once they were included as predictors the causal SNP became a predictor much later (with average rank 5.3). In contrast to single-marker analysis in which the top 15 SNPs with smallest p-values were all near the major causal SNP, only 3 SNPs out of these top 15 SNPs were near the major causal SNP and the remaining 12 SNPs were more or less uniformly located (Figure 1B). For Replicate 1 (Figure 1A), 960 SNPs that were excluded from the LASSO analysis (cyan points) included these neighboring SNPs. This was consistent across all 200 replicates: all 14 neighboring SNPs were sometimes excluded from the LASSO solution path. For example, SNP rs136457 was excluded from the LASSO path in 159 out of 200 replicates even though its average rank from single-marker analysis was 11.9 (Table 1). Overall, we have not found much consistency between ranks from Merlin and those from LASSO (correlation = 0.07 across all 200 replicates and correlation = 0.08 in replicate 1, shown in Figure 1C).

Conclusion

In this paper, we applied single-marker analysis using Merlin and multi-marker analysis using LASSO to the simulated LDL phenotype data on chromosome 22. Single-marker analysis using Merlin correctly provided statistically significant association of the major causal SNP rs2294207 with p-value less than 6.9 × 10-13 for all 200 replicates. Multi-marker analysis using LASSO also included this causal SNP as the first predictor in 114 out of 200 replicates, indicating the importance of this SNP. When the causal SNP was not included as the first predictor, one of its three neighboring SNPs was included as the first predictor. Merlin declared statistically significant 14 non-causal neighboring SNPs, whereas the first 10,000 models in the LASSO solution paths often excluded these 14 SNPs. The 12 polygenic SNPs were less statistically significant than these neighboring 14 SNPs by both Merlin and LASSO analyses, indicating that their effects were too small to be detected. Overall, there was little consistency between the rank orders of the 4589 SNPs provided by Merlin and LASSO. Our results indicate that Merlin and LASSO analyses provide different results. We observe that LASSO typically included 3 SNPs near the causal SNPs out of the 15 SNPs that showed very strong association from Merlin and excluded the remaining SNPs from the LASSO path (up to the first 10,000 models). This may be useful because these neighboring SNPs are not causal. We expected that LASSO would provide better results for the 12 polygenic SNPs. However, this may not have occurred because the strength of their effects was much smaller than the effect of the major causal SNP; thus, for this data set the phenotype appears to be influenced by a single SNP, in which case single-marker analysis will perform better than multi-marker analysis. Hence, our results are inconclusive in terms whether the LASSO analysis provides additional information. The relative advantage of multi-marker analyses over single-marker will depend on the underlying disease model. Other penalized least-squares methods may provide results more similar to single-marker analysis than LASSO. Ridge regression (penalized regression with L2 penalty) shrinks the coefficients of correlated predictors toward each other, so they borrow strength from each other. In the extreme case of k identical predictors, they each get identical coefficients with 1/kth the size that any single one would get if fit alone. On the other hand, LASSO (with L1 penalty) is somewhat indifferent to very correlated predictors and will tend to pick one and ignore the rest. The elastic net regression (penalized regression with a convex combination of both penalties) can have the advantages of both ridge and LASSO [8]. We suspect that LASSO may provide better inference for diseases with multiple causal SNPs that are not in LD. For other cases (i.e., diseases with multiple causal SNPs in LD), ridge, elastic net, or haplotype analysis may provide better inference. Further investigation is needed.

List of abbreviations used

GAW16: Genetic Analysis Workshop 16; LASSO: Least absolute shrinkage and selection operator; LD: Linkage disequilibrium; LDL: Low-density lipoprotein; MAF: Minor allele frequency; SNP: Single-nucleotide polymorphism

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YJS conceived the study, carried out association and LASSO analyses, and drafted the manuscript. TKR carried out all phenotype adjustments. GS developed the concept and set up genotype and map files in the appropriate formats. CCG acquired the data and carried out quality control of the genotypes. DCR participated in the design of the study, helped to draft the manuscript and revised the manuscript. All authors read and approved the final manuscript.
  5 in total

1.  Merlin--rapid analysis of dense genetic maps using sparse gene flow trees.

Authors:  Gonçalo R Abecasis; Stacey S Cherny; William O Cookson; Lon R Cardon
Journal:  Nat Genet       Date:  2001-12-03       Impact factor: 38.330

2.  Accommodating linkage disequilibrium in genetic-association analyses via ridge regression.

Authors:  Nathalie Malo; Ondrej Libiger; Nicholas J Schork
Journal:  Am J Hum Genet       Date:  2008-02       Impact factor: 11.025

3.  Family-based association tests for genomewide association scans.

Authors:  Wei-Min Chen; Goncalo R Abecasis
Journal:  Am J Hum Genet       Date:  2007-09-18       Impact factor: 11.025

4.  The Genetic Analysis Workshop 16 Problem 3: simulation of heritable longitudinal cardiovascular phenotypes based on actual genome-wide single-nucleotide polymorphisms in the Framingham Heart Study.

Authors:  Aldi T Kraja; Robert Culverhouse; E Warwick Daw; Jun Wu; Andrew Van Brunt; Michael A Province; Ingrid B Borecki
Journal:  BMC Proc       Date:  2009-12-15

5.  Genome-wide association analysis of Framingham Heart Study data for the Genetics Analysis Workshop 16: effects due to medication use.

Authors:  Treva K Rice; Yun Ju Sung; Gang Shi; C Charles Gu; Dc Rao
Journal:  BMC Proc       Date:  2009-12-15
  5 in total
  3 in total

1.  Mining gold dust under the genome wide significance level: a two-stage approach to analysis of GWAS.

Authors:  Gang Shi; Eric Boerwinkle; Alanna C Morrison; C Charles Gu; Aravinda Chakravarti; D C Rao
Journal:  Genet Epidemiol       Date:  2010-12-31       Impact factor: 2.135

2.  Multistage analysis strategies for genome-wide association studies: summary of group 3 contributions to Genetic Analysis Workshop 16.

Authors:  Rosalind J Neuman; Yun Ju Sung
Journal:  Genet Epidemiol       Date:  2009       Impact factor: 2.135

3.  Evaluation of the lasso and the elastic net in genome-wide association studies.

Authors:  Patrik Waldmann; Gábor Mészáros; Birgit Gredler; Christian Fuerst; Johann Sölkner
Journal:  Front Genet       Date:  2013-12-04       Impact factor: 4.599

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.