Literature DB >> 31033776

Determining population stratification and subgroup effects in association studies of rare genetic variants for nicotine dependence.

Ai-Ru Hsieh¹, Li-Shiun Chen², Ying-Ju Li³, Cathy S J Fann³.

Abstract

BACKGROUND: Rare variants (minor allele frequency < 1% or 5 %) can help researchers to deal with the confounding issue of 'missing heritability' and have a proven role in dissecting the etiology for human diseases and complex traits.
METHODS: We extended the combined multivariate and collapsing (CMC) and weighted sum statistic (WSS) methods and accounted for the effects of population stratification and subgroup effects using stratified analyses by the principal component analysis, named here as 'str-CMC' and 'str-WSS'. To evaluate the validity of the extended methods, we analyzed the Genetic Architecture of Smoking and Smoking Cessation database, which includes African Americans and European Americans genotyped on Illumina Human Omni2.5, and we compared the results with those obtained with the sequence kernel association test (SKAT) and its modification, SKAT-O that included population stratification and subgroup effect as covariates. We utilized the Cochran-Mantel-Haenszel test to check for possible differences in single nucleotide polymorphism allele frequency between subgroups within a gene. We aimed to detect rare variants and considered population stratification and subgroup effects in the genomic region containing 39 acetylcholine receptor-related genes.
RESULTS: The Cochran-Mantel-Haenszel test as applied to GABRG2 (P = 0.001) was significant. However, GABRG2 was detected both by str-CMC (P= 8.04E-06) and str-WSS (P= 0.046) in African Americans but not by SKAT or SKAT-O.
CONCLUSIONS: Our results imply that if associated rare variants are only specific to a subgroup, a stratified analysis might be a better approach than a combined analysis.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2019 PMID： 31033776 PMCID： PMC6636808 DOI： 10.1097/YPG.0000000000000227

Source DB: PubMed Journal: Psychiatr Genet ISSN： 0955-8829 Impact factor: 2.458

Background

Cigarette smoking is a primary risk factor for many chronic diseases (Bergen and Caporaso, 1999) including many cancers, diabetes, cardiovascular disease, and chronic lung disease (Fang et al.`, 2014). Recent candidate-gene association studies (Fang ) and genome-wide association studies (GWASs) (Thorgeirsson ), many of which have been reviewed by Wang and Li (2010), have searched for and, at varying levels of significance, identified common variants associated with measures of response to tobacco, tobacco consumption, nicotine dependence, nicotine metabolism, and smoking cessation. Current smoking prevalence is similar in European Americans and African Americans (Centers for Disease and Prevention, 2008; Saccone ; Choi ). Nicotine dependence is common in both groups, with evidence of slightly lower levels of dependence in African Americans by standard measures such as cigarettes per day currently in use (Breslau ; Saccone ). Smoking cessation rates, however, are lower in African Americans compared with European Americans (Breslau ; Covey ; Saccone ; Choi ). Furthermore, there is evidence that African Americans have a higher risk of dependence at lower cigarettes-per-day levels compared with European Americans (Luo ; Saccone ). Also important are the disparities in health consequences from smoking: African Americans have higher lung cancer incidence and mortality than European Americans (Haiman ; Jemal ; Saccone ). An understanding of the genetic loci involved, and their effects and allele frequencies in diverse populations, can provide important clues to the risk of developing nicotine dependence across all populations. Multiple nicotinic receptor subunit genes outside of chromosome 15q25 are likely to be important in the biological processes and development of nicotine dependence, and some of these risks may be shared across diverse populations (Saccone ). GWASs constitute an important means for identifying risk genes for complex human diseases, such as diabetes (Hindorff ; Dajani ), heart disease (Hofker ; Erdmann ), and Alzheimer’s disease (Shen and Jia, 2016; Chung ), among others. Despite many successes in identifying risk alleles, most associated variants discovered through GWAS do not account for the majority of heritability estimated for these complex human diseases and traits. From a genetics perspective, by far the most studied of these complex human diseases and traits can be attributed to heritability about an estimated 60–80% of human disease (Lichtenstein ), whereas GWAS have identified only 5–10% of this heritability, leading many researchers to ponder which alleles underlie the missing heritability (Manolio ; Zuk ; Auer and Lettre, 2015). One of the approaches to deal with missing heritability is to detect new DNA variants, especially rare variants that have a relatively large impact on disease etiology, that is, those with minor allele frequency <5%, which may also contribute to complex disease (Schork ; Marian, 2012; Gusev ; Zuk ; Auer and Lettre, 2015; Ma ; Nicolae, 2016). With efforts from the 1000 Genomes Project, which sought to identify most rare genetic variants in a group of 1092 multiethnic individuals, a new generation of GWAS is being designed to enable the discovery of rare variants using next-generation sequencing data (Abecasis ; Sampson ). Hence, improved technologies for discovering rare variants provide a possible means of explaining the missing heritability. A number of methods have been developed for identifying associations between rare variants and common diseases (Li and Leal, 2008; Madsen and Browning, 2009; Schork ). Madsen and Browning (2009) proposed a weighted sum statistic (WSS) method that assigns weights to variants according to their frequency in controls such that the variants with lower frequencies have greater weights. Li and Leal (2008) proposed the combined multivariate and collapsing (CMC) method for case-control data. Wu ) proposed a sequence kernel association test (SKAT), that is, a variance-component method that aggregates individual variant-score test statistics. However, population structure and subgroups can be strong confounding factors in association studies (Pritchard ; Ziv and Burchard, 2003; Clayton ; Roeder and Luca, 2009), and thus accounting for population structure and subgroups is crucial even when seemingly homogeneous ethnic populations are sampled. To our knowledge, only a few articles have discussed rare-variant detection and considered population stratification and subgroup effects [e.g., reviewed by Moore ), O’Connor ), Wang ), Prokopenko )] for nicotine dependence and smoking cessation studies (Saccone ). However, whether population stratification would be better than dealt with using stratified analyses or including population simply as a covariate has not been studied enough (Culverhouse ). To achieve this goal, we evaluated the issue by considering two situations: (1) assessing the strata in separate analyses and (2) pooling data from all strata, using population as a covariate. The results from the two situations were then compared. We utilized the Cochran–Mantel–Haenszel (CMH) test to check whether the allelic distribution of single nucleotide polymorphisms (SNPs) is similar between the population stratifications/subgroups. Furthermore, we extended WSS and CMC to identify rare variants while also considering population stratification and subgroup effects using stratified analyses by principal component analysis (PCA), named here as ‘str-CMC’ and ‘str-WSS’. To compare results obtained with the two aforementioned situations for nicotine dependence and smoking cessation studies, we analyzed a smoking cessation dataset to test for rare variants associated with nicotine dependence which was downloaded from the Database of Genotypes and Phenotypes (accession number phs000404.v1.p1). The smoking cessation dataset was from the Collaborative Genetic Study of Nicotine Dependence (COGEND; principal investigator: Laura Bierut) and the University of Wisconsin Transdisciplinary Tobacco Use Research Center (UW-TTURC; principal investigator: Timothy Baker). Evidence has recently accumulated that SNPs in the genetic region encoding the nicotinic acetylcholine receptor (nAChR) subunits α6, α5, α3, and β4 are associated with smoking and nicotine dependence (Russo ). For the smoking cessation dataset analyses, we were only interested in the acetylcholine receptor region that has been reported previously. To evaluate the issue by considering two situations, we compared the results to those obtained with two variance-component methods, namely, SKAT (Wu ) and optimal sequencing kernel association test (SKAT-O) (Lee ), which treat both population stratification and subgroup effects in the PCA as covariates. In our results, we found ethnicity (i.e., African American and European American) was associated with the first axis of variation (PC1) arising from PCA (Supplementary Additional file 2, Supplemental digital content 2, ). Our results imply that if a gene showed allele frequency differences between the two groups, it would be better to use str-CMC or str-WSS in detecting associated rare variants. By contrast, if a gene has a similar distribution of allele frequency between the two groups, this would be better dealt with by including population stratification and subgroup effects as covariates in SKAT or SKAT-O. These results will assist researchers in identifying a biological basis for the etiology of nicotine dependence.

Materials and methods

The CMC and WSS cannot be adjusted for covariates. However, the SKAT and SKAT-O are able to adjust for covariates (Liang and Xiong, 2013). We evaluated rare variant methods for dealing with population subgroups: (1) CMC and WSS were analyzed population subgroups in population stratification analyses, that is, str-MSS and str-WSS and (2) SKAT and SKAT-O were combined data from all population subgroups, using population subgroups as a covariate. First, we calculated the first axis of variation (PC1) using the EIGENSTRAT software (Price ) to consider population stratification and subgroup effects. PCA is a linear dimensionality reduction technique used to infer continuous axes of genetic variation. Price ) developed the program EIGENSTRAT to correct for population structure in association tests. It uses the top eigenvectors of the sample covariance matrix as covariates in a regression setting. Second, we performed the CMH test (Mantel and Haenszel, 1959) to assess differences in SNP allele frequencies between subgroups by the results of the PC1 arising from PCA. The CMH test was two-tailed for all analyses. To investigate the homogeneity association assumption, we used the Breslow–Day test and found no significant evidence for heterogeneity association. Third, we detected rare variants and accounted for the effects of population stratification and subgroup effects. Rare genetic variants, here defined as alleles with a frequency less than 1–5% (Wu ). For all rare variant methods, rare variants were detected within a gene, a minor allele frequency of less than 5% was used as the rare-variant criterion.

Smoking cessation data

Our analyses were based on a publicly available smoking cessation dataset from the Database of Genotypes and Phenotypes (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap) (accession number phs000404.v1.p1). Genotyping during the GWAS discovery phase used the HumanOmni2.5 BeadChip designed to analyze 2 443 179 loci. All individuals (n = 1515) in the study were from two projects: COGEND (principal investigator: Laura Bierut) and UW-TTURC (principal investigator: Timothy Baker). The individuals reported smoking at least 10 cigarettes per day. Both International Classification of Diseases 10th Revision (ICD-10) (Janca ) and Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) have separate categories for dependent and nondependent smokers. The ICD-10 and the DSM-IV are unsatisfactory and are rarely used for daily clinical care because they cannot be tailored treatments to individual needs (Helzer ; Rüther ). However, the Fagerström Test for Nicotine Dependence (FTND) measures tobacco dependencies as dimension parameters and uses continuums to indicate the severity of the dependency. Thus, the FTND has become an internationally recognized and proven method for determining tobacco dependence (Heatherton ; Rüther ). For this reason, we binned cases and controls based on the FTND when evaluating smoking cessation. Both COGEND and UW-TTURC projects assessed nicotine dependence using the FTND (Heatherton ). The FTND is a six-item self-report measure of nicotine dependence. FTND scores on the scale range from 0 to 10 and also categorized accordingly: 2 = very low dependence; 3–4 = low dependence; 5 = moderate dependence; 6–7 = high dependence; and 8+ = very high dependence (López-Torrecillas ). For the current study, cases were defined as having a nicotine dependence if the score for this test was at least 6; all controls had a score of or less 4. To avoid potential rare variant detection biases associated with misclassification of FTND scores due to FTND breakpoints (López-Torrecillas ), our study ignored participants with an FTND score of 5 and analyzed two groups of participants with large differences in FTND score. According to this definition, there were 923 cases (135 African Americans and 788 European Americans) and 592 controls (69 African Americans and 523 European Americans). For this dataset, we were only interested in the acetylcholine receptor region that has been reported previously as being candidate genes of smoking cessation (Conti ). According to Conti ), several studies have identified associations of genetic regions encoding the nAChRs with nicotine dependence (Saccone ) and with smoking cessation (Berrettini ). Therefore, the nAChRs list was analyzed by Ingenuity Pathways Analysis was performed (IPA; Ingenuity H Systems, Redwood City, CA, USA; (http://www.ingenuity.com) to explore the possibility of identifying gene candidates previously reported in the literature findings from Ingenuity Knowledge Base. All possibility of identifying gene candidates searched from IPA were listed in Supplementary Additional file 1, Supplemental digital content 1, .

Combined multivariate and collapsing with population stratification and subgroup effects (str-CMC)

CMC (Li and Leal, 2008) aggregates multiple rare variants across a genomic region (e.g., gene, haplotype, and pathway) and analyzes them together. CMC divides markers into subgroups based on predefined criteria (e.g., allele frequency) and, within each group, marker data are collapsed into an indicator variable. The procedure we used consisted of the following four steps: (1) data were divided into subgroups by the first axis of variation (PC1) using EIGENSTRAT software (Price ); (2) markers in each gene group were classified as either rare variants or common variants; (3) markers in each gene group are divided into subgroups on the basis of predefined criteria (e.g., allele frequencies), and within each group, marker data are collapsed into indicator variables defined for the genotype at the ith variant site for the jth individual in the case population () and control population (), respectively: as described in Li (Li and Leal, 2008); and (4) Hotelling’s T2 test was used to compare the groups of marker data in each k-gene group. This procedure was named as ‘str-CMC’.

Weighted sum statistic with population stratification and subgroup effects (str-WSS)

Madsen and Browning (2009) described WSS, which determines a weighted rare-variant count in a genomic region (e.g., gene, haplotype, and pathway). The weights are determined according to the variance of the allele frequency estimated for cases and controls, with down-weight mutation counts in constructing the genetic score as bellow. The procedure we use consisted of the following five steps: (1) data were divided into subgroups by the PC1 using EIGENSTRAT software (Harvard University, USA; https://www.hsph.harvard.edu/alkes-price/software/) (Price ); (2) a set of markers were divided into k genomic regions; (3) the genetic score as described in Madsen and Browning (2009) was calculated for each gene. Madsen and Browning (2009) defined the genetic score as follows: where is the number of mutations (usually this will be the minor allele, unless common allele was reported susceptibility to disease) in variant i for individual j in a genomic region, L is the number of variants genotyped, and i = 1, 2,…, L. The weight,, is the estimated SD of the total number of mutations in the sample (including cases and controls), under the null hypothesis of no frequency differences between cases and controls, where , is the number of mutant alleles observed for variant i in the controls, is the number of controls genotyped for variant i, and is the total number of individuals genotyped for variant i (cases and controls); (4) genetic scores were ranked for the cases and controls combined; and (5) a Wilcoxon rank sum test was used to test for association between the set of rare variants and disease status via permutation tests. This procedure was named ‘str-WSS’.

Sequence kernel association test-based methods

SKAT (Wu ) and SKAT-O (Lee ) use a multiple regression model to directly correlate a phenotype with genetic variants in a genomic region (e.g., gene, haplotype, and pathway) and with covariates by the PC1 using EIGENSTRAT software (Price ).

Results

A total of 2 295 169 SNPs in chromosomes 1–22 were excluded according to the following quality-control criteria: genotype call rate < 0.95, or departure from Hardy–Weinberg equilibrium (P< 10–4) for the control group. Missing SNPs were imputed using Beagle: University of Washington, USA; https://faculty.washington.edu/browning/beagle/beagle.html. Beagle produces a measure r2 to estimate the squared correlation between imputed and true alleles for the marker. For quality control (QC) purpose, we excluded SNPs with r2 less than 0.3. Finally, we used 1785 SNPs in acetylcholine receptor region that has been reported previously as being candidate genes of smoking cessation (Conti )

str-CMC

After using EIGENSTRAT software (Price ), we found that ethnicity was the first axis of variation (PC1) arising from PCA. Hence, we divided the data into two subgroups, that is, groups European Americans and African Americans. The acetylcholine receptor genes CHRNA5 (cholinergic receptor, nicotinic, alpha 5) (P = 0.038) and PSMA4 (proteasome subunit alpha 4) (P = 0.019) of group European Americans were identified by str-CMC (Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ). CHRNA5 is associated with risk of failure for individuals who attempt to reduce cigarette smoking (Chen ) and contributes to lung cancer susceptibility in smoking-associated nasopharyngeal carcinoma (Ji ). PSMA4 is associated with lung cancer risk in Caucasians and African Americans (Hansen ).

Fig. 1

(a) The Venn diagram of rare variants detected by str-CMC in European Americans (str-CMC_European Americans), African Americans (str-CMC_African Americans) and CMC. A color scheme of gene symbol is used to display CMH results with blue for significance, black for no significance. (b) The Venn diagram of rare variants detected by str-WSS in European Americans (str-CMC_European Americans), African Americans (str-WSS_African Americans) and WSS. (c) The Venn diagram of rare variants detected by SKAT and SKAT-O. CMC, combined multivariate and collapsing; WSS, weighted sum statistic. The following genes of group African Americans were identified by str-CMC (Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ): GABRB2 (gamma-aminobutyric acid A receptor, beta 2) (P = 2.20E-05), GABRG2 (gamma-aminobutyric acid A receptor, gamma 2) (P = 8.04E-06), JAK2 (Janus kinase 2) (p=6.58E-07), DRD2 (dopamine receptor D2) (P = 2.92E-04), RAPSN (receptor-associated protein of the synapse) (P = 0.031), RIC3 (RIC3 acetylcholine receptor chaperone) (P = 0.005), CHRM5 (cholinergic receptor, muscarinic 5) (P = 0.03), CHRNA7 (cholinergic receptor, nicotinic, alpha 7) (P = 5.49E-06), CHRNE (cholinergic receptor, nicotinic, epsilon) (P = 0.0176), CDH2 (cadherin 2, type 1, N-cadherin) P < 0.00000001), and APP [amyloid beta (A4) precursor protein] (P < 0.00000001). GABRB2 is associated with susceptibility to drug addiction (Hondebrink ), psychiatric disorders (Zhao ), and nonsmall cell lung cancer (Zhang ). GABRG2 is also associated with epilepsy (Reinthaler ) and may contribute to the potential for suicidal behavior in schizophrenia patients with alcohol dependence or abuse (Zai ). PTK2B is associated with nonsmall cell lung cancer (Kuang ). Mutations in JAK2, when considered in the context of cigarette smoking status, can affect breast cancer–specific mortality (Slattery ). Although Choi ) did not find a significant relationship between DRD2 polymorphisms and success during smoking cessation therapy, DRD2 was found to be associated with nicotine and alcohol addiction (Ma ). RIC3 is associated with nicotinic receptor assembly, expression, and nicotine-induced receptor upregulation (Dau ). CHRM5 may be involved in addiction to tobacco and cannabis, but not alcohol, in group European Americans (Anney ). CHRNA7 may be involved in the development of physical dependence on nicotine (Kishioka ).

str-WSS

The following genes of group European Americans were found by str-WSS (Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ): CHRNA2 (P = 0.038), JAK2 (P = 0.013), CHRNA3 (P = 0.011), CHRNA5 (P = 0.002), and PSMA4 (P = 0.004). CHRNA2 is associated with nicotine dependence in groups European Americans and African Americans (Wang ) and with smoking cessation (Heitjan ). CHRNA3 is associated with nicotine dependence (Munafò ), and CHRNA3 polymorphisms are genetic modifiers of the risk for developing lung adenocarcinoma (He ). However, CHRNA3 may not merely operate as a marker for the difficulty, willingness, or motivation to quit smoking (Munafò ). The following genes were identified by str-WSS in group African Americans (Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ): GABRG2 (P = 0.046), RAPSN (P = 0.011), CHRNA5 (P = 0.039), CHRNA7 (P = 0.016), and CHRNB1 (P = 0.046), whereas CHRNB1 is associated with the African Americans sample, no significant association was found in group European Americans (Lou ).

Sequence kernel association test and SKAT-O

We also observed an African American to European American difference that was the result of the first axis of variation (PC1) arising from PCA using EIGENSTRAT software (Price ). Hence, we used a subgroup effect as a covariate in SKAT and SKAT-O. The following genes were detected by SKAT: CHRNA1 (P = 0.001, Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ), UBQLN1 (P = 0.025, Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ), and CHRM1 (P = 0.041, Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ). CHRNA1 is associated with smoking cessation (Rose ) and lung adenocarcinoma (Chang ). UBQLN1 is associated with smoking cessation (Rose ) and nonsmall cell lung cancer (Shah ). The following genes were detected by SKAT-O: CHRNA1 (P = 0.015), CHRND (P = 0.034), UBQLN1 (P = 0.034), CHRM1 (P = 0.013), CHRNA3 (P = 0.018), and CHRNA5 (P = 0.048). CHRND has been reported to be related to modify the risk for nicotine dependence associated with peer smoking (Johnson ).

Comparison of the rare variant–associated results from str-CMC, str-WSS, sequence kernel association test, and SKAT-O

We applied CMH analysis to the 39 acetylcholine receptor-related genes, which revealed that 15 of these genes had differences in SNP allele frequencies between subgroups after controlling for different groups arising from PCA using EIGENSTRAT software (Price ) (i.e., European Americans and African Americans) (Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ). Among these 15 genes (with different SNP allele frequencies), str-CMC found 10 genes (two in European Americans and eight in African Americans) and str-WSS detected six genes (four in European Americans, one in African Americans; one in both European Americans and African Americans). However, SKAT and SKAT-O detected only 1 genes (Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ). By contrast, we found 22 genes that did not differ with respect to SNP allele frequency between subgroups after controlling for different groups arising from PCA using EIGENSTRAT software. Of these 22 genes, str-CMC detected three genes (0 in European Americans and three in African Americans) and str-WSS detected two genes (0 in European Americans and two in African Americans). SKAT and SKAT-O also detected five genes (Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ). For investigating differences between population stratification and combined analyses (a within test comparison), among these 15 genes (with different SNP allele frequencies) in population stratification analyses, str-CMC found 10 genes (two in European Americans and eight in African Americans). By contrast, in combined analyses, CMC only found two of them. However, str-WSS (detected six genes) and WSS (detected five genes) had comparable performance in these 15 genes. The results indicated that for this dataset, two subgroups, that is, group European Americans and African Americans as a covariate was not an effective substitute for analyzing subgroups separately when only one of them contained an associated rare variant.

Discussion

We determined whether population stratification and subgroup effects would be better dealt with using stratified analyses or including population as a covariate. Upon comparing results from str-CMC, str-WSS, SKAT, and SKAT-O, we found that the inclusion of samples from other subgroups often introduced noise when the signal for a particular gene was strong in one of the subgroups. Without stratification analysis using CMC, in the CMH test, for example, the result for GABRG2 was significant at P = 0.0009. However, with stratification analysis using str-CMC, the P value for GABRG2 was P = 0.00000804 (Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ) and it was P = 0.046 for str-WSS in African Americans. On the other hand, GABRG2 was not significant by using SKAT (P = 0.40426) and SKAT-O (P = 0.60138). In addition, GABRG2, GABRB2, CHRNA2, JAK2, RIC3, AGPHD1, PSMA4, CHRNA5, CHRNA3, CHRNB4, CHRM5, CHRNE, CDH2, and APP were significant in subsamples representing more than half of the data, and dealing with the strata in separate analyses increased the chances for detecting associated rare variants. Despite the allele frequencies not being different according to CMH test, CHRNA1 was not significant using CMC and WSS. However, by using SKAT and SKAT-O, the P value is borderline significant (P = 0.01, Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ). By using these data, our results demonstrated that all rare-variant association methods considered here could yield a relatively high rate of spurious associations in the presence of fine-scale population structure. In addition, we showed that considering for the effects of population stratification and subgroup effects can confound rare variant analyses. The differences in disease risk between subgroups that generated such high spurious association rates are plausible and it is important for further interpreting rare-variant association results. For instance, there is a 2.5–10% difference in the prevalence of lung cancer among different populations of European men, although there is a less striking difference for women (Boyle and Ferlay, 2005). In our study, GABRG2 was detected by str-CMC (P = 8.04E-06, Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ) and str-WSS (P = 0.046, Fig. 1; Supplementary Additional file 1, Supplemental digital content 1, ) in African Americans. By using SKAT and SKAT-O, the P values for GABRG2 are 0.404 and 0.601, respectively (Supplementary Additional file 1, Supplemental digital content 1, ). GABRG2 was not detected by the two tests if stratification was not considered. GABRG2 was also associated with addiction (Klee ) and alcohol use disorder (Li ; Zai ). In our study, GABRG2 might affect nicotine dependence risk in African Americans. Gene-based methods for detecting rare variants are as effective as the SNP-based methods for GWAS (Schaid ; Wessel and Schork, 2006; Tzeng and Zhang, 2007). The grouping of multiple SNPs within a genomic region allows combined calculations to enhance statistical power, because rare variants include extremely sparse data so that traditional SNP-set methods for common variants might not be applicable to rare-variant detection. Some well-known rare-variant detection methods can potentially be used to combine low-frequency SNPs in GWAS. However, the number of SNPs in a gene might be important in the rare variant detection methods. For genes with larger numbers of markers, such as acetylcholine receptor-related genes, CMC and str-CMC were more likely to detect the effects compared with other methods, that is, WSS, str-WSS, SKAT, and SKAT-O. For example, in our study, we identified 262 SNPs in APP and 180 in CDH2 by str-CMC in group African Americans (Supplementary Additional file 1, Supplemental digital content 1, ). Population structure and subgroups can be strong confounding factors in association studies (Pritchard ; Ziv and Burchard 2003; Clayton ; Roeder and Luca, 2009), so their effects need to be taken into account. To tackle this problem, we used EIGENSTRAT software (Price ) that can remove some of their effects. However, the number of principal components used should depend on the distribution of the eigenvalues (Jiang and Dong, 2011) and sample sizes of subgroups. Rare variant-detection methods can divide into two main categories: burden and variance-component tests (Bansal ) such that they should complement each other for the purpose of identifying possible risk factors for nicotine dependence or other complex traits. Nicotine usage is associated with 5 million deaths per year worldwide and is considered one of the gateway drugs that lead to the use of illicit drugs. Before detecting rare variants, the CMH test can be used to determine whether there are any possible differences in SNP allele frequencies between subgroups within a gene/genomic region/haplotype/pathway. If in a gene differences in SNP allele frequencies between subgroups occur, stratification analyses such as str-CMC or str-WSS should be used in detecting rare variants. Hence, CMH should first be used in rare variant detection analysis. By contrast, when SNP allele frequencies between subgroups are similar all methods can be used directly. In our study, we cannot fully determine whether the results demonstrate the differences between population stratification versus combined analyses, or whether they reflect the differences between the statistical approaches. Because the population stratification/combined analyses and statistical approaches are fundamentally different, these methods should be considered complementary to each other when studying rare variants in various disease analyses. In our study, the small sample size for the African Americans population is a limitation, particularly when addressing a question involving low-frequency variants. It is difficult to interpret how much of the results are driven by the small numbers in the African Americans group. In future studies, we will adopt a pairwise sampling design based on Imai ) to increase sample sizes. In conclusion, we have extended the CMC and WSS methods to identify rare variants and stratify by population/subgroups while analyzing smoking cessation data. We found that including population as a covariate was not an effective substitute for analyzing the subpopulations separately when only one subpopulation contained a rare variant linked to the phenotype. The conclusion is the same as previous study findings (Culverhouse ). Our results will help researchers overcome population stratification and subgroup effects when detecting rare variants. More importantly, these analyses showed that even when an identical genetic model is applied to multiple subgroups, sample size is not the only factor that determines association results. If rare causative variants are unique to a subgroup, stratified analyses might be more powerful than combined analyses although stratified analyses may entail a considerable decrease in the sample size.

Acknowledgements

We are grateful to the National Science Council and Institute of Biomedical Sciences, Academia Sinica of Taiwan and China Medical University of Taiwan for funding (MOST102-2314-B-001 -003 -MY2, CMU103-N-15 and CMU105-N-23). The authors thank Jurg Ott for editorial support of the manuscript. This study was supported by National Science Council and Institute of Biomedical Sciences, Academia Sinica of Taiwan and China Medical University of Taiwan (MOST102-2314-B-001-003-MY2, CMU103-N-15 and CMU105-N-23). A.-R.H. and C.S.J.F. each contributed to statistical analysis, data interpretation, and writing of the manuscript. L.-S.C. contributed to writing of the manuscript. Y.J.L. contributed to statistical analysis. The data were downloaded from the Database of Genotypes and Phenotypes (phs000404.v1.p1). The data were accessed after a Controlled Access Application (https://dbgap.ncbi.nlm.nih.gov/aa), and an approval process from both National Center for Biotechnology Information and Institutional Review Board at Academia Sinica (AS-IRB01-17006). The datasets analyzed during the current study are available in the Database of Genotypes and Phenotypes (phs000404.v1.p1) https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000404.v1.p1.

Conflicts of interest

There are no conflicts of interest.

88 in total

1. Statistical aspects of the analysis of data from retrospective studies of disease.

Authors: N MANTEL; W HAENSZEL
Journal: J Natl Cancer Inst Date: 1959-04 Impact factor: 13.506

2. Nonparametric tests of association of multiple genes with human disease.

Authors: Daniel J Schaid; Shannon K McDonnell; Scott J Hebbring; Julie M Cunningham; Stephen N Thibodeau
Journal: Am J Hum Genet Date: 2005-03-22 Impact factor: 11.025

3. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.

Authors: Bingshan Li; Suzanne M Leal
Journal: Am J Hum Genet Date: 2008-08-07 Impact factor: 11.025

4. Rare variants in γ-aminobutyric acid type A receptor genes in rolandic epilepsy and related syndromes.

Authors: Eva M Reinthaler; Borislav Dejanovic; Dennis Lal; Marcus Semtner; Yvonne Merkler; Annika Reinhold; Dorothea A Pittrich; Christoph Hotzy; Martha Feucht; Hannelore Steinböck; Ursula Gruber-Sedlmayr; Gabriel M Ronen; Birgit Neophytou; Julia Geldner; Edda Haberlandt; Hiltrud Muhle; M Arfan Ikram; Cornelia M van Duijn; Andre G Uitterlinden; Albert Hofman; Janine Altmüller; Amit Kawalia; Mohammad R Toliat; Peter Nürnberg; Holger Lerche; Michael Nothnagel; Holger Thiele; Thomas Sander; Jochen C Meier; Günter Schwarz; Bernd A Neubauer; Fritz Zimprich
Journal: Ann Neurol Date: 2015-03-28 Impact factor: 10.422

5. Genetic variation in the JAK/STAT/SOCS signaling pathway influences breast cancer-specific mortality through interaction with cigarette smoking and use of aspirin/NSAIDs: the Breast Cancer Health Disparities Study.

Authors: Martha L Slattery; Abbie Lundgreen; Lisa M Hines; Gabriela Torres-Mejia; Roger K Wolff; Mariana C Stern; Esther M John
Journal: Breast Cancer Res Treat Date: 2014-08-08 Impact factor: 4.872

Review 6. Common vs. rare allele hypotheses for complex diseases.

Authors: Nicholas J Schork; Sarah S Murray; Kelly A Frazer; Eric J Topol
Journal: Curr Opin Genet Dev Date: 2009-05-28 Impact factor: 5.578

7. Association of gamma-aminobutyric acid A receptor α2 gene (GABRA2) with alcohol use disorder.

Authors: Dawei Li; Arvis Sulovari; Chao Cheng; Hongyu Zhao; Henry R Kranzler; Joel Gelernter
Journal: Neuropsychopharmacology Date: 2013-10-18 Impact factor: 7.853

8. Rare variant association studies: considerations, challenges and opportunities.

Authors: Paul L Auer; Guillaume Lettre
Journal: Genome Med Date: 2015-02-23 Impact factor: 11.117

9. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

10. Nicotinic acetylcholine receptor beta2 subunit gene implicated in a systems-based candidate gene study of smoking cessation.

Authors: David V Conti; Won Lee; Dalin Li; Jinghua Liu; David Van Den Berg; Paul D Thomas; Andrew W Bergen; Gary E Swan; Rachel F Tyndale; Neal L Benowitz; Caryn Lerman
Journal: Hum Mol Genet Date: 2008-07-01 Impact factor: 6.150