Z John Daye1, Hongzhe Li, Zhi Wei. 1. Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA.
Abstract
Next-generation sequencing data will soon become routinely available for association studies between complex traits and rare variants. Sequencing data, however, are characterized by the presence of sequencing errors at each individual genotype. This makes it especially challenging to perform association studies of rare variants, which, due to their low minor allele frequencies, can be easily perturbed by genotype errors. In this article, we develop the quality-weighted multivariate score association test (qMSAT), a new procedure that allows powerful association tests between complex traits and multiple rare variants under the presence of sequencing errors. Simulation results based on quality scores from real data show that the qMSAT often dominates over current methods, that do not utilize quality information. In particular, the qMSAT can dramatically increase power over existing methods under moderate sample sizes and relatively low coverage. Moreover, in an obesity data study, we identified using the qMSAT two functional regions (MGLL promoter and MGLL 3'-untranslated region) where rare variants are associated with extreme obesity. Due to the high cost of sequencing data, the qMSAT is especially valuable for large-scale studies involving rare variants, as it can potentially increase power without additional experimental cost. qMSAT is freely available at http://qmsat.sourceforge.net/.
Next-generation sequencing data will soon become routinely available for association studies between complex traits and rare variants. Sequencing data, however, are characterized by the presence of sequencing errors at each individual genotype. This makes it especially challenging to perform association studies of rare variants, which, due to their low minor allele frequencies, can be easily perturbed by genotype errors. In this article, we develop the quality-weighted multivariate score association test (qMSAT), a new procedure that allows powerful association tests between complex traits and multiple rare variants under the presence of sequencing errors. Simulation results based on quality scores from real data show that the qMSAT often dominates over current methods, that do not utilize quality information. In particular, the qMSAT can dramatically increase power over existing methods under moderate sample sizes and relatively low coverage. Moreover, in an obesity data study, we identified using the qMSAT two functional regions (MGLL promoter and MGLL 3'-untranslated region) where rare variants are associated with extreme obesity. Due to the high cost of sequencing data, the qMSAT is especially valuable for large-scale studies involving rare variants, as it can potentially increase power without additional experimental cost. qMSAT is freely available at http://qmsat.sourceforge.net/.
Analysis of common variants through genome-wide association (GWA) studies has enjoyed much success over the last few years. More than a thousand genetic loci associated with more than 200 complex traits have been identified (1). However, these variants typically explain only a small fraction of the inheritable variability for common diseases (2). This may, in part, be explained by the genetic theory that implicates natural selection in pruning out disease-causing variants efficiently before they may become common among the population (3). On the other hand, rare variants, having minor allele frequencies (MAFs) between 0.1 and 1%, are often evolved from more recent mutations and less subjected to natural selection. Thus, association studies of rare variants may hold the potential for identifying genetic components that are functionally relevant and causal for a larger proportion of inheritable variability (4–6). Several studies have already demonstrated the relevance of rare variants on complex traits (7–13). The availability of emerging sequencing technologies will soon allow association studies to be routinely performed between complex traits and rare variants, making the development of robust and powerful procedures for rare variants association especially relevant.Recent studies reveal several important characteristics of rare variants association. The rare variant hypothesis states that a large proportion of common diseases may be due to the aggregate effects of multiple rare variants, acting independently and with detectable effects on disease risks (4,6,14). Due to low allele frequencies, rare variants often do not exhibit strong linkage disequilibrium (LD) with either rare or common SNPs (6,15). Hence, GWA studies through indirect LD-mapping with tagged SNPs are usually ineffective in detecting rare variants. Rare variants identified in recent literatures usually have strong effects with a mean odds ratio (OR) of 3.74 and most variants having ORs greater than 2 (4). This is quite different from the effects of identified common variants, which usually have ORs between 1.1 and 1.4. Recent studies also present evidences of multiple rare variants acting collectively on disease risks (8,11,16).Impelled by growing interests, several association tests have been proposed for rare variants. These tests often employ the idea of pooling or collapsing multiple rare variants. This is partly due to the need to aggregate effects of low-frequency variants in order to increase power and further due to recent evidences suggesting that multiple rare variants often act collectively upon disease risks (8,11,16,17). For example, the combined multivariate and collapsing (CMC) test (18) extends the cohort allelic sum test (CAST) (17) by collapsing variants in subgroups according to allele frequencies and combining these subgroups using Hotelling's T2 test. The step-up procedure determines whether each variant is protective, deleterious or non-causal and aggregates them accordingly (19). The C-alpha test is a homogeneity test for case-control data via binomial proportions (20). The sequence kernel association test (SKAT) extends kernel machine-based tests for rare variants with more accurate asymptotic approximations in the tail distribution (21). Moreover, several popular multivariate tests for GWA studies have been evaluated for rare variants association, including the mininum of univariate P-values (UminP) (22), sum score (23), sum of squared score (SSU) (24) and weighted sum of squared score (SSUw) tests (25).These existing methods can sometimes be limited for rare variants association as they do not incorporate sequencing quality information. Associating rare variants to complex traits can be challenging due to the presence of sequencing errors. Under low MAFs, a few minor alleles mistaken as major or a few major alleles mistaken as minor can both result in significant loss of power in rare variants association. The quality of individual genotype estimate can vary significantly across both variants and samples, even when the mean sequencing coverage is high. Nevertheless, accurate measures of sequencing qualities can be readily obtained for each genotype using variant calling toolkits (26,27). Genotype quality scores (GQ) are often provided in Phred scale with GQ = −10 log10(1 − Pr), where Pr is the probability of a genotype being correctly called (28). Thus, a GQ value of 10 indicates a 10% sequencing error, whereas a value of 30 indicates 0.1% error. Incorporating sequencing qualities in association tests may allow us to increase power and to fully explore the potential of emerging sequencing technologies for identifying causal rare variants.Some existing methods allow the incorporation of prior weights on each variant (21,24,29), which can potentially incorporate average quality scores across variants. However, sequencing qualities also vary significantly across samples, in addition to variants. For example, a variant may consist of some proportion of high-quality genotype samples but others that are sequenced at low qualities, with the extreme case of having missing values. Utilizing these genotypes without considering varying sequencing qualities across samples can significantly decrease power in rare variants association studies due to large perturbations from low-quality samples. Thus, a new procedure is needed that can address the incorporation of sequencing qualities at each individual genotype.To achieve a robust and powerful test for rare variants association studies, we propose a quality-weighted multivariate score association test (qMSAT) that directly incorporates sequencing qualities in association tests with multiple rare variants. The qMSAT utilizes a quality-weighted multivariate regression model to incorporate sequencing qualities at each individual genotype. It is robust towards the inclusion of noncausal variants and variants having effects with different magnitudes and directions. Furthermore, it can coherently account for missing genotypes and conjoin in a principled way individuals and variants sequenced at varying coverage depths. When no quality information is utilized, the qMSAT is equivalent to the C-alpha, SSU and SKAT using the linear kernel. The qMSAT is therefore expected to perform equivalently or better than established methods when quality information is utilized.We present extensive simulation studies based upon sequencing qualities from real data, obtained from the 1000 Genomes Project (30). To guide investigators, we examine the performances of the qMSAT with current methods across a spectrum of MAFs, odds ratios, coverage depths, sample sizes and LD structures. We also provide scenarios of having non-causal variants and variants having effects with different magnitudes and directions. These results suggest that the qMSAT can often dominate over current methods, that do not utilize quality information. In particular, the qMSAT often demonstrates significant improvements in power under moderate sample sizes and relatively low coverage for rare variants association. Furthermore, we apply the qMSAT in an application to the UCSD obesity data study, where we identified two functional regions, the MGLL promoter and MGLL 3′-untranslated region (3′-UTR), that exhibit significant multiple rare variants association with extreme obesity. Most importantly, our results suggest the qMSAT to be a potentially valuable tool for increasing power without incurring additional experimental costs, such as by sequencing at higher coverage depths or larger sample sizes (31).
MATERIALS AND METHODS
Quality-Weighted Multivariate Score Association Test (qMSAT)
In this article, we consider case-control association studies having d multiple rare variants and n individuals. For the i-th individual, y denotes a dichotomous trait with y = 1 for cases and y = 0 for controls and G = (G,…, G) are variant genotypes. We assume, in this article, that genotypes are represented by the additive genetic model where G = 0, 1 or 2 are the numbers of minor alleles at a locus. However, our procedure can be used with other genetic models, such as the dominant or recessive model, and our conclusions regarding the inclusion of quality information will remain the same using another genotypic assignment.Consider the logistic regression model,
where α0 is an intercept term and β is the effect of the j-th variant on trait. The rare variants hypothesis suggests that rare variants often act independently on disease risks. Further, having low MAFs, rare variants usually do not exhibit strong LD, even at close proximity (6,11,15). Hence, we consider a quality-weighted log-likelihood,
where
and q are quality weights on each pair (i, j) of individuals and variants. The quality-weighted log-likelihood (2) utilizes relative independence among rare variants by aggregating over weighted marginal likelihoods. It allows effects of individual genotypes to be evaluated based upon sequencing qualities, down-weighting those having relatively low qualities. The weighted multivariate log-likelihood approach has been utilized successfully in several other statistical contexts (32,33). It is often employed for the incorporation of quality-related constituents (34). The approach enables conjoined analysis of sequencing data over a spectrum of qualities and read depths. Moreover, problems associated with imputing rare variants for traditional association tests can be circumvented by assigning q = 0 at missing genotypes in Equation (2). The quality-weighted likelihood naturally allows missing genotypes without the need for imputation, which can be quite challenging under low MAFs.For simplicity, we utilize the genotype quality scores directly as weights q in this article. The quality-based weights q can also be assigned using the number of reads and monotonically non-decreasing functions of quality scores, such as where {·} is an indicator function and t is some threshold above which qualities are believed to be equally acceptable. A myriad of quality-based weights can be applied based upon an investigator's preferences and prior knowledge.To achieve a robust and powerful test for multiple rare variants association, we utilize a quality-weighted Rao's score statistic (35) in testing the multivariate null hypothesis,
Computing for the slope of the log-likelihood Equation (2) with genotype coefficients β restricted at the null hypothesis, we obtain the quality-weighted score vector S = (S1, S2,…, S) where
and
are the predicted proportions of cases at the j-th variant.We define the qMSAT statistic as the norm squared of the quality-weighted score vector
If no quality information is utilized (q = 1), the qMSAT is reduced to the SSU and linear SKAT and is also equivalent to the C-alpha test. The qMSAT provides a computationally efficient means to estimate expected improvements in model fit according to sequencing qualities when the null hypothesis in Equation (3) may be violated. It admits a simple form as explicit computations of genotype coefficients β are not required. Utilizing a quadratic form, the qMSAT is robust towards the inclusion of variants having effects with opposing directions. However, with prior weights imposed upon the responses y in Equation (2), the qMSAT statistic does not assume a readily approximable asymptotic distribution. Thus, we utilize a permutation-based strategy, where the distribution of the qMSAT statistic is approximated using statistics computed with randomly re-sampled responses in Equation (4) (36). As asymptotic-based tests are not guaranteed to be reliable under low MAFs and limited sample sizes, permutation-based procedures are often preferred in rare variants association studies (20,29). Having a simple form, the qMSAT can be computed efficiently even when large numbers of permutations are required. Indeed, the qMSAT is equivalent in computational speed to C-alpha and other score-based procedures, that have been successfully employed in several earlier studies (20,25,29).
Simulations
We present extensive simulations based upon sequencing qualities from the Pilot Phase 3 study of the 1000 Genomes project (30). The study provides sequencing data for 90 Caucasian individuals over 2385 rare variant sites, amounting to 214 650 genotype quality scores across a spectrum of read depths. In order to compare performances of various methods under realistic error distributions, we sampled directly from these empirical quality scores in generating simulated data.We generated simulated data over a spectrum of MAFs, odds ratios, coverage depths, sample sizes and LD structures. We also provided scenarios where non-causal variants and variants having effects with different magnitudes and directions are included. As in previous studies (25,37), we first generated d latent variables from the multivariate normal distribution with the autoregressive covariance structure Σ = ρ|, where ρ = 0 was used to generate independent variants and ρ close to 1 was used to generate variants in strong LD with its neighbors. The latent variables were then used to produce a haplotype at a given MAF by dichotomizing variables at a specified quantile. Two independent haplotypes thus generated were combined to obtain the underlying genotypes , with which we generated dichotomous phenotypes for a case–control study under the logistic regression model
at given effect sizes β or odds ratios exp(β). The intercept term α0 was set to attain a disease rate of 5%. As in conventional case–control design, we randomly sampled n/2 cases and n/2 controls along with their underlying genotypes . Next, we generated observed genotypes G by perturbing the underlying ones with sequencing qualities sampled from the 1000 Genomes project. At a given mean coverage depth (dp), we simulated the number of reads for each genotype from the gamma distribution with shape and scale parameters at 6.3 and dp/6.3, respectively. The gamma distribution with the specified parameters have been found to approximate observed numbers of reads well from mean coverage in previous studies (38,39). The quality score GQ was obtained for each genotype by sampling from quality scores of the 214 650 Pilot Phase 3 study genotypes with an observed number of reads equal to or closest to the number of reads simulated from mean coverage. Finally, we generated genotypes G by randomly perturbing using the Bernoulli distribution with error probabilities 10(−. Each procedure was applied by associating phenotypes y with perturbed genotypes G. Sampled quality scores at each genotype were utilized as weights q in the qMSAT.We compared the qMSAT with 10 other methods that have been proposed recently for rare variants association. The SSU, SSUw, Score, SKAT with linear and quadratic kernels, and UminP are score-based procedures similar to the qMSAT, whereas the Sum and CMC tests are derived from estimates in a logistic regression with collapsed variants. The step-up procedure also performs variable selection to determine causality and directions of effects in addition to collapsing variants together. The C-alpha procedure is a homogeneity test based upon binomial proportions. In a case–control study, the C-alpha is equivalent to the SSU and linear SKAT (21,40). Moreover, the C-alpha, SSU and linear SKAT are, in turn, equivalent to the qMSAT if no quality information is utilized (q = 1). We implemented the qMSAT in R, available online at http://qmsat.sourceforge.net/. The SSU, SSUw and Score tests were computed using R codes from an earlier study (24), whereas the SKAT was computed using the R package from the original paper (21). The SKAT package also includes options for incorporating MAF-based weights on each variant. Nonetheless, in order to clearly demonstrate the effects of MAF on association tests, our simulations were performed at a spectrum of values at fixed MAFs. Incorporating MAF-based weight on each variant is only beneficial if both rare and common variants are included. Thus, we do not incorporate MAF-based weights for the SKAT in this study. The thgenetics R package was used for the step-up procedure (19). All other procedures were computed using available codes in a recent study (40). P-values for the qMSAT, C-alpha, and step-up procedures were computed using 500 permutations. Simulated data for type I errors were generated using the restricted logistic regression model instead of (6),
where ORs were set to 1 for each variant. Both empirical powers and type I errors for each procedure were estimated using 1000 repetitions as the proportion of P-values < 0.05.
Application to UCSD obesity data set
We analyzed a targeted re-sequencing data of candidate genes at extreme obesity levels from the UCSD obesity study (41,42). The data set consisted of 289 individuals of European ancestry; among which, 73 men and 70 women having body mass index (BMI) ≥40 kg/m2 were selected as cases, and 74 men and 72 women with BMI ≤ 30 kg/m2 were assigned as controls. Two intervals were sequenced over 188 kb that encompassed regions encoding the endocannabinoid metabolic enzymes, fatty-acid amide hydrolase (FAAH) and monoglyceride lipase (MGLL). Earlier studies had found genes in the endocannabinoid system that regulates obesity levels (43,44), making the FAAH and MGLL prime candidates for association analysis of variants to extreme obesity. We mapped short reads using the Burrows-Wheeler Alignment (BWA) algorithm (45). We employed the most recent state-of-the-art variant calling toolkit GATK (26) to perform variant calls and extract corresponding quality scores at each genotype. An earlier algorithm MAQ (46) was used in the original study (42) for variant calling. We obtained 977 variants with a mean coverage of 101×. These variants were then annotated using SeattleSeq (http://gvs.gs.washington.edu/SeattleSeqAnnotation/). Given the annotations, we focused on the 455 rare variants that were not cataloged in the dbSNP database (47); these variants were then divided based on their genomic positions into groups of promoters, 5′-UTRs, exon regions, intron regions and 3′-UTRs, for performing association tests. For each gene, we combined all exon variants together in a group and considered only non-synonymous variants. The FAAH promoter was sequenced over the 11.5 kb upstream region, whereas the MGLL promoter encompassed the 12.5 kb upstream region. Short functional regions with less than two variants were not included in this study. Missing genotypes were imputed by sampling randomly from non-missing ones in the same variant, and their associated quality scores were set to be 0. Number of permutations in statistical tests was increased to 10 000 for increased precision in P-values.
RESULTS
Presence of sequencing errors
Figure 1 presents sequencing qualities of the data from the Pilot Phase 3 study of the 1000 Genomes project at each genotype in terms of genotype quality scores (GQ) and probability of sequencing errors (1 − Pr) over varying coverage depths. The quantities are related through the conversion formula Pr = 1 − 10−. We observe that sequencing qualities tend to increase with increasing coverage depths or numbers of reads at a genotype. Nonetheless, even at relatively high coverage depths, a significant proportion of genotypes still assumes low sequencing qualities. For example, the 99% interval includes genotypes with GQ = 15 (or 1 − Pr = 0.03) at a coverage depth of 20×. In association analysis of rare variants where MAFs are low, even a small proportion of unaccounted sequencing errors can influence results unexpectedly. In the following studies, we evaluate the effects of having sequencing errors under realistic settings by sampling sequencing qualities directly from the Pilot Phase 3 study as presented in Figure 1 (see ‘Material and Methods’ section).
Figure 1.
Genotype sequencing qualities generated in Pilot Study 3 of the 1000 Genomes Project. Sequencing qualities in terms of genotype quality scores (GQ) and probability of sequencing errors (1 − Pr) are displayed, where 1 − Pr = 10−. A total of 2385 rare variants are called, amounting to 214 650 genotypes at various coverage depths. (a) Plot of sequencing qualities at all genotypes, with coverage depths spaced logarithmically. (b) Modified box plot of genotypes at coverage depths of 5×, 10×, 20×, 30×, 50× and 100×. Short, bold lines are drawn at the median. Each whisker ends at the 1 and 99% quantiles, and each box encloses the 5–95% interval.
Genotype sequencing qualities generated in Pilot Study 3 of the 1000 Genomes Project. Sequencing qualities in terms of genotype quality scores (GQ) and probability of sequencing errors (1 − Pr) are displayed, where 1 − Pr = 10−. A total of 2385 rare variants are called, amounting to 214 650 genotypes at various coverage depths. (a) Plot of sequencing qualities at all genotypes, with coverage depths spaced logarithmically. (b) Modified box plot of genotypes at coverage depths of 5×, 10×, 20×, 30×, 50× and 100×. Short, bold lines are drawn at the median. Each whisker ends at the 1 and 99% quantiles, and each box encloses the 5–95% interval.
Evaluation of the effects of coverage depth and sample size
We evaluated empirical powers and type I errors over mean coverage depths of 5×, 10×, 20×, 30×, 50× and 100× and moderate sample sizes of 200, 400, 600, 800 and 1000, each having balanced numbers of cases and controls. We considered 20 rare variants (d = 20) consisting of five causal ones with odds ratios of 2, 3, 4, 0.5 and 0.5 for power computations. No LD was assumed among the variants. In each case, we compared the qMSAT with 10 other methods that do not utilize quality information (see ‘Material and Methods’ section).Figure 2 presents type I errors at an MAF of 0.5%. Type I errors are generally well controlled. However, score-based procedures using asymptotic distributions (SSU, SSUw, Score, SKAT and UminP) tend to be conservative at moderate or high coverage, especially when sample sizes are small. On the other hand, permutation-based methods (qMSAT, C-alpha and step-up) are usually well controlled around the 0.05 error rate. This may suggest that permutation methods can sometimes be advantageous under finite samples. Supplementary Figures S1 and S3 further depict type I errors at MAFs of 0.1% and 1%, respectively.
Figure 2.
Type I errors over varying coverage depths and sample sizes at MAF=0.5%. There are n/2 numbers of cases and controls each. Twenty non-causal rare variants (d = 20) are considered with odds ratios at 1 for all variants. No LD is assumed among the variants. Reference lines (in red) are drawn at the 0.05 error rate.
Type I errors over varying coverage depths and sample sizes at MAF=0.5%. There are n/2 numbers of cases and controls each. Twenty non-causal rare variants (d = 20) are considered with odds ratios at 1 for all variants. No LD is assumed among the variants. Reference lines (in red) are drawn at the 0.05 error rate.Empirical powers are provided in Figure 3 at an MAF of 0.5%. We observe that the qMSAT often dominates over other procedures, that do not utilize quality information. Improvements in power are most significant at relatively low coverage. Consider the permutation-based C-alpha, which is related to the qMSAT when no quality information is utilized (see ‘Material and Methods’ section). At n = 600, the qMSAT outperforms C-alpha with 78.0, 158, 61.2, 27.4, 21.0 and 14.8% increases in power at mean coverage depths of 5×, 10×, 20×, 30×, 50×, and 100×, respectively (Figure 3). Moreover, the qMSAT often achieves fairly high powers even at relatively low coverage. At n = 1000, the qMSAT attains powers of 0.530, 0.739 and 0.773 at mean coverage depths of 10×, 20× and 30×, respectively, whereas a similar level of power is attained at 0.780 with a much higher mean coverage depth of 100× (Figure 3). This suggests the qMSAT as a valuable tool for achieving powerful results without incurring additional costs, such as by sequencing at higher coverage depths. The SSU usually tends to have higher powers than SSUw and score tests. Previous studies have suggested that the SSUw and score tests that depend upon accurate estimations of variances and the covariance matrix, respectively, may be limited for multivariate analysis when the dimension d is large (24,48,49). The SKAT using linear kernels often has higher powers than the SKAT using quadratic kernels as epistatic effects were not included in this analysis. Due to the inclusion of causal variants having opposing directions, the sum test often has limited powers. Supplementary Figures S2 and S4 further provide power evaluations at MAFs of 0.1 and 1%, respectively.
Figure 3.
Power evaluation of the effects of coverage depths and sample sizes at MAF=0.5%. There are n/2 numbers of cases and controls each. Twenty rare variants (d = 20) are considered, of which five are causal with odds ratios at 2, 3, 4, 0.5 and 0.5. No LD is assumed among the variants.
Power evaluation of the effects of coverage depths and sample sizes at MAF=0.5%. There are n/2 numbers of cases and controls each. Twenty rare variants (d = 20) are considered, of which five are causal with odds ratios at 2, 3, 4, 0.5 and 0.5. No LD is assumed among the variants.
Evaluation of the effects of MAF and odds ratio
We compared empirical powers and type I errors over varying MAFs of 0.1, 0.5, 1, 5 and 10% and ORs of 1–6 for each of the five variants among the 20 (d = 20) considered, in which the remaining are non-causal with ORs of 1. Identified rare variants in recent literatures have a mean OR of 3.74, and most have ORs greater than 2 (4). To focus on evaluating effects of MAF and OR, we considered n = 500 observations with balanced numbers of cases and controls and assumed no LD among the variants.Figure 4 presents empirical powers and type I errors (ORs = 1) at a mean coverage depth of 30×. Power increases for all methods with increasing MAF and OR. The qMSAT often outperforms other procedures at low MAFs for rare variants, whereas all procedures do well at high MAFs for common variants. For low MAFs, power for the qMSAT improves dramatically for increasing ORs. For example, at an MAF of 0.1%, the qMSAT outperforms C-alpha with 66.0, 84.4, 113, 124 and 127% increases in power at causal variants' ORs of 2, 3, 4, 5 and 6, respectively (Figure 4). This may suggest the qMSAT as a potentially useful procedure for identifying rare variants (with MAFs between 0.1 and 1%) under moderate sample sizes and, possibly, for identifying novel variants (with MAFs <0.1%) having strong effect sizes in large-scale studies. The sum test performs competitively with related methods in this example, as all causal variants have the same direction. Supplementary Figures S5–S9 present empirical powers and type I errors at mean coverage depths of 5×, 10×, 20×, 50× and 100×, respectively. Improvements of the qMSAT tend to be more pronounced at lower coverage depths in this example.
Figure 4.
Effects of MAFs and odds ratios at a mean coverage depth of 30×. Empirical powers and type I errors are presented at a mean coverage depth of 30× with varying MAFs and odds ratios for five variants amongst twenty (d = 20) considered altogether. We use n = 500 observations, where there are n/2 = 250 numbers of cases and controls each. No LD is assumed among the variants. Reference lines (in red) are drawn at the 0.05 error rate.
Effects of MAFs and odds ratios at a mean coverage depth of 30×. Empirical powers and type I errors are presented at a mean coverage depth of 30× with varying MAFs and odds ratios for five variants amongst twenty (d = 20) considered altogether. We use n = 500 observations, where there are n/2 = 250 numbers of cases and controls each. No LD is assumed among the variants. Reference lines (in red) are drawn at the 0.05 error rate.
Evaluation of the effects of LD
We evaluated the effects of LD at a mean coverage depth of 30× and sample sizes of n = 500, with equal numbers of cases and controls. A spectrum of LD structures was evaluated with latent variables having autoregressive covariance matrices Σ = ρ| with ρ at 0, 0.25, 0.5, 0.75 and 0.9 (Figure 5). We employed 20 variants (d = 20) with ORs of (3, 2, 1, 1, 4, 1, 1, 1, 0.5, 0.5, 1,…,1). Variants generated with ρ = 0 are independent, whereas those generated with large values of ρ are in LD with its neighbors.
Figure 5.
Power evaluation of the effects of linkage disequilibrium. Empirical powers are presented at a mean coverage depth of 30× and sample sizes n = 500 with equal numbers of cases and controls for a spectrum of LD structures generated from latent variables having autoregressive covariance matrices Σ = ρ|. Variants generated with ρ = 0 have no LD, whereas variants generated with large values of ρ are in high LD. Twenty variants (d = 20) are considered having odds ratios of (3, 2, 1, 1, 4, 1, 1, 1, 0.5, 0.5, 1,…,1).
Power evaluation of the effects of linkage disequilibrium. Empirical powers are presented at a mean coverage depth of 30× and sample sizes n = 500 with equal numbers of cases and controls for a spectrum of LD structures generated from latent variables having autoregressive covariance matrices Σ = ρ|. Variants generated with ρ = 0 have no LD, whereas variants generated with large values of ρ are in high LD. Twenty variants (d = 20) are considered having odds ratios of (3, 2, 1, 1, 4, 1, 1, 1, 0.5, 0.5, 1,…,1).Figure 5 presents empirical powers for evaluating LD effects. Power increases for all methods with increasing ρ or degree of LD among neighboring variants. This is largely because variants in LD tend to have effect sizes augmented by aggregation. The qMSAT tends to dominate other methods at relatively low MAFs, even at high LD. Rare variants usually do not exhibit strong LD (6,11,15). However, these results suggest that the qMSAT can also be effective when LD effects are strong. Supplementary Figure S10 further provides type I errors.We analyzed the UCSD obsesity data from a previous study that suggested multiple rare variants to be associated with the FAAH promoter, the MGLL promoter, and a region of MGLL introns (42). We performed multiple rare variants association tests using all statistical approaches as above. As in the original study (42), none of these methods found significant rare variants in the FAAH and MGLL exon regions (Supplementary Table S1); several methods confirmed associations of multiple rare variants in the FAAH promoter and MGLL promoter regions (Table 1). Specifically, the qMSAT attained P-values of 0.061 and 0.037 for the FAAH promoter and MGLL promoter regions, respectively. No intron regions were found to be significant by any of these methods (Supplementary Table S1). Interestingly, several methods, including the qMSAT (Table 1), identified the MGLL 3′-UTR to be significant, which was missed in the previous study (42). Promoters are well known to be gene transcription regulators that specify binding of transcription factors to DNA sequences. Moreover, protein binding to 3′-UTR is known to function in some situations by protecting mRNA from nuclease degradation (50). These results suggest that recent mutations in the FAAH promoter, MGLL promoter, and MGLL 3′-UTR may have led to extreme obesity levels in some individuals by altering gene transcription mechanisms. Further studies, such as by associating rare variants to gene expression levels, may help to verify these results. On the other hand, recent mutations in MGLL introns might not have played as significant a role as previous study (42) suggested in extreme obesity. Differences between our results from those of the previous study might be due to using more recent variant calling algorithm in our analysis (see ‘Material and Methods’ section). Among the procedures evaluated, the qMSAT has the smallest P-values for association of rare variants in the MGLL promoter and MGLL 3′-UTR with extreme obesity levels (Table 1). The C-alpha, SSU and linear SKAT are equivalent to the qMSAT when no quality information is utilized. Both C-alpha and qMSAT utilize a permutation-based estimation of P-values in order to better control type I errors under moderate samples. In all three regions where at least one method indicates significant association, the qMSAT has significantly smaller P-values than the C-alpha (Table 1). These results confirm our simulation studies suggesting that the incorporation of quality information can often improve power in rare variants association analysis. We observe that the SKAT using linear kernels has smaller P-values than the SKAT using quadratic kernels (Table 1). This may suggest that rare variants, having low MAFs, might not be able to exhibit strong interaction and higher-order effects. In addition, we performed marginal association analysis of rare variants in the FAAH promoter, MGLL promoter and MGLL 3′-UTR (Supplementary Table S2). The MGLL promoter contains two variants with marginal associations significant at the 0.05 P-value level but with opposing effects on extreme obesity (Supplementary Table S2). In this region at the MGLL promoter, the multivariate sum test has the largest P-value of 0.578 (Table 1). This may support previous studies asserting the sum test to be limited when association of significant variants in a region are in opposite directions, though it might be powerful when directions are the same (25). Due to low MAFs, marginal association tests for rare variants often have low power and can be subjected to perturbations from genotype errors. Both the FAAH promoter and MGLL 3′-UTR, indicated to be significant by several multivariate tests, do not have rare variants exhibiting marginal associations significant at the 0.05 P-value level (Supplementary Table S2). This result further suggests the benefit of pooling multiple variants together in identifying rare variants associated with common diseases.
Table 1.
Multiple rare variants association analysis of the UCSD obesity data
FAAH promoter (d=17)
MGLL promoter (d=25)
MGLL 3′-UTR (d=6)
qMSAT
0.061
0.037
0.002
C-alpha
0.077
0.049
0.044
SSU
0.080
0.038
0.246
SSUw
0.145
0.234
0.308
Score
0.146
0.123
0.303
SKAT (Linear)
0.079
0.038
0.246
SKAT (Quad)
0.107
0.050
0.250
UminP
0.631
0.211
0.627
Step up
0.022
0.063
0.015
Sum
0.002
0.578
0.052
CMC
0.015
0.109
0.052
P-values are provided for association tests performed over d multiple rare variants. FAAH promoter, MGLL promoter and MGLL 3′-UTR are regions exhibiting significant association of multiple rare variants with extreme BMI at the 0.05 level by one or more methods. P-values significant at the 0.05 level are boldfaced.
Multiple rare variants association analysis of the UCSD obesity dataP-values are provided for association tests performed over d multiple rare variants. FAAH promoter, MGLL promoter and MGLL 3′-UTR are regions exhibiting significant association of multiple rare variants with extreme BMI at the 0.05 level by one or more methods. P-values significant at the 0.05 level are boldfaced.In this application, the qMSAT P-values were computed in 25.039 s for all 12 regions considered using 10 000 permutations with the R programming language. In large-scale rare variants association studies involving hundreds of sequenced genes, it is expected to take only a couple of hours. Computing speed can also be greatly improved by using more efficient programming languages, such as C or Fortran. Among the methods evaluated, the step-up procedure has comparatively much slower computational speed, as it has to ascertain whether each variant is protective, deleterious or non-causal.
DISCUSSION
We have proposed a novel procedure that allows the incorporation of sequencing qualities directly in association tests of complex traits and multiple rare variants. Extensive simulation studies with sequencing qualities sampled directly from real data have been used to evaluate our method across a spectrum of factors of interest. Results have suggested the qMSAT as a potentially very useful method for improving power, especially under moderate sample sizes and relatively low coverage. Moreover, our analysis of the UCSD obesity data set using the qMSAT has identified two regions (MGLL promoter and MGLL 3′-UTR), where multiple rare variants are significantly associated with extreme obesity.We have not considered the inclusion of additional covariates in order to provide a concise presentation. It is clear that our conclusions regarding the incorporation of sequencing qualities will remain the same whether additional covariates are included or not. The qMSAT can be easily extended to include additional covariates, such as relevant common variants and environmental effects. For dichotomous traits y and X a d-vector of additional covariates, the qMSAT statistic qMSAT = S
S is defined using scores in Equation (4) with
where and are estimated under the null hypothesis by maximizing the log-likelihood in Equation (2) with using only the covariates X and quality-based weights q1, q2,…, q at the j-th variant, i.e. q = 0 for all i = 1,…, n and j′ ≠ j.Moreover, we focused on association tests with the logistic regression for case–control studies. This is partly due to the prevalence of case–control data in biomedical researches and the fact that effects of rare variants, having extremely low MAFs, can be more readily modeled with dichotomous traits than continuous ones. However, with the advent of large-scale rare variants association studies in the near future, quantitative traits may also be considered. For continuous traits y, the qMSAT statistic is defined with scores
for which the quality-weighted log-likelihood under the linear model
is employed in place of Equation (2). The coefficients and are estimated under the null hypothesis by weighted least squares regression of y on additional covariates X only with quality-based weights q1, q2,…, q at the j-th variant. The same benefits demonstrated for the qMSAT for case–control data are expected to extend to applications involving quantitative traits.We have not considered association studies simultaneously involving both rare and common variants. A common strategy proposed in the literature is to re-weight each variant according to its MAF, augmenting effects of rare variants relative to common ones. However, it is not clear which is the best way to re-weight variants objectively. For example, re-weighting by standard deviations, control MAFs (51), and beta transformations of MAFs (21) have been proposed. As rare variants usually act on disease risks independently of common ones (6,15), a natural approach is to first identify common variants that may be causal, such as by routine GWA studies, and then incorporate them as additional covariates X as in the qMSAT.In this article, we have utilized genotype quality scores as weights in the qMSAT. These scores can account for errors, such as those arising from base calling in sequencer traces and mapping of short sequencing reads, and can quantify genotype uncertainty accurately at most loci. However, variant calling at some loci may suffer due to artifacts, such as from strand bias, that cannot be easily accounted for and may lead to less accurate genotype quality scores. Nonetheless, current variant calling programs usually can filter out the variants at these suspicious loci (26,27,52). In the situation when genotype qualities are less accurate, the qMSAT may be less powerful while type I errors are expected to be controlled due to permutation (36). Nevertheless, our results demonstrate that the incorporation of sequencing qualities can consistently improve power over methods that do not incorporate quality information. Moreover, the development of variant calling methods is an active area (26,27,52,53), and our procedure will continue to benefit from future improvements in variant calling programs.On the other hand, the qMSAT is quite flexible and can incorporate other forms of weights, in addition to quality information. For example, if there is evidence to believe that some variants are more likely to be functional, such as by prior knowledge or evolutionary models (54), or some individuals are more representative of the effected population, the qMSAT weights q can be redefined accordingly in order to incorporate additional information at each variant or individual. Indeed, the qMSAT extends earlier methods with weights on each variant (24,51,55) to allow varying weights on both individuals and variants.Statistical analysis of sequencing data is ultimately limited by inadequacies in the underlying technology employed. With the qMSAT, we approached this challenge by incorporating sequencing quality, a technology-derived constituent, directly in a statistical procedure. The qMSAT also offers a coherent strategy for integrating missing genotypes and individual genotypes sequenced at varying coverage depths and quality scores. Improvements in performances that we observed for the qMSAT in both empirical and real-data studies may suggest the importance of considering the limitations of the technology at hand in order to improve statistical analysis of genomic data sets.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online: Supplementary Tables S1–S2 and Supplementary Figures S1–S10.
FUNDING
National Institutes of Health (grant numbers CA127334, ES009911). Funding for open access charge: Waived by Oxford University Press.Conflict of interest statement. None declared.
Authors: Stefano Romeo; Wu Yin; Julia Kozlitina; Len A Pennacchio; Eric Boerwinkle; Helen H Hobbs; Jonathan C Cohen Journal: J Clin Invest Date: 2008-12-15 Impact factor: 14.808
Authors: Teri A Manolio; Francis S Collins; Nancy J Cox; David B Goldstein; Lucia A Hindorff; David J Hunter; Mark I McCarthy; Erin M Ramos; Lon R Cardon; Aravinda Chakravarti; Judy H Cho; Alan E Guttmacher; Augustine Kong; Leonid Kruglyak; Elaine Mardis; Charles N Rotimi; Montgomery Slatkin; David Valle; Alice S Whittemore; Michael Boehnke; Andrew G Clark; Evan E Eichler; Greg Gibson; Jonathan L Haines; Trudy F C Mackay; Steven A McCarroll; Peter M Visscher Journal: Nature Date: 2009-10-08 Impact factor: 49.962
Authors: Andriy Derkach; Theodore Chiang; Jiafen Gong; Laura Addis; Sara Dobbins; Ian Tomlinson; Richard Houlston; Deb K Pal; Lisa J Strug Journal: Bioinformatics Date: 2014-04-14 Impact factor: 6.937
Authors: Camelia C Minică; Giulio Genovese; Christina M Hultman; René Pool; Jacqueline M Vink; Michael C Neale; Conor V Dolan; Benjamin M Neale Journal: Twin Res Hum Genet Date: 2017-02-27 Impact factor: 1.587