Literature DB >> 31063461

An integrative association method for omics data based on a modified Fisher's method with application to childhood asthma.

Qi Yan¹, Nianjun Liu², Erick Forno¹, Glorisa Canino³, Juan C Celedón¹, Wei Chen^1,4,5.

Abstract

The development of high-throughput biotechnologies allows the collection of omics data to study the biological mechanisms underlying complex diseases at different levels, such as genomics, epigenomics, and transcriptomics. However, each technology is designed to collect a specific type of omics data. Thus, the association between a disease and one type of omics data is usually tested individually, but this strategy is suboptimal. To better articulate biological processes and increase the consistency of variant identification, omics data from various platforms need to be integrated. In this report, we introduce an approach that uses a modified Fisher's method (denoted as Omnibus-Fisher) to combine separate p-values of association testing for a trait and SNPs, DNA methylation markers, and RNA sequencing, calculated by kernel machine regression into an overall gene-level p-value to account for correlation between omics data. To consider all possible disease models, we extend Omnibus-Fisher to an optimal test by using perturbations. In our simulations, a usual Fisher's method has inflated type I error rates when directly applied to correlated omics data. In contrast, Omnibus-Fisher preserves the expected type I error rates. Moreover, Omnibus-Fisher has increased power compared to its optimal version when the true disease model involves all types of omics data. On the other hand, the optimal Omnibus-Fisher is more powerful than its regular version when only one type of data is causal. Finally, we illustrate our proposed method by analyzing whole-genome genotyping, DNA methylation data, and RNA sequencing data from a study of childhood asthma in Puerto Ricans.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Genetic Markers

Year: 2019 PMID： 31063461 PMCID： PMC6524814 DOI： 10.1371/journal.pgen.1008142

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 5.917

Introduction

Because of major advances in high-throughput biotechnologies, large amounts of omics data have been collected to study the biological mechanisms underlying complex diseases at different levels, such as genomics, epigenomics, and transcriptomics. Such different types of omics data can help us understand a disease from several perspectives. However, each of the arrays or sequencing technologies is designed to collect a specific type of omics data, such as SNPs, DNA methylation markers, and RNA sequencing. Thus, the association between a complex disease and one type of omics data is usually tested individually, but this strategy is suboptimal and has some disadvantages. Researchers often find that only a small proportion of disease variation can be explained by one type of omics (e.g., genetic) data, leading to “missing heritability” [1]. Moreover, molecular variants identified by different studies usually suffer from poor reproducibility [2, 3]. Most importantly, only partial information is used for each individual analysis. Therefore, in order to better characterize biological processes and increase the consistency of variant identification, omics data from separate platforms need to be integrated and analyzed. Integrating information from different biological datasets has the potential to yield better insight into causal mechanisms of complex diseases than that from individual omics datasets. Although integrative analysis of omics data is clearly needed, the complexity of disease mechanisms, the large number of collected molecular variables, and relatively small datasets can make such analysis quite challenging. Bersanelli et al. [4] summarized a list of existing statistical approaches for integrative analysis. When developing integrative analysis methods, two inevitable issues are: 1. handling a large number of variables and 2. dealing with data from relatively small studies. In genetic studies, to handle a large number of genetic variants in a gene, gene-based approaches [5-15] have been developed to evaluate the joint effects of genetic variants in the same gene on the disease of interest. Of the existing methods, the sequence kernel-machine-based associations test (SKAT) [16, 17] is a powerful, flexible, and computationally efficient test. In this kernel machine (KM) approach, the test statistic follows a mixture of chi-square distributions, and thus p-values can be computed analytically and quickly without using resampling techniques. Although gene-based tests were originally developed for genetic studies, the same concept can be applied to studies of multi-omics data. Another issue is small sample size, especially for epigenomic or transcriptomic data. For example, large-scale genome wide association studies (GWASs) have been widely conducted for genetic studies for many years, so researchers usually have hundreds or thousands of genotyped samples. However, genome- wide methylation studies are more recent, and thus researchers often have a small number of samples. Moreover, incomplete samples may be wasted when using methods requiring complete samples (e.g., methods incorporating multi-omics data variables into one regression model). In this scenario, methods combining multiple p-values can be applied to make full use of data. For example, the p-values for association testing of a disease and SNPs, DNA methylation markers, and RNA sequencing data are be calculated separately, and then these separate p-values can be appropriately combined into one final p-value. In order to test for overall gene-level significance, we here present an approach to use a modified Fisher’s method (denoted as Omnibus-Fisher) to combine separate p-values for association testing of a disease or trait and SNPs, methylation markers, and RNA sequencing data calculated by KM regression into an overall gene-level p-value accounting for correlation between omics data. This method can be applied to either samples with all three types of omics data or samples with one or two types. To account for all possible disease models, we further extend the modified Fisher’s method to an optimal test by using perturbations. In our simulation studies, we show that a usual Fisher’s method has inflated type I error rates when directly applied to correlated omics data. In contrast, our Omnibus-Fisher test preserves the expected type I error rates when employed in correlated omics data. Moreover, the Omnibus-Fisher method has increased power compared to its optimal version when the true disease model involves all of SNPs, methylation markers, and RNA sequencing data. On the other hand, the optimal Omnibus-Fisher method is more powerful than its regular version when only one type of data is causal. Finally, we illustrate our proposed methodology by analyzing whole-genome genotyping, DNA methylation, and RNA sequencing data from a study of childhood asthma in Puerto Ricans.

Results

Simulation of the Type I error rate

When applied to samples with independent SNPs (G), methylation markers (M), and RNA sequencing data (E), all of the methods used (i.e., the Fisher’s methods with and without considering p-value covariance [Omnibus-Fisher and usual Fisher], and the optimal test with p-values from Omnibus-Fisher as inputs [optimal Omnibus-Fisher]) had empirical Type I error rates close to the nominal level ( and ). When the usual Fisher’s method without considering covariance was applied to G and E correlated data, the Type I error rate was inflated ( and ). In contrast, the optimal and regular Omnibus-Fisher methods with considering covariance retained the desired Type I error rates as evidenced by the patterns observed in the QQ plots shown in and . Similar results were observed when extending to 100,000 datasets for evaluation (S1 Table).

QQ plot of the p-values.

A 95% pointwise confidence band (gray area) was computed under the assumption that the p-values were drawn independently from a uniform [0, 1] distribution. (A) Independent G, M and E; and (B) G and E correlated.

Statistical power comparison

When we compared the power of the statistics on the samples with independent G, M and E (), the power of optimal Omnibus-Fisher was consistent higher than that of the regular Omnibus-Fisher method when G was the only causal factor, but when G, M and E were all causal factors, the optimal methods had lower power. This was expected because the optimal methods automatically searched for the appropriate disease model; in contrast, the regular Omnibus-Fisher assumed that G, M and E were all causal factors. Thus, when the simulation matched the assumption of the regular version method, they performed better than the optimal version and vice versa. However, when G and M were causal factors, no methods were consistently better than another. Furthermore, similar patterns were observed, when evaluated using the samples with G and E correlated (). Since the causal SNPs in G were correlated with E, GM causal was equivalent to G, M and E causal. Note that the usual Fisher’s method is not included in because of its inflated Type I error rate with correlated data.

Power comparison for the scenario of independent G, M and E.

(A) Binary trait at α = 0.05; (B) Binary trait at α = 0.01; (C) Continuous trait at α = 0.05; and (D) Continuous trait at α = 0.01.

Power comparison for the scenario of G and E correlated.

(A) Binary trait at α = 0.05; (B) Binary trait at α = 0.01; (C) Continuous trait at α = 0.05; and (D) Continuous trait at α = 0.01.

Results of childhood asthma in Puerto Ricans

We used the proposed optimal Omnibus-Fisher statistic and its regular version to analyze the Puerto Rican childhood asthma data from WBCs for associations between asthma status and 14,808 genes with all SNPs, DNA methylation markers, and gene expression, with adjustment for age, gender and first two principal components calculated based genotypes. In addition, batch effect and cell type composition were also adjusted for DNA methylation and RNA sequencing data. We found that ZPBP2 was the most significant gene from both optimal (P = 1.40×10−5) and regular (P = 3.39×10−5) Omnibus-Fisher tests, although it didn’t reach a Bonferroni corrected significance level (P = 3.38×10−6) (). The ZPBP2 region from chromosome 17q21 has been consistently replicated as an asthma-susceptibility locus across diverse ethnic groups [18-28] including Puerto Ricans [29] and this region regulates its gene expression in Puerto Ricans [30]. In a meta-analysis of GWAS in Puerto Ricans [29], the only region associated with asthma was the ZPBP2 locus and the current genotypic dataset was analyzed as a part of the data. This gene could be served as a positive control in asthma genetic studies. In the optimal Omnibus-Fisher test, the significance of ZPBP2 as well as GSDMB was mainly driven by their genetic effect (P = 2.89×10−6 for ZPBP2 and 2.36×10−6 for GSDMB). Moreover, five additional genes (KAT2A, HIST1H1C, NFRKB, C14orf178 and ZNF213-AS1) were suggestively associated with asthma (P < 0.0001 []) from the regular Omnibus-Fisher test. Of these genes, KAT2A had moderate effects for SNPs, DNA methylation, and RNA expression separately, which could be overlooked by a single type of data analysis. The results also indicate that the optimal Omnibus-Fisher test was more powerful than its regular version when the significance was driven by one type of data. Conversely, when statistical significance was driven by two or three types of data, the regular Omnibus-Fisher had overall better power than the optimal version. These observations were generally consistent with the simulation results. Here, the optimal Omnibus-Fisher test does not outperform its regular version that assumes all types of omics data are in the disease model. Since only three types of omics data were analyzed in this study, it was still fine to assume they all were in the disease model. However, when more omics data are analyzed, the optimal test could be more useful than simply assuming all types of data are causal. We additionally output the p-value correlation for each gene across the whole genome (S1 Fig): 1. between SNPs and DNA methylation markers, 2. between SNPs and expression genes, and 3. between DNA methylation markers and expression genes.

–log10(p-values) of the association between 14,808 genes and asthma status in WBCs.

(A) The optimal Omnibus-Fisher test and (B) the regular Omnibus-Fisher test. The blue line is the suggestive significance level, 1×10−4, and the red line is the stringent Bonferroni-corrected significance level, P = 3.38×10−6. Analysis of the WBC genome-wide data with 1,116 samples and 14,808 genes took ~108.8 hours on a single computing node with a 3 GHz CPU and 4 GB memory. Using a computer cluster with multiple nodes, we anticipate that genome-wide data analysis should be finished within hours using our proposed methods.

Discussion

In this work, we developed an Omnibus-Fisher statistic using a kernel machine (KM) regression framework, which can be employed to test overall gene-level significance by combining separate p-values of association testing for a disease and SNPs, methylation markers, and expression genes, accounting for correlation between omics data. The separate p-values are calculated by gene-based KM regression. The gene-based analysis methods can improve power by testing a set of variants jointly and by reducing the multiple testing penalty. In addition, the method using a gene as the unit can easily combine different types of omics data that are mapped to the same gene and thus easily interpret the results. Since we do not know the exact disease model in reality, the extended optimal Omnibus-Fisher test can account for all possible disease models. Moreover, our proposed tests can be applied to either samples with all three types of omics data or with one or two types. In other words, samples with incomplete data can still contribute to the test statistic. The information about whether the different types of omics data are from the sample can also be accounted. In the simulation studies, we showed that using a usual Fisher’s method on correlated omics data results in an inflated Type I error rate, while the modified Fisher’s method, Omnibus-Fisher, had the correct Type I error rate because it considered the omics data correlation in the model. The Omnibus-Fisher method achieves better power performance compared to its optimal version when the true disease model involves all of SNPs, RNA expressions and DNA methylations. On the other hand, the optimal Omnibus-Fisher method has better power than its regular version assuming all types of data are causal when only one type of data is actually causal. Our real data study also shows that the regular Omnibus-Fisher test has better power than the optimal test, when two or three types of data contribute to the combined p-value. Because we only consider three types of omics data in this study, assuming they all are causal could be still acceptable. However, when more omics data are analyzed, we believe that the optimal test would be more powerful for most genes than simply assuming all types of data are causal. Nevertheless, both the optimal and regular Omnibus-Fisher tests are able to detect genes with moderate separate effects, which could be overlooked by single type of data analyses. Although the optimal Omnibus-Fisher test uses perturbation to consider the correlation between omics data and search for the optimal disease model, the genome-wide data analysis could be completed within hours using multiple CPUs (e.g., one CPU for each chromosome). We adapt a stepwise manner to implement perturbation (e.g., more iterations for smaller p-values) so as to save computation times when calculating large p-values. Thus, the majority of the computation time is used by genes with small p-values. However, if a large number of genes are highly associated with the phenotype, the optimal test may be infeasible due to computational intensity. In such case, the regular test is recommended. Although the regular test also involves resampling technique to calculate covariances between different types of omics data, it only requires a small number of resampling (e.g., the default setting is 200 times). Our method framework is general and flexible. Both continuous and binary traits for independent samples can be analyzed. Covariates can be easily incorporated into the model and different covariates can be used for different omics data. The regular and optimal version of Omnibus-Fisher algorithms were implemented in R (http://www.r-project.org) and the R package (https://cran.r-project.org/web/packages/OmnibusFisher/index.html) is available.

Materials and methods

KM Regression for testing gene-level effects of SNPs, DNA methylation and RNA expression

We used KM regression to calculate the gene-level p-values for association testing of a disease and SNPs, methylation markers, and expression genes. First, we test the effect of SNPs. Let there be n subjects with q genetic variants. The n × 1 vector of the continuous trait y follows a linear model: when the phenotypes are binary, y follows a logistic model: where X is an n × p covariate matrix, β is a p × 1 vector containing parameters for the fixed effects (an intercept and p– 1 covariates), G is an n × q genotype matrix for the q genetic variants of interest where an additive genetic model is assumed (i.e., coded as 0, 1, or 2 representing the copies of minor alleles) for illustration, γ is a q × 1 vector for the random effects of the q genetic variants, and ε is an n × 1 vector for the random error. The random effect γ for variant j is assumed to be normally distributed with mean zero and variance τw; thus, the null hypothesis H: γ = 0 is equivalent to H: τ = 0, which can be tested with a variance component score test [17] in the mixed model. The random variable ε is assumed to be normally distributed, and is uncorrelated with γ: where W is a predefined q × q diagonal weight matrix for each variant and may use W = I when lacking of prior information, and is the error variance. Following the same rationale as in the derivation of the SKAT score statistic [31-33], the test statistic is: when phenotypes are continuous, and when phenotypes are binary, where is the vector of estimated fixed effects of covariates under H and . Under the null hypothesis, the linear model is y = Xβ + ε, and the estimates are the logistic model is logit P(y = 1) = Xβ, and the estimates are The statistic Q is a quadratic form and follows a mixture of chi-square distributions under H. Thus, where λ are the eigenvalues of the matrix [34] for both continuous and binary traits. The p-values can be calculated by numerical algorithms, such as Davies’ method [35] and Kuonen’s saddlepoint method [36], which are both included in the R package. Analogously, the gene-level effects of DNA methylation markers and expression genes can be tested by replacing Gγ with Mρ and Eη, M is an n × k matrix for the k methylated loci, ρ is a k × 1 vector for the random effects of the k methylated loci, E is an n × g matrix for the RNA expression, and η is a g × 1 vector for the random effects of the RNA expression. When using microarray platform, multiple probes could map to the same gene and each probe has an expression value, which result in more than one expression value for one gene. Here, g is the number of probes for one gene. When using RNA sequencing platform, one gene can always have one expression value (i.e., E is an n × 1 vector and η is a scalar), although it is also possible to obtain the transcript (i.e., isoform) level expression values. The null hypothesis is ρ = 0 for testing DNA methylation markers and η = 0 for testing expression genes. It is worth to note that all three models can have the same or different null models.

Modified Fisher’s method (Omnibus-Fisher) for combining gene-level effects of SNPs, DNA methylation and RNA expression

In order to have one single p-value to represent the significance of a gene, we propose an approach to test if the trait is associated with any SNP, DNA methylation marker, and RNA sequencing variant. This could help researchers to screen out potentially interesting genes. Thus, after obtaining the three p-values for SNPs, DNA methylation markers, and expression genes, respectively, we used a modified Fisher’s method [37] to combine the three p-values to one. In Fisher’s method, let p (i = 1, 2,…, w) be independent p-values obtained from n hypothesis tests. Under the null hypothesis that p-values follow a Uniform(0, 1) distribution, the combined test statistic is equal to that follows . However, within a gene, these p-values are correlated, thus the generalized Fisher’s method cannot be used directly. To address this issue, we consider a Satterthwaite approximation by approximating a scaled T statistic with a new chi-square distribution [38]. where w = 3 for SNPs, DNA methylation markers, and expression genes. The covariance part takes the correlations of p-values into account and can be empirically estimated by perturbations. The perturbation details are described in the following section.

Optimal test for the gene-level effects of SNPs, DNA methylation and RNA expression, using perturbations with p-values from the Omnibus-Fisher method as inputs

If the disease risk only depends on SNPs and the model with SNPs, DNA methylation markers, and expression genes is used, then the testing power will lose. Since in reality we do not know the underlying true disease model (e.g., only SNP effect, both SNP and RNA variant effects, or all SNP, RNA variant, and DNA methylation marker effects; totally 7 combinations), it is difficult to choose the correct model. Thus, it is desirable to develop a method accommodating all possible disease models to maximize power. This can be achieved by using the minimum p-value of all possible models (7 combinations) as a new test statistic. Then, perturbation can be used to calculate the final p-value. The perturbation-based approach was described in Wu et al. [39]. For continuous phenotypes, with large n, under H the are approximately standard normal. Then each is essentially comprised of a vector of standard normal variables sandwiching a square matrix. Thus, we can perturb each Q by replacing with a new, common vector of normal values to generate new score statistics. Following a similar procedure as described in Urrutia et al. [40]: 1. Calculate the p-values for SNPs (G), DNA methylation (M) and RNA expression (E) separately (i.e., , and ) by KM regression. 2. For l ∈ {G, M and E}, compute Λ = diag(λ,⋯,λ), and V = [,⋯,] where λ ≥ λ ≥⋯≥ λ are the ml positive eigenvalues of with corresponding eigenvectors ,⋯,, where D ∈ {omics data matrices G, M and E}. For example, the aforementioned is for G. 3. Generate with each . This indicates that one subject has one . If the subject has all G, M and E, the same will be used for G, M and E, respectively. Thus, whether G, M and E come from the same subjects or different subjects are considered. 4. For l ∈ {G, M and E}, rotate ( using the eigenvectors to generate . 5. Compute for each l and obtain a corresponding p-value, . 6. Repeat (3)-(5) B times to obtain and for some large number B. 7. Calculate the covariance between p, p and p by using , and for b ∈ {0, 1,…, B}. 8. Calculate the joint p-values of SNPs, DNA methylation and RNA expression (i.e., for b ∈ {0, 1,…, B}, , and ) by Omnibus-Fisher considering p-values covariance. 9. For l* ∈ {G, M, E, GM, GE, ME, and GME}; b ∈ {0, 1,…, B}, set . 10. The final p-value for significance is estimated as

Simulation studies

Pools of haplotypes, DNA methylation markers, and expression genes

For the pool of haplotypes, we simulated 10,000 haplotypes over a 200-kb region generated by the calibrated coalescent model [41], mimicking the European ancestry linkage disequilibrium (LD) structure. For the pool of DNA methylation, 689 samples with 8,557 CpG sites in chromosome 22 were extracted from an epigenome-wide association study of rheumatoid arthritis [42]. The DNA methylation data were generated using the Illumina 450K methylation array. For the pool of RNA expression, 142 samples with 327 genes in chromosome 22 were extracted from an expression continuous trait loci (eQTL) study [30]. The RNA expressions were assessed using the Illumina HT-12 microarray.

Simulation settings

We simulated two main settings: 1. G, M and E are independent with each other and 2. G and E are correlated, but independent with M. In each of the two main settings, we considered both continuous and binary traits. We evaluated the type I error rate by assuming the simulated phenotypes were independent with G, M and E, and evaluated the power performance by assuming the simulated phenotypes were dependent with (1) G; (2) G and M; or (3) G, M and E. Thus, we had the scenarios for G, M and E are independent: (1.1) continuous traits and no causal factors (denoted as null-c-ind), (1.2) continuous traits and G is causal (denoted as causal-G-c-ind), (1.3) continuous traits and G and M are causal (denoted as causal-GM-c-ind), (1.4) continuous traits and G, M and E are causal (denoted as causal-GME-c-ind), (1.5) binary traits and no causal factors (denoted as null-b-ind), (1.6) binary traits and G is causal (denoted as causal-G-b-ind), (1.7) binary traits and G and M are causal (denoted as causal-GM-b-ind), and (1.8) binary traits and G, M and E are causal (denoted as causal-GME-b-ind); and we had the scenarios for G and E are correlated: (2.1) continuous traits and no causal factors (denoted as null-c-cor), (2.2) continuous traits and G is causal (denoted as causal-G-c-cor), (2.3) continuous traits and G and M are causal (denoted as causal-GM-c-cor. When the causal SNPs in G are correlated with E, GM causal is similar to GME causal), (2.4) binary traits and no causal factors (denoted as null-b-cor), (2.5) binary traits and G is causal (denoted as causal-G-b-cor), and (2.6) binary traits and G and M are causal (denoted as causal-GM-b-cor). As shown in S2 Fig, we first simulated the scenarios of binary traits with causal factors.

Simulation of (1.6) causal-G-b-ind, (1.5) null-b-ind, (1.2) causal-G-c-ind and (1.1) null-c-ind

First, we simulated the scenario of (1.6) causal-G-b-ind. Following the flow in S3 Fig, we simulated 50 SNPs, 5 methylated sites and 1 RNA expression for one sample. Then, the phenotype for sample i was generated via the model: where -2.94 = log(0.05/0.95) indicates the disease prevalence is 0.05; X is a continuous covariate generated from a normal distribution with a mean of 20 and a standard deviation of 1; X is a binary covariate generated from a Bernoulli distribution with a probability of 0.5; G,G,…,G are the genotypes of randomly selected causal SNPs. We saved the first 500 cases and 500 controls to form a dataset, and we simulated 1,000 such datasets (S3 Fig). After obtaining (1.6), we permuted the binary traits 10 times for each of the 1,000 (1.6) datasets to get 10,000 datasets for the scenario of (1.5) null-b-ind (S2A Fig). Furthermore, we used the same G, M and E data as generated in (1.6) to simulate 1,000 sets of continuous phenotypes for (1.2) causal-G-c-ind (S2A Fig) via the following model: where X, X and G are the same as described above; ε follows a standard normal distribution. We further permuted the continuous traits 10 times for each of the 1,000 (1.2) datasets to get 10,000 datasets for the scenario of (1.1) null-c-ind (S2A Fig).

Simulation of (1.7) causal-GM-b-ind and (1.3) causal-GM-c-ind

To simulate (1.7) causal-GM-b-ind, we followed a similar procedure to simulating (1.6) causal-G-b-ind via the model: where X, X and G are the same as described above; M and M are the randomly selected causal methylated sites. We further used the same G, M and E data as generated in (1.7) to simulate 1,000 sets of continuous phenotypes for (1.3) causal-GM-c-ind (S2A Fig) via the following model:

Simulation of (1.8) causal-GME-b-ind and (1.4) causal-GME-c-ind

Similarly, (1.8) causal-GME-b-ind and (1.4) causal-GME-c-ind were simulated via the following models for binary and continuous traits respectively: where X, X, G and M are the same as described above; E is the RNA expression and 7.45 is the global mean of RNA expression in the pool. Finally, all the scenarios in the setting of assuming G and E are correlated (i.e., (2.1) ~ (2.6)) were simulated in the analogous ways to the first setting and used the same models. The only difference is that one causal SNP in the gene is correlated with its RNA expression: where E is the aforementioned RNA expression used in the setting of G, M and E independent, G is the first causal SNP in the simulated gene, and is the RNA expression used in the setting of G and E correlated. We compared the optimal test (optimal Omnibus-Fisher) of G, M and E with the Fisher’s method combing G, M and E either considering covariance (Omnibus-Fisher) or not (usual Fisher’s method).

Studies of childhood asthma in Puerto Ricans

Genetic variants

Subject recruitment and study procedures have been described in detail [43, 44]. In brief, 449 children ages 6–14 yrs were recruited from schools in Hartford (CT) from 09/03 to 07/08. From 03/09 to 06/10, 678 children ages 6–14 yrs were recruited in San Juan (Puerto Rico) using a multistage probabilistic sampling design [45]. At each study site, there were no significant differences in age, sex, or area of residence between eligible children whose parents did or did not agree to participate. At both sites, the main recruitment tool was a screening questionnaire on the child’s respiratory health and PR ancestry. We selected case children with physician-diagnosed asthma and wheezing in the prior year. All children had four PR grandparents. The genotypes were coded as 0, 1 and 2 indicating the number of rare allele copies. The imputation was performed on Michigan Imputation Server [46] with HRC r1.1 2016 [47] as the reference genome.

Epigenetic variants

Children with and without asthma (aged 9–20 years) were recruited in San Juan (PR) from February 2014 to May 2017, using a similar approach to that used in a previous study [48]. Whole-genome methylation assays were performed using HumanMethylation450 BeadChips (Illumina, San Diego, CA), and M-values were used in all downstream analyses. RNA-Seq was performed with the Illumina NextSeq 500 platform, paired-end reads at 75 cycles, and 80M reads/sample; reads were aligned to reference human genome (hg19) [49] and TPM (Transcripts Per Kilobase Million) were used as proxy for gene expression level. In summary, we acquired multiple-dimensional data from microarray platforms for SNP genotyping and DNA methylation from white blood cells (WBCs), and RNA sequence data from WBCs in PR children. 10,994,111 imputed and genotyped SNPs with no missing genotypes were selected and they were assigned to 22,332 genes. 368,529 methylated sites with M values were selected and they were assigned to 20,596 genes. 16,188 genes with average TPM value greater than 0.05 were included for RNA expression. In the final analysis, we used the 14,808 genes with all SNP, DNA methylation and RNA expression information, and 1,116 subjects. Note that 471 out of the 1,116 subjects have all three types of omics data ().

Simulated Type I error rates based on 100,000 datasets.

(DOCX) Click here for additional data file.

The p-value correlation for each gene across the whole genome.

(A) between SNPs and DNA methylation markers, (B) between SNPs and expression genes, and (C) between DNA methylation markers and expression genes. (PDF) Click here for additional data file.

Simulation scenarios.

(A) SNPs, DNA methylation and RNA expression are independent; (B) SNPs and RNA expression are correlated. (PDF) Click here for additional data file.

Simulation details of the scenario of (1.6) causal-G-b-ind.

(PDF) Click here for additional data file.

Table 1

Simulated Type I error rates based on 10,000 datasets.

Significance level	Optimal Omnibus-Fisher	Omnibus-Fisher	Usual Fisher
Independent G, M and E with binary traits
0.05	0.0521	0.0516	0.0522
0.01	0.0100	0.0109	0.0108
0.001	0.0015	0.0015	0.0015
Independent G, M and E with continuous traits
0.05	0.0483	0.0472	0.0478
0.01	0.0088	0.0095	0.0095
0.001	0.0007	0.0009	0.0009
G and E correlated with binary traits
0.05	0.0488	0.0526	0.0675
0.01	0.0089	0.0107	0.0161
0.001	0.0010	0.0015	0.0029
G and E correlated with continuous traits
0.05	0.0525	0.0509	0.0655
0.01	0.0093	0.0096	0.0166
0.001	0.0005	0.0011	0.0029

Table 2

Genes with P < 1×10−4 from the optimal or regular Omnibus-Fisher tests for the asthma status analysis in WBCs.

P-valuesGenes Chr		Optimal Omnibus-Fisher	Regular Omnibus-Fisher	Gene-level SNP	Gene-levelMethylation	Gene-level RNA expression
ZPBP2	17	1.40×10⁻⁵	3.39×10⁻⁵	2.89×10⁻⁶	0.1071	0.6728
GSDMB	17	2.00×10⁻⁵	1.67×10⁻⁴	2.36×10⁻⁶	0.8816	0.6153
KAT2A	17	7.00×10⁻⁴	8.06×10⁻⁵	0.0102	0.0047	0.0127
HIST1H1C	6	1.90×10⁻⁴	5.65×10⁻⁵	0.2343	0.0109	2.00×10⁻⁴
NFRKB	11	4.00×10⁻⁴	9.25×10⁻⁵	0.0086	4.76×10⁻⁴	0.1779
C14orf178	14	7.00×10⁻⁵	5.13×10⁻⁵	0.1012	2.30×10⁻⁵	0.1757
ZNF213-AS1	16	0.0020	7.25×10⁻⁵	3.57×10⁻⁴	0.0688	0.0197

45 in total

1. Replication of an association between 17q21 SNPs and asthma in a French-Canadian familial collection.

Authors: Anne-Marie Madore; Karine Tremblay; Thomas J Hudson; Catherine Laprise
Journal: Hum Genet Date: 2007-11-09 Impact factor: 4.132

2. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.

Authors: Bingshan Li; Suzanne M Leal
Journal: Am J Hum Genet Date: 2008-08-07 Impact factor: 11.025

3. Effect of 17q21 variants and smoking exposure in early-onset asthma.

Authors: Emmanuelle Bouzigon; Eve Corda; Hugues Aschard; Marie-Hélène Dizier; Anne Boland; Jean Bousquet; Nicolas Chateigner; Frédéric Gormand; Jocelyne Just; Nicole Le Moual; Pierre Scheinmann; Valérie Siroux; Daniel Vervloet; Diana Zelenika; Isabelle Pin; Francine Kauffmann; Mark Lathrop; Florence Demenais
Journal: N Engl J Med Date: 2008-10-15 Impact factor: 91.245

4. A meta-analysis of genome-wide association studies of asthma in Puerto Ricans.

Authors: Qi Yan; John Brehm; Maria Pino-Yanes; Erick Forno; Jerome Lin; Sam S Oh; Edna Acosta-Perez; Cathy C Laurie; Michelle M Cloutier; Benjamin A Raby; Adrienne M Stilp; Tamar Sofer; Donglei Hu; Scott Huntsman; Celeste S Eng; Matthew P Conomos; Deepa Rastogi; Kenneth Rice; Glorisa Canino; Wei Chen; R Graham Barr; Esteban G Burchard; Juan C Celedón
Journal: Eur Respir J Date: 2017-05-01 Impact factor: 16.671

5. A study of disruptive behavior disorders in Puerto Rican youth: I. Background, design, and survey methods.

Authors: Héctor R Bird; Glorisa J Canino; Mark Davies; Cristiane S Duarte; Vivian Febo; Rafael Ramírez; Christina Hoven; Judith Wicks; George Musa; Rolf Loeber
Journal: J Am Acad Child Adolesc Psychiatry Date: 2006-09 Impact factor: 8.829

6. Kernel machine SNP-set testing under multiple candidate kernels.

Authors: Michael C Wu; Arnab Maity; Seunggeun Lee; Elizabeth M Simmons; Quaker E Harmon; Xinyi Lin; Stephanie M Engel; Jeffrey J Molldrem; Paul M Armistead
Journal: Genet Epidemiol Date: 2013-03-07 Impact factor: 2.135

7. A polymorphism controlling ORMDL3 expression is associated with asthma that is poorly controlled by current medications.

Authors: Roger Tavendale; Donald F Macgregor; Somnath Mukhopadhyay; Colin N A Palmer
Journal: J Allergy Clin Immunol Date: 2008-04 Impact factor: 10.793

8. Discovery of rare variants via sequencing: implications for the design of complex trait association studies.

Authors: Bingshan Li; Suzanne M Leal
Journal: PLoS Genet Date: 2009-05-15 Impact factor: 5.917

9. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis.

Authors: Yun Liu; Martin J Aryee; Leonid Padyukov; M Daniele Fallin; Espen Hesselberg; Arni Runarsson; Lovisa Reinius; Nathalie Acevedo; Margaret Taub; Marcus Ronninger; Klementy Shchetynsky; Annika Scheynius; Juha Kere; Lars Alfredsson; Lars Klareskog; Tomas J Ekström; Andrew P Feinberg
Journal: Nat Biotechnol Date: 2013-01-20 Impact factor: 54.908