Literature DB >> 35886007

MicrobiomeGWAS: A Tool for Identifying Host Genetic Variants Associated with Microbiome Composition.

Xing Hua^1,2, Lei Song¹, Guoqin Yu³, Emily Vogtmann¹, James J Goedert¹, Christian C Abnet¹, Maria Teresa Landi¹, Jianxin Shi¹.

Abstract

The microbiome is the collection of all microbial genes and can be investigated by sequencing highly variable regions of 16S ribosomal RNA (rRNA) genes. Evidence suggests that environmental factors and host genetics may interact to impact human microbiome composition. Identifying host genetic variants associated with human microbiome composition not only provides clues for characterizing microbiome variation but also helps to elucidate biological mechanisms of genetic associations, prioritize genetic variants, and improve genetic risk prediction. Since a microbiota functions as a community, it is best characterized by β diversity; that is, a pairwise distance matrix. We develop a statistical framework and a computationally efficient software package, microbiomeGWAS, for identifying host genetic variants associated with microbiome β diversity with or without interacting with an environmental factor. We show that the score statistics have positive skewness and kurtosis due to the dependent nature of the pairwise data, which makes p-value approximations based on asymptotic distributions unacceptably liberal. By correcting for skewness and kurtosis, we develop accurate p-value approximations, whose accuracy was verified by extensive simulations. We exemplify our methods by analyzing a set of 147 genotyped subjects with 16S rRNA microbiome profiles from non-malignant lung tissues. Correcting for skewness and kurtosis eliminated the dramatic deviation in the quantile-quantile plots. We provided preliminary evidence that six established lung cancer risk SNPs were collectively associated with microbiome composition for both unweighted (p = 0.0032) and weighted (p = 0.011) UniFrac distance matrices. In summary, our methods will facilitate analyzing large-scale genome-wide association studies of the human microbiome.

Entities: Chemical

Keywords: gene–environment interaction; genome-wide association study; host genetics; microbiome; skewness and kurtosis; tail probabilities

Mesh：

Substances：
RNA, Ribosomal, 16S

Year: 2022 PMID： 35886007 PMCID： PMC9317577 DOI： 10.3390/genes13071224

Source DB: PubMed Journal: Genes (Basel) ISSN： 2073-4425 Impact factor: 4.141

1. Introduction

The human body is colonized by bacteria, viruses, and other microbes that exceed the number of human cells by at least 10-fold and that exceed the number of human genes by at least 100-fold. The relationship between a person and his or her microbial population, termed the microbiota, is generally mutualistic. The microbiota may promote human health by inhibiting infection by pathogens, conditioning the immune system, synthesizing and digesting nutrients, and maintaining overall homeostasis. The microbiome, which is the collection of all microbial genes, can be investigated through massively parallel, next-generation DNA sequencing technologies. By amplifying and sequencing highly variable regions of 16S ribosomal RNA genes that are present in all eubacteria, cost-effective and informative microbiome profiles down to the genus level are obtained. The human microbiome has been associated with diseases, including obesity [1], inflammatory bowel disease (IBD) [2], colorectal cancer [3], and breast cancer [4]. Thus, identifying factors that have a sustained impact on the microbiome is fundamental for elucidating its role in health conditions and for developing treatment strategies. Increasing evidence suggests that microbiome composition at a specific site of the human body is impacted by environmental factors [5,6], host genetics [7,8], and possibly by their interactions. In the mouse, quantitative trait loci (QTL) studies have identified loci contributing to the variation in the gut microbiome using linkage analysis [9,10]. Recently, Goodrich et al. [11] systematically investigated the heritability of the human gut microbiome by comparing monozygotic twins to dizygotic twins and found substantial heritability in different microbiome metrics, suggesting the important role of host genetics on gut microbiome diversity. Associations between individual host genetic variants and microbiome taxa abundances have also begun to emerge in other human samples [7,8,12]. These studies suggest that genome-wide association studies (GWAS) have great potential to identify host genetic variants associated with microbiome diversity. GWAS of complex human diseases have identified many risk SNPs; however, the biological mechanisms are largely unknown for the majority of the risk SNPs. QTL studies of intermediate traits, e.g., gene expression [13,14], DNA methylation [15,16], chromatin structure [17,18], and metabolite production [19,20], have provided useful insights into the biological mechanisms of the GWAS findings. The human microbiome at a specific body site is another important and informative intermediate trait for interpreting GWAS signals. Knights et al. [8] reported that a risk SNP for IBD located in NOD2 was associated with the relative abundance of Enterobacteriaceae in the human gut microbiome. Tong et al. [7] show that a loss-of-function allele in FUT2 that increases the risk of developing Crohn’s Disease (CD) may modulate the energy metabolism of the gut microbiome. In both examples, the microbiome is a potential intermediate for explaining the association between risk SNPs and disease risks, although a formal mediation analysis is required based on samples with genotype, microbiome, and disease status data. Moreover, identifying microbiome-associated host genetic variants has the potential to prioritize SNPs for discovery and to improve the performance of polygenetic risk prediction. Three types of microbiome metrics can be derived as phenotypes for GWAS analysis. First, for each taxon at a specified taxonomic level (phylum, class, order, family, genus, and species), we calculate the relative abundance (RA) of the taxon as the ratio of the number of sequencing reads assigned to the taxon to the total number of sequencing reads. In 16S ribosomal RNA sequence profiles, approximately 100–200 taxa with average RAs ≥ 0.1% (from the phylum level to the genus level) across samples are abundant enough for QTL analysis. One can perform a Poisson regression to examine the association between the RA of each taxon and each SNP. Significant associations are identified using Bonferroni correction (p < 5 × 10−8/200 = 2.5 × 10−10) or by controlling FDR at an appropriate level. Second, multiple α-diversity metrics [21] can be calculated to reflect the richness (e.g., number of unique taxa) and evenness of each microbiome community after a procedure called rarefication, which eliminates the dependence between the estimated α diversity and the variable total number of sequence reads across subjects. Once the α-diversity metrics are derived, one may perform standard GWAS with α diversity as the phenotype using linear regression. Because a microbiota functions as a community, the most important analysis for a microbiome GWAS may be by assessing the complete structure of the community by using a pairwise microbiome distance matrix (or β diversity) of the microbial community. Microbiome distances can be defined in different ways, based on using phylogenetic tree information or each taxon’s abundance information. Bray–Curtis dissimilarity [22] quantifies the difference between two microbiome communities using the abundance information of specific taxa. UniFrac [23,24,25] is another widely used distance metric. Unlike the Bray–Curtis dissimilarity metric, UniFrac compares microbiome communities by using information on the relative relatedness of each taxon, specifically by phylogenetic distance (branch lengths on a phylogenetic tree). UniFrac has two variants: the weighted UniFrac [24], which accounts for the taxa abundance information, and the unweighted UniFrac [23], which only models the information of presence or absence. Recently, a generalized UniFrac distance metric [26] was developed to automatically appreciate the advantages of weighted and unweighted UniFrac metrics and was shown to provide better statistical power to detect associations between human health conditions and microbiome communities. GWAS based on a microbiome distance matrix aims to identify the host SNPs associated with microbiome composition. This has been done frequently by fitting non-parametric multivariate models [27]. This approach requires permutations to assess significance [28], which is computationally prohibitive, particularly when evaluating p-values less than 5 × 10−8—the standard GWAS p-value threshold—or even lower when testing multiple-diversity matrices. In a recent microbiome GWAS, the computation is prohibitive even using a moment matching method based on the F-statistic. Intuitively, the microbiome distances tend to be smaller for pairs of subjects with similar genotypic values at the associated SNP. In addition, it is also of great interest to identify host SNPs that interact with an environmental factor to affect microbiome composition. Importantly, β diversity is temporally more stable compared with RA of taxa and α-diversity metrics based on the data from the Human Microbiome Project [29], suggesting a smaller power loss for a GWAS due to temporal variability. To our knowledge, no statistical methods or software packages have been designed to efficiently analyze microbiome GWAS data using distance matrices as phenotypes. In this paper, we develop a statistical framework and a computationally efficient package, microbiomeGWAS, for analyzing microbiome GWAS data. Our package allows the detection of host SNPs with the main effect or interaction with an environment factor; i.e., host SNPs interacting with an environment factor to affect the microbiome composition. We calculate the variance of the score statistics by appropriately considering the dependence of the pairwise distances. Importantly, we show that the score statistics have positive skewness and kurtosis due to the dependence in pairwise distances, which makes the approximation of small p-values based on the asymptotic distribution too liberal, which easily yields false positive associations. Resampling methods, e.g., bootstrap or permutation, are computationally prohibitive for accurately approximating small p-values. We propose to improve the tail probability approximation by correcting for skewness and kurtosis of the score statistics. Numerical investigations demonstrate that our method provides a very accurate approximation, even for p = 5 × 10−8. MicrobiomeGWAS runs very efficiently, taking 36 min for analyzing main effects and 69 min for analyzing both main and interaction effects for a study with 2000 subjects and 500,000 SNPs, using a single core. MicrobiomeGWAS is available at https://github.com/lsncibb/microbiomeGWAS [30], accessed on 30 May 2022. We illustrate our methods by applying microbiomeGWAS to non-malignant lung tissue samples in the Environment And Genetics in Lung cancer Etiology (EAGLE) study [31,32]. Because smoking may alter microbiome composition, we tested both the main effect and gene–smoking interaction effect. When p-values were calculated based on asymptotic distributions, the quantile–quantile (QQ) plots strongly deviated from the uniform distribution. Nine loci also achieved genome-wide significance based on asymptotic approximations. Correcting for skewness and kurtosis eliminated the inflation and also the genome-wide significance of these loci. However, we provide evidence that the established lung cancer risk SNPs are associated with lung microbiome composition.

2. Material and Methods

2.1. A Score Statistic for Testing Main Effect

Suppose that we have a set of subjects genotyped with SNP arrays. For notational simplicity, we consider only one SNP with a minor allele frequency (MAF) denoted as. Our interest centers on testing whether the genotype of the SNP is associated with microbiome composition. Let represent the number of the minor alleles for subject n. We assume that the 16S rRNA gene of microbiota from a target site (e.g., gut) has been sequenced for these samples. Let be the microbiome distance between subject and subject and be the distance matrix. Intuitively, if the SNP is associated with the microbiome composition, the microbiome distances tend to be smaller for subject pairs with similar genotypic values, as is illustrated in Figure 1. For subjects, pairs can be divided into three groups with genetic distance 0, 1, and 2. For example, a pair of subjects with genotype (AA, AA) or (BB, BB) has genetic distance 0; a pair of subjects with genotype (AA, BB) or (BB, AA) has genetic distance 2; all other pairs have genetic distance 1. Apparently, we expect the microbiome distance to be positively correlated with genetic distance for subject pairs.

Figure 1

Microbiome distances are positively correlated with genetic distances at an associated SNP.

We define as the genetic distance for a pair of subjects . We assume . A score statistic for testing (main effect) vs. is derived as: The variance under is calculated by considering the dependence in and conditioning on the distance matrix . Briefly, we have When are distinct, and are independent; i.e., . Some algebra leads to where and The details for calculating and are in Appendix A. The normalized statistic under asymptotically. In analyses of real data, we typically have to adjust for covariates, including demographic variables and principal component analysis (PCA) scores derived based on genotypes, to eliminate potential population stratification. Given a distance matrix and covariates , we perform distance-based redundancy analyses using function capscale in the vegan package [33]. The residual matrix , extracted using the residuals function in the vegan package [33], is now adjusted for these potential confounding factors and can be used for genetic analysis.

2.2. A Score Statistic for Testing Gene–Environment Interaction

Let denote an environmental variable. Define . We extend the statistical framework to detect the SNP–environment interaction by assuming where denotes the main genetic effect, denote the additive gene–environment effect, and denotes the main effect of the environmental factor. We consider testing the null hypothesis that the SNP is not associated with microbiome composition either directly or by interacting with . The alternative hypothesis is . We estimate and under and calculate . Let be the residual matrix. The scores evaluated under are and for . Similar to (2), we derive the variance by accounting for the dependence in : Let and . Asymptotically, and under . In Appendix B, we derive Let be the correlation between the two statistics. Asymptotically, follows a bivariate normal distribution with a correlation matrix . In Appendix C, we derive a statistic for jointly testing vs. . Briefly, the 2D plane is partitioned to four parts (Figure 2). The joint statistic is derived as where , and . The asymptotic p-value is calculated as where .

Figure 2

Define the joint test for testing vs. . We assume that , and under . Details are in Appendix C.

2.3. Improved p-Value Approximations by Correcting for Skewness and Kurtosis

Theoretic investigation suggests that the score statistics and have a positive skewness, which makes the tail probability approximations based on the asymptotic distribution unacceptably liberal (Figure 3A,B). In a numeric example with skewness , 2.9 × 10−7 based on , which is approximately two orders of magnitude more significant than p = 3.9 × 10−5 based on 108 permutations. The significance inflation becomes worse for smaller p-values and larger skewness . Similar but more tedious calculations suggest that both statistics have positive kurtosis, making the approximation based on even worse. One possible solution is to approximate tail probabilities using permutations or bootstrap. However, these resampling methods are computationally prohibitive for testing millions of common SNPs in a large-scale study.

Figure 3

Correcting tail probabilities for skewness and kurtosis. (A) The standard normal distribution and an approximately normal distribution with positive skewness. The skewness has big impact when calculating the tail probability for a large value of . (B) Numerical evaluation of tail probability approximation for . We used the unweighted UniFrac distance matrix of 500 samples from the American Gut Project (AGP). For each value of b (>0), we calculated p-values based on , skewness correction, both skewness and kurtosis correction, and 108 simulations. (C) Skewness depends on minor allele frequency (MAF) of SNPs and the sample size of the study, calculated based on the weighted UniFrac distance matrix in AGP data. (D) Kurtosis depends on MAF of SNPs and the sample size, calculated based on the weighted UniFrac distance matrix in the AGP data.

To address this problem, we calculated the skewness and kurtosis of the score statistics under (Appendix D). We propose to improve the tail probability approximation by correcting for the skewness and kurtosis, following the skewness correction in linkage analysis [34,35]. Technical details are provided in Appendix E. Correcting for both skewness and kurtosis leads to an approximation where satisfies , and is the cumulative distribution function of . Correcting for skewness but ignoring kurtosis (i.e., assuming ) leads to an approximation where , . Numerical results presented in Figure 3B demonstrate that (9) works very well. Given the distance matrix , , , and (Appendix D). Thus, skewness decays much more slowly with sample size than kurtosis (Figure 3C,D). Thus, even for a large study with thousands of samples, correcting for skewness is necessary for accurately evaluating the tail probabilities. Importantly, both skewness and kurtosis highly depend on the MAF, suggesting that the impact of skewness and kurtosis is different across SNPs with a different MAF. Numerical studies (Figure 3C,D) show that skewness and kurtosis are minimized when MAF = 0.5 and maximized when MAF 0.2–0.3. Finally, we discuss how to approximate the tail probability of in (7) for testing by correcting for non-normality in and . When (or ), we calculate the skewness and the kurtosis and use (9) to approximate . When , we first approximate their marginal p-values as and by (9), and then calculate the normal quantile and . Because the correction primarily impacts the tails of the distributions, the correlation between the two statistics will remain roughly unchanged; i.e., . Thus, when , the tail probability is approximated as .

3. Results

3.1. Simulation Results

The main purpose of simulations was to investigate the type-I error of (for testing the main genetic effect), (for detecting SNP–environment interactions), and (for detecting either the main genetic effect or SNP–environment effect or both). Simulations were performed under different combinations of sample size, MAF, and microbiome distance matrices. To make the simulations realistic, we used an unweighted distance matrix of the fecal microbiome samples with the 16S rRNA V4 region sequences from the American Gut Project (AGP) [36]. The OTU table, rarefied to 10,000 sequence reads per sample, as well as the metadata were downloaded from the AGP website. Samples with less than 10,000 sequence reads were excluded from the analysis. The weighted and the unweighted UniFrac distance matrices were generated in the Quantitative Insights Into Microbial Ecology [21] (QIIME) pipeline. Because antibiotics may substantially change the microbiome composition to generate outliers that may distort the null distribution, we excluded samples with self-reported history of antibiotic usage within one month. After quality control, 1879 subjects remained for analysis. In the simulations, we randomly selected samples for a given sample size . For each setting, the type-I error rates were evaluated based on 108 simulations under . For the interaction test and the joint test, the binary environment factor had a frequency of 50% and was simulated independent of the SNP. The type-I error rates are summarized in Table 1 for the weighted UniFrac distance matrix. The skewness and kurtosis are reported in Figure 3C,D. The statistics adjusted for skewness and kurtosis have accurate type-I error rates while the statistics without adjustment have unacceptably high type-I error rates. As the sample size increases, the impact of skewness and kurtosis decreases. However, even for a study with , the type-I error rates are still seriously inflated. The results for the unweighted UniFrac distance matrix and for MAF = 0.5 are reported in Table S1.

Table 1

Type-I error rates estimated based on 108 simulations. Minor allele frequency = 20%. Simulations were based on the weighted UniFrac distance matrix of the gut microbiome data from the American Gut Project. Reported are the type-I error inflation factor. A value greater than 1 indicates an inflated type-I error.

		Z_M			Z_I			Q
	N	α = 10⁻³	10⁻⁵	10⁻⁷	10⁻³	10⁻⁵	10⁻⁷	10⁻³	10⁻⁵	10⁻⁷
Asymptotic approximation	100	5.5	51.6	610.0	4.7	36.1	342.8	7.3	80.9	1148.0
	200	3.7	23.0	187.3	3.1	15.8	105.5	4.6	33.0	316.7
	500	2.4	9.4	45.2	2.1	6.7	25.5	2.8	11.9	64.1
	1000	2.0	5.7	21.3	1.8	4.4	14.0	2.2	6.9	28.5
Adjusted for skewness and kurtosis	100	1.0	1.2	0.7	1.0	1.1	0.6	1.0	1.5	2.0
	200	1.0	1.1	1.0	1.0	1.1	0.7	0.9	1.3	1.8
	500	1.0	1.1	1.3	1.0	1.0	0.9	0.9	1.0	1.7
	1000	1.0	1.0	1.2	1.0	1.0	0.8	0.9	1.0	1.1

3.2. Software Implementation, Memory Requirement, and Computational Complexity

We implemented our algorithms in a software package, microbiomeGWAS, which is freely available at https://github.com/lsncibb/microbiomeGWAS [30], accessed on 30 May 2022. MicrobiomeGWAS requires three sets of files: a microbiome distance matrix file, a set of PLINK binary files for GWAS genotypes, and a set of covariates. MicrobiomeGWAS processes one SNP at a time and does not load all genotype data into memory; thus, it requires only memory for storing the distance matrix. Variance, skewness, and kurtosis can be partitioned into two parts related with the microbiome distance matrix and the MAF of the SNP separately; thus, we can quickly calculate these quantities for a predefined grid of MAFs. The overall computational complexity is about , where is the sample size and is the number of SNPs. Figure 4 reports the computation time on a Linux server using a single core. For a study with 10,000 subjects, it takes approximately 15 h for analyzing the main effect and approximately 30 h for analyzing both the main and interaction effects for 0.5 million variants. As a comparison, in a recent microbiome GWAS [37], to analyze 7 × 10−6 variants for the main effect and n = 3382 subjects in the SHIP-TREND cohort [37], it would take 61 years using one CPU and 94 days using one graph-processing unit for parallel computation. Moreover, their analytic pipeline could not jointly analyze all 8956 subjects from five cohorts because of the computational burden; instead, they performed a stepwise search that may cause power loss.

Figure 4

Computation time for a microbiome GWAS with 500,000 SNPs. “Main”: computation time for testing main effect only. “All”: computation time for testing main effect, interaction and the joint null hypothesis .

3.3. GWAS of Microbiome Diversity in Adjacent Normal Lung Tissues

We applied our methods to a set of lung cancer patients of Italian ancestry in the EAGLE [31] study. All subjects have germline genome-wide SNPs [32] and 16S rRNA microbiome data (V3-V4 region, Illumina MiSeq, 300 paired-end) in histologically normal lung tissues from these patients. Here, the histologically normal lung tissues were 1~5 cm from the tumor tissue. We performed a series of quality control steps to filter out low-quality sequence reads: average quality score <20 over 30 bp windows, less than 60% similarity to the Greengenes [38] reference, or identified as chimera reads using UCHIME [39]. Sequence reads were then processed by QIIME [21] to produce the relative abundances (RA) of taxa, two α-diversity metrics (observed number of species and Shannon’s index), and β diversity metrics (unweighted and weighted UniFrac distances) rarified to 1000 reads. We included 147 subjects with at least 1000 high-quality sequence reads for genetic association analysis. Out of the 147 subjects, 78 are current smokers, 8 never smoked, and 61 are former smokers. Because of the small number of never smokers, we merged never and former smokers as non-current smokers. All of the genetic association analyses were adjusted for sex, age, smoking status, and the top three PCA scores derived based on genome-wide SNPs. Here, the top three PCA scores were selected for controlling population stratification because the other PCA scores were unassociated with the distance matrices. We included 383,263 common SNPs with MAF ≥ 10% because rarer SNPs were expected to have no statistical power given the current sample size. We first performed GWAS analysis using PLINK [40] to identify the SNPs associated with taxa with an average RA greater than 0.1% or two α-diversity metrics. We did not detect genome-wide significant associations with either the main effects or gene–smoking interactions. Next, we performed GWAS analysis using unweighted and weighted UniFrac distance matrices as a representation of eubacteria β diversity. The results for testing the main effects are reported in Figure 5. Results for testing the joint effects (main effect and SNP by smoking status interaction) are reported in Figure S1. Because of the small sample size, we observed large values of skewness and kurtosis, with the magnitude varying with the MAF of the SNPs (Figure 5A). The score statistics based on the weighted UniFrac distance matrix had a much larger skewness and kurtosis than did the unweighted UniFrac matrix. Figure 5B,C report the quantile–quantile (QQ) plot of the logarithm of the association p-values for the unweighted and weighted UniFrac distance matrices, respectively. For each distance matrix, we produced QQ plots for p-values based on the asymptotic approximation and for p-values adjusted for skewness and kurtosis. For both distance matrices, the QQ plots before adjustment strongly deviated from the expected uniform distribution. Our adjustment eliminated the deviation. In addition, consistent with the observation that the skewness and kurtosis were larger for the weighted UniFrac distance matrix, the QQ plot deviated more for the analysis based on the weighted UniFrac distance. Note that the skewness and kurtosis only affect the tail probabilities; thus, the inflation of the QQ plot is not reflected by the genomic control lambda value [41], calculated as the median of the p-values. In fact, lambda for all four QQ plots.

Figure 5

Results of analyzing the microbiome GWAS data of 147 adjacent normal lung tissues in the EAGLE study. (A) Skewness and kurtosis for the main effect test using the unweighted and the weighted UniFrac distance matrices. (B) Quantile–quantile (QQ) plot for association p-values using the unweighted UniFrac distance matrix. “Adjusted”: p-values were corrected for skewness and kurtosis. “Unadjusted”: p-values were approximated based on the asymptotic distribution . (C) Quantile–quantile (QQ) plot for association p-values using the weighted UniFrac distance matrix. (D) Manhattan plots based on the unweighted or the weighted UniFrac distance matrices. (E) Box plots for the top nine loci in microbiome GWAS analysis. Subject pairs are classified into three groups according to the genetic distance at the SNP. The y-coordinate is the microbiome distance.

Without correcting for skewness and kurtosis, we identified three and six loci achieving genome-wide significance () for the unweighted and weighted UniFrac distance matrices, respectively (Figure 5D). After correcting for skewness and kurtosis, no locus remained genome-wide significant (Figure 5D), which was verified by 108 permutations. Importantly, skewness and kurtosis had a dramatic effect on tail probabilities. Here, we use SNP rs12785513 as an example, which was identified as the top SNP in both analyses. In the unweighted UniFrac analysis, p = 4.4 × 10−9 without adjustment and p = 1.6 × 10−6 after adjustment, a 364-fold inflation. The inflation was even larger for weighted UniFrac analysis because of larger skewness and kurtosis (Figure 5A). In fact, p = 3.4 × 10−10 without adjustment and p = 3.5 × 10−6 after adjustment, a 1000-fold inflation. Although these SNPs were not significant genome-wide, they were the top SNPs from the current study. Thus, we report box-plots for each of these nine SNPs (Figure 5E). As expected, in all box plots, microbiome distances tend to be larger in subject pairs with greater genetic distance at these SNPs. These associations remain to be replicated in studies with larger sample sizes. Finally, we concentrated on the six common SNPs in four genomic regions reported to be associated with lung cancer risk in GWAS of European subjects: rs2036534 and rs1051730 at 15q25.1 [42,43,44,45] (CHRNA5–CHRNA3–CHRNB4), rs2736100 and rs401681 at locus 5p15.33 [31,46] (TERT/CLPTM1L), rs6489769 [47] at 12p13.3 (RAD52), and rs1333040 at 9p21.3 [48] (CDKN2A/CDKN2B). The SNPs at 15q25.1 and 5p15.33 have the largest effect sizes for lung cancer risk based on the meta-analysis from the Transdisciplinary Research in Cancer of the Lung (TRICL) consortium [48]: OR = 1.32 for rs1051730, OR = 1.26 for rs2036534, OR = 1.13 for rs2736100, and OR = 1.14 for rs401681. Rs3131379 at locus 6p21.33 [46] (BAT3/MSH5) was excluded because the MAF = 7.5%. No SNPs were significantly associated with taxa RAs or α-diversity metrics after correcting for multiple testing. However, association analysis based on the UniFrac distance matrices provided evidence that these SNPs may be associated with the lung microbiota (Table 2). These SNPs were independent except that rs2036534 and rs1051730 at 15q25.1 were weakly correlated with R2 = 0.15. A test combining six -scores () and adjusting for the weak correlation yielded overall p-values of 0.0033 and 0.011 for the unweighted and the weighted UniFrac distance matrices, respectively. These results suggest that lung cancer risk SNPs were enriched for genetic association with the composition of the lung microbiome. The results for testing interactions and joint effects are reported in Table S2.

Table 2

Association p-values between lung cancer risk SNPs and microbiome composition in the EAGLE data.

SNP	Chr	Annotated Genes	Unweighted UniFrac	Weighted UniFrac
rs2036534	15q25.1	CHRNA3/4/5	0.425	0.167
rs1051730	15q25.1	CHRNA3/4/5	0.020	0.401
rs2736100	5p15.33	TERT	0.089	0.267
rs401681	5p15.33	CLPTM1L	0.056	0.005
rs6489769	12p13.3	RAD52	0.197	0.329
rs1333040	9p21.3	CDKN2A/B	0.249	0.224
Overall test			0.0032	0.011

4. Discussion

We developed a software package, microbiomeGWAS, for identifying host genetic variants associated with microbiome composition. MicrobiomeGWAS can test both the main effect and SNP–environment interactions. Importantly, we found that the score statistics had positive skewness and kurtosis and that the tail probabilities evaluated based on asymptotic approximations were very liberal. We addressed this problem by explicitly adjusting for skewness and kurtosis. MicrobiomeGWAS runs very efficiently and takes only 36 min for testing main effects and 69 min for testing joint effects in a GWAS with 2000 subjects and 500,000 markers. Other statistical methods exist for testing the association of microbiome distance matrices. PERMANOVA [27] is an extension of multivariate analysis of variance to a matrix of pairwise distances and relies on permutations to evaluate significance. MiRKAT [49], a recently proposed method based on kernel regression, takes hours for evaluating one association for 2000 subjects. Neither is computationally feasible for analyzing a large-scale GWAS of a microbiome. Recently, an asymptotic distribution was proposed to approximate the p-value for the PERMANOVA pseudo-F statistic [50]; however, whether it is sufficiently accurate for very small p-values (p < 5 ) remains to be investigated. Interactions of host genetic susceptibility with the microbiome have been postulated for many conditions, including inflammatory bowel diseases [51,52], autoimmune and rheumatic diseases [53,54,55,56], diabetes [57], and cancer, especially of the colon [58]. All models of these host–microbiome interactions also note the critical role of environmental factors, including diet, smoking, drugs, and antibiotics and other medications [59]. Although based on a very small initial sample set, the suggestive associations that we found between the six known lung cancer risk SNPs and the microbiome of adjacent normal lung tissue samples, including effects of cigarette smoking, provide preliminary evidence that our microbiomeGWAS method is likely to be a useful tool for generating data that will unravel host–microbiome interactions with high confidence. We are working on two extensions for microbiomeGWAS: (1) jointly testing additive and dominant effects; and (2) testing genetic associations using many microbiome distance matrices. We have assumed an additive effect model (Figure 1); however, several top SNPs in the EAGLE data suggest a dominant effect (e.g., rs8083714 in Figure 5E). Thus, a statistic for jointly testing the additive and dominant effects might be powerful for this scenario. The second extension is motivated by the fact that that the power to detect associations depends heavily on the choice of distance matrix. The recently developed generalized UniFrac [26] (gUniFrac) defines a series of distance matrices to reflect the different emphases of using taxa relative abundance information. gUniFrac has been shown to have a robust power for association studies [26]. Extending microbiomeGWAS to gUniFrac, however, requires solving two problems. First, the computational complexity is proportional to the number of distance matrices analyzed for associations, which can be addressed by implementing the algorithms using multithreading technology. Second, we need to derive accurate analytic approximations to the association p-values by correcting for the multiple testing introduced by many distance matrices. MiRKAT [49] has an option for using gUniFrac; however, intensive permutations are required to evaluate p-values. In summary, GWAS of the microbiome of each body site has the potential to help one understand microbiome variation, to elucidate the biological mechanisms of genetic associations, to improve the power of identifying novel disease-associated genetic variants, and to improve the performance of genetic risk prediction. We expect our methods and software to be useful for large-scale GWAS of the human microbiome.

53 in total

Review 1. Diet, genes, and microbes: complexities of colon cancer prevention.

Authors: Diane F Birt; Gregory J Phillips
Journal: Toxicol Pathol Date: 2013-10-15 Impact factor: 1.902

2. Genome-wide association analysis identifies variation in vitamin D receptor and other host factors influencing the gut microbiota.

Authors: Jun Wang; Louise B Thingholm; Jurgita Skiecevičienė; Philipp Rausch; Martin Kummen; Johannes R Hov; Frauke Degenhardt; Femke-Anouska Heinsen; Malte C Rühlemann; Silke Szymczak; Kristian Holm; Tönu Esko; Jun Sun; Mihaela Pricop-Jeckstadt; Samer Al-Dury; Pavol Bohov; Jörn Bethune; Felix Sommer; David Ellinghaus; Rolf K Berge; Matthias Hübenthal; Manja Koch; Karin Schwarz; Gerald Rimbach; Patricia Hübbe; Wei-Hung Pan; Raheleh Sheibani-Tezerji; Robert Häsler; Philipp Rosenstiel; Mauro D'Amato; Katja Cloppenborg-Schmidt; Sven Künzel; Matthias Laudes; Hanns-Ulrich Marschall; Wolfgang Lieb; Ute Nöthlings; Tom H Karlsen; John F Baines; Andre Franke
Journal: Nat Genet Date: 2016-10-10 Impact factor: 38.330

Review 3. Multi-omics analysis of inflammatory bowel disease.

Authors: Hu Huang; Pajau Vangay; Christopher E McKinlay; Dan Knights
Journal: Immunol Lett Date: 2014-08-15 Impact factor: 3.685

4. Human genetics shape the gut microbiome.

Authors: Julia K Goodrich; Jillian L Waters; Angela C Poole; Jessica L Sutter; Omry Koren; Ran Blekhman; Michelle Beaumont; William Van Treuren; Rob Knight; Jordana T Bell; Timothy D Spector; Andrew G Clark; Ruth E Ley
Journal: Cell Date: 2014-11-06 Impact factor: 41.582

5. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population.

Authors: Chiara Sabatti; Susan K Service; Anna-Liisa Hartikainen; Anneli Pouta; Samuli Ripatti; Jae Brodsky; Chris G Jones; Noah A Zaitlen; Teppo Varilo; Marika Kaakinen; Ulla Sovio; Aimo Ruokonen; Jaana Laitinen; Eveliina Jakkula; Lachlan Coin; Clive Hoggart; Andrew Collins; Hannu Turunen; Stacey Gabriel; Paul Elliot; Mark I McCarthy; Mark J Daly; Marjo-Riitta Järvelin; Nelson B Freimer; Leena Peltonen
Journal: Nat Genet Date: 2008-12-07 Impact factor: 38.330

6. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans.

Authors:
Journal: Science Date: 2015-05-07 Impact factor: 47.728

7. UniFrac: a new phylogenetic method for comparing microbial communities.

Authors: Catherine Lozupone; Rob Knight
Journal: Appl Environ Microbiol Date: 2005-12 Impact factor: 4.792

8. Murine gut microbiota is defined by host genetics and modulates variation of metabolic traits.

Authors: Autumn M McKnite; Maria Elisa Perez-Munoz; Lu Lu; Evan G Williams; Simon Brewer; Pénélope A Andreux; John W M Bastiaansen; Xusheng Wang; Stephen D Kachman; Johan Auwerx; Robert W Williams; Andrew K Benson; Daniel A Peterson; Daniel C Ciobanu
Journal: PLoS One Date: 2012-06-18 Impact factor: 3.240

9. The Human Microbiome Project: a community resource for the healthy human microbiome.

Authors: Dirk Gevers; Rob Knight; Joseph F Petrosino; Katherine Huang; Amy L McGuire; Bruce W Birren; Karen E Nelson; Owen White; Barbara A Methé; Curtis Huttenhower
Journal: PLoS Biol Date: 2012-08-14 Impact factor: 8.029

10. Characterizing the genetic basis of methylome diversity in histologically normal human lung tissue.

Authors: Jianxin Shi; Crystal N Marconett; Jubao Duan; Paula L Hyland; Peng Li; Zhaoming Wang; William Wheeler; Beiyun Zhou; Mihaela Campan; Diane S Lee; Jing Huang; Weiyin Zhou; Tim Triche; Laufey Amundadottir; Andrew Warner; Amy Hutchinson; Po-Han Chen; Brian S I Chung; Angela C Pesatori; Dario Consonni; Pier Alberto Bertazzi; Andrew W Bergen; Mathew Freedman; Kimberly D Siegmund; Benjamin P Berman; Zea Borok; Nilanjan Chatterjee; Margaret A Tucker; Neil E Caporaso; Stephen J Chanock; Ite A Laird-Offringa; Maria Teresa Landi
Journal: Nat Commun Date: 2014-02-27 Impact factor: 14.919