Literature DB >> 31874630

wtest: an integrated R package for genetic epistasis testing.

Rui Sun^1,2, Xiaoxuan Xia^1,2, Ka Chun Chong^1,2, Benny Chung-Ying Zee^1,2, William Ka Kei Wu^3,4, Maggie Haitian Wang^5,6.

Abstract

BACKGROUND: With the increasing amount of high-throughput genomic sequencing data, there is a growing demand for a robust and flexible tool to perform interaction analysis. The identification of SNP-SNP, SNP-CpG, and higher order interactions helps explain the genetic etiology of human diseases, yet genome-wide analysis for interactions has been very challenging, due to the computational burden and a lack of statistical power in most datasets.
RESULTS: The wtest R package performs association testing for main effects, pairwise and high order interactions in genome-wide association study data, and cis-regulation of SNP and CpG sites in genome-wide and epigenome-wide data. The software includes a number of post-test diagnostic and analysis functions and offers an integrated toolset for genetic epistasis testing.
CONCLUSIONS: The wtest is an efficient and powerful statistical tool for integrated genetic epistasis testing. The package is available in CRAN: https://CRAN.R-project.org/package=wtest.

Entities: Chemical Disease Gene Mutation Species

Keywords: Association study; Epistasis testing; R package

Mesh：

Year: 2019 PMID： 31874630 PMCID： PMC6929460 DOI： 10.1186/s12920-019-0638-9

Source DB: PubMed Journal: BMC Med Genomics ISSN： 1755-8794 Impact factor: 3.063

Background

The etiology of complex disorder involves an interplay of polygenic biomarkers, lifestyle and environmental factors [1]. Robust and efficient statistical tools are needed to perform interaction analysis in high volume genome data. Besides SNP-SNP interactions, the analysis of interactions of SNPs and cytosine-phosphate-guanine (CpG) sites might provide novel insight into the regulatory mechanism DNA methylation and gene expression underlying complex diseases. Here we introduce a software that provides estimations for different types of genetic associations, including the main effect, second or higher order interaction, and gene-methylation interaction. This package is built upon the W-test [2] to perform epistasis testing. The statistic compares distributional differences of a set of biomarkers in cases and controls and follows a chi-squared distribution with data-set adaptive degrees of freedom. The method has the advantage of correcting p-value bias caused by complicated genetic architectures. Flexible implementation options are provided. The package can calculate SNP-CpG epistasis for biomarkers located in physical proximity of the input genome and epigenome. A number of post-test diagnostic, visualization and statistical genetic analysis functions are provided for model diagnosis. This is the first statistical software providing functions for direct gene-methylation interaction and high-order interaction evaluations in genome and epigenome dataset.

Implementation

Design

The wtest package is based on the W-test [2] to measure the association between binary phenotype and categorical genetic data. To test the association of a subset marker, a k by 2 contingency table can be formed, where k is the number of non-empty category combination formed by the SNP-set, and 2 is the binary phenotype. The statistic tests for the existence of distributional difference of a subset in the case group from a comparison control group, and it takes the following form, where n1 and n0 are the number of cases and controls in the i cell of the contingency table; N1 and N0 are the total cell counts of cases and controls; and are the conditional cell probabilities of the i cell of the contingency table; and SE is the standard error of the i log odds ratio. The W-test follows a chi-squared distribution of f degrees of freedom. The scalar h and degree of freedom f take forms of covariance matrices of the log odds ratios and are estimated from bootstrapped samples under the null hypothesis by the large sample theory. The W-test inherits a data-set adaptive degree of freedom that absorbs the genetic variation not attribute to phenotypes, therefore robust to complicated genetic architectures. In this software, we further extend it to evaluate high-order interaction effect and gene-methylation interaction effect. For gene-methylation interaction, methylation data are clustered into two categories according to high and low methylation levels by two-mean clustering algorithm. We also use a novel triangular network diagram to display interaction effects up to the third order. Extensive simulation studies testing the power and type I error of the W-test can be found in Wang, Sun et al. (2016) [2] and Sun et al. (2017) [3].

Implementation

Figure 1 demonstrates the major functions in the package and illustrates the implementation step by step using example data in the package. The implementation is performed in two steps: (1) Estimation of parameters h and f; (2) Testing by the W-test. Step 1. Estimation of parameters h and f. In genotype data, the hf() function is called, and in genotype and methylation data, the function hf.snps.meth() is called. Parameter h is the scaler in Eq. (1) and f is the degrees of freedom of a chi-squared distribution of the W-test. The two parameters are estimated using bootstrap samples with permutated phenotypes (null hypothesis) for B times. Simulations suggest that the estimation converges at B>400 when the number of variables is 1000 and the number of subjects is 1000 (Additional file 1). If step 1 is not performed, the p-value of W-test will be calculated by default h and f: h=k/(k−1) and f=k−1. In this case, k is the integer categorical combinations formed by the marker set. When k=2, the W-test is equivalent to the odds ratio test for a 2-by-2 table.

Fig. 1

Integrated genetic epistasis testing and functions

Integrated genetic epistasis testing and functions Step 2. Testing by the W-test. The wtest() evaluates main and second order interaction and wtest.high() evaluates third or higher order interaction in genotype data. The wtest.snps.meth() calculates SNP-CpG interactions for genome and epigenome data. Oftentimes users are interested to explore the interactions among biomarkers with a certain level of main effect signals. The input.pval option in the function can be used to screen candidate SNPs according to their p-values to form interaction sets. While the output.pval option allows the convenient output of interaction sets reaching a p-value threshold. In function wtest.snps.meth(), positions of the biomarkers are input alongside the genome and epigenome data sets, and the window size to calculate cis-regulation relationship can be specified. The methylation.recode() function transforms the methylation data into high and low methylated levels. For high order interaction calculation, a simple check for sample size can be done by estimating the average number of cell counts formed by a set, and a high order is feasible if the number is at least two. A reference table could be found in Additional file 2 with suggested sample sizes for various order of interactions. Diagnostic checking for test statistic distribution can be performed by w.diagnosis(), which plots the W-test statistics histograms from the observed data and the curve of the chi-squared distribution using estimated parameters, indexed by the number of categorical combinations k. Close overlaying of the densities indicates the goodness of fit of estimation. An example is shown in the real data application section. The w.qqplot() function assists the diagnostic of probability distribution and degree of population stratification.

Results

Real data example

The software is applied to a number of real data analysis with novel biomarker findings and interesting implications [2-9]. Here we demonstrate its usage by two data sets: a genotypic dataset for bipolar disorder from the Genetic Association Information Network (GAIN) project, and a gene-methylation data for the lipid control treatment.

Application I. GAIN bipolar disorder dataset

This data contains 653 bipolar disorder patients and 1767 healthy controls, and 46,181 SNPs of chromosome 6 [10]. The result of h and f estimation can be found in Additional file 3. At second order interaction (order=2), setting input.pval=0.001 and output.pval=0.001, the function would output second order epistasis marker pairs with p-value <0.001. Figure 2 is the diagnostic plot for this estimation using w.diagnosis() function. The estimated red color chi-square curve follows closely with the histogram of the test statistics calculated from the observed data, showing a good estimation of the parameters.

Fig. 2

Diagnostic plot by w.diagnostics. At each combination size k, the estimated red color chi-square curve follows closely with the histogram of the W-test statistics calculated from the observed data, showing a good estimation of the parameters Data analysis identified one SNP with significant main effect: rs2495982 near GRM4, p-value =2.06×10−7. GRM4 is a major excitatory neurotransmitter in central nervous system and it is a susceptible gene for bipolar disorder and schizophrenia [11, 12]. For interaction effects, a number of SNP sets surpassed the Bonferroni corrected significance level. The top SNPs identified from different orders of interaction are listed in Additional file 4, and the interaction network up to the third order is plotted in a triangular network in Fig. 3. Each colored triangle in the network indicates a significant third order interaction, and the bold edge shows a significant second order interaction. It could be seen from the plot that the strongest interaction is formed by the gene set (SYNE1, BTBD9, RPL12P2) in the middle of the plot, in which BTBD9 plays a key role and extends to form significant combinations with FGD2 and CDKAL1. The BTBD9 is reported to be associated with neuropsychiatric disorders such as restless legs syndrome in Schizophrenia and the Tourette Syndrome [13, 14]. The gene encodes the BTB/POZ domain-containing protein that involved in protein-protein interactions [15], and is highly expressed in brain tissues [16]. It is very encouraging to discover this gene with known physical protein interaction function from pure computational and statistical perspective.

Fig. 3

Triangular network for third order genetic interactions

Application II. gene-methylation interaction analysis for lipid control data

This application was originally reported in Sun et al 2018 [3]. The data set contains 476 diabetic patients undergone lipid control treatments, and 150,000 candidate SNP-CpG pairs within 10kb genome distance (window.size=10,000). The phenotype is whether or not a subject responded to the treatment, calculated by comparing the before and after treatment triglyceride levels [3]. The h and f are estimated by hf.snps.meth(), and the gene-methylation interactions are calculated by wtest.snps.meth(). Table 1 summarized the top 5 markers identified by gene-methylation interaction associations. The cluster of genes is found to be involved in neuronal and retinal functions, including MPPED2 [17] and GUCY2E [18].

Table 1

Gene-methylation interaction in lipid control data

	SNP	CpG	Distance(kb)	Gene	MAF	P-value
1	rs12288568	cg13342435	1.27	MPPED2	0.003	7.49×10⁻⁶
2	rs11031153	cg13342435	3.86	MPPED2	0.003	7.49×10⁻⁶
3	rs16921036	cg13342435	1.35	MPPED2	0.001	8.68×10⁻⁶
4	rs11237066	cg13340272	4.52	GUCY2E	0.120	1.57×10⁻⁵
5	rs7119411	cg17432267	3.75	C11orf63	0.430	1.65×10⁻⁵

Gene-methylation interaction in lipid control data

Performance

The speed of the wtest package is evaluated on a laptop computer of 1.6GHz Intel Core i5 processor and 4GB RAM. Simulation data are used to compare the speed of different methods. On a data set consists of 5000 subjects and 100 SNPs, when B=200, n.sample=1000, the time elapsed for estimating h and f is 40.5s. After h and f calculation or assuming default values, the time used to evaluate main effects is 0.04s, and took 1.69s for second order interaction. In the same environment, the running time for existing tests for interaction yields 36.41s by chi-squared test and 130.56s by logistic regression. In the real data set, the genome-wide main effect calculation on 5000 subjects and 500,000 SNPs took around 5 min; and second order interaction calculation on 8000 SNPs used around 3.5 h.

Conclusions

Genetic epistasis testing is important to fathom the massive genomic data, and it also provides a way to explore the relationship between diseases and various types of biomarkers. This package offers an integrated toolset to analyse the association of genetic signals at all levels: from main effects, high order interactions, to gene-methylation interactions. The software is available in CRAN from https://CRAN.R-project.org/package=wtest under the GPL-2.0 license.

Availability and requirements

Project name: wtest Project home page: https://CRAN.R-project.org/package=wtest Operation systems: Platform independent Programming language: R (>= 3.1), C++ License: GPL (>= 2) Restrictions to use by non-academics: None Additional file 1 Convergency simulation study. The coefficient of variance of h at different B for pairwise interactions. Simulated dataset contains 1000 subjects and 1000 SNPs. A convergent h and f estimation is reached at B>400. Additional file 2 Reference table of sample size estimation. When the averaged MAF is 0.3 and the sample size is greater than the estimated sample size, no more than 25% cells have averaged cell count less than 2 in the contingency tables. Additional file 3 h and f estimation for main effects, second order interaction, and third order interaction analysis. Additional file 4 Top three identified sNPs at different levels of interaction orders. Note: Bonferroni corrected significant thresholds: main effect p-value <1.1×10−6, second order interaction p-value <4.69×10−11, and third order interaction p-value <3.05×10−15. Gene: the gene located within 35kb of the identified SNPs.

17 in total

Review 1. Gene-environment interactions in human diseases.

Authors: David J Hunter
Journal: Nat Rev Genet Date: 2005-04 Impact factor: 53.242

Review 2. Molecular genetics of bipolar disorder and depression.

Authors: Tadafumi Kato
Journal: Psychiatry Clin Neurosci Date: 2007-02 Impact factor: 5.188

3. Analysis of multiple genetic loci reveals MPDZ-NF1B rs1324183 as a putative genetic marker for keratoconus.

Authors: Yu Meng Wang; Li Ma; Shi Yao Lu; Tommy Chung Yan Chan; Jason C S Yam; Shu Min Tang; Ka Wai Kam; Pancy O S Tam; Clement C Tham; Alvin L Young; Vishal Jhanji; Chi Pui Pang; Li Jia Chen
Journal: Br J Ophthalmol Date: 2018-07-12 Impact factor: 4.638

4. A W-test collapsing method for rare-variant association testing in exome sequencing data.

Authors: Rui Sun; Haoyi Weng; Inchi Hu; Junfeng Guo; William K K Wu; Benny Chung-Ying Zee; Maggie Haitian Wang
Journal: Genet Epidemiol Date: 2016-08-16 Impact factor: 2.135

5. Stratified polygenic risk prediction model with application to CAGI bipolar disorder sequencing data.

Authors: Maggie Haitian Wang; Billy Chang; Rui Sun; Inchi Hu; Xiaoxuan Xia; William Ka Kei Wu; Ka Chun Chong; Benny Chung-Ying Zee
Journal: Hum Mutat Date: 2017-06-13 Impact factor: 4.878

6. Retinal-specific guanylate cyclase gene mutations in Leber's congenital amaurosis.

Authors: I Perrault; J M Rozet; P Calvas; S Gerber; A Camuzat; H Dollfus; S Châtelin; E Souied; I Ghazi; C Leowski; M Bonnemaison; D Le Paslier; J Frézal; J L Dufier; S Pittler; A Munnich; J Kaplan
Journal: Nat Genet Date: 1996-12 Impact factor: 38.330

7. The BTBD9 gene polymorphisms in Polish patients with Gilles de la Tourette syndrome.

Authors: Piotr Janik; Mariusz Berdyński; Krzysztof Safranow; Cezary Zekanowski
Journal: Acta Neurobiol Exp (Wars) Date: 2014 Impact factor: 1.579

8. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics.

Authors: Linn Fagerberg; Björn M Hallström; Per Oksvold; Caroline Kampf; Dijana Djureinovic; Jacob Odeberg; Masato Habuka; Simin Tahmasebpoor; Angelika Danielsson; Karolina Edlund; Anna Asplund; Evelina Sjöstedt; Emma Lundberg; Cristina Al-Khalili Szigyarto; Marie Skogs; Jenny Ottosson Takanen; Holger Berling; Hanna Tegel; Jan Mulder; Peter Nilsson; Jochen M Schwenk; Cecilia Lindskog; Frida Danielsson; Adil Mardinoglu; Asa Sivertsson; Kalle von Feilitzen; Mattias Forsberg; Martin Zwahlen; IngMarie Olsson; Sanjay Navani; Mikael Huss; Jens Nielsen; Fredrik Ponten; Mathias Uhlén
Journal: Mol Cell Proteomics Date: 2013-12-05 Impact factor: 5.911

9. A Zoom-Focus algorithm (ZFA) to locate the optimal testing region for rare variant association tests.

Authors: Maggie Haitian Wang; Haoyi Weng; Rui Sun; Jack Lee; William Ka Kei Wu; Ka Chun Chong; Benny Chung-Ying Zee
Journal: Bioinformatics Date: 2017-08-01 Impact factor: 6.937

10. A novel susceptibility locus in MST1 and gene-gene interaction network for Crohn's disease in the Chinese population.

Authors: William K K Wu; Rui Sun; Tao Zuo; Yuanyuan Tian; Zhirong Zeng; Jeffery Ho; Justin C Y Wu; Francis K L Chan; Matthew T V Chan; Jun Yu; Joseph J Y Sung; Sunny H Wong; Maggie H Wang; Siew C Ng
Journal: J Cell Mol Med Date: 2018-02-14 Impact factor: 5.310

3 in total

1. Roles of interacting stress-related genes in lifespan regulation: insights for translating experimental findings to humans.

Authors: Anatoliy I Yashin; Deqing Wu; Konstantin Arbeev; Arseniy P Yashkin; Igor Akushevich; Olivia Bagley; Matt Duan; Svetlana Ukraintseva
Journal: J Transl Genet Genom Date: 2021-10-19

2. Evaluating the detection ability of a range of epistasis detection methods on simulated data for pure and impure epistatic models.

Authors: Dominic Russ; John A Williams; Victor Roth Cardoso; Laura Bravo-Merodio; Samantha C Pendleton; Furqan Aziz; Animesh Acharjee; Georgios V Gkoutos
Journal: PLoS One Date: 2022-02-18 Impact factor: 3.240

3. Immune System and Neuroinflammation in Idiopathic Parkinson's Disease: Association Analysis of Genetic Variants and miRNAs Interactions.

Authors: Claudia Strafella; Valerio Caputo; Andrea Termine; Francesca Assogna; Clelia Pellicano; Francesco E Pontieri; Lucia Macchiusi; Giulietta Minozzi; Stefano Gambardella; Diego Centonze; Paola Bossù; Gianfranco Spalletta; Carlo Caltagirone; Emiliano Giardina; Raffaella Cascella
Journal: Front Genet Date: 2021-06-03 Impact factor: 4.599

3 in total