Literature DB >> 25519388

A variance component-based gene burden test.

Juan M Peralta¹, Marcio Almeida², Jack W Kent², John Blangero².

Abstract

We propose a novel variance component approach for the analysis of next-generation sequencing data. Our method is based on the detection of the proportion of the trait phenotypic variance that can be explained by the introduction of a new variance component that accounts for the local gene-specific departure of the empirical kinship relationship matrix, estimated from single-nucleotide polymorphism (SNP) genotypes, from their theoretical expectation based on the genealogical information in the pedigree. We tested our method with simulated phenotypes and imputed SNP genotypes from the Genetic Analysis Workshop 18 data set. We observed considerable variation in the differences between theoretical and gene-specific kinship estimates that proved to be informative for our test and allowed us to detect the MAP4 causal gene at a genome-wide significance level. The distribution of our test statistic show no inflation under the null hypothesis and results from a random set of genes suggest that the detection of MAP4 is both sensitive and specific. The use of 2 different strategies for the selection of the SNPs used to derive the gene-specific empirical kinship relationship matrices provides us with suggestive evidence that our method is performing as an empirical test of linkage.

Entities: CellLine Chemical Disease Gene Species

Year: 2014 PMID： 25519388 PMCID： PMC4143638 DOI： 10.1186/1753-6561-8-S1-S49

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

Complex phenotypes are thought to be determined by the aggregate effects of many rare causal variations [1-3]. Detection of the true causal variations present in next-generation sequencing data sets [4,5] is challenging because their faint signals are difficult to separate from background noise. Most of the current analytical methods try to improve the signal-to-noise ratio by reducing the number of statistical tests needed for a significant signal to be detected. A common approach to alleviate the multiple-testing problem is to collapse, commonly by membership of a variant in a known annotated gene or pathway, the information conveyed by individual variants into a single measure, like a principal component or a weighted rank, that can then be tested [6]. However, a common limitation of many approaches is that the aggregation of the variants into a single measure often involves an arbitrary definition of the directionality of each variant's fixed effects. We present a novel random-effect-variance component-based approach that uses gene-specific relationship matrices to collapse variants into a per-gene genetic contribution effect.

Methods

Data set

The Genetic Analysis Workshop 18 (GAW18) data [7], based on whole genome sequencing data for the odd-numbered chromosomes of 464 individuals released by the T2D-GENES Consortium, was used to test our method. Specifically, we used pedigrees, minor allele-based single-nucleotide polymorphism (SNP) dosages, and the SIMPHEN.1 simulated phenotypes in the GAW18 data set.

Definition of the gene loci

The transcription start site and the stop codon coordinates for the longest transcript associated with a gene were obtained from the UCSC's human genome release 19 (hg19) known gene table.

Gene-specific SNP dosages

To investigate if the procedure used to select the SNPs that were collected on a per-gene locus basis affected our test results, we used 2 different SNP selection approaches: the intragenic and the nonsyn strategies. The intragenic strategy consisted of the selection of all SNPs within the bounds of a gene. The nonsyn strategy consisted of the selection of the subset of intragenic SNPs that were annotated as being nonsynonymous coding changes using ANNOVAR [8]. GAW18 SNP dosages from the imputed genotypes where then collected into separate, gene-specific, dosage files for SNPs selected using the intragenic and nonsyn strategies.

Gene-specific empirical kinship matrices

Gene-specific dosages were transformed into genotypes and processed with KING [9], a method for relationship inference from large SNP genotype data sets that is robust to population substructure, to produce a gene-specific matrix of empirical kinship coefficients.

Control for unknown population substructure

To control for possible population stratification, principal component loadings were calculated using the prcomp function in R [10], with data from 117 unrelated individuals for approximately 29,000 haplotype tagging SNPs in low mutual linkage disequilibrium, and then projected onto the full set of genotyped individuals. The first 5 principal components explained 5% of the total phenotypic variance and were added as covariates to our variance component model.

Trait and covariates

We used the simulated phenotypic data at the first exam for the systolic blood pressure (SBP_1) trait. The sex (SEX), age (AGE_1), and smoke (SMOKE_1) status at the first exam phenotypes were introduced as covariates into our variance component model. The Q1 trait was used to assess the distribution of our test statistic under the null hypothesis.

Variance component model

Our method uses gene-specific relationship matrices (GSRMs) to extract the proportion of the trait's variance explained by a single gene as a result of the departure of its localized empirical kinship estimates (EKEs) from their pedigree-derived theoretical kinship expectations (TKEs). A new variance component parameter () was introduced into a standard variance component model where Ω is the covariance matrix, is the total phenotypic variance; , , and , respectively, represent the proportion of that can be attributed to the residual additive effect of polygenes, a gene-specific effect; and a random environmental effect, Φ, is the TKE kinship matrix, E is the EKE kinship matrix, and I is the identity matrix. This partitioning of the trait variance was estimated using an extension of the polygenic command from SOLAR [11] independently for each gene. The significance of each estimate was obtained from a likelihood ratio test against the null model Because the variance component is tested on its boundary, the likelihood ratio test statistic is distributed as a ½:½ mixture of a 1 degree of freedom (DF) chi-square and a point mass at zero [12].

Results

We compared the observed gene-specific EKE values obtained from the imputed SNP dosages with the TKE values derived from the pedigree and found substantial differences between them (Figure 1). The negative skew in Figure 1 shows that gene-specific EKE values are larger than their TKE counterparts and it shows that for certain genes individuals appear to be more closely related than expected from their relatedness in the pedigree.

Figure 1

Distribution of the gene-specific differences between TKEs) and EKEs. Differences between TKE and EKE values were averaged by gene for a sample of 100 random and 12 SBP_1 causal genes. The negative sign indicates that the gene-specific EKE average is larger than the TKE average. We then performed variance component analyses using GSRMs with intragenic and nonsyn EKE values for 12 of the causal SBP_1 genes in the simulated data set (Table 1) and a random gene sample (Table 2). We detected a clear and significant signal from the MAP4 causal gene using both the intragenic and nonsyn strategies, that reached genome-wide significance (after a conservative Bonferroni correction for 30,000 tests, p <1.6 × 10−6) in the nonsyn (Table 1). The magnitude of the MAP4 signal is strong enough for it to be specifically detected as the top result in a random sample of 100 genes (Table 2). Other causal genes also rank among the top results, but their signals are weaker (Table 2). Figure 2 suggests that our approach has the sensitivity to separate true-positive signals from false-positive ones, as there is no inflation or deflation of the p values that we obtained for the estimates of the gene effects evaluated under the null hypothesis.

Table 1

Estimated effects on the simulated SBP_1 trait for known causal genes

Gene	Strategy

	Intragenic				Nonsyn

	h2r	h2r_p	geff	geff_p	h2r	h2r_p	geff	geff_p
MAP4	0.17	3.90 × 10⁻⁶	0.10955	7.20 × 10⁻⁶	0.18	7.00 × 10⁻⁷	0.10382	1.00 × 10⁻⁷
LEPR	0.26	4.16 × 10⁻⁸	0.04702	6.52 × 10⁻³	0.31	2.28 × 10⁻¹⁰	0.01147	1.71 × 10⁻¹
LRP8	0.28	6.97 × 10⁻⁹	0.03575	6.55 × 10⁻³	0.32	3.44 × 10⁻¹¹	0	1
GTF2IRD1	0.29	4.19 × 10⁻⁹	0.01755	9.24 × 10⁻²	0.32	3.44 × 10⁻¹¹	0	1
TNN	0.30	9.51 × 10⁻¹⁰	0.01615	9.29 × 10⁻²	0.27	7.20 × 10⁻⁹	0.03433	1.26 × 10⁻³
FLT3	0.30	8.37 × 10⁻¹⁰	0.00906	1.59 × 10⁻¹	0.32	3.44 × 10⁻¹¹	0	1
CABP2	0.32	4.12 × 10⁻¹¹	0.00037	4.76 × 10⁻¹	0.32	3.44 × 10⁻¹¹	0	1
ABTB1	0.32	3.44 × 10⁻¹¹	0	1	0.21	4.01 × 10⁻¹¹	0.17969	1.90 × 10⁻¹
GAB2	0.32	3.44 × 10⁻¹¹	0	1	0.32	3.44 × 10⁻¹¹	0	1
GSN	0.32	3.44 × 10⁻¹¹	0	1	0.32	3.44 × 10⁻¹¹	0	1
KRTAP11-1	0.32	3.44 × 10⁻¹¹	0	1	0.32	3.44 × 10⁻¹¹	0	1
PSMD5	0.32	3.44 × 10⁻¹¹	0	1	0.30	9.46 × 10⁻¹¹	0.00949	1.25 × 10⁻¹

geff, Gene-specific effect estimate (); geff_p, significance of the gene-specific effect estimate; h2r, trait heritability estimate (); h2r_p, significance of the trait heritability estimate.

Table 2

Top 10 most significant results for genes in a combined sample of 100 random and 12 causal genes

Rank	Strategy

	Intragenic					Nonsyn

	Gene	h2r	h2r_p	geff	geff_p	Gene	h2r	h2r_p	geff	geff_p
1	MAP4*	0.17	3.90 × 10⁻⁶	0.10955	7.20 × 10⁻⁶	MAP4*	0.18	7.00 × 10⁻⁷	0.10382	1.00 × 10⁻⁷
2	OR9A4	0.18	6.16 × 10⁻¹¹	0.20337	4.64 × 10⁻³	TNN*	0.27	7.20 × 10⁻⁹	0.03433	1.26 × 10⁻³
3	LEPR*	0.26	4.16 × 10⁻⁸	0.04702	6.52 × 10⁻³	LSM12	0.15	3.40 × 10⁻¹¹	0.26452	4.96 × 10⁻³
4	LRP8*	0.28	6.97 × 10⁻⁹	0.03575	6.55 × 10⁻³	NAT6	0.30	3.20 × 10⁻¹¹	0.02515	1.16 × 10⁻²
5	NAT6	0.28	1.12 × 10⁻¹⁰	0.03592	8.39 × 10⁻³	AK123654	0.15	2.27 × 10⁻¹⁰	0.25783	1.37 × 10⁻²
6	CCDC169-SOHLH2	0.28	1.05 × 10⁻⁸	0.03547	2.25 × 10⁻²	OR2T27	0.28	1.05 × 10⁻¹⁰	0.04869	1.46 × 10⁻²
7	OR2T27	0.30	1.07 × 10⁻¹⁰	0.03072	4.20 × 10⁻²	HSPA9	0.15	8.69 × 10⁻¹²	0.26952	5.04 × 10⁻²
8	CCDC169	0.31	8.85 × 10⁻¹⁰	0.01913	4.53 × 10⁻²	LOC389493	0.21	1.52 × 10⁻¹⁰	0.16663	5.12 × 10⁻²
9	GNG3	0.18	1.85 × 10⁻¹⁰	0.20838	5.94 × 10⁻²	SRD5A1	0.32	2.25 × 10⁻¹¹	0.01056	1.12 × 10⁻¹
10	GAS7	0.28	1.91 × 10⁻⁸	0.03356	6.07 × 10⁻²	PSMD5*	0.30	9.46 × 10⁻¹¹	0.00949	1.25 × 10⁻¹

geff, Gene-specific effect estimate (); geff_p, significance of the gene-specific effect estimate; h2r, trait heritability estimate (); h2r_p, significance of the trait heritability estimate.

*Known causal gene for SBP_1 in the simulated data set.

Figure 2

Q-Q plot of the . The p values for the gene-specific effect estimates were calculated using SNPs selected with the intragenic strategy for a random sample of 5000 genes, using the Q1 trait, a trait highly heritable but not influenced by any of the GAW18 SNPs.

Estimated effects on the simulated SBP_1 trait for known causal genes geff, Gene-specific effect estimate (); geff_p, significance of the gene-specific effect estimate; h2r, trait heritability estimate (); h2r_p, significance of the trait heritability estimate. Top 10 most significant results for genes in a combined sample of 100 random and 12 causal genes geff, Gene-specific effect estimate (); geff_p, significance of the gene-specific effect estimate; h2r, trait heritability estimate (); h2r_p, significance of the trait heritability estimate. *Known causal gene for SBP_1 in the simulated data set. Q-Q plot of the . The p values for the gene-specific effect estimates were calculated using SNPs selected with the intragenic strategy for a random sample of 5000 genes, using the Q1 trait, a trait highly heritable but not influenced by any of the GAW18 SNPs.

Discussion

We performed variance component analyses using a novel approach to estimate the proportion of the trait phenotypic variance that can be attributed to a single gene. We first collapsed the genotypes from SNP variants into a GSRM that more closely approximates the correlations between related individuals at a gene-specific level. Figure 1 shows that there is substantial variation among genes in terms of the differences between TKE and gene-specific EKE values that had the potential to explain part of the trait variance. Thus, we then obtained gene-specific estimates of the parameter and its significance from SOLAR, using the empirical GSRM. Our results showed that the gene with the highest effect on the simulated SBP_1 trait was detected at a significance level that surpasses a conservative multiple testing threshold for the p values. Figure 2 shows that our test statistic was not inflated when evaluated under the null hypothesis using the Q1 trait and a random sample of genes. MAP4 was also consistently detected using the intragenic and nonsyn strategies (see Figures 1 and 2), with other causal genes ranking within our first top 10 results. This seems to suggest that our test is sensitive and specific enough for the detection of true-positive signals without enrichment of false-positive ones. As a consequence of using a different strategy to select the SNPs for the estimation of the empirical GSRM, our results for MAP4 improved. MAP4 results were an order of magnitude less significant for the intragenic than for the nonsyn strategy. We believe that this is the result of rare functional alleles driving the EKE of the GSRM matrices for the nonsyn strategy without the noise introduced by shared noncoding alleles. In effect, the nonsyn GSRM matrices better approximate the gene's probability of identity-by-descent sharing, thus making our test a gene-specific empirical test of linkage that is also robust to the heterogeneity of the causal variants. Finally, we want to note that our method is not restricted either to a particular measure of genetic identity or to its estimation on a gene-specific basis; identity-by-state and genomic regions, even if they are nonsyntenic [13], can potentially be used instead.

Conclusions

We were able to obtain encouraging, proof-of-concept results from the application of our method to GAW18 data. We observed differences between the TKEs and their gene-specific empirical estimations. We obtained genome-wide significant results on the SBP_1 simulated trait for MAP4 that seem to indicate that our test is both specific and sensitive enough, and which also suggest that our method is behaving as a gene-specific empirical test of linkage.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JB designed the overall study; JMP, MA, and JK conducted statistical analyses. JMP drafted the manuscript. All authors read and approved the final manuscript.

11 in total

1. Are rare variants responsible for susceptibility to complex diseases?

Authors: J K Pritchard
Journal: Am J Hum Genet Date: 2001-06-12 Impact factor: 11.025

2. Robust relationship inference in genome-wide association studies.

Authors: Ani Manichaikul; Josyf C Mychaleckyj; Stephen S Rich; Kathy Daly; Michèle Sale; Wei-Min Chen
Journal: Bioinformatics Date: 2010-10-05 Impact factor: 6.937

Review 3. The allelic architecture of human disease genes: common disease-common variant...or not?

Authors: Jonathan K Pritchard; Nancy J Cox
Journal: Hum Mol Genet Date: 2002-10-01 Impact factor: 6.150

4. Multipoint quantitative-trait linkage analysis in general pedigrees.

Authors: L Almasy; J Blangero
Journal: Am J Hum Genet Date: 1998-05 Impact factor: 11.025

Review 5. Statistical analysis of rare sequence variants: an overview of collapsing methods.

Authors: Carmen Dering; Claudia Hemmelmann; Elizabeth Pugh; Andreas Ziegler
Journal: Genet Epidemiol Date: 2011 Impact factor: 2.135

6. Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees.

Authors: Laura Almasy; Thomas D Dyer; Juan M Peralta; Goo Jun; Andrew R Wood; Christian Fuchsberger; Marcio A Almeida; Jack W Kent; Sharon Fowler; Tom W Blackwell; Sobha Puppala; Satish Kumar; Joanne E Curran; Donna Lehman; Goncalo Abecasis; Ravindranath Duggirala; John Blangero
Journal: BMC Proc Date: 2014-06-17

7. Pedigree-based random effect tests to screen gene pathways.

Authors: Marcio Almeida; Juan M Peralta; Vidya Farook; Sobha Puppala; John W Kent; Ravindranath Duggirala; John Blangero
Journal: BMC Proc Date: 2014-06-17

8. Targeted capture and massively parallel sequencing of 12 human exomes.

Authors: Sarah B Ng; Emily H Turner; Peggy D Robertson; Steven D Flygare; Abigail W Bigham; Choli Lee; Tristan Shaffer; Michelle Wong; Arindam Bhattacharjee; Evan E Eichler; Michael Bamshad; Deborah A Nickerson; Jay Shendure
Journal: Nature Date: 2009-08-16 Impact factor: 49.962

9. Genetic variation in an individual human exome.

Authors: Pauline C Ng; Samuel Levy; Jiaqi Huang; Timothy B Stockwell; Brian P Walenz; Kelvin Li; Nelson Axelrod; Dana A Busam; Robert L Strausberg; J Craig Venter
Journal: PLoS Genet Date: 2008-08-15 Impact factor: 5.917

10. Accurate whole human genome sequencing using reversible terminator chemistry.

Authors: David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

5 in total

1. Summary of results and discussions from the gene-based tests group at Genetic Analysis Workshop 18.

Authors: Heather J Cordell
Journal: Genet Epidemiol Date: 2014-09 Impact factor: 2.135

2. Longitudinal analytical approaches to genetic data.

Authors: Yen-Feng Chiu; Anne E Justice; Phillip E Melton
Journal: BMC Genet Date: 2016-02-03 Impact factor: 2.797

3. Constrained multivariate association with longitudinal phenotypes.

Authors: Phillip E Melton; Juan M Peralta; Laura Almasy
Journal: BMC Proc Date: 2016-10-18

4. Finding potential cis-regulatory loci using allele-specific chromatin accessibility as weights in a kernel-based variance component test.

Authors: Juan Manuel Peralta; Marcio Almeida; Lawrence J Abraham; Eric Moses; John Blangero
Journal: BMC Proc Date: 2016-10-18

5. Lack of Association between SLC30A8 Variants and Type 2 Diabetes in Mexican American Families.

Authors: Hemant Kulkarni; Manju Mamtani; Juan Manuel Peralta; Vincent Diego; Thomas D Dyer; Harald Goring; Laura Almasy; Michael C Mahaney; Sarah Williams-Blangero; Ravindranath Duggirala; Joanne E Curran; John Blangero
Journal: J Diabetes Res Date: 2016-11-08 Impact factor: 4.011

5 in total