Literature DB >> 22373204

Identifying causal rare variants of disease through family-based analysis of Genetics Analysis Workshop 17 data set.

Wai-Ki Yip1, Gourab De1, Nan Laird1, Benjamin A Raby2.   

Abstract

Linkage- and association-based methods have been proposed for mapping disease-causing rare variants. Based on the family information provided in the Genetic Analysis Workshop 17 data set, we formulate a two-pronged approach that combines both methods. Using the identity-by-descent information provided for eight extended pedigrees (n = 697) and the simulated quantitative trait Q1, we explore various traditional nonparametric linkage analysis methods; the best result is obtained by assuming between-family heterogeneity and applying the Haseman-Elston regression to each pedigree separately. We discover strong signals from two genes in two different families and weaker signals for a third gene from two other families. As an exploratory approach, we apply an association test based on a modified family-based association test statistic to all rare variants (frequency < 1% or < 3%) designated as causal for Q1. Family-based association tests correctly identified causal single-nucleotide polymorphisms for four genes (KDR, VEGFA, VEGFC, and FLT1). Our results suggest that both linkage and association tests with families show promise for identifying rare variants.

Entities:  

Year:  2011        PMID: 22373204      PMCID: PMC3287856          DOI: 10.1186/1753-6561-5-S9-S21

Source DB:  PubMed          Journal:  BMC Proc        ISSN: 1753-6561


Background

In contrast to the common variant/common disease hypothesis that dominated the era of linkage-disequilibrium-based genome-wide association studies (GWAS), there is increasing awareness that rare variants of modest to large individual effect contribute to disease liability and may explain a substantial proportion of the so-called missing heritability of common traits. There is therefore great interest in developing statistical methods to detect rare causal variants. Rare variant analysis is complicated by several unique challenges related to sequencing-based uncertainties in variant calling, the large search space of rare variants, and the inherently low carrier rate frequencies of these variants. It has been theorized that both linkage and family-based analysis work well in analyzing rare variants [1,2]. Combining both approaches may provide a powerful strategy for identifying rare variants.

Methods

The Genetic Analysis Workshop 17 (GAW17) data set was developed to model a real-world rare variant screen using data generated from a mini-exome scan [3]. The genotype data correspond to 24,487 variants (in 3,205 genes) derived from low-coverage sequence data provided from the 1000 Genomes Project. In our analysis, we use the simulated family-based sample of eight three-generation pedigrees (697 individuals). The founders of these pedigrees are a random sample of 202 individuals selected from the population-based sample. As a result, only four of the nine causal genes have low-frequency causal single-nucleotide polymorphisms (SNPs) in the family data. In our linkage analysis and initial family-based association test (FBAT) analysis, we average the 200 replications of the Q1 phenotype to maximize power. Detailed information about the pedigrees is shown in Table 1.
Table 1

Pedigree information based on the combined sample

Pedigree numberNumber of nuclear familiesNumber of affected sibsTotal number of sib pairsNumber of affected sib pairs
02322865
129261008
22629904
32019741
42018732
52020737
6364812834
72018731
Total19420069762
Pedigree information based on the combined sample

Linkage analysis

Our initial goal is to evaluate a variety of linkage-based approaches using the within-family identity-by-descent (IBD) information provided. In the absence of knowledge regarding the disease model, we restricted our evaluation to nonparametric methods so as to maximize power [4]. We evaluated several approaches that consider either all sib pairs (SPs) or only affected sib pairs (ASPs), including goodness-of-fit, mean, and trend test using ASPs and the Haseman-Elston and modified Haseman-Elston regressions using all SPs [4-7]. We also used the Haseman-Elston regression with Q1. Because this proved to be the most powerful, we restrict our reporting to this approach.

Family-based association test

The resolution of linkage analysis is limited by the number of informative meioses within each pedigree (a function of pedigree structure and randomness). We therefore consider family-based association methods to facilitate fine mapping of linked regions. The association test is based on a modified FBAT [8] statistic as follows: Suppose we have i = 1, …, N independent trios and M rare variants in a given gene. We apply the test to markers using a defined rare variant allele frequency threshold (<1% and 3% are illustrated). The cutoff is arbitrary and deserves further exploration. The test statistic has the following numerators: where T is the trait, X is the observed number of rare variant alleles among the offspring for the ith family, μ is the trait offset (typically the mean for measured traits), and P is the parental genotype corresponding to the ith family. The numerator is the sum of individual numerators of each of the FBAT statistics for all M SNPs. It represents the contributions for all families over all variants in a given gene to the new FBAT statistics. The test statistic W/[Var(W)]1/2 is a Z-statistic that can be used to test against a one-sided or two-sided alternative. The variance of W has a complicated expression. Even if we assume that the nuclear families within a pedigree are independent, estimating the covariance structure between the SNPs for each family is difficult because of the presence of linkage disequilibrium between variants. For the purpose of this project, we use the empirical variance as the denominator, which gives: Instead of trios, we can extend the numerator by summing contributions over all nuclear families in all pedigrees: where the summand corresponds to the lth offspring of the ith nuclear family in the kth pedigree. We can compute the empirical variance in two different ways, by treating either the pedigrees: or the nuclear families: as independent units, where the term in braces in expression (4) or (5) is the contribution of the pedigree or the nuclear family. The choice of assumption has important implications for test performance. Assuming that nuclear families are independent gives a biased estimate of the variance if indeed phenotypic correlation exists between nuclear families within a pedigree. Alternatively, assuming that pedigrees are independent gives a conservative estimate of the variance when only a small number of pedigrees are studied, as in the GAW17 family data. This test can also be extended to nuclear families with missing parents by conditioning on a sufficient statistic for transmission instead of parental genotype.

Results

Comparison of linkage-based approaches

We observe striking differences in the performance of the various linkage-based approaches evaluated. Any linkage method that aggregated results across pedigrees failed to identify any of the causal genes among the top candidates. In contrast, when genetic heterogeneity was considered by performing pedigree-stratified analysis, some of the causal genes were identified. The results are summarized in Table 2. KDR (p = 2.0 × 10−8) is the top gene and is most significant in one pedigree; VEGFA (p = 1.4 × 10−5) is among the top significant genes in another pedigree; and FLT1 (p = 5.4 × 10−3 and 1.0 × 10−3) shows up as the top gene in two other pedigrees, but the signal seems to be significantly weaker.
Table 2

Top candidate genes from separate pedigrees

Pedigree 1p-valuePedigree 3p-valuePedigree 4p-valuePedigree 5p-value
GPR1150.000004KDR0.00000002EPHA60.0052PIBF10.0003
C6orf1300.000013KIT0.00000002GPR1280.0052CCNA10.0009
GUCA1B0.000013LNX10.00000002OR5K10.0052CYSLTR20.0009
KIAA02400.000013PDGFRA0.00000002OR5K20.0052DGKH0.0009
MEA10.000013SGCB0.00000002OR5K30.0052DNAJC150.0009
PPP2R5D0.000013SPATA180.00000002OR5K40.0052ELF10.0009
PRPH20.000013PPAT0.00000045ST3GAL60.0052FNDC3A0.0009
PTK70.000013SPINK20.00000045B3GALTL0.0054FREM20.0009
RGL20.000013GUF10.00005598BRCA20.0054HTR2A0.0009
SLC26A80.000013NFXL10.00005598FLT10.0054NUFIP10.0009
TAF110.000013CHRNA90.00047549LOC6507940.0054P2RY50.0009
TBCC0.000013NSUN70.00047549SGCG0.0054RB10.0009
TFEB0.000013RHOH0.00047549TNFRSF190.0054RCBTB20.0009
ZNF760.000013LZTR10.00167619ZMYM20.0054TRPC40.0009
NFKBIE0.000014SCARF20.00167619ZMYM50.0054FLT10.0010
RUNX20.000014SDF2L10.00167619NFKBIZ0.0092STARD130.0011
SUPT3H0.000014TOP3B0.00167619STARD130.0208B3GALTL0.0013
VEGFA0.000014JMJD2C0.00242110ATP10A0.0213BRCA20.0013
HFE0.000019PTPRD0.00242110ADCY50.0229LOC6507940.0015
HIST1H2AA0.000019KIAA14320.00303729ADPRH0.0229SGCG0.0015

Linkage analysis results of top candidate genes by regressing the square of the difference of Q1 against IBD for all sib pairs in a pedigree.

Top candidate genes from separate pedigrees Linkage analysis results of top candidate genes by regressing the square of the difference of Q1 against IBD for all sib pairs in a pedigree.

Fine mapping result

To assess the performance of our modified FBAT statistic, we first screened all variants with a frequency less than 1% for association using a univariate application of the standard FBAT statistic (considering individual variants separately). We found that, with the exception of one disease-causing variant (C4S1884), all the variants demonstrated trends of association (at α = 0.05), although none reached significance after adjustment for multiple tests. We next applied our modified FBAT, performing gene-based tests of all rare variants with frequencies less than 1% or less than 3%. The p-values corresponding to the true causal genes are summarized in Table 3. Of the four genes with causal rare variants in the family data, we detected association (p < 0.01) for three genes (VEGFA, VEGFC, and FLT1), and for the fourth gene (KDR), significance was achieved using the higher frequency. Using pedigrees as independent units instead of nuclear families yielded nonsignificant results; given the small number of pedigrees, this was expected.
Table 3

P-values corresponding to the true causal genes using Q1 as phenotype

ChromosomeGene1% cutoff3% cutoff

Nuclear familiesPedigreesNuclear familiesPedigrees
1ARNT0.4410.3010.4500.406
1ELAVL40.4470.3470.9520.948
4KDRa0.030.090.2290.092
4VEGFCa0.0090.3170.0090.317
5FLT40.3140.2990.3190.304
6VEGFAa0.00020.1220.0020.156
13FLT1a0.0760.1280.00030.024
14HIF1ANANA0.3170.317
19HIF3A0.5080.4660.6380.609

a Gene that has polymorphic causal SNPs. The other five causal genes (not marked by superscript a) cannot be identified in our method because there were no causal SNPs corresponding to those genes in the sample.

P-values corresponding to the true causal genes using Q1 as phenotype a Gene that has polymorphic causal SNPs. The other five causal genes (not marked by superscript a) cannot be identified in our method because there were no causal SNPs corresponding to those genes in the sample. To estimate the FBAT statistic’s true- and false-positive rates, we ran our method on the 200 individual phenotype replicates and reported the proportion of times a gene was declared significant (at p < 0.01). As can been seen in Table 4, the FBAT has high power to detect association for three of the four polymorphic causal genes: power approaches 1 for VEGFA and VEGFC, regardless of allele frequency cutoff, whereas power varies by allele frequency cutoff for FLT1. Power is poor for KDR, regardless of cutoff. Among genes that were modeled as disease causing but for which random sampling resulted in the absence of polymorphic rare variants in our data sets, the false-positive rates are low. Two related genes, HIF1A and HIF3A, have false-positive rates of 0, and the other three genes have rates no higher than 0.02, suggesting high test specificity (not shown). However, a more comprehensive assessment of all genes reveals a substantially higher false-positive rate. Figure 1 graphs the detection rates for all genes on chromosomes 4, 5, 6, and 13. We found several genes that seem to have high rates of detection despite not being associated with the trait. Most notable are PCDHGA2 (rate = 0.245), PSMB8 (rate = 0.475), and TRPC4 (rate = 0.205). The high false-positive rate for KIT can be explained by its close proximity to KDR.
Table 4

True-positive rates corresponding to the true causal genes using Q1 as phenotype (estimated from the 200 replications provided in the GAW17 data set)

ChromosomeGene1% cutoff3% cutoff
4KDR0.0850.035
4VEGFC0.9951
6VEGFA0.9950.990
13FLT10.0750.775
Figure 1

Detection rates from modified FBAT for all genes on chromosomes 4, 5, 6, and 13. Each bar in the graphs represents the percentage of times that the gene was significant (p < 0.05) in the 200 replicates. True-positive disease genes are labeled. Of note, the KIT locus on chromosome 4, frequently detected as a false positive, is in close proximity (394 kb) to the disease-causing KDR locus.

True-positive rates corresponding to the true causal genes using Q1 as phenotype (estimated from the 200 replications provided in the GAW17 data set) Detection rates from modified FBAT for all genes on chromosomes 4, 5, 6, and 13. Each bar in the graphs represents the percentage of times that the gene was significant (p < 0.05) in the 200 replicates. True-positive disease genes are labeled. Of note, the KIT locus on chromosome 4, frequently detected as a false positive, is in close proximity (394 kb) to the disease-causing KDR locus.

Discussion

Rare variants are likely to be private to one or a limited number of families. As a consequence, it is likely that the genetic liability conferred by rare variants will exhibit pronounced genetic heterogeneity, with different individual contributions from numerous variants. It is well recognized that model misspecification, including failure to consider allelic heterogeneity, can severely limit disease-gene mapping efforts. It therefore follows that gene-mapping efforts that focus on rare variants accommodate this reality. In our study, aggregating linkage statistics across all pedigrees yielded negative results, whereas modeling linkage within individual pedigrees performed well. So linkage analysis shows some promise in analyzing rare variants given sufficiently large pedigrees. The modified FBAT is promising. It correctly identifies causal genes that contain polymorphic SNPs in the family sample. However, we found that there were considerable false positives; many factors could be responsible for the high false-positive rates, for example, failure to adjust for multiple testing, linkage disequilibrium between causal and noncausal SNPs, incorrect variance estimation, lack of normality resulting from the restriction to rare variants, and the method used to simulate the replications. With regard to variance estimation, there are only 8 pedigrees and 194 nuclear families, so differences in the two approaches to computing the variance are to be expected. In study designs often seen in actual samples, these differences may not be so important, but clearly, better approaches are needed. Some limited examination of the sensitivity of the false-positive rate suggests that the use of only rare variants does not have a major impact. Furthermore, the simulation structure of the family-based sample makes it difficult to evaluate performance of any family-based methods. First, many of the true causal SNPs are not polymorphic in the family-based sample, making it impossible for both linkage and association analyses to identify the causal genes with those variants. Second, for the proposed family-based methods the random variable is the transmission of genotype. Hence the simulated replicates of phenotypes cannot be used to appropriately evaluate power or validity of such methods. Further research should investigate possible approaches to extend the proposed association test using variable thresholds for identifying rare variants and using available pathway information. Another issue that can be addressed in future research is the assumption that all rare variants act in the same direction, affecting the disease risk; potential ways to address the violation of such an assumption in the context of our method should be tested.

Conclusions

Linkage, stratified by pedigree, provides a promising method for identifying rare variants, provided that pedigrees are large. The modified FBAT approach also suggests that it is a promising approach, but the false-positive rates need to be addressed. Although not attempted here, a promising scenario may be to combine the two approaches, using linkage to screen genes or regions and then using the FBAT for testing selected regions. Given the scale of large-scale sequencing, this approach not only may be more powerful but may also provide substantial cost savings. Finally, methods for evaluating power and type I error for linkage and transmission testing need to be designed differently to provide valid estimates for those tests.

Competing interests

The authors declare that there are no competing interests.

Authors’ contributions

W-K Yip performed the initial cleaning of the data set and linkage analysis; GD developed and applied the novel FBAT statistics for the fine mapping analysis. BARy and NL supervised the project.
  7 in total

1.  Haseman and Elston revisited.

Authors:  R C Elston; S Buxbaum; K B Jacobs; J M Olson
Journal:  Genet Epidemiol       Date:  2000-07       Impact factor: 2.135

2.  Implementing a unified approach to family-based tests of association.

Authors:  N M Laird; S Horvath; X Xu
Journal:  Genet Epidemiol       Date:  2000       Impact factor: 2.135

3.  Regression-based sib pair linkage analysis for binary traits.

Authors:  Maurice P A Zeegers; John P Rice; Frühling V Rijsdijk; Goncalo R Abecasis; Pak C Sham
Journal:  Hum Hered       Date:  2003       Impact factor: 0.444

Review 4.  Family-based methods for linkage and association analysis.

Authors:  Nan M Laird; Christoph Lange
Journal:  Adv Genet       Date:  2008       Impact factor: 1.944

5.  Parametric and nonparametric linkage analysis: a unified multipoint approach.

Authors:  L Kruglyak; M J Daly; M P Reeve-Daly; E S Lander
Journal:  Am J Hum Genet       Date:  1996-06       Impact factor: 11.025

Review 6.  Statistical analysis of rare sequence variants: an overview of collapsing methods.

Authors:  Carmen Dering; Claudia Hemmelmann; Elizabeth Pugh; Andreas Ziegler
Journal:  Genet Epidemiol       Date:  2011       Impact factor: 2.135

7.  Genetic Analysis Workshop 17 mini-exome simulation.

Authors:  Laura Almasy; Thomas D Dyer; Juan Manuel Peralta; Jack W Kent; Jac C Charlesworth; Joanne E Curran; John Blangero
Journal:  BMC Proc       Date:  2011-11-29
  7 in total
  11 in total

1.  The rare TREM2 R47H variant exerts only a modest effect on Alzheimer disease risk.

Authors:  Basavaraj V Hooli; Antonio R Parrado; Kristina Mullin; Wai-Ki Yip; Tian Liu; Johannes T Roehr; Dandi Qiao; Frank Jessen; Oliver Peters; Tim Becker; Alfredo Ramirez; Christoph Lange; Lars Bertram; Rudolph E Tanzi
Journal:  Neurology       Date:  2014-09-03       Impact factor: 9.910

2.  Incorporating biological information into association studies of sequencing data.

Authors:  Gary K Chen; Gary Chen; Peng Wei; Anita L DeStefano
Journal:  Genet Epidemiol       Date:  2011       Impact factor: 2.135

Review 3.  Vascular endothelial growth factor signaling in acute myeloid leukemia.

Authors:  Kim R Kampen; Arja Ter Elst; Eveline S J M de Bont
Journal:  Cell Mol Life Sci       Date:  2012-07-26       Impact factor: 9.261

4.  Adjusting family relatedness in data-driven burden test of rare variants.

Authors:  Qunyuan Zhang; Lihua Wang; Dan Koboldt; Ingrid B Boreki; Michael A Province
Journal:  Genet Epidemiol       Date:  2014-08-28       Impact factor: 2.135

5.  Family-Based Rare Variant Association Analysis: A Fast and Efficient Method of Multivariate Phenotype Association Analysis.

Authors:  Longfei Wang; Sungyoung Lee; Jungsoo Gim; Dandi Qiao; Michael Cho; Robert C Elston; Edwin K Silverman; Sungho Won
Journal:  Genet Epidemiol       Date:  2016-06-17       Impact factor: 2.135

6.  Identifying rare variant associations in population-based and family-based designs.

Authors:  Asuman S Turkmen; Shili Lin
Journal:  BMC Proc       Date:  2014-06-17

7.  Two-stage family-based designs for sequencing studies.

Authors:  Zhao Yang; Duncan C Thomas
Journal:  BMC Proc       Date:  2014-06-17

8.  Family-based tests applied to extended pedigrees identify rare variants related to hypertension.

Authors:  Mengyuan Xu; Harold Z Wang; Wei Guo; Haide Qin; Yin Y Shugart
Journal:  BMC Proc       Date:  2014-06-17

9.  Comparison of multilevel modeling and the family-based association test for identifying genetic variants associated with systolic and diastolic blood pressure using Genetic Analysis Workshop 18 simulated data.

Authors:  Jian Wang; Robert Yu; Sanjay Shete
Journal:  BMC Proc       Date:  2014-06-17

10.  Rare variant analysis for family-based design.

Authors:  Gourab De; Wai-Ki Yip; Iuliana Ionita-Laza; Nan Laird
Journal:  PLoS One       Date:  2013-01-15       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.