Literature DB >> 24362310

Why breeding values estimated using familial data should not be used for genome-wide association studies.

Chinyere C Ekine1, Suzanne J Rowe, Stephen C Bishop, Dirk-Jan de Koning.   

Abstract

In animal breeding, the genetic potential of an animal is summarized as its estimated breeding value, which is derived from its own performance as well as the performance of related individuals. Here, we illustrate why estimated breeding values are not suitable as a phenotype for genome-wide association studies. We simulated human-type and pig-type pedigrees with a range of quantitative trait loci (QTL) effects (0.5-3% of phenotypic variance) and heritabilities (0.3-0.8). We analyzed 1000 replicates of each scenario with four models: (a) a full mixed model including a polygenic effect, (b) a regression analysis using the residual of a mixed model as a trait score (so called GRAMMAR approach), (c) a regression analysis using the estimated breeding value as a trait score, and (d) a regression analysis that uses the raw phenotype as a trait score. We show that using breeding values as a trait score gives very high false-positive rates (up 14% in human pedigrees and >60% in pig pedigrees). Simulations based on a real pedigree show that additional generations of pedigree increase the type I error. Including the family relationship as a random effect provides the greatest power to detect QTL while controlling for type I error at the desired level and providing the most accurate estimates of the QTL effect. Both the use of residuals and the use of breeding values result in deflated estimates of the QTL effect. We derive the contributions of QTL effects to the breeding value and residual and show how this affects the estimates.

Entities:  

Keywords:  family structure; genome-wide association; statistical power; type I error

Mesh:

Year:  2014        PMID: 24362310      PMCID: PMC3931567          DOI: 10.1534/g3.113.008706

Source DB:  PubMed          Journal:  G3 (Bethesda)        ISSN: 2160-1836            Impact factor:   3.154


Genome-wide association studies (GWAS) are now commonplace in humans, livestock, plants, and model organisms. A commonality among these studies is that genetic links exist between genotyped subjects and these must be accounted for in statistical analyses. Several approaches have been proposed to take account of these genetic structures in GWAS (Yu ; Aulchenko ; Kang ), resulting in a range of statistical tools such as TASSEL (Zhang ), EMMA(X) (Kang ) and GenABEL (Aulchenko ). Somewhat less attention has been paid to the definition of the trait value that is used for the GWAS. In many situations, a phenotypic trait may be decomposed into an estimated breeding value (EBV) and a residual. The EBV is an estimated measure of the additive genetic merit of an individual (e.g., animal, plant, tree) for the given trait based on its own performance and/or that of genetically related individuals. In the genome-wide rapid association using mixed model and regression (GRAMMAR) approach, the observed phenotype is analyzed under a mixed model resulting in an EBV and a residual, with the latter being used as the trait value for GWAS (Aulchenko ). Other investigators, however, have used the EBV of the individual as the trait score for GWAS, assuming that it encompasses the best estimate of the genetic merit of an individual (Johnston ; Becker ; Čepica ). As a citation from Becker shows, the EBV is sometimes considered the best estimate of the genetic merit of an individual: Breeding values have the advantage that they are free of systematic environmental effects on measured phenotypes, as these effects are considered in the statistical model used for the estimation of EBVs. Additionally, they reflect the genetic makeup more accurately because they do not solely rely on own records but include information from all measured relatives. We will show here with a straightforward simulation study that the “information from all measured relatives” is a prime source of false-positive results in GWAS. We note that this insight is neither profound nor novel, but our aim is to provide a clear and concise illustration that using EBV comprising familial information (e.g., parents, sibs, etc.) can give much greater false-positive rates than ignoring family relationships altogether.

Materials and Methods

Simulations

The simulation scheme follows that of (Aulchenko ) with two simulated family structures (human and pig) and one complex real pedigree (pig in this example). For the human pedigree, we simulated 337 nuclear families of three full-sibs with parents that are not related to each other or any of the other parents. For the pig pedigree, we simulated 10 sires, each mated to 10 dams that had 10 or 11 offspring, resulting in 1010 measured individuals for analysis. For the real pig pedigree, we randomly sampled 1010 last-generation offspring from a total pedigree of 5390 commercial pigs and included either two or five generations of pedigree information. The latter was to test the impact of the depth of pedigree information on performance of the EBV approach. The pedigrees that were the basis for the simulations are presented in Supporting Information, File S1. Each of the 46 scenarios was simulated in 1000 replicates using the MORGAN genedrop program (George ). MORGAN genedrop simulates genotypes at marker loci, trait genotypes, and polygenic values contributing to the quantitative traits. Quantitative traits were defined as the sum of the single-nucleotide polymorphism (SNP) effect, the polygenic effects, and a random environmental error. Two SNP genotypes were simulated and analyzed for association: one SNP was not linked with the trait of interest, or any other marker, and used for studying the type I error rate. For studying power, a causal SNP with an additive effect of 4.0 and a minor allele frequency of 0.3 was simulated explaining 0.5, 1, 2, or 3% of the total variation in the trait. The simulated traits had a total heritability of 0.30, 0.40, 0.50, 0.60, and 0.80. The QTL effect and variance due to the QTL were constant throughout the simulations whereas the polygenic variance and the residual variance were scaled to achieve the different QTL contributions and overall heritabilities. An example of the MORGAN genedrop script files that were used for simulation is given as File S2.

Statistical analyses

The simulated data were analyzed using four different approaches: Measured genotype: The SNP to be tested for association was fitted as a covariate in a polygenic mixed model (1), which accounted for familial relatedness of individuals in the pedigree using the additive genetic relationships among individuals. The SNP effect and polygenic effect were estimated together using this model: where y is the vector of trait values, μ is the overall mean, a is the additive QTL effect, u and e are vectors of additive polygenic effects (random), and random residuals, respectively; u ~ N(0, σ2a), where is the additive genetic relationship matrix based on pedigree information and e ~ N(0, Iσ2e), where I is an identity matrix; σ2a and σ2e are the additive genetic and residual error variance, respectively. w is a vector of marker genotypes (codes as 0, 1, 2) and is an incidence matrix related to polygenic effects. GRAMMAR: The GRAMMAR approach consists of two steps (Aulchenko ): the first step accounts for the familial dependence among family members and the second step tests the single SNP effect on the remaining variation by analysis of variance. Step 1: For the simulated trait score, we fitted the following mixed model [with the same variable definitions as (1)] without the marker effect: Step 2: Using the estimated residuals from Step 1 as the new quantitative trait (y*), the marker genotype effect of each SNP on the trait was tested by linear regression: Ignoring family structure (IF): The IF analysis is comparable with the second step of the GRAMMAR analysis. It uses a direct regression of the phenotypic observation (y) on the SNP data and does not take account of family relationships. EBV: Similar to GRAMMAR but in this analysis the EBV from the polygenic model [û, from model (2)] is used as a trait score (y*) for the association study (3). All analyses were performed in ASReml (Gilmour ). The type I error of each scenario was estimated using the unlinked SNP and a tabulated threshold of F > 3.85 (P < 0.05). The statistical power to detect the causal SNP was estimated using either the tabulated F-threshold of 3.84 or an empirical threshold based on a 5% error rate for the unlinked SNP. In order to facilitate the computational load of 1000 replicates of 46 scenarios, the simulations and analyses were run on the Edinburgh Compute and Data Facility.

Results

Type I error

The false-positive rates (FPRs) for the four methods are summarized in Table 1, averaged across QTL effect size. Because they were estimated on unlinked QTL they were, as expected, observed to be independent of QTL size. The GRAMMAR approach was conservative, whereas the measured genotype approach performed very close to the tabulated threshold. Use of either the EBV or ignoring family relationships resulted in much greater levels of false-positive results. However, the FPR depended on the family structure in the data, with a greater degree of relatedness for the pig scenarios than the human scenarios. Particularly for the pig data, FPR for IF increased with increasing heritability whereas it decreased for EBV. Conversely, GRAMMAR was slightly less conservative for the human data than for the pig data.
Table 1

Type 1 error rate for MG, GRAMMAR, EBV, and IF analysis for the simulated human and pig population structures, averaged across QTL effects for each heritability (h2) class

h2Human Population
Pig Population
MGGRAEBVIFMGGRAEBVIF
30%0.0500.0380.1390.0670.0510.0170.6300.268
40%0.0470.0310.1270.0680.0570.0180.6000.324
50%0.0440.0250.1220.0700.0430.0090.5790.352
60%0.0550.0310.1440.0910.0540.0120.5700.401
80%0.0530.0230.1350.1110.0450.0070.4850.445

MG, measured genotype; GRAMMAR, genome-wide rapid association using mixed model and regression; EBV, estimated breeding value; IF, ignoring family; GRA, GRAMMAR.

MG, measured genotype; GRAMMAR, genome-wide rapid association using mixed model and regression; EBV, estimated breeding value; IF, ignoring family; GRA, GRAMMAR. The simulations based on the commercial pig pedigree also showed a large type I error when using EBV (Table S1), albeit slightly lower than for the simulated pedigree. This was due to the smaller family sizes in the sample from the real pedigree compared to the simulated pedigree. The use of five generations of pedigree in the mixed model (2) gave a greater FPR when we used EBV as the trait score compared to using only two generations (Table S1). The use of GRAMMAR was more conservative when applying five generations of pedigree compared with two. There was no clear trend in type I error when using the measured genotype approach (1) or ignoring family structure: in some scenarios using five generation of pedigree gave a more conservative type I error, whereas in other scenarios it was more liberal than using two generations of pedigree information (Table S1).

Power

The full comparisons of statistical power to detect the QTL across all scenarios are presented in Table S2 for the simulated human and pig pedigrees, and in Table S1 for the real pedigree simulations. Measured genotype analyses (1) usually, but not always, had the greatest power, irrespective of whether empirical or tabulated thresholds were used (Figure 1 and Table S2). Using EBV or ignoring family relationships had greater power when using tabulated (rather than empirical) thresholds (Figure 1 and Table S2), but at the cost of high FPRs (Table 1). GRAMMAR was conservative when using tabulated thresholds but comparable in power with using measured genotype when applying empirical thresholds (Figure 1 and Table S2), in agreement with (Aulchenko ).
Figure 1

Empirical and tabulated power of detecting a QTL that explains 1% of phenotypic variance in a trait with 40% heritability. MG: measured genotype; tab: tabulated power, emp: empirical power.

Empirical and tabulated power of detecting a QTL that explains 1% of phenotypic variance in a trait with 40% heritability. MG: measured genotype; tab: tabulated power, emp: empirical power. For scenarios in which the heritability and QTL effect sizes were low, e.g., a heritability of 30% and a QTL explaining 1% of phenotypic variance as shown in Figure 1, the human pedigree had greater power than the pig pedigree. However, at greater heritabilities, these differences diminished and for some scenarios the pig pedigree had the higher power (Table S2). For the simulations based on the commercial pig pedigree, the measured genotype approach had the highest power regardless of whether the thresholds were tabulated or based on empirical results (Table S1). The two-generation pedigree generally gave slightly greater power to detect QTL than the five-generation pedigree.

Estimated effects

The estimates of the QTL effect and the empirical standard deviation more than 1000 replicates are summarized in Table 2. Noting that the expected effect size was 4.0, using measured genotype gave the most accurate, and apparently unbiased, estimates of the QTL effect regardless of variance explained by the QTL or overall heritability (Table 2). Ignoring family information also gave accurate estimates of the QTL effect, apart from when the proportion of variance explained by the QTL was small in which case, the estimates were slightly inflated (Table 2). When using GRAMMAR or EBV the QTL effects were underestimated dramatically. This confirms earlier observations that GRAMMAR underestimates the QTL effects (Aulchenko ; Crooks ).
Table 2

Mean estimates (mean) and empirical standard deviations (SD) of QTL effect for different association analyses across a range of relative QTL effects and heritabilities (h2) in simulated human and pig pedigrees

h2 Human
Pig
QTL EffectMG
GRAMMAR
EBV
IF
MG
GRAMMAR
EBV
IF
MeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSD
30%0.5%4.011.832.551.181.490.834.021.864.092.012.151.092.642.054.362.85
1%3.971.372.530.911.450.643.701.384.051.522.110.822.241.654.152.20
2%3.980.952.520.651.470.533.990.984.011.042.090.611.951.263.961.57
3%3.970.792.510.531.460.493.960.813.970.862.090.531.901.103.971.25
50%0.5%3.901.861.740.862.201.233.911.974.001.931.550.793.452.694.603.14
1%3.961.381.770.672.200.953.961.463.971.481.540.632.822.054.082.48
2%4.041.011.810.542.250.704.061.054.031.041.570.492.601.644.041.90
3%4.050.781.800.422.260.614.060.813.980.831.530.422.471.453.931.61
80%0.5%4.011.920.750.443.341.884.072.184.141.810.740.464.633.385.153.58
1%3.961.340.740.343.251.363.981.524.021.400.740.393.802.664.382.86
2%3.940.960.730.233.180.993.911.093.980.970.730.353.492.134.142.22
3%3.980.800.740.283.250.833.990.893.990.780.740.353.261.773.961.80

The simulated QTL effect was always 4. MG, measured genotype; GRAMMAR, genome-wide rapid association using mixed model and regression; EBV, estimated breeding value; IF, ignoring family.

The simulated QTL effect was always 4. MG, measured genotype; GRAMMAR, genome-wide rapid association using mixed model and regression; EBV, estimated breeding value; IF, ignoring family. A clear trend was apparent for the effect of the heritability on the estimates when using GRAMMAR or EBVs. With increasing heritability, the GRAMMAR estimates were increasingly biased downward while those from the EBV approach became less biased (Table 2). Furthermore, in the pig simulations the estimates of using EBV were more severely underestimated with increasing variance explained by the QTL. On the other hand, the estimates in which GRAMMAR was used appeared to be unaffected by the proportion of variance explained by the QTL in both the pig and the human simulations (Table 2). When we looked at individual replicates, it was apparent that the sum of the GRAMMAR and EBV estimates provided an unbiased estimate of the SNP effect. This could provide a quick estimate of the true effect of significant SNPs after GRAMMAR analyses, rather than re-estimating the effect in a full mixed model as suggested previously (Aulchenko ). Across all scenarios and analyses, the precision of the estimates increased as the proportion of variance explained by the QTL increased, as shown by the empirical standard errors (Table 2).

Discussion

Our simulations have shown clearly that the use of EBV in association studies, incorporating information from relatives from the same or previous generations, can result in several problems, most notably huge increases in the type I error. This finding is attributed to the fact that when an individual’s EBV is estimated with familial information, the estimate is a linear combination of the individual’s phenotype, expressed as a deviation from the family mean, and the family mean itself. Although the individual phenotype captures the within-family segregation of the QTL, the family mean contains information on the QTL allele expressed by other family members. This “contamination” of the EBV by family information can affect both power and the FPR as follows. First, power can be reduced (as shown by results for the empirical thresholds in Table S1 and Table S2) as many sibs may have received alternative QTL alleles. This dilutes the SNP effect, and it will have greater impact in situations in which family information makes a greater contribution to the EBV. Second, there is a greater risk of FPRs, as any SNP that differs in frequency between families risks being correlated by chance with the family mean polygenic value for the trait and hence shows a significant association with the EBV. This finding implies that in analyses of real human pedigrees the type I errors are expected to be even more serious that those shown here in the simulated pedigrees as a result of minor ethnic variations from one pedigree to the next. In the Appendix, we demonstrate how an EBV may be decomposed in individual and family information and into major gene and polygenic (unlinked) effects. With EBVs, it is apparent that the weighting given to the family mean is always greater than that given to the Mendelian sampling term, hence the risks of FPRs and reduced power described previously, with the converse true for residuals. The relative weightings, and the expected value of the regression of phenotype on marker, are slightly complex and depend on the trait heritability, the accuracy with which the family mean is estimated, and the QTL frequency. Derivations are shown in the Appendix for family information estimated solely from sib means; however, the same principles apply to information obtained from other sources. This is seen in Table S1, where the five-generation pedigree resulted in a greater FPR than the two-generation pedigree; because the family mean was estimated more accurately, a greater weighting was applied to the family mean resulting in a greater FPR. In species in which the bulk of the information is derived from progeny testing, such as dairy cattle breeding, the contribution to the EBV coming from relatives other than the direct offspring becomes smaller and using EBV will be less detrimental than in the cases considered here. This is because offspring information directly estimates the Mendelian sampling term of the animal being evaluated, and hence assists in estimation of the QTL effect. We acknowledge that for many species, most commonly in dairy cattle where EBV are derived for bulls using a large number of daughter records, EBV for a wide range of traits are routinely available and are a convenient source of information. In these cases, the use of deregressed EBV has been proposed for GWAS and genomic prediction (Garrick ). De-regressed EBV take account of the heterogeneous variances of EBV that are the result of, e.g., different numbers of daughter records per sire. However, de-regressed EBVs do not remove the component of the EBV coming from information on other relatives. To remove effects from other relatives from the (deregressed) EBV, Garrick suggested to adjust for the parent average effect. The use of de-regressed EBVs, adjusted for parent average effects, can be also relevant when the EBV is the result of repeated measurements that are not easily replaced by a single trait score for GWAS. In summary, when each genotyped individual has its own associated trait score(s), we recommend the use of a measured genotype approach (1) or an approximation using GRAMMAR. Although GRAMMAR was once again shown to be conservative and give an underestimate of the QTL effect, recent developments of the GenABEL software have accounted for this in the GRAMMAR-Lambda module that provides an adjusted test static and a correction for the estimated QTL effect (Svishcheva ). At all costs, naïve usage of EBVs incorporating familial information should be avoided, as use of EBVs will achieve the triple whammy of reducing power, increasing the FPR and misestimating QTL effect sizes.
  13 in total

1.  GenABEL: an R library for genome-wide association analysis.

Authors:  Yurii S Aulchenko; Stephan Ripke; Aaron Isaacs; Cornelia M van Duijn
Journal:  Bioinformatics       Date:  2007-03-23       Impact factor: 6.937

2.  Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis.

Authors:  Yurii S Aulchenko; Dirk-Jan de Koning; Chris Haley
Journal:  Genetics       Date:  2007-07-29       Impact factor: 4.562

3.  Association mapping of quantitative trait loci for carcass and meat quality traits at the central part of chromosome 2 in Italian Large White pigs.

Authors:  S Cepica; P Zambonelli; F Weisz; M Bigi; A Knoll; Z Vykoukalová; M Masopust; M Gallo; L Buttazzoni; R Davoli
Journal:  Meat Sci       Date:  2013-05-09       Impact factor: 5.209

4.  Efficient control of population structure in model organism association mapping.

Authors:  Hyun Min Kang; Noah A Zaitlen; Claire M Wade; Andrew Kirby; David Heckerman; Mark J Daly; Eleazar Eskin
Journal:  Genetics       Date:  2008-03       Impact factor: 4.562

5.  Variance component model to account for sample structure in genome-wide association studies.

Authors:  Hyun Min Kang; Jae Hoon Sul; Susan K Service; Noah A Zaitlen; Sit-Yee Kong; Nelson B Freimer; Chiara Sabatti; Eleazar Eskin
Journal:  Nat Genet       Date:  2010-03-07       Impact factor: 38.330

6.  Rapid variance components-based method for whole-genome association analysis.

Authors:  Gulnara R Svishcheva; Tatiana I Axenovich; Nadezhda M Belonogova; Cornelia M van Duijn; Yurii S Aulchenko
Journal:  Nat Genet       Date:  2012-09-16       Impact factor: 38.330

7.  Genome-wide association mapping identifies the genetic basis of discrete and quantitative variation in sexual weaponry in a wild sheep population.

Authors:  Susan E Johnston; John C McEwan; Natalie K Pickering; James W Kijas; Dario Beraldi; Jill G Pilkington; Josephine M Pemberton; Jon Slate
Journal:  Mol Ecol       Date:  2011-03-29       Impact factor: 6.185

8.  Mixed linear model approach adapted for genome-wide association studies.

Authors:  Zhiwu Zhang; Elhan Ersoz; Chao-Qiang Lai; Rory J Todhunter; Hemant K Tiwari; Michael A Gore; Peter J Bradbury; Jianming Yu; Donna K Arnett; Jose M Ordovas; Edward S Buckler
Journal:  Nat Genet       Date:  2010-03-07       Impact factor: 38.330

9.  A genome-wide association study to detect QTL for commercially important traits in Swiss Large White boars.

Authors:  Doreen Becker; Klaus Wimmers; Henning Luther; Andreas Hofer; Tosso Leeb
Journal:  PLoS One       Date:  2013-02-05       Impact factor: 3.240

10.  Comparison of analyses of the QTLMAS XII common dataset. II: genome-wide association and fine mapping.

Authors:  Lucy Crooks; Goutam Sahana; Dirk-Jan de Koning; Mogens Sandø Lund; Orjan Carlborg
Journal:  BMC Proc       Date:  2009-02-23
View more
  20 in total

1.  Genetic heterogeneity underlying variation in a locally adaptive clinal trait in Pinus sylvestris revealed by a Bayesian multipopulation analysis.

Authors:  S T Kujala; T Knürr; K Kärkkäinen; D B Neale; M J Sillanpää; O Savolainen
Journal:  Heredity (Edinb)       Date:  2016-11-30       Impact factor: 3.821

2.  The effect of single-nucleotide polymorphism in the promoter region of bovine alpha-lactalbumin (LALBA) gene on LALBA expression in milk cells and milk traits of cows.

Authors:  Malgorzata Ostrowska; Lech Zwierzchowski; Paulina Brzozowska; Ewelina Kawecka-Grochocka; Beata Żelazowska; Emilia Bagnicka
Journal:  J Anim Sci       Date:  2021-07-01       Impact factor: 3.338

3.  Targeted resequencing of GWAS loci reveals novel genetic variants for milk production traits.

Authors:  Li Jiang; Xuan Liu; Jie Yang; Haifei Wang; Jicai Jiang; Lili Liu; Sang He; Xiangdong Ding; Jianfeng Liu; Qin Zhang
Journal:  BMC Genomics       Date:  2014-12-15       Impact factor: 3.969

Review 4.  Kernel-based whole-genome prediction of complex traits: a review.

Authors:  Gota Morota; Daniel Gianola
Journal:  Front Genet       Date:  2014-10-16       Impact factor: 4.599

5.  Advantages of continuous genotype values over genotype classes for GWAS in higher polyploids: a comparative study in hexaploid chrysanthemum.

Authors:  Fabian Grandke; Priyanka Singh; Henri C M Heuven; Jorn R de Haan; Dirk Metzler
Journal:  BMC Genomics       Date:  2016-08-24       Impact factor: 3.969

6.  Increasing the power of genome wide association studies in natural populations using repeated measures - evaluation and implementation.

Authors:  Lars Rönnegård; S Eryn McFarlane; Arild Husby; Takeshi Kawakami; Hans Ellegren; Anna Qvarnström
Journal:  Methods Ecol Evol       Date:  2016-02-05       Impact factor: 7.781

7.  Performance Gains in Genome-Wide Association Studies for Longitudinal Traits via Modeling Time-varied effects.

Authors:  Chao Ning; Huimin Kang; Lei Zhou; Dan Wang; Haifei Wang; Aiguo Wang; Jinluan Fu; Shengli Zhang; Jianfeng Liu
Journal:  Sci Rep       Date:  2017-04-04       Impact factor: 4.379

8.  Genome-wide association analysis and functional annotation of positional candidate genes for feed conversion efficiency and growth rate in pigs.

Authors:  Justyna Horodyska; Ruth M Hamill; Patrick F Varley; Henry Reyer; Klaus Wimmers
Journal:  PLoS One       Date:  2017-06-12       Impact factor: 3.240

9.  Replicated analysis of the genetic architecture of quantitative traits in two wild great tit populations.

Authors:  Anna W Santure; Jocelyn Poissant; Isabelle De Cauwer; Kees van Oers; Matthew R Robinson; John L Quinn; Martien A M Groenen; Marcel E Visser; Ben C Sheldon; Jon Slate
Journal:  Mol Ecol       Date:  2015-12-10       Impact factor: 6.185

10.  Genomic analysis of morphometric traits in bighorn sheep using the Ovine Infinium® HD SNP BeadChip.

Authors:  Joshua M Miller; Marco Festa-Bianchet; David W Coltman
Journal:  PeerJ       Date:  2018-02-12       Impact factor: 2.984

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.