| Literature DB >> 30659178 |
Simon Haworth1, Ruth Mitchell1, Laura Corbin1, Kaitlin H Wade1, Tom Dudding1, Ashley Budu-Aggrey1, David Carslake1, Gibran Hemani1, Lavinia Paternoster1, George Davey Smith1, Neil Davies1, Daniel J Lawson1, Nicholas J Timpson2,3.
Abstract
Large studies use genotype data to discover genetic contributions to complex traits and infer relationships between those traits. Co-incident geographical variation in genotypes and health traits can bias these analyses. Here we show that single genetic variants and genetic scores composed of multiple variants are associated with birth location within UK Biobank and that geographic structure in genotype data cannot be accounted for using routine adjustment for study centre and principal components derived from genotype data. We find that major health outcomes appear geographically structured and that coincident structure in health outcomes and genotype data can yield biased associations. Understanding and accounting for this phenomenon will be important when making inference from genotype data in large studies.Entities:
Mesh:
Year: 2019 PMID: 30659178 PMCID: PMC6338768 DOI: 10.1038/s41467-018-08219-1
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Within-UK ancestry predicts migration that confounds education: estimated educational attainment of the United Kingdom, when seen only through the ALSPAC cohort based in Bristol. Scores are 1: vocational, 2: CSEs, 3: O-levels, 4: A-levels, 5: degree. CSE Certificate of Secondary Education. The predicted mean education for each region is given, along with 95% confidence intervals estimated by bootstrap resampling of individuals. Each region is coloured by predicted mean education, where predicted mean = 2 is shaded in red and predicted mean = 5 is shaded in white. See Methods for details. ALSPAC Avon Longitudinal Study of Parents and Children
Fig. 2The relationship between polygenic scores (PS; right-hand label) and geographical terms (left-hand label) within the UK Biobank sample. Tiles are shaded by p value testing the null hypothesis of no association between PS and geographical term, where p = 0 is shaded in black and p = 2e−16 is shaded in red. Statistical adjustment was performed as follows: model 1: no adjustment; model 2: adjustment for genotyping array only; model 3: adjustment for genotyping array, 10 principal components (PCs) and study participation centre; model 4: adjustment for genotyping array, 40 PCs and study participation centre
Fig. 3Fitted spline regression plots showing the non-linear distribution of polygenic scores (PS) for educational attainment (weighted version, including variants with p < 1.0e−05) in unadjusted model (left) and model after adjustment for 40 principal components and study centre (right). The centre of major population centres is marked for reference. The shaded area represents 95% confidence intervals
Relationship between PS and birth location within UK Biobank
| Axis | Model 1 | Model 2 | Model 3 | Model 4 | Model 1 | Model 2 | Model 3 | Model 4 | |
| BMI (GIANT) | |||||||||
| N/S | 9.7e−7 | 9.9e−7 | 0.063 | 0.40 | 0.0013 | 0.0012 | 0.0032 | 0.58 | |
| E/W | 0.0036 | 0.0035 | 0.24 | 0.93 | 0.053 | 0.054 | 0.032 | 0.47 | |
| EA (SSGAC) | |||||||||
| N/S | 2e−16 | <2e−16 | 6.4e−6 | 6.7e−6 | <2e−16 | <2e−16 | 1.3e−9 | 1.6e-6 | |
| E/W | <2e−16 | <2e−16 | 1.5e−9 | 6.0e−11 | <2e−16 | <2e−16 | 7.5e−14 | 1.3e-11 | |
| Height (GIANT) | |||||||||
| N/S | <2e−16 | <2e−16 | 1.3e−5 | 0.14 | <2e−16 | <2e−16 | 4.6e−06 | 0.13 | |
| E/W | <2e−16 | <2e−16 | 2.1e−4 | 0.095 | <2e−16 | <2e−16 | 3.4e−05 | 0.046 | |
| BMI (GIANT) | |||||||||
| N/S | 2.4e−9 | 2.5e−09 | 0.023 | 0.019 | 2.4e−10 | 2.6e−10 | 0.0029 | 0.074 | |
| E/W | 1.4e−13 | 1.7e−13 | 0.134 | 0.34 | <2e−16 | <2e−16 | 0.020 | 0.14 | |
| EA (SSGAC) | |||||||||
| N/S | <2e−16 | <2e−16 | <2e−16 | <2e−16 | 7.6e−11 | 8.5e−11 | 0.012 | 0.16 | |
| E/W | <2e−16 | <2e−16 | <2e−16 | <2e−16 | 9.7e−12 | 8.9e−12 | 0.0021 | 0.041 | |
| Height (GIANT) | |||||||||
| N/S | <2e−16 | <2e−16 | 5.9e−5 | 0.16 | <2e−16 | <2e−16 | 2.5e−4 | 0.17 | |
| E/W | <2e−16 | <2e−16 | 1.4e−4 | 0.051 | <2e−16 | <2e−16 | 7.2e−5 | 0.014 | |
| Weighted PS | Unweighted PS | ||||||||
P value for non-linear association between component of birth location and polygenic score. For all models n = 321,439. Statistical adjustment was performed as follows: model 1: no adjustment; model 2: adjustment for genotyping array only; model 3: adjustment for genotyping array, 10 PCs and study participation centre; model 4: adjustment for genotyping array, 40 PCs and study participation centre
N/S north/south axis of birth location, E/W east/west axis of birth location, PS polygenic scores, BMI body mass index, GIANT Genetic Investigation of ANthropometric Traits, EA educational attainment, SSGAC Social Science Genetic Association Consortium
Linear relationships between observed traits and PS in UK Biobank
| Observed trait (unit) | Model 1 | Model 2 | Model 3 | Model 4 | Model 1 | Model 2 | Model 3 | Model 4 |
|---|---|---|---|---|---|---|---|---|
| PS for BMI (GIANT) | ||||||||
| Household income (£ per year) | −335 (1.8e−9) | −325 (5.2e−9) | −251 (4.0e−6) | −229 (3.4e−5) | −304 (4.7e−8) | −294 (1.3e−7) | −212 (1.0e−4) | −190 (0.0057) |
| BMI (kg/m2) | 0.612 (<2e−16) | 0.611 (<2e−16) | 0.606 (<2e−16) | 0.606 (<2e−16) | 0.549 (<2e−16) | 0.547 (<2e−16) | 0.541 (<2e−16) | 0.541 (<2e−16) |
| Age at completion of full-time education (years) | −0.0219 (3.2e−4) | −0.0216 (4.0e−4) | −0.0201 (9.2e−4) | −0.0187 (0.0025) | −0.0231 (1.6e−4) | −0.0227 (2.0e−4) | −0.0201 (9.6e−4) | −0.0187 (0.0024) |
| Number of siblings (persons) | 0.0107 (3.0e−4) | 0.0105 (3.6e−4) | 0.00783 (0.0071) | 0.00750 (0.011) | 0.00130 (1.0e−05) | 0.00129 (1.3e−05) | 0.00850 (0.0035) | 0.00807 (0.0068) |
| PS for EA (SSGAC) | ||||||||
| Household income (£ per year) | 1066 (<2e−16) | 1062 (<2e−16) | 874 (<2e−16) | 835 (<2e−16) | 1454 (<2e−16) | 1446 (<2e−16) | 1200 (<2e−16) | 1140 (<2e−16) |
| BMI (kg/m2) | −0.112 (<2e−16) | −0.111 (<2e−16) | −0.101 (<2e−16) | −0.097 (<2e−16) | −0.151 (<2e−16) | −0.150 (<2e−16) | −0.132 (<2e−16) | −0.129 (<2e−16) |
| Age at completion of full-time education (years) | 0.0878 (<2e−16) | 0.0877 (<2e−16) | 0.0844 (<2e−16) | 0.0831 (<2e−16) | 0.12 (<2e−16) | 0.119 (<2e−16) | 0.112 (<2e−16) | 0.109 (<2e−16) |
| Number of siblings (persons) | −0.0250 (<2e−16) | −0.0250 (<2e−16) | −0.0258 (<2e−16) | −0.0253 (<2e−16) | −0.038 (<2e−16) | −0.0382 (<2e−16) | −0.0293 (<2e−16) | −0.0279 (<2e−16) |
| PS for height (GIANT) | ||||||||
| Household income (£ per year) | 522 (<2e−16) | 515 (<2e−16) | 418 (1.8e−14) | 406 (2.7e−13) | 515 (<2e−16) | 509 (<2e−16) | 419 (1.7e−14) | 405 (2.9e−13) |
| BMI (kg/m2) | −0.129 (<2e−16) | −0.128 (<2e−16) | −0.112 (<2e−16) | −0.116 (<2e−16) | −0.122 (<2e−16) | −0.121 (<2e−16) | −0.105 (<2e−16) | −0.109 (<2e−16) |
| Age at completion of full-time education (years) | 0.0350 (9.4e−9) | 0.0348 (1.1e−8) | 0.0289 (2.0e−06) | 0.0263 (2.0e−05) | 0.0349 (1.1e−08) | 0.0347 (1.2e−08) | 0.0286 (2.6e−6) | 0.0265 (1.8e−5) |
| Number of siblings (persons) | −0.0249 (<2e−16) | −0.0248 (<2e−16) | −0.0130 (8.1e−06) | −0.0119 (7.2e−05) | −0.0264 (<2e−16) | −0.0263 (<2e−16) | −0.0136 (3.0e−6) | −0.0127 (2.1e−5) |
| Weighted PS ( | Unweighted PS ( | |||||||
The field contents are beta coefficients per 1 SD increase in PS, with p values for the linear association, testing the null hypothesis of no linear association between each observed trait and PS in brackets. For household income, N = 276,779; BMI, N = 336,031; age at completion of full-time education, N = 228,886; number of siblings, N = 332,037. Statistical adjustment was performed as follows: model 1: no adjustment; model 2: adjustment for genotyping array only; model 3: adjustment for genotyping array, 40 PCs and study participation centre; model 4: adjustment for genotyping array, 40 PCs, study participation centre and non-linear regression terms for North and East axes of birth location
PS polygenic score, PC principal component, BMI body mass index, EA educational attainment, GIANT Genetic Investigation of ANthropometric Traits, SSGAC Social Science Genetic Association Consortium
Fig. 4Attenuation in linear relationship between polygenic scores (PS) and complex traits in the UK Biobank sample at varying degrees of statistical adjustment. N sibs refers to number of siblings. For each PS, the relationship with four traits was estimated using an unadjusted model (plotted in circle) and this estimate and its corresponding 95% confidence intervals were rescaled to a value of 1. Error bars represent 95% confidence intervals for the rescaled estimate. Adjustment was then performed for genotyping array only (triangles), genotyping array, 40 principal components (PCs) and study participation centre (cross) and 40 PCs, study participation centre and non-linear regression terms for North and East axes of birth location (square). A value of 0.5 on the y-axis would mean that 50% of the unadjusted effect estimate remained after adjustment. Lines are drawn at x = 1 (red) and y = 0 (black) for reference