Andrea G Allegrini1, Ville Karhunen2, Jonathan R I Coleman1,3, Saskia Selzam1, Kaili Rimfeld1, Sophie von Stumm4, Jean-Baptiste Pingault1,5, Robert Plomin1. 1. Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, United Kingdom. 2. Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, United Kingdom. 3. NIHR Maudsley Biomedical Research Centre, King's College London, United Kingdom. 4. Department of Education, University of York, United Kingdom. 5. Division of Psychology and Language Sciences, University College London, United Kingdom.
Abstract
Polygenic scores are increasingly powerful predictors of educational achievement. It is unclear, however, how sets of polygenic scores, which partly capture environmental effects, perform jointly with sets of environmental measures, which are themselves heritable, in prediction models of educational achievement. Here, for the first time, we systematically investigate gene-environment correlation (rGE) and interaction (GxE) in the joint analysis of multiple genome-wide polygenic scores (GPS) and multiple environmental measures as they predict tested educational achievement (EA). We predict EA in a representative sample of 7,026 16-year-olds, with 20 GPS for psychiatric, cognitive and anthropometric traits, and 13 environments (including life events, home environment, and SES) measured earlier in life. Environmental and GPS predictors were modelled, separately and jointly, in penalized regression models with out-of-sample comparisons of prediction accuracy, considering the implications that their interplay had on model performance. Jointly modelling multiple GPS and environmental factors significantly improved prediction of EA, with cognitive-related GPS adding unique independent information beyond SES, home environment and life events. We found evidence for rGE underlying variation in EA (rGE = .38; 95% CIs = .30, .45). We estimated that 40% (95% CIs = 31%, 50%) of the polygenic scores effects on EA were mediated by environmental effects, and in turn that 18% (95% CIs = 12%, 25%) of environmental effects were accounted for by the polygenic model, indicating genetic confounding. Lastly, we did not find evidence that GxE effects significantly contributed to multivariable prediction. Our multivariable polygenic and environmental prediction model suggests widespread rGE and unsystematic GxE contributions to EA in adolescence.
Polygenic scores are increasingly powerful predictors of educational achievement. It is unclear, however, how sets of polygenic scores, which partly capture environmental effects, perform jointly with sets of environmental measures, which are themselves heritable, in prediction models of educational achievement. Here, for the first time, we systematically investigate gene-environment correlation (rGE) and interaction (GxE) in the joint analysis of multiple genome-wide polygenic scores (GPS) and multiple environmental measures as they predict tested educational achievement (EA). We predict EA in a representative sample of 7,026 16-year-olds, with 20 GPS for psychiatric, cognitive and anthropometric traits, and 13 environments (including life events, home environment, and SES) measured earlier in life. Environmental and GPS predictors were modelled, separately and jointly, in penalized regression models with out-of-sample comparisons of prediction accuracy, considering the implications that their interplay had on model performance. Jointly modelling multiple GPS and environmental factors significantly improved prediction of EA, with cognitive-related GPS adding unique independent information beyond SES, home environment and life events. We found evidence for rGE underlying variation in EA (rGE = .38; 95% CIs = .30, .45). We estimated that 40% (95% CIs = 31%, 50%) of the polygenic scores effects on EA were mediated by environmental effects, and in turn that 18% (95% CIs = 12%, 25%) of environmental effects were accounted for by the polygenic model, indicating genetic confounding. Lastly, we did not find evidence that GxE effects significantly contributed to multivariable prediction. Our multivariable polygenic and environmental prediction model suggests widespread rGE and unsystematic GxE contributions to EA in adolescence.
Education is compulsory in nearly all countries because it provides children with the skills, such as literacy and numeracy, that are essential for successfully participating in society. How well children perform at school, indicated by their educational achievement (EA; not to be confused with educational attainment, which is a measure of years spent in education), predicts many important life outcomes, especially further education and occupational status [1]. Quantitative genetic research based on twin studies showed that EA is 60% heritable throughout the school years [2, 3]. These studies also suggested that about 20% of the variance of EA and other learning-related traits can be ascribed to shared environmental factors, for example growing up in the same family and going to the same school. However, the picture became more complicated with the discovery that ostensible measures of the environment associated with educational achievement showed genetic influence–most notably, parents’ educational attainment, socio-economic status (SES) and aspects of the home environment [4].Quantitative genetic theory distinguishes two types of interplay between genetic and environmental effects, genotype-environment correlation (rGE) and genotype-environment interaction (GxE) [5]. rGE occurs when an individual’s genotype covaries with environmental exposures. There are three types of rGE: passive, active and evocative. Passive rGE results from the inheritance of both genetic propensities and environments linked to parental genotypes. That is, individuals inherit from parents a genetic predisposition to a particular trait, but parental genotypes are also associated with rearing environments that, in turn, increase the likelihood of developing a particular trait. For example, individuals with stronger genetic predispositions to educational attainment tend to grow up in higher socioeconomic status families [6]. Evocative rGE happens when individuals’ genetic propensities evoke a response from the surrounding environment; for example children’s predisposition to higher food intake might elicit restrictive food behaviors from their parents [7]. Active rGE results from individuals actively selecting environments that are linked to their genetic propensity; for example, individuals with a higher genetic predisposition to educational attainment tend to migrate to economically prosperous regions that offer greater educational opportunities [8].GxE, on the other hand, refers to genetic moderation of environmental effects. That is, when the effects of environmental exposures on phenotypes depend on individuals’ genotypes. Equivalently, environmentally moderated genetic effects occur when genetic effects on a phenotype depend on environmental exposures. Importantly, however, rGE may confound GxE effects [9]. For example, if a genetic predisposition for a particular trait is found in a particular environment, it is difficult to know whether this represents rGE between the trait and the environment or true GxE. This picture becomes even more complicated when we consider that environments are themselves heritable [4].Research on GxE was rejuvenated when it became possible to include measured genetic and environmental factors in statistical models. Hundreds of studies were published purporting to show interactions between candidate genes and environmental measures as they predict behavioural traits. For example, a seminal GxE study in the field [10] showed that carriers of two copies of the short serotonine allele on the 5HTT gene exposed to adversity had an increased the risk for depression compared to their genetic counterpart. However, GxE effects such as these have a poor replication history [11, 12]. The main problem with this approach is that it ignores the high polygenicity of complex traits, with a reductionist focus on single ‘candidate’ variants. This combined with typically small sample sizes, underpowered to detect the very small effects that can be expected for GxE, led to a replication failure [13].In complex traits, very few individual variants capture more than a tiny fraction of trait variance [14]. Genome-wide polygenic scores (GPS) are the missing piece for investigating the interplay between genes and environment because they can theoretically capture genetic influences up to the limit of SNP-based heritability, which is usually 25–50% of the total heritability for behavioural traits. GPS are indices of an individual’s genetic propensity for a trait and are typically derived as the sum of the total number of trait-associated alleles across the genome, weighted by their respective association effect size estimated through genome-wide association analysis [15]. A GPS derived from a genome-wide association study of educational attainment (years of schooling) [16] predicts up to 15% of the variance of EA [17]. As more powerful GPS become available, they have begun to be used widely in research on GxE [18-23] and rGE [7, 24–27].Recently it has been possible to dissect the role of parental genetics on child achievement by splitting the parental genome into transmitted alleles (indexing passive rGE) and non-transmitted alleles (indexing environmentally transmitted parental genetic effects). The latter demonstrated that parental genotypes are associated with the environment they provide for the child [28, 29]. In fact, a growing body of evidence is showing the importance of considering gene-environment correlation when assessing polygenic effects on trait variation [30, 31], especially for educationally relevant traits. Paralleling previous findings from the quantitative genetics literature, a key point is that environmental measures are themselves heritable and GPS effects can be mediated by the environment, while environmental effects can be accounted for by genetics (genetic confounding). In this sense, polygenic scores for cognitive traits are not pure measures of genetic predisposition: their predictive power also captures environmental effects. For the same reason, environmental measures are not pure measures of the environment.Rather than examining rGE and GxE for single polygenic scores and environmental measures, here we look at sets of GPS [32] and environmental measures. A multivariable approach is especially warranted for EA because twin analyses show that the high heritability (60%) of EA reflects many genetically influenced traits, including personality and behaviour problems in addition to cognitive traits [33, 34]. Correspondingly, EA GPS is associated with a wide range of traits, including psychiatric, anthropometric and behavioural traits [35]. Similarly, environmental predictors of EA are also intercorrelated (e.g. SES and home environment). However, it is not yet clear how sets of polygenic scores, partly capturing environmental effects, perform jointly with sets of environmental measures, which are themselves heritable, and the effect that their interplay (rGE and GxE) might have on prediction.Here for the first time we systematically investigate the interplay of GPS and environmental measures in the multivariable prediction of tested educational achievement. We jointly analyse multiple GPS and multiple environmental measures, considering the effect of their interplay in hold-out set prediction. Specifically, we test the joint prediction of 20 well-powered GPS for psychiatric, cognitive and anthropometric traits and 13 proximal and distal measured environments including life events, home environment and SES (see methods for descriptions of all measures). First, we model polygenic scores (henceforth G model) and environmental measures (henceforth E model), separately and jointly (full model), to predict educational achievement in penalized regression models [36] with hold-out set tests of prediction accuracy. Models are tuned using repeated cross-validation in 80% of the sample and tested in the remaining 20% hold-out set. Penalized methods are especially warranted when dealing with multiple correlated predictors as they can overcome problems of multicollinearity and overfitting. To investigate the relative contributions of the employed predictors to the full model, we carry out post-selection estimation [37] of partial regression coefficients, testing independent effects of single GPS and environmental measures. Secondly, we separate direct from mediated effects of the multivariable G and E models on EA and assess rGE defined in terms of the GPS and environmental measures employed. Finally, we assess GxE using a hierarchical group-lasso technique [38] to systematically discover two-way interactions between all GPS and environmental measures, and test their improvement in prediction of EA.
Results
Joint modelling of GPS and environmental effects
In a first step we tested three models for association with EA: all genetic factors (polygenic scores; G model), all environmental factors (measured environments; E model), and a joint model of all factors (full model; G+E). The G+E model achieved the best hold-out sample prediction compared to the G or E models considered separately. The full model predicted 36% of the variance (95% CI = 30.4, 41.6) in EA (Fig 1 panel B, S2 Table), 6% more than the E model (30.1%; 95% CI = 24.3, 35.6; S1 Fig) and up to 18% more compared to the G model alone (18.3%; 95% CI = 12.7, 23.6; S2 Fig). Nested comparisons of the G+E model vs the G and E models separately indicated that the difference in hold-out set prediction accuracy between models (Fig 1 panel D, S2 Table) was significant for both the G+E model vs the E model (median R2 diff = 5.9%; 95% CI = 2.8, 9.1) and the G+E model vs the G model (median R2 diff = 17.7%; 95% CI = 13.2, 22.3). This suggested the presence of genetic effects on EA not mediated via environmental effects, and vice versa of environmental effects not accounted for by the genetic effects. Next, we untangled the specific independent contributions of GPS and measured environments to variation in EA.
Fig 1
Multivariable prediction of educational achievement.
Panel A = repeated 10-fold cross validation in training set, for the environmental (E), multi-polygenic score (G), joint (G+E), and interaction (G*E) prediction models. Panel B = Hold-out set prediction of EA for best models obtained via repeated cross validation in training set. Error bars are 95% bootstrapped confidence intervals. Panel C = G+E model used in hold-out set prediction. Figure shows variables selected via repeated cross-validation in the training set, and relative importance. Panel D = Comparison of prediction accuracy for models tested as bootstrapped R2 difference between nested models in the hold-out set. Distributions represent independent (non-mediated) genetic effects (G+E−E), environmental effects (G+E−G), and G*E effects (G*E–G+E). Note. PGS = polygenic scores, ENV = Environmental measures. ASD = Autism Spectrum Disorder, BIP = Bipolar Disorder, BMI = Body Mass Index, EA3 = Educational Attainment, IQ3 = Intelligence, OCD = Obsessive Compulsive Disorder, PTSD = Post-Traumatic Stress Disorder, SCZ = Schizophrenia.
Multivariable prediction of educational achievement.
Panel A = repeated 10-fold cross validation in training set, for the environmental (E), multi-polygenic score (G), joint (G+E), and interaction (G*E) prediction models. Panel B = Hold-out set prediction of EA for best models obtained via repeated cross validation in training set. Error bars are 95% bootstrapped confidence intervals. Panel C = G+E model used in hold-out set prediction. Figure shows variables selected via repeated cross-validation in the training set, and relative importance. Panel D = Comparison of prediction accuracy for models tested as bootstrapped R2 difference between nested models in the hold-out set. Distributions represent independent (non-mediated) genetic effects (G+E−E), environmental effects (G+E−G), and G*E effects (G*E–G+E). Note. PGS = polygenic scores, ENV = Environmental measures. ASD = Autism Spectrum Disorder, BIP = Bipolar Disorder, BMI = Body Mass Index, EA3 = Educational Attainment, IQ3 = Intelligence, OCD = Obsessive Compulsive Disorder, PTSD = Post-Traumatic Stress Disorder, SCZ = Schizophrenia.
Best-model and coefficient estimation
The best G+E model selected via 10-fold repeated (100 repeats; Fig 1 panel A) cross-validation in the training set included 24 predictors, 14 of which were GPS (blue) while 10 were environments (orange) (Fig 1, panel C). Of these top EA-increasing variables were SES in early life, followed by the GPS for educational attainment (EA3GPS) and the GPS for intelligence (IQ3 GPS), while the top trait decreasing variable was chaos at home at age 12. In terms of coefficient estimation, partial regression coefficients in post-selection inference analyses (Fig 2 and S3 Table) showed that EA3GPS (β = 0.13; 95% CI = 0.09, 0.17; p = 8.45E-7) and IQ3 GPS (β = 0.12; 95% CI = 0.08, 0.15; p = 1.33E-7) remained significant in the model after adjusting for the other predictors. SES was by far the most powerful predictor in the conditional model (β = 0.37; 95% CI = 0.34, 0.40; p = 2.30E-60). Other environmental exposures that remained significant were ‘chaos at home’ at age 12 (β = -0.14; 95% CI = -0.17, -0.12; p = 3.93E-15) and two life events experienced in the past year (all trait decreasing), including ‘moving to a new school’ (β = -0.07; 95% CI -0.10, -0.04; p = 2E-5) and ‘involved with drugs’ (β = -0.06; 95% CI -0.09, -0.03; p = 2E-3). SES, EA3GPS, IQ3 GPS and ‘chaos at home’ were significant in all three models (i.e. naive, hold-out and conditional).
Fig 2
Relative contributions of model selected variables for the G+E model in the prediction of educational achievement.
Figure shows partial regression coefficients, and 95% CIs around estimates. Naive = partial regression coefficients from multiple regression of selected variables in Training set; Hold-out = partial regression coefficients of selected variables in the hold-out set; Conditional = partial regression coefficients of training set for selected variables estimated with a conditional probability from a truncated distribution (see methods section). Note. ASD = Autism Spectrum Disorder, ADHD = Attention-Deficit Hyperactivity Disorder, BIP = Bipolar Disorder, EA3 = Educational Attainment, IQ3 = Intelligence, MDD = Major Depressive Disorder, SWB = Subjective Well-Being, OCD = Obsessive Compulsive Disorder, PTSD = Post-Traumatic Stress Disorder, Risk PC1 = First Principal Component of Risky behaviors, SCZ = Schizophrenia.
Relative contributions of model selected variables for the G+E model in the prediction of educational achievement.
Figure shows partial regression coefficients, and 95% CIs around estimates. Naive = partial regression coefficients from multiple regression of selected variables in Training set; Hold-out = partial regression coefficients of selected variables in the hold-out set; Conditional = partial regression coefficients of training set for selected variables estimated with a conditional probability from a truncated distribution (see methods section). Note. ASD = Autism Spectrum Disorder, ADHD = Attention-Deficit Hyperactivity Disorder, BIP = Bipolar Disorder, EA3 = Educational Attainment, IQ3 = Intelligence, MDD = Major Depressive Disorder, SWB = Subjective Well-Being, OCD = Obsessive Compulsive Disorder, PTSD = Post-Traumatic Stress Disorder, Risk PC1 = First Principal Component of Risky behaviors, SCZ = Schizophrenia.
rGE and mediated environmental vs GPS effects
S2 Table shows prediction model estimates for all models considered, as well as nested comparisons of hold-out set prediction accuracy (R2) for the full model vs. E and the full model vs. G. We tested the correlation between the EA predicted values from the G model (Gea) and the E model (Eea) in the hold-out-set. This was r = 0.38 (95% CIs = 0.30, 0.45), indicating the extent of overlapping information between the G and E models in hold-out set prediction or, in other words, of rGE (as defined by the variables employed) underlying variation in EA. Then we proceeded to test the extent to which G and E effects on EA were reciprocally mediated (see methods). S4 Table shows results of mediation analyses. We found evidence for environmentally mediated genetic effects (indirect path: β = 0.17; bootstrapped 95% CI 0.13, 0.21) and genetically mediated environmental effects (indirect path: β = 0.10; bootstrapped 95% CI 0.06, 0.14). The effects of Gea on EA (β = 0.43; bootstrapped 95% CIs = 0.36, 0.50) were reduced by 40% after introduction of the Eea mediator in the model (β = 0.26; bootstrapped 95% CIs = 0.19, 0.33); these effects can be interpreted as the direct G model contributions to EA not accounted for by the E model. In other words, 40% of G effects on EA were explained by environmental mediation. Similarly, the direct Eea effects on EA (β = 0.55; bootstrapped 95% CIs = 0.50, 0.60) were subject to a reduction of 18% (β = 0.45; bootstrapped 95% CIs = 0.39, 0.51) after introduction of Gea as a mediator in the model, indicating partial genetic mediation of environmental effects (i.e. genetic confounding).
GxE effects and multivariable prediction
We finally tested all possible two-way interactions jointly modelled by means of a hierarchical group lasso procedure using glinternet. Out of the possible 528 two-way interactions between all study variables (i.e. interactions between and within sets of GPS and environmental measures), 32 two-way interactions were detected by the hierarchical group-lasso technique (glinternet, S5 Table), 15 of which were GxE interactions. Fig 3 depicts an interaction network from the trained glinternet model (10-fold cross validation). Hold-out set prediction accuracy was only slightly improved (R2 = 36.4%; 95% CI = 29.7, 41.1) over the joint G and E model (R2 = 36.1%; 95% CI = 30.4, 41.6). We then introduced the 15 GxE interactions found in the full elastic net model (S3 Fig) to test whether they improved the prediction of EA over the full model that had only considered additive effects of GPS and environmental measures. There was no improvement in hold-out set prediction accuracy (R2 = 36.1%; 95% CI = 30.5, 41.8), and the difference in prediction with the G+E model was not significant (median R2 diff = 0.1%; 95% CI = -1.2, 1.3). S2 Table shows fit statistics for the glinternet and elastic net models. S5 Table reports GxE interactions detected by the hierarchical lasso model.
We tested the joint prediction accuracy of sets of multiple environmental measures and polygenic scores in prediction models of educational achievement and considered the effect of their interplay on model performance. Three main findings emerged from our analyses. First, the joint modelling of multiple GPS and related environmental exposures improved the prediction of EA, consistent with theory [39]. Second, paralleling previous quantitative genetic findings, we found consistent evidence of rGE effects underlying variation in EA (rGE = 0.38; 95% CIs = 0.30, 0.48), with a substantial proportion of polygenic score effects mediated by the environmental effects (40%), and evidence for genetic confounding (18%). Lastly, we did not find evidence that GxE effects jointly contributed to the prediction of EA.Our multivariable GPS model alone predicted 18.3% of the variance in EA. Integration of multiple polygenic scores in the same model can be expected to increase as sample size in genome-wide association studies (GWAS) increases [40]. Here we constructed GPS in lassosum [41] based on previous observations that lassosum tends to perform better than more conventional approaches [17, 41] for educationally relevant traits. However, other methods for GPS construction can be expected to yield similar results when considering multivariable GPS penalized approaches, with performance of the relative approaches likely to converge as accuracy of GWAS estimates increases.Previous work [16] showed that the predictive accuracy of EA3 provides unique information beyond correlated demographics and distal control variables such as income and parental educational attainment. Here we extend this observation to multiple polygenic scores within a prediction framework, as well as multiple measured environments, including proximal measures of home environment and life events. Furthermore, we take this a step further by formally testing trait associations of the relative polygenic scores and environmental measures when jointly considered in the same model.Of interest were the relative contributions of the single GPS to the best model selected via repeated cross-validation in the training set. In post-selection inference analyses, IQ3 and EA3 were the only GPS independently associated with variation in EA after adjusting for measured environments and polygenic scores. This indicated that both these GPS contributed unique predictive information beyond other related, proxy environmental predictors (e.g. SES, parental educational attainment), and polygenic scores (e.g. household income GPS). Similarly, we found that several environments were independently predictive of EA. The best predictor was early life SES, a composite of parental educational attainment, employment status and maternal age at first birth. Life events and chaos at home were also significant contributors to the model, with negative independent effects on EA. Polygenic scores, however, improved the prediction of EA on top of the environment with a 20% increase in accuracy (from 30% to 36%). It is noteworthy that EA3 and IQ3 GPS were both significant in post-selection inference models after adjusting for SES, home environment and proximal environmental effects, all of which also tag genetic variance partly overlapping with that captured by the GPS. This suggested that cognitive-relevant GPS independently captured variation beyond environmental variables and variance due to rGE in our model. While this was important to understand the model composition, it should be highlighted that these estimates are dependent on variables included in analyses, and can be expected to change as other variables are considered in the model (see below).A central finding of the current study emerged when we separated direct and indirect effects of the GPS and environmental models by statistically testing for rGE. We found significant G mediation of the prediction of EA by the E model. This is in line with several quantitative genetics findings [42-44]. However, since it would be unreasonable to assume a causal effect of E on G (i.e. E does not change DNA sequence), in the sense employed here G acts as a ‘confounder’–in causal modelling parlance, ‘third variable confounding’–of E effects on EA (E ← G → EA). That is, because our G model is associated with both the E model and EA, it partly induces an association between the E model and EA in addition to the independent effects of E on EA. This rGE effect explained 18% of the E effects on EA.Different types of genetic confounding have been described in detail elsewhere [45].Conversely, we also found evidence of environmental mediation of the G model effects on EA. The E model explained 40% of the GPS model effects on EA. This result is also in line with previous research in quantitative genetics [27–29, 46]. A growing body of evidence points to the rGE conclusion that genetic effects on cognitive trait variation are partly environmentally mediated [25], which is likely to be due to passive rGE. Passive rGE emerges because parents create a family environment that corresponds to their genotypes and, by extension, also correlates with the genotypes of their offspring. As previously described, alternative mechanisms include evocative and active rGE effects. As noted elsewhere [26] these possibilities are not mutually exclusive. However, in order to disentangle these rGE effects, different study designs are needed, for example, looking within families at the effects of maternal and paternal non-transmitted genotypes on child outcomes. Disentangling the different underlying mechanisms to the predicted variance in this regard is an issue for future studies, but out of the scope of the present investigation.Previous work [30] has shown that prediction of educational achievement by EA3GPS consistently decreases within-family, suggesting that passive gene-environment correlation explains part of the predictive power of EA3. Here we show that reciprocal indirect effects between multivariable E and G models explain a substantial proportion of variation of their total effects on EA. These results provide converging evidence with recent research looking at rGE underlying parenting and children educational attainment [27, 47]. Both genetic confounding and environmental mediation are important factors to take into account in the prediction of EA.Lastly, we applied a hierarchical group-lasso model (glinternet) to automatically detect two-way interactions. This model helped us to identify GxE effects that show strong hierarchy, which would have otherwise been difficult to detect due to the great multiple-testing burden relative to the sample size of the present study. Furthermore, since glinternet performs shrinkage and grouping before testing for interaction effects, this enabled discovery of interactions that would have been confounded by strong main effects of correlated predictors. In other words, because the coefficients of main effects have been regularized (that is shrunk, see Methods), their fit is reduced, which facilitates the discovery of interaction effects [38]. However, neither the glinternet model including all discovered pairwise interactions, nor the elastic net model including two-way GxE effects, significantly improved hold-out set prediction over the G+E model. One possible explanation for this finding is that GxE effects are typically very small, and that the trade-off between true effect and variance introduced in the model, signal to noise ratio, was too small. It might be that even if two-way GxE effects were relevant the noise incurred in fitting their coefficients may outweigh the improvement in accuracy that they bring to the model. In this regard we note that in repeated cross-validation in the training set the model performance of both the elastic net based GxE model, and the glinternet model was substantially increased compared to the G+E model. Application of this method in larger datasets, or using different phenotypes with different genetic architectures, might be fruitful for hypothesis-free GxE discovery as well as for prediction.This study must be considered in light of a few limitations. First, our results are subject to the constraint that we performed an apriori selection on variables to be employed in our analyses. For example, we modelled exposures that are typically defined as environmental; however, many other variables can be argued to capture EA relevant environmental influences. In this regard estimates for non-mediated genetic effects for the model presently tested are likely upper-bounds, in the sense that if we were to include more E variables predictive of the outcome EA, the polygenic score contributions independent of E would either stay the same or decrease. Likewise, we included a broad range of polygenic scores that are currently available as the most predictive for cognitive, psychiatric and anthropometric traits. However, polygenic scores predictive power is in part a function of GWAS sample size [40], therefore as more powerful GWAS become available these prediction estimates are expected to increase. This in turn suggests that the contributions of the G model are likely to be on the lower bound compared to future polygenic score work in this area.We note that the environmental variables employed in the models jointly represent only a noisy proxy for the true EA relevant environmental effects, just as the multi-polygenic scores model is a noisy proxy for the true additive genetic predisposition to EA. As such, estimates derived from mediation analyses will imprecisely capture the extent to which E and G models are reciprocally mediated.Finally, we focused here on EA but predictive models of other complex traits are likely to yield different results, because EA shows comparatively great shared environmental influences [30]. This suggests that rGE is likely to be stronger for EA than for other behavioural traits, such as personality traits and social-emotional competencies. Regarding our analytical approach, we focused on GxE interactions that obeyed strong hierarchy as identified by the group lasso technique. Future studies could relax this assumption and include interactions where one of the main effect sizes is not significant, as well as higher order interactions. Finally, although it is a strength of our study that we used measured environmental exposures, we note that methods for inferring GxE without measured environmental data are emerging that have reported GxE for some complex traits [48]. The extent to which these effects are systematic, stable, and generalizable to EA remains to be determined.As large multidimensional biobank datasets become increasingly available, the integration of multi-omics data with multiple environmental measures will become more common in prediction modelling. Here, we provide an indication of the effects of integrating multiple GPS and environmental measures in prediction models of EA and the effect that their interplay has on prediction accuracy in a population cohort of adolescents. In conclusion, we found consistent evidence for rGE in prediction models of EA that systematically tested the interplay between polygenic scores and measured environments within a hypothesis-free multivariable prediction framework. When integrating multiple GPS and environmental measures, their interplay must be taken into account. Separate effects of environmental and polygenic scores cannot just be assumed to add up because pervasive rGE affects prediction.
Material and methods
Sample
We test our models using data from 16 year olds from the UK Twin Early Development Study [TEDS; 49], a large longitudinal study involving 16,810 pairs of twins born in England and Wales between 1994–1996, with DNA data available for 10,346 individuals (including 3,320 dizygotic twin pairs and 7,026 unrelated individuals, all of European ancestry). Genotypes for the 10,346 individuals were processed with stringent quality control procedures followed by SNP imputation using the Haplotype Reference Consortium (release 1.1) reference panels. Current analyses were limited to the genotyped sample, and we retained only one individual at random from sibling pairs. Following imputation, we excluded variants with minor allele frequency <0.5%, Hardy-Weinberg equilibrium p-values of <1 × 10−5. To ease computational demands, we selected variants with an info score of 1, resulting in 515,000 SNPs used for analysis (see S1 Methods for a full description of quality control and imputation procedures).
Ethics statement
Ethical approval for TEDS has been provided by the King’s College London Ethics Committee (reference: PNM/09/10–104). Written parental consent was obtained before data collection.
Measures
Dependent measure
Educational achievement was measured as the self-reported mean grade of three core subjects (English, math and science) scored by the individuals at age 16 in their standardized UK General Certificate of Secondary Education (GCSE) exams.EA was operationalized as the mean grade of the three compulsory subjects, with results coded from 4 (G, or lowest grade) to 11 (A+, or highest grade). These self-report measures are highly replicable and show high genetic and phenotypic correlations with teacher scores [50]. The variable distribution was slightly negatively skewed (similar to the national average) and subject to a rank based inverse normal transformation to approximate a normal distribution.
Environmental measures
Socio economic status: SES at recruitment (mean children age = 18 months) was calculated as the mean composite score of five standardized measures including mother and father qualification levels ranging from 1 = ‘no qualifications’ to 8 = ‘postgraduate qualification’, mother and father employment status [51], and mother’s age at birth of first child.Chaos at home: as a measure of home environment a shortened version of the Confusion, Hubbub and Order Scale [52] was used to measure children’s perception of chaos in the family environment at age 12. Children rated the extent to which they agree (range: ‘not true’, ‘quite true’ or ‘very true’) to six items: ‘I have a regular bedtime routine’ (reversed coded), ‘You can’t hear yourself think in our home’, ‘It’s a real zoo in our home’, ‘We are usually able to stay on top of things’ (reversed coded), ‘There is usually a television turned on somewhere in our home’ and ‘The atmosphere in our house is calm’ (reverse coded). The Chaos score was computed as the mean of the rated items.Life events: Self-reported life events experienced in the past year were measured (at age 16) using a shortened version of the Coddington life events [53]. Individuals had to report on 20 items that might have happened in the past year, by responding yes (coded as 1) if the event had happened or no (coded as 0) if it didn’t happen. Items included stochastic, proximal events such as “death of a close friend or relative”, “being hospitalized”, as well as family-wide events e.g. “loss of a parent job”, “decrease in parental income”. When considering prediction of educational achievement, educationally relevant items were removed from the models (i.e. “failing exam” and “outstanding achievement”). Items being endorsed by fewer than 100 people were discarded from analyses. A total of 11 life events were retained in analyses. All items were considered separately in prediction models (i.e. they were not aggregated in a scale). S1 Table reports descriptive statistics for variables employed in this study, separately by training and hold-out sets.
Genome-wide polygenic scores (GPS)
GPS for 20 cognitive, anthropometric and psychopathological traits were constructed using Lassosum [41]. Lassosum is a penalized regression approach applied to GWAS summary statistics. In Lassosum we try to minimize the following loss function:Where y is a vector of the phenotype, X is the matrix of genotypes, such that XrefT Xref is a matrix of correlations between SNPs, the LD matrix. r denotes the correlation between SNPs and the phenotype, r = XTy. The subscript ref in XrefT Xref indicates that SNPs employed to obtain the LD matrix (based on a reference panel, see below) will generally not correspond to SNPs used to infer the correlation with the phenotype.In the equation λ controls the L1 penalty (L1 norm, [54]). The notation ||β||1 describes the L1 norm of a coefficient vector β, defined as ||β||1 = ∑|β|, while s is another tuning parameter controlling the L2 penalty (|β|2, the sum of the squared betas). Here s has the additional constraint of being between 0 and 1. When λ = 0 and s = 1 the problem becomes unconstrained.Tuning parameters, λ and s, are chosen in the validation step (this is akin to optimization that can be performed in p-value thresholding methods). We used our training set to perform parameter tuning optimizing (with respect to R2) polygenic scores against EA. LD was accounted for via a reference panel, here the same as the training set sample, and estimation of LD blocks was performed using LD regions defined in [55].We employed the most powerful and publicly available GWAS summary statistics for cognitive, psychopathology and anthropometric traits. We created cognitive and educationally relevant polygenic scores for educational attainment [16], intelligence [56], and income [57]. We also created polygenic scores for mental health-related traits: autism spectrum disorder [58], major depressive disorder [MDD; 59], bipolar disorder [BIP; 60], schizophrenia [SCZ; 61], attention deficit hyperactivity disorder [ADHD; 62], obsessive compulsive disorder [OCD; 63], anorexia nervosa [AN; 64], post-traumatic stress disorder [PTSD; 65], broad depression [66], mood swings [67], subjective well-being [68], neuroticism [69], irritability [67], insomnia [70], and risk taking [71]. Finally, we created polygenic scores for height and BMI [72]. S6 Table reports information on these summary statistics, while S7 Table reports parameter tunings for the lassosum GPS.
Analyses
All variables were first regressed on age, sex, 10 genetic principal components and genotyping chip. The obtained standardized residuals were used in all subsequent analyses.
Penalised regression
We fit elastic net regularization [73] models for EA. Elastic Net minimizes the residual sum of squares (RSS) subject to the L1 penalty, consisting of the sum of the absolute coefficients, which introduce sparsity allowing for parameters selection, and the L2 penalty, consisting of the sum of the squared coefficients, which allows for parameters shrinkage [73].Elastic net tries to minimise the following loss function:
where ||y–X’β||2 is the residual sum of squares, |β|2 is the sum of the squared betas (the L2 penalty), and |β|1 is the sum of the absolute betas (the L1 penalty).Here X’ is an nxp (‘n’ observations and ‘p’ predictors) matrix of polygenic scores, environmental predictors or a combination of both (see below). α determines the mixing of penalties, where the first parameter introduces sparsity while the second shrinks correlated predictors towards each other. λ is a tuning parameter that control the effect of the penalty terms over the regression coefficients. When α = 1 the solution is equivalent to a LASSO regression, while when α = 0 the solutions is equivalent to a Ridge regression. For every α multiple λ exists, and the optimal combination of tuning parameters is determined by cross-validation, here a 10-fold cross-validation repeated 100 times. For every model tested we split the sample into an independent training set (80%) and a hold-out set (20%). In the training set we perform 10-fold repeated cross-validation to select the model that minimises the Root Mean Square Error (RMSE)–that is the tuning parameter for which the cross-validation error is the smallest. The model performance is then assessed by the variance explained (R2) in the hold-out test set. The hold-out set R2 was calculated as 1- (SSE = sum of squared errors, SST = sum of square total).
Bootstrapping
For every model tested we sampled with replacement from the data (1000 times) to calculate bootstrapped confidence intervals for the hold-out set prediction accuracy (R2). Rows of data for resampling included the phenotype under study and the predictors according to the model tested: either polygenic scores, environmental predictors or a combination of both. For each bootstrap sample drawn we calculated the hold-out set R2, and we took the difference in R2 between nested models. This procedure yielded a distribution of R2 for each model tested and a distribution of R2 differences (ΔR2) for each pairwise comparison. We then calculated 95% confidence levels as the 2.5th and 97.5th percentiles of these distributions. For nested comparisons, if the interval didn’t contain 0 we concluded that the pairwise model ΔR2 was significantly different from 0 with a α level of .05.
Post selection inference
For every model tested we conducted statistical inference of models coefficients after selection of most informative predictors performed by Elastic Net, that is effect sizes, p-values and confidence levels around the prediction estimates.Post-selection inference [37] refers to the practice of attempting to accurately estimate prediction coefficients after a model selection has been performed. If we fit the optimal model’s selected predictors in a multiple regression model in the training set (that is where the selection has been performed) our confidence in the estimates will tend to be over-optimistic. On the other hand, estimation of these parameters in a hold-out set would not be subject to this problem. The hold-out set, however, will typically be smaller than the training set, leading to wider confidence intervals. In addition, the results will be dependent on the random split (80–20) performed. A third way is to calculate P-values conditional to the selection that has been made in the training set. Briefly, after selection is performed, accurate estimation of a given partial regression coefficient can be approximated by a truncated normal distribution:With mean β, variance τ2 and boundaries of the truncated normal distribution (TN) ‘a’ and ‘b’ given by the data and the selection procedure, in this case the predictors, the active set (the variables with non 0 coefficients selected by our model) and λ [37]. We refer elsewhere to a thorough discussion of the topic [74], with a focus on lasso like approaches. Here we compare results from the three procedures: the ‘naive’ estimation of partial regression coefficients in the training set, estimation of coefficients in the hold-out set, and the conditional estimation of p-values performed using the R package ‘SelecitveInference’ [75].
rGE
We quantified rGE in two ways. First, the hold-out set predicted EA values from the GPS (henceforth Gea) and environmental (henceforth Eea) models can be tested for correlation. In this sense the covariance between these variables would be an indication of overlapping information between E and G underlying EA, i.e. rG,E = cor(Gea,Eea).Second, another way to quantify rGE is by modelling E and G effects in a mediation model (Fig 4), considering the indirect effects of G on EA via E, and vice versa the indirect effects of E on EA via G. We used the predicted EA values from the GPS and environmental models (i.e. Gea and Eea) to test mediation models in the hold-out set. We fit a structural equation model (SEM) in ‘lavaan’ [76] to test whether and to what extent E and G effects on EA were reciprocally mediated. Panel A (Fig 4) is a schematic representation of a mediation model, where βC is the effect of a predictor X on an outcome Y, βa the effect of X on the mediator (M), and βb the effect of M on Y after adjusting for X. βC’ corresponds to the effects of the predictor on the outcome when controlling for the mediator (i.e. when the full equation is estimated). If the effects are reduced (partial mediation) or are not different from 0 (full mediation) then there is evidence for mediation. We quantify the proportion of the mediated effects as (βC- βC’) / βC and test for significance of the indirect path using bootstrapping (with 1000 repetitions).
Fig 4
Panel A = schematic representation of mediation analysis; βC = effect of a predictor X on an outcome Y; βa = effect of X on a mediator (M); βb = effect of M on Y after adjusting for X; βC’ = effect of X on Y after adjusting for M. Panel B = Directed acyclic graph (DAG) showing Eea mediated effects of Gea on EA in the hold out-set; βge = causal path between Gea and Eea equivalent to rG,E; βeEA = direct independent Eea effects on EA; β = total Gea effects on EA. Panel C = DAG showing Gea mediated effects on EA (genetic confounding, see methods and discussion); βeg = causal path between Eea and Gea equivalent to rG,E; βgEA = direct independent Gea effects on EA; β = total Eea effects on EA. Note. Blue paths represent G model effects, yellow paths represent E model effects.
Fig 4 represent direct and indirect effects of the G model effects on EA mediated by E (panel B), and of the E model effects on EA mediated by G (panel C). While panel B represents a causal model where we estimate the environmentally mediated G effects on EA, panel C is a statistical abstraction since it would be unreasonable to assume a causal relationship of E on G. Here we model G as mediator to estimate the third variable confounding effects underlying the relationship between E and EA, as mediating and confounding effects have been shown to be equivalent in a linear context [77]. In other words, confounding and mediation effects are statistically equivalent, such that they can both be estimated by mediation analysis, but conceptually distinct (77).Panel A = schematic representation of mediation analysis; βC = effect of a predictor X on an outcome Y; βa = effect of X on a mediator (M); βb = effect of M on Y after adjusting for X; βC’ = effect of X on Y after adjusting for M. Panel B = Directed acyclic graph (DAG) showing Eea mediated effects of Gea on EA in the hold out-set; βge = causal path between Gea and Eea equivalent to rG,E; βeEA = direct independent Eea effects on EA; β = total Gea effects on EA. Panel C = DAG showing Gea mediated effects on EA (genetic confounding, see methods and discussion); βeg = causal path between Eea and Gea equivalent to rG,E; βgEA = direct independent Gea effects on EA; β = total Eea effects on EA. Note. Blue paths represent G model effects, yellow paths represent E model effects.
GxE
After fitting the joint GPS and environmental models, we apply a hierarchical lasso procedure to automatically search the feature space for interactions, and retrain our models introducing GxE interactions. With 33 predictors there is a total of 33(33–1)/2 = 528 possible 2-ways interactions. Testing all models separately would imply a multiple testing burden (e.g. bonferroni correction .05/528 = 9E-5), in addition to the expected low signal to noise ratio for GxE effects. Here we employ a hierarchical group lasso approach to automatically search for two-way interactions, implemented in the R package ‘glinternet’ [38] (group-lasso interaction network). Glinternet leverages group lasso, an extension of LASSO, to perform variable selection on groups of variables, dropping or retaining them in the model at the same time, to select interactions. As noted above, the L1 regularization produces sparsity. Glinternet uses a group lasso for the variables and variable interactions, which introduces a strong hierarchy: an interaction between two variables can only be picked by the model if both variables are also selected as main effects. That is, interactions between two predictors are not considered unless both predictors have non-zero coefficients in the model. Once two-way interactions obeying strong hierarchy were identified, we selected GxE interactions (i.e. GPS that interact with environmental variables) and reintroduced them in our best elastic net models to test whether the hold-out set prediction accuracy improved beyond the full (E+G) prediction model.
Descriptive statistics.
(XLSX)Click here for additional data file.
Training vs hold-out set fit indices, and nested comparisons.
(XLSX)Click here for additional data file.
Statistical inference.
(XLSX)Click here for additional data file.
Mediation models, bootstrapped estimates and 95% Confidence Intervals.
(XLSX)Click here for additional data file.
List of interactions identified through Glinternet.
(XLSX)Click here for additional data file.
GWAS Summary statistics.
(XLSX)Click here for additional data file.
Parameter tuning for lassosum GPS.
(XLSX)Click here for additional data file.
Quality control and genotyping protocol.
(DOCX)Click here for additional data file.
Polygenic score (G) model used in hold-out set prediction.
Variables importance for the best G model selected via repeated cross-validation in the training set. Note. ASD = Autism Spectrum Disorder, ADHD = Attention-Deficit Hyperactivity Disorder, BIP = Bipolar Disorder, EA3 = educational attainment, IQ3 = intelligence, MDD = Major Depressive Disorder, SWB = Subjective Well-Being, OCD = Obsessive Compulsive Disorder, PTSD = Post-Traumatic Stress Disorder, SCZ = Schizophrenia.(TIF)Click here for additional data file.
Environmental predictors (E) model used in hold-out set prediction.
Variables importance for the best E model selected via repeated cross-validation in the training set.(TIF)Click here for additional data file.
G*E model used in hold-out set prediction.
Variables importance for the best G*E model selected via repeated cross-validation in the training set. Note. For interactions the first name refers to polygenic scores, the second name refers to environmental predictors. ASD = Autism Spectrum Disorder, ADHD = Attention-Deficit Hyperactivity Disorder, BIP = Bipolar Disorder, EA3 = educational attainment, IQ3 = intelligence, MDD = Major Depressive Disorder, SWB = Subjective Well-Being, OCD = Obsessive Compulsive Disorder, PTSD = Post-Traumatic Stress Disorder, SCZ = Schizophrenia.(TIF)Click here for additional data file.25 Jun 2020Dear Dr Allegrini,Thank you very much for submitting your Research Article entitled 'Multivariable G-E interplay in the prediction of educational achievement' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time.In addition, PLOS Genetics policy requires that data be made available to other researchers, and that this availability not be at the sole discretion of authors. Therefore, we would like to have you clarify data sharing prior to a final decision on your manuscript..Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines.Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.[LINK]We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.Yours sincerely,Chris Cotsapas, PhDAssociate EditorPLOS GeneticsScott WilliamsSection Editor: Natural VariationPLOS GeneticsReviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: Uploaded as an attachmentReviewer #2: See attachment.Reviewer #3: In this paper, the authors consider the joint role of genome-wide polygenic scores (GPSs) and environmental variables at predicting educational achievement. While it's known that such measures are known to be individually good predictors of educational variables, this study quantifies the degree of overlapping signal between a set of genetic and environmental variables. At a high level, they do this using regularized prediction models and mediation-style analyses. They find that 40% of the predictive power of GPSs and 18% of the predictive power of environmental variables is due to the correlation between these variables. They also test whether interactions of environmental and genetic variables add to the predictive power of joint genetic-environmental predictive models, and find no evidence that interactions of the variables they considered contribute significantly to multivariable prediction.This paper represents an impressive effort in a rich data set. There is a lot of interesting information presented. Also the paper was easy to read, the analyses were clearly described in a way that I believe I would be able to replicate the analyses, and the tables and figures had clear and comprehensive captions. That said, upon reading the paper, I found myself with a lot of questions about what the major take-aways are for this project.Major concerns:1) How does this paper fit relative to other papers that do similar work? The authors make several strong statements about the novelty of their work, but there are a number of papers that ask very closely related questions. For example, Selzam et al (2019) compare the within- and between-family predictive power of GPSs for educational achievement, which would represent the passive rGE that the authors describe. Also, Lee et al. (2018) contains an analysis quite similar to this one looking at the incremental R2 of an education GPS at predicting educational attainment above several environmental variables. While educational attainment and educational achievement are different variables, they are very highly genetically correlated. It would be helpful if the authors could explicitly highlight what we are learning from this paper relative to related existing work.2) I had a difficult time interpreting their rGE results for a number of reasons.a) How does the sparse variable selection procedure affect the authors' estimates? As far as I could tell, the authors selected their genetic and environmental variables using an elastic net procedure in the same step. If there is strong rGE (as the authors find) some of the genetic variables that are highly correlated with the environmental variables may be selected in the place of the environmental variables in the penalization process, even if they represent independent signals. Likewise, some environmental variables will be selected in the place of the genetic variables. Since you are removing variables that are correlated with each other, this would have an effect on the subsequent rGE analysis. It may be possible to address this concern by selecting the genetic and environmental variables separately.b) How appropriate is it to estimate the mediation analysis in both directions? In order to interpret the estimates of how much the environmental variables mediate the genetic variables, you need to assume the model in Figure 4B. To estimate the amount of genetic confounding, you need to assume the model described in line 324 of the manuscript. Since both of these models can't be simultaneously true, I believe it can't be the case that the two estimates in lines 35 to 39 are both true.c) How do errors in variables affect the interpretation of these analyses? For example, the 'chaos at home' variable is likely a noisy proxy for the true amount of chaos at home. This means that even if 100% of the genetic signal were mediated through this variable, the GPS would still remain predictive after including that variable in the regression. Similarly, the polygenic scores may be thought of as noisy measures of the true additive genetic component. If that's the case, then a mediation analysis will overestimate the amount that the environmental variables mediate the predictive power of the genetic factor.Minor concern:3) The notation used the methods section is confusing. The authors often use the same variables to represent different things. For example, in the equation on line 463, the authors use 'r' to denote both a vector covariances and also as an indicator that the genotypes come from a reference sample. Also the variables and coefficients in line 463 correspond to different variables and parameters in line 505 despite having the same names.References:Lee, J. J., Wedow, R., Okbay, A., Kong, E., Maghzian, O., Zacher, M., ... & Fontana, M. A. (2018). Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment. Nature Genetics, 50(8), 1112.Selzam, S., Ritchie, S. J., Pingault, J. B., Reynolds, C. A., O’Reilly, P. F., & Plomin, R. (2019). Comparing within-and between-family polygenic score prediction. The American Journal of Human Genetics, 105(2), 351-363.**********Have all data underlying the figures and results presented in the manuscript been provided?Large-scale datasets should be made available via a public repository as described in the PLOS Genetics
data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.Reviewer #1: YesReviewer #2: No: All data underlying the figures are available in the supplement. Some restrictions will apply regarding data access. Data used for the submission may be made available on request to the Twins Early Development Study (TEDS), through their data access mechanism (see www.teds.ac.uk/researchers/teds-data-access-policy).Reviewer #3: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: NoReviewer #3: NoSubmitted filename: RC_PGENETICS-D-20-00796.docxClick here for additional data file.Submitted filename: Review PLOS Genetics June2020.docxClick here for additional data file.4 Aug 2020Submitted filename: response_reviewer_comments_31_07_2020.docxClick here for additional data file.15 Sep 2020Dear Dr Allegrini,We are pleased to inform you that your manuscript entitled "Multivariable G-E interplay in the prediction of educational achievement" has been editorially accepted for publication in PLOS Genetics, subject to the proposed waiver of data access restrictions we discussed. Congratulations!Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made.Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!Yours sincerely,Chris Cotsapas, PhDAssociate EditorPLOS GeneticsScott WilliamsSection Editor: Natural VariationPLOS Geneticswww.plosgenetics.orgTwitter: @PLOSGenetics----------------------------------------------------Comments from the reviewers (if applicable):Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: The authors addressed all points that I am concerned.Reviewer #2: I agree with the authors there are good arguments for such a composite measures, as the SES and chaos mesures applied in the manuscript. They also do appear to predict well.I do not agree that life events often represent stochastic events unrelated to each other. They tend to correlate and have similar effect on e.g. psychopathology, I guess that could be the case for effect on EA as well, which is in line with the authors findings. Further, some of these life event variables are probably rather rare, and therefore not powerful in separate analyses.In the G, E model comparisons it makes no difference, but in figure 1C and figure 2, finding significant effect of the composite measures and not of the individual life events is not surprising?For choices of PRS’s applying the most powerful does make sense, implying this may well change in the coming years. How was power assessed?I do acknowledge there is no chance adding all the variables that capture EA relevant influences, therefore of course it should be as clear as possible how you choice your variables, and what the limitation is according to that choice.**********Have all data underlying the figures and results presented in the manuscript been provided?Large-scale datasets should be made available via a public repository as described in the PLOS Genetics
data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.Reviewer #1: YesReviewer #2: No: there are restriction on access, due to confidentiality for participants**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: No----------------------------------------------------Data DepositionIf you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly:http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-00796R1More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.----------------------------------------------------Press QueriesIf you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.28 Oct 2020PGENETICS-D-20-00796R1Multivariable G-E interplay in the prediction of educational achievementDear Dr Allegrini,We are pleased to inform you that your manuscript entitled "Multivariable G-E interplay in the prediction of educational achievement" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!With kind regards,Matt LylesPLOS GeneticsOn behalf of:The PLOS Genetics TeamCarlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdomplosgenetics@plos.org | +44 (0) 1223-442823plosgenetics.org | Twitter: @PLOSGenetics
Authors: Augustine Kong; Gudmar Thorleifsson; Michael L Frigge; Bjarni J Vilhjalmsson; Alexander I Young; Thorgeir E Thorgeirsson; Stefania Benonisdottir; Asmundur Oddsson; Bjarni V Halldorsson; Gisli Masson; Daniel F Gudbjartsson; Agnar Helgason; Gyda Bjornsdottir; Unnur Thorsteinsdottir; Kari Stefansson Journal: Science Date: 2018-01-26 Impact factor: 47.728
Authors: Henrik Dobewall; Kateryna Savelieva; Ilkka Seppälä; Ariel Knafo-Noam; Christian Hakulinen; Marko Elovainio; Liisa Keltikangas-Järvinen; Laura Pulkki-Råback; Olli T Raitakari; Terho Lehtimäki; Mirka Hintsanen Journal: J Child Psychol Psychiatry Date: 2018-10-25 Impact factor: 8.982
Authors: Peter M Visscher; Naomi R Wray; Qian Zhang; Pamela Sklar; Mark I McCarthy; Matthew A Brown; Jian Yang Journal: Am J Hum Genet Date: 2017-07-06 Impact factor: 11.025
Authors: Jakob Grove; Stephan Ripke; Thomas D Als; Manuel Mattheisen; Raymond K Walters; Hyejung Won; Jonatan Pallesen; Esben Agerbo; Ole A Andreassen; Richard Anney; Swapnil Awashti; Rich Belliveau; Francesco Bettella; Joseph D Buxbaum; Jonas Bybjerg-Grauholm; Marie Bækvad-Hansen; Felecia Cerrato; Kimberly Chambert; Jane H Christensen; Claire Churchhouse; Karin Dellenvall; Ditte Demontis; Silvia De Rubeis; Bernie Devlin; Srdjan Djurovic; Ashley L Dumont; Jacqueline I Goldstein; Christine S Hansen; Mads Engel Hauberg; Mads V Hollegaard; Sigrun Hope; Daniel P Howrigan; Hailiang Huang; Christina M Hultman; Lambertus Klei; Julian Maller; Joanna Martin; Alicia R Martin; Jennifer L Moran; Mette Nyegaard; Terje Nærland; Duncan S Palmer; Aarno Palotie; Carsten Bøcker Pedersen; Marianne Giørtz Pedersen; Timothy dPoterba; Jesper Buchhave Poulsen; Beate St Pourcain; Per Qvist; Karola Rehnström; Abraham Reichenberg; Jennifer Reichert; Elise B Robinson; Kathryn Roeder; Panos Roussos; Evald Saemundsen; Sven Sandin; F Kyle Satterstrom; George Davey Smith; Hreinn Stefansson; Stacy Steinberg; Christine R Stevens; Patrick F Sullivan; Patrick Turley; G Bragi Walters; Xinyi Xu; Kari Stefansson; Daniel H Geschwind; Merete Nordentoft; David M Hougaard; Thomas Werge; Ole Mors; Preben Bo Mortensen; Benjamin M Neale; Mark J Daly; Anders D Børglum Journal: Nat Genet Date: 2019-02-25 Impact factor: 38.330
Authors: E Krapohl; J Euesden; D Zabaneh; J-B Pingault; K Rimfeld; S von Stumm; P S Dale; G Breen; P F O'Reilly; R Plomin Journal: Mol Psychiatry Date: 2015-08-25 Impact factor: 15.992
Authors: N Mullins; R A Power; H L Fisher; K B Hanscombe; J Euesden; R Iniesta; D F Levinson; M M Weissman; J B Potash; J Shi; R Uher; S Cohen-Woods; M Rivera; L Jones; I Jones; N Craddock; M J Owen; A Korszun; I W Craig; A E Farmer; P McGuffin; G Breen; C M Lewis Journal: Psychol Med Date: 2015-11-03 Impact factor: 7.723
Authors: Rosa Cheesman; Avina Hunjan; Jonathan R I Coleman; Yasmin Ahmadzadeh; Robert Plomin; Tom A McAdams; Thalia C Eley; Gerome Breen Journal: Psychol Sci Date: 2020-04-17
Authors: Margot P van de Weijer; Dirk H M Pelt; Catharina E M van Beijsterveldt; Gonneke Willemsen; Meike Bartels Journal: Eur Child Adolesc Psychiatry Date: 2021-05-24 Impact factor: 5.349