Literature DB >> 35023833

Epigenetic scores for the circulating proteome as tools for disease prediction.

Danni A Gadd¹, Robert F Hillary¹, Daniel L McCartney¹, Shaza B Zaghlool^2,3, Anna J Stevenson¹, Yipeng Cheng¹, Chloe Fawns-Ritchie^1,4, Cliff Nangle¹, Archie Campbell¹, Robin Flaig¹, Sarah E Harris^4,5, Rosie M Walker⁶, Liu Shi⁷, Elliot M Tucker-Drob^8,9, Christian Gieger^10,11,12,13, Annette Peters^11,12,13, Melanie Waldenberger^10,11,12, Johannes Graumann^14,15, Allan F McRae¹⁶, Ian J Deary^4,5, David J Porteous¹, Caroline Hayward^1,17, Peter M Visscher¹⁶, Simon R Cox^4,5, Kathryn L Evans¹, Andrew M McIntosh^1,18, Karsten Suhre², Riccardo E Marioni¹.

Abstract

Protein biomarkers have been identified across many age-related morbidities. However, characterising epigenetic influences could further inform disease predictions. Here, we leverage epigenome-wide data to study links between the DNA methylation (DNAm) signatures of the circulating proteome and incident diseases. Using data from four cohorts, we trained and tested epigenetic scores (EpiScores) for 953 plasma proteins, identifying 109 scores that explained between 1% and 58% of the variance in protein levels after adjusting for known protein quantitative trait loci (pQTL) genetic effects. By projecting these EpiScores into an independent sample (Generation Scotland; n = 9537) and relating them to incident morbidities over a follow-up of 14 years, we uncovered 137 EpiScore-disease associations. These associations were largely independent of immune cell proportions, common lifestyle and health factors, and biological aging. Notably, we found that our diabetes-associated EpiScores highlighted previous top biomarker associations from proteome-wide assessments of diabetes. These EpiScores for protein levels can therefore be a valuable resource for disease prediction and risk stratification.

Entities: Chemical

Keywords: aging; biomarker; epidemiology; epigenetic; genetics; genomics; global health; human; morbiditiy; prediction; proteomics

Mesh：

Substances：
Biomarkers
Proteome

Year: 2022 PMID： 35023833 PMCID： PMC8880990 DOI： 10.7554/eLife.71802

Source DB: PubMed Journal: Elife ISSN： 2050-084X Impact factor: 8.140

Introduction

Chronic morbidities place longstanding burdens on our health as we age. Stratifying an individual’s risk prior to symptom presentation is therefore critical (NHS England, 2016). Though complex morbidities are partially driven by genetic factors (Fuchsberger et al., 2016; Yao et al., 2018), epigenetic modifications have also been associated with disease (Lord and Cruchaga, 2014). DNA methylation (DNAm) encodes information on the epigenetic landscape of an individual and blood-based DNAm signatures have been found to predict all-cause mortality and disease onset, providing strong evidence to suggest that methylation is an important measure of disease risk (Hillary et al., 2020a; Lu et al., 2019; Zhang et al., 2017). DNAm can regulate gene transcription (Lea et al., 2018), and epigenetic differences can be reflected in the variability of the proteome (Hillary et al., 2019; Hillary et al., 2020b; Zaghlool et al., 2020). Low-grade inflammation, which is thought to exacerbate many age-related morbidities, is particularly well captured through DNAm studies of plasma protein levels (Zaghlool et al., 2020). As proteins are the primary effectors of disease, connecting the epigenome, proteome, and time to disease onset may help to resolve predictive biological signatures. Epigenetic predictors have utilised DNAm from the blood to estimate a person’s ‘biological age’ (Lu et al., 2019), measure their exposure to lifestyle and environmental factors (McCartney et al., 2018c; McCartney et al., 2018a; Peters et al., 2021), and predict circulating levels of inflammatory proteins (Stevenson et al., 2020; Stevenson et al., 2021). A leading epigenetic predictor of biological aging, the GrimAge epigenetic clock incorporates methylation scores for seven proteins along with smoking and chronological age, and is associated with numerous incident disease outcomes independently of smoking (Hillary et al., 2020a; Lu et al., 2019). This suggests there is predictive value gained in utilising DNAm scores relevant to protein levels as intermediaries for predictions. Methylation scores also point towards the pathways that may act on health beyond the protein biomarker that they are trained on. A portfolio of methylation scores for proteins across the circulating proteome could therefore aid in the prediction of disease and offer a different, but additive signal beyond methylation or protein data alone. Generation of an extensive range of epigenetic scores for protein levels has not been attempted to date. The capability of specific protein scores to predict a range of morbidities has also not been tested. However, DNAm scores for interleukin-6 (IL-6) and C-reactive protein (CRP) have been found to associate with a range of phenotypes independently of measured protein levels, show more stable longitudinal trajectories than repeated protein measurements, and, in some cases, outperform blood-based proteomic associations with brain morphology (Stevenson et al., 2021; Conole et al., 2021). This is likely due to DNAm representing the accumulation of more sustained effects over a longer period of time than protein measurements, which have often been shown to be variable in their levels when measured at multiple time points (Koenig et al., 2003; Liu et al., 2015; Moldoveanu et al., 2000; Shah et al., 2014). DNAm scores for proteins could therefore be used to alert clinicians to individuals with high-risk biological signatures, many years prior to disease onset. Here, we report a comprehensive association study of blood-based DNAm with proteomics and disease (Figure 1). We trained epigenetic scores – referred to as EpiScores – for 953 plasma proteins (with sample size ranging from 706 to 944 individuals) and validated them using two independent cohorts with 778 and 162 participants. We regressed out known genetic pQTL effects from the protein levels prior to generating the EpiScores to preclude the signatures being driven by common SNP data that are invariant across the lifespan. We then examined whether the most robust predictors (n = 109 EpiScores) associated with the incidence of 12 major morbidities (Table 1), over a follow-up period of up to 14 years in the Generation Scotland cohort (n = 9537). We also tested for associations between EpiScore levels and COVID-19 disease outcomes. We regressed out the effects of age on protein levels prior to training and testing; age was also included as a covariate in disease prediction models. We controlled for common risk factors for disease and assessed the capacity of EpiScores to identify previously reported protein-disease associations.

Figure 1.

EpiScores for plasma proteins as tools for disease prediction study design.

DNA methylation scores were trained on 953 circulating plasma protein levels in the KORA and LBC1936 cohorts. There were 109 EpiScores selected based on performance (r > 0.1, p < 0.05) in independent test sets. The selected EpiScores were projected into Generation Scotland, a cohort that has extensive data linkage to GP and hospital records. We tested whether levels of each EpiScore at baseline could predict the onset of 12 leading causes of morbidity, over a follow-up period of up to 14 years; 137 EpiScore-disease associations were identified, for 11 morbidities. We then assessed whether EpiScore associations reflected protein associations for diabetes, which is a trait that has been well characterised using SOMAscan protein measurements. Of the 29 SOMAscan-derived EpiScore-diabetes associations, 23 highlighted previously reported protein-diabetes associations.

Table 1.

Incident morbidities in the Generation Scotland cohort.

	Basic model			Fully adjusted model
Morbidity	N cases	N controls	Years to event(mean, SD)	N cases	N controls	Years to event(mean, SD)
Rheumatoid arthritis	65	9281	6.1 (3.5)	54	7736	6.4 (3.3)
Alzheimer’s dementia	69	3764	8.3 (2.7)	52	3137	8.2 (2.7)
Bowel cancer	77	9398	6.4 (3.2)	65	7817	6.5 (3.2)
Depression	101	8306	3.9 (3.3)	80	6976	3.7 (3.3)
Breast cancer	129	5355	6 (3.4)	110	4401	5.9 (3.4)
Lung cancer	201	9265	5.2 (3.1)	172	7705	5.1 (3.1)
Inflammatory bowel disease	203	9083	5 (3.5)	163	7567	4.9 (3.5)
Stroke	317	9023	6.5 (3.4)	248	7546	6.4 (3.5)
COPD	346	8939	6.2 (3.4)	273	7459	6.1 (3.4)
Ischaemic heart disease	395	8646	5.8 (3.3)	309	7248	5.9 (3.3)
Diabetes	428	8756	5.7 (3.4)	322	7331	5.7 (3.4)
Pain (back/neck)	1494	5341	5.2 (3.5)	1221	4475	5.3 (3.5)

COPD: chronic obstructive pulmonary disease.

EpiScores for plasma proteins as tools for disease prediction study design.

Incident morbidities in the Generation Scotland cohort.

Counts are provided for the number of cases and controls for each incident trait in the basic and fully adjusted Cox models run in the Generation Scotland cohort (n = 9537). Mean time-to-event is summarised in years for each phenotype. Alzheimer’s dementia cases and controls were restricted to those older than 65 years. Breast cancer cases and controls were restricted to females. COPD: chronic obstructive pulmonary disease. A video summarising this study and detailing how the 109 EpiScores can be calculated in cohorts with DNAm data is available at: https://youtu.be/xKDWg0Wzvrg. Our MethylDetectR shiny app (Hillary and Marioni, 2020) has CpG weights for the 109 EpiScores integrated such that it automates the process of score generation for any DNAm dataset and is available at: https://www.ed.ac.uk/centre-genomic-medicine/research-groups/marioni-group/methyldetectr. A video on how to use the MethylDetectR shiny app to generate EpiScores is available at: https://youtu.be/65Y2Rv-4tPU.

Results

Selecting the most robust EpiScores for protein levels

To generate epigenetic scores for a comprehensive set of plasma proteins, we ran elastic net penalised regression models using protein measurements from the SOMAscan (aptamer-based) and Olink (antibody-based) platforms. We used two cohorts: the German population-based study KORA (n = 944, mean age 59 years [SD 7.8], with 793 SOMAscan proteins) and the Scottish Lothian Birth Cohort 1936 (LBC1936) study (between 706 and 875 individuals in the training cohort, with a total of 160 Olink neurology and inflammatory panel proteins). The mean age of the LBC1936 participants at sampling was 70 (SD 0.8) for inflammatory and 73 (SD 0.7) for neurology proteins. Full demographic information is available for all cohorts in Supplementary file 1A. Prior to running the elastic net models, we rank-based inverse normalised protein levels and adjusted for age, sex, cohort-specific variables and, where present, cis and trans pQTL effects identified from previous analyses (Hillary et al., 2019; Hillary et al., 2020b; Suhre et al., 2017) (Materials and methods). Of a possible 793 proteins in KORA, 84 EpiScores had Pearson r > 0.1 and p < 0.05 when tested in an independent subset of Generation Scotland (The Stratifying Resilience and Depression Longitudinally [STRADL] study, n = 778) (Supplementary file 1B). These EpiScores were selected for EpiScore-disease analyses. Of the 160 Olink proteins trained in LBC1936, there were 21 with r > 0.1 and p < 0.05 in independent test sets (STRADL, n = 778, Lothian Birth Cohort 1921: LBC1921, n = 162) (Supplementary file 1C). Independent test set data were not available for four Olink proteins. However, they were included based on their performance (r > 0.1 and p < 0.05) in a holdout sample of 150 individuals who were left out of the training set. We then retrained all selected predictors on the full training samples. A total of 109 EpiScores (84 SOMAscan-based and 25 Olink-based) were brought forward (r > 0.1 and p < 0.05) to EpiScore-disease analyses (Figure 2 and Supplementary file 1D). There were five EpiScores for proteins common to both Olink and SOMAscan panels, which had variable correlation strength (GZMA r = 0.71, MMP.1 r = 0.46, CXCL10 r = 0.35, NTRK3 r = 0.26, and CXCL11 r = 0.09). Predictor weights, positional information, and cis/trans status for CpG sites contributing to these EpiScores are available in Supplementary file 1E. The number of CpG features selected for EpiScores ranged from 1 (lyzozyme) to 395 (aminoacylase-1 [ACY-1]), with a mean of 96 (Supplementary file 1F). The most frequently selected CpG was the smoking-related site cg05575921 (mapping to the AHRR gene), which was included in 25 EpiScores. Counts for each CpG site are summarised in Supplementary file 1G. This table includes the set of protein EpiScores that each CpG contributes to, along with phenotypic annotations (traits) from the MRC-IEU EWAS catalog (MRC-IEU, 2021) for each CpG site having genome-wide significance (p < 3.6 × 10–8) (Saffari et al., 2018). GeneSet enrichment analysis of the original proteins used to train the 109 EpiScores highlighted pathways associated with immune response and cell remodelling, adhesion, and extracellular matrix function (Supplementary file 1H).

Figure 2.

Test performance for the 109 selected protein EpiScores.

Test set correlation coefficients for associations between protein EpiScores for (a) inflammatory Olink, (b) neurology Olink, and (c) SOMAmer protein panel EpiScores and measured protein levels are plotted. 95% confidence intervals are shown for each correlation. The 109 protein EpiScores shown had r > 0.1 and p < 0.05 in either one or both of the GS:STRADL (n = 778) and LBC1921 (n = 162) test sets, wherever protein data was available for comparison. Data shown corresponds to the results included in Supplementary file 1B-C. Correlation heatmaps between the 109 EpiScore measures (Figure 2—figure supplement 1) are provided, along with a summary of the most enriched functional pathways for the genes of the 109 proteins used to train EpiScores (Figure 2—figure supplement 2).

Correlation heatmap for EpiScore measures projected into Generation Scotland (N = 9537) for the 109 protein EpiScores selected in the test sample (r > 0.1, p < 0.05). At the top of the heatmap, an annotation bar is displayed. Olink proteins are shown in pale green and Somalogic proteins are shown in purple.

Genes selected for pathway enrichment (false discovery rate [FDR]-adjusted p < 0.05) are summarised, with the proportion of overlapping genes enriched in the gene-set also shown. The corresponding data for this figure can be accessed in full in Supplementary file 1H.

Test performance for the 109 selected protein EpiScores.

Figure 2—figure supplement 1.

Correlation heatmap for protein EpiScore measures in Generation Scotland.

Figure 2—figure supplement 2.

GeneSet enrichment of canonical pathways common to the genes encoding proteins that were used to train the 109 selected EpiScores.

Correlation heatmap for protein EpiScore measures in Generation Scotland.

GeneSet enrichment of canonical pathways common to the genes encoding proteins that were used to train the 109 selected EpiScores.

EpiScore-disease associations in Generation Scotland

The Generation Scotland dataset contains extensive electronic health data from GP and hospital records as well as DNAm data for 9537 individuals. This makes it uniquely positioned to test whether EpiScore signals can predict disease onset. We ran nested mixed effects Cox proportional hazards models (Figure 3) to determine whether the levels of each EpiScore at baseline associated with the incidence of 12 morbidities over a maximum of 14 years of follow-up. The correlation structures for the 109 EpiScore measures used for Cox modelling are presented in Figure 2—figure supplement 1.

Figure 3.

Nested Cox proportional hazards assessment of protein EpiScore-disease prediction.

Mixed effects Cox proportional hazards analyses in Generation Scotland (n = 9537) tested the relationships between each of the 109 selected EpiScores and the incidence of 12 leading causes of morbidity (Supplementary file 1I-J). The basic model was adjusted for age and sex and yielded 294 associations between EpiScores and disease diagnoses, with false discovery rate (FDR)-adjusted p < 0.05. In the fully adjusted model, which included common risk factors as additional covariates (smoking, deprivation, educational attainment, body mass index (BMI), and alcohol consumption), 137 of the basic model associations remained significant with p < 0.05. In a sensitivity analysis, the addition of estimated white blood cells (WBCs) to the fully adjusted models led to the attenuation of 34 of the 137 associations. In a further sensitivity analysis, 81 associations remained after adjustment for both immune cell proportions and GrimAge acceleration.

Nested Cox proportional hazards assessment of protein EpiScore-disease prediction.

Phenotypic trait and estimated white blood cell proportion correlations with EpiScores.

Heatmap of Pearson's correlations (r) between the 78 protein EpiScore measures that were associated with incident disease (with p < 0.05 in the fully adjusted Cox mixed effects proportional hazards models) and continuous phenotypic/lifestyle trait variables and Houseman-estimated white blood cell proportions in Generation Scotland (total N = 9537). Protein measurements used to train the predictors were adjusted for age and sex. The maximum sample size available was used for each correlation. GrimAge: GrimAge acceleration. Units: weekly units of alcohol. EpiSmoker: DNAm-derived score for smoking. SIMD: Scottish Index of Multiple Deprivation. EA: educational attainment. Mono: monocytes. Gran: granulocytes. NK: natural killer cells. The Cox proportional hazards assumption was assessed through the Schoenfeld residuals from the models. Two associations in the basic model adjusting for age and sex failed to satisfy the global assumption (across all covariates) and were excluded. There were 294 remaining EpiScore-disease associations with a false discovery rate (FDR)-adjusted p < 0.05 in the basic model. After further adjustment for common risk factor covariates (smoking, social deprivation status, educational attainment, body mass index [BMI], and alcohol consumption), 137 of the 294 EpiScore-disease associations from the basic model had p < 0.05 in the fully adjusted model (Supplementary file 1I-J). Eleven of the 137 fully adjusted associations failed the Cox proportional hazards assumption for the EpiScore variable (p < 0.05 for the association between the Schoenfeld residuals and time; Supplementary file 1K). When we restricted the time-to-event/censor period by each year of possible follow-up, there were minimal differences in the EpiScore-disease hazard ratios between follow-up periods that did not violate the assumption and those that did (Supplementary file 1L). The 137 associations were therefore retained as the primary results. The 137 associations found in the fully adjusted model comprised 78 unique EpiScores that were related to the incidence of 11 of the 12 morbidities studied. Diabetes and chronic obstructive pulmonary disease (COPD) had the greatest number of associations, with 33 and 41, respectively. Figure 4 presents the EpiScore-disease relationships for COPD and the remaining nine morbidities: stroke, lung cancer, ischaemic heart disease (IHD), inflammatory bowel disease (IBD), rheumatoid arthritis (RA), depression, bowel cancer, pain (back/neck), and Alzheimer’s dementia. There were 13 EpiScores that associated with the onset of three or more morbidities. Figure 5 presents relationships for these 13 EpiScores in the fully adjusted Cox model results. Of note is the EpiScore for Complement 5 (C5), which associated with five outcomes: stroke, diabetes, IHD, RA, and COPD. Of the 29 SOMAscan-derived EpiScore associations with incident diabetes, 23 replicated previously reported SOMAscan protein associations (Elhadad et al., 2020; Gudmundsdottir et al., 2020; Ngo et al., 2021) with incident or prevalent diabetes in one or more cohorts (Figure 6 and Supplementary file 1M).

Figure 4.

Protein EpiScore associations with incident disease.

EpiScore-disease associations for 10 of the 11 morbidities with associations where p < 0.05 in the fully adjusted mixed effects Cox proportional hazards models in Generation Scotland (n = 9537). Hazard ratios are presented with confidence intervals for 104 of the 137 EpiScore-incident disease associations reported. Models were adjusted for age, sex, and common risk factors (smoking, body mass index (BMI), alcohol consumption, deprivation, and educational attainment). IBD: inflammatory bowel disease. IHD: ischaemic heart disease. COPD: chronic obstructive pulmonary disease. For EpiScore-diabetes associations, see Figure 6. Data shown corresponds to the results included in Supplementary file 1J.

Figure 5.

Protein EpiScores that associated with the greatest number of morbidities.

EpiScores with a minimum of three relationships with incident morbidities in the fully adjusted Cox models. The network includes 13 EpiScores as dark blue (SOMAscan) and grey (Olink) nodes, with disease outcomes in black. EpiScore-disease associations with hazard ratios < 1 are shown as blue connections, whereas hazard ratios > 1 are shown in red. COPD: chronic obstructive pulmonary disease. IHD: ischaemic heart disease. Data shown corresponds to the results included in Supplementary file 1J.

Figure 6.

Replication of known protein-diabetes associations with protein EpiScores.

EpiScore-incident diabetes associations in Generation Scotland (n = 9537). The 29 SOMAscan (top panel) and four Olink (bottom panel) associations shown with p < 0.05 in fully adjusted mixed effects Cox proportional hazards models. Of the 29 SOMAscan-derived EpiScores, 23 associations were consistent with protein-diabetes associations (pink) in one or more of the comparison studies that used SOMAscan protein levels. Six associations were novel (blue). Data shown corresponds to the results included in Supplementary files 1J and M.

Protein EpiScore associations with incident disease.

Protein EpiScores that associated with the greatest number of morbidities.

Replication of known protein-diabetes associations with protein EpiScores.

Immune cell and GrimAge sensitivity analyses

Correlations of the 78 EpiScores that were associated with incident disease (P < 0.05 in the fully-adjusted cox proportional hazards models) with covariates suggested interlinked relationships with both estimated white blood cell proportions and GrimAge acceleration (Figure 3—figure supplement 1). These covariates were therefore added incrementally to the fully-adjusted Cox models (Figure 3). There were 103 associations that remained statistically significant (FDR p < 0.05 in the basic model and p < 0.05 in the fully adjusted model) after adjustment for immune cell proportions, of which 81 remained significant when GrimAge acceleration scores were added to this model (Supplementary file 1J). In a further sensitivity analysis, relationships between both estimated white blood cell (WBC) proportions and GrimAge acceleration scores with incident diseases were assessed in the Cox model structure independently of EpiScores. Of the 60 possible relationships between WBC measures and the morbidities assessed, four were statistically significant (FDR-adjusted p < 0.05) in the basic model and remained significant with p < 0.05 in the fully adjusted model (Supplementary file 1N). A higher proportion of natural killer cells was linked to decreased risk of incident COPD, RA, diabetes, and pain (back/neck). The GrimAge acceleration composite score was associated with COPD, IHD, diabetes, and pain (back/neck) in the fully adjusted models (p < 0.05) (Supplementary file 1O). The magnitude of the GrimAge effect sizes was comparable to the EpiScore findings.

Figure 3—figure supplement 1.

Phenotypic trait and estimated white blood cell proportion correlations with EpiScores.

Relationship between EpiScores and subsequent COVID-19

Two previous studies including pilot proteomic measurements from the Generation Scotland cohort (N = 199 controls) as part of wider analyses found that several proteins corresponding to our EpiScores were associated with COVID-19 outcomes (Demichev et al., 2021; Messner et al., 2020). These included proteins such as CRP, C9, SELL, and SHBG, all of which were associated with one or more incident diseases in this study. Two subsets (N = 268 and N = 173) of the Generation Scotland sample who contracted COVID-19 were therefore used to test the hypothesis that EpiScores would associate with COVID-19 outcomes (acquired >9 years after the blood draw for DNAm analyses). No significant associations were identified that delineated differences between the development of long-covid (duration >4 weeks) or hospitalisation from COVID-19 (associations that had p < 0.05 did not withstand Bonferroni adjustment for multiple testing) (Supplementary file 1P).

Discussion

Here, we report a comprehensive DNAm scoring study of 953 circulating proteins. We define 109 robust EpiScores for plasma protein levels that are independent of known pQTL effects. By projecting these EpiScores into a large cohort with extant data linkage, we show that 78 EpiScores associate with the incidence of 11 leading causes of morbidity (137 EpiScore-disease associations in total), but do not associate with COVID-19 outcomes. Finally, we show that EpiScore-diabetes associations highlight previously measured protein-diabetes relationships. The bulk of EpiScore-disease associations are independent of common lifestyle and health factors, differences in immune cell composition and GrimAge acceleration. EpiScores therefore provide methylation-proteomic signatures for disease prediction and risk stratification. The consistency between our EpiScore-diabetes associations and previously identified protein-diabetes relationships (Elhadad et al., 2020; Gudmundsdottir et al., 2020; Ngo et al., 2021) suggests that epigenetic scores identify disease-relevant biological signals. In addition to the comprehensive lookup of SOMAscan proteins with diabetes, several of the markers we identified for COPD and IHD also reflect previous associations with measured proteins (Ganz et al., 2016; Serban et al., 2021). The three studies used for the diabetes comparison represent the largest candidate protein characterisations of type 2 diabetes to date and the top markers identified included aminoacylase-1 (ACY-1), sex hormone-binding globulin (SHBG), growth hormone receptor (GHR), and insulin-like growth factor-binding protein 2 (IFGBP-2) (Elhadad et al., 2020; Gudmundsdottir et al., 2020; Ngo et al., 2021). Our EpiScores for these top markers were also associated with diabetes, in addition to EpiScores for several other protein markers reported in these studies. A growing body of evidence suggests that type 2 diabetes is mediated by genetic and epigenetic regulators (Kwak and Park, 2016) and proteins such as ACY-1 and GHR are thought to influence a range of diabetes-associated metabolic mechanisms (Kim and Park, 2017; Pérez-Pérez et al., 2012). Proteins that we identify through EpiScore associations, such as NTR domain-containing protein 2 (WFIKKN2), have also been causally implicated in type 2 diabetes onset through Mendelian randomisation analysis (Ngo et al., 2021). In the case of diabetes, EpiScores may therefore be used as disease-relevant risk biomarkers, many years prior to onset. Validation should be tested when sufficient data become available for the remaining morbidities. With modest test set performances (e.g., SHBG r = 0.18 and ACY-1 r = 0.25), it is perhaps surprising that such strong synergy is observed between EpiScores for proteins that associated with diabetes and the trends seen with measured proteins. Nonetheless, DNAm scores for CRP and IL-6 have previously been shown to perform modestly in test sets (r ~ 0.2, equivalent to ~4% explained variance in protein level), but augment and often outperform the measured protein related to a range of phenotypes (Stevenson et al., 2020; Stevenson et al., 2021). Compared to scores utilising DNAm for the prediction of singular diseases, our EpiScores enable the granular study of individual protein predictor signatures with clinical outcomes. For example, levels of the acid sphingomyelinase (ASM) EpiScore predicted onset of Alzheimer’s dementia, several years prior to diagnosis. ASM (encoded by SMPD1) has been discussed as a therapeutic candidate for Alzheimer’s disease (Cataldo et al., 2004; Lee et al., 2014; Park et al., 2020) and has been shown to disrupt autophagic protein degradation and associate with accumulation of amyloid-beta in murine models of Alzheimer’s pathology (Lee et al., 2014; Park et al., 2020). Our large-scale assessment of EpiScores provides a platform for future studies, as composite predictors for traits may be created using our EpiScore database. These should be tested in incident disease predictions when sufficient case data are available. Our results indicated that the set of 109 EpiScores are likely to be heavily enriched for inflammatory, complement system and innate immune system pathways, in addition to extracellular matrix, cell remodelling, and cell adhesion pathways. This reinforces previous work linking chronic inflammation and the epigenome (Zaghlool et al., 2020). It also suggests that EpiScores could be useful in the prediction of morbidities that are characterised by differential inflammatory states. An example of this is the EpiScore for Complement Component 5 (C5), which was associated with the onset of five morbidities, the highest number for any EpiScore (Figure 5). The EpiScore for C5 is likely to reflect the biological pathways occurring in individuals with heightened complement cascade activity and could be utilised to alert clinicians to individuals at high risk of multimorbidity. Elevated levels of C5 peptides have been associated with severe inflammatory, autoimmune, and neurodegenerative states (Ma et al., 2019; Mantovani et al., 2014; Morgan and Harris, 2015) and a range of C5-targetting therapeutic approaches are in development (Alawieh et al., 2018; Brandolini et al., 2019; Hawksworth et al., 2017; Hernandez et al., 2017; Morgan and Harris, 2015; Ort et al., 2020). Though EpiScores such as C5 – which occupy central hubs in the disease prediction framework – may provide evidence of early methylation signatures common to the onset of multiple diseases, we did not observe associations between EpiScores and COVID-19 hospitalisation or long-COVID status. This is perhaps surprising, given that many of the morbidities that our EpiScores predicted are also known risk factors for increased risk of death due to COVID-19 (Williamson et al., 2020). Many of the proteins corresponding to EpiScores in our study were also associated with COVID-19 severity and progression in two previous studies that included a pilot sample (N = 199) from the Generation Scotland cohort at baseline as control data (Demichev et al., 2021; Messner et al., 2020). COVID-19 likely has multiple intersecting risk factors that impact severity and recovery, and the lack of associations we observe is likely to be in part due to the limited number of COVID-19 cases available in Generation Scotland. Additionally, there is a large lag time between baseline biological measurement and COVID-19 in our analyses, whereas the two studies that found protein marker associations integrated protein measurements longitudinally and from samples extracted during COVID-19 progression. With increased power available through continued data linkage, EpiScore relationships with COVID-19 outcomes may be observed in future work. This study has several limitations. First, we demonstrate that EpiScores carry disease-relevant signals that may be clinically meaningful to delineate early disease risk when comparing relative differences within a cohort. However, projecting a new individual onto a reference set is complicated due to absolute differences in methylation quantification resulting from batch and processing effects. Second, future studies should assess paired protein and EpiScore contributions to traits, as inference from EpiScores alone, while useful for disease risk stratification, is not sufficient to determine mechanisms. This may also highlight EpiScores that outperform the measured protein equivalent in disease. Third, the epitope nature of the protein measurement in the SOMAscan panel may incur probe cross-reactivity and non-specific binding; there may also be differences in how certain proteins are measured across panels (Pietzner et al., 2020; Sun et al., 2018). Comparisons of multiple protein measurement technologies on the same samples would help to explore this in more detail. Fourth, there may be pQTLs with small effect sizes that were not regressed from the proteins prior to generating the EpiScores. Fifth, while training and testing was performed across multiple cohorts, it is likely that further development of EpiScores in larger proteomic samples with diverse ancestries will improve power to generate robust scores. Upper bounds for DNAm prediction of complex traits, such as proteins, can be estimated by variance components analyses (Hillary et al., 2020b; Trejo Banos et al., 2020; Zhang et al., 2019). Finally, associations present between EpiScore measures and disease incidence may have been influenced by external factors such as prescription medications for comorbid conditions and comorbid disease prevalence. We have shown that EpiScores for circulating protein levels predict the incidence of multiple diseases, up to 14 years prior to diagnosis. Our findings suggest that DNAm phenotyping approaches and data linkage to electronic health records in large, population-based studies have the potential to (1) capture inter-individual variability in protein levels; (2) predict incident disease risk many years prior to morbidity onset; and (3) highlight disease-relevant biological signals for further exploration. The EpiScore weights are publicly available, enabling any cohort with Illumina DNAm data to generate them and to relate them to various outcomes. Given the increasingly widespread assessment of DNAm in cohort studies (McCartney et al., 2020; Min et al., 2021), EpiScores offer an affordable and consistent (i.e., array-based) way to utilise these signatures. This information is likely to be important in risk stratification and prevention of age-related morbidities.

Materials and methods

The KORA sample population

The KORA F4 study includes 3080 participants who reside in Southern Germany. Individuals were between 32 and 81 years of age when recruited to the study from 2006 and 2008. In the current study, there were 944 individuals with methylation, proteomics, and genetic data available. The Infinium HumanMethylation450 BeadChip platform was used to generate DNAm data for these individuals. The Affymetrix Axiom array was used to generate genotyping data and the SOMAscan platform was used to generate proteomic measurements in the sample.

DNAm in KORA

Methylation data were generated for 1814 individuals (Petersen et al., 2014); 944 also had protein and genotype measurements available. During preprocessing, 65 SNP probes were excluded and background correction was performed in minfi (Aryee et al., 2014). Samples with a detection rate of less than 95% were excluded. Next, the minfi R package was used to perform normalisation on the intensity methylation measures (Aryee et al., 2014), with a method consistent with the Lumi:QN + BMIQ pipeline. After excluding non-cg sites and CpGs on sex chromosomes or with fewer than 100 measures available, 470,837 CpGs were available for analyses.

Proteomics in KORA

The SOMAscan platform (Version 3.2) (Gold et al., 2010) was used to quantify protein levels in undepleted plasma for 1129 SOMAmer probes (Suhre et al., 2017). Of the 1000 samples provided for analysis, two samples were excluded due to errors in bio-bank sampling and one based on quality control (QC) measures. Of the 997 samples available, there were 944 individuals with methylation and genotypic data. Of the 1129 probes available, five failed the QC, leaving a total of 1124 probes for the subsequent analysis. Protein measurements were transformed by rank-based inverse normalisation and regressed onto age, sex, known pQTLs, and 20 genetic principal components of ancestry derived from the Affymetrix Axiom Array to control for population structure. pQTLs for each protein were taken from a previous GWAS in the sample (Suhre et al., 2017).

The LBC1936 and LBC1921 sample populations

The Lothian Birth Cohorts of 1921 (LBC1921; N = 550) and 1936 (LBC1936; N = 1091) are longitudinal studies of aging in individuals who reside in Scotland (Deary et al., 2012; Taylor et al., 2018). Participants completed an intelligence test at age 11 years and were recruited for these cohorts at mean ages of 79 (LBC1921) and 70 (LBC1936). They have been followed up triennially for a series of cognitive, clinical, physical, and social data, along with blood donations that have been used for genetic, epigenetic, and proteomic measurement. DNAm, proteomic (Olink platform), and genetic data for individuals from Waves 1 (n=875 at mean age 70 years and sd 0.8) and 2 (n=706 at mean age 73 years and sd 0.7) of the LBC1936 and Wave 3 of the LBC1921 (n=162 at mean age 87 years and sd 0.4) were available.

DNAm in LBC1936 and LBC1921

DNA from whole blood was assessed using the Illumina 450 K methylation array. Details of QC have been described elsewhere (Shah et al., 2014; Zhang et al., 2018). Raw intensity data were background-corrected and normalised using internal controls. Manual inspection resulted in the removal of low-quality samples that presented issues related to bisulphite conversion, staining signal, inadequate hybridisation, or nucleotide extension. Probes with low detection rate <95% at p < 0.01 and samples with low call rates (<450,000 probes detected at p < 0.01) were removed. Samples were also removed if they had a poor match between genotype and SNP control probes, or incorrect DNAm-predicted sex.

Proteomics in LBC1936 and LBC1921

Plasma samples were analysed using either the Olink neurology 92-plex or the Olink inflammation 92-plex proximity extension assays (Olink Bioscience, Uppsala Sweden). One inflammatory panel protein (BDNF) failed QC and was removed. A further 21 proteins were removed, as over 40% of samples fell below the lowest limit of detection. Two neurology proteins, MAPT and HAGH, were excluded due to >40% of observations being below the lower limit of detection. This resulted in 90 neurology (LBC1936 Wave 2) and 70 inflammatory (LBC1936 Wave 1) proteins in LBC1936 and 92 neurology proteins available in LBC1921. Protein levels were rank-based inverse normalised and regressed onto age, sex, four genetic components of ancestry derived from multidimensional scaling of the Illumina 610-Quadv1 genotype array and Olink array plate. In LBC1936, pQTLs were adjusted for, through reference to GWAS in the samples (Hillary et al., 2019; Hillary et al., 2020b).

Generation Scotland and STRADL sample populations

Generation Scotland: the Scottish Family Health Study (GS) is a large, family-structured, population-based cohort study of >24,000 individuals from Scotland (mean age 48 years) (Smith et al., 2013). Recruitment took place between 2006 and 2011 with a clinical visit where detailed health, cognitive, and lifestyle information was collected along with biological samples (blood, urine, saliva). In GS, there were 9537 individuals with DNAm and phenotypic information available. The STRADL cohort is a subset of 1188 individuals from the GS cohort who undertook additional assessments approximately 5 years after the study baseline (Navrady et al., 2018).

DNAm in Generation Scotland and STRADL

In the GS cohort, blood-based DNAm was generated in two sets using the Illumina EPIC array. Set 1 comprised 5190 related individuals whereas Set 2 comprised 4583 individuals, unrelated to each other and to those in Set 1. During QC, probes were removed based on visual outlier inspection, bead count <3 in over 5% of samples, and samples with detection p-value below adequate thresholds (McCartney et al., 2018b; Seeboth et al., 2020). Samples were removed based on sex mismatches, low detection p-values for CpGs and saliva samples and genetic outliers (Amador et al., 2015). The quality-controlled dataset comprised 9537 individuals (nSet1 = 5087, nSet2 = 4450). The same steps were also applied to process DNAm in STRADL.

Proteomics in STRADL

Measurements for 4235 proteins in 1065 individuals from the STRADL cohort were recorded using the SOMAscan technology; 793 epitopes matched between the KORA and STRADL cohorts and were included for training in KORA and testing in STRADL. There were 778 individuals with proteomics data and DNAm data in STRADL. Protein measurements were transformed by rank-based inverse normalisation and regressed onto age, sex, and 20 genetic principal components (derived from multidimensional scaling of genotype data from the Illumina 610-Quadv1 array).

Electronic health data linkage in Generation Scotland

Over 98% of GS participants consented to allow access to electronic health records via data linkage to GP records (Read 2 codes) and hospital records (ICD codes). Data are available prospectively from the time of blood draw, yielding up to 14 years of linkage. We considered incident data for 12 morbidities. Ten of the diseases are listed by the World Health Organization (WHO) as leading causes of either morbidity or mortality (Hay et al., 2017; World Health Organization, 2018). Inflammatory bowel disease (IBD) (Kassam et al., 2014) and RA (James et al., 2018) are also included as traits as they have been reported as leading causes of disability and morbidity and the global burdens of these diseases are rising (Alatab et al., 2020; Safiri et al., 2019). Prevalent cases (ascertained via retrospective ICD and Read 2 codes or self-report from a baseline questionnaire) were excluded. For IBD prevalent cases were excluded based on data linkage alone. Included and excluded terms can be found in Supplementary files 1Q-1B1. Alzheimer’s dementia was limited to cases/controls with age of event/censoring ≥65 years. Breast cancer was restricted to females only. Recurrent, major and moderate episodes of depression were included in depression. Diabetes was comprised of predominantly type 2 diabetes codes and additional general diabetes codes such as diabetic retinopathy and diabetes mellitus with renal manifestation that often occur in individuals with type 2 diabetes. Type 1 and juvenile diabetes cases were excluded.

Elastic net protein EpiScores

Penalised regression models were generated for 160 proteins in LBC1936 and 793 proteins in KORA using Glmnet (Version 4.0-2) (Friedman et al., 2010) in R (Version 4.0) (R Development Core Team, 2020). Protein levels were the outcome and there were 428,489 CpG features per model in the LBC1936 training and 397,630 in the KORA training. An elastic net penalty was specified (alpha = 0.5) and cross validation was applied. DNAm and protein measurements were scaled to have a mean of zero and variance of one. In the KORA analyses, 10-fold cross validation was applied and EpiScores were tested in STRADL (n = 778). Of 480 EpiScores that generated ≥1 CpG features, 84 had Pearson r > 0.1 and p < 0.05 in STRADL. As test set comparisons were not available for every protein in the LBC1936 analyses, a holdout sample was defined, with two folds set aside as test data and 10-fold cross validation carried out on the remaining data (ntrain = 576, ntest = 130 for neurology and ntrain = 725, ntest = 150 for inflammatory proteins). We retained 36 EpiScores with Pearson r > 0.1 and p < 0.05. New predictors for these 36 proteins were then generated using 12-fold cross validation and tested externally in STRADL (n = 778) and LBC1921 (n = 162, for the neurology panel). Twenty-one EpiScores had r > 0.1 and p < 0.05 in at least one of the external test sets. Four EpiScores did not have external comparisons and were included based on holdout performance. Functional annotations for each of the proteins used to train the finalised set of 109 EpiScores were sourced from the STRING database (Jensen et al., 2009). GeneSet enrichment analysis against protein-coding genes was performed using the FUMA database, to quantify which canonical pathways were most commonly implicated across the 109 genes corresponding to the proteins used to train the 109 EpiScores (Watanabe et al., 2017). The background gene-set was specified as protein coding genes and a threshold of FDR p < 0.05 was applied for enrichment status, with the minimum overlapping genes with gene-sets set to ≥2. The 109 selected EpiScores were then applied to Generation Scotland (n = 9537). DNAm at each CpG site was scaled to have a mean of zero and variance of one, with scaling performed separately for GS sets.

Associations with health linkage phenotypes in Generation Scotland

Mixed effects Cox proportional hazards regression models adjusting for age, sex, and methylation set were used to assess the relationship between 109 EpiScores and 12 morbidities in Generation Scotland. Models were run using coxme (Therneau, 2020b) (Version 2.2-16) with a kinship matrix accounting for relatedness in Set 1. Cases included those diagnosed after baseline who had died, in addition to those who received a diagnosis and remained alive. Controls were censored if they were disease free at time of death, or at the end of the follow-up period. EpiScore levels were rank-base inverse normalised. Fully adjusted models included the following additional covariates measured at baseline: alcohol consumption (units consumed in the previous week); deprivation assessed by the Scottish Index of Multiple Deprivation (GovScot, 2016); BMI (kg/m2); educational attainment (an 11-category ordinal variable); and a DNAm-based score for smoking status (Bollepalli et al., 2019). A false discovery rate multiple testing correction p < 0.05 was applied to the 1306 EpiScore-disease associations (109 EpiScores by 12 incident disease traits, with two associations excluded for failing the global proportional hazards assumption). Proportional hazards assumptions were checked through Schoenfeld residuals (global test and a test for the protein-EpiScore variable) using the coxph and cox.zph functions from the survival package (Therneau, 2020a) (Version 3.2-7). For each association failing to meet the assumption (Schoenfeld residuals p < 0.05), a sensitivity analysis was run across yearly follow-up intervals. Fully adjusted Cox proportional hazards models were run with Houseman-estimated white blood cell proportions as covariates (Houseman et al., 2012). A further sensitivity analysis added GrimAge acceleration (Lu et al., 2019) as an additional covariate. Basic and fully adjusted Cox models were also run with estimated monocyte, B-cell, CD4T, CD8T, and natural killer cell proportions as predictors, in addition to models with GrimAge acceleration as the predictor of incident disease. Correlation structures for EpiScores, DNAm-estimated white blood cell proportions, and phenotypic information were assessed using Pearson correlations and pheatmap (Kolde, 2019) (Version 1.0.12) and ggcorrplot packages (Version 0.1.3) (Kassambara, 2019). The psych package (Version 2.0.9) (Revelle, 2020) was used to perform principal components analysis on EpiScores. Figures 1 and 2 were created with BioRender.com. Associations for EpiScores that were related to a minimum of three morbidities were subset from the fully adjusted Cox proportional hazards results and were visualised using the ggraph package (Version 2.0.5) (Pedersen, 2021). This network representation was used (Figure 5) to highlight protein EpiScores that were connected with multiple morbidities.

Consistency of disease associations between EpiScores and measured proteins

Comparisons were conducted between EpiScore-diabetes associations and type 2 diabetes associations with measured proteins using three previous large-scale proteomic studies (Elhadad et al., 2020; Gudmundsdottir et al., 2020; Ngo et al., 2021) In these studies, six cohorts were included (Study 1: KORA n = 993, HUNT n = 940 [Elhadad et al., 2020], Study 2: AGES-Reykjavik n = 5438 and QMDiab n = 356 [Gudmundsdottir et al., 2020], Study 3: Framingham Heart Study n = 1618 and the Malmo Diet and Cancer Study n = 1221). Study 1 included the KORA dataset, which we use in this study to generate SOMAscan EpiScores. We characterised which SOMAscan-based EpiScore-diabetes associations from our fully adjusted results reflected those observed with measured protein levels. We included basic (nominal p < 0.05) and fully adjusted results (with either FDR or Bonferroni-corrected p < 0.05), wherever available, across the lookup cohorts (Supplementary file 1M).

Relationship between EpiScores and COVID-19 outcomes

Associations between each of the 109 selected protein EpiScores and subsequent long-COVID or COVID-19 hospitalisation were tested in the Generation Scotland population. A binary variable was used for long-COVID based on self-reported COVID-19 duration from the CovidLife study survey 3 questionnaire (N = 2399 participating individuals) (Fawns-Ritchie et al., 2021). Participants were asked about the total overall time they experienced symptoms in their first/only episode of illness, as well as their COVID-19 illness duration. The dataset is correct as of February 2021 when the survey 3 was administered. Of the 9537 individuals with DNAm that were included in incident disease analyses, 173 indicated that they had COVID-19 and 56 of these individuals reported having long-COVID (>4 weeks duration of symptoms after infection). The mean duration from DNAm measurement to long-COVID for these 56 individuals was 11.2 years (sd 1.2). Hospitalisation information, derived from the Scottish Morbidity Records (SMR01), was used to obtain COVID-19 hospital admissions using ICD-10 codes U07.1 (lab-confirmed COVID-19 diagnosis), and U07.2 (clinically diagnosed COVID-19). This data linkage identified 268 of the 9537 individuals that had COVID-19 diagnoses and 29 had been recorded as being hospitalised due to COVID-19. The mean duration from DNAm measurement to hospitalisation for these 29 individuals was 11.9 years (sd 1.4). Logistic regression models with either hospitalisation or long-COVID status as binary outcomes were used, with the 109 scaled protein EpiScores as the independent variables. Sex and age at COVID testing were included as covariates. The latter was defined as the age at positive COVID-19 test or 1 January 2021 if COVID-19 test data were not available. This is an important study that demonstrates the potential utility of the circulating proteome for disease prediction and risk stratification. In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses. Decision letter after peer review: Thank you for submitting your article "Epigenetic scores for the circulating proteome as tools for disease prediction" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Y M Dennis Lo as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Luigi Ferruccilu (Reviewer #1). The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. 1. The authors should explain why producing methylation score for proteins add information to using proteins directly. Proteins are effectors and physicians are accustomed to using them. The development of GrimAge showed something quite peculiar, namely that the score developed for smoking was predictive of mortality even when the analysis was adjusted for smoking. This suggests that methylation score point to "mechanisms" that are acting on health beyond what is revealed by the variable that was used for initial discovery. Could this hypothesis be tested in this study? It would potentially justify developing methylation score even when information on proteins is available. 2. It is not very clear, at least to one of the reviewers, the value of the network analysis that is reported in Figure 5. Could this be explain both in terms of it technical development and in terms of what such analysis adds to the message of this paper? 3. While the authors touch on the possible clinical values of the analysis reported here, it would be great to read what is the view of the authors in more details and indicate what they suggest should the next step to translate their findings to the clinic. 4. The list of marker proteins is interesting, but it would be useful to carry out their functional clustering for a more meaningful discussion. Reviewer #1: The manuscript is written in a comprehensible language that should understandable both to the researchers in this filed but also for scientists physicians who can gather beyond the details the potential for this science to change our current approach to medical diagnosis. The authors used penalized regression to developed methylation scores that predict a large number of circulating proteins in two separate high quality datasets. Than used these scores to predict the onset of major medical condition in a third, independent large database, including information on genetic background and many important confounders. They found that more than 100 of theses scores significantly predicted the development of selected chronic diseases with sufficient effect size to be meaningful, and the prediction was independent of circulating proteins and potential confounders. The level of prediction was not sufficient for medical diagnosis but more than enough to alert a clinician that a patients has a high risk of a developing a specific disease and should be looked at more carefully. The manuscript is well done, with a rigorous analysis. I have few suggestion that may improve the robustness of the results and also increase the clinical meaningfulness of the results. 1. The authors should explain why producing methylation score for proteins add information to using proteins directly. Proteins are effectors and physicians area accustomed to using them. The development of GrimAge showed something quite peculiar, namely that the score developed for smoking was predictive of mortality even when the analysis was adjusted for smoking. This suggests that methylation score point to "mechanisms" that are acting on health beyond what is revealed by the variable that was used for initial discovery. Is this hypothesis could be tested in this study, it would justify developing methylation score even when information on proteins is available. 2. It is not very clear, at least to me, the value of the network analysis that is reported in Figure 5. Could this be explain both in terms of it technical development and in terms of what such analysis adds to the message of this paper? 3. While the authors touch on the possible clinical values of the analysis reported here, it would be great to read what is the view of the authors in more details and indicate what they suggest should the next step to translate their findings to the clinic. Reviewer #2: From a methodological point of view, not satisfied by the relatively small size of test samples with a huge number of analyzed markers, which creates the likelihood of conclusions that are valid only within this study (especially without the necessary correction for multiple comparisons in such cases). The list of marker proteins is interesting, but it would be useful to carry out their functional clustering for a more meaningful discussion. The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. 1. The authors should explain why producing methylation score for proteins add information to using proteins directly. Proteins are effectors and physicians are accustomed to using them. The development of GrimAge showed something quite peculiar, namely that the score developed for smoking was predictive of mortality even when the analysis was adjusted for smoking. This suggests that methylation score point to "mechanisms" that are acting on health beyond what is revealed by the variable that was used for initial discovery. Could this hypothesis be tested in this study? It would potentially justify developing methylation score even when information on proteins is available. Contextualising the relevance of protein EpiScores Firstly, we note that the way reviewer 1 spoke about the EpiScores in their review was very helpful to contextualise the potential use for these scores from a clinical standpoint. We have therefore updated our manuscript with this ethos in mind. Regarding the points made in comment one, we have updated our introduction on pages 4-5 as follows, to clarify the value EpiScores may bring to disease prediction and clinical decision-making: Page 4: “As proteins are the primary effectors of disease, connecting the epigenome, proteome and time to disease onset may help to resolve predictive biological signatures.” Page 4: “A leading epigenetic predictor of biological aging, the GrimAge epigenetic clock incorporates methylation scores for seven proteins along with smoking and chronological age, and is associated with numerous incident disease outcomes independently of smoking (Hillary, Stevenson, et al., 2020; Lu et al., 2019). This suggests there is predictive value gained in utilising DNAm scores relevant to protein levels as intermediaries for predictions. Methylation scores also point towards the pathways that may act on health beyond the protein biomarker that they are trained on. A portfolio of methylation scores for proteins across the circulating proteome could therefore aid in the prediction of disease and offer a different, but additive signal beyond methylation or protein data alone.” Page 5: “DNAm scores for Interleukin-6 and C-Reactive protein have been found to associate with a range of phenotypes independently of measured protein levels, show more stable longitudinal trajectories than repeated protein measurements, and, in some cases, outperform blood-based proteomic associations with brain morphology (Conole et al., 2020; A. J. Stevenson et al., 2021). This is likely due to DNA methylation representing the accumulation of more sustained effects over a longer period of time than protein measurements, which have often been shown to be highly variable in their levels when measured at multiple time points (Koenig et al., 2003; Liu et al., 2015; Moldoveanu et al., 2000; Shah et al., 2014). DNA methylation scores for proteins could therefore be used to alert clinicians to individuals with high-risk biological signatures, many years prior to disease onset.” Testing protein EpiScores against measured proteins In response to the specific question noted above (Could this hypothesis be tested in this study?), we would absolutely want to test protein EpiScores against the measured protein equivalents as a priority. We have previously published work that shows that EpiScores for traits (such as smoking and BMI) and protein levels (IL6 and CRP) have been able to achieve comparable, or enhanced performance as compared to the measured proteins or phenotypes within the same study populations. References: – IL6 and CRP EpiScores vs measured proteins, in relation to lifestyle phenotypes (Stevenson et al., 2021) https://pubmed.ncbi.nlm.nih.gov/33595649/ – CRP EpiScore vs the measured protein, in relation to a range of brain imaging measures (Conole et al., 2021 – now accepted in Neurology) https://www.medrxiv.org/content/10.1101/2020.10.08.20205245v1 – BMI trait score (Hamilton et al., 2019) https://www.nature.com/articles/s41366-018-0262-3 – Smoking trait score (Corley et al., 2019) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6779733/ Like these studies, we would ideally like to have direct comparisons between protein EpiScores and measured protein levels for the associations we test in Generation Scotland. However, we are limited by the lack of equivalent protein measurements in Generation Scotland. We are currently generating ~4,000 peptide measurements that map to ~300 proteins in N=20,000 individuals in this cohort and we will therefore be able to validate how protein EpiScores perform in comparison to measured proteins. Many protein biomarkers and acute-phase inflammatory proteins such as CRP and IL6 can be highly variable in single sampling measurements taken across various timepoints. As methylation is likely to represent a more consistent profile of long-term exposure to biological and environmental stressors in the body than single time-point protein levels, we may see that methylation scores outperform the measured proteins in some cases for disease predictions. We have been trialling this experimental design with two protein biomarkers (GDF15 and Nt-pro-BNP) in ongoing (and unpublished) work. In preliminary results (n=20,000), we see that EpiScores for these protein levels are comparable to (or better than) the measured protein in associations with cognitive ability and lifestyle traits. However, in the present study we recognise the lack of direct comparison is a limitation. We have ensured this is mentioned in our discussion on page 23: “Second, future studies should assess paired protein and EpiScore contributions to traits, as inference from EpiScores alone, while useful for disease risk stratification, is not sufficient to determine mechanisms. This may also highlight EpiScores that outperform the measured protein equivalent in disease.” 2. It is not very clear, at least to one of the reviewers, the value of the network analysis that is reported in Figure 5. Could this be explain both in terms of it technical development and in terms of what such analysis adds to the message of this paper? We apologise that the technical development of this figure was not clear. Figure 5 is a visual representation of the fully-adjusted Cox model results between protein EpiScores and incident diseases. It builds on the individual associations presented in Figures 4 and 6 (that are grouped by disease outcomes) by plotting the associations for EpiScores that were predictive of three or greater morbidities. This provides insight into the EpiScores that may harbour a predictive signal that is relevant across multiple disease states. We feel that this visualisation helps to illustrate pathways that may be connected to the development of multimorbidity states as we age. Many of the most highly connected protein EpiScores (e.g. C5, SELL, MMP9) are trained on proteins that are known to associate with a pro-inflammatory state in the body. The remainder also implicate cell adhesion and migration pathways. We have updated our methods section as follows on page 33: “Associations for EpiScores that were related to a minimum of three morbidities were subset from the fully-adjusted Cox proportional hazards results and were visualised using the ggraph package (Version 2.0.5) (Pedersen, 2021). This network representation was used (Figure 5) to highlight protein EpiScores that were connected with multiple morbidities.” We have updated our discussion as follows on page 20: “Our results indicated that the set of 109 EpiScores are likely to be heavily enriched for inflammatory, complement system, innate immune system pathways, in addition to extracellular matrix, cell remodelling and cell adhesion pathways. This reinforces previous work linking chronic inflammation and the epigenome (Zaghlool et al., 2020). It also suggests that EpiScores could be useful in the prediction of morbidities that are characterised by pro-inflammatory states. An example of this is the EpiScore for Complement Component 5 (C5), which was associated with the onset of five morbidities, the highest number for any EpiScore (Figure 5). The EpiScore for C5 is likely to reflect the biological pathways occurring in individuals with heightened complement cascade activity and could be utilised to alert clinicians to individuals at high risk of multimorbidity. Elevated levels of C5 peptides have been associated with severe inflammatory, autoimmune and neurodegenerative states (Ma et al., 2019; Mantovani et al., 2014; Morgan and Harris, 2015) and a range of C5-targetting therapeutic approaches are in development (Alawieh et al., 2018; Brandolini et al., 2019; Hawksworth et al., 2017; Hernandez et al., 2017; Morgan and Harris, 2015; Ort et al., 2020).” If after reviewing the additional context provided the reviewers still feel strongly that this Figure should be removed, we would be happy to oblige and either remove it or add it as a supporting Figure attached to Figure 4. However, we believe that there is additional value in illustrating the associations in the context of potential signals of multimorbidity, rather than only showing them grouped by single disease states. These EpiScores for proteins may alert clinicians to those in a population that are likely to represent the highest risk of multimorbidity. These individuals are likely to have a biological profile reflective of poorer general health. 3. While the authors touch on the possible clinical values of the analysis reported here, it would be great to read what is the view of the authors in more details and indicate what they suggest should the next step to translate their findings to the clinic. We thank the reviewers for this interesting comment. We have provided details in our response below and updated specific parts of the manuscript to improve the narrative on clinical translatability of these scores. We would need to extend this research in several ways before these EpiScores can become clinically translatable: First, we hope to validate a wider set of protein EpiScores. This is something we plan to do with samples that we will generate over the course of the next year, which will allow us to train scores in n=20,000 individuals for roughly 300 additional proteins. Second, once we have a substantial set of EpiScores, we then hope to train single-trait, disease-specific predictor scores using the EpiScores as inputs. This will derive singular scores that inform on the risk of developing individual diseases. These scores may be projected into any population that has DNA methylation measured, and used to stratify those in a population that are high risk for specific diseases (i.e. in the top 5% of a score distribution). Third, more work is required before we are able to project these types of EpiScores for an individual onto a reference population. Due to processing differences and batch effects, comparing absolute differences in DNAm across populations is complicated. By comparison, relative differences within a single cohort or population are far easier to analyse robustly. If these issues could be overcome, our scores for proteins, diseases and lifestyle traits could be used by a clinician to highlight personalised risk scores for each individual based on their epigenetic profile. This would need to be accompanied by information on the magnitude of that risk in real terms. We have made the following adjustment to our discussion based on this on page 23: “First, we demonstrate that EpiScores carry disease-relevant signals that may be clinically meaningful to delineate early disease risk when comparing relative differences within a cohort. However, projecting a new individual onto a reference set is complicated due to absolute differences in methylation quantification resulting from batch and processing effects.” 4. The list of marker proteins is interesting, but it would be useful to carry out their functional clustering for a more meaningful discussion. We have interpreted this comment as requiring additional functional information about the relationships that may exist between proteins used to train EpiScores, such that we could group EpiScores into subsets by particular functions. As a caution for this response, we would like to note that the proteins are derived from separate sources (84 SomaScan and 25 Olink) and assays. As the protein selection for the SomaScan and Olink panels also contains inherent biases towards proteins with certain functions (i.e. neurology-associated, inflammatory-associated, or of likely relevance to health systems), interpretation of any protein clustering is not straightforward since we are likely to amplify such biases. We have assessed the functional clustering of the 109 proteins through the use of the STRING tool, which summarises likely protein-protein interaction based on a range of databases. This provides a visual summary of projected interactions, in addition to a tabular list of functional annotations for each protein (see the revised Supplementary File 1 – Table 1D – STRING functional annotation column). While this is informative (see Author response image 1 – for the 109 gene names corresponding to proteins), we believe that there are several caveats to consider when interpreting the functional clustering of these proteins.

Author response image 1.

While this visual informs us on the likely interactions between the proteins, we feel that the interpretation of this plot in the context of the protein EpiScores is somewhat unhelpful. The EpiScores reflect the biological state of an individual respective to the levels of specific proteins, but they do not directly represent the proteins themselves and are potentially a different signal. Therefore, the functional clustering for the original proteins only direct us towards pathways of possible relevance that the EpiScore signature may be capturing, rather than being directly comparable to the EpiScore signal. For this reason, we have chosen to include only the functional annotations for the individual proteins used in the original training, rather than the interaction information in Author response image 1, as EpiScores for proteins may not interact in the same manner as the protein equivalents.In Supplementary File 1D, we have added a “Generalised grouping” column, which provides an indication of top-line functional groupings of each of the proteins. We have also performed a GeneSet enrichment test for the 109 genes corresponding to the proteins using FUMA. This has allowed for mapping of canonical pathways that are most enriched across the set of 109 proteins. We have added these results to the manuscript (Figure 2 —figure supplement 3 for a visual summary, and Supplementary File 1D for a tabular summary). The two updates we have made in response to this comment have provided additional confirmation that many of the proteins in the set of 109 are heavily associated with immune-regulatory, complement cascade and innate immune system pathways. There are also many that are involved with metabolic pathways and cell migration/conformation signalling pathways. This has helped to make our discussion of the EpiScores more robust and we have updated our manuscript as follows: Page 9 (results): “GeneSet enrichment analysis of the original proteins used to train the 109 EpiScores highlighted pathways associated with immune response and cell remodelling, adhesion and extracellular matrix function (Supplementary File 1E).” Page 31 (methods): “Functional annotations for each of the proteins used to train the finalised set of 109 EpiScores were sourced from the STRING database (Jensen et al., 2009). GeneSet enrichment analysis against protein-coding genes was performed using the FUMA database, to quantify which canonical pathways were most commonly implicated across the 109 genes corresponding to the proteins used to train the 109 EpiScores (Watanabe et al., 2017). The background gene-set was specified as protein coding genes and a threshold of FDR P < 0.05 was applied for enrichment status, with the minimum overlapping genes with gene-sets set to greater than or equal to two.” Page 20 (discussion): “Our results indicated that the set of 109 EpiScores are likely to be heavily enriched for inflammatory, complement system, innate immune system pathways, in addition to extracellular matrix, cell remodelling and adhesion signalling pathways. This reinforces previous work linking chronic inflammation and the epigenome (Zaghlool et al., 2020b) and suggests that these EpiScores may be useful in the prediction of age-morbidities that are characterised by pro-inflammatory states.”

72 in total

1. Refinement of the association of serum C-reactive protein concentration and coronary heart disease risk by correction for within-subject variation over time: the MONICA Augsburg studies, 1984 and 1987.

Authors: Wolfgang Koenig; Malte Sund; Margit Fröhlich; Hannelore Löwel; Winston L Hutchinson; Mark B Pepys
Journal: Am J Epidemiol Date: 2003-08-15 Impact factor: 4.897

2. Exercise elevates plasma levels but not gene expression of IL-1beta, IL-6, and TNF-alpha in blood mononuclear cells.

Authors: A I Moldoveanu; R J Shephard; P N Shek
Journal: J Appl Physiol (1985) Date: 2000-10

3. Recent genomic heritage in Scotland.

Authors: Carmen Amador; Jennifer Huffman; Holly Trochet; Archie Campbell; David Porteous; James F Wilson; Nick Hastie; Veronique Vitart; Caroline Hayward; Pau Navarro; Chris S Haley
Journal: BMC Genomics Date: 2015-06-06 Impact factor: 3.969

4. Investigating the relationship between DNA methylation age acceleration and risk factors for Alzheimer's disease.

Authors: Daniel L McCartney; Anna J Stevenson; Rosie M Walker; Jude Gibson; Stewart W Morris; Archie Campbell; Alison D Murray; Heather C Whalley; David J Porteous; Andrew M McIntosh; Kathryn L Evans; Ian J Deary; Riccardo E Marioni
Journal: Alzheimers Dement (Amst) Date: 2018-06-21

5. Genetic architecture of host proteins involved in SARS-CoV-2 infection.

Authors: Maik Pietzner; Eleanor Wheeler; Julia Carrasco-Zanini; Johannes Raffler; Nicola D Kerrison; Erin Oerton; Victoria P W Auyeung; Jian'an Luan; Chris Finan; Juan P Casas; Rachel Ostroff; Steve A Williams; Gabi Kastenmüller; Markus Ralser; Eric R Gamazon; Nicholas J Wareham; Aroon D Hingorani; Claudia Langenberg
Journal: Nat Commun Date: 2020-12-16 Impact factor: 14.919

6. DNA methylation arrays as surrogate measures of cell mixture distribution.

Authors: Eugene Andres Houseman; William P Accomando; Devin C Koestler; Brock C Christensen; Carmen J Marsit; Heather H Nelson; John K Wiencke; Karl T Kelsey
Journal: BMC Bioinformatics Date: 2012-05-08 Impact factor: 3.169

7. Cohort Profile Update: The Lothian Birth Cohorts of 1921 and 1936.

Authors: Adele M Taylor; Alison Pattie; Ian J Deary
Journal: Int J Epidemiol Date: 2018-08-01 Impact factor: 7.196

Review 8. Potential therapeutic target for aging and age-related neurodegenerative diseases: the role of acid sphingomyelinase.

Authors: Min Hee Park; Hee Kyung Jin; Jae-Sung Bae
Journal: Exp Mol Med Date: 2020-03-13 Impact factor: 8.718

9. Epigenetic measures of ageing predict the prevalence and incidence of leading causes of death and disease burden.

Authors: Robert F Hillary; Anna J Stevenson; Daniel L McCartney; Archie Campbell; Rosie M Walker; David M Howard; Craig W Ritchie; Steve Horvath; Caroline Hayward; Andrew M McIntosh; David J Porteous; Ian J Deary; Kathryn L Evans; Riccardo E Marioni
Journal: Clin Epigenetics Date: 2020-07-31 Impact factor: 6.551

Review 10. Complement, a target for therapy in inflammatory and degenerative diseases.

Authors: B Paul Morgan; Claire L Harris
Journal: Nat Rev Drug Discov Date: 2015-10-23 Impact factor: 112.288

7 in total

Review 1. DNA methylation-based predictors of health: applications and statistical considerations.

Authors: Paul D Yousefi; Matthew Suderman; Ryan Langdon; Oliver Whitehurst; George Davey Smith; Caroline L Relton
Journal: Nat Rev Genet Date: 2022-03-18 Impact factor: 53.242

2. Integrated methylome and phenome study of the circulating proteome reveals markers pertinent to brain health.

Authors: Danni A Gadd; Robert F Hillary; Daniel L McCartney; Liu Shi; Aleks Stolicyn; Neil A Robertson; Rosie M Walker; Robert I McGeachan; Archie Campbell; Shen Xueyi; Miruna C Barbu; Claire Green; Stewart W Morris; Mathew A Harris; Ellen V Backhouse; Joanna M Wardlaw; J Douglas Steele; Diego A Oyarzún; Graciela Muniz-Terrera; Craig Ritchie; Alejo Nevado-Holgado; Tamir Chandra; Caroline Hayward; Kathryn L Evans; David J Porteous; Simon R Cox; Heather C Whalley; Andrew M McIntosh; Riccardo E Marioni
Journal: Nat Commun Date: 2022-08-09 Impact factor: 17.694

3. Functional Medicine Past, Present, and Future.

Authors: Jeffrey S Bland
Journal: Integr Med (Encinitas) Date: 2022-05

4. Getting closer to the clinic.

Authors: Toshiko Tanaka; Luigi Ferrucci
Journal: Elife Date: 2022-02-25 Impact factor: 8.140

5. Feasibility and ethics of using data from the Scottish newborn blood spot archive for research.

Authors: Sarah Cunningham-Burley; Daniel L McCartney; Archie Campbell; Robin Flaig; Clare E L Orange; Carol Porteous; Mhairi Aitken; Ciaran Mulholland; Sara Davidson; Selena M McCafferty; Lee Murphy; Nicola Wrobel; Sarah McCafferty; Karen Wallace; David StClair; Shona Kerr; Caroline Hayward; Andrew M McIntosh; Cathie Sudlow; Riccardo E Marioni; Jill Pell; Zosia Miedzybrodzka; David J Porteous
Journal: Commun Med (Lond) Date: 2022-10-06

6. Epigenome-wide association study of global cortical volumes in generation Scotland: Scottish family health study.

Authors: Miruna Carmen Barbu; Mat Harris; Xueyi Shen; Stolicyn Aleks; Claire Green; Carmen Amador; Rosie Walker; Stewart Morris; Mark Adams; Anca Sandu; Christopher McNeil; Gordon Waiter; Kathryn Evans; Archie Campbell; Joanna Wardlaw; Douglas Steele; Alison Murray; David Porteous; Andrew McIntosh; Heather Whalley
Journal: Epigenetics Date: 2021-11-23 Impact factor: 4.861

7. A blood DNA methylation biomarker for predicting short-term risk of cardiovascular events.

Authors: Andrea Cappozzo; Cathal McCrory; Oliver Robinson; Anna Freni Sterrantino; Carlotta Sacerdote; Vittorio Krogh; Salvatore Panico; Rosario Tumino; Licia Iacoviello; Fulvio Ricceri; Sabina Sieri; Paolo Chiodini; Gareth J McKay; Amy Jayne McKnight; Frank Kee; Ian S Young; Bernadette McGuinness; Eileen M Crimmins; Thalida Em Arpawong; Rose Anne Kenny; Aisling O'Halloran; Silvia Polidoro; Giuliana Solinas; Paolo Vineis; Francesca Ieva; Giovanni Fiorito
Journal: Clin Epigenetics Date: 2022-09-29 Impact factor: 7.259

7 in total