BACKGROUND: Since the onset of the SARS-CoV-2 pandemic, most clinical testing has focused on RT-PCR1. Host epigenome manipulation post coronavirus infection2-4 suggests that DNA methylation signatures may differentiate patients with SARS-CoV-2 infection from uninfected individuals, and help predict COVID-19 disease severity, even at initial presentation. METHODS: We customized Illumina's Infinium MethylationEPIC array to enhance immune response detection and profiled peripheral blood samples from 164 COVID-19 patients with longitudinal measurements of disease severity and 296 patient controls. RESULTS: Epigenome-wide association analysis revealed 13,033 genome-wide significant methylation sites for case-vs-control status. Genes and pathways involved in interferon signaling and viral response were significantly enriched among differentially methylated sites. We observe highly significant associations at genes previously reported in genetic association studies (e.g. IRF7, OAS1). Using machine learning techniques, models built using sparse regression yielded highly predictive findings: cross-validated best fit AUC was 93.6% for case-vs-control status, and 79.1%, 80.8%, and 84.4% for hospitalization, ICU admission, and progression to death, respectively. CONCLUSIONS: In summary, the strong COVID-19-specific epigenetic signature in peripheral blood driven by key immune-related pathways related to infection status, disease severity, and clinical deterioration provides insights useful for diagnosis and prognosis of patients with viral infections.
BACKGROUND: Since the onset of the SARS-CoV-2 pandemic, most clinical testing has focused on RT-PCR1. Host epigenome manipulation post coronavirus infection2-4 suggests that DNA methylation signatures may differentiate patients with SARS-CoV-2 infection from uninfected individuals, and help predict COVID-19 disease severity, even at initial presentation. METHODS: We customized Illumina's Infinium MethylationEPIC array to enhance immune response detection and profiled peripheral blood samples from 164 COVID-19 patients with longitudinal measurements of disease severity and 296 patient controls. RESULTS: Epigenome-wide association analysis revealed 13,033 genome-wide significant methylation sites for case-vs-control status. Genes and pathways involved in interferon signaling and viral response were significantly enriched among differentially methylated sites. We observe highly significant associations at genes previously reported in genetic association studies (e.g. IRF7, OAS1). Using machine learning techniques, models built using sparse regression yielded highly predictive findings: cross-validated best fit AUC was 93.6% for case-vs-control status, and 79.1%, 80.8%, and 84.4% for hospitalization, ICU admission, and progression to death, respectively. CONCLUSIONS: In summary, the strong COVID-19-specific epigenetic signature in peripheral blood driven by key immune-related pathways related to infection status, disease severity, and clinical deterioration provides insights useful for diagnosis and prognosis of patients with viral infections.
Coronaviruses (CoV) comprise a large group of human and animal pathogens, including the novel enveloped RNA betacoronavirus referred to as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)[5]. This pathogen is associated with coronavirus disease 2019 (COVID-19) first identified in Wuhan, China in 2019[6] and declared a pandemic on March 11, 2020[7]. Since the onset of the pandemic, multiple tests for diagnosing COVID-19 have been launched, including real-time reverse transcriptase–polymerase chain reaction (RT-PCR), specific antibody detection, and next-generation sequencing assays that query for current or past infections[1]. With the exception of next-generation sequencing, which can discern viral subtypes, most diagnostic tests are viral strain dependent, can carry a high false negative rate, do not discern if the virus is viable and replicating, and do not predict clinical outcomes of infection[1,8,9]. For example, pre-symptomatic patients may test negative[10,11] while patients who have recovered may continue to test positive though they are no longer infectious[12]. Accurate diagnostics are urgently required to control continued communal spread, to better understand host response, and for the development of vaccines and antivirals[13].Individuals infected with SARS-CoV-2 have a variable course of infection, ranging from asymptomatic to death. Although the fatality rate varies tremendously according to demographic characteristics and co-morbidities[14], the U.S. ranks as one of the countries with the highest COVID-19 mortality rates[15]. Identification of which SARS-CoV-2-infected patients are most likely to develop severe disease would enable clinicians to triage patients via augmented clinical decision support. Having more information on disease severity has recently become critical due to widespread lack of hospital and intensive care unit (ICU) capacity, necessitating difficult decisions about resource triage. To our knowledge, no test can predict COVID-19 clinical course or severity, although work on cytokine abundance ratios after hospitalization has been proposed as a prognostic indicator of severe outcomes[16].There is considerable evidence that enveloped RNA viruses such as CoV can manipulate the host’s epigenome via evolved functions that antagonize and regulate the host innate immune antiviral defense processes[2,3], specifically via DNA methylation. Viral-mediated antagonism of antigen-presentation gene expression in the case of Middle East respiratory syndrome coronavirus (MERS-CoV) was shown to occur via DNA methylation[4]. DNA methylation changes at cytosine-phosphate-guanine (CpG) sites have been increasingly leveraged in the emerging field of clinical epigenetics to characterize unique epigenetic signatures that diagnose disease. To date, considerable success has been demonstrated in developing highly accurate and robust machine learning (ML)-based disease classifiers using DNA methylation patterns to differentiate Mendelian disorders[17], behavior disorders[18], coronary artery disease[19], and some cancers[20-22]. Consequently integration of a methylation-based disease classification can result in relevant improvement in clinical practice[23,24].With a goal to leverage Illumina’s Infinium MethylationEPIC Array to classify differential methylation signatures of SARS-CoV-2-positive (hereafter referred to as SARS-CoV-2+, regardless of additional symptoms) and control peripheral blood DNA samples (either confirmed SARS-CoV-2 negative or samples collected prior to the SARS-CoV-2 pandemic), we conducted this study to determine whether DNA methylation patterns could differentiate SARS-CoV-2-infected patients from non-infected patients from whole blood obtained from patients. Our secondary objective was to determine whether DNA methylation patterns could differentiate patients with SARS-CoV-2 infection who go on to develop severe disease. In this study, we identified a strong COVID-19-specific epigenetic signature in peripheral blood driven by key immune-related pathways related to SARS-CoV-2 infection status, disease severity, and clinical deterioration.
Methods
Source of data
This protocol was reviewed and approved by the Colorado Multiple Institutional Review Board (COMIRB) and the research adheres to the ethical principles of research outlined in the U.S. Federal Policy for the Protection of Human Subjects. SARS-CoV-2+ were defined as those patients who tested positive for SARS-CoV-2 infection via a routine diagnostic RT-PCR assay in the Biobank at the Colorado Center for Personalized Medicine (Thermo Fisher Scientific, Waltham, MA) or in the UCHealth University of Colorado Hospital Clinical Laboratory (Roche Diagnostics, Indianapolis, IN) of a nasopharyngeal swab collected in viral transport media; controls were defined as those who tested negative. Peripheral blood DNA samples were collected in EDTA tubes from patients seen at the UCHealth University of Colorado Hospital and tested for SARS-CoV-2 epigenetic signatures starting on March 1, 2020. Blood specimens were collected from patients consented to the University of Colorado COVID-19 Biorepository (https://research.cuanschutz.edu/university-research/covid-19-clinical-research/covid-19-biobank-specimen-repository) or the University of Colorado Emergency Medicine Specimen Bank (EMSB)[25]. Control subjects included patients from each study who tested negative for SARS-CoV-2 infection during the index visit. Through the University of Colorado COVID-19 Biorepository and the EMSB, patients tested were consented for blood collection and data abstraction from their electronic health record (EHR). Data obtained from EHR abstraction included demographics, past medical history, laboratory testing (including SARS-CoV-2), treatments, vital signs, hospital disposition, and clinical outcomes. In addition, previously collected samples from patients with acute upper respiratory viral infections (SARS-CoV-2 negative/pan-negative for upper respiratory viral infections/positive for non-SARS-CoV-2 upper respiratory viral infections) between February 5, 2018 and January 1, 2020 were obtained through the EMSB as SARS-CoV-2-negative controls. Additional biospecimens included discarded clinical samples from patients not approached for biorepository enrollment through the UCHealth University of Colorado Hospital Clinical Laboratory. Discarded samples were linked to a limited EHR dataset through the Colorado Center for Personalized Medicine’s health data warehouse, Health Data Compass, and then deidentified. The limited dataset included age, gender, race, ethnicity, viral test status (SARS-CoV-2 and other upper respiratory viruses), and clinical outcomes. The use of discarded samples and accompanying limited datasets was determined to be exempt from Institutional Review Board approval and the need for informed consent by COMIRB. All samples were frozen at −20 °C after collection prior to processing for methylation analyses.
Customization of the Infinium MethylationEPIC Array
Following a literature review of known epigenetic associations with respiratory viral infections from recent CoV outbreaks, we selected additional content to enrich Illumina’s Infinium MethylationEPIC Array[26]. We specifically enriched for known HLA alleles accounting for known genomic variation[27] as well as multiple alternative haplotypes and unpublished reference sequences spanning the major histocompatibility complex genomic region, the natural killer cell immunoreceptor, and other immunogenetic loci (e.g., cytokines, interferon response genes), to enhance the sensitivity of immune response detection. The custom panel targeted 262 genes with 7831 additional probes. While the majority of the additional probes targeted unique sequences within the genome, a number of probes were intentionally designed to target genomic sequences with a limited degree of repetitiveness. The list of genes and the Illumina IDs for the probes that target these genes are given in Supplementary Data 1.
Methylation array and quality assessment
DNA extraction
Biospecimens were accessioned and tracked via the Colorado Anschutz Research Genetics Organization (CARGO) laboratory information management system (LIMS). Genomic DNA was extracted from SARS-CoV-2+ peripheral blood on the bead-based, automated extraction Maxwell(R) RSC System (Promega) in a biological safety cabinet in compliance with CDC safety guidelines and procedures for handling SARS-CoV-2 biospecimens (biospecimens from SARS-CoV-2+ cases) and from controls on the Autogen FlexSTAR+ using the Autogen’s FlexiGene Blood Extraction Kit (Holliston, MA). All DNA samples were quantified using both absorbance (NanoDrop 2000; Thermo Fisher Scientific, Waltham, MA) and fluorescence-based methods (Qubit; Thermo Fisher Scientific, Waltham, MA) using standard dyes selective for double-stranded DNA, minimizing the effects of contaminants that affect the quantitation. DNA quality was assessed using an Agilent TapeStation (Agilent, Santa Clara, CA). Samples were then uploaded to CARGO’s LIMS, barcoded, and labeled.
Bisulfite conversion and amplification
Purified DNA samples were processed using the Zymo EZ-96 DNA Methylation bisulfite conversion kits (Zymo, Irvine, CA) as described previously[28]. The product of this process contains cytosine converted to uracil if it was previously unmethylated. The bisulfite-treated DNA was subjected to whole-genome amplification via random hexamer priming and Phi29 DNA polymerase, and the amplification products were then enzymatically fragmented, purified from dNTPs, primers, and enzymes, and applied to the Illumina chip as described elsewhere[29].
Hybridization and single-base extension
The bisulfite-converted amplified DNA products were denatured into single strands and hybridized to the customized Infinium 850K BeadChip (EPIC+; Illumina Inc., San Diego, CA) via allele-specific annealing to either the methylation-specific probe or the non-methylation probe. Hybridization to the chip was followed by single-base extension with labeled di-deoxynucleotides according to Illumina’s Infinium protocol at the CARGO laboratory[28].
Fluorescence staining and scanning of chip
The hybridized BeadChips were stained, washed, and scanned to show the intensities of the un-methylated and methylated bead types using Illumina’s iScan System.
Data processing and quality control (QC)
IDAT files were processed, filtered, and normalized using the SeSAMe R package[30]. Type I probe channel was empirically determined from signal intensities. Probe detection P values (representing the ability to differentiate true signal from background fluorescence) were calculated for each color channel using pOOBAH, which leverages the fluorescence of out-of-band (OOB) probes. Normalization was performed using noob, which uses OOB probes to perform a normal-exponential deconvolution of fluorescent intensities[31]. Finally, a common dye bias that results in greater intensities in the red color channel was corrected to ensure that the distribution of intensities in the two color channels were equal. Probes with detection P values >0.05 were removed, as well as probes overlapping single-nucleotide polymorphisms with global minor allele frequency >1% in dbSNP, probes with poor mapping, and probes containing non-unique sequence according to Zhou et al.[32]. Beta values were logit-transformed into M values for modeling. Probes with >25% missingness were removed. Remaining missing values were then imputed with mean probe M value.
Selection of discovery/training and testing cohorts and controls
Case–control analyses were performed using the entire genotyped dataset passing epigenetics QC, with SARS-CoV-2 infection status determined as described above (see Fig. 1 for a summary of the workflow). Analyses were repeated including and excluding controls with other upper respiratory infections validated by clinical respiratory panels. Measurements of disease severity and progression (e.g., hospitalization, ICU admittance, ventilator use) were extracted from chart review within the UCHealth EHR.
Fig. 1
Flowchart of the study sample collection.
Six hundred and forty-eight samples were collected for analysis, of which 644 were processed on MethylationEPIC arrays. Five hundred and twenty-five arrays passed quality control and were included in the final analysis.
Flowchart of the study sample collection.
Six hundred and forty-eight samples were collected for analysis, of which 644 were processed on MethylationEPIC arrays. Five hundred and twenty-five arrays passed quality control and were included in the final analysis.
Control for batch effect and robustness of the identified epigenetic signatures
To minimize possible batch effects and other sources of variability, samples were split into SARS-CoV-2+ and SARS-CoV-2-negative control sets, randomized within sets to account for unavailable phenotypes, and then distributed across chips. To reduce batch and plating effects a minimum of two SARS-CoV-2+ and two SARS-CoV-2-negative control samples were run on each chip (12 chips per plate, 8 samples each) and positive/negative status was randomized across the chip.
Epigenome-wide association study (EWAS) with COVID-19 disease status
Preprocessing was performed using the GLINT[33] package for association testing and estimating components to adjust for population structure (EPISTRUCTURE[34]) and we used ReFACTor[35] to account for cell-type proportions. We chose ReFACTor to account for cell proportion information in a data-driven fashion. The linear mixed-effects model in GLINT was fit to each probe, testing for differences based on COVID-19 disease status while correcting for age, sex, chip position, 6 ReFACTor components, 1 EPISTRUCTURE component, and a variance component representing individual covariance[36]. Enrichment of top hits in common databases was performed using enrichR[37]. Probes were sorted by adjusted P value and the top 800 genes to which differentially methylated probes map were used as input to perform overrepresentation enrichment analysis within Gene Ontology (GO) categories, Kyoto Encyclopedia of Genes and Genomes pathways (KEGG), BioPlanet, and WikiPathways[38-41]. Probes were annotated to CpG island and genic regions using annotatr[42].
Clinical outcome stratification
Clinical data were abstracted via detailed chart review for all EMSB patients. COVID-19 disease severity was determined by an ordered severity score of (1) discharged from emergency department; (2) admitted to inpatient care; (3) progressed to ICU; and (4) death. We also determined a hospital duration variable, where individuals without a measured hospital stay (i.e., discharged from the emergency department) were assigned 0 and individuals who died were removed from the cohort for length of stay analysis to minimize bias associated with timing of decisions to withdraw care.
Construction and validation of a prediction model
Predictive modeling was performed using the Lasso[43] and Elastic Net[44] algorithms for sparse penalized regression modeling available in the glmnet software package[45]. For each prediction model, only autosomal methylation probes passing QC were included, to remove potential confounding from sex-linked chromosomes. No demographic, clinical, or cell count variables were included in the predictive models, requiring the algorithm to pick CpG sites with strong enough associations to surpass the level of penalization of the hyperparameters across the entire least angle regression path. For each trait of interest, a separate model was created and best-fitting parameters were chosen after tenfold cross-validation either by maximizing area under the receiver-operator characteristic curve (AUC for dichotomous traits) or minimizing mean-squared error (MSE for quantitative traits). Each was fit across a grid of parameters representing various strengths of penalization and combination of L1 and L2 penalties under the weighted elastic net model. Both the days of hospitalization and case severity were modeled as continuous outcomes. To assess performance for quantitative traits in a manner comparable to dichotomous traits, we swept across potential cutpoints to estimate AUCs for this newly derived dichotomous variable. While case–control status was the primary phenotype of interest, measures of severity were assessed in SARS-CoV-2+ cases only.To estimate stability of estimation in parameters, we performed 100 iterations of model training and testing. Within each iteration for case–control and severity outcomes, we employed tenfold cross-validation to derive the model and a held-out set of 30% removed from train/test to gauge out-of-sample performance of the best-fitting model. Our train/test and validation splits were created within each stratum to preserve representation across all outcomes and reflect the distribution across the total dataset. For hospitalization duration, the train/test/validation models had instability in convergence and so we reverted to a train/test model using the tenfold cross-validation within the default cv.glmnet() function. We assessed overall performance for the dichotomous COVID+/COVID− case–control status using out-of-sample AUC, the F1 score (a measure of the relationship between precision and recall), the distribution of best-fit λ penalty via cross-validation, and the number of probes chosen in the final model. For the quantitative outcomes, we assessed overall performance using out-of-sample R2, the slope of the model, and λ number of probes. Finally, these were stratified each across the elastic net weights (α) from 0.01 to 1, representing the proportion of ridge (L2) vs Lasso (L1) penalty to choose a final model. All models included nonzero λ to encourage sparsity (a L2-only model would include prediction from the entire array). Final models described in results were chosen based on best-performing (maximum R2 or AUC) vs median values for each chosen set of hyperparameters. The final, out-of-sample best-fit prediction for each outcome was considered the “methylation score” used in downstream modeling, characterization of association, and determination of potential confounding with demographic and blood cell proportion characteristics.
Authors: Golareh Agha; Michael M Mendelson; Cavin K Ward-Caviness; Roby Joehanes; TianXiao Huan; Rahul Gondalia; Elias Salfati; Jennifer A Brody; Giovanni Fiorito; Jan Bressler; Brian H Chen; Symen Ligthart; Simonetta Guarrera; Elena Colicino; Allan C Just; Simone Wahl; Christian Gieger; Amy R Vandiver; Toshiko Tanaka; Dena G Hernandez; Luke C Pilling; Andrew B Singleton; Carlotta Sacerdote; Vittorio Krogh; Salvatore Panico; Rosario Tumino; Yun Li; Guosheng Zhang; James D Stewart; James S Floyd; Kerri L Wiggins; Jerome I Rotter; Michael Multhaup; Kelly Bakulski; Steven Horvath; Philip S Tsao; Devin M Absher; Pantel Vokonas; Joel Hirschhorn; M Daniele Fallin; Chunyu Liu; Stefania Bandinelli; Eric Boerwinkle; Abbas Dehghan; Joel D Schwartz; Bruce M Psaty; Andrew P Feinberg; Lifang Hou; Luigi Ferrucci; Nona Sotoodehnia; Giuseppe Matullo; Annette Peters; Myriam Fornage; Themistocles L Assimes; Eric A Whitsel; Daniel Levy; Andrea A Baccarelli Journal: Circulation Date: 2019-08-19 Impact factor: 29.690
Authors: Maxim V Kuleshov; Matthew R Jones; Andrew D Rouillard; Nicolas F Fernandez; Qiaonan Duan; Zichen Wang; Simon Koplev; Sherry L Jenkins; Kathleen M Jagodnik; Alexander Lachmann; Michael G McDermott; Caroline D Monteiro; Gregory W Gundersen; Avi Ma'ayan Journal: Nucleic Acids Res Date: 2016-05-03 Impact factor: 16.971
Authors: Juliana Imgenberg-Kreuz; Jonas Carlsson Almlöf; Dag Leonard; Christopher Sjöwall; Ann-Christine Syvänen; Lars Rönnblom; Johanna K Sandling; Gunnel Nordmark Journal: Front Immunol Date: 2019-07-30 Impact factor: 7.561
Authors: Eugene Andres Houseman; William P Accomando; Devin C Koestler; Brock C Christensen; Carmen J Marsit; Heather H Nelson; John K Wiencke; Karl T Kelsey Journal: BMC Bioinformatics Date: 2012-05-08 Impact factor: 3.169
Authors: Peter Horby; Wei Shen Lim; Jonathan R Emberson; Marion Mafham; Jennifer L Bell; Louise Linsell; Natalie Staplin; Christopher Brightling; Andrew Ustianowski; Einas Elmahi; Benjamin Prudon; Christopher Green; Timothy Felton; David Chadwick; Kanchan Rege; Christopher Fegan; Lucy C Chappell; Saul N Faust; Thomas Jaki; Katie Jeffery; Alan Montgomery; Kathryn Rowan; Edmund Juszczak; J Kenneth Baillie; Richard Haynes; Martin J Landray Journal: N Engl J Med Date: 2020-07-17 Impact factor: 91.245
Authors: Veronica Davalos; Carlos A García-Prieto; Gerardo Ferrer; Sergio Aguilera-Albesa; Juan Valencia-Ramos; Agustí Rodríguez-Palmero; Montserrat Ruiz; Laura Planas-Serra; Iolanda Jordan; Iosune Alegría; Patricia Flores-Pérez; Verónica Cantarín; Victoria Fumadó; Maria Teresa Viadero; Carlos Rodrigo; Maria Méndez-Hernández; Eduardo López-Granados; Roger Colobran; Jacques G Rivière; Pere Soler-Palacín; Aurora Pujol; Manel Esteller Journal: EClinicalMedicine Date: 2022-06-25
Authors: Nicholas S Giroux; Shengli Ding; Micah T McClain; Thomas W Burke; Elizabeth Petzold; Hong A Chung; Grecia O Rivera; Ergang Wang; Rui Xi; Shree Bose; Tomer Rotstein; Bradly P Nicholson; Tianyi Chen; Ricardo Henao; Gregory D Sempowski; Thomas N Denny; Maria Iglesias De Ussel; Lisa L Satterwhite; Emily R Ko; Geoffrey S Ginsburg; Bryan D Kraft; Ephraim L Tsalik; Xiling Shen; Christopher W Woods Journal: Sci Rep Date: 2022-07-09 Impact factor: 4.996
Authors: Nicholas S Giroux; Shengli Ding; Micah T McClain; Thomas W Burke; Elizabeth Petzold; Hong A Chung; Grecia O Rivera; Ergang Wang; Rui Xi; Shree Bose; Tomer Rotstein; Bradly P Nicholson; Tianyi Chen; Ricardo Henao; Gregory D Sempowski; Thomas N Denny; Maria Iglesias De Ussel; Lisa L Satterwhite; Emily R Ko; Geoffrey S Ginsburg; Bryan D Kraft; Ephraim L Tsalik; Xiling Shen; Christopher Woods Journal: Res Sq Date: 2022-04-07
Authors: Ivana Yang; Christopher Gignoux; Kathleen C Barnes; Cosby G Arnold; Iain Konigsberg; Jason Y Adams; Sunita Sharma; Neil Aggarwal; Andrew Hopkinson; Alexis Vest; Monica Campbell; Meher Boorgula; Andrew A Monte Journal: Hum Genomics Date: 2022-07-27 Impact factor: 6.481
Authors: Guillermo Barturen; Elena Carnero-Montoro; Manuel Martínez-Bueno; Silvia Rojo-Rello; Beatriz Sobrino; Óscar Porras-Perales; Clara Alcántara-Domínguez; David Bernardo; Marta E Alarcón-Riquelme Journal: Nat Commun Date: 2022-08-06 Impact factor: 17.694