Gabriel E Hoffman1,2,3, Jaroslav Bendl1,3, Kiran Girdhar1,3, Eric E Schadt2,3,4, Panos Roussos1,2,3,5,6,7. 1. Pamela Sklar Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 2. Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 3. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 4. Sema4, Stamford, CT, USA. 5. Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 6. Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 7. Mental Illness Research, Education, and Clinical Center (VISN 2 South), James J. Peters VA Medical Center, Bronx, NY, USA.
Abstract
Identifying functional variants underlying disease risk and adoption of personalized medicine are currently limited by the challenge of interpreting the functional consequences of genetic variants. Predicting the functional effects of disease-associated protein-coding variants is increasingly routine. Yet, the vast majority of risk variants are non-coding, and predicting the functional consequence and prioritizing variants for functional validation remains a major challenge. Here, we develop a deep learning model to accurately predict locus-specific signals from four epigenetic assays using only DNA sequence as input. Given the predicted epigenetic signal from DNA sequence for the reference and alternative alleles at a given locus, we generate a score of the predicted epigenetic consequences for 438 million variants observed in previous sequencing projects. These impact scores are assay-specific, are predictive of allele-specific transcription factor binding and are enriched for variants associated with gene expression and disease risk. Nucleotide-level functional consequence scores for non-coding variants can refine the mechanism of known functional variants, identify novel risk variants and prioritize downstream experiments.
Identifying functional variants underlying disease risk and adoption of personalized medicine are currently limited by the challenge of interpreting the functional consequences of genetic variants. Predicting the functional effects of disease-associated protein-coding variants is increasingly routine. Yet, the vast majority of risk variants are non-coding, and predicting the functional consequence and prioritizing variants for functional validation remains a major challenge. Here, we develop a deep learning model to accurately predict locus-specific signals from four epigenetic assays using only DNA sequence as input. Given the predicted epigenetic signal from DNA sequence for the reference and alternative alleles at a given locus, we generate a score of the predicted epigenetic consequences for 438 million variants observed in previous sequencing projects. These impact scores are assay-specific, are predictive of allele-specific transcription factor binding and are enriched for variants associated with gene expression and disease risk. Nucleotide-level functional consequence scores for non-coding variants can refine the mechanism of known functional variants, identify novel risk variants and prioritize downstream experiments.
Genome-wide association studies (GWAS) have identified thousands of loci associated with risk to human diseases (1). Yet progress in understanding the molecular etiology of disease and the development of novel therapies has been limited by the fact that these studies are often not able to identify a specific functional variant and mechanistically relevant gene due to linkage disequilibrium (LD) (1–4). Integrating independent biological knowledge has the potential to increase the resolution of the associated region and improve the interpretation of GWAS results (5–7). Most notably, risk variants are enriched in non-coding regulatory regions (8–10). While interpreting the functional consequences of protein coding variants has been remarkably successful and improved the understanding of the biology of human disease (11–14), the rules governing the functional effects of variants in non-coding regulatory DNA have been more challenging to decipher. Novel approaches are needed to interpret non-coding variants from ongoing whole genome sequencing projects, for example, of somatic variants in cancer (15) and de novo variants in autism (16).Recent work has sought to better understand the regulatory genome by characterizing the epigenetic differences in transcription factor (TF) binding, chromatin accessibility and histone modifications between tissues and cell types (17–19). Yet, these epigenetic tracks can cover a substantial portion of the genome, even though polymorphisms at only a fraction of sites are presumed to have a functional consequence. Moreover, these efforts have generally not integrated genetic variation. Other efforts have focused on the effects of genetic variation on gene expression (20–22) as well as multiple epigenetic assays (23–26). Yet, these molecular trait QTL studies are subject to the same challenges with linkage disequilibrium as GWAS so they generally cannot pinpoint the functional variant, or predict the functional consequence of a rare variant not observed in the dataset.Recent progress in developing computational models able to predict TF binding, chromatin accessibility, and histone modifications from only the genome sequence in the surrounding region offers a novel paradigm to interpret the functional consequences of non-coding variants (27–31). These models leverage advances in deep learning (32) to use DNA sequence context in the predictive model of functional consequences. Yet with few exceptions (28), these computational approaches consider only the discrete absence versus presence of an epigenetic signal (27,29–31). Moreover, these methods rely on the sequence of the reference genome so they do not model the contribution of genetic variation driving the epigenetic signal. Based on the extensive contribution of genetic variation to molecular phenotypes (20–26), and the increasing availability of epigenetics datasets from multiple individuals paired with genetic data (23–26,33), integrating genetics into model training has the potential to improve prediction accuracy and increase power of variant impact predictions. Finally, although these methods are trained jointly across many datasets, they only consider a single experiment from a given cell type and assay.Here we introduce a deep learning framework for functional interpretation of genetic variants (DeepFIGV). Our approach extends the method of Kelley et al. (28) by (i) performing model training on many epigenetic experiments for a particular assay and cell type instead just a single representative sample, (ii) integrating whole genome sequencing to create a personalized genome sequence for each individual, and (iii) modeling quantitative variation in the epigenetic readout rather than dividing the genome into two classes. We develop predictive models of quantitative epigenetic variation in chromatin accessibility from DNase-seq and histone modifications (H3K27ac, H3K4me3 and H3K4me1) from 75 lymphoblastoid cell lines (LCL) (23,26). By training the models on many experiments from the same cell type and assay, integrating whole genome sequencing, and modeling quantitative variation in the epigenetic signal, we identify genetic variants with functional effects on the epigenome.
MATERIALS AND METHODS
Epigenomic data from lymphoblastoid cell lines
The dataset comprises ChIP-seq experiments for 3 histone modifications (H3K27ac, H3K4me1 and H3K4me3) for 75 individuals (23) and DNase I hypersensitivity experiments for 69 individuals (26). All individuals are of Yoruban ancestry from the 1000 Genomes Project (34). Processed data was downloaded from the ChromoVar3D website (chromovar3d.stanford.edu). Peak coordinates and signal intensities for each sample and each DNase peak were extracted from DNase_removeBlacklist_Log10PvalueThreshold_5_DATA_MATRIX.gz, and corresponding files were used for the 3 histone modifications. VCF of variants from whole genome sequencing was obtained from the same website.
Deep learning with a convolutional neural network
An extended version of the basset software (30) was used to learn parameters in a predictive model mapping from genome sequence as input to epigenetic signal as output. The analysis was customized to take advantage of this particular dataset by (i) integrating genetic variation from whole genome sequencing, (ii) modeling the quantitative variation in the epigenetic signal and (iii) combining many experiments from the same cell type into a large single-task learning application. This customized analysis enables a focus on genetic variants with relatively small effects on the quantitative signal value, rather than the strong effect required to completely lose or gain a binding or histone modification event. Each of the four assays was analyzed separately using a single-task learning approach.
Constructing DNA sequence as input to neural network
Personalized genome sequences were constructed using the GRCh37 reference genome with sites modified according to biallelic SNPs in the whole genome sequence using the bcftools consensus command. At homozygous alternate sites the reference allele is simply replaced by the alternative allele. Heterozygous sites are represented using the IUPAC nucleotide codes (35), so that for example an A/C heterozygote is indicated with the characters ‘M’. Only biallelic SNPs are considered, so there are 6 additional characters, one for each pair of nucleotides.Homozygous sites are one-hot coded as a matrix of mostly 0’s with 4 rows corresponding to ‘A’, ‘T’, ‘C’ and ‘G’. Coding a 1 in the ‘T’ row indicates the presence of that nucleotide in the corresponding position in the genome sequence. Heterozygous sites are encoded with a value of 0.5 in the two corresponding rows. Thus the training data does not explicitly include any information about phasing of the SNPs or allele-specific signals.For each peak interval called in the processed data, the genome sequence within a specified distance from the center of the peak was extracted and matched to the corresponding signal value. Peaks exceeding an assay-specific width cutoff were excluded from the analysis (Supplementary Table S1). An assay-specific window size (DNase: 300 bp, H3K27ac: 2000 bp, H3K4me3: 2000 bp, H3K4me1: 1400 bp) was used to extract regions from the personalized reference genomes (Supplementary Table S1). Larger window sizes have been shown to increase prediction performance (27), and we used the largest window size where encoding the DNA sequence from all peaks and all individuals to one-hot coded format could be computed on a machine with 256 Gb RAM. This high memory usage is a limitation of the current basset implementation (30).
Model training and testing
The default model architecture from Basset (30) was used to train a 3 layer deep neural network and 300 convolutional filters each 19 bp wide. The model included rectified linear unit (ReLU) and max pooling in order to learning a non-linear function mapping from DNA sequence to epigenetic readout (Supplementary Table S2). A 30% dropout was applied to avoid overfitting. All training was performed on NVIDIA Tesla K20X GPU. Training on a single assay took between 14 and 45 GPU hours. We observed that changing the number of filters between 100 and 400 and changing the filter width between 10 and 20 bp did not produce a substantial change in prediction accuracy. Multiple restarts gave similar prediction accuracy.The dataset was divided into training, validation and testing sets. In order to avoid overfitting, an early stopping approach was used where the parameter values in the model were learned from the training set, but the final values were selected to minimize the squared prediction error in the validation set. Training was stopped after 10 epochs with no decrease in error in the validation set. The prediction performance for each assay was reported based on the test set.The training, validation and testing sets were specially constructed using a conservative approach in order to ensure independence of the three sets. Since the signal values at a given peak are relatively similar across individuals, including the same peak region, albeit from different individuals, in both the training and testing sets could overstate the prediction performance. Similarly, peaks from the same individual are generated under the same experimental conditions and are subject to technical batch effects. Thus, including peaks from the same individual in both the training and testing set could also overstate the prediction performance. In order to avoid this issue, the test set is composed of peaks on chr1–chr8 from 60% of individuals, the validation set is composed of peaks on chr9-chr15 from the next 20% of individuals, and the test set is composed of peaks on chr16-chr22 from last 20% of individuals. Thus, the three sets have no overlap in either peaks or individuals (Supplementary Table S1, Figure S1) to ensure a conservative estimate of prediction performance.We describe the analysis workflow with numbers from DNase data; numbers for other assays are shown in Supplementary Table S1. There were 681 990 total DNase peak intervals from 69 individuals. In order to focus on peaks of approximately equal size, peaks exceeding 250 bp were excluded. This left 463 094 peaks (67.9% of total) with a mean width of 150.7 bp. Multiplying the number of remaining peaks by the number of individuals gives a dataset of 31 953 486 examples. Since DNase and histone modification ChIP-seq are not strand specific-assays, the reverse complement sequence gives the same epigenetic signal as the original sequence. Augmenting the dataset by including the reverse complement of each example doubles the number of sequence-signal pairs. Constructing the training set from peaks on chr1–chr8 from the first 60% of individuals gives 229 421 unique peak regions and 18 812 522 total examples.
Genomic correlates
Minor allele frequency across populations were obtained from gnomAD r2.0.2 (11). Transcription factor binding motifs were obtained from the JASPAR 2018 database (36) and genomic location of bindings sites were identified with FIMO (37). For genome-wide summaries, a TFBS score cutoff of 500 was used, correspond to a P-value cutoff of 1e–5. For specific lookups (i.e. Figures 4F and 6C), a more liberal cutoff of 400 (i.e. P-value of 1e–4) was used. Genomic locations from ChIP-seq experiments for transcription factors in LCL GM12878 (18) were downloaded from http://egg2.wustl.edu/roadmap/src/chromHMM/bin/COORDS/hg19/TFBS/gm12878/. List of genes expressed in LCLs were obtained from http://egg2.wustl.edu/roadmap/src/chromHMM/bin/COORDS/hg19/expr/gm12878/. ChromHMM tracks (17) for LCL GM12878 were downloaded from http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/final/E116_18_core_K27ac_dense.bed.gz. Genome annotation of sites were obtained from VEP v85 (38) provided by gnomAD (11). CpG islands were obtained from Annotatr (39).
Figure 4.
DeepFIGV scores predict results of molecular trait QTL analysis. (A–C) Lead variants from molecular trait QTL analysis from lymphoblastoid cell lines are enriched for SNPs with DeepFIGV absolute z-score exceeding a range of cutoffs. Enrichments are evaluated using (A) DeepFIGV scores for 4 assays for DNase-QTLs from LCL’s of Yoruban ancestry, (B) expression QTLs from LCL’s of European ancestry, and (C) chromatin accessibility QTL’s assayed by ATAC-seq in human post mortem brain homogenate of European ancestry. Shaded regions indicated 95% confidence intervals. (D) Rare variants associated with gene expression outliers are enriched for DeepFIGV absolute z-score for DNase compared to rare variants not associated with outliers. Shaded regions indicated 95% confidence intervals. (E) Enrichment of somatic variants in cancer that drive gene expression changes (15) for strong DeepFIGV scores. (F) Candidate causal variants for expression QTLs with higher posterior probability are enriched for exceeding a DeepFIGV absolute z-score of 10 for DNase. Enrichments are shown for skin and LCL samples from TwinsUK, and LCL samples from GEAUVIDIS (21). Shaded regions indicated 95% confidence intervals. (G) DeepFIGV independently identifies candidate causal variant rs11547207 (shown in red) for QTL affecting expression of both CIB2 and IDH3A. eQTL analysis of GEUAVDIS identifies many correlated variants associated with these genes, but statistical fine-mapping identifies a signal candidate variant (42). Although this variant is not an DNase QTL in LCLs Yoruban individuals (23), DeepFIGV analysis on the same data identifies the same candidate causal variant identified by statistical fine mapping. In silico mutagenesis of 50bp around rs11547207 indicates that variants at nearby positions are predicted to decrease the DNase signal. Size of letters in DNA sequence is proportional to the maximum absolute delta at that position. Bottom panel shows TFBS motifs.
Figure 6.
Variants with large DeepFIGV z-scores are enriched for activity in experimental massively parallel reporter assay. (A) Variants that modulate gene expression in this assay are enriched for having large DeepFIGV z-scores compared to variants in sequences that do not drive expression. (B) Variants that modulate gene expression in this assay are enriched for having large DeepFIGV z-scores for DNase compared to variants that have no allelic effect in this experiment. Shaded regions indicate 95% confidence intervals.
Comparison to canonical motifs
Each convolutional filter in the first layer of the neural network reads in 19 bp at a time and uses a weight for each of the four nucleotides to transform the DNA sequence to the next layer of the neural network. Each filter is a 19 × 4 matrix of continuous values that can be treated as a position weight matrix (PWM) (30). Each PWM learned in the current dataset were then compared to PWM’s of transcription factors from the JASPAR 2018 database (36). We applied the widely used software tomtom (40) to query a our new set of PWM’s against a know database of PWM and generate p- and q-values. Since a given filter can show high similarity to multiple JASPAR motifs, only the best match is reported. Motifs were visualized using ggseqlogo (41).
Evaluating variant effects
Coordinates and alleles of SNPs were obtained from multiple public resources (Supplementary Table S3) and combined into a non-redundant list comprising 413 223 060 sites and 437 960 283 variants (due to multi-allelic sites). The delta between the predicted signal from the reference and alternative alleles was evaluated for each of the four epigenetic assays. The median and standard deviation of the delta values for each assay were obtained for 208 million biallelic SNVs from whole genome sequencing (WGS) from gnomAD r2.0.2 and were used to compute z-scores for the entire set of variants for the corresponding assay. This approach used sites distributed across the genome that were identified independently of their predicted functional consequence and avoids double counting multiallelic sites. The standard deviation was computed using a robust method (i.e. winsorized) where delta values below the 1st percentile or above the 99th percentile were set to the value at the corresponding cutoff. This approach reduced the effect of variants with extreme scores. Changing the cutoff values had a very minimal effect of the resulting z-scores.Evaluating all variants for the four assays took 5929 GPU hours using 10 NVIDIA Tesla K20X GPUs.
Integration with molecular trait QTLs
We downloaded QTLs for gene expression, DNase and histone modifications on Yoruban individuals (23), QTLs for gene expression on LCLs from multiple European populations (21). Enrichment was evaluated by comparing the DeepFIGV absolute z-cores from the lead QTLs to the scores for to variants ranked between 5th and 10th. Statistical fine mapping results were obtained from multiple cell types (42). Rare variants associated with gene expression outliers from multiple tissues were obtained from GTEx (43). Enrichments are evaluated based on 2113 rare variants associated with outliers and 67 044 not associated with outliers.
Chromatin accessibility QTLs from brain homogenate
Raw ATAC-seq (44) data from 189 human post mortem brains (45) of European ancestry was aligned to GRCh38 with STAR aligner (46). To create a final peakset, we subsampled and merged BAM-files separately for schizophrenia-case and control samples. We subsequently called peaks separately on these two merged BAM files with MACS2 (47) at q-value < 0.01, and merged these two peaksets into a single consensus peakset. For each sample, the reads in each consensus peak were quantified by featureCounts (48). Only peaks with 1 counts per million in at least 10% samples were retained for QTL analysis on the TMM normalized log2 counts per million values (49). QTL analysis was performed with QTLtools (50) using 5 ancestry PC’s and variants with MAF > 5%. Covariates also included 10 PEER components (51) gender, and GC content of reads for each peak for each sample. QTL analysis was performed on variants within 2 kb of each peak boundary.
Cancer somatic variants driving gene expression
We downloaded somatic variants in tumors that were identified by whole genome sequencing and results from an eQTL analysis combining nearby variants and testing the association with proximal genes (15). We considered the 569 genes with cis-eQTLs at FDR < 30% and evaluated the DeepFIGV z-score for each of four epigenetic assays for the 2309 somatic variants in the proximal regions. The enrichment analysis compared these variants to somatic variants in this dataset that were not associated with gene expression changes and which were matched for distance to transcription start site.
Prediction of allele specific binding (ASB)
DeepFIGV scores were used to predict the presence and direction of ASB using sites identified from transcription factor ChIP-seq and DNase I hypersensitivity experiments in LCLs (52,53) and HeLa-S3 cells (53). For AlleleDB (52), the ASB status for 42 ChIP-seq targets across 14 individuals totaling 77 experiments were reported at a total of 276,589 sites with sufficient read coverage (accB.auto.v2.1.aug16.txt.gz). Shi et al. (53) reported ABS for 36 targets across 7 LCL and HeLa-S3 cells across 51 518 total sites (ASB_GM12878_HeLa_1based.txt, ASB_other_GMs_1based.txt). Sites from the two databases were combined to produce a non-redundant set, so sites identified for the same target and individual in both databases were not double counted. Sites were considered as ASB or non-ASB based on a Benjamini-Hochberg corrected P-value (beta-binomial for AlleleDB and binomial for Shi et al.) <0.05, or >0.99, respectively. Only assays with at least 20 ASB examples were considered.Precision-recall (PR) curves and area under the PR curve (AUPR) were used to evaluate the classification performance since the ASB vs non-ASB class counts were very imbalanced. PR curves and AUPR of empirical and random classes classifiers were evaluated with the PRROC package (54).No allele-specific signal was used in the training of DeepFIGV and no re-training was performed for the ASB analysis. The DeepFIGV z-scores for each of the 4 assays in the training set were extracted for each site, and the PR and AUPR were computed by the intersecting these scores with the combined ASB dataset. Classifying ASB sites from non-ASB sites used the absolute value of the z-scores, while classifying the direction of ASB used the z-score itself. ASB magnitude was encoded as the number of alternative reads at a site divided by the total number of reads at that site. Thus, a positive ASB magnitude corresponds to a positive DeepFIGV z-score indicating that the alternative allele is predicted to increase signal compared to the reference allele.
Publicly available GWAS summary statistics were obtained for immune diseases as well as other representative diseases and traits. We performed a partitioned heritability analysis with LD-score regression (LDSC) (8) in order to quantify the contribution to the trait heritability of variants with high absolute DeepFIGV z-scores. The per-SNP heritability was computed after accounting for the other genomic annotations. Annotations included 28 provided with LDSC baseline model (i.e. TFBS, TSS, UTR, intron, promoter, enhancer, superenhancer, epigenetic assays multiple sources (H3K27ac, H3K4me1, H3K4me3, H3K9ac, DNase)) in addition to peak regions from the four assays in LCLs used by DeepFIGV. DeepFIGV was the only annotation with nucleotide level resolution; other annotations were 10s or 100s of bases wide.This analysis was restricted to common variants outside of the MHC region. The per-SNP heritability was evaluated for site exceeding absolute z-score cutoffs for 1, 2, 3, 4 and 5. There was not a sufficient number of sites with larger scores, since only common variants were considered.
Candidate causal SNPs identified from finemapping analysis of autoimmune diseases were obtained from http://www.broadinstitute.org/pubs/finemapping (10). The DeepFIGV z-scores were obtained for the 8741 candidate causal SNPs for 39 traits. Enrichments for each trait were evaluated by comparing the number of candidate causal sites with a DeepFIGV absolute z-score exceeding a given cutoff to the expected value from a random set of sites from a null distribution. This null was constructed for each site by drawing 10 000 sites from across the genome matching the MAF, gene density, distance to nearest genes and number of sites within LD of 0.5 of the original site (55).
Comparison to other variant scoring methods
Variant-level scores were obtained from DeepSea (27) evaluated on 17 DNase datasets and deltaSVM (29) evaluated on 35 DNase datasets, and CAPE (56) evaluated on two datasets. In addition we included CADD (57) and LINSIGHT (58). Scores were obtained from: https://www.ncbi.nlm.nih.gov/research/snpdelscore/rawdata/For DeepSea and deltaSVM the reported delta values for the predicted signal from the reference and alternative alleles were transformed to a z-score using the observed standard deviation. Analysis was performed on a shared set of 12 million variants.
Analysis of LCL MPRA results
Results were downloaded from Tewhey et al. (59) and variants were divided into three classes: (i) expression modulating variants that showed significant difference in expression between reference and alternative alleles, (ii) variants that drove expression but did not show allelic differences, and (iii) variants whose sequence did not drive expression in this assay. Enrichment of expression modulating variants (i.e. class 1) were compared to the other two classes as a function of high predicted epigenetic signal or DeepFIGV z-scores for each assay.
Predicting pathogenic from benign variants in ClinVar
ClinVar variants were downloaded on April 22, 2019. Variants labeled as ‘Pathogenic’ or ‘Likely_pathogenic’ was considered as the positive class and variants labeled ‘Benign’ or ‘Likely_benign’ were considered the negative class. Variants were then stratified based on variant consequence (i.e. missense, 3′ UTR, etc) and the performance of each prediction (DeepFIGV, CADD (57), GWAVA (60)) were evaluated in each strata. Due to the high degree of class imbalance, performance of each predictor was reported as the area under the precision recall curve (AUPR). We note that performance of a random prediction varied across variant consequence since it depends on the class ratios.
Evaluation of CAGI5 regulation saturation data
Since DeepFIGV predictions were learned on a different dataset, we evaluated concordance between DeepFIGV and CAGI5 relative expression values on the combined training + test set from CAGI5. Data was obtained from genomeinterpretation.org.
RESULTS
Deep learning maps from genome sequence to epigenetic signal
DeepFIGV combines the quantitative signal from epigenetic experiments across multiple individuals with whole genome sequencing into a single machine learning task (Figure 1). While standard molecular trait QTL analyses rely on the correlation between the epigenetic signal and a given genetic variant (Figure 1A, B), deep learning using a convolutional neural network explicitly models the DNA sequence context to train a predictive model (Figure 1C, D). Evaluating the predicted effect of each variant produces a large database of nucleotide-level scores (Figure 1E, F) that can be integrated with other analyses to refine the mechanism of known functional variants, identify novel risk variants and prioritize downstream experiments (Figure 1G).
Figure 1.
Computational workflow for Deep Functional Interpretation of Genetic Variants (DeepFIGV). (A) Quantitative signal from epigenetic assay (i.e. ChIP-seq, DNase-seq) across multiple individuals and genomic regions. (B) Standard genetic analysis stratifies quantitative signal by the allelic state at a given SNP, yet linkage disequilibrium complicates the interpretation of the functional variant. (C) DeepFIGV encodes a DNA sequences as an ‘image’ matrix of mostly zeros with a 1 (i.e. a dark box) indicating the presence of a particular nucleotide at that position. Heterozygous SNPs are encoded as a 0.5 for each allele. Convolutions are local matrix operations with parameter values learned from the data. A neural network uses the convolutions to predict the epigenetic signal from the DNA sequence. (D) Training the computational model links DNA sequences from many individuals to the epigenetic signal in each region. (E) The epigenetic signal is estimated for a query sequence with the reference and the alternate allele. The difference between the estimated signal values (i.e. delta) indicates the predicted effect of the variant. (F) In silico mutagenesis evaluates the delta value for every possible single nucleotide substitution. (G) DeepFIGV delta values are used to predict allele specific binding of transcription factors and identify candidate functional variants.
Computational workflow for Deep Functional Interpretation of Genetic Variants (DeepFIGV). (A) Quantitative signal from epigenetic assay (i.e. ChIP-seq, DNase-seq) across multiple individuals and genomic regions. (B) Standard genetic analysis stratifies quantitative signal by the allelic state at a given SNP, yet linkage disequilibrium complicates the interpretation of the functional variant. (C) DeepFIGV encodes a DNA sequences as an ‘image’ matrix of mostly zeros with a 1 (i.e. a dark box) indicating the presence of a particular nucleotide at that position. Heterozygous SNPs are encoded as a 0.5 for each allele. Convolutions are local matrix operations with parameter values learned from the data. A neural network uses the convolutions to predict the epigenetic signal from the DNA sequence. (D) Training the computational model links DNA sequences from many individuals to the epigenetic signal in each region. (E) The epigenetic signal is estimated for a query sequence with the reference and the alternate allele. The difference between the estimated signal values (i.e. delta) indicates the predicted effect of the variant. (F) In silico mutagenesis evaluates the delta value for every possible single nucleotide substitution. (G) DeepFIGV delta values are used to predict allele specific binding of transcription factors and identify candidate functional variants.Datasets from each of four epigenetic assays (DNase-seq and H3K27ac, H3K4me3, and H3K4me1 histone marks) were analyzed separately (Figure 2). Previous quantifications using liberal peak calling thresholds were used in order to capture a wide range of quantitative variation (23,26). The parameters of the convolutional neural network were chosen to minimize the least squares prediction error. Extensive steps were taken to avoid overfitting, and all prediction results are reported on a set of individuals and chromosomes that were excluded from the training set (see Materials and Methods, Supplementary Figure S1). Increasing the number of individuals in the training set and including genetic variation in the genome sequence of each individual decreased prediction error on withheld test data (Supplementary Figure S2). Although the model uses only DNA sequence in the predictions, the predicted DNase signal shows strong concordance in the test set with the observed signal in peak regions (Spearman rho = 0.485, Pearson R = 0.707) (Figure 2A). Focusing on the more robust (i.e. rank based) Spearman correlation metric shows that these predictive models give substantial accuracy for all four assays for the quantitative epigenetic signal (Figure 2B). Genomic intervals corresponding to ENCODE blacklisted regions were already excluded and removing additional genomic intervals based on low sequence uniqueness/mappability (61) did not change the results (Supplementary Table S4). Examining the predicted signal for DNase for a representative example in the test set along a segment of chromosome 22 shows notable concordance with the observed signal in peak regions (Figure 2C).
Figure 2.
Evaluating DeepFIGV model and interpreting variant scores. (A) Predicted DNase signal compared to observed DNase signal evaluated on the test set. (B) Spearman correlation between predicted and observed epigenetic signal on the test set for four assays. (C) Predicted and observed DNase signal for peaks in a 300 kb segment on chr22 in the test set. (D) Mean, median and standard deviation of the delta scores for 208 million biallelic SNPs for each assay. (E) Density plot of z-scores for four assays. Dashed line indicates the null distribution of the z-scores, which is the standard normal distribution.
Evaluating DeepFIGV model and interpreting variant scores. (A) Predicted DNase signal compared to observed DNase signal evaluated on the test set. (B) Spearman correlation between predicted and observed epigenetic signal on the test set for four assays. (C) Predicted and observed DNase signal for peaks in a 300 kb segment on chr22 in the test set. (D) Mean, median and standard deviation of the delta scores for 208 million biallelic SNPs for each assay. (E) Density plot of z-scores for four assays. Dashed line indicates the null distribution of the z-scores, which is the standard normal distribution.Functional impact scores from the predicted difference in epigenetic signal for the reference versus the alternate allele were evaluated for 438 million variants observed in previous sequencing projects (Supplementary Table S3). Subsequent analysis was performed on 208 million biallelic SNPs from the gnomAD database of 15 000 whole genome sequences (11). The delta value for each variant is defined as Δ = SALT – SREF with terms representing the predicted epigenetic signal from the alternative and reference alleles, respectively. Thus, a positive delta value indicates that the alternative allele increases the epigenetic signal compared to the reference allele. As expected, the mean and median delta values for all assays were very close to zero (Figure 2D). Transforming these delta values to a standard scale (i.e. z-score) by dividing by the standard deviation for each assay shows an excess of variants with scores near zero compared to the standard normal distribution (Figure 2E). This is consistent with the vast majority of variants having no functional effect on the epigenome. Yet, there is an excess of variants with large effects on all four epigenetic assays, with DNase showing the highest excess followed by H3K27ac, H3K4me3 and finally H3K4me1.
Genomic correlates of predicted variant effects
Although no prior biological information is included in training the model, DeepFIGV recovers multiple aspects of known regulatory biology (Figure 3). The predictive model learned by the convolutional neural network is composed of a set of local sequence features called filters. Although learned de novo, the predictive sequences features extracted by these filters are often similar to known transcription factor bindings site (TFBS) motifs. Some filters have a direct correspondence to a known motif, but other filters model only a portion of a motif so that multiple filters combine to capture the signal encoded by the sequence (Figure 3A). Variants in TFBS motifs are enriched for the alternative allele decreasing the DNase signal (Figure 3B). TFBS nucleotides with high information content (i.e. high weight) in the position weight matrix have an even stronger enrichment for decreasing the DNase signal, consistent with variants being more likely to weaken rather than strengthen the affinity of a TFBS motif (Supplementary Figure S3). The TFBS enrichments are consistent with the biology of these assays: variants predicted to affect the open chromatin assay DNase are most enriched for TFBS motifs, followed by the H3K4me3 promoter mark and the H3K27ac active promoter and enhancer mark (Figure 3C). H3K4me1 is an enhancer mark that tags active or primed sequences and is not enriched for strong variants in TFBS.
Figure 3.
Genomic enrichments of predicted functional variants. (A) Canonical transcription factor binding motif along with the motif representation of convolutional filters learn from the DNase dataset. P-values indicate the probability of concordance this high between a canonical motif and convolutional filter occurring by chance given the motif database. Q-values correct for multiple testing since 300 convolutional filters were queried. (B) Ratio indicating the fraction of sites where the alternative allele increases the DNase signal (i.e has a positive DeepFIGV z-score) for DNase for four cutoffs. Ratios are shown for variants within or outside a TFBS motif. A value of 0.5 indicates an equal number of variants with positive and negative z-scores. A value <0.5 indicates an depletion of variants where the alternative allele increases the DNase signal, corresponding to an excess of variants where the alterative allele decreases DNase signal. (C) Fraction of variants in transcription factor binding sites that exceed a DeepFIGV absolute z-score of 10 for each of four assays. (D) Fraction of sites that are in a transcription factor binding site motif, or in the flanking 5 or 10 bp, for a range of DeepFIGV absolute z-score cutoffs for DNase. (E) Enrichment of variants near a TFBS motif exceeding four z-score cutoffs for DNase. Black box indicates median size of TFBS motif. (F) Enrichment of sites with absolute z-scores greater than 5 near the transcription start site of genes stratified by whether the genes are expressed in LCLs. Sites with absolute z-scores less than the genome-wide mean are used as the baseline for the enrichment. (G) Fraction of sites with absolute z-scores for DNase greater than 10 within 7 minor allele frequency bins based on non-Finnish Europeans from gnomAD. Dashed line indicates genome-wide fraction of sites. P-value is based a logistic regression where the response is a binary variable indicating if the absolute z-scores is >10 and the log minor allele frequency is the predictor. Error bars show 95% confidence intervals.
Genomic enrichments of predicted functional variants. (A) Canonical transcription factor binding motif along with the motif representation of convolutional filters learn from the DNase dataset. P-values indicate the probability of concordance this high between a canonical motif and convolutional filter occurring by chance given the motif database. Q-values correct for multiple testing since 300 convolutional filters were queried. (B) Ratio indicating the fraction of sites where the alternative allele increases the DNase signal (i.e has a positive DeepFIGV z-score) for DNase for four cutoffs. Ratios are shown for variants within or outside a TFBS motif. A value of 0.5 indicates an equal number of variants with positive and negative z-scores. A value <0.5 indicates an depletion of variants where the alternative allele increases the DNase signal, corresponding to an excess of variants where the alterative allele decreases DNase signal. (C) Fraction of variants in transcription factor binding sites that exceed a DeepFIGV absolute z-score of 10 for each of four assays. (D) Fraction of sites that are in a transcription factor binding site motif, or in the flanking 5 or 10 bp, for a range of DeepFIGV absolute z-score cutoffs for DNase. (E) Enrichment of variants near a TFBS motif exceeding four z-score cutoffs for DNase. Black box indicates median size of TFBS motif. (F) Enrichment of sites with absolute z-scores greater than 5 near the transcription start site of genes stratified by whether the genes are expressed in LCLs. Sites with absolute z-scores less than the genome-wide mean are used as the baseline for the enrichment. (G) Fraction of sites with absolute z-scores for DNase greater than 10 within 7 minor allele frequency bins based on non-Finnish Europeans from gnomAD. Dashed line indicates genome-wide fraction of sites. P-value is based a logistic regression where the response is a binary variable indicating if the absolute z-scores is >10 and the log minor allele frequency is the predictor. Error bars show 95% confidence intervals.The role of TFBS in regulating gene expression and epigenetics is well established, and consequences of variants in TFBS motifs are more interpretable than variants in other genome annotations (18,36,62). Yet despite notable enrichment in TFBS motifs, variants in these motifs account for a minority of variants with strong DeepFIGV scores. Only 30.1% of sites with an absolute z-score between 9 and 10, and 44.4% with an absolute z-score between 29 and 30 for DNase fall in a TFBS (Figure 3D, Supplementary Figure S4). Including flanking nucleotides within 5 or 10 bp increases these percentages, but for most z-score cutoffs, variants in or proximal to known TFBS motifs are a minority. While variants in TFBS motifs are enriched for variants predicted to affect DNase signal, the enrichment is not observed when expanding beyond these proximal nucleotides (Figure 3E for DNase, Supplementary Figure S3C for ChIP-seq). We note that these enrichments are not being driven by biases in peak location since variants located within peak intervals for each assay show a similar enrichment profile as variants not located within the peak intervals (Supplementary Figure S3D). Therefore, the majority of variants with strong predicted effects on all four assays do not fall in nor are they proximal to these known TFBS, indicating that DeepFIGV models a more complicated relationship between genetic variants in epigenetic signal than is encoded by TFBS motifs alone.Variant effects show a degree of cell type specificity as variants with strong predicted effects on DNase, H3K27ac and H3K4me3 are more enriched around the transcription start site (TSS) of genes expressed in LCLs, compared to genes not expressed in LCLs (Figure 3F, Supplementary Figure S5). Variants with strong predicted effects are also more enriched around the TSS of LCL-specific genes, compared to tissue-specific genes from each of 52 additional GTEx tissues (Supplementary Figure S6). Moreover, variants with strong predicted effects on DNase, H3K27ac and H3K4me3 are enriched in CpG islands, and ChromHMM tracks from LCLs (Supplementary Figures S7 and S8). Finally, variants with large predicted effects are depleted among common variants (minor allele frequency > 1%) and are enriched in rare variants across multiple human populations, consistent with negative selection against variants that disrupt the epigenome (Figure 3G, Supplementary Figure S9).
Concordance with molecular trait QTLs
Lead cis-QTL variants (i.e. the local variant with the smallest P-value) for multiple assays are enriched for having strong predicted effect on the epigenome (Figure 4, Supplementary Figure S10). Variants that are lead cis-QTLs for DNase from the current dataset (23,26) are particularly enriched for having a strong predicted effect on DNase and H3K4me1, compared to variants ranked 5th–10th in the cis-QTL analysis (Figure 4A). Similarly, variants that are lead cis-QTLs for gene expression in an independent dataset of LCLs from European individuals (21) are most enriched for variants with a strong predicted effect on DNase (Figure 4B). In addition, variants that are lead cis-QTLs chromatin accessibility assayed by ATAC-seq in post mortem homogenate of human brain tissue from European individuals (45) are also enriched for variants with a strong predicted effect on DNase (Figure 4C). Rare variants associated with expression outliers in multiple tissue types (43) are enriched for variants with strong predicted effects on DNase (Figure 4D), but not ChIP-seq, compared to rare variants not associated with expression outliers. Moreover, somatic variants in cancer that drive changes in expression of nearby genes are enriched for variants with strong predicted effects on DNase, H3K4me3 and H3K27ac, compared to somatic variants matched for distance to transcription start site that were not associated with expression changes (15) (Figure 4E). Furthermore, candidate causal variants for expression QTLs identified by statistical fine mapping (42) are enriched for variants with strong predicted effects on DNase, compared to cis variants that have lower posterior probability (Figure 4F).DeepFIGV scores predict results of molecular trait QTL analysis. (A–C) Lead variants from molecular trait QTL analysis from lymphoblastoid cell lines are enriched for SNPs with DeepFIGV absolute z-score exceeding a range of cutoffs. Enrichments are evaluated using (A) DeepFIGV scores for 4 assays for DNase-QTLs from LCL’s of Yoruban ancestry, (B) expression QTLs from LCL’s of European ancestry, and (C) chromatin accessibility QTL’s assayed by ATAC-seq in human post mortem brain homogenate of European ancestry. Shaded regions indicated 95% confidence intervals. (D) Rare variants associated with gene expression outliers are enriched for DeepFIGV absolute z-score for DNase compared to rare variants not associated with outliers. Shaded regions indicated 95% confidence intervals. (E) Enrichment of somatic variants in cancer that drive gene expression changes (15) for strong DeepFIGV scores. (F) Candidate causal variants for expression QTLs with higher posterior probability are enriched for exceeding a DeepFIGV absolute z-score of 10 for DNase. Enrichments are shown for skin and LCL samples from TwinsUK, and LCL samples from GEAUVIDIS (21). Shaded regions indicated 95% confidence intervals. (G) DeepFIGV independently identifies candidate causal variant rs11547207 (shown in red) for QTL affecting expression of both CIB2 and IDH3A. eQTL analysis of GEUAVDIS identifies many correlated variants associated with these genes, but statistical fine-mapping identifies a signal candidate variant (42). Although this variant is not an DNase QTL in LCLs Yoruban individuals (23), DeepFIGV analysis on the same data identifies the same candidate causal variant identified by statistical fine mapping. In silico mutagenesis of 50bp around rs11547207 indicates that variants at nearby positions are predicted to decrease the DNase signal. Size of letters in DNA sequence is proportional to the maximum absolute delta at that position. Bottom panel shows TFBS motifs.For example, rs11547207 is identified as an eQTL in LCLs from European individuals, but this SNP is in linkage disequilibrium with many nearby variants (Figure 4G). Statistical fine mapping indicates that this SNP has a high probably of being the causal variant in this region driving gene expression variation. Although analysis of the DNase signal from LCLs does not identify this SNP as a QTL, DeepFIGV directly models the sequence context of this variant and predicted a strong effect of the epigenetic signal in this region. In silico saturation mutagenesis in this region gives predictions at nucleotide-resolution and indicates that variants within ∼5 bp are also predicted to decrease the DNase signal while falling just upstream of a TFAP2A TFBS (36).
DeepFIGV variant scores predict allele specific binding
The predicted functional effect of genetic variants on each of the four epigenetic assays analyzed in DeepFIGV can identify allele-specific binding (ASB) of TFs in independent ChIP-seq experiments in LCLs (52,53) (Figure 5). Heterozygous variants can be divided into three categories based on ASB: (a) no allele specific effect, (b) ASB favoring the reference allele and (c) ASB favoring the alternative allele (Figure 5A). We evaluated the ability of DeepFIGV to distinguish between these categories even though no allele-specific information is included in model training. The predicted effect on the DNase signal can classify variants showing ASB versus no ASB for CCCTC-binding factor (CTCF) with an area under the precision recall (AUPR) curve of 0.202 while a random classifier gives an AUPR of 0.0493 (Figure 5B, C). This gives an AUPR increase of 0.202 – 0.0493 = 0.1527 compared to random. Given a variant with an allele-specific effect, the predicted effect on DNase signal is able to classify the direction of the effect (i.e. favoring reference versus alternative) for CTCF with an AUPR of 0.704 compared to a random classifier of 0.36 (Figure 5D, E). This gives an AUPR increase of 0.344. Since the number of sites in each category varied substantially across TFs, we consider the increase in AUPR from the DeepFIGV score compared to a TF-specific baseline. DeepFIGV scores show an AUPR increase compared to random classifiers for predicting ASB events and predicting ASB direction for variants from independent assays of multiple transcription factors in LCLs (Figure 5F) and HeLa S3 cells (Supplementary Figure S11).
Figure 5.
DeepFIGV scores predict allele specific transcription factor binding in LCLs. (A) Diagram illustrating 3 categories of allele specific binding (ASB): (i) no ASB, (ii) ASB favoring the reference allele, and 3) ASB favoring the alternative allele. (B) Precision-recall curve indicating performance of absolute DeepFIGV z-score for DNase in predicting ASB of CTCF. AUPR indicates the area under the precision-recall curve. Dashed line indicates the performance of a random predictor. (C) Density plot showing absolute DeepFIGV z-score for variants in (B) in the ABS or no ASB classes. (D) Precision-recall curve indicating performance of DeepFIGV z-score for DNase in predicting the directionality of ASB (reference versus alternative) for CTCF. AUPR indicates the area under the precision-recall curve. Dashed line indicates the performance of a random predictor. (E) Plot of ASB magnitude versus DeepFIGV DNase z-score from (D). (F) Increase in AUPR of predicting ASB status for DeepFIGV scores for four epigenetic assays compared to a TF-specific random predictor. Increase in AUPR is shown for predicting ASB versus no ASB (left) and predicting the directionality of ASB (reference versus alternative) (center). Right panel shows the number of ASB SNPs considered in each analysis.
DeepFIGV scores predict allele specific transcription factor binding in LCLs. (A) Diagram illustrating 3 categories of allele specific binding (ASB): (i) no ASB, (ii) ASB favoring the reference allele, and 3) ASB favoring the alternative allele. (B) Precision-recall curve indicating performance of absolute DeepFIGV z-score for DNase in predicting ASB of CTCF. AUPR indicates the area under the precision-recall curve. Dashed line indicates the performance of a random predictor. (C) Density plot showing absolute DeepFIGV z-score for variants in (B) in the ABS or no ASB classes. (D) Precision-recall curve indicating performance of DeepFIGV z-score for DNase in predicting the directionality of ASB (reference versus alternative) for CTCF. AUPR indicates the area under the precision-recall curve. Dashed line indicates the performance of a random predictor. (E) Plot of ASB magnitude versus DeepFIGV DNase z-score from (D). (F) Increase in AUPR of predicting ASB status for DeepFIGV scores for four epigenetic assays compared to a TF-specific random predictor. Increase in AUPR is shown for predicting ASB versus no ASB (left) and predicting the directionality of ASB (reference versus alternative) (center). Right panel shows the number of ASB SNPs considered in each analysis.
Concordance with large-scale functional experiments of variant impact
Scalable experimental approaches to measure the functional consequence of non-coding variants have recently been proposed (59,63–66). These massively parallel reporter assays (MPRA) couple thousands to millions of nucleotide sequences to a molecular readout that can be quantified by short read sequencing. Tewhey et al. (59) performed an MPRA of 32K variants in LCLs by inserting 150 bp sequences centered at the variant into an episomal vector. Based on experimental readout, sequences were divided into three classes: (1) expression modulating variants that showed significant difference in expression between reference and alternative alleles, (2) variants that drove expression but did not show allelic differences, and (3) variants whose sequence did not drive expression in this assay. Applying predictions from the DeepFIGV models is concordant with experimental readout for sequences and variants from these experiments (Figure 6). Evaluating each sequence based on predicted signal magnitude for the four epigenetic assays shows that sequences with higher predicted signal are strongly enriched for driving expression in this experiment compared to sequences that do not drive expression (Supplementary Figure S12). As expected, variants with a high DeepFIGV z-score are enriched in variants with allelic effect (i.e. class 1) compared to variants that are in sequences that do not drive expression in this assay (i.e. class 3) (Figure 6A). Yet, despite the challenge of small experimental effect sizes between two alleles of a variant (59), variants found to drive changes in gene expression (i.e. class 1) are enriched for having strong predicted effects on DNase by DeepFIGV compared to variants that do not show an allelic effect (i.e. class 2) (Figure 6B).Variants with large DeepFIGV z-scores are enriched for activity in experimental massively parallel reporter assay. (A) Variants that modulate gene expression in this assay are enriched for having large DeepFIGV z-scores compared to variants in sequences that do not drive expression. (B) Variants that modulate gene expression in this assay are enriched for having large DeepFIGV z-scores for DNase compared to variants that have no allelic effect in this experiment. Shaded regions indicate 95% confidence intervals.
Enrichment for disease risk variants and interpreting causal variants
Integrating DeepFIGV scores with large-scale genome-wide association studies shows that risk variants for common disease are enriched for variants predicted to impact the epigenome (Figure 7). We applied stratified LD-score regression (8) to evaluate the contribution of variants with different genomic annotations to disease risk. Analysis of 19 traits identified a contribution of variants passing multiple DeepFIGV z-score cutoffs to trait heritability, even after accounting for a baseline set of 32 genomic annotations (see Methods) (Figure 7A, Supplementary Figures S13,S14). Immune traits show the strongest contribution of DeepFIGV variants to trait heritability since the model was trained in LCLs (a B-cell lineage), yet there are also cell type autonomous effects and a contribution to non-immune traits. Further investigation of the impact of immune traits shows that candidate causal variants identified by statistical fine mapping (10) are enriched for variants with strong DeepFIGV effects (Figure 7B, Supplementary Figure S15).
Figure 7.
Disease risk variants are enriched for large DeepFIGV scores. (A) Linkage-disequilibrium score regression (LDSC) (8) partitioned heritability estimates for diseases in 4 categories. Heritability per SNP is computed for variants that exceed 5 cutoffs for DeepFIGV absolute z-score for DNase. Error bars indicate 2 standard deviations. (B) Enrichment of candidate causal variants for autoimmune disease (10) are variants exceeding six cutoffs DeepFIGV absolute z-score for DNase. Error bars indicate 2 standard deviations. (C) DeepFIGV elucidates molecular function of candidate causal variant for inflammatory bowel disease (67). GWAS identifies many correlated variants associated with disease risk, but statistical fine mapping identifies a single SNP (shown in red) as the candidate causal variant. This variant, rs10748781, disrupts a CpG site and is predicted to decrease the DNase signal in this region. In silico mutagenesis of 50bp around this SNP indicates that variants at nearby positions are predicted to decrease the DNase signal. Size of letters in DNA sequence is proportional to the maximum absolute delta at that position. Bottom panel shows TFBS motifs. Disease abbreviations: AD (Atopic dermatitis), ALZ (Alzheimer's), AS (Ankylosing spondylitis), ASD (Autism spectrum disorder), AT (Autoimmune thyroiditis), BMD (Bone mineral density), BMI (Body mass index), CAD (Coronary artery disease), CD (Crohn's disease), CKD (Chronic kidney disease) HbA1c (HbA1c protein level in blood), HDL (High-density lipoprotein), IBD (Inflammatory bowel disease), JIA (Juvenile idiopathic arthritis), LDL (Low-density lipoprotein), Liver enz (gamma glutamyl transferase), MI (myocardial infarction), MS (Multiple sclerosis), PBC (Primary biliary cirrhosis), PSC (Primary sclerosing cholangitis), PSP (Progressive supranuclear palsy), PS (Psoriasis), RA (Rheumatoid arthritis), SLE (Systemic lupus erythematosus), SWB (Subjective well-being), T1D (Type 1 diabetes), T2D (Type 2 diabetes), TC (total cholesterol), TG (Triglycerides), UC (Ulcerative colitis).
Disease risk variants are enriched for large DeepFIGV scores. (A) Linkage-disequilibrium score regression (LDSC) (8) partitioned heritability estimates for diseases in 4 categories. Heritability per SNP is computed for variants that exceed 5 cutoffs for DeepFIGV absolute z-score for DNase. Error bars indicate 2 standard deviations. (B) Enrichment of candidate causal variants for autoimmune disease (10) are variants exceeding six cutoffs DeepFIGV absolute z-score for DNase. Error bars indicate 2 standard deviations. (C) DeepFIGV elucidates molecular function of candidate causal variant for inflammatory bowel disease (67). GWAS identifies many correlated variants associated with disease risk, but statistical fine mapping identifies a single SNP (shown in red) as the candidate causal variant. This variant, rs10748781, disrupts a CpG site and is predicted to decrease the DNase signal in this region. In silico mutagenesis of 50bp around this SNP indicates that variants at nearby positions are predicted to decrease the DNase signal. Size of letters in DNA sequence is proportional to the maximum absolute delta at that position. Bottom panel shows TFBS motifs. Disease abbreviations: AD (Atopic dermatitis), ALZ (Alzheimer's), AS (Ankylosing spondylitis), ASD (Autism spectrum disorder), AT (Autoimmune thyroiditis), BMD (Bone mineral density), BMI (Body mass index), CAD (Coronary artery disease), CD (Crohn's disease), CKD (Chronic kidney disease) HbA1c (HbA1c protein level in blood), HDL (High-density lipoprotein), IBD (Inflammatory bowel disease), JIA (Juvenile idiopathic arthritis), LDL (Low-density lipoprotein), Liver enz (gamma glutamyl transferase), MI (myocardial infarction), MS (Multiple sclerosis), PBC (Primary biliary cirrhosis), PSC (Primary sclerosing cholangitis), PSP (Progressive supranuclear palsy), PS (Psoriasis), RA (Rheumatoid arthritis), SLE (Systemic lupus erythematosus), SWB (Subjective well-being), T1D (Type 1 diabetes), T2D (Type 2 diabetes), TC (total cholesterol), TG (Triglycerides), UC (Ulcerative colitis).DeepFIGV scores can elucidate the molecular mechanism of a causal variant and prioritize downstream experiments. Integrating DeepFIGV scores with candidate causal variants for inflammatory bowel disease (67) shows that for rs10748781, which has 99% posterior probability of being the causal variant in this region and is in a Stat4 TFBS, the alternative allele is predicted to decrease the DNase signal in this region in LCLs (Figure 7C). This result gives a specific cell type and biological assay to design a validation experiment. Moreover, this variant disrupts a CpG site, is a known DNA methylation QTL, and the methylation at nearby sites is predicted to mediate the effect of the variant on disease risk (68) (Supplementary Figure S16).
DISCUSSION
Translating findings of genetic studies to a molecular understanding of disease etiology and then to novel therapies has been hindered by the challenge of interpreting the functional consequence of genetic variants. There is a widely recognized need for accurate computational predictions of the functional impact of non-coding regulatory variants (69). Genomic annotations of the non-coding regions have generally taken one of four approaches. Evolutionary conservation or selection can identify functional regions of the genome, but consecutive nucleotides often have very similar scores and this approach does not give cell type- and assays-specific functional consequences (58,70,71). Epigenetic maps across multiple cell types, tissues and assays provide a functional interpretation, but peaks from these assays cover millions of nucleotides (17,18). Molecular trait QTL studies correlate genetic variants with gene expression or epigenetic signals, but interpretation of this correlation analysis is limited by linkage disequilibrium and is only applicable to variants observed in the dataset (20–22). Most recently, deep convolutional neural networks have been used to develop predictive models linking the genome sequence to splicing (72), protein binding (73), and epigenetic signals (27,29–31). Although these deep learning methods have been promising, their biological application has so far been limited.Here, we present a deep learning framework that learns a predictive model linking DNA sequence to quantitative variation in epigenetic signal and evaluates the predicted functional impact of genetic variants on multiple assays. This framework models quantitative variation in the epigenome, integrates whole genome sequencing to create a personalized genome sequence for each individual, and trains on many experiments from the same cell type and assay. Because this framework fits a predictive model based on sequence context, it is less susceptible to issues of linkage disequilibrium and can predict the functional impact of variants even if they are not observed in the training dataset.Application to epigenetic assays of open chromatin (DNase-seq) and histone modifications (H3K27ac, H3K4me3 and H3K4me1) from 75 lymphoblastoid cell lines (LCL) (23,26) produces functional consequence scores that are concordant with other genomic annotations while capturing sequence context information beyond known TFBS motifs. We note that potential mechanisms of variants outside these motifs include affecting local DNA shape, DNA methylation or nucleosome positioning (74,75), but interpretation remains an open challenge (76). We demonstrate that these functional consequence scores inform molecular mechanism, are concordant with molecular trait QTL analysis, are predictive of allele-specific binding, and inform interpretation of risk variants for common disease. Moreover, these scores can prioritize variants for downstream experiments and indicate the appropriate cell type and functional assay. DeepFIGV scores are complementary to other non-coding variant scores, and compared to DeepSea (27) identifies more variants with extreme z-scores. (Supplementary Methods, Supplementary Figure S17).Yet computational prediction of functional variants remains a challenging problem. DeepFIGV z-scores for DNase are correlated (spearman rho = 0.0802, P = 5.32e–16) with relative expression from an MPRA from the regulation saturation challenge from the Critical Assessment of Genome Interpretation (CAGI5, genomeinterpretation.org) (Supplementary Figure S18). But there is substantial room for improvement. Moreover, CADD (57) and GWAVA (60) show better performance in classifying pathogenic from benign variants in ClinVar (77) (Supplementary Figure S19). This result underscores that predicting the ‘proximal’ relationship between DNA and epigenetics is different that predicting the more complex relationship between DNA and higher-level disease phenotypes.The differing performance of the prediction and biological enrichments across the four epigenetic assays is attributable to both biological and technical factors. These assays differ in the biological processes they measure. DNase measures open chromatin with high signal representing protein interacting with the DNA within a narrow region of ∼150 bp. Thus DNase signal is largely determined by the proximal DNA sequence and especially TF binding. Histone modification ChIP-seq is more complex readout of chromatin state with H3K4me3 at active promoters, H3K27ac at active promoters and enhancers, and H3K4me1 at active or primed enhancers. Due to spatial chromatin spreading, the role of trans-factors, and the increased width of these marks (300 bp to 1 kb), sequence-based prediction is known to be less accurate (27). Since genetic variants conferring disease risk or regulating gene expression can act through a number of mechanisms (9,20,23,26,67,69), the value of additional epigenetic assays depends on the accuracy of a predictive model as well as the regulatory mechanism of interest.We note that since the current method is trained on the continuous epigenetic signal within peak regions learned from each assay, the method is dependent on the set of peaks. Moreover, there is likely meaningful epigenetic variation outside of strictly defined peak regions and, indeed, there is much interest in performing basepair-level predictions rather than summarizing the epigenetic signal for each peak (28,78). Thus although our predictions are peak-centric, and we observe that for DNase SNPs within DNase peaks are certainly associated with the observed basepair-level signal (P = 6.5e–22), SNPs outside peaks are still associated with basepair-level signal P = 6.2e–5) (Supplementary Figure S20).Despite the remarkable experimental throughput of recent massively parallel report assays, these assays are limited to cell culture and they assay the function of the query sequence either in an episomal vector or through random insertion into the genome (59,63–66). Thus the degree to which results from MPRAs recapitulate function in the disease relevant cell type and natural genomic context remains unclear (65,79,80). In contrast, predictive models based on sequence context use natural genetic variation, are extensible to multiple biological assays, and evaluate sequences in their native chromosomal context. Moreover, they are applicable to cell culture, as well as cells from post mortem, biopsy or blood draws to more precisely target the relevant cell type.The growth of large-scale resources pairing quantitative epigenetic assays with genetic data offers an opportunity to train rich predictive models on disease relevant cell types (25,33,81). Finally, we have developed a public resource of the DeepFIGV predicted functional scores for 438 million variants observed in previous sequencing projects available at deepfigv.mssm.edu.Click here for additional data file.
Authors: Jacob C Ulirsch; Satish K Nandakumar; Li Wang; Felix C Giani; Xiaolan Zhang; Peter Rogov; Alexandre Melnikov; Patrick McDonel; Ron Do; Tarjei S Mikkelsen; Vijay G Sankaran Journal: Cell Date: 2016-06-02 Impact factor: 41.582
Authors: Peter M Visscher; Naomi R Wray; Qian Zhang; Pamela Sklar; Mark I McCarthy; Matthew A Brown; Jian Yang Journal: Am J Hum Genet Date: 2017-07-06 Impact factor: 11.025
Authors: Lu Chen; Bing Ge; Francesco Paolo Casale; Louella Vasquez; Tony Kwan; Diego Garrido-Martín; Stephen Watt; Ying Yan; Kousik Kundu; Simone Ecker; Avik Datta; David Richardson; Frances Burden; Daniel Mead; Alice L Mann; Jose Maria Fernandez; Sophia Rowlston; Steven P Wilder; Samantha Farrow; Xiaojian Shao; John J Lambourne; Adriana Redensek; Cornelis A Albers; Vyacheslav Amstislavskiy; Sofie Ashford; Kim Berentsen; Lorenzo Bomba; Guillaume Bourque; David Bujold; Stephan Busche; Maxime Caron; Shu-Huang Chen; Warren Cheung; Oliver Delaneau; Emmanouil T Dermitzakis; Heather Elding; Irina Colgiu; Frederik O Bagger; Paul Flicek; Ehsan Habibi; Valentina Iotchkova; Eva Janssen-Megens; Bowon Kim; Hans Lehrach; Ernesto Lowy; Amit Mandoli; Filomena Matarese; Matthew T Maurano; John A Morris; Vera Pancaldi; Farzin Pourfarzad; Karola Rehnstrom; Augusto Rendon; Thomas Risch; Nilofar Sharifi; Marie-Michelle Simon; Marc Sultan; Alfonso Valencia; Klaudia Walter; Shuang-Yin Wang; Mattia Frontini; Stylianos E Antonarakis; Laura Clarke; Marie-Laure Yaspo; Stephan Beck; Roderic Guigo; Daniel Rico; Joost H A Martens; Willem H Ouwehand; Taco W Kuijpers; Dirk S Paul; Hendrik G Stunnenberg; Oliver Stegle; Kate Downes; Tomi Pastinen; Nicole Soranzo Journal: Cell Date: 2016-11-17 Impact factor: 41.582
Authors: Hailiang Huang; Ming Fang; Luke Jostins; Maša Umićević Mirkov; Gabrielle Boucher; Carl A Anderson; Vibeke Andersen; Isabelle Cleynen; Adrian Cortes; François Crins; Mauro D'Amato; Valérie Deffontaine; Julia Dmitrieva; Elisa Docampo; Mahmoud Elansary; Kyle Kai-How Farh; Andre Franke; Ann-Stephan Gori; Philippe Goyette; Jonas Halfvarson; Talin Haritunians; Jo Knight; Ian C Lawrance; Charlie W Lees; Edouard Louis; Rob Mariman; Theo Meuwissen; Myriam Mni; Yukihide Momozawa; Miles Parkes; Sarah L Spain; Emilie Théâtre; Gosia Trynka; Jack Satsangi; Suzanne van Sommeren; Severine Vermeire; Ramnik J Xavier; Rinse K Weersma; Richard H Duerr; Christopher G Mathew; John D Rioux; Dermot P B McGovern; Judy H Cho; Michel Georges; Mark J Daly; Jeffrey C Barrett Journal: Nature Date: 2017-06-28 Impact factor: 49.962
Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean Journal: Nature Date: 2012-11-01 Impact factor: 49.962
Authors: Wei Zhang; Ana Bojorquez-Gomez; Daniel Ortiz Velez; Guorong Xu; Kyle S Sanchez; John Paul Shen; Kevin Chen; Katherine Licon; Collin Melton; Katrina M Olson; Michael Ku Yu; Justin K Huang; Hannah Carter; Emma K Farley; Michael Snyder; Stephanie I Fraley; Jason F Kreisberg; Trey Ideker Journal: Nat Genet Date: 2018-04-02 Impact factor: 41.307
Authors: Julien Bryois; Melanie E Garrett; Lingyun Song; Alexias Safi; Paola Giusti-Rodriguez; Graham D Johnson; Annie W Shieh; Alfonso Buil; John F Fullard; Panos Roussos; Pamela Sklar; Schahram Akbarian; Vahram Haroutunian; Craig A Stockmeier; Gregory A Wray; Kevin P White; Chunyu Liu; Timothy E Reddy; Allison Ashley-Koch; Patrick F Sullivan; Gregory E Crawford Journal: Nat Commun Date: 2018-08-07 Impact factor: 14.919
Authors: Jiyun Zhou; Qiang Chen; Patricia R Braun; Kira A Perzel Mandell; Andrew E Jaffe; Hao Yang Tan; Thomas M Hyde; Joel E Kleinman; James B Potash; Gen Shinozaki; Daniel R Weinberger; Shizhong Han Journal: Proc Natl Acad Sci U S A Date: 2022-08-15 Impact factor: 12.779
Authors: Kamil Wnuk; Jeremi Sudol; Kevin B Givechian; Patrick Soon-Shiong; Shahrooz Rabizadeh; Christopher Szeto; Charles Vaske Journal: iScience Date: 2019-09-14