| Literature DB >> 30823901 |
Jun Cheng1,2, Thi Yen Duong Nguyen1, Kamil J Cygan3,4, Muhammed Hasan Çelik1, William G Fairbrother3,4, Žiga Avsec1,2, Julien Gagneur5.
Abstract
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files.Entities:
Keywords: Deep learning; Modular modeling; Splicing; Variant effect; Variant pathogenicity
Mesh:
Year: 2019 PMID: 30823901 PMCID: PMC6396468 DOI: 10.1186/s13059-019-1653-z
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Individual modules of MMSplice and their combination to predict the effect of genetic variants on various splicing quantities. a MMSplice consists of six modules scoring sequences from donor, acceptor, exon, and intron sites. Modules were trained with rich genomics dataset probing the corresponding regulatory regions. b Modules from a are combined with a linear model to score variant effects on exon skipping (ΔΨ), alternative donor (ΔΨ3), or alternative acceptor site (ΔΨ5), splicing efficiency, and they are combined with a logistic regression model to predict variant pathogenicity. L and L stand for the length of intron sequence taken from the acceptor and donor side respectively
Summary of trained modules and models
| MMSplice model | Training data | Architecture | Loss function | Target value | Parameters |
|---|---|---|---|---|---|
| Donor module | GENCODE 24, positive: annotated donors, negative: random sequence (“ | Four layer neural network with dropout and batch normalization, Additional file | Binary cross entropy | Positive vs. negative | 18,049 |
| Acceptor module | GENCODE 24, positive: annotated acceptors, negative: random sequence (“ | Two layer conv. neural network with dropout and batch normalization, Additional file | Binary cross entropy | Positive vs. negative | 4833 |
| Exon 5 ′ module | MPRA [ | One conv. layer shared with the Exon 3 ′ module, followed with one specific dense layer, Additional file | Binary cross entropy |
| 6145 |
| Exon 3 ′ module | MPRA [ | One conv. layer shared with the Exon 5 ′ module, followed with one specific dense layer, Additional file | Binary cross entropy |
| 6145 |
| Intron 5 ′ module | MPRA [ | One conv. layer shared with the Intron 3 ′ module, followed with one specific dense layer, Additional file | Binary cross entropy |
| 13,825 |
| Intron 3 ′ module | MPRA [ | One conv. layer shared with the Intron 5 ′ module, followed with one specific dense layer, Additional file | Binary cross entropy |
| 13,825 |
| Vex-seq [ | Linear regression | Huber loss | 9 | ||
| Splicing efficiency model (in vivo) | MaPSy (“ | Linear regression | Huber loss | Splicing efficiency, Eq. | 5 |
| Splicing efficiency model (in vitro) | MaPSy (“ | Linear regression | Huber loss | Splicing efficiency, Eq. | 5 |
| Pathogenicity model (w/o phyloP and CADD) | ClinVar [ | Logistic regression | Binary cross entropy | Pathogenic vs. benign | 14 |
| Pathogenicity model (with phyloP and CADD) | ClinVar [ | Logistic regression | Binary cross entropy | Pathogenic vs. benign | 18 |
Fig. 2MMSplice improves the prediction of variant effect on exon skipping. a Schema of the Vex-seq experiment [29]. The effect of 2059 ExAC variants (red star) from or adjacent to 110 alternative exons were tested with reporter genes by measuring percent splice-in of the reference sequence (Ψref) and of the alternative (Ψalt) by RNAseq. b–d Measured (y-axis) versus predicted (x-axis) Ψ differences between alternative and reference sequence for MMSplice (b), HAL [18] (c), and SPANR [17] (d) on Vex-seq test data. Color scale represents counts in hexagonal bins. The black line marks the y=x diagonal. Each plot is shown with the subset of variants that the considered model can score. Pearson correlations (R) and root-mean-square errors (RMSE) were also calculated based on the scored variants. The 95% confidence intervals for these two metrics were calculated with bootstrap (“Methods” section). (e) Schema of MFASS experiment [34]. Exon skipping effects of 27,733 ExAC SNVs (red star) spanning or adjacent to 2339 exons were tested by genome integration of designed construct. Splice-disrupting variant (SDV) is defined as a variant that change an exon with original exon inclusion index by at least 0.5. f Precision-recall curve of MFASS SDV classification based on model predicted ΔΨ. Precision-recall curve for all three models was calculated for the sets of variants they can score. MMSplice (black) scored all 27,733 variants, SPANR (yellow) scored 27,663 variants (1,048 SDVs), and HAL (blue) scored 14,353 variants (489 SDVs)
Fig. 3Evaluation of models predicting ΔΨ5 and ΔΨ3 on the GTEx dataset. Associated effects (y-axis) versus predictions (x-axis) for GTEx variants around alternative spliced donors (3 nt in the exon and 6 nt in the intron) and acceptors (3 nt in the exon and 20 nt in the intron) were considered. Ψ5 (or Ψ3) of homozygous (black) and heterozygous (blue) alternative variants as well as homozygous reference variants were calculated by taking the mean Ψ5 (or Ψ3) across individuals with the same genotype (excluding individuals with multiple variants within 300 nt around splice sites) on brain and skin (not sun exposed) samples. For donor variants, MMSplice (a) was benchmarked against COSSMO (b), HAL (c), and MaxEntScan (d). For acceptor variants, MMSplice (e) was benchmarked against COSSMO (f) and MaxEntScan (g). The 95% confidence intervals for Pearson correlation (R) and root-mean-square errors (RMSE) were calculated with bootstrap (“Methods” section). The dotted line marks the y=x diagonal
Fig. 4Splicing efficiency prediction. a MaPSy experiment (“Methods” section). Effect of 5761 published disease-causing exonic mutations on splicing efficiency is measured both in vivo and in vitro. Changes of splicing efficiency were quantified by allelic log-ratio. b–e Measured (y-axis) versus predicted (x-axis) allelic ratio for 797 variants in the test set for MMSplice (b, c) and the SMS score [28] (d, e). The dotted line marks the y=x diagonal. The 95% confidence intervals for Pearson correlation (R) and root-mean-square errors (RMSE) were calculated with bootstrap (“Methods” section)
Fig. 5Predictions on ClinVar variants. a Variants are first mapped to potentially affected exons. Variants in the exon or in the intron, within L nt of the acceptor site or within L nt from the donor site are considered to affect splicing of the exon. Afterwards, reference and alternative sequences are retrieved and subjected to MMSplice for prediction. MMSplice gives a prediction for each variant-exon pair. b Model comparison on classifying pathogenicity of ClinVar splice variants. Models were trained and evaluated in 10-fold cross-validation. Error bars indicate one standard deviation calculated across folds. The six leftmost models (blue) are incrementally added to the ensemble model: “+phyloP+CADD ” uses all five previous models as well as phyloP and CADD scores. Performance of MMSplice and SPANR alone as well as their performance with phyloP and CADD scores are on the right (orange)