| Literature DB >> 35449021 |
Abstract
Recent progress in deep learning has greatly improved the prediction of RNA splicing from DNA sequence. Here, we present Pangolin, a deep learning model to predict splice site strength in multiple tissues. Pangolin outperforms state-of-the-art methods for predicting RNA splicing on a variety of prediction tasks. Pangolin improves prediction of the impact of genetic variants on RNA splicing, including common, rare, and lineage-specific genetic variation. In addition, Pangolin identifies loss-of-function mutations with high accuracy and recall, particularly for mutations that are not missense or nonsense, demonstrating remarkable potential for identifying pathogenic variants.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35449021 PMCID: PMC9022248 DOI: 10.1186/s13059-022-02664-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Fig. 1Overview of Pangolin and evaluation. a Schematic and architecture of Pangolin. b Heatmap summarizing the performance of Pangolin, SpliceAI, HAL, MMSplice, and MaxEntScan with respect to three metrics including top-1 accuracy. c Precision-recall curves representing the precision and recall from multiple methods for the prediction of splice-disrupting variants as identified in Cheung et al. [8] (1050 splice-disrupting variants out of 27,733 total). d Scatter plots showing measured versus predicted effects of single genetic variants (left) or a combination of genetic variants (right) on RNA splicing. Measured effects of single genetic variants and combinations of variants were obtained from Julien et al. [15] and Baeza-Centurion et al. [3] respectively. e In silico mutagenesis of 6416 exons from human chromosomes 7 and 8. Barplots show for each base the percent of mutations (square root) predicted to increase or decrease usage by at least 0.2
Fig. 2Application of Pangolin to a variety of prediction tasks. a Cumulative density plot of the log10 sQTL p-value fold difference between the SNP predicted to affect splicing and that of the lead sQTL SNP for the top 500 sQTLs identified in DGN (All predictions), or for the 100 predictions with the largest predicted effects (inset). b Example of a splice site that shows a large inter-species difference in usage. A single-nucleotide difference between chimp (T) and human (C) is predicted to strongly decrease (resp. increase) usage of a chimp (resp. human) splice site (dashed vertical line indicates the human site). The T (resp. C) difference likely disrupts (resp. creates) a 3’ canonical splice site in chimp (resp. human). c Locations and effects of SNVs ±50bp from a splice site predicted to underlie inter-species differences in splice site usage for 71 3’ and 74 5’ sites. A large fraction—but not all—of splice-altering variants are located near the canonical splice sites. d Survival function plots of BRCA1 variants in splice regions as a function of their predicted effects on splicing. The variants are separated by their classification as loss-of-function (LOF, blue), intermediate effect (INT, orange), or functional (FUNC, green). We observe a huge enrichment of LOF variants among variants with large predicted splicing effects. e Precision-recall curves for different variant types representing the precision and recall for distinguishing LOF variants from functional variants. Pangolin achieves a remarkable AUPRC for variants in extended splice regions (note that this excludes canonical splice variants). See Additional file 1: Fig. S8 for variants from additional annotation bins. f Predicted splicing effects of mutations in or flanking 4 BRCA1 exons from Findlay et al. [12]. Mutations identified to be LOF or to have intermediate phenotypes, as well as missense, nonsense, and canonical splice site mutations are annotated. See Additional file 1: Fig. S9 for all 13 exons with predictions. g Precision-recall curves representing the precision and recall for distinguishing variants annotated as pathogenic from variants annotated as benign in ClinVar. The blue (resp. orange) line represents the PRC for variants excluding (resp. including) variants in annotated splice sites. Missense and nonsense variants are excluded