| Literature DB >> 19359358 |
Rileen Sinha1, Swetlana Nikolajewa, Karol Szafranski, Michael Hiller, Niels Jahn, Klaus Huse, Matthias Platzer, Rolf Backofen.
Abstract
Alternative splicing (AS) involving NAGNAG tandem acceptors is an evolutionarily widespread class of AS. Recent predictions of alternative acceptor usage reported better results for acceptors separated by larger distances, than for NAGNAGs. To improve the latter, we aimed at the use of Bayesian networks (BN), and extensive experimental validation of the predictions. Using carefully constructed training and test datasets, a balanced sensitivity and specificity of >or=92% was achieved. A BN trained on the combined dataset was then used to make predictions, and 81% (38/47) of the experimentally tested predictions were verified. Using a BN learned on human data on six other genomes, we show that while the performance for the vertebrate genomes matches that achieved on human data, there is a slight drop for Drosophila and worm. Lastly, using the prediction accuracy according to experimental validation, we estimate the number of yet undiscovered alternative NAGNAGs. State of the art classifiers can produce highly accurate prediction of AS at NAGNAGs, indicating that we have identified the major features of the 'NAGNAG-splicing code' within the splice site and its immediate neighborhood. Our results suggest that the mechanism behind NAGNAG AS is simple, stochastic, and conserved among vertebrates and beyond.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19359358 PMCID: PMC2699507 DOI: 10.1093/nar/gkp220
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.NAGNAG alternative splicing. Nomenclature of NAGNAG AS with E and I sites and isoforms.
Figure 2.Nomenclature of features used in this study. Nomenclature of sequence features used to analyze NAGNAG splicing. The region used to derive all 42 features is shown, along with the names given to the positional features. Positional features, including the last three nucleotides of the upstream intron, were derived using the database TassDB, which in turn used reference annotations (RefSeq when available, else ENSEMBL).
Features for machine learning used in this study
| Feature subset | Number of features | Motivation |
|---|---|---|
| N1, N2, D1, D2, D3 and positions in the PPT | 25 | NAGNAG splicing is influenced by the NAGNAG motif and its sequence context |
| U1, U2, U3 | 3 | Potential influence on protein context |
| Length of neighboring exons and upstream intron | 3 | The architecture of the pre-mRNA influences AS |
| GC content of neighboring exons and upstream intron | 3 | GC content can influence AS |
| Features related to the pyrimidine content of the PPT | 6 | Composition of the PPT influences splicing |
| Splice site strength of E and I splice sites | 2 | Alternative NAGNAGs tend to have comparable splice site strengths |
Performance on the dataset D1, using SVMs
| Classification problem | Original sample labels | Sample labels according to TassDB | ||
|---|---|---|---|---|
| AUC | Features | AUC | Features used | |
| E versus EI | 0.82 | N1, N2, MAXENT-E, MAXENT-I, D1, p−1, Y-content, | 0.89 | N1, N2, D1, D3, U1, U2, p−8, p−5, p−2, p−1 |
| I versus EI | 0.77 | N1, N2, MAXENT_E, MAXENT_I, D1, p−2, p−1, GC-intron, | 0.85 | N1, N2, D1, D2, D3, U1, U2, U3, p−19, p−18, p−16, p−13, p−12, p−11, p−10, p−9, p−8, p−6, p−5, p−2, p−4, p−3, p−2, p−1 |
aFor nucleotide nomenclature see Figure 2. Y-content: fraction of the 20-bp upstream of the NAGNAG motif that are pyrimidines, GC_intron: G + C content of the intron ending with the NAGNAG, MAXENT_E, MAXENT_I: MAXENT scores for the É and I splice sites.
Figure 3.In-silico performance of the Bayesian network. ROC plot showing the performance achieved on the 3-class [I-class (red), E-class (green), and EI-class (blue)] classification problem. The I-class is relatively the easiest to predict, whereas the EI-class, or AS, is the hardest.
Figure 4.Experimental validation of predictions using RT–PCR and quantification by capillary electrophoresis. Experimental results indicating (A) constitutive NAGNAG splicing of VPS13D exon 27 and (B) alternative NAGNAG splicing of INPP5E exon 6, minor isoform abundance 24%.
Figure 5.Bayesian network to predict NAGNAG alternative splicing. The 14-feature Bayesian network learned on the D2 dataset. Note that the class node, which has an edge to all other nodes, is omitted for ease of visualization. Thus, this is just the augmenting tree in the TAN classifier.
Accuracy of prediction against threshold of the minor splice variant
| Threshold of the minor splice variant (%) | Experimentally confirmed predictions of AS | |
|---|---|---|
| Class 1 | Class 3 | |
| 10 | 50 | 60 |
| 8 | 50 | 90 |
| 6 | 67 | 90 |
| 4 | 100 | 100 |
aSix predictions with P(EI) ≥ 0.9.
bTen predictions with P(EI) ≥ 0.9.
Accuracy of predictions against posterior probability
| P(EI) | Accuracy of predictions |
|---|---|
| 0.9–1 | 100% (10/10) |
| 0.7–0.89 | 80% (8/10) |
| 0.5–0.69 | 50% (5/10) |
aAbundance of the minor splice variant ≥ 4%.
Top 10 features according to the information gain
| Feature | Information gain |
|---|---|
| N2 | 0.492 |
| MaxEntScan I | 0.448 |
| MaxEntScan E | 0.252 |
| N1 | 0.199 |
| p−1 | 0.040 |
| D1 | 0.020 |
| p−2 | 0.014 |
| p−3 | 0.005 |
| G/C content of 3′ exon | 0.005 |
| D3 | 0.004 |
aFor nomenclature see Figure 2.
Area under the ROC curve for the three classes and six organisms
| Organism | AUC | ||
|---|---|---|---|
| EI | E | I | |
| Human | 0.967 | 0.985 | 0.989 |
| Mouse | 0.966 | 0.982 | 0.989 |
| Rat | 0.967 | 0.985 | 0.991 |
| Chicken | 0.972 | 0.983 | 0.986 |
| Zebrafish | 0.967 | 0.983 | 0.992 |
| Fruitfly | 0.924 | 0.971 | 0.952 |
Predictions of the 14-feature BN on experimentally studied cases from the literature (30)
| Gene | Isoform ratios (E:I) in different tissues ( | P(EI) | P(E) | P(I) |
|---|---|---|---|---|
| DRPLA | 8:2–9:1 | 0.76 | 0.22 | 0.02 |
| GHRHR | 2:8 | 0.92 | 0.04 | 0.05 |
| BAIAP2 | 1:9–0:10 | 0.88 | 0.04 | 0.07 |
| PTMA | 0:10–1:9 | 0.14 | 0.33 | 0.53 |
| IGF1R | 7:3–8:2 | 0.56 | 0.43 | 0 |
| PAX3 | 0:10-10:0 | 0.72 | 0.03 | 0.25 |
| PAX7 | 0:10–9:1 | 0.69 | 0.13 | 0.18 |
| LEP | 1:9–10:0 | 0.61 | 0.38 | 0.02 |
| DNMT1 (Mouse) | 4:6–6:4 | 0.58 | 0.07 | 0.35 |
| CAST | 9:1–10:0 | 0.90 | 0.08 | 0.03 |
| MAN2B1 | 0:10–3:7 | 0.23 | 0.67 | 0.10 |
| PSEN2 | 7:3 | 0.45 | 0.55 | 0 |
| LAP1B | 0:10–10:0 | 0.84 | 0.15 | 0.01 |
| NOXO1 | 0:10–9:1 | 0.08 | 0.91 | 0.01 |
| CCL20 | 4:6–9:1 | 0.80 | 0.18 | 0.02 |
| SGNE1 | 4:6–8:2 | 0.48 | 0.41 | 0.11 |
| TGFA | 5:5–9:1 | 0.93 | 0.04 | 0.03 |