| Literature DB >> 17517127 |
Mihaela Pertea1, Stephen M Mount, Steven L Salzberg.
Abstract
BACKGROUND: Algorithmic approaches to splice site prediction have relied mainly on the consensus patterns found at the boundaries between protein coding and non-coding regions. However exonic splicing enhancers have been shown to enhance the utilization of nearby splice sites.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17517127 PMCID: PMC1892810 DOI: 10.1186/1471-2105-8-159
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Sequence logos for motifs detected in the ESEAra exons. a) Motif detected at the 5'end of ESEAra exons, and b) motif detected at the 3'end of ESEAra exons. Both logos were computed with WebLogo [45].
Experimental evidence for predicted ESE hexamers.
| 9mer ESE | ESE Score | Mutant ESE | Mutant Score | Contained Hexamer Motifs |
| GAAGAAGAA | 5 | GCAGAAAAA | -1 | gaagaa, aagaag |
| TGCTGCTGG | 5 | tgctgc, gctgct | ||
| TGCAGCTGG | 5 | gcagct, cagctg | ||
| GAAGATGGA | 5 | gaagat, aagatg, gatgga | ||
| GAAGGAAGA | 5 | gaagga, aaggaa, ggaaga | ||
| GAGAAGAAG | 5 | gagaag, gaagaa, aagaag | ||
| TTGGAGCAA | 5 | ttggag, ggagca | ||
| AGCTGCTGG | 4 | agctgc, gctgct | ||
| TGCTGGTGG | 4 | tggtgg | ||
| TGCTGCAGG | 4 | tgctgc, ctgcag | ||
| TGCTGCTCG | 4 | tgctgc, gctgct | ||
| TGCTGCTGC | 4 | TACTTCTGC | -3 | tgctgc, gctgct |
| GAGGATTGA | 4 | GAGAATTGA | -1 | gaggat |
| TGCAGATGA | 4 | gcagat, cagatg | ||
| CAAGAAACA | 4 | aagaaa | ||
| GAAGAGAAA | 4 | GCAGAAAAA | -1 | aagaga |
| AAAGGAGAT | 4 | aaggag, aggaga, ggagat | ||
| GAAGAAAGA | 4 | gaagaa, aagaaa | ||
| GAGCAGAAG | 4 | gagcag | ||
| TGCTGCCGC | 4 | tgctgc | ||
| TTGAAGAAG | 3 | TTGAAAAAG | -3 | ttgaag, tgaaga, gaagaa, aagaag |
| TTGAAGCTG | 3 | TTAAAGCTG | -3 | ttgaag, tgaagc, gaagct, aagctg |
| GAAGATTGA | 3 | GAGAATTGA | -1 | gaagat |
| TTTGGTGGA | 3 | tggtgg, ggtgga | ||
| ATGGAGAAA | 3 | ATTGAGAAA | -3 | atggag, tggaga, ggagaa |
Hexamer motifs that are contained within experimentally confirmed 9-mers with ESE activity (column 5). Experiments to confirm 9-mers are described elsewhere (S. Mount et al., manuscript in preparation). Column 1 shows the containing ESE ninemer, and column 3 shows ninemers without ESE activity, which are situated within 1–2 bp edit distance from the ESE ninemer. The ESE activity of each 9-mer in the table is shown by a score equal to log2(inclusion/skipping) [34].
Figure 2Sensitivity versus specificity rates for GeneSplicer and GeneSplicerESE. Sensitivity is defined as the fraction of all true splice sites found by the splice site predictor; specificity is the fraction of the predicted elements labelled correctly as splice sites. Rates are shown for a) donor sites (GS don and GSESE don), and b) acceptor sites (GS acc and GSESE acc). Results are obtained using a 5-fold cross-validation procedure on the ESEAra data set. Weight matrices for the selected motifs to describe each of the splice sites were recomputed on each training data set from the 5 partitions of the CV procedure.
False negative (FN) vs. false positive (FP) rates on test and intergenic data sets for acceptor sites
| FN(%) | FP(%) | |||
| GS-test | GS-intg | GSESE-test | GSESE-intg | |
| 0.5 | 14.27 | 29.58 | 12.47 | 20.67 |
| 1 | 10.03 | 23.39 | 8.09 | 15.74 |
| 2 | 7.11 | 18.51 | 5.80 | 11.30 |
| 3 | 5.64 | 15.76 | 4.21 | 9.00 |
| 5 | 4.00 | 12.41 | 2.94 | 6.56 |
| 7 | 3.13 | 10.43 | 2.18 | 5.20 |
| 10 | 2.32 | 8.41 | 1.62 | 4.01 |
| 15 | 1.55 | 6.20 | 1.05 | 2.74 |
| 20 | 1.10 | 4.86 | 0.71 | 2.01 |
Rates on test data are obtained from a 5-fold CV procedure on the ESEAra data set, while FP rates on intergenic data are averages of the FP rates obtained on INTAra by setting a threshold that would produce the same FN rate on each of the 5 fold test data.
False negative (FN) vs. false positive (FP) rates on test and intergenic data sets for donor sites
| FN(%) | FP(%) | |||
| GS-test | GS-intg | GSESE-test | GSESE-intg | |
| 0.5 | 11.06 | 17.99 | 9.11 | 12.84 |
| 1 | 7.58 | 13.11 | 6.24 | 9.35 |
| 2 | 5.33 | 9.75 | 4.10 | 6.34 |
| 3 | 4.21 | 7.99 | 3.25 | 5.08 |
| 5 | 2.94 | 5.86 | 2.20 | 3.77 |
| 7 | 2.22 | 4.65 | 1.62 | 2.95 |
| 10 | 1.61 | 3.58 | 1.15 | 2.27 |
| 15 | 1.03 | 2.48 | 0.74 | 1.58 |
| 20 | 0.73 | 1.86 | 0.52 | 1.20 |
(b) Rates on test data are obtained from a 5-fold CV procedure on the ESEAra data set, while FP rates on intergenic data are averages of the FP rates obtained on INTAra by setting a threshold that would produce the same FN rate on each of the 5 fold test data.
Figure 3The contribution of weak splice sites to GeneSplicerESE's performance. For each threshold that would produce a false negative rate over all splice sites in the test data, we show the difference between the number of false positives that are predicted by GeneSplicer versus GeneSplicerESE. The red plot shows this value for all splice sites, while the green plot shows it for weak splice sites only. See Methods for definition of weak sites. (a) donor sites; (b) acceptor sites.
False positive rates obtained by SpliceMachine and SpliceMachineESE on the GSAra data set
| Sn | FP% | |||
| Donors | Acceptors | |||
| SpliceMachine | SpliceMachineESE | SpliceMachine | SpliceMachineESE | |
| 0.97 | 3.2 | 3.1 | 4.7 | 4.5 |
| 0.95 | 2.1 | 1.8 | 2.7 | 2.4 |
| 0.93 | 1.5 | 1.3 | 1.8 | 1.7 |
| 0.92 | 1.3 | 1.2 | 1.6 | 1.5 |
| 0.90 | 1.0 | 0.9 | 1.2 | 1.1 |
| 0.85 | 0.6 | 0.5 | 0.8 | 0.7 |
| 0.80 | 0.4 | 0.4 | 0.5 | 0.4 |
| 0.70 | 0.2 | 0.2 | 0.3 | 0.2 |
The false positive rates for SpliceMachine are copied from [29].