| Literature DB >> 17704130 |
Martin Akerman1, Yael Mandel-Gutfreund.
Abstract
Alternative splicing constitutes a major mechanism creating protein diversity in humans. This diversity can result from the alternative skipping of entire exons or by alternative selection of the 5' or 3' splice sites that define the exon boundaries. In this study, we analyze the sequence and evolutionary characteristics of alternative 3' splice sites conserved between human and mouse genomes for distances ranging from 3 to 100 nucleotides. We show that alternative splicing events can be distinguished from constitutive splicing by a combination of properties which vary depending on the distance between the splice sites. Among the unique features of alternative 3' splice sites, we observed an unexpectedly high occurrence of events in which a polypyrimidine tract was found to overlap the upstream splice site. By applying a machine-learning approach, we show that we can successfully discriminate true alternative 3' splice sites from constitutive 3' splice sites. Finally, we propose that the unique features of the intron flanking alternative splice sites are indicative of a regulatory mechanism that is involved in splice site selection. We postulate that the process of splice site selection is influenced by the distance between the competitive splice sites.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17704130 PMCID: PMC2018619 DOI: 10.1093/nar/gkm603
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
P-values for Student's; t- and F-tests comparing alternative acceptors against constitutive/pseudo acceptors based for the following features: distal splice sites (Dist SS), proximal splice sites (Prox SS), average intronic conservation in 100 nt upstream of the proximal splice site (IC100), PPT length, PPT score, distance of PPT to the distal site (PPT∼D) and the proximal site (PPT∼P), ESE/ESS density, pseudo HAG sites and GC content
| Feature | NAGNAG-P | NAGNAG-D | CLOSE | MID | FAR | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Dist SS | 0.215 | 0.371 | 0.183 | 0.321 | 0.005 | 0.003 | 0.049 | |||
| Prox SS | 0.236 | 0.026 | 0.730 | 0.241 | 0.083 | 0.169 | 0.063 | 0.260 | ||
| IC100 | 0.144 | 0.007 | 0.037 | 0.013 | 0.435 | 0.373 | 0.465 | |||
| PPT Length | 0.226 | 0.344 | 0.996 | 0.030 | 0.823 | 0.004 | 0.001 | 2. | 0.061 | 2. |
| PPT Score | 0.525 | 0.748 | 0.211 | 0.199 | 0.012 | 3. | ||||
| PPT∼D | 0.054 | 0.233 | 0.901 | 0.516 | 0.002 | 0.854 | 0.038 | 0.161 | ||
| PPT∼P | na | na | na | na | 0.007 | 0.986 | 0.055 | 0.023 | ||
| BP score | 0.525 | 0.748 | 0.2388 | 0.2338 | 0.289 | 0.539 | 0.4418 | 0.239 | 0.4564 | 0.550 |
| BP∼D | 0.054 | 0.233 | 0.6724 | 0.9282 | 0.017 | 0.002 | 0.154 | 0.021 | ||
| BP∼P | na | na | na | na | 2.90 | 0.031 | 0.143 | 0.005 | ||
| ESE | 0.042 | 0.020 | 0.850 | 0.812 | 0.708 | 0.618 | 0.031 | 0.331 | 0.042 | 0.020 |
| ESS.hex2 | 0.873 | 0.252 | 0.582 | 0.948 | 0.721 | 0.186 | 0.064 | 0.157 | 0.176 | 0.135 |
| ESS.hex3 | 0.350 | 0.023 | 0.710 | 0.779 | 0.985 | 0.170 | 0.606 | 0.112 | 0.324 | 0.428 |
| HAG | 0.888 | 0.561 | 0.601 | 0.023 | 0.023 | 0.723 | 0.490 | 0.868 | ||
| GC | 0.265 | 0.512 | 0.714 | 0.711 | 0.044 | 0.999 | 0.149 | 0.312 | ||
Results are shown for the different datasets: FAR, MID, CLOSE, NAGNAG-proximal and NAGNAG-distal. Significant values (based on Westfall–Young correction) are indicated in bold.
Figure 1.Human–mouse evolutionary conservation shown for the CLOSE (A) and FAR (B) groups. Conservation was calculated for the 30 nt upstream of the proximal (or pseudo) splice site in overlapping windows of length 10. Gray circles account for alternative acceptor (AA) pairs, black triangles for constitutive/pseudo acceptor (CA/PA) pairs and the gray crosses for a set of 1000 randomly selected constitutive acceptors (CA). For the CA set, the conservation was calculated upstream of the constitutive splice site.
Figure 2.PPT distribution. (A) The distance between the most downstream nt of the PPT and position −1 (or N site) at the distal NAG site is plotted against the number of nts between position −1 of the proximal and the distal splice site. (B) A control set in which the PPT-to-constitutive splice sites distance is plotted against the constitutive-to-pseudo splice site distance. The diagonal indicate positions for which the PPT is adjacent to the proximal (or pseudo) splice site.
Figure 3.Position of PPTs relative to the proximal splice site in the FAR group. The bars indicate the percent of observations in the data. I–III are cases in which only one PPT was found upstream (I), overlapping (II) or downstream (II) the proximal splice site. IV–VI are cases in which two PPTs were observed (IV) flanking the proximal site, (V) one PPT overlapping the splice site and the second one upstream and (VI) one PPT overlapping the splice site and the second one downstream of the proximal splice site. The gray bars represent alternative acceptor pairs and the black bars represent constitutive/pseudo acceptor pairs in which the pseudo splice site mimics the proximal site. The number of occurrences is shown in brackets.
SVM performance
| FP | FN | TP | TN | SN | SP | TA | MCC | AUC | |
|---|---|---|---|---|---|---|---|---|---|
| FAR | 12 | 7 | 40 | 129 | 85.106 | 91.489 | 89.894 | 0.741 | 0.936 |
| MID | 37 | 14 | 63 | 194 | 81.818 | 83.983 | 83.442 | 0.608 | 0.913 |
| CLOSE | 51 | 22 | 53 | 174 | 70.667 | 77.333 | 75.667 | 0.437 | 0.802 |
| NAGNAG-P | 121 | 47 | 150 | 276 | 76.142 | 69.521 | 71.717 | 0.432 | 0.785 |
| NAGNAG-D | 53 | 75 | 122 | 124 | 61.929 | 70.056 | 65.775 | 0.32 | 0.698 |
The table displays the number of false positives (FP), true positives (TP), false negatives (FN), true negatives (TN), sensitivity (SN), specificity (SP), total accuracy (TA) as well as The Matthews correlation coefficient (MCC) and the AUC value for the different datasets.
Figure 4.ROC plot summarizing the SVM results: False positive rate is plotted against the true positive rate for alternative acceptors versus constitutive/pseudo acceptor pairs in the FAR (black line), MID (red dashed), CLOSE (green line), NAGNAG-proximal (blue dots) and NAGNAG-distal (black dots) groups.
Figure 5.ΔAUC values for the different features sets are plotted for the FAR (black), MID (red), CLOSE (green) and NAGNAG-proximal (blue) groups. The feature sets are splice sites (SS), intronic conservation (CON), polypyrimidine tract (PPT), pseudo splice sites (PSE) and GC content (GC).