| Literature DB >> 19014490 |
Rileen Sinha1, Michael Hiller, Rainer Pudimat, Ulrike Gausmann, Matthias Platzer, Rolf Backofen.
Abstract
BACKGROUND: Alternative splicing is a major contributor to the diversity of eukaryotic transcriptomes and proteomes. Currently, large scale detection of alternative splicing using expressed sequence tags (ESTs) or microarrays does not capture all alternative splicing events. Moreover, for many species genomic data is being produced at a far greater rate than corresponding transcript data, hence in silico methods of predicting alternative splicing have to be improved.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19014490 PMCID: PMC2621368 DOI: 10.1186/1471-2105-9-477
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Features for machine learning used in this study
| Exon: length, symmetry, and identity with mouse ortholog | 3 | Alternative exons tend to be shorter, frame-preserving, and more conserved compared to constitutive exons | [ |
| Conservation of intronic flanks: length/identity of the best local and identity of the global alignment | 2 × 3 | Alternative exons tend to have higher conservation in their intronic flanks | [ |
| Conservation in a 12 nucleotide region spanning the 3' and 5'ss | 2 | As alternative exons and their intronic flanks are more conserved, this may in particular concern the exon/intron boundaries | This work |
| PPT intensity | 1 | Alternative exons tend to have weaker PPTs | [ |
| Nucleotides at seven positions flanking the 5'ss | 4 × 7 | Alternative exons tend to have specific nucleotide preferences near the 5'ss | [ |
| Frequency of di- and trimers in the exon and flanking introns | 3 × 16 | Motifs which are part of splice regulatory motifs might differ in their abundance in alternative and constitutive exons | [ |
| Splice site strength of 3'and 5'ss | 2 | Alternative exons tend to have weak splice sites | [ |
| Length of flanking introns | 2 | Alternative exons tend to be flanked by long introns | [ |
| GC content of exon and intronic flanks | 3 | GC-poor regions tend to promote alternative splicing | This work |
| Features based on NI scores | 24 | Alternative exons tend to have fewer ESEs and more ESSs | This work |
| Features based on PU values | 15 | Single-stranded motifs are likelier to bind to regulators | This work |
| PTB-binding sites | 6 | PTB is a regulator alternative splicing | This work |
| Features based on ISREs | 8 | Alternative exons tend to have more ISREs in their intronic flanks | This work |
| Density of various motifs | 22 | Several motifs are known to be associated with alternative splicing | This work |
| Combination features | 7 | Combining features can capture more information | This work |
Note that the total number of features used is 365 whereas the sum of the entries here is 378, because some features have been counted in more than one category (for example, in PU value and NI score related features).
Figure 1ROC plot showing the average performance of the 3-fold cross-validation on datasets D1 (red line) and D2 (green line).
Top features according to information gain and information gain ratio (excluding combination features)
| 1 | Length of best alignment in the upstream intron flank | 0.169 | Abundance of GA in exon | 0.172 |
| 2 | Upstream intron flank conservation | 0.169 | Density of single stranded ESEs in exon | 0.151 |
| 3 | Identity of best alignment in the upstream intron flank | 0.142 | Exon identity | 0.128 |
| 4 | Downstream intron flank conservation | 0.138 | Average of positive NI scores in exon | 0.118 |
| 5 | Length of best alignment in the downstream intron flank | 0.138 | Length of best alignment in the upstream intron flank | 0.117 |
| 6 | Exon identity | 0.120 | Density of AC in exon | 0.115 |
| 7 | Identity of best alignment in the downstream intron flank | 0.088 | Average of negative NI scores in exon | 0.112 |
| 8 | Exon length | 0.080 | Density of CT in exon | 0.111 |
| 9 | Matches in 12-mer near 3'ss | 0.066 | ESE density in exon | 0.104 |
| 10 | Symmetry | 0.042 | Length of best alignment in the upstream intron flank | 0.103 |
Top combination features according to information gain
| 1 | Product of identities of exon and both intron flanks | 0.208 |
| 2 | Product of identity of both intron flanks | 0.196 |
| 3 | Product of identities of exon and upstream intron flank | 0.181 |
| 4 | Product of identities of exon and downstream intron flank | 0.153 |
| 5 | Ratio of the downstream intron length to exon length | 0.051 |
| 6 | Ratio of ESE density to ESS density | 0.029 |
| 7 | Sum of splice site scores | 0.023 |
| 8 | Ratio of the upstream intron length to exon length | 0.022 |
| 9 | Ratio of trusted ESE density to trusted ESS density | 0.010 |
| 10 | Density of putative PTB binding sites in exon | 0.008 |
Top trimers in the exon and intron flanks according to information gain
| 1 | TCC | 0.034 | upstream | TTC | 0.016 |
| 2 | ATG | 0.031 | downstream | AGG | 0.014 |
| 3 | CCT | 0.029 | downstream | GAG | 0.012 |
| 4 | TCG | 0.028 | upstream | TTT | 0.012 |
| 5 | CAT | 0.028 | upstream | TCT | 0.012 |
| 6 | AAG | 0.027 | downstream | GGA | 0.012 |
| 7 | GTA | 0.027 | downstream | TTT | 0.011 |
| 8 | GAC | 0.026 | upstream | GAG | 0.011 |
| 9 | GAT | 0.026 | upstream | AGG | 0.011 |
| 10 | CAA | 0.026 | upstream | CAG | 0.009 |
Figure 234-feature Bayesian network. Note that BN in fact has 35 nodes. The class node, which has an edge to all other nodes and makes the actual number of edges 67, is omitted for ease of visualization. Thus, this is just the augmenting tree in the TAN classifier. The features associated with the nodes are as follows: 1: 1 if exon length is divisible by 3, otherwise 0. 2: Length of the best alignment in the 3' 100 nt intronic region. 3: Length of the best alignment in the 5' 100 nt intronic region.4: Percent identity of the best alignment in the 5' 100 nt intronic region. 5: Length of the 5' intron. 6: Ratio of the lengths of the 3' intron and the exon. 7: Product of the identities of the exon and both 100-nt intronic flanks with their mouse orthologs. 8: 1 if G at +4 of the 5'ss, otherwise 0. 9: T at +4, 10: A at +6; 11: MAXENTSCAN score of the 5'ss. 12: Sum of the MAXENTSCAN scores of the 3' and 5'ss. 13: Average of the NI scores of all the hexamers with a negative NI score. 14: Variance of the NI scores of all the hexamers with a "strong" (≥ 0.8 or ≤ -0.8) score. 15: Average of the NI scores of all the hexamers with a "strong" (≤ -0.8) negative score. 16: Density of single-stranded (PU value ≥ 0.6), "trusted" ESEs (NI score = 1). 17: Ratio of the number of "trusted" ESEs (NI score = 1) to the number of ESSs (NI score = -1). 18: Density of ISREs enriched in the flanks of AS exons, in the 5'intron flank. 19: Density of single-stranded (PU value ≥ 0.6), intronic splice regulatory elements (ISREs) enriched in the flanks of AS exons, in the 5'intron flank. 20: PTB-binding site TCTT density in the exon. Dimer density in the exon:21:CC, 22: GA; 23: Dimer GA density in the 3' intron flank; Trimer density in the exon: 24: AAG, 25: AGG, 26: ATG, 27: CAA, 28:CCA, 29: CGG, 30: CTC, 31: GCA, 32: GGT, 33: TAG, 34: TCC.