| Literature DB >> 20702423 |
Mikita Suyama1, Eoghan D Harrington, Svetlana Vinokourova, Magnus von Knebel Doeberitz, Osamu Ohara, Peer Bork.
Abstract
Cis-acting short sequence motifs play important roles in alternative splicing. It is now possible to identify such sequence motifs as conserved sequence patterns in genome sequence alignments. Here, we report the systematic search for motifs in the neighboring introns of alternatively spliced exons by using comparative analysis of mammalian genome alignments. We identified 11 conserved sequence motifs that might be involved in the regulation of alternative splicing. These motifs are not only significantly overrepresented near alternatively spliced exons, but they also co-occur with each other, thus, forming a network of cis-elements, likely to be the basis for context-dependent regulation. Based on this finding, we applied the motif co-occurrence to predict alternatively skipped exons. We verified exon skipping in 29 cases out of 118 predictions (25%) by EST and mRNA sequences in the databases. For the predictions not verified by the database sequences, we confirmed exon skipping in 10 additional cases by using both RT-PCR experiments and the publicly available RNA-Seq data. These results indicate that even more alternative splicing events will be found with the progress of large-scale and high-throughput analyses for various tissue samples and developmental stages.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20702423 PMCID: PMC3001076 DOI: 10.1093/nar/gkq705
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The regions analyzed to search for intronic splicing regulators. The skipped exon is shown in gray box. The regions analyzed are shown in gray arrows. If the flanking intron of the skipped exon is shorter than 1000 bases, then only the region up to the neighboring exon are taken. In such case, the 60 residues upstream of 3′ splice site of the neighboring exon and 5 residues downstream of the 5′ splice site of the neighboring exon are also excluded from the analysis to ensure that the region does not contain splice site signals, polypyrimidine tract, or branch point sequences.
Conserved pentamers with statistical significance
| Rank | Conserved pentamer | Clustered motif | |
|---|---|---|---|
| 1 | GCATG | TGCATG | 1.0 × 10−29 |
| 2 | TGCAT | TGCATG | 1.4 × 10−20 |
| 3 | ACTAA | ACTAAC | 7.5 × 10−12 |
| 4 | CTAAC | ACTAAC | 9.8 × 10−10 |
| 5 | TGCTG | CTGCTGC | 8.7 × 10−08 |
| 6 | GCTGC | CTGCTGC | 3.3 × 10−07 |
| 7 | TGCTT | CTTGCTT | 7.2 × 10−06 |
| 8 | CTTGC | CTTGCTT | 8.2 × 10−06 |
| 9 | GTGGG | GTGGTGGG | 1.1 × 10−05 |
| 10 | TTTCT | TTTCT | 1.2 × 10−05 |
| 11 | AAGAT | AAGAT | 3.6 × 10−05 |
| 12 | TGGAA | TGGAA | 4.2 × 10−05 |
| 13 | GCTAA | GCTAA | 5.6 × 10−05 |
| 14 | CTGCT | CTGCTGC | 5.6 × 10−05 |
| 15 | AAAGG | AAAGG | 6.8 × 10−05 |
| 16 | GTGGT | GTGGTGGG | 9.8 × 10−05 |
| 17 | TCTTG | TCTTG | 1.2 × 10−04 |
| 18 | GGTGG | GTGGTGGG | 1.3 × 10−04 |
Pentamers are sorted by P-values. The pentamers that cover only a part of known motifs are excluded from the list (see Supplementary Table S4 for a complete list of the pentamers sorted by P-value).
aP-values are calculated by Fisher’s exact test.
Figure 2.Distribution of the number of the conserved pentamers. The 18 pentamers listed in Table 1 are counted in the flanking introns of alternatively spliced exons. x-axis indicates the relative position counting from the corresponding splice site. Upstream of the 3′ splice site and downstream of the 5′ splice site are shown in light and dark gray, respectively.
Comparison of the 18 pentamers with the patterns in other data sets
| Voelker | Yeo | Churbanov | Wang | Zhang | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Donor | Acceptor | Upstream | Downstream | 3′-ISE | 3′-ISS | 5′-ISE | 5′-ISS | 3′SS-intronic | 5′SS-intronic | IIE | |
| (819) | (1007) | (1069) | (911) | (814) | (1032) | (478) | (174) | (175) | (187) | (708) | |
| GCATG | M | M | s | M | – | s | – | – | s | s | s |
| TGCAT | M | M | M | M | s | – | – | – | s | s | s |
| ACTAA | M | M | M | s | s | – | – | – | s | s | – |
| CTAAC | M | M | M | M | M | – | – | – | s | s | – |
| TGCTG | M | M | s | s | s | s | – | – | s | – | s |
| GCTGC | M | s | s | s | – | s | s | – | s | – | – |
| TGCTT | M | M | M | M | – | – | – | – | – | – | s |
| CTTGC | M | s | M | s | s | s | – | – | s | s | s |
| GTGGG | s | – | – | s | s | s | s | s | – | – | s |
| TTTCT | M | M | s | s | s | – | s | – | – | – | s |
| AAGAT | – | – | – | M | – | s | – | – | s | s | – |
| TGGAA | s | M | s | s | – | s | – | – | – | – | s |
| GCTAA | s | M | M | s | s | – | – | – | s | – | – |
| CTGCT | M | M | M | M | s | – | – | – | – | – | s |
| AAAGG | s | s | – | – | – | s | – | s | s | – | – |
| GTGGT | – | – | s | – | – | s | – | – | s | – | s |
| TCTTG | s | M | M | s | s | – | – | – | s | s | s |
| GGTGG | s | – | – | s | s | s | s | s | s | – | s |
The total number of predicted motifs are indicated in the parenthesis under each data set.
aSignificantly conserved n-mers found in the donor/acceptor intronic flanking region (18).
bConserved words that are located upstream/downstream of internal exons (19).
c3′-/5′-splice site-related intronic splicing silencers/enhancers (20).
d3′-/5′-splice site intronic motifs (6).
eIntron-identity elements (43).
M, exact match; s, the pentamer is included as a sub-string of the longer pattern; –, no match found.
GCATG motif in the genome alignments of various groups of species
| Species | Conserved GCATG in the introns | ||
|---|---|---|---|
| Around skipped exons | Around constitutive exons | ||
| Human + mouse | 303 | 136 | 1.1 × 10−23 |
| Euarchontoglires | 177 | 29 | 1.4 × 10−33 |
| Boreouetheria | 158 | 17 | 6.4 × 10−36 |
| Placental mammals | 140 | 21 | 6.8 × 10−28 |
| Mammals | 157 | 26 | 1.0 × 10−29 |
In the mammals set, we count the motif if it is strictly conserved at least in 10 species.
aP-values are calculated by Fisher’s exact test.
bHuman, chimpanzee, macaque, mouse, rat, rabbit.
cHuman, chimpanzee, macaque, mouse, rat, rabbit, dog, cow.
dHuman, chimpanzee, macaque, mouse, rat, rabbit, dog, cow, armadillo, elephant, tenrec.
eHuman, chimpanzee, macaque, mouse, rat, rabbit, dog, cow, armadillo, elephant, tenrec, opossum.
Figure 3.Examples of co-occurring motifs. (A) ACTAAC and TGCATG in the flanking intron of the skipped exon of NDEL1 and (B) TGCATG and TGCTT in the flanking intron of the skipped exon of MYBPC1. The part of the gene structure around the skipped exon is shown above the alignment with the exon numbers. The format of the alignment: the first column, the name of the species; second column, chromosome or scaffold identifier; the third column, direction of the gene on the chromosome, the fourth column, position on the chromosome. Conserved residues are indicated by asterisks under the alignments. Conserved motifs are indicated by colored boxes.
Figure 4.Distribution of the probability calculated by cumulative hypergeometric distribution function. The pairs among the statistically significant motifs are shown in orange and the scale for those is indicated as the vertical axis on the right. The randomly selected pairs are shown in cyan and the scale for those is indicated as the vertical axis on the left.
Figure 5.A network of co-occurring motifs. Each node represents a motif. Number of skipped exons with the motif is indicated in the node. If several pentamers consists of a single motif, the pentamers are shown in the node. Each edge represents a pair of co-occurring motifs. Number of skipped exons with a pair of co-occurring motifs is indicated at the edge. Thick and thin lines indicate statistically significant co-occurrence, P < 1.0 × 10−5 and P < 1.0 × 10−4, respectively. The nodes that are not connected with P < 1.0 × 10−4 are linked with dotted lines to the closest motifs. The P-values for these edges are <1.0 × 10−3, except for the one links AAAGG (P < 2.1 × 10−3). The edge connecting the same node indicates the significant co-occurrence of the same motif.
Figure 6.Number of exon skipping supported by RNA-Seq data increase with the number of tissue samples. All possible combinations of tissues are taken into account for each number of tissues. The 118 predictions of exon skipping made for the Ensembl gene set were used.
Figure 7.Exon skipping of the third exon of ENST00000256858 confirmed by the PCR experiment. The PCR product lengths for skipped and included forms are 136 bases (open triangle) and 208 bases (closed triangle), respectively (see Supplementary Table S1). F, fetal brain; h, hippocampus; a, amygdala; N, negative control.