Literature DB >> 16914448

Comprehensive splice-site analysis using comparative genomics.

Nihar Sheth1, Xavier Roca, Michelle L Hastings, Ted Roeder, Adrian R Krainer, Ravi Sachidanandam.   

Abstract

We have collected over half a million splice sites from five species-Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans and Arabidopsis thaliana-and classified them into four subtypes: U2-type GT-AG and GC-AG and U12-type GT-AG and AT-AC. We have also found new examples of rare splice-site categories, such as U12-type introns without canonical borders, and U2-dependent AT-AC introns. The splice-site sequences and several tools to explore them are available on a public website (SpliceRack). For the U12-type introns, we find several features conserved across species, as well as a clustering of these introns on genes. Using the information content of the splice-site motifs, and the phylogenetic distance between them, we identify: (i) a higher degree of conservation in the exonic portion of the U2-type splice sites in more complex organisms; (ii) conservation of exonic nucleotides for U12-type splice sites; (iii) divergent evolution of C.elegans 3' splice sites (3'ss) and (iv) distinct evolutionary histories of 5' and 3'ss. Our study proves that the identification of broad patterns in naturally-occurring splice sites, through the analysis of genomic datasets, provides mechanistic and evolutionary insights into pre-mRNA splicing.

Entities:  

Mesh:

Substances:

Year:  2006        PMID: 16914448      PMCID: PMC1557818          DOI: 10.1093/nar/gkl556

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Pre-mRNA splicing is a nuclear process that is conserved across eukaryotes (1). This process involves the recognition of exon–intron junctions by the spliceosome, and intron excision through a two-step transesterification reaction (2,3). The spliceosome is a large and dynamic complex assembled from RNA and protein components, including four small nuclear RNAs (snRNAs) and associated proteins that make up the small nuclear ribonucleoprotein particles (snRNPs) (4). The spliceosome recognizes three conserved sequences at or near the exon–intron boundaries, namely the 5′ splice site (5′ss), the branch point sequence (BPS) and the 3′ss (Figure 1). In the first catalytic step, the 5′ end of the intron is cleaved and covalently joined to an Adenosine (A) on the BPS. In the second catalytic step, the neighboring exons are joined and the intron is excised as a lariat. There are at least two classes of pre-mRNA introns, based on the splicing machineries that catalyze the reaction: Whereas U2-type introns have been found in virtually all eukaryotes (1) and comprise the vast majority of the splice sites found in any organism, U12-type introns have only been identified in vertebrates, insects, jellyfish and plants (8).
Figure 1

Initial steps in splice-site selection. The dark bars stand for exons, while the lines represent introns. ss stands for splice sites. AG is the 3′ terminus of the intron, where Y is a pyrimidine. PPT is the poly-pyrimidine tract. BPS is the branch point sequence, U1, U2, U11 and U12 are snRNPs. U2AF65 and U2AF35 are protein splicing factors. See text for details.

U2 snRNP-dependent introns make up the majority of all introns and are excised by spliceosomes containing the U1, U2, U4, U5 and U6 snRNPs. These introns consist of three subtypes, according to their terminal dinucleotides: GT–AG, GC–AG and AT–AC introns. U12 snRNP-dependent introns are the minor class of introns and are excised by spliceosomes containing U11, U12, U4atac, U6atac and U5 snRNPs. These introns mainly consist of two subtypes, as defined by their terminal dinucleotides: AT–AC and GT–AG introns. In addition, a small fraction of the U12-type introns exhibit other terminal dinucleotides (5–7). U2-type intron splicing initially involves base pairing of U1 snRNA to the 5′ss and U2 snRNA to the BPS (2) (Figure 1). The base pairing of U2 snRNA to the BPS is facilitated by the binding of the large subunit of the U2 Auxiliary Factor (U2AF65) to the poly-pyrimidine tract (PPT) located immediately upstream of the intron 3′ terminus, and binding of the small subunit (U2AF35) to the 3′ terminal AG dinucleotide of the intron (9,10). Following the initial recognition of the splice sites by the U1 and U2 snRNPs, the U4/U6/U5 tri-snRNP is recruited to the splice site leading to the two catalytic steps of splicing (11). In U12-type introns, the roles of U1, U2, U4 and U6 snRNPs in U2-type introns are replaced by the U11, U12, U4atac and U6atac snRNPs, respectively (12–14). The overall similarity in the predicted secondary structure between analogous U2- and U12-type snRNAs suggests that the spliceosome rearrangements during catalysis are conserved between the two spliceosomes (15,16). The 5′ss, 3′ss and BPS elements conform to specific consensus sequences as determined by the alignment of splice-site compilations (8,17–24). The U2-type splicing signals have highly degenerate sequence motifs; many different sequences can function as U2-type splice sites. In contrast, U12-type 5′ss and the BPS, which lies close to the 3′ end of the intron, are highly conserved (22,25). In most U2-type cases, the PPT is located immediately upstream of the AG but there are examples in alternatively spliced exons in Drosophila melanogaster with a longer PPT–AG distance (26), or even with the PPT placed downstream of the AG (27). Furthermore, the mammalian U2-type BPS can also be located very far (>100 nt) from the intron–exon junction sequence (28). U12-type introns also lack an obvious PPT at the 3′ss. Historically, splice sites are ranked, based on compilations of splice sites (19,20,29,30). However, none of these ranking schemes accurately identify the bona fide splice sites (31–33). In addition, alternative splicing, involving the choice of competing splice sites, is not amenable to an analysis based solely on splice sites (34). In order to identify common and distinguishing features in each splice-site type, we have collected and analyzed a comprehensive set of naturally-occurring splice sites from the genomes of five model organisms: Homo sapiens, Mus musculus, D.melanogaster, Caenorhabditis elegans and Arabidopsis thaliana. Since a well-classified dataset did not exist previously, we have carefully classified the splice sites into various categories, generating a user-friendly resource, SpliceRack (). Using this large-scale splice-site database, we have revisited relevant themes in splicing, such as the frequencies of U12-type and other rare introns, and the variations in the 5′ and 3′ss motifs among the five organisms. We have also applied phylogenetic approaches to infer the evolutionary histories of the different splice-site subtypes. Finally, we have developed tools, available on SpliceRack, to facilitate various splice-site analyses.

MATERIALS AND METHODS

Data collection

The idea driving the collection of data was to generate a large, reliable dataset that would allow statistical analyses. For this reason, sequences from RefSeq (35) were used instead of ESTs and cDNAs and other sources of noisy data. The RefSeq collection contains sequences with different levels of confidence (including categories, such as reviewed, provisional, predicted etc.). RefSeq sequence data from NCBI (from June 2005) for five reference genomes was collected, including the primate H.sapiens, the rodent M.musculus, the arthropode D.melanogaster, the nematode C.elegans and the land plant A.thaliana. If a splice site occurred more than once in different alternatively spliced isoforms, it was considered only once in the database, as part of one of the variants, in order to avoid counting splice sites multiple times. The flanking sequences at both exon–intron junctions were extracted, 38 nt long at the 5′ss (8 nt within the exon and 30 nt within the intron), and 43 nt long at the 3′ss (35 nt within the intron and 8 nt within the exon). The splice sites were identified using annotations of the genome from GenBank. C.elegans 3′ss that are trans-spliced to spliced leader (SL) RNAs (36) were not included in the analysis. The data, along with lengths of introns and exons, are stored in a relational database.

Position weight matrices (PWMs), information content and phylogeny

Sequences at the 5′ and 3′ junctions for each intron were separately aligned using the intron–exon junction as the anchor. The alignment was used to calculate the frequencies of nucleotides at each position. This results in a PWM that has four rows (one for each of A, C, G and T) and the number of columns is equal to the length of the splice-site motif. We chose a 9 nt long motif for the 5′ss (3 in the exon and 6 in the intron) and a 14 nt long motif for the 3′ss (13 in the intron and 1 in the exon). The consensus sequence for a PWM is constructed by choosing the nucleotide with the highest frequency of occurrence at each position. A log-odds (LOD) matrix was constructed from the PWM by the usual method of taking a logarithm to base 2 of the ratio of frequencies at each position of the matrix against a background frequency of 0.25 for each nucleotide. Using a different background will not change the results, but will change the scores and thresholds. A position on a sequence was scored by aligning it with the matrix and adding up the entries in the matrix that correspond to the nucleotides at each position on the sequence. Each position on the splice site has an associated information content that can be derived from the frequencies of the 4 nt, using information theory (37,38). For a random variable that can take on four values with four probabilities, p, the information content is . Information content can be used to study the variability (or level of conservation) at each position, varying from 0 (one possibility is certain) to 2 (all possibilities are likely equal). Some papers use as a measure of conservation, which just inverts the scale (39). A distance, d, can be derived between two PWMs, p and q, of equal length, using a measure called the Kullback–Leibler distance (40), a symmetrized form of which is defined as where i is the index of position on the PWM. This distance was used to derive phylogenetic trees for the 5′ and 3′ss separately, using the program Phylip (41).

Scoring a putative splice site

Given a PWM of a given splice-site type, a log-odds scoring matrix was constructed to score the splice sites. A pseudocount of 0.0001 was added to avoid logarithms of zero. The small pseudocount ensures a strong penalty for mismatches at the exon–intron junctions. The scores obtained by each LOD score matrix were rescaled to fall in the 0–100 range. All positive scores, from 0 to the maximum, were rescaled to lie between 50 and 100, and all negative scores, from the minimum score to 0, were rescaled to lie between 0 and 50. A score close to 50 means the background model is favored, a score close to 100 favors the splice-site motif in question, and a score close to 0 implies that the splice site is different from both the background and the splice-site distribution. Finally, strongly conserved splice sites have a bigger spread of scores, so a single mismatch can drop the score dramatically, whereas less conserved splice sites are more tolerant of mismatches compared to the highest scoring sequences.

Classification of introns

PWMs were first generated for each splice-site type and BPS using human curation. The U12-type BPS consists of an 8 nt pyrimidine-rich sequence (approximately TTTTCCTT), followed by two positions enriched in As and two positions enriched in pyrimidines (C or T). The U12-type BPS requires an A, either at positions 9 or 10 (42). In order to take this into account, we constructed two U12-type BPS PWMs, one with a highly conserved A at position 9 (U12-BPS-A9), and another with a highly conserved A at position 10 (U12-BPS-A10). A sequence was considered a good BPS if it had a score greater than 65 for either one of the two BPS PWMs and if it was located between 8 and 30 nt upstream of the 3′ss. The short fragments around the exon–intron boundaries were scored using the LOD matrices, and each intron received a score for 5′ss, 3′ss and BPS for U12-type introns. The best examples of each type, those with high 5′ss scores and consistent dinucleotide boundaries, were selected. For U12-type GT–AG introns, an additional criterion for the presence of a BPS between 8 and 38 nt upstream of the 3′ end was used to eliminate possible contaminants. The best set of introns for each subtype was used to generate new species and splice-site type specific PWMs, which were then used to rescore the splice sites and reclassify them. Intron classification was unambiguous in almost all cases, either due to their distinctive boundaries or their 5′ss scores, with the exception of a few introns with similar scores for U2- and U12-type GT–AG 5′ss matrices. The introns that remained unclassified were subsequently classified using criteria described below. Previous classification schemes (6,8,25) assumed that U12-dependent splicing required a good BPS within 10–20 nt from the 3′ end of the intron (22), but recent data has shown this is not always necessary (7,25,43). Thus, we used the presence of an optimal U12-type BPS only to discriminate between introns that could not be unambiguously classified by their 5′ss scores alone. If the U12-type 5′ss score was greater than the U2-type 5′ss score by more than 25 then it was assigned the U12-type GT–AG. If the U12-type 5′ss score was greater than the U2-type 5′ss score by more than 10 and the intron had a good U12-type BPS, then again it was assigned the U12-type GT–AG. In all other cases it was assigned the U2-type GT–AG. The thresholds of 25 and 10 are arbitrary, but were optimized by human curators to minimize the error rate, which was estimated at <1% by manual scrutiny. The borderline cases could be classified in either group, and their unambiguous assignment will require experimental verification. Finally, introns with a minimum 5′ss score less than 50 are kept in the database, and can be accessed from the SpliceRack website, but are not used in our analysis.

Introns with non-canonical boundaries

A few of the introns with dinucleotide boundaries that do not fit into the canonical categories are real, and we delineate the strategy to identify them. In general, current datasets are biased against non-canonical splice sites, but it is possible to gather some information on them from the RefSeq sequences, as we show below. To identify non-canonical U12-type introns from this unclassified set, we chose those with good matches to the U12-type 5′ss, with a non-canonical dinucleotide at the intron's 3′ end, and a significant number of these introns were found (Table 1). Many of these introns have a U12-BPS. Analogous to the scheme outlined above, those introns without a match to the U12-type BPS were only classified as U12-type when the score for the U12-type 5′ss was >25 score units higher than that for the U2-type 5′ss. These introns were included in the U12-type RT–RN category (where R is a purine, and N is any nucleotide).
Table 1

A synopsis of the collection of splice sites in SpliceRack

H.sapiensM.musculusD.melanogasterC. elegansA.thalianaTotal
Number of accessionsa23 78821 82411 76319 27822 43499 087
Number of genesa23 50521 77211 75619 26922 35498 656
U2-type GT–AG183 678174 67140 63793 699111 351604 036
U2-type GC–AG160214391853519034480
U2-type II (AT–AC)15630024
Total U2-type185 295176 11640 82594 072112 254608 562
U12-type GT–AG469444110*1621086*
U12-type AT–AC1691516027353
Other U12-typeb333010266
Total U12-type6716251801911505*
Total number185 966176 74140 84394 072112 445610 067
% U2-type GC–AG0.8650.8170.4530.3730.8040.736
% U2-type AT–AC0.0080.0030.007000.003
% U12-type0.3610.3540.04400.1700.247

This table shows the total number of introns in SpliceRack for each subtype.

The * indicates that there are 22 U12-type-like GT–AG introns in C. elegans, which were not included in the counts of U12-type introns (see text for details).

aThe number of accessions (RefSeqs) includes alternatively spliced isoforms.

bOther U12-type introns refers to those U12-type introns lacking the canonical U12-type terminal dinucleotide pairs, which were sorted in the RT–RN and GC–AG categories (see text for details).

Some GC–AG introns revealed a strong match to the PWM for U12-type GT–AG 5′ss, except for the mismatch at position +2. We searched for GC–AG introns with the GCATCC motif from positions +1 to +6 of the 5′ss, and found that most of these introns had a match to the BPS at an optimal distance from the 3′ss. These introns were named U12-type GC–AG. U2-type AT–AC introns (or AT–AC type II introns) were identified by searching for unclassified AT–AC introns that bear a good match to the U2-type 5′ss, except for the intron's terminal dinucleotides. A PWM for U2-type AT–AC 5′ss was generated from the PWM for U2-type GT–AG 5′ ss, by changing the full conservation from G to A at position +1. Only those introns with a minimal 5′ss score of 75 were reclassified as U2-type AT–AC. Putative examples of other classes of non-canonical U2-type introns were identified by searching for all exon–intron boundaries with good 5′ GT–AG U2-type splice sites (score greater than 70), with a non-canonical dinucleotide boundary at the 3′ terminus of the intron and a good PPT near the 3′ terminus in the intron. Introns that could not be classified into any of the seven categories either lack a canonical exon–intron boundary dinucleotide, and/or have all 5′ss scores below 40. These can be accessed through SpliceRack. Most of these unknown splice sites are probably due to sequencing or annotation errors in the genome (44), as shown in previous examinations of non-canonical splice sites (45–47). The unclassified introns in our dataset represent a small fraction (<1.5%) of the total number of introns.

RESULTS AND DISCUSSION

SpliceRack

We have created the largest compilations of splice-site sequences to-date. The splice-site sequences for the five species were extracted from the RefSeq collection of NCBI (35). Introns were classified into various splice-site subtypes, based on scoring methods derived from PWMs, as explained in the Materials and Methods section (Table 1). Our data is publicly available through the SpliceRack website (see ).

Tools to explore SpliceRack

SpliceRack contains sequences flanking the 5′ss (38 nt long, 8 nt within the exon and 30 nt within the intron), and the 3′ss (43 nt long, 35 nt within the intron and 8 nt within the exon). We have developed several useful tools, available on the SpliceRack website, that allow exploration of the splice sites and their flanking sequences. Each tool results in the selection of a subset of splice sites. The genes containing these splice sites can be analysed for functional and other biases by using GObar (48), which is provided as a built-in tool.

Retrieve splice sites by type

This tool allows users to retrieve splice site flanking sequences by specifying one or more splice-site subtypes, cut-off 5′ss or 3′ss scores, and species (Figure 2). A similar display is obtained in the ‘Search for genes’ tool, which allows visualization of all splice sites within a queried gene. Introns with weak splice-site boundaries can be retrieved using a minimum 5′ss score of 40 and a maximum of 50.
Figure 2

Two of the 169 human U12-type AT–AC introns are shown. These were accessed using a minimum 5′ ss score of 50 and a maximum of 100. The output includes links to the GenBank accession, gene names and the intron number and length. The two highest scores for each splice-site subtype are shown. The classification of introns is explained in the Materials and Methods section.

Search for patterns

This tool can be used to locate potential protein-binding sites by searching for sequences that match user-specified patterns in the form of regular expressions. For example, IUPAC SNP codes as well as standard symbols, such as ‘+’ and ‘*’ can be used. The search can be restricted by species, splice-site type and start position, as explained on the website. The neural-specific splicing factor Fox-1 binds to the hexamer UGCAUG. In order to find introns with the hexamer in the vicinity downstream of the 5′ss, the database was queried for U2-type GT–AG 5′ss with the pattern N{13,31}TGCATG. The N allows match to any nucleotide, the 13 and 31 ensure that the TGCATG motif will start anywhere from position 13 on the flanking sequence, which is after the U1 binding site, to position 31. This query results in 1221 introns. By using the GObar function, we confirmed that the genes containing these motifs are enriched for function in neural tissues (49). This analysis will not find genes that have this motif deep inside the intron.

Search for motifs

This tool is similar to the search for patterns tool, except the patterns are now of fixed-length, which speeds up the search. For example, to find the frequency of potential stop codons (TAG,TTA,TAA) at the last 3 nt of an exon, the search is carried out for human U2-type GT–AG 5′ ss with a motif start position of 5. A query with the sequence TAG returns 4591 introns, while TRA (R is a purine), representing the other two stop codons, returns an additional 1439 introns.

Search for matrices

This tools finds matches to a user-supplied PWM in the sequences in SpliceRack. Users can also supply cut-off scores, and the tool calculates the frequency of the motif's occurrence, starting from a user-specified position relative to the exon–intron boundary. It also displays the rank (based on counts) of a searched motif relative to the frequencies for all the motifs that are of the length of the PWM and start at the user-specified start site. Results can be limited by species, 5′ss or 3′ss and splice-site type. This tool can be used to locate sequences in the neighborhood of splice sites that can serve as protein-binding sites.

Splice-site locator

This tool can help identify potential splice sites in user-specified genes or sequences, using scores derived from the PWMs in SpliceRack. When a gene name (or sequence) is supplied along with a user-specified score threshold, the known exon–intron boundaries as well as predicted splice sites are shown. In addition, a distribution of counts of predicted splice sites against scores is also shown, so that it is easy to see how the genuine splice sites compare to the predicted splice sites. A link to the sequence in the gene where the potential splice sites are located is also provided.

U2-type splice sites

U2-type GT–AG and GC–AG splice sites

Of all the splice sites we have classified, 99.6% (608 562) are of the U2-type in the five genomes (Table 1). GC–AG introns represent 0.82% of U2-type introns in H.sapiens, 0.86% in M.musculus, 0.8% in A.thaliana, 0.45% in D.melanogaster and 0.37% in C.elegans, indicating that more primitive organisms tend to have fewer GC–AG introns. GT–AG and GC–AG subtypes of U2-type introns are considered separately. Though there is no evidence that these types are spliced by distinct mechanisms, their PWMs are significantly different (Figure 3). The 5′ss from GC–AG and GT–AG introns bind to U1 snRNP during spliceosome assembly (2). The consensus sequence for the U2-type GT–AG 5′ss motif corresponds to a perfect base pairing to the U1 snRNA 5′ end. Despite this, a high degree of variability seems to be tolerated at most positions of the actual 5′ss. In GC–AG introns, the substitution at position +2 of the 5′ ss introduces a mismatch in the U1: 5′ss helix. The PWMs (Figure 3) show that the remainder of the 5′ss sequence has a higher degree of match to the consensus GT–AG 5′ss, apparently compensating for the mismatch at position +2 (17,23,24,50,51). Subtype switching of introns from U2-type GT–AG to U2-type GC–AG between human, rodent and chicken has been shown to be a common event (24).
Figure 3

PWMs for 5′ and 3′ ss for the U2-type GT–AG and GC–AG intron subtypes, for the five species. We used color-coded vertical bars stacked one over the other to represent the percentage of each nucleotide, Green (A), Blue (C), Black (G) and Red (T); similar logos have been used elsewhere (66). The bars are ordered by their percentages.

Non-canonical U2-type introns

Non-canonical U2-type introns with AT–AC dinucleotide intron ends have been known to occur and are functional. This is the first genome-wide survey for U2-dependent AT–AC II introns (Table 1, Materials and Methods). Previously, only two human AT–AC introns in the SCN4A and SCN5A genes were experimentally demonstrated to be spliced by the U2- but not the U12-type splicing machinery (5,51). The borders of these introns were conserved between human and puffer fish. In addition, the chicken parvalbumin gene and some xylanase genes in filamentous fungi were hypothesized to contain AT–AC II introns. Furthermore, a G to A transition at the first intronic nucleotide can be rescued for splicing by a G to C transversion at the last nucleotide of U2-type introns in yeast (52,53) and human pre-mRNAs (54), although less efficiently. We found a total of 15 AT–AC II introns in the human, 6 in the mouse and 3 in the D.melanogaster genomes. The case for these introns is bolstered by the fact that they are preserved in orthologous genes, such as the AT–AC II intron in the human and D.melanogaster TAF2 gene. In the human dataset, 10 out of 15 of these introns are found in members of the alpha subunit of the voltage-gated sodium channel gene family, which also have U12-type introns. Hence, the higher number of AT–AC II introns in mammals could be explained by the expansion of this gene family. All six mouse AT–AC II introns are found in members of this family, and conserved in the human genome. Interestingly, AT–AC II introns seem to be absent in C.elegans and A.thaliana. The very low abundance of these introns, and their presence in vertebrates and fungi, suggest that they constitute a relic of U2-dependent introns that have been progressively lost. We searched for other putative non-canonical U2-type splice sites as explained in the Materials and Methods section. Many of the known types (55) (see ) are not in our compilation as they do not occur in the RefSeq collection. We found 99 introns (33 in human, 46 in mouse, 5 in fly, 11 in the worm and 4 in the plant) with an optimal U2-type 5′ss and PPT, but a non-canonical 3′ intron terminus. Out of these, the GT–AT and GT–TG introns in the DGCR2 and ARS2 genes, respectively, are conserved between human and mouse, suggesting that at least some of the remaining 95 might be bonafide, functional splice sites.

U12-type splice sites

Canonical U12-type introns

We have substantially expanded the catalogue of U12-type splice sites in four of the five organisms we examined (Table 1). The PWMs for these splice sites have been derived and are consistent with previous reports (6,25) (Figure 4). However, we find a more extended region of conservation than previously identified (6,22,25). Specifically, the −1 position on 5′ss has a strong preference for T in U12-type GT–AG, and the three exonic positions immediately downstream of the 3′ss favor the sequence ATA.
Figure 4

PWMs for 5′ and 3′ ss for the GT–AG and AT–AC U12-type intron subtypes, in the five species. For C.elegans what is shown is the vestigial U12-type GT–AG splice sites.

We have identified 671 human U12-type introns, ∼0.36% of all introns, which includes 314 introns that were hitherto unrecognized as U12-type introns. The GT–AG subtype accounts for 469 introns while 169 introns belong to the AT–AC subtype, and the remaining 33 lack the canonical U12-type intron borders (Table 1). We have identified 357 of the 404 U12-type introns predicted in the most recent previous survey of U12-type introns (6) while the remainder are from genes not present in the RefSeq collection. Mouse frequencies for U12-type introns are very similar to those for human, consistent with a previous study (24). We identified 189 U12-type introns in A.thaliana (Table 1), as compared to the 165 U12-type introns derived earlier from a mapping of cDNA/ESTs to the genomic sequence (25). Overall, the fraction (0.24%) of U12-type introns in A.thaliana is smaller than the fraction in mammals. We found 18 U12-type introns in D.melanogaster (Table 1), of which 6 are of AT–AC subtype, 11 are of GT–AG subtype and 1 has GC–AG terminal dinucleotide (see below). A previous study identified 19 U12-type introns in D.melanogaster (56). We are missing one of their AT–AC introns and three of their GT–AG introns, either due to the genes not being in RefSeq or due to the sequences not having a good BPS. Even though U12-type introns in D.melanogaster correspond to a very small fraction (0.04%) of all introns, in at least some cases they are essential for the development of this organism (57). Strikingly, the relative abundance of each U12 subtype in the fly is very different from that of mammals and plants, in that D.melanogaster has a greater percentage of AT–AC U12-type introns compared to GT–AG U12-type introns than the other three species (33 versus 14–25%). U12-type introns have not been found in C.elegans (8) and recent systematic searches for U12-type splicing factors failed to identify proteins unique to the U12-type spliceosome in the C.elegans genome (1,58). Since other metazoans and plants have U12-type introns, it has been suggested that the C.elegans lineage lost the U12-type introns (8). Indeed, we found some vestigial U12-type C.elegans introns whose 5′ss are strongly reminiscent of U12-type GT–AG introns in other species (Table 1).However, because the 3′ end sequences of these introns have the signature features of C.elegans U2-type 3′ss (see below), and a good candidate sequence for a U12-type BPS is missing, they are probably spliced by the U2-type machinery.

Non-canonical U12-type introns

A significant fraction of U12-type introns in our dataset (≈5%) lack the canonical pairs of dinucleotide (Table 2, see Materials and Methods). These transcripts are supported by RefSeq sequences. However, a large fraction of these introns are incorrectly mapped in other genome browsers, probably due to automated mapping based on canonical splice-site predictions.
Table 2

The distribution of non-canonical U12-type introns

H.sapiensM.musculusD.melanogasterA.thalianaTotal
AT–AG1180120
AT–AA34018
AT–AT660012
AT–GT01001
GT–AT43007
GT–GG53008
GT–TG01001
GC–AG44109
Total33301266
% of all U12-type4.924.805.561.05
Only two of these introns were found in A.thaliana, including the previously described intron 7 of the AtG5 gene (59). Such non-canonical introns have been experimentally shown to be competent for splicing by the U12-type machinery (7). The most efficient pairs were found to be GT–AG and AT–AC, but AT–AN (where N is any nucleotide) and GT–AT could also engage in productive splicing. Remarkably, the splicing efficiency of these splice-site combinations correlates with their frequencies in our database. Furthermore, evidence has been found for U12-type non-canonical splicing in samples from a Peutz–Jeghers syndrome patient that have a A to G mutation at the +1 position of the AT–AC U12-type 5′ ss in the LKB1 gene (43). We have also identified another subtype of putative U12-type introns with GC–AG borders, which were predicted in an earlier survey (8). There are four candidates of these introns in the human and mouse, and one in the fly. All of these candidates have a perfectly conserved GCATCCT 5′ ss motif, and most have an optimal U12-type BPS at an appropriate location. Furthermore, the exonic nucleotide upstream of the 5′ss of most of these candidates do not usually conform to the U2-type consensus making them poor substrates for U2-type splicing.

Biases in the distribution of U12-type and other rare introns

The persistence of U12-type introns over evolutionary time is a mystery, as they represent a very small fraction of all genes, and some lineages and organisms, such as C.elegans, appear to have completely lost them. We used GObar (48) (see ), a tool that exploits the information in Gene Ontology to unveil pathways or functions that are enriched in a particular sample gene set, to study the functional distribution of genes containing U12-type introns. We find that U12-type introns are usually found in genes involved in ion transport (25,60), protein trafficking and cell-cycle control, among other processes. Genes with more than one U12-type intron are found more often than expected from a random model for the disribution of U12-type introns in genes, consistent with previous reports (6,8). H.sapiens and M.musculus have 45 and 46 genes that contain two U12-type introns, respectively, A.thaliana has only six, and D.melanogaster has none. Furthermore, human and mouse have two and three genes, respectively, with three U12-type introns, all of them members of the solute carrier family (SLC) gene family, that encode sodium and hydrogen exchanger proteins. In mammals, considering an average number of eight introns per gene (61) and assuming an independent U12-intron distribution model, the expected number of genes with two U12-type introns is only six. If we include more realistic distributions of introns amongst genes, the total expected goes up to only 16. This physical clustering of U12-type introns could be related to function or to the origin of U12-type introns. The U12-type machinery is ≈100 times less abundant in the nucleus than the U2-type machinery (62,63) and a possible consequence of this is slower splicing kinetics of U12-type introns. Indeed, there is some experimental corroboration for this idea (64). These observations suggest that U12-type introns persist because of their slower rates of splicing reaction, which might be important for the expression of the gene that contains them. If the U12-type intron is the slowest step in the pre-mRNA processing, having more than one U12-type intron in a gene might have no additional impact on splicing dynamics. The possibility of intron conversion from U12- to U2-type was previously demonstrated by looking at orthologous U12-type introns from different species (8,60), but the rate of conversion is unknown. As pointed out in these studies, the high conservation of U12-type splice sites and the high degeneracy of U2-type splice sites suggests that this conversion could be achieved by a few mutations, and that the reciprocal conversion would be very difficult. Interestingly, we noticed frequent subtype switching of non-canonical U12-type introns between human and mouse. Out of 24 conserved U12-type introns between human and mouse of which at least one of the two is non-canonical, only 10 showed the same intron borders (Table 3) The remaining 14 introns underwent subtype switching. Amongst these, we found two cases of subtype switching from a non-canonical intron pair to another one, and ten introns that were switched from a non-canonical intron pair to a canonical GT–AG or AT–AC intron or vice versa. Remarkably, all these non-conserved intron borders differ at only one position.
Table 3

Subtype switching for non-canonical U12-type introns between human and mouse

H.sapiensM.musculusSubtype
Human gene
    XPO4AT–AGGT–AGNon-canonical/canonical
    ACTR10AT–AGGT–AGNon-canonical/canonical
    NUP210AT–AGAT–AGConserved
    SCN10AAT–AGGT–AGNon-canonical/canonical
    CUL4BAT–AGAT–ATNon-canonical/non-canonical
    USP21AT–AGAT–AGConserved
    RAPGEFL1AT–AGAT–AGConserved
    INPP5BAT–AGAT–AGConserved
    RASGRP4AT–AAAT–AAConserved
    FLJ21106AT–AAAT–AAConserved
    MSH3AT–AAAT–ACNon-canonical/canonical
    ERCC5AT–ATAT–ACNon-canonical/canonical
    IKAT–ATAT–AANon-canonical/non-canonical
    DEADC1GT–ATGT–ATConserved
    SLC24A6GT–ATGT–ATConserved
    SLC12A7GT–GGGT–AGNon-canonical/canonical
    ARAFGT–GGGT–GGConserved
    LZTR1GC–AGGC–AGConserved
Mouse gene
    4930449E07RikATACAT–AGNon-canonical/canonical
    Cacna1hATACAT–AGNon-canonical/canonical
    4930590J08RikGTAGAT–AGNon-canonical/canonical
    Gpaa1ATACAT–ATNon-canonical/canonical
    Zc3hdc6ATACAT–ATNon-canonical/canonical
    Slc9a8ATACAT–ATNon-canonical/canonical

The canonical borders, AT–AC or GT–AG, for one of the species are shown in bold letters. For non-conserved introns, the differences in the nucleotides on the intron borders are underlined. There are 10 cases where the non-canonical boundaries are conserved between mouse and human, and 2 cases where one non-canonical form switched to another non-canonical form between the species. In the rest of the cases, one of the boundaries, either the mouse or human is a canonical form.

The AT–AG U12-type introns might represent viable intermediates of subtype switching events between U12-type AT–AC and GT–AG introns. In light of this, the >50% subtype switching for the non-canonical introns seem to be at odds with the observation that no subtype switching between U12-type AT–AC and GT–AG intron orthologs was found between human and chicken (24)

Splice-site analysis by comparative genomics

We used SpliceRack to analyze the species-specific properties of splice sites from an evolutionary perspective. The global patterns of the splice sites were compared amongst the five species by phylogenetic approaches.

Splice-site information content across species

We measured and plotted the information content at each position of the 5′ and 3′ss from the PWMs (Figures 5 and 6 see Materials and Methods). The information content is a measure of the degree of conservation in splice-site sequences, with lower information content corresponding to a higher degree of conservation, and vice versa (24,65–68).
Figure 5

Information content in the exonic and intronic portions of 5′ ss in various organisms. (A) U2-type GT–AG introns. (B) U2-type GC–AG introns. (C) U12-type GT–AG introns. (D) U12-type AT–AC introns.

Figure 6

Information content in the exonic and intronic portions of 3′ ss in various organisms. (A) U2-type GT–AG introns. (B) U2-type GC–AG introns. (C) U12-type GT–AG introns. (D) U12-type AT–AC introns.

Information content of U2-type 5′ss (Figure 5): The 5′ss motifs for the two mammals are virtually identical, as described previously (24,69). The U2-type GT–AG 5′ss graph shows that there is more conservation in the exonic nucleotides for H.sapiens, M.musculus and A.thaliana, compared to the other two simpler species, D.melanogaster and C.elegans. This is consistent with the fact that budding and fission yeast have poor conservation at the exonic 5′ss positions, but a very strict conservation at intronic positions (70). C.elegans and A.thaliana show weak conservation of Ts at positions downstream of +6, which is not the case for the other organisms (8,71). The GC–AG 5′ss show higher conservation at both intronic and exonic positions when compared to the GT–AG 5′ss (23,24,45). Information content of U2-type 3′ss (Figure 6): The U2-type GT–AG and GC–AG 3′ss show similar information content, suggesting they might be functionally very similar. All species, except for C.elegans, show little conservation in the exonic or intronic portions of the 3′ss, probably due to the degenerate PPT. C.elegans exhibits a shorter and more conserved PPT in the intronic portion, with a consensus sequence of TTTTCAGR, where AG is the terminus of the intron (72). The existence of this conserved 3′ss motif in C.elegans is explained by a more stringent requirement for an optimal binding site for the U2AF heterodimer that is unique to this species (73). The conservation of G at position +1, conserved only in H.sapiens, M.musculus and A.thaliana, can be seen as a small decrease in information content at that position for the three species. Information content of U12-type 5′ss (Figure 5): The information content distribution for U12-type GT–AG and AT–AC 5′ss shows a high degree of conservation at the intronic positions that starts decaying at position +6. The conservation of T at position −1 of mammalian and plant, but not fly U12-type GT–AG 5′ss can be seen as a small increase in the fly's information content at that position. The information content distribution for C.elegans U12-type-like GT–AG introns is kept in the graph for comparison. Information content of U12-type 3′ss (Figure 6): The information content distributions for both U12-type GT–AG and AT–AC 3′ss show a very small degree of conservation in the intronic nucleotides. This observation is consistent with the notion that U12-type 3′ss are slightly enriched in pyrimidines but lack a PPT. The location of a BPS near (8–14 nt) the 3′ end of the U12-type intron contributes to conservation at these positions, as seen in the slight decrease of information content on the left of these graphs. The small numbers of these types of introns makes it difficult to infer general trends for D.melanogaster and A.thaliana's AT–AC 3′ss. The vestigial U12-type-like introns in the nematode also contain the conserved motif of the U2-type 3′ss, consistent with the view that these introns are indeed U2-dependent.

Phylogenetic trees from PWMs

The Kullback–Leibler distance (40) (see Materials and Methods) between PWMs for 5′ and 3′ss was used to calculate a distance matrix between species, and phylogenetic trees were derived from these distance matrices. It is possible to use other methods to build trees, but this is the simplest one (50). Trees were generated for all the splice sites together (Figure 7), after removing the conserved dinucleotide terminus of the introns from consideration, as they would hide the signal from the other positions on the splice-sites.
Figure 7

Phylogenetic trees for all splice site types. Trees were constructed in a manner similar to the trees for the individual splice site subtypes, but the terminal dinucleotides of the intron were removed from the PWM (see text for details).

From Figure 7, it is easy to see a clear separation between U2- and U12-type 5′ss motifs. The various 5′ splice types separate according to splice-site type, rather than species. All the U2-type motifs cluster together, with a small separation between GC–AG and GT–AG types suggesting that the GC–AG subtype is a specialized version of the U2-type GT–AG 5′ss motif. The C.elegans GC–AG motif is not as conserved as the other GC–AG motifs, so it clusters closer to the GT–AG motif from the same species. The two C.elegans U2-type motifs and the A.thaliana U2-type GT–AG motif are close to each other due to the T-rich composition of plant and nematode introns. The 5′ss motifs for U12-type introns are distant from the U2-type introns, their closest relative being the GT–AG 5′ss motif from A.thaliana, probably because these motifs are rich in pyrimidines from positions +7 to +10. The C.elegans U12-type-like GT–AG motif is located between the U2-type and the bona fide U12-type GT–AG motifs, but it is closer to the U12-type motifs in the other species, consistent with the idea that these introns could represent a small group of vestigial U12-type introns that were converted to the U2-type. Interestingly, in this tree the four AT–AC motifs are located on the same branch of the tree, because these motifs are slightly more conserved than those for U12-type GT–AG introns. The tree for 3′ss shows that, in general, the 3′ss tend to cluster by species rather than by subtype. Most remarkably, the three subtypes of C.elegans 3′ss motifs, U2-type GT–AG, GC–AG and U12-type-like GT–AG, occupy a distinct branch on the tree, indicating that: (i) these motifs are strongly related to each other, in turn suggesting that the U12-type-like introns are indeed U2-dependent; (ii) these motifs are separated from the rest, in agreement with the idea of divergent evolution of C.elegans 3′ss. C.elegans exhibits two distinct features, a short and highly conserved PPT in 3′ss (72,73) and trans-splicing (36). The selection of 3′ss for efficiency in cis- as well as trans-splicing might explain these features and the divergent evolution. Three of the motifs in A.thaliana, U2-type GT–AG and GC–AG and the U12-type GT–AG, also cluster together. In summary, the phylogenetic trees for the different splice-site motifs reflect the evolutionary relationship between the corresponding splice sites in the five species. The clustering of the 5′ss motifs by splice-site subtype strongly suggests that these subtypes were present in a common ancestor of these five eukaryotes, and have evolved separately. In contrast, 3′ss subtypes tend to cluster by species, suggesting a more recent origin for many of the 3′ss. Since U2-type 5′ss is initially recognized by its base pairing to the U1 snRNA which is a more specific interaction, while the 3′ss is defined by a more non-specific binding of the U2AF heterodimer to the PPT and AG element, which in turn stabilizes base pairing between U2 snRNA and the BPS. The different properties of RNA–RNA interactions at the 5′ss versus RNA–protein interactions at the 3′ss imply a greater flexibility in 3′ss selection. In support of this view we note that even within a species, there is more leeway within the 3′ss for specification of the PPT, the AG and the upstream BPS (10,26–28).

CONCLUSIONS

We have created a reliable dataset of genomic splice sites from five model organisms, based on RefSeq genes and their mappings. We have classified these splice sites into various categories, based on carefully curated PWMs for each species and splice-site type. Non-canonical splice sites, that do not fit into the four major categories, get a short shrift in databases that use large-scale automated annotation, but we have considered these cases carefully. This comprehensive set of splice sites allows robust statistical analyses, while minimizing errors. We have implemented a variety of tools, which can be used to further analyze the splice sites in our collection. Experiments can never reproduce the entire range of phenomena that occur in nature, nor can they easily detect certain subtle features that can only be revealed by statistical analysis of a large dataset. For these reasons, the genomic dataset provides an ideal ‘dry-bench’ to study global patterns in splice sites. In addition, by using the information content of the splice-site motifs and the phylogenetic distance derived from them, we draw inferences on the evolution of the splice sites, which are only possible with large and reliable datasets. We believe this dataset and its analyses will spur further experimental work.
  69 in total

Review 1.  Mechanical devices of the spliceosome: motors, clocks, springs, and things.

Authors:  J P Staley; C Guthrie
Journal:  Cell       Date:  1998-02-06       Impact factor: 41.582

Review 2.  Classification of introns: U2-type or U12-type.

Authors:  P A Sharp; C B Burge
Journal:  Cell       Date:  1997-12-26       Impact factor: 41.582

3.  Sequence walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences.

Authors:  T D Schneider
Journal:  Nucleic Acids Res       Date:  1997-11-01       Impact factor: 16.971

4.  Splicing of a divergent subclass of AT-AC introns requires the major spliceosomal snRNAs.

Authors:  Q Wu; A R Krainer
Journal:  RNA       Date:  1997-06       Impact factor: 4.942

5.  Comparison of splice sites in mammals and chicken.

Authors:  Josep F Abril; Robert Castelo; Roderic Guigó
Journal:  Genome Res       Date:  2004-12-08       Impact factor: 9.043

6.  Inhibition of msl-2 splicing by Sex-lethal reveals interaction between U2AF35 and the 3' splice site AG.

Authors:  L Merendino; S Guth; D Bilbao; C Martínez; J Valcárcel
Journal:  Nature       Date:  1999-12-16       Impact factor: 49.962

Review 7.  Pre-mRNA splicing: the discovery of a new spliceosome doubles the challenge.

Authors:  W Y Tarn; J A Steitz
Journal:  Trends Biochem Sci       Date:  1997-04       Impact factor: 13.807

8.  An LKB1 AT-AC intron mutation causes Peutz-Jeghers syndrome via splicing at noncanonical cryptic splice sites.

Authors:  Michelle L Hastings; Nicoletta Resta; Daniel Traum; Alessandro Stella; Ginevra Guanti; Adrian R Krainer
Journal:  Nat Struct Mol Biol       Date:  2004-12-19       Impact factor: 15.369

9.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors:  Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

10.  The splicing regulatory element, UGCAUG, is phylogenetically and spatially conserved in introns that flank tissue-specific alternative exons.

Authors:  Simon Minovitsky; Sherry L Gee; Shiruyeh Schokrpur; Inna Dubchak; John G Conboy
Journal:  Nucleic Acids Res       Date:  2005-02-03       Impact factor: 16.971

View more
  161 in total

Review 1.  Intron creation and DNA repair.

Authors:  Hermann Ragg
Journal:  Cell Mol Life Sci       Date:  2010-09-19       Impact factor: 9.261

2.  De novo assembly and analysis of RNA-seq data.

Authors:  Gordon Robertson; Jacqueline Schein; Readman Chiu; Richard Corbett; Matthew Field; Shaun D Jackman; Karen Mungall; Sam Lee; Hisanaga Mark Okada; Jenny Q Qian; Malachi Griffith; Anthony Raymond; Nina Thiessen; Timothee Cezard; Yaron S Butterfield; Richard Newsome; Simon K Chan; Rong She; Richard Varhol; Baljit Kamoh; Anna-Liisa Prabhu; Angela Tam; YongJun Zhao; Richard A Moore; Martin Hirst; Marco A Marra; Steven J M Jones; Pamela A Hoodless; Inanc Birol
Journal:  Nat Methods       Date:  2010-10-10       Impact factor: 28.547

3.  PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text.

Authors:  Yuri Pirola; Raffaella Rizzi; Ernesto Picardi; Graziano Pesole; Gianluca Della Vedova; Paola Bonizzoni
Journal:  BMC Bioinformatics       Date:  2012-04-12       Impact factor: 3.169

4.  Features of 5'-splice-site efficiency derived from disease-causing mutations and comparative genomics.

Authors:  Xavier Roca; Andrew J Olson; Atmakuri R Rao; Espen Enerly; Vessela N Kristensen; Anne-Lise Børresen-Dale; Brage S Andresen; Adrian R Krainer; Ravi Sachidanandam
Journal:  Genome Res       Date:  2007-11-21       Impact factor: 9.043

5.  Large-scale comparative analysis of splicing signals and their corresponding splicing factors in eukaryotes.

Authors:  Schraga H Schwartz; João Silva; David Burstein; Tal Pupko; Eduardo Eyras; Gil Ast
Journal:  Genome Res       Date:  2007-11-21       Impact factor: 9.043

6.  AtCopeg1, the unique gene originated from AtCopia95 retrotransposon family, is sensitive to external hormones and abiotic stresses.

Authors:  Ke Duan; Xiangzhen Ding; Qiong Zhang; Hong Zhu; Aihu Pan; Jianhua Huang
Journal:  Plant Cell Rep       Date:  2008-02-29       Impact factor: 4.570

7.  The U11-48K protein contacts the 5' splice site of U12-type introns and the U11-59K protein.

Authors:  Janne J Turunen; Cindy L Will; Michael Grote; Reinhard Lührmann; Mikko J Frilander
Journal:  Mol Cell Biol       Date:  2008-03-17       Impact factor: 4.272

8.  Evolution of spliceosomal snRNA genes in metazoan animals.

Authors:  Manuela Marz; Toralf Kirsten; Peter F Stadler
Journal:  J Mol Evol       Date:  2008-12       Impact factor: 2.395

9.  Widespread recognition of 5' splice sites by noncanonical base-pairing to U1 snRNA involving bulged nucleotides.

Authors:  Xavier Roca; Martin Akerman; Hans Gaus; Andrés Berdeja; C Frank Bennett; Adrian R Krainer
Journal:  Genes Dev       Date:  2012-05-15       Impact factor: 11.361

10.  Conservation of the protein composition and electron microscopy structure of Drosophila melanogaster and human spliceosomal complexes.

Authors:  Nadine Herold; Cindy L Will; Elmar Wolf; Berthold Kastner; Henning Urlaub; Reinhard Lührmann
Journal:  Mol Cell Biol       Date:  2008-11-03       Impact factor: 4.272

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.