Literature DB >> 16963498

Aberrant 3' splice sites in human disease genes: mutation pattern, nucleotide structure and comparison of computational tools that predict their utilization.

Igor Vorechovský1.   

Abstract

The frequency distribution of mutation-induced aberrant 3' splice sites (3'ss) in exons and introns is more complex than for 5' splice sites, largely owing to sequence constraints upstream of intron/exon boundaries. As a result, prediction of their localization remains a challenging task. Here, nucleotide sequences of previously reported 218 aberrant 3'ss activated by disease-causing mutations in 131 human genes were compared with their authentic counterparts using currently available splice site prediction tools. Each tested algorithm distinguished authentic 3'ss from cryptic sites more effectively than from de novo sites. The best discrimination between aberrant and authentic 3'ss was achieved by the maximum entropy model. Almost one half of aberrant 3'ss was activated by AG-creating mutations and approximately 95% of the newly created AGs were selected in vivo. The overall nucleotide structure upstream of aberrant 3'ss was characterized by higher purine content than for authentic sites, particularly in position -3, that may be compensated by more stringent requirements for positive and negative nucleotide signatures centred around position -11. A newly developed online database of aberrant 3'ss will facilitate identification of splicing mutations in a gene or phenotype of interest and future optimization of splice site prediction tools.

Entities:  

Mesh:

Substances:

Year:  2006        PMID: 16963498      PMCID: PMC1636351          DOI: 10.1093/nar/gkl535

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Mutations that affect pre-mRNA splicing have been shown to account for up to a half of disease-causing gene alterations (1,2), potentially representing the most frequent cause of hereditary disorders (3). The most common consequence of splicing mutations is skipping of one or more exons, followed by the activation of aberrant 5′ (donor) splice sites (5′ss), 3′ (acceptor) splice sites (3′ss) and retention of full introns in mRNA (4,5). Each of these four events may have a dramatic impact on the structure or outcome of mature transcripts, function of their translation products and phenotypic manifestations. However, gene mutations or variants can also have more subtle effects at the level of splicing by altering the expression of pre-existing alternatively spliced mRNA isoforms, which can considerably modify not only phenotypic severity of both Mendelian and complex traits, but also their population prevalence (6–9). Mutation-induced aberrant splice sites have been classified into two categories (10): (i) cryptic splice sites, which are only used when a mutation disrupts use of the authentic site, and (ii) de novo splice sites, which are induced by mutations elsewhere in introns or exons and increase the match to a splice site consensus. However, distinction between the two categories may be ambiguous in some cases since disruption of the authentic site may create a new splice site consensus, and is less obvious for 3′ss than 5′ss because accurate recognition of acceptor sites requires additional signal sequences in introns (11). The splicing signals of acceptor sites, namely the branch point sequence (BPS), polypyrimidine tract (PPT), and 3′AG, are recognized by RNA–protein interactions involving splicing factor 1 (SF1) and 65 and 35 kDa subunits of the U2 small nuclear RNP auxiliary factor (U2AF65 and U2AF35), respectively (12–17). The overall strength of 3′ss is defined by optimal sequences for interaction with each cognate factor as well as their distances from each other (18,19). Cryptic 5′ss have a similar frequency distribution in exons and introns and their number decreases with an increasing distance from authentic 5′ss (10). In contrast, the localization of cryptic 3′ss is biased towards exons, whereas de novo 3′ss usually reside in introns, particularly within the PPT of authentic 3′ss (11). The distribution bias and a lower prevalence of aberrant 3′ss than 5′ss in vivo is most likely due to sequence constraints near intron/exon boundaries, including depletion of AG dinucleotides and the presence of PPT and BPS upstream of 3′ss (11). In addition, the multifaceted distribution of aberrant 3′ss would be predicted to reflect variable distances between the 3′ss signal sequence from intron to intron (18–21), including the presence of putative ‘distant BPS’ that are not located within an optimal distance of 18–40 nt 5′ of 3′ss, but may reside up to several hundred nucleotides further upstream (22). Despite a growing number of reported splicing mutations and associated phenotypes, the localization of the resulting aberrant 3′ss and their effect on gene expression remain difficult to predict. Currently available computational tools that estimate the splice site strength have been based on a variety of methods, including nucleotide frequency matrices (23,24), machine learning approaches (25), neural networks (26), information theory (27) and interdependence between adjacent (the first-order Markov model) or more distant (the maximum entropy model) positions of the splicing consensus sequences (28). Gene prediction algorithms that take into account protein coding information have been shown to perform better than those that rely only on signals present in the splice sites (29). However, the strength of mutation-induced aberrant splice acceptor sites has not been systematically analyzed, and it is unknown at present which models best predict the localization of cryptic or de novo 3′ss activated in vivo. Here, nucleotide sequences of aberrant 3′ss that were reported previously in human disease genes have been compiled and made available to the public through an online retrieval tool. Comparison of the splice site strength using current prediction algorithms showed that the maximum entropy model allowed the best discrimination between authentic and mutation-induced aberrant 3′ss, validating this model as the most sensitive instrument. In addition, this study provides a detailed characterization of the underlying mutation pattern and comparison of nucleotide composition upstream of aberrant and corresponding authentic 3′ss.

MATERIALS AND METHODS

Compilation of mutation-induced aberrant 3′ss in human disease genes

Published reports of cryptic and de novo 3′ss were identified by searching PubMed () and home pages of peer-reviewed journals. A subset of case reports were identified by searching locus-specific mutation databases (). The search was restricted to human genes with sequence-verified aberrant RNA products published before May 2006 that resulted from disease-associated mutations or variants. Nine cases in which no patient RNA was available but aberrant RNA products of wild-type and mutated alleles were characterized in minigene splicing reporter assays were also included. Aberrant 3′ss were manually validated by mapping the information in the literature to sequences in the Human Genome Project databases. Nucleotide sequences of authentic, mutated and aberrant 3′ss are available at in the first online database of aberrant 3′ss termed DBASS3.

Comparison of computational methods to predict aberrant 3′ss

Validated sequences of aberrant and corresponding authentic 3′ss were used as input files for several splice site prediction algorithms. The Shapiro and Senapathy (S&S) matrix is based on nucleotide frequencies at each position of the 3′ss consensus sequence (23,24). The S&S matrix scores were computed using an online tool available at . The information theory-based server (27) available at was used to obtain the information content (Ri) of 3′ss in bits. To accommodate dependencies between adjacent and non-adjacent positions, the compiled sequences were analyzed using the first-order Markov (MM) and the maximum entropy (ME) models (28). The former method considers dependencies between adjacent positions, whereas the latter approximates short sequence motif distributions with the ME distribution and may include dependencies between non-adjacent as well as adjacent positions. The MM and ME scores (28) were derived for each 3′ss using online tools available at . The Wilcoxon Mann–Whitney rank test (Stat-200, v. 2.01, Biosoft, UK) was employed to test the significance of score differences between authentic, mutated and aberrant 3′ss in each category.

DBASS3 construction

DBASS3 is an online retrieval and submission tool for mutation-induced aberrant 3′ss available at . The web application was created using the ASP server technology (Microsoft), and SQL database software (). In addition to aberrant 3′ss induced by germ-line and somatic mutations, DBASS3 contains naturally occurring variants common in the population if they have been convincingly shown to modify both alternative pre-mRNA splicing and disease phenotypes, such as FECH IVS3-48T/C in protoporphyria (8). Genetic polymorphisms that may influence utilization of tandemly arranged ‘NAGNAG’ 3′ss (30) and exert putative functional effects have been reported elsewhere (31) and were not included in DBASS3, nor were the mutations leading to exon skipping or complete intron retention.

RESULTS

Mutations that activate aberrant 3′ss

An exhaustive search for previously published aberrant 3′ss identified 218 unique aberrant acceptors in 131 genes (Table 1). They were generated by a total of 16 deletions/insertions (32–46) and 211 point mutations (Table 2). Single-nucleotide substitutions of purine residues were much more frequent than those of pyrimidines (165 versus 46, P < 10−16). This overrepresentation was not attributable solely to substitutions at 3′YAG (102 versus 8), but was also observed for de novo 3′ss (63 versus 38, P = 0.004). The most frequently introduced base in each of the four categories of aberrant 3′ss was guanine (G), accounting for ∼42% (89/211) of all point mutations (Table 2).
Table 1

Summary of aberrant 3′ss

Location of cryptic or de novo 3′ss
ExonIntronBoth



Mutation in 3′YAG (cryptic)Mutation outside 3′YAG (‘de novo’)Mutation in 3′YAG (cryptic)Mutation outside 3′YAG (‘de novo’)All mutations
Number of genes54252356131
Number of cryptic and de novo 3′ ss (per cent)88 (39)32 (14)29 (13)78 (34)227 (100)
Number of unique 3′ss (per cent)83 (38)29 (13)28 (13)78 (36)218 (100)
Number of aberrant 3′ss affecting terminal exons (per cent)11 (13)4 (14)8 (29)4 (5)27 (12)
Median distance (nucleotide) between authentic and aberrant 3′ss1255−44−141
Change in the reading frame for unique aberrant 3′ss
    0291082774
    +1387132684
    +2211582569
Table 2

Summary of mutations leading to aberrant 3′ss

Location of cryptic or de novo 3′ss
ExonIntronBoth



Mutation in 3′YAG (cryptic)Mutation outside 3′YAG (‘de novo’)Mutation in 3′YAG (cryptic)Mutation outside 3′YAG (‘de novo’)All mutations
Number of deletions/insertions730616
Number of single-nucleotide substitutions81292972211
Wild-type nucleotide
    A363132981
    C295925
    G4313101884
    T0411621
Mutated nucleotide
    A23852864
    C2135231
    G259144189
    T1295127
Number of AG-creating mutations
Total number (%)16 (7)12 (5)8 (4)62 (27)98 (43)
Not used as aberrant 3′ss (%)052411 (5)
As expected, point mutations were most common in highly conserved positions −1 (53/211; 25%) and −2 (48/211; 23%) relative to natural intron/exon junctions (Table 3). Position −3 was mutated in nine cases (∼4%). As noted in the initial analysis of all splice site mutations for position −2 (47), G-to-Y (in position −1; Y is pyrimidine) and A-to-Y (position −2) transversions were under-represented as compared with G-to-A and A-to-G transitions, respectively (P < 0.01 and P < 0.00001, assuming that substitutions to the remaining nucleotides were equally probable; Table 3). Since transitions are in significant excess in humans compared with the expected frequency of 33% (47), the expected numbers were calculated for each substitution using previously published single-nucleotide mutability rates in disease genes (Table 3). However, the observed number of G−1-to-T−1 mutations was too low to be explained by chance, suggesting that primary transcripts carrying the A−2T−1 acceptors generate on average more canonical mRNAs as compared with 3′AG mutated to other dinucleotides, leading to a detection bias against less severe phenotypes. This notion is supported by similar frequencies of G>T/C>A and G>C/C>T alterations among disease-causing point mutations (48) and by the presence of residual amounts of natural transcripts in some 5′G+1T+2–3′A−2T−1 introns both in Saccharomyces cerevisiae (49) and humans (50). However, comparison of the observed and expected distributions derived from di-nucleotide mutability rates that allow for the influence of neighbouring nucleotides (48) failed to confirm any bias for both intron positions (Table 3). Thus, although small effects of ‘leaky’ dinucleotides on the observed distribution cannot be excluded, these data are consistent with dramatic consequences for splicing of any point mutation in the highly conserved 3′AG and with indistinguishable defects of the second splicing step previously observed in vitro both for intron position −1 (51) and −2 (52).
Table 3

Number of single-nucleotide substitutions in 3′YAG that resulted in cryptic 3′ss

ObservedExpected


Location of cryptic 3′ splice siteExonIntronBothMono-aDi-b
Point mutations in position IVS-1431053
    −1G>A2342725.2c31.5d
    −1G>C1431710.913.2
    −1G>T63916.98.3
Point mutations in position IVS-2361248
    −2A>C7299.3e8.8f
    −2A>G2383130.635.5
    −2A>T6288.13.7
Point mutations in position IVS-3279
    −3C>N257
    −3T>N011
    −3A>N011

Expected numbers of both exonic and intronic cryptic 3′ss were calculated as a weighted average of relative mono-a (47) and di-b (48) nucleotide mutability rates in the sense and antisense DNA strands that were published previously for a large number of point mutations in human disease genes. Relative substitution rates at the di-nucleotide level allow for the nearest-neighbour effects as previously described (48). cχ2=6.2, P = 0.046, dχ2=1.3, P > 0.05, eχ2=0.02, P > 0.05; fχ2=4.4, P > 0.05.

Interestingly, as many as 14/53 (26%) point mutations in position −1 (IVS-1G>A if the first exon nucleotide was G) (34,53–64), 3/48 (6%) substitutions in position −2 (IVS-2A>G) (65,66) and 2/9 (22%) point mutations in position −3 [IVS-3T>G (67) and IVS-3A >G(68)] created new 3′AG sites that were used in vivo (Figure 1). The proportion of AG-creating mutations in position −1 was higher than in position −2 (P = 0.01, Fisher's exact test), which may have contributed to the higher number of substitutions observed in position −1 than −2 (Table 3). In contrast to mutations in the 3′YAG consensus, the majority of substitutions in the PPT were AG-creating mutations. For example, in positions −5 to −26 relative to natural intron/exon junctions as many as 61/73 (84%) point mutations mutations created new AGs (Figure 1). The overall proportion of AG-creating mutations that resulted in aberrant 3′ss was 43%, and ∼95% of the newly introduced 3′AGs were used in vivo (Table 2).
Figure 1

Frequency distribution of 184 intronic point mutations that activated aberrant 3′ss

Purine transitions, which accounted for ∼54% (113/211) of all aberrant 3′ss and dominated the mutation pattern of cryptic 3′ss, were also the most frequent point mutations leading to de novo 3′ss (54/101; 53%). De novo sites in introns resulted from purine transitions more often than de novo sites in exons (45/72 versus 9/29, χ2 = 7.0, P = 0.008). Intronic de novo 3′ss were most frequently induced by substitutions of A (29/72, 40%), whereas exonic de novo 3′ss were most commonly activated by point mutations of G (13/29, 45%; Table 2).

Comparison of computational tools to predict mutation-induced aberrant 3′ss in vivo

The predicted strength of aberrant, mutated and corresponding authentic 3′ss was analyzed using publicly available computational tools shown in Table 4. Each of the tested models distinguished authentic, mutated and aberrant 3′ss, with authentic sites giving, on average, the highest scores or information bits, followed by aberrant and then by mutated 3′ss (Table 5). However, this was not the case for each category of aberrant acceptors.
Table 4

Comparison of the strength of authentic, mutated and aberrant 3′ss

Location of aberrant 3′ss
ExonIntronBoth



Mutation in 3′YAG (cryptic)Mutation outside 3′YAG (‘de novo’)Mutation in 3′YAG (cryptic)Mutation outside 3′YAG (‘de novo’)All mutations
Shapiro and Senapathy matrix score
    A (SD)84.5. (6.5)79.3 (9.8)85.0 (6.1)81.0 (7.8)82.6 (7.7)
    M (SD)67.5 (8.0)78.9 (9.4)70.2 (6.6)79.4 (8.7)73.5 (10.0)
    CR (SD)74.5 (9.2)78.6 (10.4)82.7 (8.5)77.3 (9.0)77.1 (9.5)
    A–M17.10.514.81.69.2
    M–CR7.1−0.312.4−2.13.6
    A–CR10.00.72.43.65.5
Maximum entropy model
    A (SD)8.7 (3.5)6.7 (4.0)8.6 (3.0)7.3 (2.9)7.9 (3.4)
    M (SD)−0.3 (4.5)5.8 (5.1)−0.4 (3.6)3.8 (4.1)2.0 (5.0)
    CR (SD)4.1 (3.9)5.5 (5.6)6.4 (3.6)5.2 (4.5)5.0 (4.4)
    A–M8.40.59.03.25.6
    M–CR4.1−0.46.81.83.0
    A–CR4.40.92.21.42.6
First-order Markov model
    A (SD)9.1 (3.0)7.1 (4.1)8.9 (2.8)7.5 (3.0)8.2 (3.2)
    M (SD)0.1 (3.9)6.2 (5.1)0.3 (3.0)4.0 (4.1)2.3 (4.7)
    CR (SD)4.4 (3.9)5.3 (5.9)7.1 (3.2)5.4 (4.8)5.2 (4.6)
    A–M8.90.88.53.55.9
    M–CR4.3−0.96.71.42.9
    A–CR4.71.71.82.13.0
Weight matrix model
    A (SD)9.9 (3.8)7.3 (4.3)9.6 (3.0)7.5 (4.1)8.7 (4.0)
    M (SD)1.1 (4.3)7.1 (4.2)1.4 (3.1)6.6 (4.0)3.9 (4.9)
    CR (SD)4.6 (4.8)6.2 (5.3)8.1 (3.5)6.2 (4.8)5.8 (4.8)
    A–M8.80.38.20.94.8
    M–CR3.6−0.96.8−0.42.0
    A–CR5.31.21.51.32.8
Information content
    A (SD)9.9 (3.9)8.1 (3.9)9.6 (3.1)7.9 (3.9)8.9 (3.9)
    M (SD)2.1 (3.8)7.6 (3.8)2.4 (3.3)7.0 (3.7)4.6 (4.5)
    CR (SD)5.9 (3.7)7.2 (3.3)8.0 (3.5)7.5 (3.6)6.9 (3.7)
    A–M7.80.57.20.84.3
    M–CR3.5−0.75.50.12.2
    A–CR4.11.01.70.72.2

Means and SD of splice prediction scores (S&S, ME, MM, weight matrix model) or bits (information content) for authentic (A), mutated authentic (M) and aberrant (CR) 3′ss. The length of input sequences was 15 (−14 to +1 relative to authentic 3′ss), 23 (−20 to +3), 23 (−20 to +3), 23 (−20 to +3) and 28 (−26 to +2) nt, respectively. The information content algorithm failed to recognize 7 authentic, 12 mutated and 28 aberrant 3′ss used in vivo. The missing values (47/654, 7%) were treated as a group mean.

Table 5

Discrimination of computational tools between authentic, mutated and cryptic/de novo 3′ss

Location of aberrant 3′splice sites
ExonIntronBoth



Mutation in 3′YAG (cryptic)Mutation outside 3′YAG (‘de novo’)Mutation in 3′YAG (cryptic)Mutation outside 3′YAG (‘de novo’)All mutations
Shapiro and Senapathy matrix score
A–M7.2 × 10−260.41.3 × 10−90.162.1 × 10−23
CR–M5.2 × 10−80.54.1 × 10−70.092.9 × 10−5
A–CR1.2 × 10−130.40.130.019.5 × 10−11
Maximum entropy model
A–M2.4 × 10−280.31.6 × 10−108.4 × 10−9<10−32
CR–M1.3 × 10−110.42.2 × 10−88.4 × 10−36.9 × 10−13
A–CR1.5 × 10−150.24.5 × 10−31.9 × 10−31.8 × 10−15
First-order Markov model
A–M2.3 × 10−280.31.2 × 10−101.5 × 10−8<10−32
CR–M1.0 × 10−110.25.2 × 10−99.3 × 10−31.7 × 10−12
A–CR5.1 × 10−160.10.016.3 × 10−31.4 × 10−14
Weight matrix model
A–M1.9 × 10−240.44.1 × 10−100.092.5 × 10−24
CR–M3.2 × 10−70.21.4 × 10−80.48.1 × 10−6
A–CR2.7 × 10−130.20.040.071.8 × 10−10
Information contents
A–M9.2 × 10−230.34.8 × 10−90.081.9 × 10−20
CR–M1.8 × 10−100.32.9 × 10−70.25.7 × 10−9
A–CR6.8 × 10−110.20.020.23.2 × 10−9

Table cells contain P-values of Wilcoxon Mann–Whitney rank tests comparing authentic (A), mutated (M) and cryptic/de novo (CR) 3′ss. P-values < 0.05 are in bold.

First, each computational tool was more effective in discriminating authentic and aberrant 3′ss that resulted from mutations in the 3′YAG consensus than from mutations elsewhere (Table 5). This was owing to significantly higher scores for authentic 3′ss that corresponded to cryptic 3′ss than for authentic counterparts of de novo sites. For example, the S&S scores for authentic counterparts of de novo and cryptic 3′ss were 80.5 ± 8.4 (±SD) and 84.6 ± 6.4 (P < 10−7, Wilcoxon Mann–Whitney rank test), respectively. Similarly, the ME scores were 7.2 ± 3.2 and 8.6 ± 3.3, respectively (P < 10−7). In contrast, the score differences between cryptic and de novo 3′ss were not statistically significant (means of the S&S matrix scores were 76.5 versus 77.7, P = 0.3; means of the ME scores were 4.7 versus 5.3, P = 0.4, respectively). Scores or information bits for each category of aberrant acceptors are shown in Table 4. These results indicate that authentic counterparts of de novo 3′ss are intrinsically weak and can be outcompeted by newly created splicing consensus elements. They also suggest that mutations or genetic variants flanking weak splice sites are more likely to play a role in regulated splicing than those near well-defined sites, consistent with weakening of splicing signals in evolution from virtually invariable sequences in yeasts to highly degenerate in humans and a need for more sophisticated regulation in complex organisms at the level of alternative splicing. Second, each algorithm could distinguish cryptic and authentic 3′ss in exons, whereas matrix-based scores struggled to differentiate between authentic and cryptic 3′ss in introns where the ME and MM were the only models that showed P-values of 0.01 or lower (Table 5). Third, de novo 3′ss could not be discriminated from authentic sites by any algorithm if located in exons. Although this could be partly attributed to a smaller sample size of exonic than intronic de novo sites (Table 1), a similar sample of intronic cryptic 3′ss did show statistically significant differences for a subset of algorithms (Table 5). Finally, the difference between intronic de novo sites and their authentic counterparts was statistically significant with the ME and MM models but not with the remaining algorithms, except for the S&S matrix scores. Taken together, these results indicated that the value of computational tools to predict aberrant 3′ss depended on their localization in introns and exons as well as on the underlying mutation, and that the ME was the best model discriminating mutation-induced aberrant 3′ss in vivo from corresponding authentic 3′ss. They also suggested that the failure to distinguish exonic de novo 3′ss from authentic counterparts may be due to our as yet incomplete understanding of the role of exonic splicing silencers or enhancer elements in 3′ss selection.

Single-nucleotide composition upstream of aberrant 3′ss

Comparison of the nucleotide structure upstream of aberrant and authentic 3′ss revealed a significantly higher proportion of purines in aberrant 3′ss. For example, in intronic positions −3 through −26 aberrant 3′ss had 1760 purines as opposed to 1526 purines in authentic 3′ss (χ2 = 23.7, P < 0.00001; Supplementary Figure 1). Overall, this was attributable to a higher number of As (χ2 = 13.5, P < 0.001) rather than Gs (χ2 = 6.4, P = 0.01; Supplementary Figure 1A). The increase of purine residues was almost exclusively at the expense of uridines for aberrant 3′ss in exons (Supplementary Figure 1B and C). In contrast, aberrant 3′ss in introns showed only a borderline increase of purine residues (χ2 = 3.2, P = 0.07), largely owing to cytosine depletion (Supplementary Figure 1D and E). De novo 3′ss in exons had a smaller number of Gs as compared with authentic 3′ss, but the difference was not statistically significant (Supplementary Figure 1C). The increase of purines in aberrant 3′ss was the highest in position −3 where As were 6× more frequent than in authentic 3′ss (Supplementary Figure 2A, χ2 = 26.5, P < 0.000001). The number of aberrant 3′ss with G in position −3 was also higher (7 versus 2) in aberrant (65,66,69–73) than in corresponding authentic (70,74) 3′ss. Positive associations between −3C and upstream Cs in the PPT and between −3T and upstream Ts, which were described previously for authentic 3′ss (75), were observed also for aberrant 3′ss (Supplementary Table 2). Although the influence of −3C on the relative usage of C versus T in the PPT may be attributed to autocorrelation due to compositional similarities of local genomic regions (75), sequence constraints resulting from cooperative interactions at the 3′ss could not be excluded. Indeed, non-random distributions at −3 observed for positions −11, −12, −17 and −19 of aberrant 3′ss (Supplementary Table 2) may be explained by inefficient binding of U2AF to RNAs carrying 3′TAG as compared to 3′CAG (76) and a need for functional compensation of the former by stronger interaction of U2AF65 (or other PPT-binding proteins) with uridines at positions −11 and −12 rather than cytosines. Associations further upstream may involve similar compensation by more optimal BPS interactions with the RS domain of U2AF (77–79) and/or, possibly, other BPS-interacting factors, including K- and Quaking-homology 2 domains of SF1 (12,80,81) or U2 small nuclear RNA (82,83). Similar associations at −3 with upstream intron positions were seen also for authentic counterparts of aberrant 3′ss (data not shown), confirming previous findings with a larger dataset (75). Although most of the analyzed positions upstream of aberrant 3′ss showed uridine depletion as compared to authentic sites (e.g. 565 versus 659 Ts in positions −5 to −10; χ2 = 12.8, P < 0.001; Supplementary Figure 2A), their numbers were similar further upstream between positions −11 and −13 (311 versus 319, P = 0.7). Cs were slightly under-represented between positions −11 and −13 in aberrant 3′ss (168 versus. 202, χ2 = 4.0, P = 0.04). The T-to-C ratio in aberrant 3′ss was the highest in position −11 (2.53 versus 1.70 in authentic), while the average (±SD) ratios between positions −4 and −26 in aberrant and authentic 3′ss were similar (1.52 ± 0.34 and 1.55 ± 0.28, respectively). Aberrant 3′ss with purine at −3 had higher T-to-C ratios between −11 and −13 than aberrant 3′ss with pyrimidine at −3 (2.54 versus 1.69). The number of Gs in this region was significantly higher in aberrant than authentic 3′ss (153 versus 92 in positions −9 to −12, χ2 = 17.0, P < 0.0001), particularly in cryptic sites, whereas the number of As in these positions was not different (125 versus 115, χ2 = 0.4, P = 0.5).

Di-nucleotide composition upstream of aberrant 3′ss

The number of AG dinucleotides, which are depleted in ‘AG exclusion zones’ upstream of authentic 3′ss (24,75,84), was significantly higher in aberrant than corresponding authentic 3′ss (Supplementary Figure 2B). In a 17 nt sequence upstream of 3′ss where the AG depletion in natural 3′ss is the most pronounced (75), the numbers of authentic and aberrant 3′ss with a non-3′ss (‘intervening’) AG were 15 and 36, respectively (χ2 = 8.8, P = 0.003), while the number of AGs in the two groups was 15 and 40 (binomial test, P = 0.0003). The observed frequency of authentic 3′ss with non-3′ss AGs in this region (∼16%) was similar to those previously reported for constitutively (14%) and alternatively (17%) spliced introns that contained intervening AGs downstream of predicted BPS (21). Between positions −3 and −26, there were 53 versus 80 AG-containing 3′ss (χ2 = 7.2, P = 0.007) and 64 versus 95 intervening AGs (binomial test, P = 0.003), respectively. No AG dinucleotides were found in positions −10 and −11 of aberrant 3′ss. Although the number of intervening AGs was low, putative differences of these and other purine dinucleotides between aberrant and authentic in intron positions −25, −24, −22, −20 or −19 upstream of 3′ss are consistent with a distinct average distance of the BPS from aberrant versus authentic 3′ss. Peak frequencies of the GA and AA dinucleotides that may signify the presence of branchpoint in the mammalian BPS consensus YNYURAY were shifted several nucleotides upstream in aberrant 3′ss (Supplementary Figure 2B). The remaining purine dinucleotides were also more common in aberrant than in authentic sites. The increase of AA dinucleotides (253 versus 185 in positions −26 to −3, P = 0.001), which were found in excess upstream of authentic 3′ss as compared to pseudo-sites (75), was largely attributable to position −3 due to the excess of −3As in aberrant 3′ss (Supplementary Figure 2A, B). The GG dinucleotides (186 in aberrant versus 118 in authentic sites in the same region, P < 0.0001) also clustered in some positions, such as −17 to −21 (56 versus 19, χ2 = 17.9, P < 0.0001) and −8 to −12 (49 versus 26, χ2 = 6.7, P < 0.01, respectively). A region upstream of 3′ss in vertebrates (75) and Arabidopsis thaliana (85) contains a higher number of TG dinucleotides as compared to pseudo-splice sites, suggesting that they are important for correct 3′ss recognition. Although the total number of TGs in positions −3 to −26 was similar in aberrant and authentic 3′ss (430 versus 428), there were 94 and 60 TGs in positions −10 to −13 in aberrant and corresponding authentic sites, respectively (χ2 = 7.7, P = 0.005). The number of GTs in the same region was also higher in aberrant sites (56 versus 34; χ2 = 5.2, P = 0.02). In contrast, the number of TTs in the same region was similar (235 versus 200, P > 0.05) both in cryptic and de novo sites, whereas aberrant 3′ss showed TT depletion for most of the remaining positions. The number of CC dinucleotides between position −10 and −13 was lower in aberrant 3′ss (71 versus 99, χ2 = 4.7, P = 0.03), but this difference was limited to de novo sites (χ2 = 10.7, P = 0.001). The TT-to-CC ratio in aberrant 3′ss was the highest in position −12 (8.14 versus 2.76 in authentic), whereas the average (±SD) between positions −5 to −26 was 2.26 (±1.43), with 2.21 ± 0.56 in authentic counterparts. Position −11 shows peak uridine frequencies in vertebrate PPTs (86), most probably due to highly conserved interactions with the second RNA recognition motif (RRM2) of U2AF65, a central organizing force for 3′ss recognition in higher eukaryotes, or with competing pyrimidine-binding proteins (14,87,88). The same position was efficiently crosslinked to RRM2 of U2AF65 in several PPTs (87) and substitutions of T−11 generated lower levels of spliced products and prespliceosomal complexes than identical mutations of T−8 or T−14 (89), suggesting that the observed single- and di-nucleotide imbalances between aberrant and authentic 3′ss centred around this position have functional significance. Higher T-to-C and TT-to-CC ratios in aberrant 3′ss in this area are proposed to improve these interactions and functionally compensate their less favourable sequence context (Supplementary Figure 2A and Tables 4 and 5). The difference in the number of C−12C−11 between aberrant and authentic 3′ss (7 versus 21, χ2 = 6.4, P = 0.01) suggests that this di-nucleotide does not sufficiently promote U2AF binding and that at least one uridine is required in either position for the productive interaction since the numbers of T−12C−11 or C−12T−11 were not significantly different in aberrant and authentic 3′ss (Supplementary Figure 2B). This notion is in agreement with ∼80- to 100-fold inhibition of U2AF65 binding following chemical modification of the uridine N3 and O4 atoms, the only positions that differ between the two nucleosides (90). However, the CC dinucleotides in positions −11 to −13 were over-represented in authentic counterparts of de novo sites (53 versus 21, χ2 = 13.7, P < 0.001) but not cryptic sites (19 versus 26, P > 0.05), suggesting that they signify natural 3′ss that compete poorly with and may be susceptible to mutation-induced 3′ss. In contrast to cytosines, both de novo and cryptic 3′ss showed an increase of TGs/GTs between positions −10 and −13 (64 versus 40, χ2 = 5.4, P < 0.05 and 86 versus 54, χ2 = 7.4, P < 0.01, respectively) as compared to authentic counterparts. A relative lack of G−12T−11/T−11G−10 in authentic sites suggests that such 3′ss may compete relatively well with newly introduced 3′ss, consistent with an earlier observation that GU tracts can substitute for pyrimidine tracts (91), probably as a result of flexible side chain rearrangements of U2AF65 and/or relocation of bound water molecules (92).

Depletion of aberrant 3′ss upstream and downstream of authentic 3′ss

Distribution of the distances between aberrant and authentic 3′ss with the updated sample confirmed a previously reported (11) bias of cryptic 3′ss towards exons and de novo sites towards introns (Supplementary Figure 3A and B). Major frequency peaks for cryptic and de novo 3′ss were 8 and −10 nt from authentic 3′ss, respectively (median distances in each category of aberrant 3′ss are in Table 1). In addition, a relative depletion of both in cryptic and de novo 3′ss emerged further upstream and downstream. A lack of cryptic 3′ss upstream is apparently due to AG depletion (11), although cryptic 3′ss activation may also be prevented by spliceosomal complexes assembled around the branch site. The latter explanation is likely to account for the observed depletion of de novo 3′ss, which is more upstream as compared to cryptic sites (∼50 nt, Supplementary Figure 3B). Smaller areas of depletion for cryptic 3′ss 30–40 nt downstream of authentic 3′ss and ∼20 nt downstream for de novo sites was followed by a second peak at 50–60 nt. The exonic depletion may be explained by a lack of suitable alternative BP adenosines within an optimal distance from de novo 3′ss, cross-exon interactions, selection against codons carrying AGs or a combination of these factors. In contrast to asymmetric distribution of cryptic and de novo 3′ss, the frequency plot of all aberrant 3′ss was virtually symmetric, with a median distance of just 1 nt from authentic 3′ss (Table 1 and data not shown). Finally, the observed frequency distribution suggests that aberrant 3′ss retaining the BPS and PPT of their authentic counterparts may be more frequent than those that use a new BPS-PPT-3′AG unit.

DBASS3: a database of aberrant 3′ss

Nucleotide sequences of all aberrant 3′ss were compiled in a new online resource available at . The DBASS3 web interface provides access to the database through the ‘search’ option. The user can search DBASS3 by phenotype, gene designation, mutation, location of aberrant 3′ss and their distance from authentic 3′ss. Aberrant 3′ss generated in terminal exons can also be easily retrieved. In cases in which a search identifies more than one database entry, the results page displays the gene, phenotype and location of aberrant 3′ss for all corresponding hits. The user can then choose details pages that show nucleotide sequences flanking the authentic and cryptic 3′ss, literature references with PubMed links and the estimated strength of each splice site for the tested algorithms. In addition, the details page shows how aberrant 3′ss change the reading frame of each transcript (0, +1 and +2 nt). DBASS3 visitors can also submit published data to the corresponding author and receive regular updates by email. Potential applications of DBASS3 include the optimization of computational tools for prediction of aberrant splice sites, detection of introns or exons that are frequently involved in aberrant splicing, identification of splicing mutations and aberrant 3′ss in a gene or phenotype of interest, and investigating basic mechanisms of 3′ss selection.

DISCUSSION

A high proportion of AG-creating mutations activating aberrant 3′ss

This study is the first to provide a detailed survey of mutations leading to aberrant 3′ss. It showed that the distribution of single-nucleotide substitutions roughly reflected the degree of conservation of consensus sequences that define 3′ss (Figure 1) and revealed a high proportion of mutations creating the 3′AG consensus (Table 1). The observed frequency of AG-creating mutations (42%) was considerably higher than the estimated ∼13% in the initial analysis of splicing mutations (47). Only ∼5% (n = 11, Table 2) of these mutations failed to activate de novo 3′ss in situ and instead induced one or more aberrant 3′ss upstream (36,70,93–96) or downstream (62,97,98) of the newly introduced AGs. These mutations were in position −3 (36,93), −9 (62,98), −10 (96), −14 (70), −15 (97), −17 (95) and −24 (94) relative to authentic 3′ss (Supplementary Table 1). Mutations in positions −3 and −24 directly inactivated 3′YAG and BPS, respectively, but the remaining AG-creating mutations were all in ‘AG-exclusion zones’ downstream of the BPS. The distance between predicted BP adenosine and new 3′AG/ was 9–20 nt (Supplementary Table 1). Aberrant 3′ss with the ‘BP-new AG’ distances between 9 and 16 nt were either in exons or upstream of the BPS, and new AGs were never selected as 3′ss, consistent with protein complexes bound to ∼19 nt region downstream of BP (99). In the FALDH gene (70), this distance was 20 nt and normally silent AG located 9 nt downstream of the BPS was activated by the newly created AG further 11 nt downstream. However, this putative exception can be explained by inefficient recognition of new 3′AG, which was preceded by G, unlike the remaining aberrant 3′ss (Supplementary Table 1). Alternatively, selection of aberrant 3′ss in this FALDH intron can be explained by almost identical BPS sequences arranged in tandem, with the upstream BP in the optimal distance (18 nt) from aberrant 3′ss. In contrast, wild-type AGs 6 and 7 nt downstream of the predicted BP were not selected (36,98). Although the location of AG exclusion zones is likely to be substrate-dependent, these data suggest that the average zone is between ∼7 and ∼19 nt downstream of the BP adenosine, consistent with previous studies of intervening AGs (11,19,21,99).

Selection of cryptic 3′ss upstream of BPS

If 3′ss are selected by unidirectional scanning for 3′YAG downstream of the BPS (91), why are so many cryptic 3′ss upstream of the predicted BPS used in vivo? Inspection of downstream exonic sequences in 29 cases of intronic cryptic 3′ss (Table 2) showed that eight were in terminal introns (67,100–106) (Table 1), which was significantly more frequent (χ2 = 5.6, P = 0.018) than for the remaining categories of aberrant 3′ss (Table 1), one was activated in a downstream intron (107) and two were associated with cryptic 3′ss in the following exon (108). Of the remaining sites, 13 cases either completely lacked exonic 3′YAG consensus in the context of four or more upstream pyrimidines or contained this consensus only in the last 20 nt of the exon (2,65,66,72,93,109–116) These 3′YAGs are unlikely to be used as 3′ss given inefficient inclusion of very small exons in mRNA (117) and a typical recognition site of RRM of ∼4–7 nt [(87) and references therein]. This strongly suggests that the choice of upstream 3′ss is influenced by the availability of 3′YAGs in the downstream exon and their distance from the exon end, and is consistent with unidirectional scanning that is inefficient in terminal exons. It is therefore possible that a new, competing BPS-PPT-3′AG unit is selected after the initial scanning of the downstream exon for AGs is completed. However, there has been no obvious reason for using upstream 3′ss in at least some of the remaining introns (36,118,119). These rare cases and similar examples identified in the future might provide interesting insights into cellular mechanisms that discriminate between authentic 3′ss and pseudo-acceptors.

Random distribution of the reading frames in transcripts that use aberrant 3′ss

Aberrant splicing often results in transcripts containing premature termination codons (PTCs). Such transcripts are downregulated by nonsense mediated RNA decay (NMD), which degrades PTC-containing mRNAs whose translation may be deleterious for the cell (120). Whereas EST databases over-represent alternative splicing events that maintain the reading frame (121), neither cryptic 5′ss (10) nor aberrant 3′ss (Table 1, χ2 = 8.2, 6 d.f., P = 0.2) (11) showed any bias against splice sites involving a frameshift with respect to the authentic sites, even though many mRNAs frameshifted by +1 and +2 nt would be expected to trigger NMD. These results can be explained by a great reduction of RNA downregulation in response to a PTC in transcripts containing PPT Y-to-R mutations that reduced splicing (122). In addition, NMD usually does not completely eliminate RNAs with PTCs and the activated cryptic sites that result in frameshifts can still be detected with RT–PCR, a method used by the authors of most DBASS3 records.

The maximum entropy model as a method of choice for predicting aberrant 3′ss

This study demonstrated that the ability of current computational tools to predict utilization of aberrant 3′ss is influenced by their localization and the underlying mutation. The best overall model discriminating authentic and aberrant 3′ ss was the ME model, validating previous predictions based on comparisons of genuine 3′ss and pseudo-acceptors (28). The ME model outperformed the remaining algorithms for each category of aberrant 3′ss and, together with the MM model, was the only method that could separate authentic from de novo 3′ss in introns at a significance level <0.01. Since none of the tested tools discriminated between de novo 3′ss in exons and their authentic counterparts (Table 5), these aberrant 3′ss were tested with additional algorithms, including NetGene2 (25,123) available at and ASSP (alternative splice site predictor; ) method (124). NetGene2 considers more distant features that include global coding information and distances between potential splice sites, whereas ASSP is based on two neural networks pre-processed by position specific matrix scores. However, neither method revealed a difference for this category of aberrant 3′ss. Although this study is the first to focus on 3′ss utilized in vivo as opposed to previous comparisons with pseudo-sites, there are limitations of this approach. First, even though each aberrant 3′ss was confirmed by sequencing, aberrant splicing was reliably and accurately quantified only in a subset of case reports and was highly variable from mutation to mutation, ranging from a few to hundred per cent utilization. This could be improved in future case reports and, as DBASS3 submissions permit inclusion of this information in future database records, taken into account in subsequent analyses. Second, despite the cell-specific nature of alternative splicing, measurements of aberrant and authentic RNA products have been obtained largely for blood leukocytes and only rarely for other cell types. Even with these limitations, future updates of DBASS3 may provide valuable insights into nucleotide dependencies between individual positions and distribution of trinucleotides that were significantly favoured or avoided upstream of authentic 3′ss as compared to pseudo-sites (75), as well as other motifs.

CONCLUSIONS

This work showed that (i) almost one half of aberrant 3′ss resulted from AG-creating mutations and from the introduction of guanosine, a virtually invariant nucleotide in both terminal positions of U2-dependent introns; (ii) the higher frequency of transitions over transversions observed for both positions of 3′AG can be attributed to relative di-nucleotide mutability rates rather than a detection bias resulting from a differential splicing efficiency of mutated 3′AGs; (iii) purine transitions leading to de novo sites in introns were more frequent than for de novo sites in exons; (iv) the maximum entropy model was the best model discriminating authentic and mutation-induced aberrant 3′ss used in vivo; (v) authentic counterparts of de novo 3′ss were intrinsically weak; (vi) the nucleotide sequence upstream of aberrant 3′ss had a higher purine content than corresponding authentic sites, particularly in position −3; (vii) as with authentic sites, aberrant 3′ss showed positive associations at −3 with upstream positions that may result from functional compensation of weaker interactions of U2AF with 3′TAG by stronger interactions with PPT uridines around position −11 and with more optimal BPS; (viii) the extreme rarity of AGs between positions −6 and −15 in authentic 3′ss (75,84) was violated in aberrant 3′ss, particularly 5–9 nt upstream of new intron/exon junctions; (ix) although uridines were generally under-represented upstream of aberrant 3′ss, they maintained their high numbers at position −11 and flanking nt for predicted interaction with U2AF65 or other PPT-binding proteins; (x) in this region, aberrant 3′ss had higher T-to-C and TT-to-CC ratios, required a complete lack of AGs, but tolerated more guanosines and UG/GU dinucleotides than authentic sites. Finally, the development and maintenance of DBASS3 will facilitate prediction of cryptic or de novo 3′ss in mutated disease genes, identification of introns or exons that are frequently involved in aberrant splicing, structural dissection of interactions leading to selection of 3′ss in vivo, and refinement of computational methods that estimate the splice site strength.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.
  122 in total

Review 1.  Nonsense-mediated mRNA decay in mammals.

Authors:  Lynne E Maquat
Journal:  J Cell Sci       Date:  2005-05-01       Impact factor: 5.285

2.  Are splicing mutations the most frequent cause of hereditary disease?

Authors:  Núria López-Bigas; Benjamin Audit; Christos Ouzounis; Genís Parra; Roderic Guigó
Journal:  FEBS Lett       Date:  2005-03-28       Impact factor: 4.124

3.  Spectrum of splicing errors caused by CHRNE mutations affecting introns and intron/exon boundaries.

Authors:  K Ohno; A Tsujino; X-M Shen; M Milone; A G Engel
Journal:  J Med Genet       Date:  2005-08       Impact factor: 6.318

4.  DNA sequence analysis for structure/function and mutation studies in Becker muscular dystrophy.

Authors:  Sa Hamed; Aj Sutherland-Smith; Jrm Gorospe; J Kendrick-Jones; Ep Hoffman
Journal:  Clin Genet       Date:  2005-07       Impact factor: 4.438

5.  Sensitive multistep clinical molecular screening of 180 unrelated individuals with retinoblastoma detects 36 novel mutations in the RB1 gene.

Authors:  Kim E Nichols; Monisa D Houseknecht; Lynn Godmilow; Greta Bunin; Carol Shields; Anna Meadows; Arupa Ganguly
Journal:  Hum Mutat       Date:  2005-06       Impact factor: 4.878

6.  U2AF binding selects for the high conservation of the C. elegans 3' splice site.

Authors:  Courtney Hollins; Diego A R Zorio; Margaret MacMorris; Thomas Blumenthal
Journal:  RNA       Date:  2005-01-20       Impact factor: 4.942

7.  Variation in antiviral 2',5'-oligoadenylate synthetase (2'5'AS) enzyme activity is controlled by a single-nucleotide polymorphism at a splice-acceptor site in the OAS1 gene.

Authors:  Vagn Bonnevie-Nielsen; L Leigh Field; Shao Lu; Dong-Jun Zheng; Min Li; Pia M Martensen; Thomas B Nielsen; Henning Beck-Nielsen; Yu-Lung Lau; Flemming Pociot
Journal:  Am J Hum Genet       Date:  2005-02-24       Impact factor: 11.025

8.  A mutation-created novel intra-exonic pre-mRNA splice site causes constitutive activation of KIT in human gastrointestinal stromal tumors.

Authors:  Lei L Chen; Mahyar Sabripour; Elsie F Wu; Victor G Prieto; Gregory N Fuller; Marsha L Frazier
Journal:  Oncogene       Date:  2005-06-16       Impact factor: 9.867

9.  Human-mouse comparative analysis reveals that branch-site plasticity contributes to splicing regulation.

Authors:  Guy Kol; Galit Lev-Maor; Gil Ast
Journal:  Hum Mol Genet       Date:  2005-04-27       Impact factor: 6.150

10.  Branch site haplotypes that control alternative splicing.

Authors:  Jana Královicová; Sophie Houngninou-Molango; Angela Krämer; Igor Vorechovsky
Journal:  Hum Mol Genet       Date:  2004-10-20       Impact factor: 6.150

View more
  51 in total

1.  Characterization of NOL7 gene point mutations, promoter methylation, and protein expression in cervical cancer.

Authors:  Colleen L Doçi; Tanmayi P Mankame; Alexander Langerman; Kelly R Ostler; Rajani Kanteti; Timothy Best; Kenan Onel; Lucy A Godley; Ravi Salgia; Mark W Lingen
Journal:  Int J Gynecol Pathol       Date:  2012-01       Impact factor: 2.762

2.  Ab initio prediction of mutation-induced cryptic splice-site activation and exon skipping.

Authors:  Petr Divina; Andrea Kvitkovicova; Emanuele Buratti; Igor Vorechovsky
Journal:  Eur J Hum Genet       Date:  2009-01-14       Impact factor: 4.246

3.  Deleterious variants of FIG4, a phosphoinositide phosphatase, in patients with ALS.

Authors:  Clement Y Chow; John E Landers; Sarah K Bergren; Peter C Sapp; Adrienne E Grant; Julie M Jones; Lesley Everett; Guy M Lenk; Diane M McKenna-Yasek; Lois S Weisman; Denise Figlewicz; Robert H Brown; Miriam H Meisler
Journal:  Am J Hum Genet       Date:  2009-01       Impact factor: 11.025

4.  Assessment of the F9 genotype-specific FIX inhibitor risks and characterisation of 10 novel severe F9 defects in the first molecular series of Argentinian patients with haemophilia B.

Authors:  Claudia Pamela Radic; Liliana Carmen Rossetti; Miguel Martín Abelleyro; Miguel Candela; Raúl Pérez Bianco; Miguel de Tezanos Pinto; Irene Beatriz Larripa; Anne Goodeve; Carlos Daniel De Brasi
Journal:  Thromb Haemost       Date:  2012-10-23       Impact factor: 5.249

5.  Prediction and assessment of splicing alterations: implications for clinical testing.

Authors:  Amanda B Spurdle; Fergus J Couch; Frans B L Hogervorst; Paolo Radice; Olga M Sinilnikova
Journal:  Hum Mutat       Date:  2008-11       Impact factor: 4.878

6.  A comprehensive survey of human polymorphisms at conserved splice dinucleotides and its evolutionary relationship with alternative splicing.

Authors:  Makoto K Shimada; Yosuke Hayakawa; Jun-ichi Takeda; Takashi Gojobori; Tadashi Imanishi
Journal:  BMC Evol Biol       Date:  2010-04-30       Impact factor: 3.260

7.  Some novel intron positions in conserved Drosophila genes are caused by intron sliding or tandem duplication.

Authors:  Jörg Lehmann; Carina Eisenhardt; Peter F Stadler; Veiko Krauss
Journal:  BMC Evol Biol       Date:  2010-05-26       Impact factor: 3.260

8.  Genomic features defining exonic variants that modulate splicing.

Authors:  Adam Woolfe; James C Mullikin; Laura Elnitski
Journal:  Genome Biol       Date:  2010-02-16       Impact factor: 13.583

9.  Targeted genome-wide enrichment of functional regions.

Authors:  Periannan Senapathy; Ashwini Bhasi; Jeffrey Mattox; Perundurai S Dhandapany; Sakthivel Sadayappan
Journal:  PLoS One       Date:  2010-06-16       Impact factor: 3.240

10.  Genome-wide data-mining of candidate human splice translational efficiency polymorphisms (STEPs) and an online database.

Authors:  Christopher A Raistrick; Ian N M Day; Tom R Gaunt
Journal:  PLoS One       Date:  2010-10-11       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.