| Literature DB >> 16110330 |
Uwe Ohler1, Noam Shomron, Christopher B Burge.
Abstract
The split structure of most mammalian protein-coding genes allows for the potential to produce multiple different mRNA and protein isoforms from a single gene locus through the process of alternative splicing (AS). We propose a computational approach called UNCOVER based on a pair hidden Markov model to discover conserved coding exonic sequences subject to AS that have so far gone undetected. Applying UNCOVER to orthologous introns of known human and mouse genes predicts skipped exons or retained introns present in both species, while discriminating them from conserved noncoding sequences. The accuracy of the model is evaluated on a curated set of genes with known conserved AS events. The prediction of skipped exons in the approximately 1% of the human genome represented by the ENCODE regions leads to more than 50 new exon candidates. Five novel predicted AS exons were validated by RT-PCR and sequencing analysis of 15 introns with strong UNCOVER predictions and lacking EST evidence. These results imply that a considerable number of conserved exonic sequences and associated isoforms are still completely missing from the current annotation of known genes. UNCOVER also identifies a small number of candidates for conserved intron retention.Entities:
Year: 2005 PMID: 16110330 PMCID: PMC1185642 DOI: 10.1371/journal.pcbi.0010015
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Structure of the UNCOVER pHMM
The model is used to globally align a pair of orthologous human/mouse introns and detect conserved coding AS events.
(A) A schematic overview of the model architecture, with circles indicating groups of functionally related states. For accurate splicing, the two ends of an intron must be precisely determined by the splicing machinery. The prominent sites for this process are the 5′ splice site (5′ss) at the junction between the upstream exon and the intron and the 3′ splice site (3′ss) at the junction of the intron and the downstream exon. As reference, pictograms of the mammalian 5′ splice site and 3′ splice site are depicted, in which the letters at individual positions are scaled according to their frequency. We restrict ourselves to U2-type splice sites with perfectly conserved GT–AG dinucleotides. The alignment always starts with the conserved 5′ splice site after the initial GT dinucleotide. The transitions of the model then allow it to pursue several paths, corresponding to different types of AS, indicated by small icons. (1) The “default” is to observe conserved or nonconserved noncoding sequence, possibly alternating between these two. (2) Transitions to an ACE sequence of conserved {3′ splice site, skipped exon, 5′ splice site} are possible at any time, and can also occur more than once. (3) An IRE is modeled by going from a 5′ splice site to a 3′ splice site by only passing through a coding submodel. (4) and (5) An early exit from this codon model through another 5′ splice site leads to an alternative 5′ exon at the beginning of the sequence, or correspondingly to an alternative 3′ exon at the end. The alignment is fixed on the right side by the 3′ splice site at the end of the intron. All splice site states are first-order pair states not allowing for insertions or deletions. The 5′ splice site part of the model covers 9 nt (3 nt in the exon, plus the conserved GT and the following 4 nt in the intron); the 3′ splice site is 23 nt long (18 nt and the conserved AG in the intron plus 3 nt in the exon).
(B and C) A detailed view of the noncoding intronic submodel (B) and a close-up of the coding submodel (C), with closed circles representing pair states and dashed circles representing single states. Thick straight arrows indicate the allowed start and end states of the submodels. The noncoding conservation (B) is modeled by a first-order pair state, allowing insertions and deletions of individual nucleotides. The null model contains single first-order states representing nonconserved human and mouse intronic sequences. The coding states (C) comprise three second-order pair states for nucleotides in the three codon positions, as well as three second-order single states each for human and mouse to capture species-specific codon insertion/deletion events. The transition matrix ensures that only those insertion/deletion events covering complete codons are admissible.
Figure 2Experimental Validation of UNCOVER Predictions
(A) RT-PCR validation of newly identified alternative exons with no prior EST evidence. Lane numbers are given in Arabic numerals below the gel; sample numbers of new verifications and negative controls are in Roman numerals above. Lanes 2–5 were verified using flanking primers and therefore show two bands each, the larger one corresponding to the event including the newly identified ACE. Lanes 6–9 used a primer internal to the newly identified exon and therefore only show one band each. Lanes 10–13 are typical examples of ten randomly selected introns in the ENCODE target regions that were not predicted to harbor AS events. Lane 14 shows a blank reaction control without adding template. Lanes 1 and 15 contain size markers spaced at 100 nt intervals, with the strong bands corresponding to 1,000 and 500 nt. Ensembl ID pairs for the known exon upstream of the validated new one and the corresponding gene are as follows: internal exons, lanes 2–6: ENSE00000881911.1:ENSG00000004866.5, ENSE00000862512.1:ENSG00000126217.3, ENSE00001201432.1:ENSG00000168781.5, ENSE00001146476.1:ENSG00000168781.5, and ENSE00001084095.4:ENSG00000164402.2; terminal exons, lanes 7–9: ENSE00001379673.1:ENSG00000159140.5, ENSE00001046164.1:ENSG00000067369.1, and ENSE00000952769.2:ENSG00000142183.3 (a known case as positive control); random negative controls, lanes 10–13: ENSE00001321652.4:ENSG00000161980.2, ENSE00000868377.2:ENSG00000102125.4, ENSE00001239587.1:ENSG00000100220.2, and ENSE00001307891.1:ENSG00000185721.1.
(B) Example UNCOVER alignment of a newly detected ACE. Aligned nucleotides are connected with a vertical dash in case of identity, a colon in case of a transition, and a dot in case of a transversion. The alignment is labeled with the types of the states that lead to the most likely alignment: C, conserved noncoding sequence; F, 5′ splice site; I, nonconserved intronic sequence; T, 3′ splice site; 1, 2, and 3, coding sequence, with the number giving the position in a codon. The detected ACE is flanked by highly conserved noncoding sequence, a characteristic of true ACEs. The sequence shown corresponds to the event in sample i in (A).
Prediction Results on a Known Set of 241 Conserved Skipped Exons
We compare results obtained by BLASTN analysis with those of the UNCOVER approach. Predicted regions overlapping with a known ACE in the human sequence are counted as true positives, and the fractions are given for which the locations of the 5′ splice site, the 3′ splice site, or both are correct. Sensitivity is calculated as the number of true positives divided by the total number of known exons, and specificity as the number of true positives divided by the total number of predictions. We also show the total number of nucleotides spanned by all predictions, and the number of nucleotides overlapping the known ACEs.
Predicted Conserved Coding IREs and Their Evidence
For each candidate, the table shows Ensembl IDs of gene, transcript, and upstream exon; the HUGO gene ID (if available); whether the predicted retained intron is covered by spliced ESTs; whether it is predicted to continue in-frame with upstream exon; whether it has a size that is a multiple of three; and whether a protein domain detected by InterPro [40] spans across it. For four genes, the frame of the upstream exon could not be uniquely determined from the human and mouse annotations.