Literature DB >> 12093376

Full-length messenger RNA sequences greatly improve genome annotation.

Brian J Haas1, Natalia Volfovsky, Christopher D Town, Maxim Troukhan, Nickolai Alexandrov, Kenneth A Feldmann, Richard B Flavell, Owen White, Steven L Salzberg.   

Abstract

BACKGROUND: Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism.
RESULTS: Approximately 35% of the transcripts indicated that previously annotated genes needed modification, and 5% of the transcripts represented newly discovered genes. We also discovered that multiple transcription initiation sites appear to be much more common than previously known, and we report numerous cases of alternative mRNA splicing. We include a comparison of different alignment software and an analysis of how the transcript data improved the previously published annotation.
CONCLUSIONS: Our results demonstrate that sequencing of large numbers of full-length transcripts followed by computational mapping greatly improves identification of the complete exon structures of eukaryotic genes. In addition, we are able to find numerous introns in the untranslated regions of the genes.

Entities:  

Mesh:

Substances:

Year:  2002        PMID: 12093376      PMCID: PMC116726          DOI: 10.1186/gb-2002-3-6-research0029

Source DB:  PubMed          Journal:  Genome Biol        ISSN: 1474-7596            Impact factor:   13.583


Background

The scientific community has recently witnessed the publication of several large eukaryotic genomes in various stages of completion, including the human genome [1,2], the nematode Caenorhabditis elegans [3], the fruit fly Drosophila melanogaster [4], and the model plant Arabidopsis thaliana [5,6]. Each of these genomes contains over 10,000 genes, and as scientists attempt to study these genes more closely, the need for accurate gene models becomes increasingly apparent. For high-throughput genome sequencing projects, equally rapid high-throughput genome annotation is necessary, and bioinformaticists use a variety of computational methods to generate this annotation. Despite many improvements in recent years, these computational methods still fall short of producing correct models for every gene. In order to improve the annotation and facilitate further research, it is essential that we develop methods to identify genes correctly. Annotation of a gene model should include a precise description of where the genomic DNA sequence is transcribed into messenger RNA, the positions in the mRNA of any and all introns, and the translated protein sequence of the gene. If alternative splice variants are present, these should be annotated as well. Computational methods for genome annotation have several shortcomings that result in the following errors in annotation. Gene prediction programs predict exon boundaries correctly only about 80% of the time, even for the most intensively studied organisms [7,8,9]. Thus a gene with five exons will be completely correct only 0.85 = 33% of the time, and genes with more exons will be even less likely to be correct. Gene prediction programs also tend to merge and split gene models. Thus two real genes may be merged into one predicted transcript, or vice versa. In addition, programs to align genomic DNA to protein sequences often miss small exons, especially when the homologous proteins are not well conserved. Annotation protocols also tend to miss short genes. For example, recent work has shown that at least one large family of Arabidopsis genes encodes a short (80-120 amino acid) protein similar to a secreted polypeptide ligand for a receptor-like kinase that functions in meristems [10]. Most of these were missed in the original, automated annotation of the Arabidopsis genome. Alignment programs also make mistakes when genes occur in tandemly repeated copies. Finally, alignment of protein sequence to genomic DNA cannot predict untranslated regions (UTRs), and the leading ab initio gene prediction programs (Genscan [11], GlimmerM [12], Genemark.HMM [13]) have great difficulty predicting UTRs; most of them predict only the coding portion of a transcript. The solution to many of these problems is to identify the complete sequence of the transcribed portions of the genome. Sequencing the mature transcripts (spliced mRNA) solves three major problems: first, it permits accurate identification of the 5' and 3' UTRs. Second, in conjunction with complete genomic sequence, it enables alignment software to identify the precise locations of all introns. Third, it aids in the discovery of new genes.

Results and discussion

We used sequences from 5,000 full-length transcripts sequenced by Ceres, Inc. and released to the public in March 2001 ([14], and at GenBank as accession numbers AY084215-AY089214). We merged this data with a small set of full-length transcripts created in a pilot study, yielding 5,016 complementary DNA (cDNA) sequences in total. As described in detail below, after comparing the cDNA alignments to the previous annotation, we found that 62% of the matching gene models correctly predicted the exon-intron structure for the gene, 33% needed to be modified, and 5% represented previously undiscovered genes. We mapped and aligned the cDNA sequences to the five chromosomes of A. thaliana. This step is not entirely straightforward, in part because several different programs are available for aligning cDNA to genomic sequence (see Materials and methods). Although the programs are similar, they are not identical, and we conducted a detailed comparison in order to determine which one would produce the best results for these data. Note that the conclusions from this comparison might change for alignments between cDNAs and genomic sequences from different species - for example, alignments between human cDNAs and the mouse genome - where the genome and the cDNAs match less closely than in this study. Because the cDNA sequences are derived from two A. thaliana ecotypes (Wassilewskija and Landsberg erecta; see Materials and methods) that differ from the genome's ecotype (Columbia), we expected to find some polymorphisms when we compared the cDNAs to the genome; on average, the three ecotypes are more than 99% identical. The four programs used for the alignment comparison were sim4 [15], dds/gap2 [16], est_genome [17] and GeneSeqer [18]. Three of these are general-purpose alignment algorithms designed to map expressed sequence tag (EST) or cDNA sequences to a genome. One program, GeneSeqer, is more specialized, in that it has particular subroutines designed to recognize splice sites in A. thaliana, and it is able to find AT-AC introns. The other three programs require only that an intron begin with the dinucleotide GT and end with AG. Note that none of these programs is tuned to find AT-AC introns [19] or any other introns with non-consensus splice sites. The first question we asked in the comparison was how often the four programs produced identical alignments between the cDNA sequences and the genome. Because the programs differed in how they treated the first and last nucleotides of the cDNA, we ignored those nucleotides in deciding if two alignments were identical. The number of identical alignments is shown in Table 1. The table shows that the two programs in greatest agreement with each another were gap2 and GeneSeqer, which agreed on 4,839 out of 5,016 alignments. The total number of cDNAs for which all four programs agreed was 4,124. This initial comparison does not provide a true picture of the extent of agreement, though, because most of the differences were in the alignment at the 5' or 3' ends of the cDNAs.
Table 1

Number of identical cDNA alignments produced by four different programs on a set of 5,016 cDNA sequences

Programgap2est_genomeGeneSeqer
sim44,8194,3424,784
gap24,2744,839
est_genome4,257
To determine how well the programs agreed at the ends of the alignments, we compared the lengths of the alignments. Because the cDNAs derive from a different ecotype of Arabidopsis than the genomic sequence, there are a few polymorphisms in the UTRs, but even these regions are usually > 98% identical. Differences between ecotypes sometimes interrupt an open reading frame (ORF). The result is that a gene in one ecotype may be a pseudogene, or a severely truncated gene, in the other [20]. We addressed this problem by first aligning cDNAs to the genome, and then always using the genomic sequence to create the predicted protein. Therefore the alignments should span the entire length of the cDNA. Our analysis indicates that gap2 and sim4 performed the best at matching the entire cDNA to the genome, as shown in Table 2.
Table 2

Comparison of lengths of cDNA alignments for the 5,016 cDNA sequences

gap2est_genomeGeneSeqer
sim432/346/65231/67
gap22/7083/50
est_genome706/708706/733

Entries show the number of sequences for which the program listed along the top produced a longer alignment than the program listed on the left; for example, the entry 32/34 indicates that gap2 produced a longer alignment in 32 cases out of the total 34 for which sim4 and gap2 had alignments of different lengths.

Two immediate observations from Table 2 are that gap2 produced longer alignments than any of the other programs, and that est_genome produced shorter alignments than all the others. Sim4 and gap2 are closest to each other, disagreeing on alignment length in only 34 cases, with gap2 generating the longer alignment on 32 of those. GeneSeqer has a few more differences than gap2, but is clearly much closer to gap2 than est_genome. Even though est_genome often produces shorter alignments, the unaligned sequence is almost always restricted to a small number of terminal nucleotides. A more critical question is how many times the programs agreed on the precise locations of all the splice sites for a particular transcript. This question is answered for all pairwise comparisons between the programs in Table 3. Overall, there were 4,918 cDNAs that yielded identical splice sites (and therefore identical introns) regardless of which program was used to generate the alignment. As Table 3 shows, the programs disagreed on fewer than 100 alignments, less then 2% of the total. This still leaves open the question of whether one program is clearly superior on these problematic alignments. We therefore evaluated the 98 cDNAs (5,016 - 4,918) for which there was disagreement among the programs to determine the cause of the differences. In 64 of these cases, three programs agree and only one disagrees, while in the remaining 32 cases, there was no alignment shared by a majority of the programs. We looked individually at all 64 cases in which a majority agreed to evaluate whether or not a 'majority wins' rule would always produce the correct result, and came to the following conclusions.
Table 3

Number of cDNA alignments, out of 5,016 total, for which all splice sites are identical

Programgap2est_genomeGeneSeqer
sim44,9464,9654,960
gap24,9544,955
est_genome4,961
Originally, there were 140 cDNAs for which the programs did not all agree on the locations of the introns. Of these, 106 were cases in which only one program disagreed with the three others. These differences were communicated to the authors of the GeneSeqer program (V. Brendel) and the sim4 program (L. Florea). Both authors identified bugs in their systems, fixed the bugs, and issued new releases, which were then re-run on all the data used in this study. The result was that these two programs are in much closer agreement with gap2 and with each other than previously. Gap2 disagreed with the other three programs in 24 alignments. The most common reason for the disagreement was an erroneous alignment to non-consensus splice sites (other than GT and AG). Gap2's alignment appeared to be incorrect for all 24 cases. Sim4 disagreed with the other three programs 16 times. Like gap2, in most cases these were due to erroneous non-consensus splice sites. However, in one case, sim4 seems to have found a correct alignment missed by the other three programs. GeneSeqer disagreed with the majority 14 times, sometimes as the result of a tendency to create additional short exons. On the other hand, GeneSeqer is excellent at identifying potential short exons, especially at the termini: in four cases out of 14, GeneSeqer's alignment contained an additional exon that results in a greater percent identity in the overall alignment to the genome. Est_genome disagreed with the other three programs on 10 alignments; in all 10 cases the majority was correct. The mistake in eight of these alignments was that a short exon was missed. Overall, it does indeed matter which program is used to align cDNA sequences to genome sequences. Three of the four programs do an excellent job of extending the alignment to cover the full length of the cDNA; only est_genome consistently failed to extend the alignments. When the programs disagree, this sometimes indicates that an exon was missed by one or more programs, and manual inspection is necessary to determine the best alignment. Finally, there is a substantial difference in computational speed. Sim4 is more than 200 times faster than any of the other programs, which can have a significant impact in efforts such as this to align large numbers of sequences. For the searches in this study, the average computation time per cDNA (on an 850 MHz Pentium III computer) for sim4 was 0.026 second; for gap2, 6.4 seconds; for est_genome, 9.2 seconds; and for Gene-Seqer, 12.2 seconds. GeneSeqer's speed per search increases dramatically, approximating the speed of sim4, if the cDNAs are submitted as a batch rather than one at a time. In addition, memory requirements make it impossible to run some programs on a standard desktop (dds/gap2 runs out of memory if one attempts to align a cDNA to a whole chromosome, for example).

Re-annotation of the Arabidopsis genome

The alignments generated from the cDNA sequences were used to create new gene models for the corresponding genes in the A. thaliana genome. Many of the genes have been manually curated, but many others were created by automated scripts [5,6]. Manual curation is still ongoing. We used the cDNA alignments to create new gene models automatically according to the following criteria. As described above, there were 4,918 cDNAs for which all alignment programs agreed on the positions of all introns. Using a majority voting scheme for the remaining 98 cDNAs did not always give a correct answer, as discussed above, therefore we used these only after manual inspection. We assume the protein-coding region is the longest ORF on the forward strand, and required it to span at least 40% of the cDNA length. This allowed us to create 4,809 gene models automatically, leaving 109 cDNAs that were inspected manually to determine if they represent RNA genes, pseudogenes or other types of sequence. In one case, cDNA Ceres: 104289, the protein-coding region was actually on the opposite strand, corresponding to expressed protein At2g23670, and Ceres: 20125 matched the correct strand, supporting the gene annotation. (This could be explained in several ways: as an example of antisense-mediated translational control, as two separate proteins on opposite strands, perhaps expressed in different cell types, or simply erroneous data.) In most of the other cases, the problematic cDNA is either an RNA gene or a likely pseudogene. Using the alignments from the 4,809 gene models, we updated the annotation of the genome, and evaluated how this had changed the previous annotation. For the vast majority of genes, 5' and 3' UTRs had not been annotated previously, and these were added with the incorporaton of the cDNA data. More interesting is how the protein-coding regions changed. Of the gene models, 2,978 contained identical protein-coding regions to what had already been annotated and required only UTR refinements, but 1,591 were adjusted, yielding more accurate protein sequences. Some of these contain very short 'micro-exons' that are usually missed by ab initio gene prediction programs. Perhaps most significant was the addition of 240 completely novel genes not previously included in the Arabidopsis genome annotation. Of the 240 novel genes, 92 have significant homology to known proteins, and the rest do not match any previously described proteins. In summary, we found that 62% of the matching gene models further validated the existing exon-intron structure for the gene, 33% needed to be corrected, and 5% represented previously undiscovered genes.

Micro-exons

We also used the cDNA alignments to detect 'micro-exons', very short exons that are typically missed by both gene-finding programs and alignment algorithms. Using new software protocols we developed, we found 47 micro-exons, ranging from 3 to 25 base pairs (bp) in length, distributed evenly across all five chromosomes. To find micro-exons, we analyzed the results of sim4 alignments using all 5,016 Ceres cDNAs. Sim4 identified 36 cDNAs encoding exons of 25 bp or less. In an effort to identify additional micro-exons, sim4 alignments containing imperfect intron-exon boundaries were examined. We selected only those cases with near-perfect alignments, requiring that all but one or two exons have 100% identity. We then checked to see if the 1-2 exons with slightly lower identity were misaligned as the result of the presence of a small, undetected, exon. We used the 5 bp segments at the boundaries of the exon as probes. If these 5 bp probes mismatched in the original alignment, we searched the adjacent intron (that is, the intron identified by the initial alignment) for short exons that would produce a perfect match with the cDNA. We also required that any new exon would generate introns with a standard GT-AG consensus on either end. This procedure therefore yielded valid exon-intron structures that always improved the identity of the alignment between cDNA and genomic DNA. Figure 1 shows an example of the cDNA alignments before and after inserting a micro-exon.
Figure 1

An example showing how a micro-exon improves a cDNA alignment. (a) Alignment showing the boundaries of the fourth and fifth exons from the sim4 alignment of cDNA Ceres:20761 to chromosome 4. (b) The improvement resulting from insertion of a micro-exon; all three exons now align with 100% identity to the cDNA sequence. Intron positions are shown by '>' in the alignment.

Using this method, we were able to identify 11 additional micro-exons, all shorter than 12 bp. An extraordinarily short exon of only 3 bp was identified, corresponding to exon 2 of disease-resistance gene RAR1 (At5g51700). A listing of these micro-exons from all chromosomes is shown in Table 4. Note that in some cases the length of the micro-exons is not a multiple of three; for these, one of the preceding or following exons had its boundary realigned to maintain the reading frame. In comparison to the other alignment programs examined, GeneSeqer proved to be highly competent in identifying micro-exons; 46 of the 47 micro-exons were identified by GeneSeqer using the default settings. After lowering the minimum exon length cut-off to 1 bp, all 47 were identified.
Table 4

Micro-exons from each of the five chromosomes, listed in order of increasing length

LocusGene namecDNA accessionExon numberExon length (nucleotides)Micro-exon sequence
At5g51700RAR1Ceres:996152 of 63AG<GGA>GT
At1g63290D-ribulose-5-phosphate-3-epimeraseCeres:378432 of 85AG<GACGG>GT
At3g01850D-ribulose-5-phosphate-3-epimeraseCeres:23982 of 95AG<GACGG>GT
At4g01610Cysteine proteaseCeres:207615 of 115AG<ATCAG>GT
At5g14030Expressed proteinCeres:16313.5 of 66AG<GCCAAG>GT
At2g38880Putative CCAAT-binding transcription factor subunitCeres:7805.4 of 76AG<TTGGAG>GT
At2g07340Expressed proteinCeres:34060.4 of 67AG<GAAGAAC>GT
At2g41710AP2 domain transcription factorCeres:414623 of 99AG<TTTATCTAG>GT
At2g36190Beta-fructofuranosidaseCeres:1180382 of 69AG<ATCCAAATG>GT
At4g13720Auxin-regulated proteinCeres:83614 of 810AG<GGCCATACAT>GT
At4g29510Arginine methyltransferase (pam1)Ceres:386012 of 911AG<GAATCCATGAA>GT
At3g55260Beta-N-acetylhexosaminidaseCeres:1182867 of 1517AG<GTTTGCCAAAATGAGAG>GT
At1g80380Auxin-regulated proteinCeres:1176983 of 718AG<GTACCTAGGTACAATAAG>GT
At3g55630Tetrahydrofolylpolyglutamate synthaseCeres:230791.6 of 1518AG<GAGAAAACCAGCAATGAG>GT
At5g61530Auxin-regulated proteinCeres:1525575 of 1019AG<GGAGTTGCCAGCTCAGATG>GT
At5g03880Auxin-regulated proteinCeres:376684 of 1219AG<CTGTCCCTTCTGCCGGAAG>GT
*At4g23470Expressed proteinCeres:25694.7 of 819AG<GGTTTGCGCATGTATGCAG>GT
At1g67320Expressed proteinCeres:116252.7 of 1820AG<TTGAAAACATTTACTACAAG>GT
At3g50210Flavonol synthaseCeres:25787.8 of 1220AG<TGGAGCTCACACTGACTATG>GT
At4g37680Expressed proteinCeres: 262351.2 of 521AG<GGTTTGTCTTTCGAAATTCAG>GT
At3g60340Palmitoyl-protein thioesteraseCeres:38539.5 of 1221AG<ACATCAGTTGTTTGTGAGAAG>GT
At3g23600Expressed proteinCeres:11339.2 of 722AG<GTTTTGAAGCTCCAAACTTAAG>GT
At5g46030Expressed proteinCeres:15222.4 of 522AG<CTTTGACTGAAGCTATAGACAT>GT
At4g33925Expressed proteinCeres:24360.2 of 522AG<TAACCGAAGAACAGCTCTCAAT>GT
*At4g23470Expressed proteinCeres:25694.2 of 822AG<ATTGTTGCTTCGCGTTGTGGTG>GT
At3g13860Chaperonin, putativeCeres:38045.2 of 1722AG<CTCGTCTACTTCCAGGAAACTG>GT
At1g73180Expressed proteinCeres:108165.13 of 1423AG<TTACTTGGAATAAGCACAACAGG>GT
At5g09830Expressed proteinCeres:37422.2 of 323AG<GAAGTCATTGACATATCTGGAGG>GT
At2g23930Small nuclear ribonucleoprotein ECeres:4850.2 of 423AG<GTACATGGATAAGAAGCTCCAAA>GT
At1g66940Expressed proteinCeres:110066.3 of 524AG<AATCTAATATTAGATGGATAATAG>GT
At5g51100Expressed proteinCeres:126592.8 of 924AG<CACGCTTACTATCTGGATTTTGAG>GT
At1g05070Expressed proteinCeres:13725.2 of 324AG<AGCTCAGTAATGCTTCTTTTGCTG>GT
At2g32580Expressed proteinCeres:16625.2 of 324AG<GACTCAGCAATGGTTCATTCACTG>GT
At2g29960CyclophilinCeres:19211.4 of 624AG<AAAACTTCAGAGCTTTGTGCACAG>GT
At1g65220Expressed proteinCeres:21223.2 of 824AG<CTCAAAGGAGAAGCCCACTCTCGG>GT
At5g23310Iron superoxide dismutase 3Ceres:26637.7 of 824AG<CACTCTTATTATCTGGACTACAAG>GT
At4g25100Superoxide dismutaseCeres:32935.6 of 724AG<CATGCTTACTACCTTGACTTCCAG>GT
At3g55920Cyclophilin-like proteinCeres:94608.5 of 824AG<AGAACTTTCGGTCACTTTGCACGG>GT
At1g77060Carboxyphosphonoenol-pyruvate mutase, putativeCeres:12293.4 of 625AG<GACCAAGCATGGCCAAAGAAGTGTG>GT
At4g15900PRL1 proteinCeres:123113.2 of 1725AG<CAAGCAGATTCGTCTCAGCCATAAG>GT
At2g47640Putative small nuclear ribonucleoprotein D2Ceres:26123.3 of 625AG<CAAGCCAATGGAAGAGGATACCAAT>GT
At2g41630Transcription factor IIB (TFIIB)Ceres:2657.2 of 725AG<GTTGGGACTTGTTGCAACTATCAAG>GT
At3g62840Small nuclear ribonucleoproteinCeres:32457.3 of 525AG<TAAACCAATGGAAGAGGATACCAAC>GT
At2g21270Putative ubiquitin fusion-degradation proteinCeres:34470.4 of 1025AG<CCACAACTTGAAAGTGGTGACAAGA>GT
At3g10330Transcription initiation factor IIB (TFIIB)Ceres:38950.2 of 725AG<GTTGGGACTTGTTGCGACCATCAAG>GT
At1g42480Expressed proteinCeres:42677.7 of 925AG<ATTGCTGGAGGAAACTGAAGATGAG>GT
At2g23985Expressed proteinCeres:252843.2 of 425AG<TGTCTTGTTCAGGTGAACAAAAAAG>GT

*At4g23470 contains two micro-exons.

One indication that these micro-exons are correct (in addition to the identity with the cDNA) is that many of them are homologous to exons in other Arabidopsis genes. For example, a search of GenBank in late 2001 revealed that the micro-exon of Ceres: 118038 is homologous to exons from five different cDNAs (accession numbers gi: 15028118, gi:6683111, gi:14517549, gi:15027838, and gi:16974574). The consensus sequence of these exons, ATCCTAA(T/C)G, has been previously characterized as a micro-exon in the potato invertase gene [21].

Splicing anomalies

Analysis of cDNA sequences can help to estimate the frequency of alternative splicing in a species. Alternative splicing appears to be relatively common in animals [22,23]; in plants this phenomenon has been less frequently observed, possibly as a result of the smaller collections of ESTs compared with mammalian systems. Recently, some reports have appeared documenting a small number of cases [24,25]. We examined the alignments of cDNAs to the genome, looking for examples where more than one cDNA aligned to overlapping locations on the same chromosome in such a way as to predict a different splicing pattern. The working hypothesis was that if two cDNAs mapped to the same locus, but presented distinct sets of exons, this would constitute evidence of alternative splicing, or possibly another type of splicing anomaly. We broadened the search for splicing anomalies by including in this protocol all the complete cDNAs available from GenBank, including the Institute for Physical and Chemical Research (RIKEN) collection described below. A total of 1,515 Ceres transcripts overlapped another transcript, of which 1,129 overlapped a sequence from the RIKEN set. This protocol identified 158 genes with apparent splicing anomalies, each of which was inspected manually. They fall into many different classes, representing different genetic events, as follows: 27 alignments indicate an alternative 3' acceptor site for an intron; 17 alignments indicate an alternative 5' donor site for an intron; 23 alignments indicate that one or more introns remained unspliced. In some cases more than one intron was unspliced; for example, in one interesting case only one intron was spliced in the RIKEN transcript (gi: 15146259), whereas four introns were spliced from the corresponding Ceres transcript (Ceres:3992, corresponding to gene At2g35520). These unspliced transcripts may arise from nuclear rather than mature cytoplasmic mRNA sequences. Six alignments indicate that an internal exon is missing in one isoform; presumably the adjacent introns are spliced as a single intron containing the exon sequence. Fifty-seven alignments suggest possible alternative transcription initiation sites. For 17 of these transcripts, the putative initiation site was shifted far enough in the 3' direction to move past the first donor site, making it impossible to splice out the first intron, producing an additional 5' exon in one of the transcripts. Many of the other transcripts contained one or more additional 5' exons as a result of alternative initiation sites. Thirteen alignments suggest alternative 3' polyadenylation (poly(A)) sites that affect splicing. The prediction of poly(A) sites can be confounded by misannealing of the oligo(dT) primers used for reverse transcription; for example, the presence of multiple adenines within the 3' UTR can be mistaken for a poly(A) site. Misannealing cannot explain the presence of unspliced intronic sequence found at the terminus of 12 of these 13 transcripts, suggesting that these putative poly(A) sites are genuine and have an impact on splicing. We have found similar evidence for the occurrence of multiple poly(A) sites in RACE-PCR experiments directed at cloning cDNAs from hypothetical genes. Finally, 15 alignments display multiple splicing anomalies, falling into more than one of the categories above. Table 5 lists many of these alternatively spliced genes; the complete list, with graphical and textual alignment data, is available on-line [26] and is also provided as Additional data with this paper online. Figure 2 highlights several interesting examples. In Figure 2a, the alternative 3' splice site on the second intron leads to a shift in the reading frame, producing a different protein sequence. In Figure 2b, alignments of several cDNAs indicate that the last intron is unspliced. Figure 2c shows that different 5' ends lead to differing 5' introns and exons, while not changing the protein sequence in this particular example. Figure 2d shows a centrally located exon that is spliced out along with the surrounding introns. Figure 2e contains three different 5' transcription start sites, three different 3' termination sites, and two unspliced introns in the middle transcript. The unspliced introns occur within exon 2 of GI:14335057, which corresponds to three exons and two introns in both the other transcripts. Note that some of the alternative splicing events occur within the same ecotype.
Table 5

Alternative acceptor and donor splice sites, alternative 5'exons, and exon skipping examples based on cDNA alignments

 LocusGene namecDNA accessions
Alternative acceptor splice sites
 At3g58710WRKY DNA-binding proteinCeres:100465, gi:15991735
 At5g35680Expressed proteinCeres:11304, gi:14596002
 At2g38860Expressed proteinCeres:114031, gi:13122287
 At4g31550DNA-binding proteinCeres:11953, gi:15384214
 At1g22700Expressed proteinCeres:120133, gi:15294175
 At4g30480Expressed proteinCeres:12573, gi:14423435
 At1g52870Expressed proteinCeres:126586, gi:14326544
 At5g41810Expressed proteinCeres:126660, gi:14532565
 At1g63970Expressed proteinCeres:15758, gi:11386014
 At2g33830Auxin-regulated proteinCeres:1711, gi:11127600
 At5g20040IPP transferaseCeres:19250, gi:14279069
 At3g55330Oxygen-evloving complex subunitCeres:21674, Ceres:3747
 At1g60850RNA polymerase subunitCeres:21961, gi:514321
 At1g76405Expressed proteinCeres:23773, gi:13358245
 At1g22630Expressed proteinCeres:37537, gi:15010607
 At1g02500S-adenosylmethionine synthaseCeres:37800, gi:15450420
 At4g20380Zinc finger protein Lsd1Ceres:38456, gi:1872520
 At3g54380Expressed proteinCeres:38778, gi:14423485
 At3g04830Expressed proteinCeres:38917, gi:15293268
 At1g11840Lactoylglutathione lyaseCeres:39107, gi:11094298
 At1g02090COP9 complex subunitCeres:40042, gi:3288822
 At1g79650DNA repair protein RAD23Ceres:40579, gi:14334441
 At2g25625Expressed proteinCeres:465, gi:14334615
 At4g02640Expressed proteinCeres:6568, gi:10954094
 At3g11930Ethylene-responsive proteinCeres:7474, gi:13926249
 At2g20820Expressed proteinCeres:91872, gi:14190456
 At3g09150Expressed proteinCeres:98026, gi:13359272
Alternative donor splice sites
 At1g16460Mercaptopyruvate sulfurtransferaseCeres:111646, gi:6009982
 At5g16540Zinc finger protein 3Ceres:113763, gi:4689375
 At2g41070bZIP family transcription factorCeres:114632, gi:13346156
 At2g36000Expressed proteinCeres:123727, gi:14532493
 At3g21175Expressed proteinCeres:12996, gi:14596058
 At5g61880Expressed proteinCeres:146274, gi:14517497
 At3g14230RAP2 family proteinCeres:158240, gi:15450917
 At3g03890Expressed proteinCeres:18355, gi:14190432
 At1g67700Expressed proteinCeres:19973, gi:15215605
 At3g55630Tetrahydrofolylpolyglutamate synthaseCeres:230791, gi:15292866
 At2g21620Expressed proteinCeres:31655, gi:15320407
 At4g10100Expressed proteinCeres:35962, gi:6635742
 At1g23950Expressed proteinCeres:41387, gi:15146261
 At1g24260Floral homeotic proteinCeres:5055, gi:2345157
 At2g39730Expressed proteinCeres:7114, gi:15450670
 At3g07760Expressed proteinCeres:7246, gi:15451021
 At3g06720Importin alphaCeres:9351, gi:4191743
Alternative 5' exons
 At3g57810Expressed proteinCeres:101256, Ceres:29384
 At3g03780Methionine synthaseCeres:111720, gi:14532771
 At5g59890Actin depolymerizing factor 4Ceres:11691, gi:15215858
 At3g4901060S ribosomal protein L13Ceres:12182, gi:15292840
 At5g52210GTP-binding proteinCeres:16621, gi:1184980
 At1g01100Acidic ribosomal proteinCeres:24367, gi:15293082
 At2g41430ERD15Ceres:31388, gi:13926319
 At3g08580Adenylate translocatorCeres:36818, gi:1433
 At5g05000GTP-binding proteinCeres:6734, gi:1151243
 At3g48880Expressed proteinCeres:99337, gi:15028346
Exon skipping
 At5g54940Translation initiation factorCeres:103464, Ceres:32071
 At2g46800Zinc transporterCeres:207558, gi:4206639
 At5g53860Expressed proteinCeres:22860, gi:15215799
 At5g27840TOPP8 Ser/Thr protein phosphatase type-1Ceres:38656, gi:14596132
 At3g23280Expressed proteinCeres:41648, gi:15010671
 At1g77080Expressed proteinCeres:92459, gi:11545544, gi:11545546, gi:13649968

A full list with illustrations and supporting alignment data is available at [26].

Figure 2

Alternative splice variants discovered by cDNA alignments. Red bars indicate the protein-coding portion of each exon. Black bars indicate noncoding exons and the UTR portions of the initial and terminal exons. Exon boundaries that line up exactly between two or more cDNAs are highlighted in blue. Thin lines connecting the exons represent introns. The genes involved are: (a) auxin-regulated protein, At2g20820, chromosome (chr) 2; (b) SKP1-interacting partner 5 (SKIP5), At3g54480, chr 3; (c) acidic ribosomal protein, At1g01100, chr 1; (d) auxin-regulated protein, At5g53860, chr 5; (e) unknown expressed protein, At2g45740, chr 2.

Neither collection of cDNAs can be considered a random sample of transcripts, and therefore the number of examples of alternative splicing discovered in this data (approximately 10% of the overlapping transcripts) should not be used to extrapolate to the entire genome. The discovery of transcripts with different introns spliced out raises the question of whether the different spliced products are translated and whether the splicing differences reflect programmed developmental variation or simply splicing errors. It is not possible to answer these questions now, but incomplete splicing and consequential variants in plants have been noted previously to be associated with gene silencing and were postulated to reflect the regulated production of aberrant RNA products not destined to be translated [27]. One clear conclusion is that alternative splicing can be discovered via analysis of cDNAs and genomic sequence, and that a fuller collection of cDNAs will provide a valuable resource for more discoveries about splicing and gene regulation.

Are the sequences full-length?

An independent project to sequence complete Arabidopsis cDNAs is ongoing by the SPP consortium [28], using clones created by K. Shinozaki at RIKEN in Japan. These sequences are publicly available from GenBank (search for "RIKEN cDNA Arabidopsis"). These data provided the opportunity to compare the two sets of cDNAs and measure independently how many of them appear to cover the entire length of the predicted mRNA transcript. The sequencing of the RIKEN cDNAs generated 2,996 sequences as of October 2001; we compared these to the 5,016 cDNAs from Ceres and found 1,129 sequences that are contained in both data sets. Of the 1,129 sequences, 941 alignments yield the same exon-intron structure for the underlying gene. We then asked, for each of the sequences containing identical introns, do the 5' and 3' ends match, and if not, how large is the difference? The results are illustrated in Figure 3.
Figure 3

Comparison of the lengths of the 941 cDNAs from the clones that are contained in both the Ceres and RIKEN collections. (a) Comparison of the 5'-end difference between Ceres and RIKEN clones; (b) comparison of the 3'-end difference between Ceres and RIKEN clones. Peak height indicates the percentage of sequences with a length difference as indicated along the horizontal axis. Positive values on the horizontal axis correspond to longer Ceres clones, while negative values correspond to longer RIKEN clones.

Several observations can be made about these results. First, it is important to note that the Ceres clones were selected for full-length sequencing from among a large number of clustered 5' sequences (see Materials and methods), whereas the RIKEN clones were sequenced on the 3' end followed by clustering and selection of a clone for sequencing [29]. The methods for creating the full-length cDNA sequences at both centers involve multiple sequencing runs, followed by assembly of the overlapping sequences. Second, we observed that in the Ceres data, many mRNAs appeared to have two or more putative alternative transcription start sites. This became apparent when different cDNA assemblies were found to overlap exactly except for an extension on the 5' end on one or more clones. It is interesting to note that when the RIKEN clones were longer or shorter on the 5' end, clones of the equivalent length could often be found in the Ceres collection. Multiple clones with the same 5' end provided strong validation that these were truly representative of alternative transcription initiation sites or repeatable artifacts of the cloning process. Overall, there were 397 Ceres clones that were > 10 bp longer on the 5' end, and 136 RIKEN clones that were longer on the 5' end. If alternative transcription initiation is the correct explanation, then it is relatively common. It is worth noting that in almost all cases, both alternative cDNAs contain complete ORFs. On the 3' end, the Ceres and RIKEN databases each contained 316 sequences that were >10 bp longer than their match from the other set. If these represent alternative polyadenylation sites or stabilized ends of RNA that get polyadenylated, then these are quite common. Further investigation will be necessary to determine if the 3' end of transcripts truly varies at such a high frequency. In summary, work described in this study on Arabidopsis illustrates the utility of full-length cDNAs for finding alternative splice variants, short exons, UTRs, short genes and alternative transcription start sites. The annotation of eukaryotic genomes is currently an inexact and developing science, and the results described here demonstrate the power of full-length cDNA sequences for improving the quality of multiple aspects of genome annotation.

Materials and methods

Preparation and sequencing of cDNA

Starting material for cDNA synthesis was polysomal RNA isolated from the top-most inflorescence tissues (ecotype Wassilewskija) and from roots (ecotype Landsberg erecta). RNA from roots of Landsberg erecta was used to construct the libaries because of the availability of high-quality RNA. Nine parts inflorescence to one part root, as measured by wet mass, was used to make three size-fractionated libraries. Because the ecotypes were mixed before library construction, we cannot determine the source ecotype for any individual cDNA. Polysomal RNA was isolated from a detergent-generated supernatant on a 2 M sucrose cushion. To capture full-length cDNAs, an oligonucleotide is first attached to intact 5' ends, taking advantage of the cap. After first- and second-strand synthesis, the full-length cDNAs were selected, size fractionated and cloned into pBluescript. The ligation mixture was transformed into bacteria, selected on appropriate antibiotics and picked into 384-well microtiter plates. In repeated rounds of sequencing, several tens of thousands of clones from the three libraries were sequenced from the 5' end, the sequences clustered, and the clone with the longest 5' end in each cluster selected for complete sequencing. The number of clones sequenced in each round depended on the percentage of new full-length clones that could be obtained from each of the size-fractionated libraries. As the clones reported in this study came from non-normalized libraries, only three rounds of 5' sequencing were employed; 42,000 in the first round, 59,000 in the second round and 22,000 in the final round. Following each round of 5' sequencing, all sequences were clustered using a clustering algorithm that forms separate clusters if there are more than 6 nucleotide differences in any 30-nucleotide window of the match. In this way, clones would not fall into separate clusters simply because of ecotypic differences, different putative transcription start sites or sequencing errors. However, they would cluster separately if alternative splicing occurred in the first approximately 500 nucleotides and involved more than 6 nucleotides. Following clustering, the clone that was longest on the 5' end was selected for full-length sequencing. If clones were of comparable length on the 5' end, the clone to be sequenced was selected from the library with the highest percentage of full-length clones. Sequencing of 5' ends was performed on capillary sequencers (Molecular Dynamics); full-length sequencing was done on ABI377 sequencers using primer walking. The 5,016 clones analyzed in this study were selected from all full-length clones based on length (> 400 nucleotides), non-redundancy (eliminating alternatively spliced clones), and length of the putative ORF relative to overall clone length.

Alignment of cDNA sequences to the A. thaliana genome

Four programs were used to align all 5,016 Ceres cDNA sequences to the A. thaliana genome as follows. First, each program was used to align each cDNA sequence to the genome. Some programs cannot efficiently handle a search comparing a cDNA to a 30 + Mb eukaryotic chromosome, and to compensate for those programs, we created a modified procedure that first used BLASTN to identify and extract a region of 20,000 bp surrounding the gene. Each cDNA was aligned to the corresponding 20 kb genome sequence segment using all four programs with default parameter settings. The resulting alignments were then compared automatically to generate the comparison data appearing in the main text. The programs are sim4, available from [30]; dds/gap2, available from [31]; GeneSeqer, available from [32]; and est_genome, available from [33]. Gene models were constructed by first recreating the cDNA sequence using the Arabidopsis genome sequence, employing the longest alignment for which all programs predicted identical splice sites. The longest ORF was identified along the forward strand of the cDNA followed by a division of the ORF into protein-coding exon segments and untranslated regions of exons. These constructed gene models were then compared to the existing gene annotation at the mapped genomic region. Previously annotated gene structures that disagreed with the cDNAs were replaced by the cDNA alignment-based gene models, and new gene models were created where pre-existing gene annotations were lacking.

Additional data files

Additional data corresponding to anomalous splicing, including png image files and text-formatted multiple alignments, is available.

Additional data file 1

Additional data corresponding to anomalous splicing Click here for additional data file
  25 in total

1.  A large family of genes that share homology with CLAVATA3.

Authors:  J M Cock; S McCormick
Journal:  Plant Physiol       Date:  2001-07       Impact factor: 8.340

Review 2.  Alternative RNA splicing in the nervous system.

Authors:  P J Grabowski; D L Black
Journal:  Prog Neurobiol       Date:  2001-10       Impact factor: 11.685

Review 3.  Finding the genes in genomic DNA.

Authors:  C B Burge; S Karlin
Journal:  Curr Opin Struct Biol       Date:  1998-06       Impact factor: 6.809

4.  A tool for analyzing and annotating genomic sequences.

Authors:  X Huang; M D Adams; H Zhou; A R Kerlavage
Journal:  Genomics       Date:  1997-11-15       Impact factor: 5.736

5.  GeneMark.hmm: new solutions for gene finding.

Authors:  A V Lukashin; M Borodovsky
Journal:  Nucleic Acids Res       Date:  1998-02-15       Impact factor: 16.971

6.  A novel spliceosome containing U11, U12, and U5 snRNPs excises a minor class (AT-AC) intron in vitro.

Authors:  W Y Tarn; J A Steitz
Journal:  Cell       Date:  1996-03-08       Impact factor: 41.582

7.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA.

Authors:  R Mott
Journal:  Comput Appl Biosci       Date:  1997-08

8.  ETR2 is an ETR1-like gene involved in ethylene signaling in Arabidopsis.

Authors:  H Sakai; J Hua; Q G Chen; C Chang; L J Medrano; A B Bleecker; E M Meyerowitz
Journal:  Proc Natl Acad Sci U S A       Date:  1998-05-12       Impact factor: 11.205

9.  Prediction of complete gene structures in human genomic DNA.

Authors:  C Burge; S Karlin
Journal:  J Mol Biol       Date:  1997-04-25       Impact factor: 5.469

10.  The genome of Caenorhabditis elegans.

Authors:  R Waterston; J Sulston
Journal:  Proc Natl Acad Sci U S A       Date:  1995-11-21       Impact factor: 11.205

View more
  83 in total

1.  Cloning and sequencing of cDNAs for hypothetical genes from chromosome 2 of Arabidopsis.

Authors:  Yong-Li Xiao; Mukesh Malik; Catherine A Whitelaw; Christopher D Town
Journal:  Plant Physiol       Date:  2002-12       Impact factor: 8.340

2.  Genomewide comparative analysis of alternative splicing in plants.

Authors:  Bing-Bing Wang; Volker Brendel
Journal:  Proc Natl Acad Sci U S A       Date:  2006-04-21       Impact factor: 11.205

3.  The Arabidopsis root transcriptome by serial analysis of gene expression. Gene identification using the genome sequence.

Authors:  Cécile Fizames; Stéphane Muños; Céline Cazettes; Philippe Nacry; Jossia Boucherez; Frédéric Gaymard; David Piquemal; Valérie Delorme; Thérèse Commes; Patrick Doumas; Richard Cooke; Jacques Marti; Hervé Sentenac; Alain Gojon
Journal:  Plant Physiol       Date:  2004-01       Impact factor: 8.340

4.  Computational gene prediction using multiple sources of evidence.

Authors:  Jonathan E Allen; Mihaela Pertea; Steven L Salzberg
Journal:  Genome Res       Date:  2004-01       Impact factor: 9.043

5.  GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders.

Authors:  William H Majoros; Mihaela Pertea; Corina Antonescu; Steven L Salzberg
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

6.  GeneSeqer@PlantGDB: Gene structure prediction in plant genomes.

Authors:  Shannon D Schlueter; Qunfeng Dong; Volker Brendel
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

7.  Annotation of the Arabidopsis genome.

Authors:  Jennifer R Wortman; Brian J Haas; Linda I Hannick; Roger K Smith; Rama Maiti; Catherine M Ronning; Agnes P Chan; Chunhui Yu; Mulu Ayele; Catherine A Whitelaw; Owen R White; Christopher D Town
Journal:  Plant Physiol       Date:  2003-06       Impact factor: 8.340

8.  Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping.

Authors:  Wei Zhu; Shannon D Schlueter; Volker Brendel
Journal:  Plant Physiol       Date:  2003-06       Impact factor: 8.340

9.  Agrobacterium T-DNA integration in Arabidopsis is correlated with DNA sequence compositions that occur frequently in gene promoter regions.

Authors:  Richard G Schneeberger; Ke Zhang; Tatiana Tatarinova; Max Troukhan; Shing F Kwok; Josh Drais; Kevin Klinger; Francis Orejudos; Kimberly Macy; Amit Bhakta; James Burns; Gopal Subramanian; Jonathan Donson; Richard Flavell; Kenneth A Feldmann
Journal:  Funct Integr Genomics       Date:  2005-03-03       Impact factor: 3.410

10.  MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource for plant genomics.

Authors:  Heiko Schoof; Rebecca Ernst; Vladimir Nazarov; Lukas Pfeifer; Hans-Werner Mewes; Klaus F X Mayer
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.