| Literature DB >> 12093376 |
Brian J Haas1, Natalia Volfovsky, Christopher D Town, Maxim Troukhan, Nickolai Alexandrov, Kenneth A Feldmann, Richard B Flavell, Owen White, Steven L Salzberg.
Abstract
BACKGROUND: Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism.Entities:
Mesh:
Substances:
Year: 2002 PMID: 12093376 PMCID: PMC116726 DOI: 10.1186/gb-2002-3-6-research0029
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Number of identical cDNA alignments produced by four different programs on a set of 5,016 cDNA sequences
| Program | gap2 | est_genome | GeneSeqer |
| sim4 | 4,819 | 4,342 | 4,784 |
| gap2 | 4,274 | 4,839 | |
| est_genome | 4,257 |
Comparison of lengths of cDNA alignments for the 5,016 cDNA sequences
| gap2 | est_genome | GeneSeqer | |
| sim4 | 32/34 | 6/652 | 31/67 |
| gap2 | 2/708 | 3/50 | |
| est_genome | 706/708 | 706/733 |
Entries show the number of sequences for which the program listed along the top produced a longer alignment than the program listed on the left; for example, the entry 32/34 indicates that gap2 produced a longer alignment in 32 cases out of the total 34 for which sim4 and gap2 had alignments of different lengths.
Number of cDNA alignments, out of 5,016 total, for which all splice sites are identical
| Program | gap2 | est_genome | GeneSeqer |
| sim4 | 4,946 | 4,965 | 4,960 |
| gap2 | 4,954 | 4,955 | |
| est_genome | 4,961 |
Figure 1An example showing how a micro-exon improves a cDNA alignment. (a) Alignment showing the boundaries of the fourth and fifth exons from the sim4 alignment of cDNA Ceres:20761 to chromosome 4. (b) The improvement resulting from insertion of a micro-exon; all three exons now align with 100% identity to the cDNA sequence. Intron positions are shown by '>' in the alignment.
Micro-exons from each of the five chromosomes, listed in order of increasing length
| Locus | Gene name | cDNA accession | Exon number | Exon length (nucleotides) | Micro-exon sequence |
| At5g51700 | RAR1 | Ceres:99615 | 2 of 6 | 3 | AG<GGA>GT |
| At1g63290 | D-ribulose-5-phosphate-3-epimerase | Ceres:37843 | 2 of 8 | 5 | AG<GACGG>GT |
| At3g01850 | D-ribulose-5-phosphate-3-epimerase | Ceres:2398 | 2 of 9 | 5 | AG<GACGG>GT |
| At4g01610 | Cysteine protease | Ceres:20761 | 5 of 11 | 5 | AG<ATCAG>GT |
| At5g14030 | Expressed protein | Ceres:16313. | 5 of 6 | 6 | AG<GCCAAG>GT |
| At2g38880 | Putative CCAAT-binding transcription factor subunit | Ceres:7805. | 4 of 7 | 6 | AG<TTGGAG>GT |
| At2g07340 | Expressed protein | Ceres:34060. | 4 of 6 | 7 | AG<GAAGAAC>GT |
| At2g41710 | AP2 domain transcription factor | Ceres:41462 | 3 of 9 | 9 | AG<TTTATCTAG>GT |
| At2g36190 | Beta-fructofuranosidase | Ceres:118038 | 2 of 6 | 9 | AG<ATCCAAATG>GT |
| At4g13720 | Auxin-regulated protein | Ceres:8361 | 4 of 8 | 10 | AG<GGCCATACAT>GT |
| At4g29510 | Arginine methyltransferase (pam1) | Ceres:38601 | 2 of 9 | 11 | AG<GAATCCATGAA>GT |
| At3g55260 | Beta-N-acetylhexosaminidase | Ceres:118286 | 7 of 15 | 17 | AG<GTTTGCCAAAATGAGAG>GT |
| At1g80380 | Auxin-regulated protein | Ceres:117698 | 3 of 7 | 18 | AG<GTACCTAGGTACAATAAG>GT |
| At3g55630 | Tetrahydrofolylpolyglutamate synthase | Ceres:230791. | 6 of 15 | 18 | AG<GAGAAAACCAGCAATGAG>GT |
| At5g61530 | Auxin-regulated protein | Ceres:152557 | 5 of 10 | 19 | AG<GGAGTTGCCAGCTCAGATG>GT |
| At5g03880 | Auxin-regulated protein | Ceres:37668 | 4 of 12 | 19 | AG<CTGTCCCTTCTGCCGGAAG>GT |
| *At4g23470 | Expressed protein | Ceres:25694. | 7 of 8 | 19 | AG<GGTTTGCGCATGTATGCAG>GT |
| At1g67320 | Expressed protein | Ceres:116252. | 7 of 18 | 20 | AG<TTGAAAACATTTACTACAAG>GT |
| At3g50210 | Flavonol synthase | Ceres:25787. | 8 of 12 | 20 | AG<TGGAGCTCACACTGACTATG>GT |
| At4g37680 | Expressed protein | Ceres: 262351. | 2 of 5 | 21 | AG<GGTTTGTCTTTCGAAATTCAG>GT |
| At3g60340 | Palmitoyl-protein thioesterase | Ceres:38539. | 5 of 12 | 21 | AG<ACATCAGTTGTTTGTGAGAAG>GT |
| At3g23600 | Expressed protein | Ceres:11339. | 2 of 7 | 22 | AG<GTTTTGAAGCTCCAAACTTAAG>GT |
| At5g46030 | Expressed protein | Ceres:15222. | 4 of 5 | 22 | AG<CTTTGACTGAAGCTATAGACAT>GT |
| At4g33925 | Expressed protein | Ceres:24360. | 2 of 5 | 22 | AG<TAACCGAAGAACAGCTCTCAAT>GT |
| *At4g23470 | Expressed protein | Ceres:25694. | 2 of 8 | 22 | AG<ATTGTTGCTTCGCGTTGTGGTG>GT |
| At3g13860 | Chaperonin, putative | Ceres:38045. | 2 of 17 | 22 | AG<CTCGTCTACTTCCAGGAAACTG>GT |
| At1g73180 | Expressed protein | Ceres:108165. | 13 of 14 | 23 | AG<TTACTTGGAATAAGCACAACAGG>GT |
| At5g09830 | Expressed protein | Ceres:37422. | 2 of 3 | 23 | AG<GAAGTCATTGACATATCTGGAGG>GT |
| At2g23930 | Small nuclear ribonucleoprotein E | Ceres:4850. | 2 of 4 | 23 | AG<GTACATGGATAAGAAGCTCCAAA>GT |
| At1g66940 | Expressed protein | Ceres:110066. | 3 of 5 | 24 | AG<AATCTAATATTAGATGGATAATAG>GT |
| At5g51100 | Expressed protein | Ceres:126592. | 8 of 9 | 24 | AG<CACGCTTACTATCTGGATTTTGAG>GT |
| At1g05070 | Expressed protein | Ceres:13725. | 2 of 3 | 24 | AG<AGCTCAGTAATGCTTCTTTTGCTG>GT |
| At2g32580 | Expressed protein | Ceres:16625. | 2 of 3 | 24 | AG<GACTCAGCAATGGTTCATTCACTG>GT |
| At2g29960 | Cyclophilin | Ceres:19211. | 4 of 6 | 24 | AG<AAAACTTCAGAGCTTTGTGCACAG>GT |
| At1g65220 | Expressed protein | Ceres:21223. | 2 of 8 | 24 | AG<CTCAAAGGAGAAGCCCACTCTCGG>GT |
| At5g23310 | Iron superoxide dismutase 3 | Ceres:26637. | 7 of 8 | 24 | AG<CACTCTTATTATCTGGACTACAAG>GT |
| At4g25100 | Superoxide dismutase | Ceres:32935. | 6 of 7 | 24 | AG<CATGCTTACTACCTTGACTTCCAG>GT |
| At3g55920 | Cyclophilin-like protein | Ceres:94608. | 5 of 8 | 24 | AG<AGAACTTTCGGTCACTTTGCACGG>GT |
| At1g77060 | Carboxyphosphonoenol-pyruvate mutase, putative | Ceres:12293. | 4 of 6 | 25 | AG<GACCAAGCATGGCCAAAGAAGTGTG>GT |
| At4g15900 | PRL1 protein | Ceres:123113. | 2 of 17 | 25 | AG<CAAGCAGATTCGTCTCAGCCATAAG>GT |
| At2g47640 | Putative small nuclear ribonucleoprotein D2 | Ceres:26123. | 3 of 6 | 25 | AG<CAAGCCAATGGAAGAGGATACCAAT>GT |
| At2g41630 | Transcription factor IIB (TFIIB) | Ceres:2657. | 2 of 7 | 25 | AG<GTTGGGACTTGTTGCAACTATCAAG>GT |
| At3g62840 | Small nuclear ribonucleoprotein | Ceres:32457. | 3 of 5 | 25 | AG<TAAACCAATGGAAGAGGATACCAAC>GT |
| At2g21270 | Putative ubiquitin fusion-degradation protein | Ceres:34470. | 4 of 10 | 25 | AG<CCACAACTTGAAAGTGGTGACAAGA>GT |
| At3g10330 | Transcription initiation factor IIB (TFIIB) | Ceres:38950. | 2 of 7 | 25 | AG<GTTGGGACTTGTTGCGACCATCAAG>GT |
| At1g42480 | Expressed protein | Ceres:42677. | 7 of 9 | 25 | AG<ATTGCTGGAGGAAACTGAAGATGAG>GT |
| At2g23985 | Expressed protein | Ceres:252843. | 2 of 4 | 25 | AG<TGTCTTGTTCAGGTGAACAAAAAAG>GT |
*At4g23470 contains two micro-exons.
Alternative acceptor and donor splice sites, alternative 5'exons, and exon skipping examples based on cDNA alignments
| Locus | Gene name | cDNA accessions |
| Alternative acceptor splice sites | ||
| At3g58710 | WRKY DNA-binding protein | Ceres:100465, gi:15991735 |
| At5g35680 | Expressed protein | Ceres:11304, gi:14596002 |
| At2g38860 | Expressed protein | Ceres:114031, gi:13122287 |
| At4g31550 | DNA-binding protein | Ceres:11953, gi:15384214 |
| At1g22700 | Expressed protein | Ceres:120133, gi:15294175 |
| At4g30480 | Expressed protein | Ceres:12573, gi:14423435 |
| At1g52870 | Expressed protein | Ceres:126586, gi:14326544 |
| At5g41810 | Expressed protein | Ceres:126660, gi:14532565 |
| At1g63970 | Expressed protein | Ceres:15758, gi:11386014 |
| At2g33830 | Auxin-regulated protein | Ceres:1711, gi:11127600 |
| At5g20040 | IPP transferase | Ceres:19250, gi:14279069 |
| At3g55330 | Oxygen-evloving complex subunit | Ceres:21674, Ceres:3747 |
| At1g60850 | RNA polymerase subunit | Ceres:21961, gi:514321 |
| At1g76405 | Expressed protein | Ceres:23773, gi:13358245 |
| At1g22630 | Expressed protein | Ceres:37537, gi:15010607 |
| At1g02500 | S-adenosylmethionine synthase | Ceres:37800, gi:15450420 |
| At4g20380 | Zinc finger protein Lsd1 | Ceres:38456, gi:1872520 |
| At3g54380 | Expressed protein | Ceres:38778, gi:14423485 |
| At3g04830 | Expressed protein | Ceres:38917, gi:15293268 |
| At1g11840 | Lactoylglutathione lyase | Ceres:39107, gi:11094298 |
| At1g02090 | COP9 complex subunit | Ceres:40042, gi:3288822 |
| At1g79650 | DNA repair protein RAD23 | Ceres:40579, gi:14334441 |
| At2g25625 | Expressed protein | Ceres:465, gi:14334615 |
| At4g02640 | Expressed protein | Ceres:6568, gi:10954094 |
| At3g11930 | Ethylene-responsive protein | Ceres:7474, gi:13926249 |
| At2g20820 | Expressed protein | Ceres:91872, gi:14190456 |
| At3g09150 | Expressed protein | Ceres:98026, gi:13359272 |
| Alternative donor splice sites | ||
| At1g16460 | Mercaptopyruvate sulfurtransferase | Ceres:111646, gi:6009982 |
| At5g16540 | Zinc finger protein 3 | Ceres:113763, gi:4689375 |
| At2g41070 | bZIP family transcription factor | Ceres:114632, gi:13346156 |
| At2g36000 | Expressed protein | Ceres:123727, gi:14532493 |
| At3g21175 | Expressed protein | Ceres:12996, gi:14596058 |
| At5g61880 | Expressed protein | Ceres:146274, gi:14517497 |
| At3g14230 | RAP2 family protein | Ceres:158240, gi:15450917 |
| At3g03890 | Expressed protein | Ceres:18355, gi:14190432 |
| At1g67700 | Expressed protein | Ceres:19973, gi:15215605 |
| At3g55630 | Tetrahydrofolylpolyglutamate synthase | Ceres:230791, gi:15292866 |
| At2g21620 | Expressed protein | Ceres:31655, gi:15320407 |
| At4g10100 | Expressed protein | Ceres:35962, gi:6635742 |
| At1g23950 | Expressed protein | Ceres:41387, gi:15146261 |
| At1g24260 | Floral homeotic protein | Ceres:5055, gi:2345157 |
| At2g39730 | Expressed protein | Ceres:7114, gi:15450670 |
| At3g07760 | Expressed protein | Ceres:7246, gi:15451021 |
| At3g06720 | Importin alpha | Ceres:9351, gi:4191743 |
| Alternative 5' exons | ||
| At3g57810 | Expressed protein | Ceres:101256, Ceres:29384 |
| At3g03780 | Methionine synthase | Ceres:111720, gi:14532771 |
| At5g59890 | Actin depolymerizing factor 4 | Ceres:11691, gi:15215858 |
| At3g49010 | 60S ribosomal protein L13 | Ceres:12182, gi:15292840 |
| At5g52210 | GTP-binding protein | Ceres:16621, gi:1184980 |
| At1g01100 | Acidic ribosomal protein | Ceres:24367, gi:15293082 |
| At2g41430 | ERD15 | Ceres:31388, gi:13926319 |
| At3g08580 | Adenylate translocator | Ceres:36818, gi:1433 |
| At5g05000 | GTP-binding protein | Ceres:6734, gi:1151243 |
| At3g48880 | Expressed protein | Ceres:99337, gi:15028346 |
| Exon skipping | ||
| At5g54940 | Translation initiation factor | Ceres:103464, Ceres:32071 |
| At2g46800 | Zinc transporter | Ceres:207558, gi:4206639 |
| At5g53860 | Expressed protein | Ceres:22860, gi:15215799 |
| At5g27840 | TOPP8 Ser/Thr protein phosphatase type-1 | Ceres:38656, gi:14596132 |
| At3g23280 | Expressed protein | Ceres:41648, gi:15010671 |
| At1g77080 | Expressed protein | Ceres:92459, gi:11545544, gi:11545546, gi:13649968 |
A full list with illustrations and supporting alignment data is available at [26].
Figure 2Alternative splice variants discovered by cDNA alignments. Red bars indicate the protein-coding portion of each exon. Black bars indicate noncoding exons and the UTR portions of the initial and terminal exons. Exon boundaries that line up exactly between two or more cDNAs are highlighted in blue. Thin lines connecting the exons represent introns. The genes involved are: (a) auxin-regulated protein, At2g20820, chromosome (chr) 2; (b) SKP1-interacting partner 5 (SKIP5), At3g54480, chr 3; (c) acidic ribosomal protein, At1g01100, chr 1; (d) auxin-regulated protein, At5g53860, chr 5; (e) unknown expressed protein, At2g45740, chr 2.
Figure 3Comparison of the lengths of the 941 cDNAs from the clones that are contained in both the Ceres and RIKEN collections. (a) Comparison of the 5'-end difference between Ceres and RIKEN clones; (b) comparison of the 3'-end difference between Ceres and RIKEN clones. Peak height indicates the percentage of sequences with a length difference as indicated along the horizontal axis. Positive values on the horizontal axis correspond to longer Ceres clones, while negative values correspond to longer RIKEN clones.