| Literature DB >> 18854010 |
Yutaka Satou1, Katsuhiko Mineta, Michio Ogasawara, Yasunori Sasakura, Eiichi Shoguchi, Keisuke Ueno, Lixy Yamada, Jun Matsumoto, Jessica Wasserscheid, Ken Dewar, Graham B Wiley, Simone L Macmil, Bruce A Roe, Robert W Zeller, Kenneth E M Hastings, Patrick Lemaire, Erika Lindquist, Toshinori Endo, Kohji Hotta, Kazuo Inaba.
Abstract
BACKGROUND: The draft genome sequence of the ascidian Ciona intestinalis, along with associated gene models, has been a valuable research resource. However, recently accumulated expressed sequence tag (EST)/cDNA data have revealed numerous inconsistencies with the gene models due in part to intrinsic limitations in gene prediction programs and in part to the fragmented nature of the assembly.Entities:
Mesh:
Year: 2008 PMID: 18854010 PMCID: PMC2760879 DOI: 10.1186/gb-2008-9-10-r152
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
cDNA sequence evidence used in the present study
| ESTs (conventional cDNA clones)* | 1,179,850 |
| 5' EST | 589,329 |
| 3' EST | 590,521 |
| 5'-full-length ESTs | 202,535 |
| Oligo-capping cDNA library-derived ESTs† | 2,079 |
| Spliced-leader mRNA derived ESTs‡ | 199,947 |
| 5'-RACEs from oligo-capping cDNA pool§ | 509 |
| Full insert cDNA sequences¶ | 8,877 |
*There were 672,390 ESTs published before [1,12]. The rest of the ESTs were produced recently and high quality reads among them were deposited in the GeneBank database ([GenBank: FF685517-FF836289] and [GenBank:FF848360-FG007279]). †Described in [2]. ‡Pooled data from two sets of SL-based reverse-transcription PCR analyses. One dataset consisted of 19,571 sequences derived from oligo(dT)-primed cDNA of mRNA from pooled embryonic/adult stages and several adult tissues (Y Satou et al., unpublished data). The other consisted of 180,376 SL-containing sequences >30 nucleotides derived from random-hexamer-primed cDNA of mRNA from tailbud embryos (J Matsumoto et al, manuscript in preparation). §From a study by oligo-capping 5'-RACE for determining 5'-ends of mRNAs encoding transcription factors (Y Satou et al., unpublished data). ¶Sequences of full-inserts of cDNA clones downloaded from the public database.
Figure 1Concordant identification of linkage between version 1 scaffolds from EST mate pairs, and BAC paired-end sequences. (a) Multiple 5'- and 3'-EST mate pairs identified a linkage between version 1 scaffolds 21 and 103. (b) Paired end sequence data of two independent BAC clones also identified this joined-scaffold linkage. (c) Identification of such linkages and FISH data constitute a larger scaffold representing chromosome 9. This new scaffold includes 61 of version 1 scaffolds. Black and red arrows indicate version 1 scaffolds in leftward and rightward directions. (d) FISH data are used to orient and place tentative joined scaffolds, which are built by EST mate pairs and paired BAC ends, on chromosomes. Left panel: two-color FISH of GECi23_g02 (green) and GECi42_e12 (red) BAC clones, which are mapped onto the same tentative joined scaffold, determines the orientation of this tentative joined-scaffold on the chromosome 9. Right panel: similarly, two-color FISH of GECi45_n13 (green) and GECi42_e12 (red) BAC clones, which are mapped onto different tentative joined-scaffolds, indicates that these two tentative joined scaffolds are in this order on chromosome 9. White arrowheads indicate the centromere.
Figure 2Improvement of gene models. (a) Improvement of a gene model for Gli, including the joining of two JGI version 1 scaffolds. 5'-ESTs and 3'-ESTs are shown as yellow and purple boxes and EST pairs are connected by dashed lines. Multiple EST pairs indicate that this locus is artifactually split into two version 1 scaffolds. This Gli gene locus was not precisely predicted in the previous studies (exons are indicated by pink boxes and joined by lines). The new gene model (green boxes) precisely coincides with the structure of a cDNA sequence (yellow boxes) and ESTs. (b) The alignment of ESTs and gene models with the genome sequence around the 5'-end of the Gli locus. The 5'-full-length EST shown here has the spliced leader sequence (red letters), which is not aligned with the genome sequence because it is appended to Gli mRNA by trans-splicing. The acceptor dinucleotide for this trans-splicing is shown in red in the genome sequence. Note that only the new model precisely represents the 5'-end of this locus. (c) A gene locus that had not been modeled in previous annotations. Although 5'-ESTs (yellow boxes) and 3'-ESTs (purple boxes) indicate the existence of genes in this region, no previous model sets have included models in this region. Two gene models for this locus were built on the basis of EST evidence.
Figure 3Operons in the Ciona genome. In the genomic region indicated, 5'-ESTs (yellow boxes) and 3'-ESTs (purple boxes) clearly indicate that there are (a) two and (b) three genes encoded. (Note that the genomic region indicated in (a) is not included in the version 2 genome and there are no version 2 gene models.) Previous models (pink boxes) failed to model these loci precisely and the present study yielded gene models that faithfully reflect cDNA evidence. The lower panel in (a) is a magnification of the region around the intergenic region of this operon and the inset shows corresponding DNA sequences.
Statistics of the KH gene model set
| Predicted gene loci | 15,254 |
| Predicted transcripts | 24,025 |
| Transcripts that putatively encode the full ORF | 20,239 |
| Transcript 5'-ends identified by SL ESTs | 11,797 |
| Transcript 5'-ends identified by non-SL oligocapping ESTs | 818 |
| In-frame stop codons in the 5'-region of the longest ORFs of transcripts not represented by 5'-full-length ESTs | 7,624 |
| Operons | 1,310 |
| Operon genes | 2,909 |
Introns with GT-AG, GC-AG and AT-AC terminal dinucleotides
| Terminal dinucleotides | Number of introns |
| GT-AG | 112,989 |
| GC-AG | 556 |
| AT-AC | 40 |
| Uncategorized* | 294 |
| Total | 113,879 |
*The terminal dinucleotides of these introns contain 'N'.
Numbers of genes per operon
| Number of genes per operon | Number of operons |
| 2 | 1,079 |
| 3 | 185 |
| 4 | 36 |
| 5 | 8 |
| 6 | 2 |
Figure 4Prevalence of single-exon 5'-most genes in Ciona operons. Ratio of genes containing a given number of exons within non-operonic (blue) and operonic (green) gene populations. Red and black lines indicate the ratio within the 5'-most upstream genes encoded in operons and the downstream operonic genes, respectively. Genes with 11 or more exons are not shown in this graph for simplicity. Note that single-exon genes are more prevalent in operons than in the non-operon (monocistronic) gene population, and are especially prevalent among the 5'-most genes of operons.