| Literature DB >> 17355987 |
Monica Jain1, Jeff Shrager, Elizabeth H Harris, Renee Halbrook, Arthur R Grossman, Charles Hauser, Olivier Vallon.
Abstract
Clustering and assembly of expressed sequence tags (ESTs) constitute the basis for most genomewide descriptions of a transcriptome. This approach is limited by the decline in sequence quality toward the end of each EST, impacting both sequence clustering and assembly. Here, we exploit the available draft genome sequence of the unicellular green alga Chlamydomonas reinhardtii to guide clustering and to correct errors in the ESTs. We have grouped all available EST and cDNA sequences into 12,063 ACEGs (assembly of contiguous ESTs based on genome) and generated 15,857 contigs of average length 934 nt. We predict that roughly 3000 of our contigs represent full-length transcripts. Compared to previous assemblies, ACEGs show extended contig length, increased accuracy and a reduction in redundancy. Because our assembly protocol also uses ESTs with no corresponding genomic sequences, it provides sequence information for genes interrupted by sequence gaps. Detailed analysis of randomly sampled ACEGs reveals several hundred putative cases of alternative splicing, many overlapping transcription units and new genes not identified by gene prediction algorithms. Our protocol, although developed for and tailored to the C. reinhardtii dataset, can be exploited by any eukaryotic genome project for which both a draft genome sequence and ESTs are available.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17355987 PMCID: PMC1874618 DOI: 10.1093/nar/gkm081
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overall algorithm for ACEG assembly.
Source of sequences and remaining numbers after quality-screening and ghost generation
| ESTs | Total (input) | After Lucy | After BLAT |
|---|---|---|---|
| CGP Libraries | 194 920 | 145 686 | 114 809 |
| Kazusa | 50 961 | 50 961 | 48 044 |
| Genbank | 765 | 765 | 698 |
| Purton | 283 | 283 | 262 |
| Private | 43 | 43 | 42 |
Figure 2.Different stages of ACEG generation. An example is given of a hypothetical gene (bracketted on the scaffold line) split by three introns (black bars) and with two possible polyadenylation sites. Its last exon is interrupted by a sequence gap (red) that leads to stretches of N in some of the ghosts (dotted cyan lines). Thin arrows indicate ghost position and orientation, dotted black lines group paired ghosts from the same clone. Assembly starts with ghost g894001H03.y1 and generates two non-overlapping contigs (purple arrows). Because ESTs are introduced at stage 5, chlre3.1.1.2.11contains the sequence missing in the genome sequence gap.
Figure 3.Number of ACEGs (bars) and average number of reads (open circles) as a function of number of contigs in the ACEG.
Repartition of contigs between various categories, based on suffix type. The median contig length is indicated for each category
| Contig composition | In ACEGs with a single contig (median length) | In ACEGs with several contigs (median length) |
|---|---|---|
| Both 5′ and 3′ reads | 2894 (1338 nt) | 2980 (1372 nt) |
| Only 5′ reads | 4195 (487 nt) | 4010 (692 nt) |
| Only 3′ reads | 1 (715 nt) | 1777 (754 nt) |
Figure 4.Distribution of contig lengths. Data has been placed into bins of 25 units in width. The inset is an enlarged display (with 400 units in width) of the longest contigs.
Figure 5.Comparison of ACEGs and gene models. ACEGs show either a complete match to a single gene model (for all contigs, coverage is above cutoff and identity at least 98%), a partial match (some contigs match the model, but others match nothing), a mixed match to several gene models (some contigs match one model, others match another model), or no match at all. Results are displayed for three minimum coverage levels on the ACEG contigs.