| Literature DB >> 12537569 |
Mark Stapleton1, Joe Carlson, Peter Brokstein, Charles Yu, Mark Champe, Reed George, Hannibal Guarin, Brent Kronmiller, Joanne Pacleb, Soo Park, Ken Wan, Gerald M Rubin, Susan E Celniker.
Abstract
BACKGROUND: A collection of sequenced full-length cDNAs is an important resource both for functional genomics studies and for the determination of the intron-exon structure of genes. Providing this resource to the Drosophila melanogaster research community has been a long-term goal of the Berkeley Drosophila Genome Project. We have previously described the Drosophila Gene Collection (DGC), a set of putative full-length cDNAs that was produced by generating and analyzing over 250,000 expressed sequence tags (ESTs) derived from a variety of tissues and developmental stages.Entities:
Mesh:
Substances:
Year: 2002 PMID: 12537569 PMCID: PMC151182 DOI: 10.1186/gb-2002-3-12-research0080
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Status of DGCr1 and DGCr2 clones
| DGCr1 | DGCr2 | Total | |
| Clones in each release | 5,849 | 5,061 | 10,910 |
| Clones stopped while in progress* | 148 | 739 | 887 |
| Incorrect clone | 0 | 40 | 40 |
| Co-ligated inserts | 13 | 493 | 506 |
| No poly(A) | 9 | 97 | 106 |
| Transposable element (TE) | 11 | 71 | 82 |
| Incomplete coding sequence | 115 | 38 | 153 |
| Candidate clones to be sequenced | 5,701 | 4,322 | 10,023 |
| Submitted to GenBank† | 5,291 | 3,479 | 8,770 |
| Clones in progress | 410 | 843 | 1,253 |
*Quality-control analysis was carried out on clones during the sequencing process. Initial quality-control analysis was carried out for DGCr1 clones before full-length sequencing and for DGCr2 clones during the initial shotgun phase. This difference accounts for the different frequencies of error types observed in the DGCr1 and DGCr2. For example, the DGCr1 3' ESTs were generated before adding the clones to the sequencing pipeline allowing us to eliminate co-ligated clones and clones without poly(A) tails. Conversely, the DGCr2 has fewer clones with incomplete coding sequences because the DGCr2 clones were selected by aligning ESTs to the annotated genomic sequence, providing a more reliable way of selecting clones with complete ORFs than the inter se clustering of ESTs used to select the DGCr1. Clones were removed from finishing if they: were the incorrect clone as revealed by their 5'-end sequence; consisted of two cDNA molecules ligated into the same plasmid vector, as indicated by their 5'- and 3'-end reads aligning more than 300 kb apart in the genome; did not contain a poly(A) tract at their 3' end; corresponded to a member of the transposable element data set [20]; or did not extend to the ATG start site of the corresponding predicted protein in the Release 2 CDS data set. †Each clone submitted to GenBank has a contiguous sequence with a phrap estimated error rate of not more than one error per 50,000 bases. Additionally, each individual base has a phred [32,33] quality score of 25 or higher. An exception to these rules was made for 475 clones from the DCGr1 clone set that were submitted to GenBank before we increased our error rate standard from one in 10,000 to one in 50,000. These clones are undergoing additional sequencing to improve their quality to meet the higher standard.
Finished clone statistics
| DGCr1 | DGCr2 | Total | |
| Number submitted to GenBank | 5,291 | 3,479 | 8,770 |
| Percentage of clones finished without custom primers* | 88% | 88% | 88% |
| Average number of reads/kb for finished clones | 12.9 | 19.4 | 15 |
| Average number of primers to finish* | 3.7 | 2.4 | 3.4 |
| Average insert size of finished clones (kb) | 2.23 | 1.67 | 2.01 |
| Sequence (Mb) | 11.8 | 5.7 | 17.5 |
*Excludes clones <1.4 kb in size.
cDNA analysis
| DGCr1 | DGCr2 | Total | |
| Clones that encode complete ORFs | |||
| ORFs identical to the Release 3 predicted proteins* | 3,429 | 1,946 | 5,375 |
| ORFs with 1-2% differences to Release 3 proteins† | 235 | 306 | 541 |
| Total | 3,664 | 2,252 | 5,916 |
| Clones known to be compromised‡ | |||
| Nucleotide discrepancies | 485 | 829 | 1314 |
| 5' short | 618 | 150 | 768 |
| 3' truncated | 57 | 26 | 83 |
| Co-ligated inserts | 23 | 54 | 77 |
| ORFs with less than 50 amino acids | 49 | 21 | 70 |
| Antisense transcripts | 53 | 58 | 111 |
| Transposable elements | 12 | 9 | 21 |
| Bacterial contaminants | 2 | 4 | 6 |
| Total | 1,299 | 1,151 | 2,450 |
| Clones that may represent alternative transcripts§ | |||
| 5' short with upstream in-frame stop codon | 32 | 4 | 36 |
| 3' truncated with downstream in-frame stop codon | 55 | 17 | 72 |
| Putative missed micro-exon in Release 3 annotation | 23 | 7 | 30 |
| Total | 110 | 28 | 138 |
| Unclassified clones¶ | 257 | 160 | 417 |
Summary of analysis of the 8,770 clones in GenBank plus 151 clones for which we do not have accession numbers yet. *The ORF predicted from the cDNA sequence is identical to the corresponding Release 3 predicted protein; 4,620 of these clones are from the LD, GH, HL, LP, RE or RH cDNA libraries, which were made from the same strain that was sequenced. Thus, we required their ORFs to be identical to those of the predicted Release 3 proteins. An additional 755 clones with ORFs identical to Release 3 proteins are from the AT, GM or SD libraries. †The ORF predicted from the cDNA sequence is the same length as the Release 3 predicted protein with less than 2% amino-acid difference. These clones are derived from the AT, GM or SD cDNA libraries, which were made from strains or cell lines that are not isogenic with the strain that was sequenced. ‡See text for explanation of the individual subclasses of compromised clones. §These clones have structures that are inconsistent with the corresponding Release 3 predicted gene. The 5'-short and 3'-truncated clones may reflect alternative splice products or promoters, or perhaps more likely, incompletely processed primary transcripts with retained introns. Additional experimental work will be required to distinguish these possibilities. Those clones referred to as putative missed micro-exons in Release 3 annotations are cases in which the cDNA clone contains additional nucleotides that are a multiple of 3, relative to the Release 3 predicted mRNA, and maintains the ORF. We expect that most of these discrepancies result from a failure of Sim4 to align micro-exons and that these cases will be resolved by modifying the Release 3 gene model; see [15] for more discussion. ¶The predicted ORF from the cDNA clone does not match a Release 3 predicted protein, but the underlying cause could not be classified into one of the above categories. We expect that very few of these clones accurately reflect actual gene transcripts.
Figure 1A putative example of RNA editing as revealed by comparison of cDNA and genomic DNA sequences. (a) Gene models for CG18314 based on sequence of two DGCr1 full-length cDNA clones (GH15292.c, GH08370.c) that differ at their 5' and 3' termini. Although the cDNAs have alternative 5' and 3' UTRs and are alternatively spliced, they share the same protein-coding potential (shown in blue). CG18314 encodes a G-protein-coupled receptor of the rhodopsin family, containing a seven-transmembrane protein domain (7tm_1; the red bar shows the extent of the domain) with similarity to β2-adrenergic receptors of mouse (X15643, E value = 9e-23) and human (M15169, E = 8e-22). Shown hatched is a 310-bp portion of cDNA sequences with A-to-G nucleotide variation. (b) Sequence alignments of this 310-bp portion of genomic sequence, two cDNA and three EST sequences (GH14918, GH14553, HL02270). Shown in yellow are codons with A-to-G nucleotide variation. Above the genomic nucleotide sequence is its translated amino-acid sequence starting at amino acid 224 of the protein. Comparing the cDNA nucleotide sequence to the genomic sequence identifies 10 A-to-G nucleotide variations. Two are silent, seven result in amino-acid changes, and one alters the stop codon, allowing two additional amino acids to be encoded. The amino acids that are affected are shown below the nucleotide sequence (red letters in a gray circle). Two of the amino-acid changes (N224S and S229G) map to the conserved seven-transmembrane protein domain. The Anopheles gambiae genomic draft contains sequence encoding this protein (gi|21299606|gb|EAA11751.1| (AAAB01008960) agCP5433) which is highly conserved at the amino-acid sequence level (E = e-168) and also encodes N and S at these sites. To sample additional transcripts of this gene, we performed gene-specific RT-PCR to amplify the region shown in (b). From a total of 64 independent transcripts we confirmed the 10 cases of editing diagrammed above, and identified 15 new sites of A-to-G nucleotide variations. A list of these putative editing sites showing the resulting amino-acid change and the number of times this change was observed, given in parentheses, is as follows: N224D (2), N224S (12), L225L (9), N227S (1), S229G (9), H230R (1), M231V (1), L236L (16), A239A (1), P246P (2) E254G (1), I272I (1), I275M (1), I281V (1), S286G (1), K306R (16), K308R (5), K308G (8), Q312Q (1), A313A (1), L315L (31), I316V (52), *323W (44) and S324G (4).