Literature DB >> 17012272

Robust analysis of 5'-transcript ends (5'-RATE): a novel technique for transcriptome analysis and genome annotation.

Malali Gowda¹, Haumeng Li, Joe Alessi, Feng Chen, Richard Pratt, Guo-Liang Wang.

Abstract

Complicated cloning procedures and the high cost of sequencing have inhibited the wide application of serial analysis of gene expression and massively parallel signature sequencing for genome-wide transcriptome profiling of complex genomes. Here we describe a new method called robust analysis of 5'-transcript ends (5'-RATE) for rapid and cost-effective isolation of long 5' transcript ends (approximately 80 bp). It consists of three major steps including 5'-oligocapping of mRNA, NlaIII tag and ditag generation, and pyrosequencing of NlaIII tags. Complicated steps, such as purification and cloning of concatemers, colony picking and plasmid DNA purification, are eliminated and the conventional Sanger sequencing method is replaced with the newly developed pyrosequencing method. Sequence analysis of a maize 5'-RATE library revealed complex alternative transcription start sites and a 5' poly(A) tail in maize transcripts. Our results demonstrate that 5'-RATE is a simple, fast and cost-effective method for transcriptome analysis and genome annotation of complex genomes.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2006 PMID： 17012272 PMCID： PMC1636456 DOI： 10.1093/nar/gkl522

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Rapid sequencing of many complex eukaryotic genomes has provided unprecedented opportunities to understand gene function, genome structure and genome evolution. However, accurate annotation of all expressed genes in the sequenced genomes remains one of the most challenging tasks for genome biologists. Although various computer-based gene prediction methods play a role in genome annotation, experimental data provide essential evidence for the determination of gene structure and function. In the last decade, various sequence-based strategies, such as expressed sequence tags (ESTs) (1), full-length (FL) cDNA (2,3), serial analysis of gene expression (SAGE) (4,5) and massively parallel signature sequencing (MPSS), have been developed for transcriptome studies (6,7). These approaches have contributed valuable resources for gene discovery and genome annotation, but their application in most molecular studies has been limited. Generally, EST and FL-cDNA sequencing techniques are neither cost-effective nor deep enough to isolate rare transcripts or address transcript variability. Sequencing millions of cDNA clones from various tissues can only sample ∼60% of the expressed genes (8). To overcome this limitation, high-throughput and short tag-based approaches such as SAGE (4) and MPSS (6) have been developed. SAGE library construction involves several tedious steps before tags can be cloned into a plasmid vector. The process includes isolation of short tags (14–26 bp) from the 3′ or 5′ ends of transcripts, ditag formation, concatenation and sequencing of SAGE clones. The time-consuming procedure of colony picking and storage, and the high cost for sequencing individual clones in SAGE library construction has prohibited use of this approach in many biological studies (5,9). The MPSS strategy involves in vitro cloning of cDNA molecules on the surface of microbeads and non-gel-based sequencing of millions of tags (17–20 bp) (6). MPSS library construction can be performed only by experienced technicians at Solexa, Inc. The multiple-location matching of some 17–21 bp tags from SAGE or MPSS libraries in a sequenced genome is problematic when mapping tags to the EST or genomic sequence. To obtain accurate matches for interested tags in the genome, longer transcripts have to be isolated. This is usually accomplished using techniques such as rapid amplification of cDNA ends (RACE) (10) or generation of longer cDNA fragments using the GLGI method (11,12). These individual gene confirmation assays are tedious and expensive, and they are not practical when many positive tags have been identified. The Sanger method of DNA sequencing is expensive and laborious (13,14). Currently, several strategies and platforms are under development including sequencing by synthesis (SBS), sequencing by hybridization and nanopore sequencing (14). Pyrosequencing is an SBS method that can sequence thousands of DNA fragments in a few hours. The entire genome of a bacterium was sequenced in 4.5 h with high accuracy (13), compared with the several months required by the Sanger procedure (15). The pyrosequencing technique generates high-quality short sequences (∼100 bases), and it has many potentially important applications when combined with tag-based expression profiling methods (14). In this study, we describe a novel approach called robust analysis of 5′-transcript ends (5′-RATE). The method includes three major steps: 5′-oligocapping of mRNA using the FL-cDNA isolation strategy (16,17) NlaIII tag and ditag formation using the RL-SAGE strategy (5), and tag sequencing by pyrosequencing (13). The 5′-RATE method has simplified transcript tag isolation by eliminating the complicated concatemer cloning procedure. This allows for a quick, efficient and cost-effective method for the identification and characterization of the 5′ signatures of expressed genes. This strategy is flexible because it can also be adapted easily for 3′ end isolation of expressed genes. We applied the 5′-RATE method to characterize expressed genes from the maize inbred line B73, which is being used for whole genome sequencing. Maize, which has a genome ∼80% the size of the human genome, is one of the most important crops, a model for plant genetics, breeding and crop evolution (18). Sequencing of the gene rich region has predicted that the maize genome consists of a large number of genes (59 000 genes) as compared to mammalian genomes (18). Although hundred and thousand of ESTs and full-length cDNA () have been released to the public, only a limited number of 5′ end signatures have been identified. In this study, we developed the 5′-RATE method using total RNA isolated from B73. Sequence analysis of the 5′-RATE tags revealed the complex nature of alternative transcription start sites (TSSs) and promoter regions. Interestingly, the 5′-RATE method is comprehensive enough to identify poly(A) tails at the 5′ regions of many maize transcripts, which has not been detected so far in any organisms using other expression approaches. These results indicate that 5′-RATE is a powerful profiling method for rapid identification of long transcript ends in complex genomes.

MATERIALS AND METHODS

RNA isolation

Total RNA was isolated from ∼2.0 g of leaves of 30-day-old maize plants (inbred line B73) using a Trizol solution (Invitrogen, Carlsbad, CA). The mRNA was purified using a Qiagen (Valencia, CA) kit according to the manufacturer's instructions.

Oligo-capping at the 5′ regions of mRNA

About 1.0 µg poly(A+) mRNA was used for 5′-decapping. RNA oligocapping was carried out as described by Suzuki et al. (16,17) and Hashimoto et al. (19) with minor modifications (see details at ). Bacterial alkaline phosphatase was used to remove 5′ phosphate groups from mRNAs, while, subsequently, 5′G-capping was hydrolyzed using tobacco acid pyrophosphatase (). The decapped mRNA was divided into two pools (pools 1 and 2) and ligated with two different synthetic RNA oligos (5′-oligo A and 5-oligo B) (Figure 1; ) using T4 RNA ligase (TaKaRa, New York, NY).

Figure 1

Experimental procedure for the 5′-RATE. The mRNA from maize is treated with bacterial alkaline phosphatase and acid pyrophosphatase to modify the cap structure at the 5′ regions. The 5′ decapped mRNA is divided into pools 1 and 2 and ligated with RNA oligos (A and B). The cDNA is synthesized and tags are released from the 5′ regions of cDNA using the NlaIII enzyme. Tags from the two pools are self-ligated to generate ditag cassettes. Ditags are amplified using PCR and linkers are removed by XhoI digestion. Ditag fragments are sequenced using the 454 pyrosequencer at DOE Joint Genome Institute (JGI), CA.

First-strand cDNA synthesis

The decapped mRNA was pre-heated at 70°C for 10 min to prepare for single-strand cDNA synthesis. Pre-heated mRNA was then combined with 10 pmol of random adapter primer and 3.5 µl (200 U/µl) of reverse transcriptase (Superscript; Invitrogen) in 100 µl volume. The RT reaction was incubated at 12°C for 1 h, followed by 42°C for 4 h, according to the procedure described by Hashimoto et al. (19). The mRNA was hydrolyzed using 15 µl of 0.1 M NaOH at 65°C for 40 min. The cDNA synthesis was confirmed using actin and ubiquitin primers (see details at ).

Double-strand cDNA amplification

The single stranded cDNA (10 µl) was amplified in a 50 µl PCR using 10 pmol of 3′ primer specific to random primer sequences and 10 pmol of biotinylated 5′ primer (primer A for pool 1 and primer B for pool 2) (see details at ). About 5 U of PfuTurbo® DNA polymerase (Stratagene Inc., La Jolla, CA) was used for each PCR. A total of 12 PCR cycles were performed at 94°C, 1 min; 58°C, 1 min; 72°C, 1 min with a final extension of 5 min at 72°C.

NlaIII tag and ditag formation

Amplified cDNA was digested with 200 U of NlaIII enzyme for 3 h at 37°C (see details at ). Biotinylated PCR fragments were captured using 100 µl of Dynal streptavidin beads. Ditag cassettes were formed by ligating NlaIII tags from pools 1 and 2 with 15 U of T4 DNA ligase (USB Inc., Cleveland, OH) in 25 µl volume at 16°C overnight. Ditags (1 µl of 1:100 dilution) were amplified in five 50 µl PCRs (22 cycles of 94°C, 5 min; 94°C, 1 min; 60°C, 30 s; 72°C, 1 min; and 72°C, 5 min extension) with 10 pmol of biotinylated forward and reverse primers (see details at ) and 5 U of platinum Taq DNA polymerase (Invitrogen). Ditag DNA fragments (50 bp to 1.5 kb) were purified from a 3% agarose gel (Figure 1) and linkers corresponding to pools 1 and 2 were removed by digesting with 200 U of XhoI in 300 µl volume for 3 h at 37°C. Ditags were then purified from a 3% agarose gel and dissolved in 50 µl of ultra pure water. The linkers and any undigested ditags were removed using 100 µl of Dynal streptavidin beads. The supernatant was treated twice with phenol:chloroform:isoamyl alcohol (24:24:1), precipitated and dissolved in 10 µl of ultra pure water.

Pyrosequencing of NlaIII tags

The NlaIII ditags were made as blunt ends and were ligated with pyrosequencing adaptors (). Single-stranded ditags were captured on beads and subjected to emulsion PCR (emPCR) to enrich the templates. The enriched beads were loaded on a pico-titer-plate for pyrosequencing according to Margulies et al. (13) (see details at ). Pyrosequencing was carried out on a pico-titer-plate at the DOE Joint Genome Institute (JGI), CA (). A limited amount of a nucleotide (A or G or C or T) was added at a time to pause the DNA polymerase reaction. During this process, a pyrophosphate (PPi) was released from each nucleotide incorporation, which in turn was converted into ATP by sulfurylase. The resulting ATP was further catalyzed by luciferase to emit light. The emitted light was detected by a CCD camera and then converted into pyrogram (13) (). The signal peaks in the pyrogram were converted into nucleotide sequencing information.

5′-RATE tag extraction

We developed the RATEspy program to extract the NlaIII tags from the 454 raw sequences. For forward tag extraction, the 5′ oligo signature sequence (TCGAGT) was identified. Then a tag was extracted after the signature sequence and before the first NlaIII site (CATG). If no CATG site was found, a tag was extracted until an ‘N’ was found or up to 80 bp after the signature sequence. For reverse tag extraction, the raw NlaIII sequences were reverse complemented and then the tags were extracted using the same method for the forward tags. Forward and reverse tags were clustered separately to get unique tags using the RATEspy program.

Mapping NlaIII tags

Stand lone local BLAST 2.0 was used to map 5′-RATE tags to the target databases including maize genomic (), and maize FL-cDNA sequences () separately. The 5′ end sequences of maize FL-cDNAs were used to determine the matching rate of the NlaIII tags. An identity of 90% and an E-value = e−5 were used as BLAST search criteria. The BLAST results were processed and analyzed using RATEspy to get matching statistical reports.

Putative promoter identification

About 200 bp of genomic DNA upstream from TSSs were extracted in order to predict maize promoter regions as described by Shahmuradov et al. (20). Promoter motifs such as the TATA box and other cis-acting elements were predicted using a PlantProm DB program ( and ) (20).

RESULTS

Improvements in the isolation of 5′ ends and generation of NlaIII tags

The three major steps (5′-oligocapping, formation of NlaIII tags and ditags, and pyrosequencing) involved in 5′-RATE library construction are presented in Figure 1. About 1.0 µg of mRNA was used for the 5′-RATE protocol, as compared to 5.0–25.0 µg mRNA in SAGE (19,21) and FL-cDNA (16,17) library construction protocols. To improve the ligation efficiency, two distinct RNA oligos were ligated overnight to 5′ regions of mRNA pools (1 and 2) as compared to 3–16 h in the original procedures (19). Digestion of synthesized cDNA with NlaIII released much longer 5′ tags (average 250 bp) than those released from the type IIS (22) or type III enzyme (23) digestions during SAGE library construction (Table 1). Ditags (average 500 bp) were generated by ligating tags overnight from the two pools, similar to the ditag ligation in RL-SAGE (5). Only five ditag PCRs were enough to generate a 5′-RATE library as compared to 20 (5,9) to 1000 PCRs (Catalog no. T5001-01; I-SAGE kit from Invitrogen) in other SAGE methods. The longer RATE ditags (∼500 bp) were easily purified on 2% agarose gel as compared to PAGE for the purification of shorter SAGE tags (5). Purified ditags (∼5.0 µg) were precipitated and shipped to JGI for 454 pyrosequencing. The complete 5′-RATE experimental procedure is available at .

Table 1

Comparison of 5′-RATE with SAGE and MPSS

Feature	5′-RATE	LongSAGE	SuperSAGE	MPSS
Tagging enzyme	NlaIII (Type IIa)	MmeI (Type IISb)	EcoP15I (Type IIIc)	BsmFI/MmeI (Type IIS)
Binding sequences	CATG	TCCRAC	CAGCAG	GGGAC/TCCRAC
Cleavage	On the binding site	Away from binding site	Away from the binding site	Away from the binding site
Tag size (bp)	∼80	19–21	25–26	17–20
Method of sequencing	Pyrosequecing	Sanger method	Sanger method	Hybridization
Cloning and colony picking	Not required	Required	Required	Required
Standard kits	Lab made	I-SAGE kit	SAGE kit	Custom library in Solexa, Inc.
Technical difficulties	Simple	Challenging	Challenging	Challenging
Cost/library ($)	Inexpensive (∼9000)	Expensive (∼30 000)	Expensive (∼30 000)	Expensive (∼30 000)
Time requirement	10–15 days	Several months	Several months	Several months

aRestriction enzyme consisting of a homodimer that recognition a palindromic sequences and cleave within the recognition site. Only Mg2+ is required as a cofactor in this case.

bRestriction enzymes consist of monomer which recognize non-palindromic sites and cleave outside the recognition sequence. SAM (S-adenosylmethionine and Mg2+ are required cofactors for successful cleavage.

cType III restriction enzymes consist of restriction and methylation subunits. Recognition sites are non-palindromic and cleavage is ∼25 bases from the recognition site. ATP and Mg2+ are required cofactors for successful cleavage.

Generation and characterization of a maize 5′-RATE library

About 160 000 sequence reads were obtained from a 454 sequence run. Using the RATEspy program, we isolated over 116 000 NlaIII tags with good quality sequences (Table 2). The size of sequenced 5′ NlaIII tags varied from 21 to 150 bp with an average of 80 bp (Table 2 and Figure 2). The RATE procedure improved tag size ∼3- to 5-fold in comparison with SAGE or MPSS procedures (tags size, 14–26 bp; Table 1). To validate the 5′-RATE method, 3259 significant tags (≥2 copies or more) were matched against the maize genome sequence () and the 5′ regions of maize FL-cDNAs (). Among these tags, 44% matched to the 5′ regions of FL-cDNAs and 34% matched to the maize genome sequence at 95% identity (Table 2) which is lower than that of the 3′ RL-SAGE tags (data not shown). The low matching rate is likely because of unfinished genome sequencing, incomplete sampling of FL-cDNAs, and heterogeneity of the TSSs that was observed in our study (see below) and other 5′ LongSAGE libraries (19,21). As expected, >70% of the tags matched to the 5′ region (within 100 bp) of the maize FL-cDNAs, which is similar to the results in other 5′ LongSAGE studies (19,21). Matching analysis with different lengths of 5′-RATE tags showed that as the tag length was increased, the rate of multiple hits to non-redundant nucleotide database at NCBI was decreased (Supplementary Table 1), which was also demonstrated by Matsumura et al. (23,24). For example, only the 60 bp tag has a unique match in the NCBI non-redundant nucleotide database for the homolog of the rice acireductone dioxygenase 2 gene (Supplementary Table 1).

Table 2

Features of the maize B73 5′-RATE library

Inbred line	B73
Treatment	None
Growth stage	4-week-old leaves
Total mRNA	1 µg
Template DNA sequenced	Ditag
No. of reads sequenced	160 000
Total cost/library ($)	9000
Average tag size	∼80 bp
Matching of significant tags to genomic DNA	34%
Matching of significant tags to 5′ regions of maize FL-cDNA	44%

Figure 2

Size distribution of the 5′-RATE tags.

Sequence diversity in the 5′ region of maize transcripts

Preliminary sequence analysis of the 5′-RATE tags revealed that many maize transcripts had alternative TSSs. For example, the gene encoding a jasmonate-induced protein (ID: Q564C9) had 46 different TSSs (Figure 3) and the rubisco small subunit-encoding gene (ID: P05348) had 9 different TSSs (Supplementary Figure 3). In general, the TSS location of different transcripts varied from 1 to 99 nt from the 5′ region of maize FL-cDNAs. The length of the TSSs ranged from 8 to 14 nt (Figure 3 and Supplementary Figures 1–3). The analysis of the TSS data did not reveal any consensus sequences. Among the analyzed transcripts, the rubisco small subunit-encoding gene had the lowest 5′-tag diversity (33.33%) (Supplementary Figure 3). The transcript with the highest tag diversity (90%) was the gene encoding an intermediate filament C2 protein (ID: Q9NG13) with 2–75 non-template derived nucleotides (Supplementary Figure 2). Alternative TSSs were found for a lot of genes analyzed in this study, which is similar with the findings in the FL-cDNA libraries in Arabidopsis (25).

Figure 3

Sequence alignment of the jasmonate-induced gene (ID: Q564C9) with its alternative TSSs and 5′ poly(A) tail. The nucleotides in underlined are non-template sequences.

When the 5′-RATE tags matched to the maize genomic sequences, TATA- boxes were found at 30–40 bp upstream from the TSSs (Figure 3 and Supplementary Figures 1 and 3),while similar results were also reported in other plants (20,25) and animals (17–19). This result demonstrated that the majority of the 5′-RATE tags may be true 5′ sequences from the TSSs and the isolation of these sequences will facilitate the identification of their putative promoters. Surprisingly, we found poly(A) tails at the 5′ ends of maize transcripts (Figure 3 and Supplementary Figures 1 and 2). Over 8% of the maize transcripts consisted in both G-capping and poly(A) signature in the 5′ regions. The length of the 5′ poly(A) tail varied from 20 to 150 adenine residues in most transcripts (Figure 3 and Supplementary Figures 1 and 2). Except for the rubisco small subunit gene (Supplementary Figure 3), we found poly(A) tails for other highly expressed genes (Figure 3 and Supplementary Figures 1 and 2). To investigate if any FL-cDNAs had 5′ poly(A) tails, we searched plant, animal and viral cDNA databases. Similarly, several FL-cDNAs with 5′ poly(A) tail were found in plants (maize, rice and Arabidopsis), animals (human, mouse and Drosophila) and viruses (vaccine and cowpox virus) (Figure 4). Interestingly, the translation initiation codon (ATG) was followed immediately after the 5′ poly(A) tail in the FL-cDNAs from Arabidopsis, rice, human, mouse, Drosophila and virus, but this feature was not observed in maize FL-cDNAs (Figure 4).

Figure 4

FL-cDNA sequences with 5′ poly(A) tail from plants and animals. The 5′ poly(A) sequences are shown in boldface letters and translation initiation codon (ATG) is shown in capital letters.

DISCUSSION

The mRNA sequence provides crucial information for localizing expressed genes in a sequenced genome including maize, which has a relatively large genome size. Although millions of dollars have been invested in the identification of FL-cDNAs and ESTs in many organisms, these sequences are estimated to cover a maximum of only 60% of the transcriptome (8). The SAGE technology has provided an unprecedented high-throughput and high-efficiency approach to identify uncharacterized transcripts. For example, the Cancer Genome Anatomy Project (CGAP) has adopted the SAGE method for the analysis of different cancerous cell types that has produced >7 million transcript tags from 171 libraries (26). Many of the tags are novel transcripts since they are not present in any FL-cDNA or EST collections (11). However, the current SAGE methods require complicated tag, ditag and concatemer cloning, tedious colony picking and expensive clone sequencing. Similarly, the MPSS library construction and sequencing procedures are more complicated and can only be performed by Solexa Inc. Owing to these limitations, SAGE and MPSS methods have not been widely used for transcriptome analysis. The 5′-RATE procedure reported here is simple and fast, and has eliminated tag cloning and colony picking procedures. One can now make a 5′-RATE library within 2 weeks. In contrast, several months are required for SAGE and MPSS library construction. At the current sequencing cost, a 5′-RATE library with 160 000 tags costs about $9000, which is two to three times cheaper than that of the SAGE and MPSS methods. A unique feature of pyrosequencing is that several small samples can be run on a single 454 chip. This provides the possibility to perform replications and multiplexing of multiple samples from the same or different organisms in a single experiment. If only 40 000 tag sequences are needed for a 5′-RATE library, then the cost will be less than $2500. The 5′-RATE method is highly flexible and can, therefore, be adopted to characterize tags from the 3′ end of the transcripts. A comparison between 5′-RATE with other tag-based methods is summarized in Table 1. Two unique features of the 5′-RATE method are noteworthy. First, most of the tags in a 5′-RATE library are derived from the 5′ end sequence of transcripts. Since the majority of the SAGE or MPSS tags isolated so far are from the 3′ region of transcripts, identification of the TSS sequence in the 5′ ends is essential for the characterization of complete transcription units. Only few methods for 5′ end isolation have been reported so far, including CAGE (27), 5′LongSAGE (19,21) and PET (28) methods. For example, among 15 448 5′ tags identified in humans, 86–96% of the 5′ LongSAGE tags were assigned within −500 to +200 nt of the mRNA start sites (21). In the maize 5′-RATE library, >70% of the tags matched to 5′ regions (within 100 bp) of the maize FL-cDNAs. As more FL-cDNA sequences from the maize full cDNA project () become available to the public, we expect a higher matching rate will be obtained from our 5′-RATE tags. The second unique feature of the 5′-RATE method is that the tag length ranges from 21 to 150 bp with an average of 80 bp. This will circumvent the multi-location matching of some of the SAGE (14 bp), LongSAGE tags (21 bp), SuperSAGE (26 bp) and MPSS tags (17–21 bp). Highly homologous gene family members should be more easily distinguished by using the long 5′-RATE tags rather than SAGE/LongSAGE/SuperSAGE or MPSS tags (Supplementary Table 1). Limited work has been done towards the identification of TSSs in plants as compared to animals. The recent study by Alexandrov et al. (25) identified alternative TSSs in 30–50% of genes in the Arabidopsis using FL-cDNAs. Similarly, the 5′-RATE method described here has demonstrated that many maize genes produced alternative TSSs. Interestingly, substitutions, deletions and additions were also identified in the maize TSS regions, similar to the results observed in animal genomes (19,21). We also demonstrated that the promoter signature-like TATA boxes are localized at 30–40 bp upstream of the TSS region in maize, which is consistent with other plants (20) and animals (17,19). These results suggest that the 5′-RATE sequence data are an excellent genomics resource for the identification of TSSs, promoter regions and 5′-untranslated regions. The addition of these sequence data should ultimately increase the accuracy of genome annotation. Newly generated mRNA transcripts, called heterogeneous nuclear RNAs, are further modified by the addition of 5′ cap structures (guanosine nucleotide via 5′–5′ triphosphate linkage) and 3′ poly(A) tails (150–200 adenines) in eukaryotes (29). Unexpectedly, sequences obtained using the 5′-RATE method were revealed poly(A) tails (20–150 bp) at the 5′ ends of maize transcripts, which has not been reported previously in any other organism. The size of poly(A) tails at the 5′ end identified from this study is similar to that of 3′ poly(A) tail. The longer 5′ adenylation might increase the half-life of transcripts and may also regulate the translation and stability, which was reported for the 3′ poly(A) tail (30,31). These results motivated us to analyze further FL-cDNA sequences obtained from the biotinylated CAP trapper procedure in rice (3), Arabidopsis (2), mouse (32) and also from the oligocapping procedure in maize (), human (33) and Drosophila (34). Surprisingly, several FL-cDNAs with 5′ poly(A) tails from plants (maize, rice and Arabidopsis) and animals (human, mouse and Drosophila) were unknowingly deposited in the databases. These results supported that 5′ G-capping and 5′ poly(A) tail structures in the transcripts are widely present in plants and animals. The chance that the 5′poly(A) tails are experimental artifacts of the oligocapping method is low because the cDNA clones with a 5′ poly(A) tail in Arabidopsis (2) and mouse (32) FL-cDNAs were identified by the biotinylation CAP trapper method. So far only one gene, called late gene or 11 kDa protein (M64569) in poxvirus (vaccine and cowpox), was reported to contain 5′ poly(A) sequences that are not complementary to the viral DNA template (35–42). Until now, there are no reports either on the 5′ poly(A) tail identification or the mechanism of 5′ polyadenylylation in any eukaryotic organisms. However, the recent in vitro experiment showed that the poly(A) tail addition to the 5′ regions of mRNA enhances the translation rate (43). They also reported that translation inhibition is possible with an excess of mRNA containing poly(A) tails (43). Our results confirmed the presence of nontemplate encoded poly(A) sequences at the 5′ regions of mRNAs in maize. The poly(A) tags identified by 5′-RATE method were also G-capped at the first nucleotide as similar to the poxviral late mRNAs (40). We believe that G-capping and poly(A) addition to the 5′ ends might be coupled to each other during mRNA processing. Interestingly, a novel translation initiation codon (ATG) was created due to the presence of 5′ poly(A) tails in the eukaryotic transcripts, as shown in Figure 4. We speculate that 5′ polyadenylation of transcripts might generate a novel protein diversity in eukaryotes. Although the possibility that the poly(A) tails are the artifacts of oligo-capping method is low, we will experimentally confirm the presence of the 5′ poly(A) tails in selected transcripts, and determine how a poly(A) tail is added to the 5′ region and its role in transcript stability and function in the near future. It is worthwhile to report here that the presence of the 5′ poly(A) tail in maize transcripts has caused a major problem in our 5′ LongSAGE library construction (M. Gowda and G.L. Wang, unpublished data). Initially, we optimized the 5′ LongSAGE method (19) using the same RNA from maize that was used for the 5′-RATE method. Owing to the occurrence of long poly(A) signatures at the 5′ regions of maize transcripts, we failed to obtain enough concatemer clones for sequencing. The homopolymeric tracks of A/T in the plasmid might inhibit the replication and gene expression processes as shown in Escherichia coli (44,45). Similar problems have also been reported during cDNA library generation (46). In addition, sequencing of 5′ LongSAGE clones with poly(A) sequences was not successful (M. Gowda and G.L. Wang, unpublished data). Therefore, it is impossible to generate long concatemer inserts and obtain good sequencing results from a maize 5′ LongSAGE library. In summary, 5′-RATE has the following advantages over existing tag-based methods: (i) it is simple because the difficult steps for purifying and cloning concatemers in E.coli are eliminated, allowing the technique to be used in most molecular labs with no specialized equipment, (ii) it is fast because colony picking and DNA purification for Sanger sequencing are eliminated, and a 454 sequencing run can be finished in a few hours; (iii) it is more comprehensive because 5′-RATE tags are more informative for genome and EST matching due to the generation of longer tag length (average 80 bp) compared to 21 bp RL-SAGE/LongSAGE tags or 17–21 bp MPSS tags; (iv) it is cost effective as it costs about $9000 for 160 000 5′-RATE tags in comparison to about $30 000 for LongSAGE tags; (v) the 5′-RATE tags will also have potential applications in subsequent biological experiments. For example, the 5′-RATE tag sequences can be used as templates for RNAi-based gene silencing, for probe designing of oligo-chips, or primer designing for RT–PCR assays. The 5′-RATE method could be further improved with the following two approaches. First, average tag size can be increased to >100 bp if other novel sequencing methods are used in ditag sequencing (13,14). Second, a small fraction of transcripts might be missed during 5′-RATE library construction due to the absence of the NlaIII site on the transcripts. This can be overcome by making an additional 5′-RATE library using different tagging enzymes such as DpnII, Taq1, MseI or Sau3AI.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

43 in total

1. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays.

Authors: S Brenner; M Johnson; J Bridgham; G Golda; D H Lloyd; D Johnson; S Luo; S McCurdy; M Foy; M Ewan; R Roth; D George; S Eletr; G Albrecht; E Vermaas; S R Williams; K Moon; T Burcham; M Pallas; R B DuBridge; J Kirchner; K Fearon; J Mao; K Corcoran
Journal: Nat Biotechnol Date: 2000-06 Impact factor: 54.908

2. Removal of polyA tails from full-length cDNA libraries for high-efficiency sequencing.

Authors: Y Shibata; P Carninci; K Sato; N Hayatsu; T Shiraki; Y Ishii; T Arakawa; A Hara; N Ohsato; M Izawa; K Aizawa; M Itoh; K Shibata; A Shinagawa; J Kawai; Y Ota; S Kikuchi; N Kishimoto; M Muramatsu; Y Hayashizaki
Journal: Biotechniques Date: 2001-11 Impact factor: 1.993

3. Functional annotation of a full-length Arabidopsis cDNA collection.

Authors: Motoaki Seki; Mari Narusaka; Asako Kamiya; Junko Ishida; Masakazu Satou; Tetsuya Sakurai; Maiko Nakajima; Akiko Enju; Kenji Akiyama; Youko Oono; Masami Muramatsu; Yoshihide Hayashizaki; Jun Kawai; Piero Carninci; Masayoshi Itoh; Yoshiyuki Ishii; Takahiro Arakawa; Kazuhiro Shibata; Akira Shinagawa; Kazuo Shinozaki
Journal: Science Date: 2002-03-21 Impact factor: 47.728

4. Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites.

Authors: Y Suzuki; H Taira; T Tsunoda; J Mizushima-Sugano; J Sese; H Hata; T Ota; T Isogai; T Tanaka; S Morishita; K Okubo; Y Sakaki; Y Nakamura; A Suyama; S Sugano
Journal: EMBO Rep Date: 2001-05 Impact factor: 8.807

5. PlantProm: a database of plant promoter sequences.

Authors: Ilham A Shahmuradov; Alex J Gammerman; John M Hancock; Peter M Bramley; Victor V Solovyev
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

6. Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags.

Authors: Jianjun Chen; Miao Sun; Sanggyu Lee; Guolin Zhou; Janet D Rowley; San Ming Wang
Journal: Proc Natl Acad Sci U S A Date: 2002-09-04 Impact factor: 11.205

7. Using the transcriptome to annotate the genome.

Authors: Saurabh Saha; Andrew B Sparks; Carlo Rago; Viatcheslav Akmaev; Clarence J Wang; Bert Vogelstein; Kenneth W Kinzler; Victor E Velculescu
Journal: Nat Biotechnol Date: 2002-05 Impact factor: 54.908

8. Rapid analysis of gene expression (RAGE) facilitates universal expression profiling.

Authors: A Wang; A Pierce; K Judson-Kremer; S Gaddis; C M Aldaz; D G Johnson; M C MacLeod
Journal: Nucleic Acids Res Date: 1999-12-01 Impact factor: 16.971

9. Functional annotation of a full-length mouse cDNA collection.

Authors: J Kawai; A Shinagawa; K Shibata; M Yoshino; M Itoh; Y Ishii; T Arakawa; A Hara; Y Fukunishi; H Konno; J Adachi; S Fukuda; K Aizawa; M Izawa; K Nishi; H Kiyosawa; S Kondo; I Yamanaka; T Saito; Y Okazaki; T Gojobori; H Bono; T Kasukawa; R Saito; K Kadota; H Matsuda; M Ashburner; S Batalov; T Casavant; W Fleischmann; T Gaasterland; C Gissi; B King; H Kochiwa; P Kuehl; S Lewis; Y Matsuo; I Nikaido; G Pesole; J Quackenbush; L M Schriml; F Staubli; R Suzuki; M Tomita; L Wagner; T Washio; K Sakai; T Okido; M Furuno; H Aono; R Baldarelli; G Barsh; J Blake; D Boffelli; N Bojunga; P Carninci; M F de Bonaldo; M J Brownstein; C Bult; C Fletcher; M Fujita; M Gariboldi; S Gustincich; D Hill; M Hofmann; D A Hume; M Kamiya; N H Lee; P Lyons; L Marchionni; J Mashima; J Mazzarelli; P Mombaerts; P Nordone; B Ring; M Ringwald; I Rodriguez; N Sakamoto; H Sasaki; K Sato; C Schönbach; T Seya; Y Shibata; K F Storch; H Suzuki; K Toyo-oka; K H Wang; C Weitz; C Whittaker; L Wilming; A Wynshaw-Boris; K Yoshida; Y Hasegawa; H Kawaji; S Kohtsuki; Y Hayashizaki
Journal: Nature Date: 2001-02-08 Impact factor: 49.962

10. Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice.

Authors: Shoshi Kikuchi; Kouji Satoh; Toshifumi Nagata; Nobuyuki Kawagashira; Koji Doi; Naoki Kishimoto; Junshi Yazaki; Masahiro Ishikawa; Hitomi Yamada; Hisako Ooka; Isamu Hotta; Keiichi Kojima; Takahiro Namiki; Eisuke Ohneda; Wataru Yahagi; Kohji Suzuki; Chao Jie Li; Kenji Ohtsuki; Toru Shishiki; Yasuhiro Otomo; Kazuo Murakami; Yoshiharu Iida; Sumio Sugano; Tatsuto Fujimura; Yutaka Suzuki; Yuki Tsunoda; Takashi Kurosaki; Takeko Kodama; Hiromi Masuda; Michie Kobayashi; Quihong Xie; Min Lu; Ryuya Narikawa; Akio Sugiyama; Kouichi Mizuno; Satoko Yokomizo; Junko Niikura; Rieko Ikeda; Junya Ishibiki; Midori Kawamata; Akemi Yoshimura; Junichirou Miura; Takahiro Kusumegi; Mitsuru Oka; Risa Ryu; Mariko Ueda; Kenichi Matsubara; Jun Kawai; Piero Carninci; Jun Adachi; Katsunori Aizawa; Takahiro Arakawa; Shiro Fukuda; Ayako Hara; Wataru Hashizume; Norihito Hayatsu; Koichi Imotani; Yoshiyuki Ishii; Masayoshi Itoh; Ikuko Kagawa; Shinji Kondo; Hideaki Konno; Ai Miyazaki; Naoki Osato; Yoshimi Ota; Rintaro Saito; Daisuke Sasaki; Kenjiro Sato; Kazuhiro Shibata; Akira Shinagawa; Toshiyuki Shiraki; Masayasu Yoshino; Yoshihide Hayashizaki; Ayako Yasunishi
Journal: Science Date: 2003-07-18 Impact factor: 47.728

17 in total

1. Noncanonical transcript forms in yeast and their regulation during environmental stress.

Authors: Oh Kyu Yoon; Rachel B Brem
Journal: RNA Date: 2010-04-26 Impact factor: 4.942

2. Key considerations for measuring allelic expression on a genomic scale using high-throughput sequencing.

Authors: Pierre Fontanillas; Christian R Landry; Patricia J Wittkopp; Carsten Russ; Jonathan D Gruber; Chad Nusbaum; Daniel L Hartl
Journal: Mol Ecol Date: 2010-03 Impact factor: 6.185

3. Magnaporthe grisea infection triggers RNA variation and antisense transcript expression in rice.

Authors: Malali Gowda; R-C Venu; Huameng Li; Chatchawan Jantasuriyarat; Songbiao Chen; Maria Bellizzi; Vishal Pampanwar; HyeRan Kim; Ralph A Dean; Eric Stahlberg; Rod Wing; Cari Soderlund; Guo-Liang Wang
Journal: Plant Physiol Date: 2007-03-09 Impact factor: 8.340

4. Genome-wide characterization of methylguanosine-capped and polyadenylated small RNAs in the rice blast fungus Magnaporthe oryzae.

Authors: Malali Gowda; Cristiano C Nunes; Joshua Sailsbery; Minfeng Xue; Feng Chen; Cassie A Nelson; Douglas E Brown; Yeonyee Oh; Shaowu Meng; Thomas Mitchell; Curt H Hagedorn; Ralph A Dean
Journal: Nucleic Acids Res Date: 2010-07-21 Impact factor: 16.971

5. Expansion mechanisms and functional annotations of hypothetical genes in the rice genome.

Authors: Shu-Ye Jiang; Alan Christoffels; Rengasamy Ramamoorthy; Srinivasan Ramachandran
Journal: Plant Physiol Date: 2009-06-17 Impact factor: 8.340

6. Paired-end analysis of transcription start sites in Arabidopsis reveals plant-specific promoter signatures.

Authors: Taj Morton; Jalean Petricka; David L Corcoran; Song Li; Cara M Winter; Alexa Carda; Philip N Benfey; Uwe Ohler; Molly Megraw
Journal: Plant Cell Date: 2014-07-17 Impact factor: 11.277

7. Expressed sequence tags with cDNA termini: previously overlooked resources for gene annotation and transcriptome exploration in Chlamydomonas reinhardtii.

Authors: Chun Liang; Yuansheng Liu; Lin Liu; Adam C Davis; Yingjia Shen; Qingshun Quinn Li
Journal: Genetics Date: 2008-05 Impact factor: 4.562

8. Pyrosequence analysis of expressed sequence tags for Manduca sexta hemolymph proteins involved in immune responses.

Authors: Zhen Zou; Fares Najar; Yang Wang; Bruce Roe; Haobo Jiang
Journal: Insect Biochem Mol Biol Date: 2008-03-29 Impact factor: 4.714

Review 9. From transcription start site to cell biology.

Authors: Philipp Kapranov
Journal: Genome Biol Date: 2009-04-20 Impact factor: 13.583

10. 5'-Serial Analysis of Gene Expression studies reveal a transcriptomic switch during fruiting body development in Coprinopsis cinerea.

Authors: Chi Keung Cheng; Chun Hang Au; Sarah K Wilke; Jason E Stajich; Miriam E Zolan; Patricia J Pukkila; Hoi Shan Kwan
Journal: BMC Genomics Date: 2013-03-20 Impact factor: 3.969