Literature DB >> 14704093

SAGE is far more sensitive than EST for detecting low-abundance transcripts.

Miao Sun¹, Guolin Zhou, Sanggyu Lee, Jianjun Chen, Run Zhang Shi, San Ming Wang.

Abstract

BACKGROUND: Isolation of low-abundance transcripts expressed in a genome remains a serious challenge in transcriptome studies. The sensitivity of the methods used for analysis has a direct impact on the efficiency of the detection. We compared the EST method and the SAGE method to determine which one is more sensitive and to what extent the sensitivity is great for the detection of low-abundance transcripts.
RESULTS: Using the same low-abundance transcripts detected by both methods as the targeted sequences, we observed that the SAGE method is 26 times more sensitive than the EST method for the detection of low-abundance transcripts.
CONCLUSIONS: The SAGE method is more efficient than the EST method in detecting the low-abundance transcripts.

Entities: CellLine Chemical Gene Species

Mesh：

Substances：
3' Untranslated Regions

Year: 2004 PMID： 14704093 PMCID： PMC317289 DOI： 10.1186/1471-2164-5-1

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Background

Identification of a complete set of transcripts expressed in a genome is one of the ultimate goals of transcriptome studies. Such information is essential for genome annotation and for further study of the function of each gene. It is well known that three classes of transcripts are expressed from a genome, including high-abundance, intermediate-abundance and low-abundance transcripts [1]. Whereas most of the high- and intermediate-abundance transcripts have been identified, it remains a serious challenge to identify fully the low-abundance transcripts [2-4]. Since the beginning of human genome studies, transcript identification has been performed mainly by the use of EST (expressed sequence tag)-based methods [5]. For identification of low-abundance transcripts, extensive subtraction and normalization have been performed in these EST efforts [4,6]. The number of novel transcripts identified in humans through the EST-based approaches has reached a plateau [2,7]. Recently, the SAGE (series analysis of gene expression) method has been applied for transcriptome analyses, with the collection of large numbers of 10-base SAGE tags from different species [8-10]. Although both the EST and the SAGE method are applied to transcriptome study, they use different approaches. The process of the EST method is that of single transcript-single clone-single sequencing; thus, each sequence represents a single transcript. In contrast, the process of SAGE follows the approach of multiple transcripts-multiple tags-single clone-single sequencing; thus, each SAGE sequence represents multiple transcripts. Using the same scale of sequence collection, SAGE should detect far more transcripts than does EST; therefore, SAGE might identify more low-abundance transcripts than does EST. Indeed, it is frequently observed that many SAGE tags have no match among the existing ESTs, and most of these SAGE tags have low copy numbers [11-13]. Our previous analyses indicated that the majority of these unmatched SAGE tags are derived from low-abundance transcripts [7]. To determine whether SAGE is indeed more sensitive than the EST method and, if so, to what extent for the detection of low-abundance transcripts, we used existing EST and SAGE data for analysis, and we report our observations.

Results and Discussion

Because a SAGE tag is located at the 3' part of a transcript [8], we used 3' ESTs for comparison. We collected 3' ESTs representing low-abundance transcripts by searching UniGene clusters which contained only a single 3' EST (ftp://ftp.ncbi.nih.gov/repository/UniGene/ Hs.seq.all.gz, UniGene Build #161). We identified 42,500 such UniGene clusters and obtained the same number of 3' ESTs. For comparison with SAGE tags, we extracted virtual tags from these ESTs. We identified 32,587 from the 42,500 3' ESTs that have CATG site(s), a pre-condition for release of a SAGE tag from a transcript, and we extracted 32,587 virtual SAGE tags (10 bases downstream of the last CATG) from the 32,587 sequences. We removed virtual tags that were shared by more than one 3' EST. This resulted in a final set of 22,243 virtual tags from 22,243 3' ESTs representing low-abundance transcripts. To obtain the experimental SAGE tags for the comparison, we downloaded 477,261 SAGE tags containing 6,847,555 copies collected from 154 SAGE libraries . Comparison of the 22,243 virtual SAGE tags with the experimental SAGE tag set identified 20,575 tags that were present in both sets. By matching the 20,575 tags in the SAGEmap database (http://www.ncbi.nlm.nih.gov/SAGE/), we identified 2,278 tags that represented the same 3' ESTs detected by both the EST method and the SAGE method. We used the 2,278 tags as the final set for quantitative comparison. Whereas each of the 2,278 virtual tags represents a transcript detected only once by the EST method, the copy number in each of the 2,278 experimental SAGE tags represents the frequency of a transcript detected by SAGE. We observed that the total copy number for the 2,278 experimental SAGE tags appeared 59,754 times; 1,424 (63%) of these SAGE tags appeared between two and more than 100 times. On average, SAGE was 26 times more sensitive than the EST method in detecting these transcripts (Table 1). The data clearly show that the SAGE method is much more sensitive than the EST method for the detection of low-abundance transcripts.

Table 1

Comparison between EST and SAGE methods for the detection of low-abundance transcripts

Frequency of detection	Virtual tags from 3' EST	Experimental SAGE tags

		Tags	Copies
1	2,278	854	854
2		482	964
3		313	939
4		190	760
5		97	485
6 to 10		217	1,578
11 to 20		86	1,234
21 to 100		37	1,279
>100		2	51,661

Total	2,278	2,278	59,754

Comparison between EST and SAGE methods for the detection of low-abundance transcripts What could be the explanations for the difference between the EST and SAGE methods for detecting the low abundant transcripts? It is unlikely that the difference is due to the depth of sequence collection. The current number of human ESTs reaches to 4.5 millions including 131,229 mRNAs and 1,470,982 3' ESTs, whereas the total human SAGE tags has about 8 millions. Considering that over 20 tags can be detected by a single SAGE sequence, the number of sequences collected from SAGE is far less than that from ESTs. In our previous studies [2], we observed the "loss" effect on EST collection due to the non-specific polydA/dT hybridization during subtraction / normalization widely used in EST library construction [6], as evidenced by the quantitative loss of a group of targeted transcripts, although it will be difficult to give an absolute rate of loss at the whole genome level due to the complexity of the transcriptome. Such a phenomenon can explain in part but other possibilities may also exist for the loss, such as the limitation of cloning efficiency when ligating cDNAs into vector during cDNA library construction, and clonal loss during library transformation etc. In the SAGE process, there is no subtraction / normalization step, and all the cDNA fragments at each step during SAGE library construction have nearly the same length with the same ends till being cloned into vector. Therefore, the repertoire of the total transcripts is well preserved in SAGE libraries for the detection. It is true that SAGE method has many limitations for transcript detection. For example, a 14-base SAGE tag contains less sequence information for the detected transcript comparing with an EST that has hundred bases; the specificity of a SAGE tag representing a unique transcript is also lower than that of EST, particularly for SAGE tags at higher copies [14-16]; and SAGE can't detect CATG-negative transcripts, although this number is low as shown that only 151 (7.8%) among the 19,399 full-length human cDNAs in the Refseq (NM) database are CATG-negative. Another issue is related with the error SAGE tags. A SAGE tag has 10 bases. In theory, any base within a single tag could be sequencing error leading to the generation of 4 × 4 × 4 × 4 × 4 × 4 × 4 × 4 × 4 × 4 = 410 mutated tags. However, such event doesn't happen in the real world [7]. We have converted thousand SAGE tags into their 3' cDNA experimentally using the GLGI method. From these studies, we clearly see that over 70% of the low-copy SAGE tags represent the real transcripts expressed at low level (these are experimentally confirmed. The real rate may be higher considering the limitation of the experimental sensitivity). Although there are certainly error SAGE tags, these error SAGE tags cannot be a significant portion in the total SAGE tag collection, particularly for the SAGE tags with low copies. Regardless these limitations, SAGE does have unique features for transcriptome study. Among these is that the presence of a SAGE tag implies in large the presence of a transcript. It is worth to indicate that we only focused on the known low-abundance transcripts for the analysis. For the unknown low-abundance transcripts, many of them may not be present in EST libraries therefore not detectable as novel ESTs. However, these unknown low-abundance transcripts may be well preserved in SAGE libraries therefore readily detectable as novel SAGE tags.

Conclusions

The high sensitivity of the SAGE method for transcript detection becomes valuable for the isolation of low-abundance transcripts. Coupling amplification-based high-throughput methods such as the GLGI (generation of longer 3'cDNA from SAGE tag for gene identification) methods [17] for converting SAGE tags into the original transcripts provides an efficient way for isolating low-abundance transcripts.

Methods

Sequences used for the analysis

The ESTs were downloaded from UniGene database (Build #161) (ftp://ftp.ncbi.nih.gov/repository/UniGene/ Hs.seq.all.gz). The UniGene clusters containing CATG+ 3' ESTs were identified. Virtual SAGE tags were extracted from these 3' ESTs after their last CATG sites. The virtual SAGE tags were pooled and tags with the same sequences were then combined to generate the final virtual SAGE tag list from the 3' ESTs with quantitative information for each tag. The experimental SAGE tags were downloaded from GEO database that contained 154 SAGE libraries . The SAGE tags from different libraries were pooled. The same SAGE tags in the pool were combined with the copy number to generate the final SAGE tags with quantitative information for each SAGE tags.

Computational process

Computational programs were designed using java language for the extraction of virtual SAGE tags from the 3' ESTs, and for the comparison between the experimental SAGE tags and EST-derived virtual SAGE tags. The programs are available upon request.

List of abbreviations

EST – expressed sequence tag SAGE – serial analysis of gene expression GLGI – generation of longer 3'cDNA from SAGE tag for gene identification

Authors' contributions

M.S. carried out all computational analyses. G.Z., S.L., J.C., and R.Z.S. generated experimental data for the development of the concept. S.M.W. designed the study. All authors read and approved the final manuscript.

17 in total

1. Analysis of human transcriptomes.

Authors: V E Velculescu; S L Madden; L Zhang; A E Lash; J Yu; C Rago; A Lal; C J Wang; G A Beaudry; K M Ciriello; B P Cook; M R Dufault; A T Ferguson; Y Gao; T C He; H Hermeking; S K Hiraldo; P M Hwang; M A Lopez; H F Luderer; B Mathews; J M Petroziello; K Polyak; L Zawel; K W Kinzler
Journal: Nat Genet Date: 1999-12 Impact factor: 38.330

2. The pattern of gene expression in human CD15+ myeloid progenitor cells.

Authors: S Lee; G Zhou; T Clark; J Chen; J D Rowley; S M Wang
Journal: Proc Natl Acad Sci U S A Date: 2001-03-13 Impact factor: 11.205

3. The pattern of gene expression in human CD34(+) stem/progenitor cells.

Authors: G Zhou; J Chen; S Lee; T Clark; J D Rowley; S M Wang
Journal: Proc Natl Acad Sci U S A Date: 2001-11-20 Impact factor: 11.205

4. High-throughput GLGI procedure for converting a large number of serial analysis of gene expression tag sequences into 3' complementary DNAs.

Authors: Jianjun Chen; Sanggyu Lee; Guolin Zhou; San Ming Wang
Journal: Genes Chromosomes Cancer Date: 2002-03 Impact factor: 5.006

5. Large-scale transcriptional activity in chromosomes 21 and 22.

Authors: Philipp Kapranov; Simon E Cawley; Jorg Drenkow; Stefan Bekiranov; Robert L Strausberg; Stephen P A Fodor; Thomas R Gingeras
Journal: Science Date: 2002-05-03 Impact factor: 47.728

6. Correct identification of genes from serial analysis of gene expression tag sequences.

Authors: Sanggyu Lee; Terry Clark; Jianjun Chen; Guolin Zhou; L Ridgway Scott; Janet D Rowley; San Ming Wang
Journal: Genomics Date: 2002-04 Impact factor: 5.736

7. An anatomy of normal and malignant gene expression.

Authors: Kathy Boon; Elisson C Osorio; Susan F Greenhut; Carl F Schaefer; Jennifer Shoemaker; Kornelia Polyak; Patrice J Morin; Kenneth H Buetow; Robert L Strausberg; Sandro J De Souza; Gregory J Riggins
Journal: Proc Natl Acad Sci U S A Date: 2002-07-15 Impact factor: 11.205

8. Normalization and subtraction: two approaches to facilitate gene discovery.

Authors: M F Bonaldo; G Lennon; M B Soares
Journal: Genome Res Date: 1996-09 Impact factor: 9.043

9. Screening poly(dA/dT)- cDNAs for gene identification.

Authors: S M Wang; S C Fears; L Zhang; J J Chen; J D Rowley
Journal: Proc Natl Acad Sci U S A Date: 2000-04-11 Impact factor: 11.205

10. Computational Analysis of Gene Identification with SAGE.

Authors: Terry Clark; Sanggyu Lee; L Ridgway Scott; San Ming Wang
Journal: J Comput Biol Date: 2002 Impact factor: 1.479

13 in total

1. Single molecule transcription profiling with AFM.

Authors: Jason Reed; Bud Mishra; Bede Pittenger; Sergei Magonov; Joshua Troke; Michael A Teitell; James K Gimzewski
Journal: Nanotechnology Date: 2007-05-09 Impact factor: 3.874

2. Detecting novel low-abundant transcripts in Drosophila.

Authors: Sanggyu Lee; Jingyue Bao; Guolin Zhou; Joshua Shapiro; Jinhua Xu; Run Zhang Shi; Xuemei Lu; Terry Clark; Deborah Johnson; Yeong C Kim; Claudia Wing; Charles Tseng; Min Sun; Wei Lin; Jun Wang; Huanming Yang; Jian Wang; Wei Du; Chung-I Wu; Xiuqing Zhang; San Ming Wang
Journal: RNA Date: 2005-06 Impact factor: 4.942

3. LongSAGE analysis of the early response to cold stress in Arabidopsis leaf.

Authors: Youn-Jung Byun; Hyo-Jin Kim; Dong-Hee Lee
Journal: Planta Date: 2009-02-28 Impact factor: 4.116

4. A comparative analysis of transcript abundance using SAGE and Affymetrix arrays.

Authors: Adel F M Ibrahim; Peter E Hedley; Linda Cardle; Warren Kruger; David F Marshall; Gary J Muehlbauer; Robbie Waugh
Journal: Funct Integr Genomics Date: 2005-02-16 Impact factor: 3.410

5. Deep SAGE analysis of the Caenorhabditis elegans transcriptome.

Authors: Peter Ruzanov; Donald L Riddle
Journal: Nucleic Acids Res Date: 2010-02-03 Impact factor: 16.971

6. What would you do if you could sequence everything?

Authors: Avak Kahvejian; John Quackenbush; John F Thompson
Journal: Nat Biotechnol Date: 2008-10 Impact factor: 54.908

7. Robust analysis of 5'-transcript ends (5'-RATE): a novel technique for transcriptome analysis and genome annotation.

Authors: Malali Gowda; Haumeng Li; Joe Alessi; Feng Chen; Richard Pratt; Guo-Liang Wang
Journal: Nucleic Acids Res Date: 2006-09-29 Impact factor: 16.971

8. Accurate and unambiguous tag-to-gene mapping in serial analysis of gene expression.

Authors: Rodrigo Malig; Cristian Varela; Eduardo Agosin; Francisco Melo
Journal: BMC Bioinformatics Date: 2006-11-04 Impact factor: 3.169

9. Identification of novel reference genes using multiplatform expression data and their validation for quantitative gene expression analysis.

Authors: Mi Jeong Kwon; Ensel Oh; Seungmook Lee; Mi Ra Roh; Si Eun Kim; Yangsoon Lee; Yoon-La Choi; Yong-Ho In; Taesung Park; Sang Seok Koh; Young Kee Shin
Journal: PLoS One Date: 2009-07-07 Impact factor: 3.240

10. Chromosomal losses are associated with hypomethylation of the gene-control regions in the stomach with a low number of active genes.

Authors: Yu-Chae Jung; Seung-Jin Hong; Young-Ho Kim; Sung-Ja Kim; Seok-Jin Kang; Sang-Wook Choi; Mun-Gan Rhyu
Journal: J Korean Med Sci Date: 2008-12-24 Impact factor: 2.153