Literature DB >> 17028099

DeepSAGE--digital transcriptomics with high sensitivity, simple experimental protocol and multiplexing of samples.

Kåre L Nielsen¹, Annabeth Laursen Høgh, Jeppe Emmersen.

Abstract

Digital transcriptomics with pyrophosphatase based ultra-high throughput DNA sequencing of di-tags provides high sensitivity and cost-effective gene expression profiling. Sample preparation and handling are greatly simplified compared to Serial Analysis of Gene Expression (SAGE). We compare DeepSAGE and LongSAGE data and demonstrate greater power of detection and multiplexing of samples derived from potato. The transcript analysis revealed a great abundance of up-regulated potato transcripts associated with stress in dormant potatoes compared to harvest. Importantly, many transcripts were detected that cannot be matched to known genes, but is likely to be part of the abiotic stress-response in potato.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2006 PMID： 17028099 PMCID： PMC1636492 DOI： 10.1093/nar/gkl714

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Transcriptomics is essential to monitoring the genomic activation of cells or organisms in response to environmental signals. Global gene expression analysis has been conducted either by hybridization with oligo nucleotide microarrays (1), or by counting of sequence tags. An advantage of microarray analysis is that once the array has been made at a high cost, many measurements can be made at a relatively low cost. However, only known genes can be spotted on the array. In contrast, sequence tag based approaches, like Serial Analysis of Gene Expression (SAGE) (2) and massive parallel signature sequencing (MPSS) (3) can measure the expression of both known and unknown genes. The MPSS technology, however, is too complex to be performed in non-specialized laboratories and very expensive. On the contrary, a SAGE experiment consists of a series of molecular biology manipulation that, in principle, can be carried out in any molecular biology laboratory with access to a 96 capillary DNA sequencer. SAGE relies on the extraction of one 14–21 nt sequence tag from each mRNA. Tags are ligated together, cloned and sequenced. In a typical sequence run of 96 samples ∼1500 tags of corresponding mRNAs can be detected. Due to the cost of sequencing, a SAGE study typically encompasses 50 000 tags and provides detailed knowledge of the 2000 most highly expressed genes in the tissue analyzed. In practice, it can be difficult to achieve enough clones of the appropriate insert length (4) to facilitate efficient detection. Here we describe an experimentally simple method for ditag-based transcript detection, DeepSAGE, similar to the initial steps of LongSAGE (5) in conjunction with emulsion-based amplification and pyrophosphate based ultra-high throughput DNA sequencing (6). DeepSAGE allows the counting of more than 300 000 tags with less effort and cost than a typical LongSAGE study encompassing 50 000 tags. The deep sampling facilitates the measurement of rare transcripts below the detection limit of existing global transcript profiling technologies. Moreover multiple samples can be sequenced in a single run.

MATERIALS AND METHODS

DeepSAGE sample preparation

RNA was isolated (7) from field grown potato tubers cv. Kuras at the time of harvest (HAR) and at dormancy after 60 days of storage at 10°C (DOR). Quality of RNA was verified from integrity and intensity of ribosomal RNA following 1% TAE-agarose gel electrophoresis. Fifty microgram of RNA was used to construct LongSAGE ditags as described by Saha et al. (5). Following ligation of linker-tags to form ditags, six 50 μl PCR consisting of 2.5 U Taq polymerase (Ampliqon, Copenhagen, Denmark), 0.5 mM deoxynucleotide triphosphates, 1 μl 1:160 dilution of the ligation reaction, 2 μM of 5′-GCCTTGCCAGCCCGCTCAGCAAGCTTCTAACGATGTACGT-3′ and 2 μM of either 5′-GCCTCCCTCGCGCCATCAGAAGTGGTGCAGTACAACTAGGCT (HAR) or 5′-GCCTCCCTCGCGCCATCAGACGTGGTGCAGTACAACTAGGCT (DOR) in 10 mM Tris–HCl, 50 mM KCl, 3 mM MgCl, 1% Triton X-100 were prepared. PCR were subjected to 26 cycles of amplification at 94°C for 30 s, 1 min at 55°C followed by 1 min at 70°C. The presence of a 125 bp ditag band was verified by 15% TAE–PAGE prior to pooling and ethanol precipitation by addition of 2 μl 20 g/l glycogen (Fermentas, Burlington, Canada), 50 μl 7.5 M ammonium acetate, 1 ml 100% ethanol (De Danske Spritfabrikker, Aalborg, Denmark) and incubation at −80°C for 1 h. The tubes were centrifuged at maximum speed at room temperature for 20 min. The pellets were washed with 1 ml 70% ethanol and redisolved in 75 μl 10 mM Tris–HCl, 0.1 mM EDTA, pH 7.5. The two amplified ditag samples were separated by 12% TAE–PAGE. Following staining of the gel for 2 min with ethidium bromide (2 μg/ml), the 130 bp band was excised using a clean scalpel, and the gel piece transferred into a 0.6 ml tube that had been punctured in the bottom with a 12 Gauge needle. The tube was inserted into a 1.5 ml tube and centrifuged at maximum speed for 1 min in a benchtop centrifuge. 375 μl 10 mM Tris–HCl, 0.1 mM EDTA, pH 7.5 and 125 μl 7.5 M ammonium acetate was added to the crushed gel pieces, and the tubes were incubated at 4°C overnight. The entire contents of each sample was transferred to two Spin-X filter tubes (Corning, New York, USA) and centrifuged at maximum speed for 30 s. The eluates were transferred to a 2 ml tube prior to addition of 2 μl 20 mg/ml glycogen and 1500 μl 100% ethanol. Following incubation at −80° for 1 h, the tubes were centrifuged at maximum speed at room temperature for 20 min, washed with 1 ml 70% ethanol, and redissolved in 20 μl 10 mM Tris–HCl, 0.1 mM EDTA, pH 7.5. The integrity of the 130 bp ditag band was checked by 15% TAE–PAGE and the concentration was determined by absorption at 260 nm. The two samples were mixed in equimolar amounts prior to sequencing by 454-Life Science Corp., Branford, CT, USA according to Ref. (6).

Tag extraction and data analysis

Tags were extracted from sequence FASTA files containing ditags using the PERL script DeepSAGE_extract.pl (see Supplementary Data). Linker and poly-A derived tags were removed, but duplicate ditags were not. The tags were mapped to potato tentative contiguous sequences () using Sagemap-tsv.pl. (). The resulting tabulator separated value-files were imported into Excel for further analysis. The entire dataset including tags only observed in one of the datasets only was used for the calculation of correlation coefficients. To improve interpretability however, Figures 2 and 3 were displayed in logarithmic scale thereby omitting tags observed only in one of the datasets. Statistically significant gene expression changes were detected (8) using strict Bonferroni correction.

RESULTS AND DISCUSSION

mRNA from two stages of potato tuber development, at harvest (HAR) and dormancy (DOR), were extracted. Following the preparation of LongSAGE ditags (5), 100–400 50 μl PCR are usually pooled to provide enough concatemers for LongSAGE. In the present study, only six 50 μl reactions yielded more than 10 times the material used for a DeepSAGE experiment. Amplification of ditags was carried out using primers containing a sequence primer recognition site, a 3 nt sample identification key (AAG for HAR, ACG for DOR) and a sequence complementary to the linkers used in LongSAGE. Both samples yielded amplification products of 125 bp which were purified by gel electrophoresis. DNA concentration was determined and equimolar amounts of the two samples were pooled. Contrary to LongSAGE these amplified ditags were used directly for sequencing. Preparation of beads carrying sequence templates, clonal amplification in emulsion and DNA sequencing were done according to Margulies et al. (6). A total of 224 310 sequences were obtained in a single sequence run, which included both forward and reverse sequences (Table 1). The distribution of the length between the two CATG sequences flanking the ditags (Figure 1) was found to be very similar to traditional LongSAGE. A PERL script (DeepSAGE_extract.pl) was used to extract 314 212 tags of 19 nt (167 367 from forward sequences and 146 845 from reverse sequences). Overall, 70% of these sequences yielded a good ditag sequence. This is comparable to our experience with traditional LongSAGE, where 73% of sequenced clones contained at least one ditag.

Table 1

Summary of sequencing statistics

Library	# Sequences	# Tags extracted			% Match to unigenesa	% Error rateb
		1st run	2nd run	Total		Sub	Ins	Del	Total
DeepSAGE	224 310	314 212	119 835	434 047
Forward sequences		167 367	46 313	213 680
DOR		95 427	26 673	122 100	49.5	5.2	2.5	1.7	9.1
HAR		71 940	19 640	91 580	50.5	9.2	2.2	1.3	12.4
LongSAGE DOR	3206	53 688			52.3	15.3	0.7	0.8	16.6

aTags were matched to the 38 239 unigenes in Solanum tuberosum Gene Index release 10 ().

bError rates were estimated according to Akmaev and Wang (9).

Figure 1

Distribution of ditags length in LongSAGE (solid) and DeepSAGE (hatched).

It was reported that the pyro-sequencing employed in this study has a somewhat higher error rate than Sanger sequencing, especially in homopolymer regions of four or more (6). Therefore, we inspected our dataset for tags containing homopolymers which were truncated or elongated. Surprisingly, we only found such tags in very low abundance similar to other type of sequencing errors, even though several abundant tags contained homopolymers. To further address sequence accuracy and the impact on tag based transcriptome analysis, the forward sequences from both runs were sorted by their identification key into 91 580 from the HAR sample and 122 100 from the DOR sample. We determined the sequence error rates using SAGEscreen (9) for both the DeepSAGE datasets and for a LongSAGE DOR library of 53 688 tags. Tags observed more than 50 times (87 229 and 141 tags for LongSAGE, DeepSAGE DOR and HAR, respectively). The results are shown in Table 1. Overall estimates of sequence error containing tags in DeepSAGE are in fact lower (9.1–12.4%) than LongSAGE (16.6%). The overall estimates are composed of lower substitution error rate in DeepSAGE compared to LongSAGE (5.2–9.2% versus 15.3% of tags) and a higher insertion (2.2–2.5% versus 0.72%) and deletion rate (1.3–1.7% versus 0.8%) in agreement with what was previously found for ultra-high throughput pyrophosphate sequencing (6). Presumably, the higher sequence accuracy of DeepSAGE is obtained because tag sequences are extracted from nt 33 to approximately 73 (dependent on variation in MmeI cleavage) of DNA sequences, well within the first 90 nt which are determined with the highest accuracy (6). Indeed, correlation analysis of tags extracted from forward and reverse sequences (Figure 2A) indicated good sequence fidelity and reliable tag extraction (R2 = 0.96). Reproducibility was confirmed by performing a second limited run yielding 119 835 tags and comparing the two runs (R2 = 0.96) (Figure 2B).

Figure 2

Correlation of tag counts extracted from (A) forward and reverse sequences, respectively. Data sets consisted of 167 159 forward sequences and 199 413 reverse sequences. Using tags observed at least once in both directions only (12 025 tags) the R2 = 0.9611. (B) Counts extracted from two different sequencing runs. Data sets consisted of 96 427 tags from the first run and 26 673 tags from the second run. Using tags observed at least once in both runs only (6631 tags) the R2 = 0.9609.

Figure 3 shows a comparison of the numbers of LongSAGE DOR tags versus DeepSAGE DOR tags and DeepSAGE HAR tags, respectively. The distribution of DOR tags was very similar for the Long- and DeepSAGE methods (R2 = 0.96) showing that these measurements of the transcriptome are equivalent. Comparison of the transcriptomes at dormant and harvest were significantly different (R2 = 0.33) as expected. A similar correlation of R2 = 0.35 was obtained for DeepSAGE DOR versus DeepSAGE HAR (data not shown).

Figure 3

Correlation of LongSAGE and DeepSAGE DOR tags (A) and DeepSAGE HAR (B). Data sets consisted of 51 918 LongSAGE tags, 122 100 DeepSAGE DOR tags and 91 580 DeepSAGE HAR tags. The most abundant DOR tag was encountered 1397 in LongSAGE and 3145 in DeepSAGE. The least abundant tags were seen once in all data sets. Using tags observed at least once in both libraries only (8567 tags) the R2 for the comparison of DOR Deep- and LongSAGE increase to 0.9694.

Little is known about the potato tubers adaptation to the abiotic stress imposed by the unnatural environment above ground during storage. Comparing the gene expression of the analyzed potato libraries, 69 genes were up-regulated and 65 genes were down-regulated (P < 0.05 with Bonferroni correction) in DeepSAGE DOR compared to DeepSAGE HAR (Supplementary Table 1). Strikingly, among the 69 up-regulated transcripts, 22 of the 42 transcripts that can be matched to a known sequence are homologues to either chaperones (4), genes involved in the ubiquitin protein degradation pathway (5) or suggested to be otherwise stress-related (13) (see Table 2). Intriguingly, among these transcripts are three members of Ca2+ signal transduction pathways: TC112122 (Calmodulin, 4-fold up-regulated), TC126796 (Phosholipase C, 15-fold up-regulated) and TC119057 (Annexin P34, 5-fold up-regulated). Interestingly, Phospholipase C and Annexin P34 transcripts were also observed during EST sequencing in three cDNA libraries derived from abiotically stressed tissue [ and Ref. (10)]. In addition, the cell wall protein (Q40142) which has been shown to be involved in thickening the protective periderm layer was increased (11), consistent with the fact that potatoes thicken their skin during storage, presumably as a response to drought. In stark contrast, only 3 of the 65 down-regulated genes encode chaperones and no other stress-related transcripts were identified. It seems likely, that several of the unknown up-regulated transcripts are additional parts of the potato tuber's response to the abiotic stress induced by storage. 38 of the 65 down-regulated genes can be matched to a potato tentative consensus sequence (TC) and of these more than half (25) match storage proteins, such as patatins, metallocarboxypeptidase inhibitors or Kunitz type protease inhibitors. This indicates that loading of storage protein loading into the potato tuber, at least to some extent can take place up to the point of harvest, but is completely shut down after the tuber has been taken out of the ground. Using the LongSAGE DOR tags (see Supplementary Table 2) only 97 of these were identified as regulated by comparison to DeepSAGE HAR. Importantly, 26 additional ‘false positive’ transcripts, which fail to meet the statistical significance criteria using the larger dataset were deemed differentially regulated using the smaller dataset.

Table 2

Stress genes regulated between harvest and dormant potato tubers

Tentative consensus	DeepSAGE DOR	DeepSAGE HAR	LongSAGE DOR	Gene name	References
TC111718	337	50	76	Glycine rich binding protein	(14)
TC119035	140	18	59	ADP-ribosylation factor	(15)
TC112122	122	33	52	Calmodulin	(16)
TC119041	311	47	135	Starch phosphorylase L-1	(17)
TC120303	84	11	35	Universal stress protein	(18)
TC118975	121	23	ND	S-Adenosylmethionine synthase	(19)
CN514071	99	17	36	Glycine rich RNA binding protein	(14)
TC126796	62	4	21	Phospholipase C	(20)
TC125909	1478	646	583	Glycine rich binding protein	(14)
TC125835	137	22	69	Low temp and salt responsive protein	(21)
TC126028	367	157	ND	Auxin-repressed protein	(22)
TC126037	136	37	67	Stress-related protein	a
TC119057	339	67	163	Annexin P34	(23)

aThis entry was associated with stress by the Li and McKersie during direct submission of Q9SW70 in SwissProt.

Typically 225 000 sequences are obtained in a single experiment, and can be divided in forward and reverse sequences (as in this study), or be generated exclusively from one end only. In DeepSAGE, each sequence contains a ditag. Therefore, at the success rate of this study of 70%, 225 000 sequences represent 315 000 tags determined in a single sequence run. This might be compared to microarray transcriptomics, as Lu et al. (12) have estimated that the sensitivity towards detecting rare transcripts in an Affymetrix gene chip experiment is comparable to a SAGE study of 120 000 tags. Therefore, three samples can be multiplexed in a single run to generate this sensitivity. To estimate the usefulness of the increased sensitivity we analyzed the expression of transcription factors, a class of genes known to be expressed at very low abundance. Mapping the DOR derived tags to the ∼190 000 tentative contiguous sequences of potato (), only 67 different gene matches were identified as homologues of transcription factors among LongSAGE tags (see Supplementary Table 3 for details). Of these, 13 were observed >5 times and 8 tags were observed 3, 4 or 5 times. The majority (69%) of the 67 tags were seen only twice (17) or once (29). Because, a large proportion of the very rare tags might be tags generated by sequencing errors (13), it cannot be established with certainty whether the corresponding transcripts are present. Furthermore, reliable expression levels cannot be established for rare tags because of random sampling (13). In comparison, 94 different gene matches were found to putative transcription factors in the DeepSAGE analysis of the DOR sample. Twenty-two tags were observed >5 times, whereas 51 tags (54%), were seen twice (16) or once (36). Interestingly, 21 or almost three times as many tags were encountered 3, 4 or 5 times. This underscores that a deeper sampling indeed will detect more rare transcripts, and provides more reliable gene expression estimates. Multiplexing of different samples or replicates of the same sample, each tagged with a unique nucleotide identification key is a further possibility of DeepSAGE as we have shown here by co-analyzing potato tuber ditag libraries at dormancy and harvest. Until now the lack of replicates has been a severe drawback of SAGE. Following sequencing, different samples are first sorted according to their identification keys and the tags are counted prior to comparison of gene expression. The DeepSAGE protocol omits the sample consuming concatenation of ditags, the tedious clone picking and sequence template preparation, which constitute most of the experimental SAGE protocol. As an example, a single person in our laboratory has consistently spent six weeks to generate data from two LongSAGE libraries, including sequencing of concatemers. Using DeepSAGE, the same person has recently generated 19 SAGE libraries from pig mRNA in 2 weeks. Sequence template preparation and analysis was performed in 1 week. Tag based gene expression profiling methods have been inhibited by the cost of DNA sequencing, despite their advantageous global and digital nature, but will become increasingly cost-effective as the 1000$ genome is approached. The total cost of DeepSAGE, including labor costs, was reduced at least 10-fold compared to LongSAGE, from 25 cents to ∼2.5 cents per tag.

21 in total

1. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays.

Authors: S Brenner; M Johnson; J Bridgham; G Golda; D H Lloyd; D Johnson; S Luo; S McCurdy; M Foy; M Ewan; R Roth; D George; S Eletr; G Albrecht; E Vermaas; S R Williams; K Moon; T Burcham; M Pallas; R B DuBridge; J Kirchner; K Fearon; J Mao; K Corcoran
Journal: Nat Biotechnol Date: 2000-06 Impact factor: 54.908

Review 2. Signal transduction mechanisms in plants: an overview.

Authors: G B Clark; G Thompson; S J Roux
Journal: Curr Sci Date: 2001-01-25 Impact factor: 1.102

3. DNA chip-based expression profile analysis indicates involvement of the phosphatidylinositol signaling pathway in multiple plant responses to hormone and abiotic treatments.

Authors: Wen Hui Lin; Rui Ye; Hui Ma; Zhi Hong Xu; Hong Wei Xue
Journal: Cell Res Date: 2004-02 Impact factor: 25.617

4. A comparison of gene expression profiles produced by SAGE, long SAGE, and oligonucleotide chips.

Authors: Jun Lu; Anita Lal; Barry Merriman; Stan Nelson; Gregory Riggins
Journal: Genomics Date: 2004-10 Impact factor: 5.736

5. Correction of sequence-based artifacts in serial analysis of gene expression.

Authors: Viatcheslav R Akmaev; Clarence J Wang
Journal: Bioinformatics Date: 2004-02-10 Impact factor: 6.937

6. A novel extracellular matrix protein from tomato associated with lignified secondary cell walls.

Authors: C Domingo; M D Gómez; L Cañas; J Hernández-Yago; V Conejero; P Vera
Journal: Plant Cell Date: 1994-08 Impact factor: 11.277

7. The role of a zinc finger-containing glycine-rich RNA-binding protein during the cold adaptation process in Arabidopsis thaliana.

Authors: Yeon-Ok Kim; Hunseung Kang
Journal: Plant Cell Physiol Date: 2006-04-11 Impact factor: 4.927

8. Arabidopsis proteins containing similarity to the universal stress protein domain of bacteria.

Authors: David Kerk; Joshua Bulgrien; Douglas W Smith; Michael Gribskov
Journal: Plant Physiol Date: 2003-03 Impact factor: 8.340

9. Plastidial alpha-glucan phosphorylase is not required for starch degradation in Arabidopsis leaves but has a role in the tolerance of abiotic stress.

Authors: Samuel C Zeeman; David Thorneycroft; Nicole Schupp; Andrew Chapple; Melanie Weck; Hannah Dunstan; Pierre Haldimann; Nicole Bechtold; Alison M Smith; Steven M Smith
Journal: Plant Physiol Date: 2004-06-01 Impact factor: 8.340

10. Robust-LongSAGE (RL-SAGE): a substantially improved LongSAGE method for gene discovery and transcriptome analysis.

Authors: Malali Gowda; Chatchawan Jantasuriyarat; Ralph A Dean; Guo-Liang Wang
Journal: Plant Physiol Date: 2004-03 Impact factor: 8.340

45 in total

Review 1. Applications of next generation sequencing in molecular ecology of non-model organisms.

Authors: R Ekblom; J Galindo
Journal: Heredity (Edinb) Date: 2010-12-08 Impact factor: 3.821

Review 2. Single-cell and regional gene expression analysis in Alzheimer's disease.

Authors: Ruby Kwong; Michelle K Lupton; Michal Janitz
Journal: Cell Mol Neurobiol Date: 2012-01-22 Impact factor: 5.046

3. Gene expression profiling by massively parallel sequencing.

Authors: Tatiana Teixeira Torres; Muralidhar Metta; Birgit Ottenwälder; Christian Schlötterer
Journal: Genome Res Date: 2007-11-21 Impact factor: 9.043

Review 4. Keeping up with the next generation: massively parallel sequencing in clinical diagnostics.

Authors: John R ten Bosch; Wayne W Grody
Journal: J Mol Diagn Date: 2008-10-02 Impact factor: 5.568

Review 5. Next-generation gap.

Authors: John D McPherson
Journal: Nat Methods Date: 2009-11 Impact factor: 28.547

6. Transcriptome response to pollutants and insecticides in the dengue vector Aedes aegypti using next-generation sequencing technology.

Authors: Jean-Philippe David; Eric Coissac; Christelle Melodelima; Rodolphe Poupardin; Muhammad Asam Riaz; Alexia Chandor-Proust; Stéphane Reynaud
Journal: BMC Genomics Date: 2010-03-31 Impact factor: 3.969

7. Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies.

Authors: Matthew S Hestand; Andreas Klingenhoff; Matthias Scherf; Yavuz Ariyurek; Yolande Ramos; Wilbert van Workum; Makoto Suzuki; Thomas Werner; Gert-Jan B van Ommen; Johan T den Dunnen; Matthias Harbers; Peter A C 't Hoen
Journal: Nucleic Acids Res Date: 2010-07-07 Impact factor: 16.971

8. Gene expression profiling via LongSAGE in a non-model plant species: a case study in seeds of Brassica napus.

Authors: Christian Obermeier; Bashir Hosseini; Wolfgang Friedt; Rod Snowdon
Journal: BMC Genomics Date: 2009-07-03 Impact factor: 3.969

9. Integrating omic technologies into aquatic ecological risk assessment and environmental monitoring: hurdles, achievements, and future outlook.

Authors: Graham Van Aggelen; Gerald T Ankley; William S Baldwin; Daniel W Bearden; William H Benson; J Kevin Chipman; Tim W Collette; John A Craft; Nancy D Denslow; Michael R Embry; Francesco Falciani; Stephen G George; Caren C Helbing; Paul F Hoekstra; Taisen Iguchi; Yoshi Kagami; Ioanna Katsiadaki; Peter Kille; Li Liu; Peter G Lord; Terry McIntyre; Anne O'Neill; Heather Osachoff; Ed J Perkins; Eduarda M Santos; Rachel C Skirrow; Jason R Snape; Charles R Tyler; Don Versteeg; Mark R Viant; David C Volz; Tim D Williams; Lorraine Yu
Journal: Environ Health Perspect Date: 2010-01 Impact factor: 9.031

10. A score system for quality evaluation of RNA sequence tags: an improvement for gene expression profiling.

Authors: Daniel G Pinheiro; Pedro A F Galante; Sandro J de Souza; Marco A Zago; Wilson A Silva
Journal: BMC Bioinformatics Date: 2009-06-06 Impact factor: 3.169