| Literature DB >> 15905473 |
Zhihong Zhang1, Fred S Dietrich.
Abstract
A minimally addressed area in Saccharomyces cerevisiae research is the mapping of transcription start sites (TSS). Mapping of TSS in S.cerevisiae has the potential to contribute to our understanding of gene regulation, transcription, mRNA stability and aspects of RNA biology. Here, we use 5' SAGE to map 5' TSS in S.cerevisiae. Tags identifying the first 15-17 bases of the transcripts are created, ligated to form ditags, amplified, concatemerized and ligated into a vector to create a library. Each clone sequenced from this library identifies 10-20 TSS. We have identified 13,746 unique, unambiguous sequence tags from 2231 S.cerevisiae genes. TSS identified in this study are consistent with published results, with primer extension results described here, and are consistent with expectations based on previous work on transcription initiation. We have aligned the sequence flanking 4637 TSS to identify the consensus sequence A(A(rich))5NPyA(A/T)NN(A(rich))6, which confirms and expands the previous reported PyA(A/T)Pu consensus pattern. The TSS data allowed the identification of a previously unrecognized gene, uncovered errors in previous annotation, and identified potential regulatory RNAs and upstream open reading frames in 5'-untranslated region.Entities:
Mesh:
Year: 2005 PMID: 15905473 PMCID: PMC1131933 DOI: 10.1093/nar/gki583
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Scheme of 5′ SAGE methodology. The poly(A)-rich RNA is divided into two pools. Different oligos set (blue or red) are used to carry out reverse transcription, template-switching and primer extension. After tagging enzyme (yellow oval) digestion, the two sample pools are combined together to make 120 bp ditags. The anchoring enzyme (blue triangle) is used to generate 50 bp ditag for concatenation. Specifics are discussed in Materials and Methods.
Validation of protein-coding gene TSS from published results
| ORF | Gene | Total occurrence | Tags position (occurrence) | Published data | References |
|---|---|---|---|---|---|
| Gene TSS being consistent to the published results | |||||
| | CDC19 | 4 | −30(1), 27(2), | Around −20 | ( |
| | PHO5 | 4 | −40, | ( | |
| | HIS4 | 7 | −63(1), − | −60 | ( |
| | PGK1 | 58 | −46(1), −43(4), | Within −48 to −27 | ( |
| | HMRA1 | 6 | Around | ( | |
| | COX9 | 14 | −46(9), | −52, | ( |
| | RPP1A | 17 | −78(2), | ( | |
| | RPP1B | 7 | −115(1), −22(2), 20(2) | −18, | ( |
| | ARO3 | 3 | −8, −9, | ( | |
| | RPP2B | 11 | −122(1), −83(1), | ( | |
| | CYC7 | 1 | −89(1) | −93, | ( |
| | 12 | −473(6), −271(1), −219(1), | −15 to −43 | ( | |
| | HIS1 | 3 | −116 to −109, | ( | |
| | RPL30 | 11 | −80(1), −69(1), −64(2), −59(1), | −58 | ( |
| | IMD2 | 14 | −240(1), | −106 | ( |
| | RPS17A | 15 | −42(1), | ( | |
| | PHO84 | 14 | −54(1), −45(1), | −39 | ( |
| | COX5A | 9 | −34(1), | −31 | ( |
| | RPP2A | 5 | −55, | ( | |
| | ARG1 | 5 | −77, | ( | |
| | ADH1 | 16 | −61(1), −49(1), −43(2), | −37, | ( |
| | CYT1 | 7 | −360(1), −289(1), | −296, | ( |
| | GLN1 | 38 | −187(1), −126(2), | −120 | ( |
| Gene TSS having minor difference between 5′ SAGE and published result | |||||
| | HHT1 | 26 | −30 | ( | |
| | TEF2 | 23 | −111(1), −110(1), −87(1), −74(2), −45(1), −30(1), | −23 | ( |
| | ARO4 | 2 | −235(1), | −91 | ( |
| | HEM3 | 4 | −346(1), −193(1), | −176 | This study |
| | HEM13 | 2 | −75, −52, | ( | |
| | 5 | −603(1), | ( | ||
| | 11 | ( | |||
| | TDH3 | 32 | −47(2), −45(1), | −40 | ( |
| | ENO1 | 23 | −39 | ( | |
| | ARG4 | 3 | −55(3) | −57 | ( |
| | ENO2 | 9 | −35 | ( | |
| | KGD1 | 4 | −149(1), −137(1), −117(1), | −254, | ( |
| | CYC1 | 4 | −56(1), | −61, | ( |
| | SDH2 | 6 | −40 | ( | |
| | CTS1 | 2 | −428(1), | −238 | ( |
| | 4 | −244 | ( | ||
| | CAR1 | 1 | −40(1) | −49, −48, −46, | ( |
| | TEF1 | 32 | −32 | ( | |
| | RPO26 | 2 | −84, −78, −76, −75, −60, −59, −58, −51, | ( | |
| Gene TSS that are not consistent with the published results | |||||
| | PIM1 | 1 | −133(1) | −146, −341 | ( |
| | TPI1 | 28 | −41(1), −38(1), −30(24), −15(2) | −20 | ( |
| | HEM1 | 1 | −260(1) | −76, −72, −68, −63 | ( |
| | URA1 | 5 | −62(1), −44(1), −42(2), −32(1) | −155, −151, −143, −138, −136, −133 | ( |
| | URA4 | 1 | −79(1) | −41, −40, −22, −18 | ( |
| | HSC82 | 7 | −35(3), −31(1), −30(1), −20(1), −10(1) | −98, −42 | ( |
| | ADH2 | 3 | −46(3) | −67, −63, −58, −55 | ( |
For each gene, the total occurrences of tags in the 5′-UTR region are shown. Published data were collected from the literature. Each unitag position is shown relative to the translation start position followed with occurrence in parenthesis. Bold type indicates consistent position of TSS position between 5′ SAGE and previously published results.
aThe total occurrence = ∑tag occurrence.
bSER3 contains an upstream regulatory non-coding RNA SRG1, see the text for detail.
cBoth GCN4 and CPA1 contain long 5′-UTR containing uORF(s). Tags representing GCN4 TSS identified by examining all tags upstream of GCN4, as they were beyond the 500 bp 5′-UTR mapping range using in this study.
dCOX4 contains a 342 bp 5′-UTR intron.
Gene-associated unitag positions relative to annotated S.cerevisiae genes
| Tag occurence | Putative 5′-UTR (−500, −1) | Coding region | Putative 3′-UTR (+1, +100) | Total | ||||
|---|---|---|---|---|---|---|---|---|
| Unitags | Occurence | Unitags | Occurence | Unitags | Occurence | Unitags | Occurence | |
| =1 | 3895 (48%) | 3895 (33%) | 2754 (34%) | 2754 (23%) | 380 (4.6%) | 380 (3.2%) | 7029 (86%) | 7029 (59%) |
| >1 | 1041 | 4461 (38%) | 107 | 299 (2.5%) | 17 | 35 (0.3%) | 1165 (14%) | 4795 (41%) |
| Total | 4936 | 8356 | 2861 (35%) | 3053 (26%) | 397 (4.8%) | 415 (3.5%) | 8194 | 11 824 |
The number of 5′ SAGE unitags and corresponding number of total occurrence (Occurence) were categorized into three groups based on mapping positions and two groups based on the tag occurrence threshold. For each number, the percentage of the total unitag or occurrence (both in bold) is shown in the parenthesis.
aThe (−500, −1) is the position relative to the ATG start codon. The (+1, +100) refers to position relative to the STOP codon.
bThe numbers of multiple occurrence unitags. In the 5′-UTR category, ∼1041/4936 = 21.1% of unitags are multiple occurrence, while only 107/2861 = 3.7% and 17/397 = 4.3% of unitags mapping to coding region and 3′-UTR, respectively, are multiple occurrence.
cThe 4936 putative 5′-UTR mapped unitags come from 2231 genes, with a total of 660 genes represent at least one multiple occurrence unitag and 1571 genes represented by one or more single occurrence unitags.
dTotal tag occurrence of all 5′-UTR mapped unitags. We estimate the percentage of tags representing actual TSS as bounded by 8365/11 824 = 70.7% and (4461 + 1904)/11 824 = 53.8%, where 1904 is an estimate of the number of real single occurrence tag in the putative 5′-UTR.
Figure 25′ SAGE tag distribution around the ORF start codon. (A) Distribution of all 8194 unitags from 500 bp upstream of ATG to 1000 bp downstream of ATG. (B) Distribution of all 1165 multiple occurrence unitags from 500 bp upstream of ATG to 1000 bp downstream of ATG. (C) Zoom-in view of tag distribution shown in (B) within 200 bp putative 5′-UTR region. (D) Cumulative distribution function (CDF) plots of all unitags (red) and multiple occurrence unitags (blue).
Figure 3Correlation between tag occurrence and ORF length/gene expression level. Each 5′-UTR unitag occurrence was plotted with corresponding gene (A) ORF length, and (B) expression level (mRNA copies/cell). The expression level data is acquired from website based on microarray data (24) and the ORF length was calculated based on SGD annotation (20). A linear regression model was applied (R2 = 0.3726, P < 2.2 × 10−16). Negative (−0.6320, P = 1.67 × 10−6) and positive (0.3804, P < 2 × 10−16) correlation coefficients were observed from the model for ORF length (bp) and expression level (mRNA copies/cell), respectively.
Figure 4Primer extension verification. Primer extension was used to map the TSS of 12 S.cerevisiae genes. For each gene, a gene-specific 32P-end-labeled primer was used to reverse transcribe to the 5′ end of the respective mRNA. Fragment sizes were analyzed by denaturing PAGE and autoradiograph. Lane M: Φ174 Hinf I DNA markers; lane 1: primer extension reaction; lane 2: reaction without RNA (negative control). (a) the marker actual size (nt); (b): the corresponding position (bp) to the ATG start codon; (c) the assigned 5′ SAGE tag position (bp) with occurrence in parenthesis, some are assigned to a single band because of the gel resolution; (d) The number in the bracket means the position (bp) estimation of apparent band without 5′ SAGE data.
Figure 5The consensus sequence of the TSS The sequence of ±10 bp flanking each TSS was extracted from the S.cerevisiae genomic sequence and analyzed using WebLOGO () (38,39). Sequence LOGO of TSS flanking sequences derived from (A) all 4936 unitags mapping to the putative 5′-UTR region, (B) 1041 multiple occurrence unitags mapping to the putative 5′-UTR and (C) 3258 unitags mapping to the coding region and putative 3′-UTR (negative control).
Figure 6New features predicted in S.cerevisiae insights from the TSS information from 5′ SAGE data combined with comparative genomics methods have versatile usages include: (A) New gene discovery: synteny view of S.cerevisiae chromosomal IV and III regions, which are believed to be duplicated regions resulting from the whole genome duplication. Each orthologous gene pair is shown in the same color. Two 5′ SAGE unitags with total three occurrences (yellow arrow) revealed a new gene, YCL048W-A, which is homologous to YDR524C-B. (B) Determine the real ATG start codon: Two unitags with one having multiple occurrences are mapped to the coding region of LSM6, while no tag is associated to its 5′-UTR. Protein sequence alignment to orthologs from other Saccharomyces species further supports the proposed LSM6 translation start position. (C) Search of putative regulatory RNA element similar to SRG1–SER3. Two multiple occurrence tags upstream of ODC2 coding region are shown (yellow arrow). The phylogenic comparison showed homology around these two TSS position among multiple species. There is also a conventional SAGE tag (green arrow) that maps to this region with position −286. (D) Example of uORF containing gene. Four unitags with one having multiple occurrences were mapped to 300+ bp upstream of PCL5 coding region. Two small uORFs (blue and red) are found and conserved among all four sensu stricto species in terms of position, length and sequences. Five other hemiascomycete species also contain similar putative uORF(s) in that region. In C.glabrata, three uORFs are present, with two overlapping in different reading frames.
Annotation correction of ORF translational start position
| ORF | Gene | Unitag position (occurrence) | Chr | New ATG position | Predicted |
|---|---|---|---|---|---|
| YDL208W | NHP2 | 6(1) | 4 | +51 | Yes |
| YDR378C | LSM6 | 4 | +111 | No | |
| YER030W | 5 | +21 | Yes | ||
| YER050C | RSM18 | 5 | +192 | No | |
| YGR088W | CTT1 | 7 | +33 | No | |
| YHR163W | SOL3 | 8 | +93 | Yes | |
| YIL043C | CBR1 | 82(1) 94(1) | 9 | +114 | Yes |
| YIL053W | RHR2 | 9 | +63 | No | |
| YIL076W | SEC28 | 9 | +189 | No | |
| YJL046W | 99(2) | 10 | +126 | No | |
| YKR042W | UTH1 | 11 | +255 | Yes | |
| YOR091W | 96(1) | 15 | +168 | No | |
| YOR147W | MDM32 | 56(3) | 15 | +93 | No |
| YPR169W | JIP5 | 32(1) 50(1) | 16 | +66 | Yes |
Proposed translational start codon changes for 14 genes based on 5′ SAGE TSS data and multiple orthologous sequences alignment. 5′ SAGE TSS position are shown relative to the currently annotated ATG, with unitag occurrence in parenthesis and multiple occurrence unitags are shown in bold. The proposed new downstream ATG start codon position relative to the original start position is listed.
a‘Predicted’ means the annotation change had been previously suggested by multiple Saccharomyces species comparison (42).