Literature DB >> 17942412

ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes.

Gang-Qing Hu¹, Xiaobin Zheng, Yi-Fan Yang, Philippe Ortet, Zhen-Su She, Huaiqiu Zhu.

Abstract

Correct annotation of translation initiation site (TIS) is essential for both experiments and bioinformatics studies of prokaryotic translation initiation mechanism as well as understanding of gene regulation and gene structure. Here we describe a comprehensive database ProTISA, which collects TIS confirmed through a variety of available evidences for prokaryotic genomes, including Swiss-Prot experiments record, literature, conserved domain hits and sequence alignment between orthologous genes. Moreover, by combining the predictions from our recently developed TIS post-processor, ProTISA provides a refined annotation for the public database RefSeq. Furthermore, the database annotates the potential regulatory signals associated with translation initiation at the TIS upstream region. As of July 2007, ProTISA includes 440 microbial genomes with more than 390 000 confirmed TISs. The database is available at http://mech.ctb.pku.edu.cn/protisa.

Entities: Chemical Disease Species

Mesh：

Substances：
Codon, Initiator

Year: 2007 PMID： 17942412 PMCID： PMC2238952 DOI： 10.1093/nar/gkm799

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Over the past few years, people have witnessed an exponential growth in the number of completed microbial genomes. It is imperative to annotate a genome as precisely as possible, especially due to flourishing with genome-based experimental approaches such as DNA microarrays and protein arrays (1). Specifically, accurate translation initiation site (TIS) annotation is important for experiments such as identifying native purified proteins through N-terminal amino acid sequencing, as well as heterologous protein products (1). Meanwhile, in silico studies of translation initiation mechanism, gene regulation, as well as the predictions of operon, promoter and small-untranslated RNAs also rely on the correct TIS annotation (2,3). A TIS can be reliably identified by means of the experiments such as N-terminal protein sequencing. Unfortunately, such data constitutes only a small portion of all known proteins. Even for the best-studied genome Escherichia coli K-12, as collected in EcoGene (4), less than a quarter of proteins have been verified in such way. Nevertheless, as the number of proteomic projects increases, the amount of TIS with experimental evidences is expected to accumulate significantly in the near future (5). On the other hand, progress has been made in reliable TIS identification through computational evidences such as sequence alignment (2,6). Frishman et al. (6) was the first to annotate open reading frames (ORFs) that have significant matches to known proteins with hits distributed in a way to ensure the 5′ most candidate start codon to be true. Recently, Makita et al. (2) introduced another method to identify high-quality TIS by sequence alignment between orthologous genes and applied the resultant dataset to evaluate the performance of TIS prediction. The sequence patterns around TISs have been frequently used for in silico study of translation initiation mechanism, thus to design TIS prediction algorithm (2,7–12). In textbooks, ribosome is recruited to mRNA to initiate translation by specific signals nearby TIS such as start codon and Shine-Dalgarno (SD) signal (13). However, for genes without (or almost no) 5′UTR in the mRNA, i.e. leaderless genes, transcriptional signal such as Pribnow box (Bacteria) or TATA box (Archaea) instead of the SD signal has been found upstream of the TIS (9,14–16). Recently, a comprehensive study on hundreds of prokaryotic genomes revealed that ‘non-SD-led genes are as common as SD-led genes’ (17). Thus, it is reasonable to expect that the complexity of prokaryotic translation initiation mechanism will attract a closer attention as more and more genomes are being sequenced. In addition to experimental data in the public database such as Swiss-Prot, the increasing amount of sequenced genomes, which covers a wide range of prokaryotic branches, allows now a high-throughput approach to systematically collect confirmed TIS through database scanning and sequence alignment. It is also interesting to combine the state-of-the-art prediction tools to refine the current public database annotation. Moreover, annotating regulatory signals upstream of TIS will facilitate the studies of initiation mechanism. Herein, we describe a comprehensive relational database, ProTISA, which is designed to collect confirmed TIS, as well as to annotate potential transcriptional or translational signals adjacently upstream to TISs for each of the current hundreds of prokaryotic genomes. We expect that the database may serve the prokaryotic genome annotation and facilitate a wide range of studies on translation initiation.

DATA COLLECTION

Annotation on TIS location

Confirmed TISs (IPT, CDC and HSC) were collected through database scanning, literature survey, conserved domain search and sequence alignment as follows: ImPorTed (IPT): a script was written to extract high-quality manual annotation in Swiss-Prot. The feature key ‘INIT_MET’ is used to indicate whether the initiator methionine has been cleaved off or not. We extracted Swiss-Prot entries that are identified as being cleaved off. In addition, we collected experimentally confirmed data by literature survey. Finally, the IPT data set has been enriched by a simple N-terminal sequence comparison between closely related species. Conserved domain confirmed (CDC): the method to identify TIS through conserved domain search is essentially similar to that in (6). We searched for each gene against the Conserved Domain Database. The TIS for a gene with only one possible start codon upstream to the 5′-most conserved domain hit was readily identified. To compensate for random matches, which would perhaps lead to an incorrect CDC-TIS annotation, we removed six amino acids from the most upstream hit before processing (18), since the frequency that a hit overlaps with the non-coding region by more than six amino acids is <1% when examined on genes with IPT TISs. High similarity confirmed (HSC): identifying TIS through sequence alignment between orthologous genes has been described in (2). Briefly, it determines the TIS of a gene by referring to its orthologous gene with known TIS from other genera. It requires that they are aligned in the N-terminal region. In ProTISA, genes with TIS labeled as IPT or CDC constitute the references to determine HSC TISs. We have been aware that errors in the IPT or CDC TISs might propagate into the HSC TISs via sequence alignment, especially among closely related genomes. To minimize such errors, a HSC TIS is annotated only if it has support of orthologous genes from more than one different genus. MED-Start is a TIS predictor with an iterative self-training algorithm based on a four-component statistical model to describe the TIS in prokaryotic genomes (7). The high performance of MED-Start has been demonstrated by evaluating on the E. coli and Bacillus subtillis genomes. For the present work to computationally relocate TISs for large-scale genomes of both Bacteria and Archaea, several improvements used in our another work (9) have been made to the original algorithm of MED-Start. The modification first treated with the effects of the genomic background to recover regulatory motifs as well as to characterize the sequence patterns around the motifs and the start codon. The operon structure in prokaryotic genomes was also taken into account. Further, the bias of the codon positional GC-content was applied to describe the coding potential of the context around a candidate start. Details of the improved algorithm are available from http://ctb.pku.edu.cn/main/SheGroup/MEDStartPlus.htm The confirmed TISs via the above-mentioned evidences together with the improved predictions (MED) then served as a refined annotation resource for the public database such as RefSeq.

Annotation on regulatory signal

We implemented a MEME-like algorithm to find signals upstream of TIS (19). It combines the positional weight matrix (PWM) of the signal, the distribution of the number of nucleotides between the signal and the TIS (spacer length), and the background nucleotide frequencies into a likelihood function. An EM algorithm and a simulative annealing strategy were used to estimate the parameters. To classify the signals, we first included two kinds of typical signals as references, i.e. the widely accepted SD consensus ‘AAGGAGGTGA’ (3) and the Pribnow (or TATA) box. We use the PWM of the −10 promoter in E. coli K-12 for the Pribnow box (20), while the PWM of the AT-rich motif found in Archaeoglobus fulgidus for the TATA box. Two scores were calculated for each signal, i.e. the SD score and the TA score. The former is calculated by matching the referenced consensus against the PWM of the signal, and the latter is measured by the Euclidean distance between the PWM of the signal and the referenced PWM. Interestingly, each score follows a bimodal distribution, which allows us to readily classify the signals into three categories: (i) TA-like, those resemble the Pribnow (or TATA) box; (ii) SD-like, those resemble the SD signal and (iii) atypical, those resemble neither SD signal nor Pribnow (or TATA) box. A Bayesian methodology is employed to predict potential signal upstream of each TIS. We introduce a scoring function to measure the significance of a string as a signal comparing to a random background. A string is more likely to be a functional signal than a random sequence if the score >0. With the PWM and the spacer length distribution of the signal, we score each substring in the TIS-upstream sequence and select the one with the highest score as the potential signal. Details of the methods are available from http://mech.ctb.pku.edu.cn/protisa

DATA STATISTICS

As of July 2007, ProTISA provides refined annotations for 440 genomes in RefSeq, with more than 390 000 confirmed TISs: IPT (8898), CDC (302 192) and HSC (258 031) (Table 1). The percentage of confirmed TISs in a genome varies from 9% to 73% with an average of 33%. The group Gammaproteobacteria contributes to the majority of the data collection, which is much more evident for the IPT TISs.

Table 1.

Statistics of confirmed TISs (as of July 2007)

Kingdom	Group	Genome No.	IPT No.	CDC No.	HSC No.	Gene No.^a
Archaea	Crenarchaeota	8	207	4238	1454	4729
	Euryarchaeota	23	156	13 175	7330	15 389
	Nanoarchaeota	1	0	152	54	163
Bacteria	Acidobacteria	2	0	1836	932	2213
	Actinobacteria	36	286	21 137	12 561	27 297
	Aquificae	1	3	687	350	746
	Bacteroidetes/Chlorobi	11	10	6782	5431	8516
	Chlamydiae/Verrucomicrobia	11	2	3405	2212	3858
	Chloroflexi	2	0	911	701	1099
	Cyanobacteria	22	277	13 461	9 682	16 855
	Deinococcus-Thermus	4	99	2011	1276	2603
	Firmicutes	89	864	63 647	51 122	77 320
	Fusobacteria	1	1	773	419	837
	Planctomycetes	1	0	439	429	655
	Alphaproteobacteria	53	138	31 928	32 043	45 656
	Betaproteobacteria	36	52	23 893	23 601	34 828
	Gammaproteobacteria	103	6 716	93 832	93 644	127 165
	Deltaproteobacteria	14	17	8932	7416	11 996
	Epsilonaproteobacteri	11	39	6051	4143	6911
	Spirochaetes	9	23	3906	2580	4635
	Thermotogae	1	8	497	292	591
	Other Bacteria	1	0	499	359	663
Sum	–	440	8 898	302 192	258 031	394 725

aNumber of genes with at least one confirmed TIS. A TIS might be confirmed by several evidences. About 1–2% of the genes have more than one confirmed TIS.

Statistics of confirmed TISs (as of July 2007) aNumber of genes with at least one confirmed TIS. A TIS might be confirmed by several evidences. About 1–2% of the genes have more than one confirmed TIS. Transcriptional and translational signals are classified into three classes: SD-like, TA-like and atypical signals. Of the 440 genomes, near half (212) were reported with only SD-like signals (mainly in Firmicutes and Proteobacteria), 22 genomes with only atypical signals (Bacteroidetes/Chlorobi and Cyanobacteria). The other genomes were found with dual signals: SD-like and TA-like signals were found in 76 genomes (Actinobacteria and Archaea), SD-like and atypical signals in 126 genomes (Proteobacteria) and TA-like and atypical signals in four genomes (Table 2).

Table 2.

Statistics of genomes with specific signals (as of July 2007)

Kingdom	Group	SD_like only	Atypical only	SD_like and TA_like	SD_like and Atypical	TA_like and Atypical
Archaea	Crenarchaeota	–	–	6	2	–
	Euryarchaeota	5	–	16	–	2
	Nanoarchaeota	–	–	1	–	–
Bacteria	Acidobacteria	–	–	2	–	–
	Actinobacteria	1	–	33	–	2
	Aquificae	–	–	1	–	–
	Bacteroidetes/Chlorobi	–	6	–	5	–
	Chlamydiae/Verrucomicrobia	1	–	–	10	–
	Chloroflexi	1	–	1	–	–
	Cyanobacteria	–	11	–	11	–
	Deinococcus-Thermus	–	–	4	–	–
	Firmicutes	79	2	8	–	–
	Fusobacteria	1	–	–	–	–
	Thermotogae	1	–	–	–	–
	Planctomycetes	–	1	–	–	–
	Alphaproteobacteria	15	1	–	37	–
	Betaproteobacteria	2	–	–	34	–
	Gammaproteobacteria	82	1	–	20	–
	Deltaproteobacteria	9	–	2	3	–
	Epsilonaproteobacteria	11	–	–	–	–
	Spirochaetes	3	–	2	4	–
	OtherBacteria	1	–	–	–	–
Sum	–	212	22	76	126	4

Statistics of genomes with specific signals (as of July 2007)

DATA ACCESS

ProTISA was implemented under the Apache/PHP/MySQL environment on Linux platform. The basic functionalities aim to browse the stored data and to search the database with a user-specified input. The browse page is composed of two sections. The first section shows general information for a genome such as organism name, taxonomic group and genomic GC-content. This section also displays a sequence logo (21) and a histogram of the spacer length for each signal (Figure 1). The second section contains TIS annotation with start site and initiation signal for each gene. It shows the gene coordinate, gene identity (PID and gene name), TIS evidence type (i.e. IPT or CDC or HSC or MED), and the predicted signal. It also provides links to the evidence that supports the proposed TIS to be confirmed: PMID or external database links for IPT TIS, conserved domain search results for CDC-TIS and multiple N-terminal sequence alignments among orthologous genes for HSC TIS.

Figure 1.

Sequence logo and spacer length distribution of representative signals for the genomes (A) E. coli k-12; (B) S. coelicolor; (C) A. fulgidus; and (C) Synechocystis sp. PCC 6803. The positional weight matrix of the signal is visualized by a sequence logo in which the height of a letter on a given position is proportional to its occurring frequency. A letter is bottom-up shown if the occurring frequency is lower than that from the background. The consensus is shown below the logo. The spacer length is defined as the distance (or the number of nucleotides) between the TIS and each of all annotated signals, which are calculated by the positional weight matrix visualized in sequence logo. The webpage provides the user with a friendly interface to search the TIS annotation by specifying a region in the genome sequence or by gene identifier such as name and PID. Users can also specify the TIS evidence types and compare the output with the RefSeq annotation. The annotation, based on which the web server is constructed, is available for download in batch. The files can be easily imported into a database management system such as MySQL. In addition, source codes (written in C++) for the generation of CDC/HSC TIS, the new version of MED-Start, and the motif finding algorithm are freely available in our website under the GNU GPL license. Besides, referenced genes for HSC TIS creation were compiled in a FASTA file for download.

CONCLUSIONS AND FUTURE DIRECTIONS

Despite the remarkable progresses made in computational annotation, there are continuous publications concerning TIS annotation quality in the public database such as RefSeq (1,5,9,22,23). A notable feature of ProTISA is the compilation of reliable TIS by collecting evidences from experiments, literature, conserved domain search, sequence alignment and accurate prediction. It is interesting to apply the most reliable resource, IPT TISs, to estimate the reliability of CDC TISs, HSC TISs and MED TISs. After removing redundancy from closely related genomes, we have collected a set of 3413 IPT TISs as benchmark, on which the CDC TISs report an accuracy of 99.4% and the HSC TISs report an accuracy of 99.0%. For MED TISs predicted by the modified MED-Start algorithm, the accuracy against the same benchmark achieves to 92.9%. It is argued that the signal upstream of TIS usually implies the translation initiation mechanism (3,9,14,17,24). Another merit of ProTISA is the annotation of transcriptional and translational signals, which is visualized for each genome by a sequence logo for the signal content and a histogram for the spacer length distribution to TISs (Figure 1). This would be helpful for biologists to speculate the initiation mechanism for a specific genome (14). For example, in addition to the SD-like motif, we found in Streptomyces coelicolor and ‘TANNNT’ motif that highly resembles the Pribnow box reported previously, which generally locates at 10 bps upstream to the transcription start site (TSS) (20). Moreover, the motif has a conserved position about 10 bps upstream to the TIS, counting from TIS to the 5′ T in the ‘TANNNT’ motif (not shown in Figure 1). In other words, for some genes, the TSS locates just a few base pairs upstream to or even overlaps with the TIS, resulting in a leaderless gene. This would lead one to speculate the existence of initiation mechanism for leaderless gene in S. coelicolor, which is consistent with the results reported in (25). Interestingly, this motif was also found in several bacteria groups, for example Actinobacteria, Deinococcus-Thermus and Firmicutes, implying that leaderless gene may not be a marginal phenomenon as usually believed in terms of gene structure in bacteria (26). An atypical signal is likely to be functional, especially given its conserved position to the TIS. For instance, we detected in Synechocystis sp. PCC 6803 a conserved dual-pyrimidine locating immediately upstream to the TISs (Figure 1D), which is consistent with the findings in (24). This phenomenon was also found in several other genera. Such signal could serve as a target for biologists to decipher its regulation role by experiments, thus leading to a better understanding of the initiation mechanism. With the growing number of completely sequenced bacterial and archaeal genomes, the scientific value of a specific resource for TISs and the corresponding initiation signals is clear. We hope to increase the update frequency so that the database stays current as each new prokaryotic genome becomes available at NCBI. We plan to make ProTISA to be an evolving resource and add significant functionality over time. One direction of ongoing development is to explore the way of comparative genome analysis based on the divergent translation initiation mechanisms for Bacteria and Archaea. Doing this is somewhat challenging since it is difficulty to develop a quantitative model to describe these complex mechanisms. ProTISA may also add items such as genes classification based on their initiation signals. To sum up, we believe that the resource will expand to suit the needs and requests of the research community for translation initiation studies.

25 in total

1. EcoGene: a genome sequence database for Escherichia coli K-12.

Authors: K E Rudd
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.

Authors: J Besemer; A Lomsadze; M Borodovsky
Journal: Nucleic Acids Res Date: 2001-06-15 Impact factor: 16.971

3. A probabilistic method for identifying start codons in bacterial genomes.

Authors: B E Suzek; M D Ermolaeva; M Schreiber; S L Salzberg
Journal: Bioinformatics Date: 2001-12 Impact factor: 6.937

Review 4. Leaderless mRNAs in bacteria: surprises in ribosomal recruitment and translational control.

Authors: Isabella Moll; Sonja Grill; Claudio O Gualerzi; Udo Bläsi
Journal: Mol Microbiol Date: 2002-01 Impact factor: 3.501

5. Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures.

Authors: Jiong Ma; Allan Campbell; Samuel Karlin
Journal: J Bacteriol Date: 2002-10 Impact factor: 3.490

6. WebLogo: a sequence logo generator.

Authors: Gavin E Crooks; Gary Hon; John-Marc Chandonia; Steven E Brenner
Journal: Genome Res Date: 2004-06 Impact factor: 9.043

Review 7. Compilation and analysis of DNA sequences associated with apparent streptomycete promoters.

Authors: W R Strohl
Journal: Nucleic Acids Res Date: 1992-03-11 Impact factor: 16.971

8. Large-scale identification of N-terminal peptides in the halophilic archaea Halobacterium salinarum and Natronomonas pharaonis.

Authors: Michalis Aivaliotis; Kris Gevaert; Michaela Falb; Andreas Tebbe; Kosta Konstantinidis; Birgit Bisle; Christian Klein; Lennart Martens; An Staes; Evy Timmerman; Jozef Van Damme; Frank Siedler; Friedhelm Pfeiffer; Joël Vandekerckhove; Dieter Oesterhelt
Journal: J Proteome Res Date: 2007-04-20 Impact factor: 4.466

9. GS-Finder: a program to find bacterial gene start sites with a self-training method.

Authors: Hong-Yu Ou; Feng-Biao Guo; Chun-Ting Zhang
Journal: Int J Biochem Cell Biol Date: 2004-03 Impact factor: 5.085

10. EasyGene--a prokaryotic gene finder that ranks ORFs by statistical significance.

Authors: Thomas Schou Larsen; Anders Krogh
Journal: BMC Bioinformatics Date: 2003-06-03 Impact factor: 3.169

16 in total

1. Comparative genomic analysis of ten Streptococcus pneumoniae temperate bacteriophages.

Authors: Patricia Romero; Nicholas J Croucher; N Luisa Hiller; Fen Z Hu; Garth D Ehrlich; Stephen D Bentley; Ernesto García; Tim J Mitchell
Journal: J Bacteriol Date: 2009-06-05 Impact factor: 3.490

2. Re-annotation of two hyperthermophilic archaea Pyrococcus abyssi GE5 and Pyrococcus furiosus DSM 3638.

Authors: Junxiang Gao; Ji Wang
Journal: Curr Microbiol Date: 2011-11-06 Impact factor: 2.188

3. Prodigal: prokaryotic gene recognition and translation initiation site identification.

Authors: Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser
Journal: BMC Bioinformatics Date: 2010-03-08 Impact factor: 3.169

4. Shuttle vector expression in Thermococcus kodakaraensis: contributions of cis elements to protein synthesis in a hyperthermophilic archaeon.

Authors: Thomas J Santangelo; L'ubomíra Cubonová; John N Reeve
Journal: Appl Environ Microbiol Date: 2008-03-31 Impact factor: 4.792