Literature DB >> 17942412

ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes.

Gang-Qing Hu1, Xiaobin Zheng, Yi-Fan Yang, Philippe Ortet, Zhen-Su She, Huaiqiu Zhu.   

Abstract

Correct annotation of translation initiation site (TIS) is essential for both experiments and bioinformatics studies of prokaryotic translation initiation mechanism as well as understanding of gene regulation and gene structure. Here we describe a comprehensive database ProTISA, which collects TIS confirmed through a variety of available evidences for prokaryotic genomes, including Swiss-Prot experiments record, literature, conserved domain hits and sequence alignment between orthologous genes. Moreover, by combining the predictions from our recently developed TIS post-processor, ProTISA provides a refined annotation for the public database RefSeq. Furthermore, the database annotates the potential regulatory signals associated with translation initiation at the TIS upstream region. As of July 2007, ProTISA includes 440 microbial genomes with more than 390 000 confirmed TISs. The database is available at http://mech.ctb.pku.edu.cn/protisa.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17942412      PMCID: PMC2238952          DOI: 10.1093/nar/gkm799

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Over the past few years, people have witnessed an exponential growth in the number of completed microbial genomes. It is imperative to annotate a genome as precisely as possible, especially due to flourishing with genome-based experimental approaches such as DNA microarrays and protein arrays (1). Specifically, accurate translation initiation site (TIS) annotation is important for experiments such as identifying native purified proteins through N-terminal amino acid sequencing, as well as heterologous protein products (1). Meanwhile, in silico studies of translation initiation mechanism, gene regulation, as well as the predictions of operon, promoter and small-untranslated RNAs also rely on the correct TIS annotation (2,3). A TIS can be reliably identified by means of the experiments such as N-terminal protein sequencing. Unfortunately, such data constitutes only a small portion of all known proteins. Even for the best-studied genome Escherichia coli K-12, as collected in EcoGene (4), less than a quarter of proteins have been verified in such way. Nevertheless, as the number of proteomic projects increases, the amount of TIS with experimental evidences is expected to accumulate significantly in the near future (5). On the other hand, progress has been made in reliable TIS identification through computational evidences such as sequence alignment (2,6). Frishman et al. (6) was the first to annotate open reading frames (ORFs) that have significant matches to known proteins with hits distributed in a way to ensure the 5′ most candidate start codon to be true. Recently, Makita et al. (2) introduced another method to identify high-quality TIS by sequence alignment between orthologous genes and applied the resultant dataset to evaluate the performance of TIS prediction. The sequence patterns around TISs have been frequently used for in silico study of translation initiation mechanism, thus to design TIS prediction algorithm (2,7–12). In textbooks, ribosome is recruited to mRNA to initiate translation by specific signals nearby TIS such as start codon and Shine-Dalgarno (SD) signal (13). However, for genes without (or almost no) 5′UTR in the mRNA, i.e. leaderless genes, transcriptional signal such as Pribnow box (Bacteria) or TATA box (Archaea) instead of the SD signal has been found upstream of the TIS (9,14–16). Recently, a comprehensive study on hundreds of prokaryotic genomes revealed that ‘non-SD-led genes are as common as SD-led genes’ (17). Thus, it is reasonable to expect that the complexity of prokaryotic translation initiation mechanism will attract a closer attention as more and more genomes are being sequenced. In addition to experimental data in the public database such as Swiss-Prot, the increasing amount of sequenced genomes, which covers a wide range of prokaryotic branches, allows now a high-throughput approach to systematically collect confirmed TIS through database scanning and sequence alignment. It is also interesting to combine the state-of-the-art prediction tools to refine the current public database annotation. Moreover, annotating regulatory signals upstream of TIS will facilitate the studies of initiation mechanism. Herein, we describe a comprehensive relational database, ProTISA, which is designed to collect confirmed TIS, as well as to annotate potential transcriptional or translational signals adjacently upstream to TISs for each of the current hundreds of prokaryotic genomes. We expect that the database may serve the prokaryotic genome annotation and facilitate a wide range of studies on translation initiation.

DATA COLLECTION

Annotation on TIS location

Confirmed TISs (IPT, CDC and HSC) were collected through database scanning, literature survey, conserved domain search and sequence alignment as follows: ImPorTed (IPT): a script was written to extract high-quality manual annotation in Swiss-Prot. The feature key ‘INIT_MET’ is used to indicate whether the initiator methionine has been cleaved off or not. We extracted Swiss-Prot entries that are identified as being cleaved off. In addition, we collected experimentally confirmed data by literature survey. Finally, the IPT data set has been enriched by a simple N-terminal sequence comparison between closely related species. Conserved domain confirmed (CDC): the method to identify TIS through conserved domain search is essentially similar to that in (6). We searched for each gene against the Conserved Domain Database. The TIS for a gene with only one possible start codon upstream to the 5′-most conserved domain hit was readily identified. To compensate for random matches, which would perhaps lead to an incorrect CDC-TIS annotation, we removed six amino acids from the most upstream hit before processing (18), since the frequency that a hit overlaps with the non-coding region by more than six amino acids is <1% when examined on genes with IPT TISs. High similarity confirmed (HSC): identifying TIS through sequence alignment between orthologous genes has been described in (2). Briefly, it determines the TIS of a gene by referring to its orthologous gene with known TIS from other genera. It requires that they are aligned in the N-terminal region. In ProTISA, genes with TIS labeled as IPT or CDC constitute the references to determine HSC TISs. We have been aware that errors in the IPT or CDC TISs might propagate into the HSC TISs via sequence alignment, especially among closely related genomes. To minimize such errors, a HSC TIS is annotated only if it has support of orthologous genes from more than one different genus. MED-Start is a TIS predictor with an iterative self-training algorithm based on a four-component statistical model to describe the TIS in prokaryotic genomes (7). The high performance of MED-Start has been demonstrated by evaluating on the E. coli and Bacillus subtillis genomes. For the present work to computationally relocate TISs for large-scale genomes of both Bacteria and Archaea, several improvements used in our another work (9) have been made to the original algorithm of MED-Start. The modification first treated with the effects of the genomic background to recover regulatory motifs as well as to characterize the sequence patterns around the motifs and the start codon. The operon structure in prokaryotic genomes was also taken into account. Further, the bias of the codon positional GC-content was applied to describe the coding potential of the context around a candidate start. Details of the improved algorithm are available from http://ctb.pku.edu.cn/main/SheGroup/MEDStartPlus.htm The confirmed TISs via the above-mentioned evidences together with the improved predictions (MED) then served as a refined annotation resource for the public database such as RefSeq.

Annotation on regulatory signal

We implemented a MEME-like algorithm to find signals upstream of TIS (19). It combines the positional weight matrix (PWM) of the signal, the distribution of the number of nucleotides between the signal and the TIS (spacer length), and the background nucleotide frequencies into a likelihood function. An EM algorithm and a simulative annealing strategy were used to estimate the parameters. To classify the signals, we first included two kinds of typical signals as references, i.e. the widely accepted SD consensus ‘AAGGAGGTGA’ (3) and the Pribnow (or TATA) box. We use the PWM of the −10 promoter in E. coli K-12 for the Pribnow box (20), while the PWM of the AT-rich motif found in Archaeoglobus fulgidus for the TATA box. Two scores were calculated for each signal, i.e. the SD score and the TA score. The former is calculated by matching the referenced consensus against the PWM of the signal, and the latter is measured by the Euclidean distance between the PWM of the signal and the referenced PWM. Interestingly, each score follows a bimodal distribution, which allows us to readily classify the signals into three categories: (i) TA-like, those resemble the Pribnow (or TATA) box; (ii) SD-like, those resemble the SD signal and (iii) atypical, those resemble neither SD signal nor Pribnow (or TATA) box. A Bayesian methodology is employed to predict potential signal upstream of each TIS. We introduce a scoring function to measure the significance of a string as a signal comparing to a random background. A string is more likely to be a functional signal than a random sequence if the score >0. With the PWM and the spacer length distribution of the signal, we score each substring in the TIS-upstream sequence and select the one with the highest score as the potential signal. Details of the methods are available from http://mech.ctb.pku.edu.cn/protisa

DATA STATISTICS

As of July 2007, ProTISA provides refined annotations for 440 genomes in RefSeq, with more than 390 000 confirmed TISs: IPT (8898), CDC (302 192) and HSC (258 031) (Table 1). The percentage of confirmed TISs in a genome varies from 9% to 73% with an average of 33%. The group Gammaproteobacteria contributes to the majority of the data collection, which is much more evident for the IPT TISs.
Table 1.

Statistics of confirmed TISs (as of July 2007)

KingdomGroupGenome No.IPT No.CDC No.HSC No.Gene No.a
ArchaeaCrenarchaeota8207423814544729
Euryarchaeota2315613 175733015 389
Nanoarchaeota1015254163
BacteriaAcidobacteria2018369322213
Actinobacteria3628621 13712 56127 297
Aquificae13687350746
Bacteroidetes/Chlorobi1110678254318516
Chlamydiae/Verrucomicrobia112340522123858
Chloroflexi209117011099
Cyanobacteria2227713 4619 68216 855
Deinococcus-Thermus499201112762603
Firmicutes8986463 64751 12277 320
Fusobacteria11773419837
Planctomycetes10439429655
Alphaproteobacteria5313831 92832 04345 656
Betaproteobacteria365223 89323 60134 828
Gammaproteobacteria1036 71693 83293 644127 165
Deltaproteobacteria14178932741611 996
Epsilonaproteobacteri1139605141436911
Spirochaetes923390625804635
Thermotogae18497292591
Other Bacteria10499359663
Sum4408 898302 192258 031394 725

aNumber of genes with at least one confirmed TIS. A TIS might be confirmed by several evidences. About 1–2% of the genes have more than one confirmed TIS.

Statistics of confirmed TISs (as of July 2007) aNumber of genes with at least one confirmed TIS. A TIS might be confirmed by several evidences. About 1–2% of the genes have more than one confirmed TIS. Transcriptional and translational signals are classified into three classes: SD-like, TA-like and atypical signals. Of the 440 genomes, near half (212) were reported with only SD-like signals (mainly in Firmicutes and Proteobacteria), 22 genomes with only atypical signals (Bacteroidetes/Chlorobi and Cyanobacteria). The other genomes were found with dual signals: SD-like and TA-like signals were found in 76 genomes (Actinobacteria and Archaea), SD-like and atypical signals in 126 genomes (Proteobacteria) and TA-like and atypical signals in four genomes (Table 2).
Table 2.

Statistics of genomes with specific signals (as of July 2007)

KingdomGroupSD_like onlyAtypical onlySD_like and TA_likeSD_like and AtypicalTA_like and Atypical
ArchaeaCrenarchaeota62
Euryarchaeota5162
Nanoarchaeota1
BacteriaAcidobacteria2
Actinobacteria1332
Aquificae1
Bacteroidetes/Chlorobi65
Chlamydiae/Verrucomicrobia110
Chloroflexi11
Cyanobacteria1111
Deinococcus-Thermus4
Firmicutes7928
Fusobacteria1
Thermotogae1
Planctomycetes1
Alphaproteobacteria15137
Betaproteobacteria234
Gammaproteobacteria82120
Deltaproteobacteria923
Epsilonaproteobacteria11
Spirochaetes324
OtherBacteria1
Sum21222761264
Statistics of genomes with specific signals (as of July 2007)

DATA ACCESS

ProTISA was implemented under the Apache/PHP/MySQL environment on Linux platform. The basic functionalities aim to browse the stored data and to search the database with a user-specified input. The browse page is composed of two sections. The first section shows general information for a genome such as organism name, taxonomic group and genomic GC-content. This section also displays a sequence logo (21) and a histogram of the spacer length for each signal (Figure 1). The second section contains TIS annotation with start site and initiation signal for each gene. It shows the gene coordinate, gene identity (PID and gene name), TIS evidence type (i.e. IPT or CDC or HSC or MED), and the predicted signal. It also provides links to the evidence that supports the proposed TIS to be confirmed: PMID or external database links for IPT TIS, conserved domain search results for CDC-TIS and multiple N-terminal sequence alignments among orthologous genes for HSC TIS.
Figure 1.

Sequence logo and spacer length distribution of representative signals for the genomes (A) E. coli k-12; (B) S. coelicolor; (C) A. fulgidus; and (C) Synechocystis sp. PCC 6803. The positional weight matrix of the signal is visualized by a sequence logo in which the height of a letter on a given position is proportional to its occurring frequency. A letter is bottom-up shown if the occurring frequency is lower than that from the background. The consensus is shown below the logo. The spacer length is defined as the distance (or the number of nucleotides) between the TIS and each of all annotated signals, which are calculated by the positional weight matrix visualized in sequence logo.

Sequence logo and spacer length distribution of representative signals for the genomes (A) E. coli k-12; (B) S. coelicolor; (C) A. fulgidus; and (C) Synechocystis sp. PCC 6803. The positional weight matrix of the signal is visualized by a sequence logo in which the height of a letter on a given position is proportional to its occurring frequency. A letter is bottom-up shown if the occurring frequency is lower than that from the background. The consensus is shown below the logo. The spacer length is defined as the distance (or the number of nucleotides) between the TIS and each of all annotated signals, which are calculated by the positional weight matrix visualized in sequence logo. The webpage provides the user with a friendly interface to search the TIS annotation by specifying a region in the genome sequence or by gene identifier such as name and PID. Users can also specify the TIS evidence types and compare the output with the RefSeq annotation. The annotation, based on which the web server is constructed, is available for download in batch. The files can be easily imported into a database management system such as MySQL. In addition, source codes (written in C++) for the generation of CDC/HSC TIS, the new version of MED-Start, and the motif finding algorithm are freely available in our website under the GNU GPL license. Besides, referenced genes for HSC TIS creation were compiled in a FASTA file for download.

CONCLUSIONS AND FUTURE DIRECTIONS

Despite the remarkable progresses made in computational annotation, there are continuous publications concerning TIS annotation quality in the public database such as RefSeq (1,5,9,22,23). A notable feature of ProTISA is the compilation of reliable TIS by collecting evidences from experiments, literature, conserved domain search, sequence alignment and accurate prediction. It is interesting to apply the most reliable resource, IPT TISs, to estimate the reliability of CDC TISs, HSC TISs and MED TISs. After removing redundancy from closely related genomes, we have collected a set of 3413 IPT TISs as benchmark, on which the CDC TISs report an accuracy of 99.4% and the HSC TISs report an accuracy of 99.0%. For MED TISs predicted by the modified MED-Start algorithm, the accuracy against the same benchmark achieves to 92.9%. It is argued that the signal upstream of TIS usually implies the translation initiation mechanism (3,9,14,17,24). Another merit of ProTISA is the annotation of transcriptional and translational signals, which is visualized for each genome by a sequence logo for the signal content and a histogram for the spacer length distribution to TISs (Figure 1). This would be helpful for biologists to speculate the initiation mechanism for a specific genome (14). For example, in addition to the SD-like motif, we found in Streptomyces coelicolor and ‘TANNNT’ motif that highly resembles the Pribnow box reported previously, which generally locates at 10 bps upstream to the transcription start site (TSS) (20). Moreover, the motif has a conserved position about 10 bps upstream to the TIS, counting from TIS to the 5′ T in the ‘TANNNT’ motif (not shown in Figure 1). In other words, for some genes, the TSS locates just a few base pairs upstream to or even overlaps with the TIS, resulting in a leaderless gene. This would lead one to speculate the existence of initiation mechanism for leaderless gene in S. coelicolor, which is consistent with the results reported in (25). Interestingly, this motif was also found in several bacteria groups, for example Actinobacteria, Deinococcus-Thermus and Firmicutes, implying that leaderless gene may not be a marginal phenomenon as usually believed in terms of gene structure in bacteria (26). An atypical signal is likely to be functional, especially given its conserved position to the TIS. For instance, we detected in Synechocystis sp. PCC 6803 a conserved dual-pyrimidine locating immediately upstream to the TISs (Figure 1D), which is consistent with the findings in (24). This phenomenon was also found in several other genera. Such signal could serve as a target for biologists to decipher its regulation role by experiments, thus leading to a better understanding of the initiation mechanism. With the growing number of completely sequenced bacterial and archaeal genomes, the scientific value of a specific resource for TISs and the corresponding initiation signals is clear. We hope to increase the update frequency so that the database stays current as each new prokaryotic genome becomes available at NCBI. We plan to make ProTISA to be an evolving resource and add significant functionality over time. One direction of ongoing development is to explore the way of comparative genome analysis based on the divergent translation initiation mechanisms for Bacteria and Archaea. Doing this is somewhat challenging since it is difficulty to develop a quantitative model to describe these complex mechanisms. ProTISA may also add items such as genes classification based on their initiation signals. To sum up, we believe that the resource will expand to suit the needs and requests of the research community for translation initiation studies.
  25 in total

1.  EcoGene: a genome sequence database for Escherichia coli K-12.

Authors:  K E Rudd
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.

Authors:  J Besemer; A Lomsadze; M Borodovsky
Journal:  Nucleic Acids Res       Date:  2001-06-15       Impact factor: 16.971

3.  A probabilistic method for identifying start codons in bacterial genomes.

Authors:  B E Suzek; M D Ermolaeva; M Schreiber; S L Salzberg
Journal:  Bioinformatics       Date:  2001-12       Impact factor: 6.937

Review 4.  Leaderless mRNAs in bacteria: surprises in ribosomal recruitment and translational control.

Authors:  Isabella Moll; Sonja Grill; Claudio O Gualerzi; Udo Bläsi
Journal:  Mol Microbiol       Date:  2002-01       Impact factor: 3.501

5.  Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures.

Authors:  Jiong Ma; Allan Campbell; Samuel Karlin
Journal:  J Bacteriol       Date:  2002-10       Impact factor: 3.490

6.  WebLogo: a sequence logo generator.

Authors:  Gavin E Crooks; Gary Hon; John-Marc Chandonia; Steven E Brenner
Journal:  Genome Res       Date:  2004-06       Impact factor: 9.043

Review 7.  Compilation and analysis of DNA sequences associated with apparent streptomycete promoters.

Authors:  W R Strohl
Journal:  Nucleic Acids Res       Date:  1992-03-11       Impact factor: 16.971

8.  Large-scale identification of N-terminal peptides in the halophilic archaea Halobacterium salinarum and Natronomonas pharaonis.

Authors:  Michalis Aivaliotis; Kris Gevaert; Michaela Falb; Andreas Tebbe; Kosta Konstantinidis; Birgit Bisle; Christian Klein; Lennart Martens; An Staes; Evy Timmerman; Jozef Van Damme; Frank Siedler; Friedhelm Pfeiffer; Joël Vandekerckhove; Dieter Oesterhelt
Journal:  J Proteome Res       Date:  2007-04-20       Impact factor: 4.466

9.  GS-Finder: a program to find bacterial gene start sites with a self-training method.

Authors:  Hong-Yu Ou; Feng-Biao Guo; Chun-Ting Zhang
Journal:  Int J Biochem Cell Biol       Date:  2004-03       Impact factor: 5.085

10.  EasyGene--a prokaryotic gene finder that ranks ORFs by statistical significance.

Authors:  Thomas Schou Larsen; Anders Krogh
Journal:  BMC Bioinformatics       Date:  2003-06-03       Impact factor: 3.169

View more
  16 in total

1.  Comparative genomic analysis of ten Streptococcus pneumoniae temperate bacteriophages.

Authors:  Patricia Romero; Nicholas J Croucher; N Luisa Hiller; Fen Z Hu; Garth D Ehrlich; Stephen D Bentley; Ernesto García; Tim J Mitchell
Journal:  J Bacteriol       Date:  2009-06-05       Impact factor: 3.490

2.  Re-annotation of two hyperthermophilic archaea Pyrococcus abyssi GE5 and Pyrococcus furiosus DSM 3638.

Authors:  Junxiang Gao; Ji Wang
Journal:  Curr Microbiol       Date:  2011-11-06       Impact factor: 2.188

3.  Prodigal: prokaryotic gene recognition and translation initiation site identification.

Authors:  Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser
Journal:  BMC Bioinformatics       Date:  2010-03-08       Impact factor: 3.169

4.  Shuttle vector expression in Thermococcus kodakaraensis: contributions of cis elements to protein synthesis in a hyperthermophilic archaeon.

Authors:  Thomas J Santangelo; L'ubomíra Cubonová; John N Reeve
Journal:  Appl Environ Microbiol       Date:  2008-03-31       Impact factor: 4.792

5.  Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes.

Authors:  Xiaobin Zheng; Gang-Qing Hu; Zhen-Su She; Huaiqiu Zhu
Journal:  BMC Genomics       Date:  2011-07-12       Impact factor: 3.969

6.  ClubSub-P: Cluster-Based Subcellular Localization Prediction for Gram-Negative Bacteria and Archaea.

Authors:  Nagarajan Paramasivam; Dirk Linke
Journal:  Front Microbiol       Date:  2011-11-08       Impact factor: 5.640

7.  Genome reannotation of Escherichia coli CFT073 with new insights into virulence.

Authors:  Chengwei Luo; Gang-Qing Hu; Huaiqiu Zhu
Journal:  BMC Genomics       Date:  2009-11-22       Impact factor: 3.969

8.  DIGA--a database of improved gene annotation for phytopathogens.

Authors:  Na Gao; Ling-Ling Chen; Hong-Fang Ji; Wei Wang; Ji-Wei Chang; Bei Gao; Lin Zhang; Shi-Cui Zhang; Hong-Yu Zhang
Journal:  BMC Genomics       Date:  2010-01-21       Impact factor: 3.969

9.  PairWise Neighbours database: overlaps and spacers among prokaryote genomes.

Authors:  Albert Pallejà; Tomàs Reverter; Santiago Garcia-Vallvé; Antoni Romeu
Journal:  BMC Genomics       Date:  2009-06-25       Impact factor: 3.969

10.  Gene prediction in metagenomic fragments based on the SVM algorithm.

Authors:  Yongchu Liu; Jiangtao Guo; Gangqing Hu; Huaiqiu Zhu
Journal:  BMC Bioinformatics       Date:  2013-04-10       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.