Literature DB >> 18592220

Genome information management and integrated data analysis with HaloLex.

Friedhelm Pfeiffer1, Alexander Broicher, Thomas Gillich, Kathrin Klee, José Mejía, Markus Rampp, Dieter Oesterhelt.   

Abstract

HaloLex is a software system for the central management, integration, curation, and web-based visualization of genomic and other -omics data for any given microorganism. The system has been employed for the manual curation of three haloarchaeal genomes, namely Halobacterium salinarum (strain R1), Natronomonas pharaonis, and Haloquadratum walsbyi. HaloLex, in particular, enables the integrated analysis of genome-wide proteomic results with the underlying genomic data. This has proven indispensable to generate reliable gene predictions for GC-rich genomes, which, due to their characteristically low abundance of stop codons, are known to be hard targets for standard gene finders, especially concerning start codon assignment. The proteomic identification of more than 600 N-terminal peptides has greatly increased the reliability of the start codon assignment for Halobacterium salinarum. Application of homology-based methods to the published genome of Haloarcula marismortui allowed to detect 47 previously unidentified genes (a problem that is particularly serious for short protein sequences) and to correct more than 300 start codon misassignments.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18592220      PMCID: PMC2516542          DOI: 10.1007/s00203-008-0389-z

Source DB:  PubMed          Journal:  Arch Microbiol        ISSN: 0302-8933            Impact factor:   2.552


Introduction

In the era of high-throughput biochemical experiments and large-scale systems-modeling approaches, the availability of high-quality input data like the complete gene or protein inventory of an organism is of paramount importance (cf. Kitano 2002). In practice, however, progress is often hampered by the lack of access to the relevant data, their insufficient integration with related information, or simply inadequate reliability of the data. Although the current era is commonly referred to as “postgenomic,” many problems related to the quality of (microbial) genome annotations are still not satisfactorily solved. The GC-rich genomes of halophilic archaea, for example, are known to pose particular challenges for the bioinformatic prediction of their gene and protein inventory. Unsupervised, automatic gene prediction likely fails and blindly relying on such data can apparently compromise any further analysis or experiment. Thus, there is not only a need for making genomic and other related data available to the end-user in a most convenient and comprehensive way, but also tools are required that support generating, managing, and manually curating the data and allow experts to assess and improve their quality. To this end, we have developed HaloLex, which serves both the aforementioned purposes. Halolex is a software system for the central management, integration, and web-based visualization of genomic and other -omics data for any given microorganism. Centered on the genomic information, HaloLex provides a comprehensive and user-friendly web interface (http://www.halolex.mpg.de) with many different interlinked views and various search functionalities to the underlying database. Advanced data mining tasks can be performed by employing high-level programming interfaces to access and automatically process bulk-data with computer scripts and programs. The main scientific purpose of HaloLex is to support in-depth analysis of selected prokaryotic genomes and to assist knowledge-based manual revision and refinement of their annotation, in particular by taking into account also nongenomic, experimental data (e.g., proteomics). Typically, genomic data enter the system after automatic gene identification, classification, and basic annotations have been accomplished using a general-purpose genome annotation system like, e.g., “GenDB” (Meyer et al. 2003). We are not trying to parallel seemingly similar efforts like the “integrated microbial genomes browser IMG” (Markowitz et al. 2006), the “UCSC Archaeal Genome Browser” (Schneider et al. 2006) or “PEDANT” (Riley et al. 2007), “AGMIAL” (Bryson et al. 2006), or alike (see Bryson et al. 2006 for a recent overview), which are full-blown (automatic) genome annotation and/or information systems and provide exhaustive data repositories to the community. The focus of the HaloLex system is rather to assist experts in achieving an extraordinarily high data quality for a selection of (model) organisms and to make that data available for further analysis like systems modeling, experiment design, etc. To this end, we place particular emphasis on integrating standard genomic data with proteomic (see also Pleissner et al. 2004), transcriptomic, and metabolomic data. This has, for example, enabled a number of genome-scale proteomics (Tebbe et al. 2005; Klein et al. 2005; Bisle et al. 2006; Falb et al. 2006; Aivaliotis et al. 2007; Konstantinidis et al. 2007) and transcriptomic (Twellmeyer et al. 2007) analyses as well as a whole genome metabolic flux simulation (Gonzalez et al. 2008). So far, scientific applications with HaloLex have mainly been focussing on a number of halophilic archaea, in particular on Halobacterium salinarum strain R1 (DSM 671, Pfeiffer et al. 2008), Natronomonas pharaonis strain Gabara (DSM 2160, Falb et al. 2005), and Haloquadratum walsbyi strain HBSQ001 (DSM 16790, Bolhuis et al. 2006). These genomes, together with Halobacterium salinarum strain NRC-1 (Ng et al. 2000), Haloarcula marismortui (Baliga et al. 2004) and Haloferax volcanii (J. Eisen, unpublished), are of primary interest to our own group and our collaborators. For specific examples, the reader is referred to the articles of Teufel et al. (2008), Scheuch et al. (2008), Dambeck and Soppa (2008) (all three references in this issue of Archives of Microbiology), and to the general review on the genomics and functional genomics of halophilic archaea by Soppa et al. (2008) (this issue of Archives of Microbiology). However, HaloLex is not limited to halophilic archaea but currently covers all publicly available archaeal and a few selected bacterial genomes as obtained from the NCBI (ftp.ncbi.nih.gov/genomes/Bacteria). Besides demonstrating that the system is in principle capable of handling data for a large number of organisms, it allows users to browse the underlying GenBank data (augmented with additional information like bioinformatic predictions) within the same, coherent web interface. The integrated access within the same data model and software environment has shown to be a key prerequisite for conducting comprehensive statistical analyses like the ones presented in the second part of this paper. The paper is organized as follows: in “Section 1: overview of the HaloLex system”, we describe the main functionalities of the HaloLex system and give some notes on its implementation. “Section 2: integrated data analysis with HaloLex” highlights a number of biological problems that have been addressed with HaloLex, and points out bioinformatic solutions for the specific challenges posed by the GC-rich genomes of halophilic archaea. In particular, we shall present new, significantly improved annotation data for Haloarcula marismortui.

Section 1: overview of the HaloLex system

In short, HaloLex is based on a relational database serving as the central repository for all kinds of data, which are available for a given microorganism. A dynamic web application provides integrated access to the data and supports the daily work with the genomic and proteomic information in an economical way. The web interface is complemented by a programming interface, which enables (computationally experienced) local users to perform complex data mining tasks, based on a coherent data model and query methods.

Main functionalities of the web application

The primary and most accessible interface to the data stored in HaloLex is a web application, which allows to conveniently browse and query data over the Internet with a minimum technical effort (http://www.halolex.mpg.de). Depending on their individual role, anonymous or appropriately authorized users get read-only access to various browsing and search functionalities or are equipped with additional privileges for data curation and management, respectively. Access rights can be granted separately for each individual strain allowing us to handle all data within the same data store and code base. Wherever applicable, graphics are rendered in the SVG (Scalable Vector Graphics) format. This greatly facilitates postprocessing of results and improves the quality of their presentation as compared to working with pixel-based formats (GIF, JPEG, PNG, etc.), which are conventionally employed by the majority of existing web applications.

Genome viewer

The available information about an individual coding sequence is summarized by a central “details page” listing sequences (coding region and protein translation), functional information (e.g., protein name, gene name, EC number, functional classification), general gene and protein characteristics (e.g., sequence length, start and stop codons, GC content, theoretical pI value), and results from several bioinformatic tools, e.g., transmembrane and signal peptide prediction with “Phobius” (Kall et al. 2004), protein export signals with “Tatfind” (Rose et al. 2002), codon adaptation index (Sharp and Li 1987), etc. In addition, the details page shows homologous sequences as well as cross-references to entries of the same protein in major public sequence databases like GenBank, UniProt, Kegg, and also links to relevant PubMed abstracts. Usually, the details page is reached by selecting an organism and directly specifying an identifier or name for the gene of interest. In addition, also less specific searches and browsing functionalities are supported, including the option to obtain complete lists of genes or proteins, which can optionally be filtered by various characteristics like pI value range, type of proteomic identification, etc. (cf. Fig. 1).
Fig. 1

Screenshot of the search functionality of HaloLex. Example output of a query for all genes of Halobacterium salinarum (R1), which were “reliably” identified by proteomics (indicated in the rightmost column). The complete list of 1,992 identifications was truncated for brevity

Screenshot of the search functionality of HaloLex. Example output of a query for all genes of Halobacterium salinarum (R1), which were “reliably” identified by proteomics (indicated in the rightmost column). The complete list of 1,992 identifications was truncated for brevity If the organism or gene of interest is not specified a priori, the user can alternatively start out with a blast-based search (Altschul et al. 1997) for all sequences in the HaloLex database, which are similar to a given query. To reach the details page, one may also start from a graphical display of a particular region on the genome. The corresponding “region viewer” page provides standard genome browsing functionalities and allows to color-code genes according to a variety of characteristics like the annotation status, assigned function class, GC content, proteomic identification (see Fig. 2), and many more.
Fig. 2

Screenshot of the region viewer of HaloLex. Genomic region on the Halobacterium chromosome with ORFs color-coded according to different trust levels of proteomic identification. “Spurious” ORFs (which are hidden by default) are rendered as open symbols

Screenshot of the region viewer of HaloLex. Genomic region on the Halobacterium chromosome with ORFs color-coded according to different trust levels of proteomic identification. “Spurious” ORFs (which are hidden by default) are rendered as open symbols

Genome curation

For the manual curation of genome-based data, the web interface provides basic forms for updating the protein function annotation of individual genes (i.e., changing protein name, gene name, EC number, etc.). In addition, the gene assignment itself can be revised. HaloLex supports the introduction of newly identified genes, which, e.g., may have been missed by some automatic gene prediction tool. Such tools may also have produced false positives, i.e., open reading frames (ORFs) that are eventually found not to code for proteins. Such “spurious ORFs”, which are especially frequent in GC-rich genomes (cf. “Section 2: integrated data analysis with HaloLex”), are not eliminated from the database but get appropriately tagged. This allows to optionally retain such ORFs in viewing and data mining tools (cf. Fig. 2). Furthermore, start codons may have been misassigned, which is also a common problem for GC-rich genomes (cf. “Section 2: integrated data analysis with HaloLex”). HaloLex assists the curator in assessing and revising the setting of the start codon by showing a number of characteristic quantities like the resulting amino-acid distribution or pI values corresponding to all relevant alternative choices of the start codon.

Viewers for proteomic data

Figure 3 illustrates a navigation path from a spot on a two-dimensional gel image via the spectrum taken in a mass-spectrometric experiment to the identified protein. Individual spots on the gel image, for which spectra have been taken, are classified and color-coded according to the type and quality of the protein identification. The corresponding mass-intensity spectra are annotated and rendered such that the interpretation of the “raw” spectrum immediately gets transparent for the user.
Fig. 3

Integrated access to genomic and proteomic data. Montage of different views of the HaloLex web interface on proteomic data. Blue arrows indicate example navigation tracks from a particular spot on a 2D gel image via two different mass-spectra to the identified protein, and its location on the genome, respectively

Integrated access to genomic and proteomic data. Montage of different views of the HaloLex web interface on proteomic data. Blue arrows indicate example navigation tracks from a particular spot on a 2D gel image via two different mass-spectra to the identified protein, and its location on the genome, respectively

Data mining capabilities

Naturally, not all conceivable types of data analysis can be anticipated and implemented in a web application with limited effort. For example, we opted not to provide sophisticated web-based cross-genome comparison functionalities. To still support complex and highly customizable data mining applications, HaloLex offers full programmatic access to all data and tools within a well-structured data model. Being able to work in such a coherent environment has proven to be a fundamental prerequisite for a large variety of research projects, which have been conducted with HaloLex in the course of several years, a few current examples of which shall be highlighted in the subsequent section. The corresponding application programming interface (API) requires analysis programs to be written in the Java language and to run in the same local-area network where the HaloLex server is located. Both restrictions can, however, be relaxed by means of a SOAP-based web service interface, which we are internally already employing successfully (for a nontechnical introduction to web services and their role in biosciences, see Stein 2002).

Integration of other -omics data

The HaloLex database allows storing and accessing other -omics data in an integrated way and links them with the corresponding genomic data. As shown above, this is well established for proteomic data (currently limited to database searches using MASCOT) and also applies to transcriptomic (Twellmeyer et al. 2007), as well as to curated metabolic data based on KEGG information (Falb et al. 2008). Access to the latter, however, is currently restricted to internal data mining applications, i.e., transcriptomic and metabolic data have not yet been made publicly available via the HaloLex web interface.

Notes on the implementation

HaloLex was originally implemented as a classic “LAMP” system, i.e., it has been operated on a Linux platform, using an Apache webserver, the Mysql relational database management system, and employing the Perl programming language. The system has been mainly used for department-internal purposes and covered only a few genomes. To substantially extend the system with respect to the amount and complexity of data and to provide user-friendly public web access to the wealth of internal HaloLex functionalities, the system was recently reimplemented based on the Java Enterprise Edition 5 platform (see, e.g., Stearns et al. 2006). Besides many other well-established benefits delivered by this technology, we take advantage of the so-called “distributed components” approach, which promotes (loose) coupling of different stand-alone services through standardized interfaces. Specifically, HaloLex uses remote services offered by the MIGenAS sequence analysis platform (Rampp et al. 2006), e.g., for computing bioinformatic predictions like the transmembrane topology and for cross-referencing database identifiers (cf. Wu et al. 2004). For genome sequences imported from GenBank, we employ the SIMAP web service (Rattei et al. 2006) to retrieve precalculated and regularly updated similarities of proteins with public sequence databases like UniProt, PDB, etc. Data mining applications are enabled by an Application Programming Interface (API), which is built upon the “Remote Interface” component of Java’s Enterprise Edition. The same technology is easily exploited to export a web service interface.

Section 2: integrated data analysis with HaloLex

A typical gene prediction problem, ORF overprediction, was chosen as a principal topic to illustrate several applications of HaloLex. We describe the statistical basis for this problem and how an integrated analysis of proteomic and genomic data allows to overcome it. In addition, we describe homology-based methods to detect and resolve gene prediction problems. Using the manual curation tools of HaloLex, we were able to substantially improve the gene prediction for the published genome of Haloarcula marismortui (Baliga et al. 2004). GC-rich genomes like those of halophilic archaea are known to challenge standard gene prediction tools (Nielsen and Krogh 2005; McHardy et al. 2004). Two types of problems are encountered: (1) the existence of alternative long open reading frames (Veloso et al. 2005) makes it difficult to discriminate protein-coding genes from spurious ORFs; (2) start codon selection is highly error-prone due to long N-terminal ORF extensions in front of the start codon used in vivo (Aivaliotis et al. 2007). In both cases, which we summarize as the “ORF overprediction problem”, noncoding DNA may be erroneously “translated” into protein sequences upon unwary application of gene predictors. This markedly deteriorates the quality of the resulting protein-coding gene set. A high-quality gene set is, however, essential for genetic experiments, analysis of transcription and translation signals, or the analysis of protein export signals, which are commonly located in the N-terminal region, not to speak of systems biology applications such as metabolic modeling. ORF overprediction is illustrated in Fig. 2, which shows a 10 kb region of the Halobacterium salinarum strain R1 genome. Protein-coding genes are outnumbered by spurious ORFs, which are all longer than 100 codons. In many cases, a spurious ORF is even longer than the protein-coding gene with which it overlaps. Spurious ORFs with a length of up to 1,300 codons have been found in the Halobacterium genome (Pfeiffer et al. 2008). The ORF overprediction problem is also strikingly illustrated by the fact that 20% of the predicted protein sequences of strain NRC-1 of Halobacterium salinarum are inconsistent with those of strain R1, although the DNA sequences of both strains are virtually identical (four single-base differences, five one-base frameshifts, three indels; see Pfeiffer et al. 2008). Among the genes with a start codon assignment discrepancy is the TATA-binding protein tbpA (Scheuch et al. 2008).

Genome statistical data

ORF overprediction is caused by the low number of stop codons in GC-rich genomes (Veloso et al. 2005). Because of the reduced frequencies of T and A, there is a low expectation value for each of the three stop codons (TAA, TAG, and TGA). The problem is further aggravated, because the number of stop codons actually found in prokaryotic genomes is even lower than that predicted by basic statistics of single-nucleotide frequencies. In case of Halobacterium, only 66% of the statistically expected stop codons are found. It is interesting to note that nearly all prokaryotic genomes have less stop codons than expected (Fig. 4). While the reduction is moderate for AT-rich genomes, it is significant for genomes with a GC content larger than 60%, where 28% of the expected stop codons are missing on average.
Fig. 4

Expected and actual frequency of stop codons for 425 microbial genomes. For the chromosomes of 425 microbial strains, the expected and the actual number of stop codons was counted and normalized by the total number of codons. Species are sorted along the abscissa by decreasing GC content. For nearly all genomes, the number of actually present stop codons (open circles) is significantly lower than that expected (filled symbols). The small inset shows that for the group of genomes with a GC content >60% (to the left of the dashed vertical line), only 72% of the expected stop codons are found, whereas more than 85% of the expected stop codons are actually present in the group of genomes with a GC content <60% (to the right of the dashed vertical line). The GenBank data for all microbial strains were downloaded from ftp.ncbi.nih.gov/genomes/Bacteria. Only the chromosome (more precisely: the longest replicon) was chosen for each strain and only one representative strain was used for each species

Expected and actual frequency of stop codons for 425 microbial genomes. For the chromosomes of 425 microbial strains, the expected and the actual number of stop codons was counted and normalized by the total number of codons. Species are sorted along the abscissa by decreasing GC content. For nearly all genomes, the number of actually present stop codons (open circles) is significantly lower than that expected (filled symbols). The small inset shows that for the group of genomes with a GC content >60% (to the left of the dashed vertical line), only 72% of the expected stop codons are found, whereas more than 85% of the expected stop codons are actually present in the group of genomes with a GC content <60% (to the right of the dashed vertical line). The GenBank data for all microbial strains were downloaded from ftp.ncbi.nih.gov/genomes/Bacteria. Only the chromosome (more precisely: the longest replicon) was chosen for each strain and only one representative strain was used for each species This observation can be explained by an additional bias at the dinucleotide level, which exists on top of the aforementioned bias due to an altered GC content. For the Halobacterium chromosome, as an example, this is illustrated by Fig. 5a, which shows that dinucleotides with the same number of A or T residues do not occur with equal frequencies. In particular, the “TA” dinucleotide, which appears in two of the three stop codons, is especially rare. Reduced “TA” dinucleotide frequencies have been found in most prokaryotic genomes (Karlin et al. 2002).
Fig. 5

Dinucleotide bias for Halobacterium salinarum. a Counts of dinucleotides in the Halobacterium salinarum chromosome. Dinucleotides are grouped according to the number of G or C residues. Within each group, each dinucleotide is adjacent to its reverse complement (e.g., TC and GA). The four palindromic dinucleotides are indicated by green arrows. For each group, the theoretically expected average (blue line) is compared with the average, which is actually observed (yellow line). b Same as (a) but showing the counts of trinucleotides. Red circles highlight stop codons and blue circles highlight trinucleotides that correspond to arginine codons. c The amino acid composition as computed from the protein-coding gene set (black) and from trinucleotide counts (gray). The over-representation of the acidic amino acids aspartate and to a lesser extent glutamate (red circles) in protein-coding genes contrasts with the over-representation of the basic amino acid arginine, prolines and serines (blue circles) in translations of random stretches of DNA. This is the basis for a strong pI difference between these two sets of ORFs

Dinucleotide bias for Halobacterium salinarum. a Counts of dinucleotides in the Halobacterium salinarum chromosome. Dinucleotides are grouped according to the number of G or C residues. Within each group, each dinucleotide is adjacent to its reverse complement (e.g., TC and GA). The four palindromic dinucleotides are indicated by green arrows. For each group, the theoretically expected average (blue line) is compared with the average, which is actually observed (yellow line). b Same as (a) but showing the counts of trinucleotides. Red circles highlight stop codons and blue circles highlight trinucleotides that correspond to arginine codons. c The amino acid composition as computed from the protein-coding gene set (black) and from trinucleotide counts (gray). The over-representation of the acidic amino acids aspartate and to a lesser extent glutamate (red circles) in protein-coding genes contrasts with the over-representation of the basic amino acid arginine, prolines and serines (blue circles) in translations of random stretches of DNA. This is the basis for a strong pI difference between these two sets of ORFs In Halobacterium, the “CG” dinucleotide is much more frequent than the other dinucleotides consisting only of G and C residues. As already noted by Karlin et al. (2002), an excess of “CG” is rather exceptional for prokaryotic genomes, which commonly are enriched for “GC.” Indirectly, this “CG excess” facilitates gene selection and start codon assignment in Halobacterium to some extent, as it results in an excess of four trinucleotides, which correspond to arginine codons. Thus, translations of random stretches of DNA (spurious ORFs) are preferentially arginine-rich and thus highly alkaline, while halophilic proteins are known to be rich in aspartic acid and highly acidic: the pI value of 82% of the halobacterial proteins is between 3.5 and 5.5 (Tebbe et al. 2005). Like spurious ORFs, N-terminal gene extensions in front of the correct start codon tend to be highly alkaline, whereas the rest of the N-terminal region of the protein tends to be acidic. In combination, this results in a large pI upshift in front of the correct start codon (see Fig. 6), which can help to assign it properly (Tebbe et al. 2005). In the HaloLex web interface, the indicative pI values are shown to assist the annotator in assigning the correct start codon (see “Section 1: overview of the HaloLex system”).
Fig. 6

pI shift around start codons. The distribution of pI values of the 20 N-terminal residues excluding the initial Met (solid line) and the 20 residues of the spurious ORF extension (broken line) that precedes the start codon is plotted for Halobacterium. Transmembrane proteins and proteins with a signal sequence or twin-arginine export motif have been excluded from the analysis. The small inset shows the correlation of the pI value of the N-terminal region of the protein (pI-post, plotted on the x-axis) and the pI value of the spurious ORF extension (pI_pre, plotted on the y-axis). The majority of the N-terminal regions are acidic, while a large fraction of the spurious extensions is highly alkaline

pI shift around start codons. The distribution of pI values of the 20 N-terminal residues excluding the initial Met (solid line) and the 20 residues of the spurious ORF extension (broken line) that precedes the start codon is plotted for Halobacterium. Transmembrane proteins and proteins with a signal sequence or twin-arginine export motif have been excluded from the analysis. The small inset shows the correlation of the pI value of the N-terminal region of the protein (pI-post, plotted on the x-axis) and the pI value of the spurious ORF extension (pI_pre, plotted on the y-axis). The majority of the N-terminal regions are acidic, while a large fraction of the spurious extensions is highly alkaline

Integrated analysis of proteomic and genomic data

Gene selection and start codon assignment are greatly facilitated by experimental evidences, especially by proteome analysis. We have collected genome-scale proteomic data for Halobacterium salinarum (68% of all proteins identified, Tebbe et al. 2005; Klein et al. 2005; Bisle et al. 2006; Falb et al. 2006; Aivaliotis et al. 2007) and for Natronomonas pharaonis (43% of all proteins identified, Konstantinidis et al. 2007). This allowed to address and solve the two problems associated with gene prediction in GC-rich genomes, as an ORF is unambiguously confirmed as gene if the protein product is identified by a proteomic experiment. More than 100 orphans (ORFs that potentially code for proteins but do not have any homologs in the databases) could therewith be confirmed as genes. In many cases, initial gene predictions had to be corrected on the basis of proteomic data (see Tebbe et al. 2005). No evidence for “ORF overprinting” (i.e., more than one gene is located on the same genomic sequence stretch, Keese and Gibbs 1992) was found in Halobacterium and Natronomonas, although throughout their chromosomes, more than one reading frame is open at a given genome location (cf. Fig. 2). Searching for protein identifications resulting from alternative overlapping reading frames, we did not find a single pair of identified overlapping proteins (see Aivaliotis et al. 2007). Therefore, we conclude that, if ORF overprinting occurs at all, it is a very rare event (Konstantinidis et al. 2007; Pfeiffer et al. 2008). To address the problem of start codon assignment, we selected N-terminal peptides from the aforementioned set of proteomic data. In addition, we designed experiments in an attempt to specifically identify N-terminal peptides (Aivaliotis et al. 2007). In total, N-termini from 606 proteins in H. salinarum and from 328 in N. pharaonis were identified (Falb et al. 2006; Aivaliotis et al. 2007). On the basis of these experimental data, the subsequent integrated analysis of proteomic and genomic data in the HaloLex system confirmed that commonly applied gene finders have a high error rate with respect to start codon selection (Falb et al. 2005; Aivaliotis et al. 2007). Major difficulties to assign correct start codons are also evident from the fact that several hundred start codon assignment discrepancies exist between Halobacterium salinarum strains R1 (Pfeiffer et al. 2008) and NRC-1 (Ng et al. 2000), although the DNA sequences are virtually identical. Whenever N-terminal peptides could be identified by proteomics, they confirmed the start codon assignment for strain R1 (Pfeiffer et al. 2008). A selection of additional results from our proteomic analysis illustrates the power of integrated analysis with the HaloLex system. Our set of experimentally validated N-terminal peptides is among the largest in the prokaryotic world and allowed to unravel N-terminal protein maturation in halophilic archaea, which consists of methionine cleavage and N-terminal protein acetylation (Falb et al. 2006). N-terminal protein maturation critically depends on the penultimate residues (the one following the initiator-methionine). The set of proteins with N-terminally identified peptide contains 90 integral membrane proteins (again being one of the largest sets currently available). The data show that a major fraction of the integral membrane proteome is synthesized without a cleavable signal sequence and processed analogous to cytosolic proteins (Falb et al. 2006). One focus of our group is on membrane proteins, which we have extensively analyzed by proteomics (Klein et al. 2005, Bisle et al. 2006). While identification of integral membrane proteins has become highly efficient, our data show that the identification of peptides that form the transmembrane domain is still in its infancy. Most of the integral membrane proteins are identified exclusively through loop peptides. Statistical analysis shows that this hampers protein modification-based quantitative proteomics of integral membrane proteins (Bisle et al. 2006). Yet another issue concerning gene selection could be solved by experimental means. We were uncertain if our protein-coding gene set would show a major overprediction of small genes. Indicative of such an overprediction were two statistical results: (1) proteins smaller than 20 kDa are severely underrepresented in the set of proteomically identified proteins (Tebbe et al. 2005; Klein et al. 2007); (2) although we had used gel systems that are able to separate proteins below 20 kDa, the number of 2D gel spots in this size range seems much smaller than expected from a theoretical 2D gel (Tebbe et al. 2005). Experimental analysis showed that the small proteins indeed exist, but have so far been missed due to technical problems in standard biochemical experiments. There is a severe washout of small proteins upon standard SDS gel handling procedures (Klein et al. 2007). Also, the low number of peptides upon tryptic digestion severely hampers proteomic identification. With improved experimental techniques, 380 proteins smaller than 20 kDa could be identified (which increased the fraction of identified small proteins by a factor of six).

Homology-based checking of ORF prediction

Small protein-coding genes easily escape upon gene prediction. Therefore, we implemented a semiautomatic homology-based procedure to detect yet unannotated small genes. To this end, short protein sequences from closely related organisms are used for independent homology searches using blastP (protein vs. protein) and tblastN (protein vs. six-frame translation of the genome). Proteins with a higher score in tblastN as compared to blastP are selected for subsequent manual curation. Annotations of new genes, which are detected by this procedure, can be generated using a six-frame translator implemented in HaloLex. We applied this procedure to the published genome of Haloarcula marismortui, using proteins with up to 150 residues from H. salinarum strain R1, N. pharaonis, and H. walsbyi as a seed. This enabled us to detect 47 previously missed genes in Haloarcula (Table 1); among them, four were ribosomal proteins and 10 were small CPxCG-related zinc finger proteins, which are a prominent class of potential gene regulators found in all archaeal genomes (Tarasov et al. 2008).
Table 1

Newly assigned genes in Haloarcula marismortui

ORFLength (aa)Best homologFunctionSeq id. (%)Other homologs
rrnAC0103_A75NP3662Arib_prot S28.eR86OE2664F, HQ2884A
rrnAC0208_A60NP0350ACHY48
rrnAC0216_A126OE2874FCHY42HQ1719A
rrnAC0301_A80HQ2541ASmall ZnF45
rrnAC0669_A150NP0856ACHY66HQ1219A, OE1540R
rrnAC0678_A53NP0788ASmall ZnF73HQ1109A, OE1789R, HQ2748A, OE7210R
rrnAC0696_A99NP0816ASmall ZnF47OE1556F
rrnAC0797_A57HQ2892Arib_prot L37.eR92OE3141R, NP4310A
rrnAC0991_A48NP2998ACHY79OE3047F
rrnAC1044_A86HQ1848AmoaD family protein33NP2500A, NP5020A, NP3946A, OE3595R
rrnAC1515_A66NP1736ASmall ZnF58HQ3220A, OE3365R
rrnAC1588_A146OE5063RIS200-type transposase72NP4630A, OE1439F, rrAC0815, OE4728F
rrnAC1597_A61NP4882Arib_prot S1472OE3408F, HQ2828A, NP1768A
rrnAC1603_A94NP4870ARNAseP comp. 152OE3398F, HQ2834A
rrnAC1676_A44NP4282ACHY77NP2940A
rrnAC1676_B122HQ1297ACHY60
rrnAC1678_A100HQ1827ACHY59
rrnAC1706_A141NP1764ACHY52
rrnAC1831_A63NP1510ACHY63HQ1704A, OE1775R
rrnAC1867_A52HQ1176ASmall ZnF69OE1435R, NP5316A
rrnAC1929_A212OE3249FCob cluster protein54NP5310A, HQ1412A, NP1896A
rrnAC1936_A89NP3612ACHY48
rrnAC1983_A231NP1896ACHY43
rrnAC2105_A134NP0772ACHY54HQ1375A, OE4661R
rrnAC2167_A96NP4084ACHY35HQ2323A
rrnAC2212_A142HQ1071ACHY38OE2090R
rrnAC2268_A116NP2558ATranscription regulator74OE2591R, NP3596A, rrnAC3399, HQ1949A
rrnAC2270_A59HQ1365ASmall ZnF68NP0928A, OE4676F
rrnAC2286_A84NP5336ACHY43
rrnAC2448_A54NP5086ACHY51
rrnAC2530_A130HQ2261ACHY52
rrnAC2569_A49HQ3677ASmall ZnF74NP0778A, OE4167C1R, HQ2748A, OE7210R
rrnAC2574_A52NP0788ASmall ZnF92OE1789R, HQ1109A, HQ2748A
rrnAC2592_A98HQ1034ACHY79NP1820A
rrnAC2764_A111NP5102ACHY59HQ3659A, OE4054F
rrnAC2791_A115NP0196ACHY70HQ3647A, OE3914R
rrnAC2834_A129OE4148FCHY31HQ3411A
rrnAC2897_A73OE4475RSmall ZnF58HQ3437A, NP0708A
rrnAC2982_A137HQ2813ACHY42HQ1789A, HQ2547A, pNG7092, NP1808A
rrnAC3115_A57NP0186Arib_prot HL3275HQ3421A
rrnB0024_A139OE6004FSmall ZnF64NP6252A, HQ1149A
rrnB0146_A118OE1549FCHY54NP1698A, HQ1429A, OE3894R
rrnB0177_A89HQ2065ACHY50
pNG3034_A47NP4282ACHY80NP2940A
pNG6117_A85OE6052RCHY59
pNG6164_A115OE6242RCHY79
pNG6170_A53NP0788ACHY75OE1789R, HQ1109A, HQ2748A, OE7210R

Using tblastN, previously unannotated genes were detected and realized by the manual curation options within HaloLex. For each newly assigned gene, its code, length, the best homolog (with a brief function indication and percentage of sequence identity) and other homologous genes are given. Codes are systematically assigned using the number of the upstream ORF and a letter attached with an intervening underscore (commonly _A). Function assignment abbreviations: CHY conserved hypothetical protein, rib_prot ribosomal protein, small ZnF small CPxCG-related zinc finger protein (Tarasov et al. 2008)

Newly assigned genes in Haloarcula marismortui Using tblastN, previously unannotated genes were detected and realized by the manual curation options within HaloLex. For each newly assigned gene, its code, length, the best homolog (with a brief function indication and percentage of sequence identity) and other homologous genes are given. Codes are systematically assigned using the number of the upstream ORF and a letter attached with an intervening underscore (commonly _A). Function assignment abbreviations: CHY conserved hypothetical protein, rib_prot ribosomal protein, small ZnF small CPxCG-related zinc finger protein (Tarasov et al. 2008) In a similar way, sequence homology analysis allows to identify such genes, whose start codons were very likely incorrectly assigned. For this purpose, we analyze the results of a blastP search in closely related organisms. For each organism, the best homolog is used (provided the e-value is better than 1E-20). The alignment start position for query and hit is used to categorize the alignment. Alignments are considered to indicate a start codon misassignment if (1) the alignment starts very close to the N-terminus for one sequence but far away for the other and (2) when the alignment starts at the initiator-methionine for one sequence and this methionine aligns with a potential start codon translation (Met or Val) in the other sequence. Candidates are further analyzed by manual inspection. Table 2 lists 337 genes from the published genome of Haloarcula marismortui, where we have reassigned the start codon (196 genes are shortened and 141 extended). We briefly discuss two example cases by showing the corresponding multiple sequence alignments (Figs. 7, 8).
Table 2

Genes with corrected start codon assignments in Haloarcula marismortui

ORFCorrected length (aa)Original length (aa)Direction of changeHomologous ORFs
rrnAC00041,5511,356ExtendedNP4364A, OE3175F, HQ3018A
rrnAC0005624360ExtendedOE2052F, NP3952A
rrnAC00129571,083ShortenedHQ1987A, NP4816A
rrnAC0041279339ShortenedNP3812A, HQ1503A, OE2950R
rrnAC0053864717ExtendedNP2042A, HQ3014A
rrnAC00801,3471,383ShortenedNP3698A, OE2648F, HQ2890A
rrnAC0083936654ExtendedNP1382A, HQ2430A
rrnAC01012,2382,328ShortenedOE2656R, NP3690A
rrnAC0115774909ShortenedNP4142A, OE2472F
rrnAC0137882993ShortenedOE2860R, NP3116A, rrnAC1236
rrnAC01451,125843ExtendedOE4359F, HQ1341A
rrnAC01719841,080ShortenedOE1599F
rrnAC01781,8121,725ExtendedOE1613R
rrnAC01811,8661,413ExtendedNP4200A, OE3010F, HQ2528A
rrnAC01981,2361,305ShortenedNP1646A
rrnAC0199480537ShortenedNP1296A
rrnAC0213951726ExtendedHQ1261A
rrnAC0215723522ExtendedNP0356A, HQ2398A, OE1636F
rrnAC02399991,110ShortenedNP1902A, OE2918F
rrnAC0240333372ShortenedOE3652F, HQ2230A
rrnAC02498791,017ShortenedOE3606R, NP3184A
rrnAC0261267300ShortenedHQ2898A, NP3686A, OE3683R
rrnAC0280333381ShortenedNP2580A, HQ2722A
rrnAC0284447225ExtendedNP2596A, HQ2724A
rrnAC03041,1041,155ShortenedOE2360R, NP5226A
rrnAC03059571,011ShortenedOE2551F, NP3722A
rrnAC03221,1761,203ShortenedNP3076A, OE2763F
rrnAC0324411153ExtendedNP3362A, OE4451F
rrnAC03291,2451,035ExtendedNP2206A, HQ2700A
rrnAC0374414324ExtendedNP2642A, OE2237F, HQ2615A
rrnAC03941,0081,056ShortenedNP1082A, OE4339R, HQ3696A
rrnAC04261,3861,464ShortenedpNG7203, NP0964A, HQ3464A
rrnAC0430936975ShortenedHQ2692A, NP1916A, OE2547R
rrnAC04361,5301,605ShortenedOE2288F, NP2702A, HQ2668A
rrnAC04811,002894ExtendedNP2798A
rrnAC0494249306ShortenedNP4192A, OE1860F, HQ1556A
rrnAC0497708471ExtendedHQ1562A, OE1794R
rrnAC05051,7101,485ExtendedOE3490R, NP1742A, HQ3347A
rrnAC05061,5271,575ShortenedNP3956A, OE2049R
rrnAC0536573606ShortenedNP2906A, HQ2751A
rrnAC05461,7911,833ShortenedHQ1573A, OE1495R, NP1746A
rrnAC0568699435ExtendedHQ1615A, rrnAC3127, NP1996A
rrnAC0572804855ShortenedHQ3669A, rrnAC2557, NP0792A, OE3115F
rrnAC0589657846ShortenedrrnAC2321, OE8048F
rrnAC06171,3351,413ShortenedNP0212A, HQ2634A, OE8010R
rrnAC06191,272963ExtendedHQ3141A, OE4634F, NP0578A
rrnAC06201,5661,431ExtendedNP4066A, HQ2635A, OE7174R
rrnAC0628912957ShortenedNP4072A, HQ2637A, OE1748R
rrnAC0629738774ShortenedNP4074A, HQ2638A, OE1752F
rrnAC06311,122642ExtendedNP0380A, OE1582R, HQ1531A
rrnAC06339691,008ShortenedNP0384A, OE1578F, HQ1670A
rrnAC0638474606ShortenedHQ3692A, NP2228A
rrnAC06511,1911,059ExtendedOE4393R, NP0888A
rrnAC0655603657ShortenedNP1390A, HQ1666A
rrnAC0660333417ShortenedNP4090A, HQ2743A, OE1651F
rrnAC06631,029240ExtendedNP0372A, OE1646R, HQ2392A
rrnAC0666708582ExtendedNP0100A, HQ3478A, OE1004F
rrnAC0674810945ShortenedrrnAC0848, OE7042R, rrnAC2044, HQ2141A, NP6028A
rrnAC06871,2691,347ShortenedOE4207F
rrnAC0696762900ShortenedNP0818A, OE1554R
rrnAC0717636699ShortenedNP1230A, OE1713F, HQ1537A
rrnAC07211,3711,395ShortenedOE3467R, HQ1298A
rrnAC0753879936ShortenedOE2785R, HQ2762A
rrnAC0777903492ExtendedHQ2440A
rrnAC0779861915ShortenedOE2138F, NP1596A
rrnAC08019511,071ShortenedNP4302A, OE3145F, HQ2933A
rrnAC08251,2841,383ShortenedNP4134A, HQ2196A
rrnAC0833795858ShortenedNP1462A, OE1641R, HQ2394A
rrnAC08381,7791,827ShortenedNP2726A, HQ1873A, OE2653R
rrnAC08419541,062ShortenedNP2730A, OE2561R, HQ1874A
rrnAC08431,5241,587ShortenedNP2738A, OE2555R, HQ2402A
rrnAC08521,017966ExtendedOE3343R
rrnAC0875543405ExtendedOE3121R, HQ2788A
rrnAC0878810783ExtendedOE3119R, NP4334A
rrnAC08831,116504ExtendedNP4236A
rrnAC08961,1521,293ShortenedHQ1590A, OE2358F, NP3650A
rrnAC09171,0651,140ShortenedHQ1663A, OE1669F
rrnAC09259901,035ShortenedNP2796A, OE2451R
rrnAC0934423471ShortenedNP2710A, OE2005F, HQ2301A
rrnAC09421,6111,416ExtendedOE3436R
rrnAC09441,3021,350ShortenedHQ1663A, OE1669F
rrnAC0956462537ShortenedHQ1497A, OE2934R
rrnAC10421,8061,851ShortenedrrnAC1570, HQ3533A
rrnAC10831,9652,010ShortenedNP4322A, OE2871F
rrnAC1106519420ExtendedNP4198A, OE2985F, HQ2561A
rrnAC11071,4761,176ExtendedNP4904A, HQ1686A
rrnAC1115270324ShortenedNP4036A, OE2903R, HQ2458A
rrnAC1138849507ExtendedOE2020F, NP1592A
rrnAC11691,311867ExtendedNP3742A, OE2827R, HQ2339A
rrnAC12181,7941,821ShortenedHQ1754A
rrnAC1220663606ExtendedHQ1752A
rrnAC12611,1551,182ShortenedNP4050A, HQ2389A
rrnAC1263798894ShortenedOE2913R, NP3970A
rrnAC12812,5292,592ShortenedOE2573F, NP1526A
rrnAC12991,8811,704ExtendedOE1143R, HQ3344A, NP1442A
rrnAC1308339399ShortenedNP2066A, HQ1665A, OE1673F
rrnAC1336399426ShortenedNP4972A
rrnAC13411,6831,800ShortenedNP0164A
rrnAC13501,0501,191ShortenedNP3216A, OE1906R, HQ2500A
rrnAC1361687498ExtendedOE2276F, NP2980A, HQ1692A
rrnAC13651,0261,161ShortenedrrnAC0576
rrnAC1377582777ShortenedrrnAC0508, NP3954A
rrnAC1383210342ShortenedNP1548A
rrnAC13958491,029ShortenedNP4160A
rrnAC1438429351ExtendedNP3220A, OE2139R, HQ1579A
rrnAC14431,2751,398ShortenedNP3228A, HQ1584A, OE2149R
rrnAC14441,6231,746ShortenedHQ1934A, HQ2096A, pNG7256
rrnAC1447414534ShortenedNP2292A, HQ1637A, OE1953F
rrnAC1454303207ExtendedOE1963F, HQ1645A, NP2308A
rrnAC14771,0471,251ShortenedOE2014F, HQ2353A
rrnAC14971,4311,707ShortenedNP4594A, OE3274R
rrnAC15001,0921,155ShortenedOE3278R, NP4774A
rrnAC1504846528ExtendedNP4780A, HQ2866A, OE3286F
rrnAC1516189489ShortenedOE3330F
rrnAC1530765459ExtendedNP1786A, HQ3174A
rrnAC1532711579ExtendedNP1788A, OE3352R, HQ3173A
rrnAC15361,3321,395ShortenedHQ1685A, NP4902A, OE5298F
rrnAC15421,4161,644ShortenedOE1133F
rrnAC1567291474ShortenedHQ2131A
rrnAC15881,308882ExtendedOE5062R
rrnAC1621570459ExtendedHQ2801A, OE3367F
rrnAC16262,5322,808ShortenedrrnAC2044, rrnAC0848, OE7042R, HQ2141A
rrnAC1628366216ExtendedOE3324R, HQ2783A, NP3352A
rrnAC1630591411ExtendedOE2334R, NP0028A
rrnAC16389751,107ShortenedNP1214A
rrnAC1647402222ExtendedNP1834A
rrnAC1655750675ExtendedNP1082A, OE4339R, HQ3696A
rrnAC1665642735ShortenedNP2884A
rrnAC1669279351ShortenedOE1371R, HQ1283A, NP5232A
rrnAC1680390429ShortenedNP0612A, OE1371R, HQ1286A
rrnAC16901,5451,509ExtendedNP0624A, HQ1292A
rrnAC17021,7191,218ExtendedNP1742A, OE3490R, HQ3347A
rrnAC17081,3471,089ExtendedNP4502A, HQ3336A, OE3496R
rrnAC17181,3591,065ExtendedNP4542A, OE3506F, HQ3330A
rrnAC17261,5271,485ExtendedHQ3326A, OE3511F, NP4534A
rrnAC1743549486ExtendedHQ1673A, NP5358A, rrnAC3526
rrnAC1764924978ShortenedrrnAC1777, NP5168A, OE1385F, HQ1277A
rrnAC1774339411ShortenedHQ1279A, OE1379R
rrnAC1776588633ShortenedNP5166A, OE1384F, HQ1278A
rrnAC1779651516ExtendedNP5170A
rrnAC1782933963ShortenedHQ1276A, NP5174A, OE4651F
rrnAC1797858903ShortenedNP4932A, OE3445F, rrnAC0317
rrnAC1809735444ExtendedOE1445R, NP1134A
rrnAC1812705525ExtendedOE1451F, HQ1168A, NP1178A
rrnAC1822717666ExtendedNP1636A, OE1793F, HQ1712A
rrnAC1826522237ExtendedNP1498A, HQ1709A, OE1785F
rrnAC18402,6762,727ShortenedNP1516A, OE1770F, HQ1701A
rrnAC18491,7461,083ExtendedHQ1189A, NP5206A
rrnAC18533,0032,616ExtendedNP5214A, HQ1185A
rrnAC1855891330ExtendedNP5218A, OE1417F, HQ1183A
rrnAC1867522756ShortenedHQ1177A, OE1434R, NP5318A
rrnAC18701,4551,209ExtendedNP4702A, OE3960F
rrnAC18801,2391,284ShortenedNP0438A, OE3971R, rrnAC3166
rrnAC1905429279ExtendedNP4526A, pNG6069
rrnAC1930924366ExtendedOE3253F, NP5308A, HQ1411A
rrnAC1931804504ExtendedHQ1410A, NP5306A, OE3255F
rrnAC19501,8931,680ExtendedNP0158A, HQ1329A
rrnAC19571,4911,671ShortenedHQ2578A
rrnAC1979795591ExtendedNP1462A, OE1641R
rrnAC19831,2181,755ShortenedNP1754A, OE2013R
rrnAC1992738591ExtendedNP1470A
rrnAC2014792747ExtendedNP5122A, OE1306F, HQ1416A
rrnAC2085360522ShortenedNP0342A, OE4713R, HQ3071A
rrnAC20981,8781,905ShortenedNP0198A, OE4671R, HQ1369A
rrnAC21055221,014ShortenedNP0774A, OE4663F, HQ1347A
rrnAC21271,005711ExtendedNP0962A
rrnAC2129864894ShortenedOE4355R, NP3186A
rrnAC21581,9742,034ShortenedNP0404A, OE4613F, HQ3117A
rrnAC2159462609ShortenedNP0954A, HQ3116A, OE4610R
rrnAC21811,0981,227ShortenedNP1140A, OE4571R, HQ1074A
rrnAC2221435579ShortenedOE4541F, NP1718A, HQ1065A, rrnAC2455
rrnAC2223372387ShortenedNP1710A, OE4544R, HQ1063A
rrnAC22451,137870ExtendedOE4034R, HQ3066A, NP0030A
rrnAC22471,2961,443ShortenedNP1050A
rrnAC2258624435ExtendedNP0018A
rrnAC22619541,026ShortenedOE1151R, NP0014A, HQ1359A
rrnAC2278528600ShortenedNP3368A, HQ2565A, rrnAC0868, OE2992R
rrnAC22841,038993ExtendedNP5368A, OE2438R
rrnAC23521,1671,188ShortenedpNG7026, OE5170F, HQ1989A
rrnAC23569991,251ShortenedNP5048A, HQ1275A, OE4196R
rrnAC2359432180ExtendedNP4806A, rrnAC0738, OE3162F, HQ2346A
rrnAC23771,281924ExtendedNP0578A, OE4634F, HQ3141A
rrnAC2440684780ShortenedNP0956A, OE4360R, HQ3733A
rrnAC24601,9772,127ShortenedNP1264A, HQ3402A, OE4140R
rrnAC24691,2991,422ShortenedHQ2809A, HQ2192A
rrnAC24731,5241,569ShortenedOE2133R, NP3020A
rrnAC2474288351ShortenedNP1258A, OE4136R, HQ3399A
rrnAC2476936582ExtendedNP1312A, OE4133R
rrnAC25181,2661,221ExtendedNP1318A, OE3943R, HQ3056A
rrnAC25251,026735ExtendedHQ1021A, OE4201R
rrnAC2526390477ShortenedNP1272A, HQ2001A, OE4217R
rrnAC2529741906ShortenedNP1268A, OE4218F, HQ1025A
rrnAC25322,7663,069ShortenedNP0538A, OE1272R, HQ1460A
rrnAC25502,1542,241ShortenedOE1267R, NP0536A, HQ1456A
rrnAC25581,8631,443ExtendedOE3889R, HQ3102A, NP1576A
rrnAC25659781,038ShortenedHQ3671A, NP0900A, OE4195F
rrnAC2582699870ShortenedNP1406A, HQ1040A, OE4235F
rrnAC25861,2001,065ExtendedHQ3704A, NP1412A, OE4236F
rrnAC25929751,281ShortenedHQ1035A, NP1818A
rrnAC26272,0612,106ShortenedNP1344A, HQ2213A
rrnAC2629804864ShortenedNP5160A
rrnAC2630858993ShortenedOE4085R, NP0606A, HQ3650A
rrnAC26331,356888ExtendedrrnAC0404, OE3070R
rrnAC2636624720ShortenedNP5088A, OE3906F
rrnAC2642738246ExtendedNP5114A, HQ2624A, OE2740F
rrnAC26561,1251,182ShortenedHQ2450A, OE2317R
rrnAC26571,1941,062ExtendedOE5132F, rrnB0290, NP1412A
rrnAC27141,5691,179ExtendedNP0482A, HQ1003A, OE4390F
rrnAC2722510558ShortenedNP0462A, HQ3640A, OE4429F
rrnAC2748435582ShortenedNP5152A, HQ3265A, OE4027F
rrnAC2749453186ExtendedNP5150A, OE4028R, HQ3266A
rrnAC2753648687ShortenedHQ1473A
rrnAC2754366417ShortenedNP5146A, OE4039F
rrnAC27551,2061,257ShortenedOE4034R, HQ3066A, NP0030A
rrnAC27562,0461,770ExtendedNP5144A, HQ3065A, OE4041F
rrnAC27611,143960ExtendedOE2170R, HQ2450A
rrnAC27721,5211,617ShortenedNP1074A, HQ2660A
rrnAC27761,4881,260ExtendedpNG7305, OE2076F, HQ2506A
rrnAC27801,3561,443ShortenedNP4376A, HQ3643A, OE3922R
rrnAC2781894711ExtendedNP0190A, OE3921F, HQ3644A
rrnAC27821,2361,140ExtendedNP0192A
rrnAC27831,4371,296ExtendedNP5292A, OE2063R
rrnAC2798552393ExtendedOE3905F, HQ1379A, NP0086A
rrnAC2800612699ShortenedNP0092A, OE3902R, HQ1377A
rrnAC2804516447ExtendedNP1700A, OE3895F
rrnAC2806381228ExtendedNP1698A, HQ1429A, OE3894R
rrnAC2810909990ShortenedOE3892R, NP1688A, HQ3137A
rrnAC28111,9051,128ExtendedOE3889R, HQ3102A, NP1576A
rrnAC28181,3441,479ShortenedNP2252A, HQ3104A, OE3882R
rrnAC2822951309ExtendedNP2248A, OE3879F, HQ3106A
rrnAC2831540411ExtendedNP5076A, HQ1339A, OE3871R
rrnAC28341,0321,677ShortenedNP1066A, OE4144R, HQ1407A
rrnAC28361,2061,359ShortenedHQ2439A, NP3538A
rrnAC2851744786ShortenedNP0554A, OE4165R, HQ3686A
rrnAC28572,1962,238ShortenedNP1350A, OE4181R, HQ3684A
rrnAC2859630678ShortenedNP1332A, HQ3517A
rrnAC28671,4671,353ExtendedOE4370R, NP5292A
rrnAC28701,092951ExtendedHQ3382A, OE4359F
rrnAC28911,6981,773ShortenedNP1008A, OE4122R, HQ3049A
rrnAC2893975873ExtendedNP0698A, HQ3439A, OE2975F
rrnAC29011,7281,278ExtendedNP0898A, OE4471R
rrnAC2933720558ExtendedNP0238A, HQ1390A
rrnAC29372,8442,895ShortenedOE1286R, NP0232A
rrnAC3005927999ShortenedNP1116A, HQ3034A, OE3214F
rrnAC30081,158723ExtendedNP1114A, HQ3033A, OE3216F
rrnAC30461,1041,146ShortenedNP0884A, HQ2919A
rrnAC30502,1992,514ShortenedrrnAC0848, rrnAC2044, HQ2141A, NP6028A
rrnAC30629541,032ShortenedrrnAC1698, OE1358R, HQ2259A
rrnAC30719151,149ShortenedOE3959R, HQ3234A, NP5036A
rrnAC3074699963ShortenedNP0072A, HQ3236A, OE3964R
rrnAC3079573624ShortenedNP3402A
rrnAC3083375423ShortenedNP0948A, OE4292F, HQ3465A
rrnAC31002,0102,049ShortenedNP2262A, HQ1094A, OE3832F
rrnAC31216511,194ShortenedHQ1495A
rrnAC3130939645ExtendedHQ1618A, rrnB0227
rrnAC31321,3921,419ShortenedHQ1619A, rrnAC2624
rrnAC31371,047798ExtendedNP0860A, HQ3045A, OE4446R
rrnAC3167687750ShortenedOE1188F, rrnAC1953
rrnAC3182627666ShortenedNP3516A, rrnAC1228
rrnAC31981,2151,044ExtendedNP1072A
rrnAC32101,3111,356ShortenedNP4992A, OE3792F, HQ3101A
rrnAC3214864936ShortenedHQ3098A, OE3787R, NP2524A
rrnAC32261,6711,599ExtendedNP5136A
rrnAC32361,383420ExtendedOE1018F, rrnAC1586
rrnAC3256546663ShortenedNP3054A, HQ3084A, OE3752R
rrnAC3268678498ExtendedNP5010A, OE3731R, HQ3131A
rrnAC3272981834ExtendedNP5006A, HQ3129A, OE3735F
rrnAC32791,1071,032ExtendedNP2398A, OE3722F, HQ3125A
rrnAC3302786735ExtendedOE3439F, HQ2249A
rrnAC3328531621ShortenedNP1380A, OE1858F, HQ1684A
rrnAC3342891945ShortenedNP4916A, OE3430F, HQ2764A
rrnAC3345453531ShortenedHQ1964A, rrnAC2948, OE2717R
rrnAC3348876963ShortenedNP4524A, HQ3322A, OE3531R
rrnAC3352609765ShortenedNP4518A, OE3537R, HQ2230A
rrnAC33712,3611,971ExtendedpNG2034, NP3562A, HQ1851A, OE5286R
rrnAC3385573396ExtendedOE3854R, NP2906A
rrnAC3394444525ShortenedNP1284A, rrnB0323
rrnAC3420843885ShortenedOE3633R, NP2432A, HQ3036A
rrnAC3450366399ShortenedOE3588C1R, NP2486A, HQ3026A
rrnAC3452894951ShortenedNP2484A, OE3586R, HQ3027A
rrnAC34621,9292,019ShortenedNP2410A, OE3580R, HQ2579A
rrnAC34751,113594ExtendedNP6258A
rrnAC3486636687ShortenedNP4286A
rrnAC3509624471ExtendedOE1814R, HQ1569A
rrnAC3528459477ShortenedrrnAC3526, pNG2015
rrnAC3536276153ExtendedNP0840A, OE1853R
rrnAC35371,089681ExtendedHQ1674A, OE1854R, NP0842A
rrnAC35511,026750ExtendedNP2032A, HQ3010A, rrnAC1926, OE5141R
rrnB00921,5331,599ShortenedHQ1972A
rrnB0172591507ExtendedNP1842A
rrnB01981,5301,623ShortenedNP0556A, OE4115F
rrnB02421,1731,287ShortenedNP2606A, HQ2307A
rrnB0257834888ShortenedNP4244A, OE1942F
rrnB02651,6831,851ShortenedNP4242A
rrnB02661,4041,011ExtendedNP3416A
rrnB02751,1011,065ExtendedrrnAC0899, OE4576F
rrnB03251,1071,278ShortenedOE5142F, NP2128A, rrnAC3284, HQ3147A
pNG20079391,119ShortenedOE4023F, NP1282A, rrnAC2744, HQ3263A
pNG2015516615ShortenedrrnAC3526, NP1956A
pNG40171,2541,125ExtendedNP2168A, rrnAC2207, OE2401F
pNG4035561516ExtendedOE3768F, NP5358A
pNG50011,1341,104ExtendedrrnAC0252, NP0102A, HQ1815A, OE1005F
pNG50041,4881,068ExtendedrrnAC0250
pNG50101,632879ExtendedOE5248F, HQ1543A, NP2464A
pNG5131633579ExtendedrrnAC3384, OE4753R
pNG51391,251312ExtendedHQ2051A, NP6268A, OE1070R
pNG6047591618ShortenedHQ1118A, OE2691R, rrnAC0503, NP2664A
pNG60544231,047ShortenedNP5022A
pNG6069417444ShortenedNP4526A, rrnAC1905
pNG6075378477ShortenedOE7144R, NP3058A
pNG6092294324ShortenedpNG6058, OE7057F, NP3002A, HQ2407A
pNG6120861921ShortenedOE5424R
pNG6141615585ExtendedOE5415R
pNG70121,2811,026ExtendedOE1077R, rrnAC3239, HQ2680A, NP2322A
pNG70371,4881,725ShortenedNP5056A
pNG70401,182999ExtendedpNG7041, OE4576F
pNG7050750705ExtendedHQ3696A, rrnAC0479, NP1198A, OE3661F
pNG7058528501ExtendedOE1252R, HQ2374A, NP1606A
pNG70601,2721,302ShortenedNP6204A
pNG70661,0171,107ShortenedOE2128F, HQ2746A, NP1386A
pNG70781,0711,242ShortenedNP1388A, HQ1592A
pNG7081399363ExtendedHQ4010A
pNG71011,9712,004ShortenedHQ1729A
pNG7106897819ExtendedOE2497F, HQ2422A, NP1346A
pNG7178984834ExtendedHQ2189A
pNG7227747612ExtendedNP0054A, HQ1091A, OE3843F
pNG72442,4812,166ExtendedpNG7246, HQ1944A
pNG72521,6591,788ShortenedOE2316R, rrnAC2655, HQ2451A
pNG72781,0411,155ShortenedOE4674F, HQ1124A
pNG7280540489ExtendedpNG6134, NP5298A
pNG7297603630ShortenedNP0672A
pNG7321381408ShortenedHQ1769A, pNG7235, OE3930R, NP0566A
pNG73271,5121,572ShortenedHQ1784A, NP0802A, OE1568F
pNG73421,0981,128ShortenedpNG7026, OE5170F
pNG73511,0651,089ShortenedrrnAC0191, NP1260A, OE4674F, HQ3648A
pNG7377432324ExtendedHQ3372A
pNG73801,7461,932ShortenedHQ1768A

Using our semiautomatic checking procedure, candidate genes with probable errors in start codon assignment were identified and subjected to manual curation. When sufficiently strong evidences were found, the start codon was reassigned using the manual curation options of HaloLex. For each gene in the list (first column), we provide the corrected (second column) and original length (third column) of the amino-acid sequence, and the set of homologous genes that support our decision for the new start codon assignment (fifth column). The redundant fourth column facilitates a quick overview of whether sequences were extended or shortened with respect to their original length

Fig. 7

Homology-based start codon checking for the detection of ORFs, which are too short. A sequence alignment of four homologous proteins of H. salinarum (strains R1 and NRC-1), N. pharaonis, H. walsbyi and H. marismortui is shown. Codes starting with OE are from H. salinarum strain R1, those with VNG from strain NRC-1, NP from N. pharaonis, HQ from H. walsbyi and those starting with rrnAC from H. marismortui. Uppercase letters indicate the protein sequence as obtained from the current database, the first methionine being bold. Lowercase letters indicate additional residues obtained by our correction of the start codon assignment. Residues conserved in all sequences are indicated by asterisks

Fig. 8

Homology-based start codon checking for the detection of ORFs, which are too long. A sequence alignment of four homologous proteins of H. salinarum (strains R1 and NRC-1), N. pharaonis, H. walsbyi and H. marismortui is shown. Codes starting with OE are from H. salinarum strain R1, those with VNG from strain NRC-1, NP from N. pharaonis, HQ from H. walsbyi and those starting with rrnAC from H. marismortui. The protein sequences are highly homologous. Residues conserved in all sequences are indicated by asterisks (lower alignment block). Spurious N-terminal sequence extensions are possible in three of the four species, but are considered to be incorrect as they are not homologous to each other (upper alignment block). Uppercase letters indicate the protein sequence as obtained from the current database, the first methionine being bold. The position of the probable initiator methionine in the current database sequence is indicated. Lowercase letters indicate gene extensions, which are possible but are considered spurious

Homology-based start codon checking for the detection of ORFs, which are too short. A sequence alignment of four homologous proteins of H. salinarum (strains R1 and NRC-1), N. pharaonis, H. walsbyi and H. marismortui is shown. Codes starting with OE are from H. salinarum strain R1, those with VNG from strain NRC-1, NP from N. pharaonis, HQ from H. walsbyi and those starting with rrnAC from H. marismortui. Uppercase letters indicate the protein sequence as obtained from the current database, the first methionine being bold. Lowercase letters indicate additional residues obtained by our correction of the start codon assignment. Residues conserved in all sequences are indicated by asterisks Homology-based start codon checking for the detection of ORFs, which are too long. A sequence alignment of four homologous proteins of H. salinarum (strains R1 and NRC-1), N. pharaonis, H. walsbyi and H. marismortui is shown. Codes starting with OE are from H. salinarum strain R1, those with VNG from strain NRC-1, NP from N. pharaonis, HQ from H. walsbyi and those starting with rrnAC from H. marismortui. The protein sequences are highly homologous. Residues conserved in all sequences are indicated by asterisks (lower alignment block). Spurious N-terminal sequence extensions are possible in three of the four species, but are considered to be incorrect as they are not homologous to each other (upper alignment block). Uppercase letters indicate the protein sequence as obtained from the current database, the first methionine being bold. The position of the probable initiator methionine in the current database sequence is indicated. Lowercase letters indicate gene extensions, which are possible but are considered spurious Genes with corrected start codon assignments in Haloarcula marismortui Using our semiautomatic checking procedure, candidate genes with probable errors in start codon assignment were identified and subjected to manual curation. When sufficiently strong evidences were found, the start codon was reassigned using the manual curation options of HaloLex. For each gene in the list (first column), we provide the corrected (second column) and original length (third column) of the amino-acid sequence, and the set of homologous genes that support our decision for the new start codon assignment (fifth column). The redundant fourth column facilitates a quick overview of whether sequences were extended or shortened with respect to their original length Figure 7 shows a gene, which needs to be extended in Haloarcula marismortui (and also in H. salinarum strain NRC-1). Met-1 of rrnAC2377 aligns with Met-121 of NP0578A. Using the longer sequence (here NP0578) for tblastN shows that the homologous region extends beyond the assigned start codon (lowercase sequence letters for rrnAC2377). VNG2591C can also be extended to match OE4634F, as the genome sequences of strains R1 and NRC-1 are identical in this region (also indicated by lowercase sequence letters). Figure 8 shows an example of a gene, which needs to be shortened in Haloarcula marismortui (and also in H. salinarum strain NRC-1). The methionine at position 17 in the rrnAC2722 sequence aligns with the methionine at position 1 of NP0462A. Using the longer sequence (here rrnAC2722) for tblastN does not result in an extension of the homologous region as compared to the shorter sequence (NP0462A), which indicates that the extension may be spurious. Spurious ORF extensions are possible in three of the four halophiles, but they are not homologous to each other. It should be stressed that the homology-based procedures described above are not suitable for performing automatic, unsupervised gene predictions. They rather serve to preselect candidates with probable gene prediction errors, which then need to be manually inspected. The HaloLex system is well suited to support such manual curation, as it does not only support detailed analysis but, once a decision is taken, allows it to be conveniently made persistent with a few clicks.

Conclusions and outlook

We have described HaloLex, a software system for the central management, integration, and web-based visualization of genomic and other -omics data. A number of HaloLex functionalities are specifically tailored to halophilic archaea, but the system can handle any given microorganism. HaloLex has proven an indispensable tool for the data management, curation, and in-depth bioinformatic analysis of three halophilic archaea sequenced in-house, namely Halobacterium salinarum (strain R1), Natronomonas pharaonis, and Haloquadratum walsbyi. HaloLex summarizes all available data for a given organism including experimental data, like, e.g., proteomics, in an easy-to-use web interface. This proved to be of enormous importance for both, the daily user of genome information as well as for the manual curator of the gene annotation in these organisms. In this article, we further reviewed a number of selected, biologically relevant results we obtained for these species, thus highlighting the capabilities of HaloLex for prediction and curation of gene assignment, in particular by the integrated analysis of genomic with proteomic data. Lately, we have applied HaloLex functionalities to the published genome of another halophilic archaeon, Haloarcula marismortui, which resulted in a significantly improved version of the original gene prediction. Other halophiles (also from the bacterial kingdom) like Halobacillus halophilus are currently being annotated by different collaborations, which shows that HaloLex could be a useful tool also for a broader user-community. Based on our promising experiences, we thus encourage potential collaborators to consider employing our HaloLex server as a data repository and a tool for curation and analysis of their genomes (and proteomes, etc.) of interest. At the same time, HaloLex would allow such groups to make their data available to the public (or restricted user groups) without having to take up the burden of developing and hosting their own software and hardware infrastructure. Our ongoing and future activities are focussed on making those data and methods fully available, which so far can be used only internally (e.g., data from transcriptomics experiments). Moreover, HaloLex functionalities are continuously being improved and extended. Currently, we are about to couple software modules for text-mining and metabolic modeling, which we are developing in our group to the HaloLex web application. We also plan to release our web-service interface to support mining of HaloLex data over the Internet.
  36 in total

1.  GenDB--an open source genome annotation system for prokaryote genomes.

Authors:  Folker Meyer; Alexander Goesmann; Alice C McHardy; Daniela Bartels; Thomas Bekel; Jörn Clausen; Jörn Kalinowski; Burkhard Linke; Oliver Rupp; Robert Giegerich; Alfred Pühler
Journal:  Nucleic Acids Res       Date:  2003-04-15       Impact factor: 16.971

Review 2.  Heterogeneity of genome and proteome content in bacteria, archaea, and eukaryotes.

Authors:  Samuel Karlin; Luciano Brocchieri; Jonathan Trent; B Edwin Blaisdell; Jan Mrázek
Journal:  Theor Popul Biol       Date:  2002-06       Impact factor: 1.570

3.  Web-accessible proteome databases for microbial research.

Authors:  Klaus-Peter Pleissner; Till Eifert; Sven Buettner; Frank Schmidt; Martina Boehme; Thomas F Meyer; Stefan H E Kaufmann; Peter R Jungblut
Journal:  Proteomics       Date:  2004-05       Impact factor: 3.984

4.  A combined transmembrane topology and signal peptide prediction method.

Authors:  Lukas Käll; Anders Krogh; Erik L L Sonnhammer
Journal:  J Mol Biol       Date:  2004-05-14       Impact factor: 5.469

5.  Analysis of the cytosolic proteome of Halobacterium salinarum and its implication for genome annotation.

Authors:  Andreas Tebbe; Christian Klein; Birgit Bisle; Frank Siedler; Beatrix Scheffer; Carolina Garcia-Rizo; Jan Wolfertz; Volker Hickmann; Friedhelm Pfeiffer; Dieter Oesterhelt
Journal:  Proteomics       Date:  2005-01       Impact factor: 3.984

6.  Reconstruction, modeling & analysis of Halobacterium salinarum R-1 metabolism.

Authors:  Orland Gonzalez; Susanne Gronau; Michaela Falb; Friedhelm Pfeiffer; Eduardo Mendoza; Ralf Zimmer; Dieter Oesterhelt
Journal:  Mol Biosyst       Date:  2007-12-05

Review 7.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

8.  Living with two extremes: conclusions from the genome sequence of Natronomonas pharaonis.

Authors:  Michaela Falb; Friedhelm Pfeiffer; Peter Palm; Karin Rodewald; Volker Hickmann; Jörg Tittor; Dieter Oesterhelt
Journal:  Genome Res       Date:  2005-09-16       Impact factor: 9.043

9.  AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system.

Authors:  K Bryson; V Loux; R Bossy; P Nicolas; S Chaillou; M van de Guchte; S Penaud; E Maguin; M Hoebeke; P Bessières; J-F Gibrat
Journal:  Nucleic Acids Res       Date:  2006-07-19       Impact factor: 16.971

Review 10.  Metabolism of halophilic archaea.

Authors:  Michaela Falb; Kerstin Müller; Lisa Königsmaier; Tanja Oberwinkler; Patrick Horn; Susanne von Gronau; Orland Gonzalez; Friedhelm Pfeiffer; Erich Bornberg-Bauer; Dieter Oesterhelt
Journal:  Extremophiles       Date:  2008-02-16       Impact factor: 2.395

View more
  43 in total

Review 1.  Archaeal extrachromosomal genetic elements.

Authors:  Haina Wang; Nan Peng; Shiraz A Shah; Li Huang; Qunxin She
Journal:  Microbiol Mol Biol Rev       Date:  2015-03       Impact factor: 11.056

2.  Haloferax volcanii flagella are required for motility but are not involved in PibD-dependent surface adhesion.

Authors:  Manuela Tripepi; Saheed Imam; Mechthild Pohlschröder
Journal:  J Bacteriol       Date:  2010-04-02       Impact factor: 3.490

3.  GlpR Is a Direct Transcriptional Repressor of Fructose Metabolic Genes in Haloferax volcanii.

Authors:  Jonathan H Martin; Katherine Sherwood Rawls; Jou Chin Chan; Sungmin Hwang; Mar Martinez-Pastor; Lana J McMillan; Laurence Prunetti; Amy K Schmid; Julie A Maupin-Furlow
Journal:  J Bacteriol       Date:  2018-08-10       Impact factor: 3.490

4.  New, closely related haloarchaeal viral elements with different nucleic Acid types.

Authors:  Elina Roine; Petra Kukkaro; Lars Paulin; Simonas Laurinavicius; Ausra Domanska; Pentti Somerharju; Dennis H Bamford
Journal:  J Virol       Date:  2010-01-20       Impact factor: 5.103

5.  The complete genome sequence of Haloferax volcanii DS2, a model archaeon.

Authors:  Amber L Hartman; Cédric Norais; Jonathan H Badger; Stéphane Delmas; Sam Haldenby; Ramana Madupu; Jeffrey Robinson; Hoda Khouri; Qinghu Ren; Todd M Lowe; Julie Maupin-Furlow; Mecky Pohlschroder; Charles Daniels; Friedhelm Pfeiffer; Thorsten Allers; Jonathan A Eisen
Journal:  PLoS One       Date:  2010-03-19       Impact factor: 3.240

6.  A predictive computational model of the kinetic mechanism of stimulus-induced transducer methylation and feedback regulation through CheY in archaeal phototaxis and chemotaxis.

Authors:  Stefan Streif; Dieter Oesterhelt; Wolfgang Marwan
Journal:  BMC Syst Biol       Date:  2010-03-18

7.  Characterization of growth and metabolism of the haloalkaliphile Natronomonas pharaonis.

Authors:  Orland Gonzalez; Tanja Oberwinkler; Locedie Mansueto; Friedhelm Pfeiffer; Eduardo Mendoza; Ralf Zimmer; Dieter Oesterhelt
Journal:  PLoS Comput Biol       Date:  2010-06-03       Impact factor: 4.475

8.  Identification of polyhydroxyalkanoates in Halococcus and other haloarchaeal species.

Authors:  Andrea Legat; Claudia Gruber; Klaus Zangger; Gerhard Wanner; Helga Stan-Lotter
Journal:  Appl Microbiol Biotechnol       Date:  2010-05-02       Impact factor: 4.813

9.  Whole-genome comparison between the type strain of Halobacterium salinarum (DSM 3754T ) and the laboratory strains R1 and NRC-1.

Authors:  Friedhelm Pfeiffer; Gerald Losensky; Anita Marchfelder; Bianca Habermann; Mike Dyall-Smith
Journal:  Microbiologyopen       Date:  2019-12-03       Impact factor: 3.139

10.  A single transcription factor regulates evolutionarily diverse but functionally linked metabolic pathways in response to nutrient availability.

Authors:  Amy K Schmid; David J Reiss; Min Pan; Tie Koide; Nitin S Baliga
Journal:  Mol Syst Biol       Date:  2009-06-16       Impact factor: 11.429

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.