Literature DB >> 16381898

The UCSC Archaeal Genome Browser.

Kevin L Schneider¹, Katherine S Pollard, Robert Baertsch, Andy Pohl, Todd M Lowe.

Abstract

As more archaeal genomes are sequenced, effective research and analysis tools are needed to integrate the diverse information available for any given locus. The feature-rich UCSC Genome Browser, created originally to annotate the human genome, can be applied to any sequenced organism. We have created a UCSC Archaeal Genome Browser, available at http://archaea.ucsc.edu/, currently with 26 archaeal genomes. It displays G/C content, gene and operon annotation from multiple sources, sequence motifs (promoters and Shine-Dalgarno), microarray data, multi-genome alignments and protein conservation across phylogenetic and habitat categories. We encourage submission of new experimental and bioinformatic analysis from contributors. The purpose of this tool is to aid biological discovery and facilitate greater collaboration within the archaeal research community.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Archaeal Proteins

Year: 2006 PMID： 16381898 PMCID： PMC1347496 DOI： 10.1093/nar/gkj134

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

As more genomic sequences become available, analysis and interpretation of existing annotation becomes increasingly difficult without useful tools (1). This large quantity of data provides a challenge regarding what information should be provided at the top level so that the data can inform but not overwhelm. The extendibility of the UCSC Genome Browser allows easy integration of new organisms into a system that has a well-developed database and user interface, in the form of a graphical web page (2). BLAT, an alignment tool designed to quickly find sequences of 95% and greater similarity (3), allows quick look-up of sequences of interest. The UCSC Genome Browser was first implemented for the relatively large human genome, containing 24 linear chromosomes and dozens of annotation ‘tracks’ (2,4). Because the genome and annotation datasets are comparatively small for all prokaryotic species, loading new archaeal genomes into the browser requires only modest computational resources. To date, the Archaeal Genome Browser contains all 24 complete archaeal genomes, plus two draft assemblies (Table 1). Four bacterial genomes are also provided for comparative purposes. The genomes of these species, along with a variety of annotation tracks from our own group and other public datasets, provide a simple, interactive genome information resource for which the UCSC Genome Browser is renowned.

Table 1

Species available in the Archaeal Genome Browser at time of publication

Archaeal species	Hyperthermophile	Halophile	Acidophile	Methanogen	Other extreme
Pyrobaculum aerophilum	X
Aeropyrum pernix	X
Sulfolobus acidocaldarius			X
Sulfolobus solfataricus			X
Sulfolobus tokodaii			X
Nanoarchaeum equitans	X
Thermococcus kodakaraensis	X
Pyrococcus abyssi	X
Pyrococcus furiosus	X
Pyrococcus horikoshii	X
Methanopyrus kandleri				X
Methanothermobacter thermautotrophicus				X
Methanocaldococcus jannaschii				X
Methanococcus maripaludis				X
Ferroplasma acidarmanus			X
Picrophilus torridus	X		X
Thermoplasma acidophilum			X
Thermoplasma volcanium			X
Archaeoglobus fulgidus	X
Halobacterium sp.		X
Haloarcula marismortui		X
Natronomonas pharaonis		X			Alkaliphile
Methanococcoides burtonii				X	Psychrotroph
Methanosarcina barkeri				X
Methanosarcina mazei				X
Methanosarcina acetivorans				X
Bacterial species
Aquifex aeolicus	X
Thermotoga maritima	X
Escherichia coli K12
Vibrio cholerae

All 24 fully sequenced archaeal genomes are included. In addition, two draft archaeal genomes (M.burtonii, F.acidarmanus) and four bacterial genomes are also displayed. Each species is characterized by features of its optimal growth environment.

MATERIALS AND METHODS

Archaeal browser tracks

The data displayed in the Archaeal Genome Browser give the user the ability to integrate information to make new observations. Each track presents a different biological perspective on the organism. A detailed overview of the tracks can be found on corresponding description pages in the browser. We briefly describe the 11 currently available tracks here. The Pyrobaculum aerophilum and Pyrococcus furious browsers have the most complete sets of tracks. Basic tracks provide the framework for understanding a genome. These include G/C (guanine/cytosine) content, annotated protein gene predictions (ORFs) from Genbank (5) and the Institute for Genome Research (TIGR, ), all possible start/stop codons in all reading frames and current known or predicted non-coding RNA (ncRNA) genes. Clicking on a Genbank ORF opens a page detailing the available RefSeq annotation for that gene, InterPro (6) and Pfam (7) domains, COG group (8), GO categories (9), KEGG pathway information (10) and ModBase (11) structural predictions when available. Additional tracks provide data from computational analyses and experimental genomics studies. The Shine-Dalgarno track indicates positions with base pairing potential to the 3′ end of 16S rRNA, which is required for translation initiation. A companion promoter track indicates similarity to the consensus transcription factor IIB (BRE)-TATA box, which highlights possible sites of transcription initiation. The promoter track can be useful to detect incorrect start codons of annotated genes or novel genes. There are two operon tracks: published predictions by Ermolaeva and colleagues (12) at TIGR and a less stringent set of predictions using a simple distance-rule method integrated with TIGR-predicted operons. Microarray tracks display results from experiments within our laboratory. The MULTIZ conservation track provides visual display of a full-genome multiple sequence alignment to closely related species (13). The browser also graphs a conservation score based on a phylogenetic hidden Markov model (14), useful for finding candidate ncRNAs, repetitive elements or features that are not annotated in the genome. A track displaying BLASTZ (15) hits to other loci within the same genome is useful for identifying repetitive elements and paralogous genes. Finally, a protein conservation track displays each ORF's alignment score to other species by phylogenetic category or native environment (i.e. hyperthermophile, halophile, acidophile). This similarity database was created locally using BLASTP (16) against NCBI's non-redundant protein database. The information page for each ORF contains a ranked table of hits—this includes the gene name and annotated function, a link to the full Genbank entry and a link to the corresponding genome locus in the browser of the other organism (if available). The latter provides a level of ‘genome-interoperability’ not available in single-species genome browser tools. An additional link allows one to re-run the BLASTP search against the most current protein database at NCBI. Together, these data help improve or update functional prediction of poorly characterized genes.

Archaeal browser features

Every organism's browser includes a BLAT server (3) for performing sequence similarity searches. Given a DNA or protein sequence, BLAT returns a list of links to all genome positions that share 95% or greater identity with the input sequence. A browser position search utility takes one or more terms (but not sequence) as input and returns all matching genome positions. Together, BLAT and the browser position search utility allow one to quickly search a genome by sequence, keyword or locus. The table browser is a powerful tool for retrieving raw data and performing intersections and unions between data in different tracks (4). For example, with a few mouse clicks, one can obtain a list of all GenBank coding genes that overlap known non-coding RNAs or all places where GenBank and TIGR ORF annotations disagree. A correlation feature computes and plots the correlation coefficient between any two tracks at a user-specified genome location and base resolution. The browser also provides a ‘Custom Track’ tool, which allows one to easily and privately upload datasets for display along side the publicly available tracks described earlier. Detailed help pages explain how to use the browser and format data for custom tracks.

RESULTS

Archaeal browser data mining

The Archaeal Genome Browser is a powerful research tool that has facilitated new discoveries in our laboratory. For example, it has allowed us to integrate sequence annotations with whole genome microarray data for two species to date, P.aerophilum and P.furiosus. Visualizing expression patterns along with gene conservation and sequence motifs has been particularly useful for these species where 60–80% of protein coding genes do not have a well-annotated function. By comparison, the Escherichia coli K12 genome has <24% genes with no function (17) and Saccharomyces cerevisiae has <37% [ (18)]. Furthermore, in these and other archaea, the start codons of many ORFs are likely to be incorrect because of frequent use of UUG and GUG as alternatives to AUG. We find many ‘intergenic’ regions of the genome to be transcribed, but with no annotation clues in GenBank entries. Figure 1 illustrates an intergenic region for which the browser track data strongly suggests the presence of a novel gene or untranslated region (UTR). While a single piece of information (e.g. microarray data) alone would be only suggestive, the combination of evidence across multiple tracks helps to build confidence in new observations. Promising hypotheses such as these can then be followed up by further computational analysis or traditional experimental testing.

Figure 1

The intergenic region denoted Pfu.i140, located between P.furious ORFs PF0216 and PF0217, shows high conservation among closely related species (‘Conservation’ track), high ‘G/C Percent’ values (indicative of RNA structure in hyperthermophiles), and a different expression pattern from the neighboring ORFs (green: suppressed relative to baseline, red: induced relative to baseline in a heat shock time course). The ‘Promoter +’ track with green peaks indicates a TATA box that is ∼88 bp upstream of PF0217, not the −38 position that is typical for this species. Together, these pieces of evidence suggest that transcription starts 50 bp upstream of the predicted coding region, which implies one of three possibilities: (i) this is a non-coding RNA gene that is co-transcribed with PF0217 but later processed away from the transcript; (ii) this is a regulatory UTR of PF0217 with secondary structure; or (iii) the start codon is incorrectly annotated forPF0217.

DISCUSSION

Future developments

The Archaeal Genome Browser is run off the same source code and mySQL database as the UCSC Genome Browser, which allows for easy code, data and feature updating. When new computational or functional analyses become available for an organism with an existing genome browser, the database can be quickly be updated to create a new track of genome annotation. Development of browsers for additional sequenced archaeal species is planned. We strongly encourage members of the archaeal research community to contribute genome-wide datasets to the browser project.

CONCLUSIONS

The Archaeal Genome Browser is a simple, graphic navigation tool and annotation database designed to help researchers perform effective research and analysis, while being relatively easy to improve in content and function.

18 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. Prediction of operons in microbial genomes.

Authors: M D Ermolaeva; O White; S L Salzberg
Journal: Nucleic Acids Res Date: 2001-03-01 Impact factor: 16.971

3. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

4. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

5. MODBASE, a database of annotated comparative protein structure models, and associated resources.

Authors: Ursula Pieper; Narayanan Eswar; Hannes Braberg; M S Madhusudhan; Fred P Davis; Ashley C Stuart; Nebojsa Mirkovic; Andrea Rossi; Marc A Marti-Renom; Andras Fiser; Ben Webb; Daniel Greenblatt; Conrad C Huang; Thomas E Ferrin; Andrej Sali
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

6. The UCSC Genome Browser Database.

Authors: D Karolchik; R Baertsch; M Diekhans; T S Furey; A Hinrichs; Y T Lu; K M Roskin; M Schwartz; C W Sugnet; D J Thomas; R J Weber; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

7. Human-mouse alignments with BLASTZ.

Authors: Scott Schwartz; W James Kent; Arian Smit; Zheng Zhang; Robert Baertsch; Ross C Hardison; David Haussler; Webb Miller
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

8. EcoCyc: a comprehensive database resource for Escherichia coli.

Authors: Ingrid M Keseler; Julio Collado-Vides; Socorro Gama-Castro; John Ingraham; Suzanne Paley; Ian T Paulsen; Martín Peralta-Gil; Peter D Karp
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; David L Wheeler
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. The Pfam protein families database.

Authors: Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

68 in total

1. Discovery of Pyrobaculum small RNA families with atypical pseudouridine guide RNA features.

Authors: David L Bernick; Patrick P Dennis; Matthias Höchsmann; Todd M Lowe
Journal: RNA Date: 2012-01-26 Impact factor: 4.942

2. Using Galaxy to perform large-scale interactive data analyses.

Authors: Jennifer Hillman-Jackson; Dave Clements; Daniel Blankenberg; James Taylor; Anton Nekrutenko
Journal: Curr Protoc Bioinformatics Date: 2012-06

3. Integration and visualization of systems biology data in context of the genome.

Authors: J Christopher Bare; Tie Koide; David J Reiss; Dan Tenenbaum; Nitin S Baliga
Journal: BMC Bioinformatics Date: 2010-07-19 Impact factor: 3.169

4. Tri-split tRNA is a transfer RNA made from 3 transcripts that provides insight into the evolution of fragmented tRNAs in archaea.

Authors: Kosuke Fujishima; Junichi Sugahara; Kaoru Kikuta; Reiko Hirano; Asako Sato; Masaru Tomita; Akio Kanai
Journal: Proc Natl Acad Sci U S A Date: 2009-02-03 Impact factor: 11.205

10. Template-dependent 3'-5' nucleotide addition is a shared feature of tRNAHis guanylyltransferase enzymes from multiple domains of life.

Authors: Maria G Abad; Bhalchandra S Rao; Jane E Jackman
Journal: Proc Natl Acad Sci U S A Date: 2009-12-18 Impact factor: 11.205