Literature DB >> 16460652

Comparison of human (and other) genome browsers.

Abstract

The sequence of the human genome provides a scaffold on which numerous annotations, such the locations of genes, can be laid. Genome browsers have been created to allow the simultaneous display of multiple annotations within a graphical interface. In addition, they provide the ability to search for markers and sequences, to extract annotations for specific regions or for the whole genome and to act as a central starting point for genomic research. This review describes the basic functionality of genome browsers and compares three of them: the University of California Santa Cruz (UCSC) Genome Browser, the Ensembl Genome Browser and the NCBI MapViewer.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Genetic Markers
DNA

Year: 2006 PMID： 16460652 PMCID： PMC3525149 DOI： 10.1186/1479-7364-2-4-266

Source DB: PubMed Journal: Hum Genomics ISSN： 1473-9542 Impact factor: 4.639

Introduction

Genome browsers allow researchers to navigate the genome in an analogous way to navigating the internet with Internet Explorer or Mozilla. As with the internet, the amount of available genomic data is overwhelming, and browsers aim to make these data accessible to all researchers. The number and variety of annotations has increased dramatically, enabling a detailed view of many aspects of the genome. Of course, one of the primary annotations is still the location and structure of genes, but even this is not straightforward, as many sources of information (sometimes confiicting) necessitate the creation of several gene-related annotations. These include the locations of mRNA and expressed sequence tag (EST) sequences deposited in the major sequence databases, curated gene sequence projects such as the Vertebrate Genome Annotation (VEGA) [1], Ref Seq [2], MGC [3] and ENSEMBL [4] and computational predictions such as GenScan [5] and Twinscan [6]. There is a wide range of additional annotations. The locations of clones from bacterial artificial chromosome (BAC) and other clone libraries, sequence-tagged site (STS) markers from genetic maps [7-9] and estimated boundaries of cytogenetic bands [10] provide crucial mapping information. Alignments with genomic sequences from other species delineate regions of synteny and help to identify orthologous genes. Single nucleotide polymorphisms (SNPs) and other types of variation point to differences within a species. Locations of repetitive sequences, due both to retrotransposable elements and to simple repeats such as microsatellites, help to provide a more complete description of the genomic landscape. An incomplete listing of annotations is shown in Table 1. Browsers simultaneously display these annotations, allowing for the investigation and appreciation of the genomic context in which to consider a gene or region of interest.

Table 1

A sample of annotations found in one or more of the UCSC, Ensembl and NCBI genome browsers.

Type	Annotations
Mapping and sequence	Chromosome bands; GC percent; CpG Islands; restriction enzyme recognition sites; BAC and fosmid clones; STS markers from genetic, RH maps; Mitelman breakpoints

Genes, transcription and expression	RefSeq mRNAs; VEGA genes; Ensembl genes; UniGene; pseudogenes; retroposed genes; Non-coding RNA genes; tRNAs; mRNAs and ESTs; computational gene predictions; GNF Atlas expression values; Affymetrix microarray probes; DNase1 hypersensitive sites

Variation and repeats	SNPs from dbSNP, HapMap projects haplotypes; recombination rates and hotspots; segmental duplications; repetitive sequences (RepeatMasker); tandem repeats

Cross-species	Evolutionarily conserved regions; syntenic mappings to many organisms including chimp, mouse, rat, chicken, cow, dog, opossum, fish

Abbreviations: BAC, bacterial artificial chromosome; EST, expressed sequence tag; GNF, Genomics Institute of the Novartis Research Foundation; NCBI, National Center for Biotechnology Information; RH, Radiation hybrid; SNP, single nucleotide polymorphism; STS, sequence-tagged site; UCSC, University of California Santa Cruz; VEGA, Vertebrate Genome Annotation.

A sample of annotations found in one or more of the UCSC, Ensembl and NCBI genome browsers. Abbreviations: BAC, bacterial artificial chromosome; EST, expressed sequence tag; GNF, Genomics Institute of the Novartis Research Foundation; NCBI, National Center for Biotechnology Information; RH, Radiation hybrid; SNP, single nucleotide polymorphism; STS, sequence-tagged site; UCSC, University of California Santa Cruz; VEGA, Vertebrate Genome Annotation. Three browsers in particular, the University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc. edu) [11], the Ensembl Genome Browser (http://www.ensembl.org)[12] and the National Center for Biotechnology Information (NCBI) MapViewer (http://www.ncbi.nih.gov/mapview)[13] provide information portals for multiple genome sequences, including human. They share many common features, but differ in significant ways. The following presents an overview and comparison of these browsers.

Genome browser comparisons

Genome browsers can be described and compared with respect to presentation, content and functionality. Presentation refers to how the data are displayed in a graphical form and the overall structure of the website. Content refers to what data is accessible, such as particular genome sequences and annotations for a specific genome. Functionality refers to tools available for mining the genome sequence and annotations, such as sequence and text searches and data extraction. The UCSC, Ensembl and NCBI genome browsers aim to present genomic data in a manner that will facilitate research, but they do so in different ways. Table 2 summarises some of these differences, and a more complete, yet still high-level, discussion of these is presented below.

Table 2

Feature comparison of the UCSC Genome Browser, Ensembl Genome Browser and NCBI MapViewer.

	UCSC	Ensembl	NCBI
Presentation	Genome in horizontal orientation Main page contains a single graphic displaying annotation ('tracks')Clicking on annotation element presents web page of detailed information and links to other resources	Genome in horizontal orientation Main ContigView page contains three graphics displaying annotations at different resolutionsClicking on annotation element presents box with links to other resources or Views with more detailed information	Genome in vertical orientation Annotations graphically presented in columns ('maps')Clicking on annotation elements or links in columns provides quick access to other, primarily NCBI, resources

Content	13 vertebrate, 15 invertebrate Many cross-species annotations including conservation across eight species ENCODE Project annotations	13 vertebrate, six invertebrate Heavy focus on gene annotations such as Ensembl genes and VEGA HapMap project-related Views	11 vertebrate, five invertebrate, one protozoan, 12 plant, eight fungi Annotations primarily from NCBI resources

Functionality	Text search, BLAT sequence search, isPCR primer searchAdvanced annotation extraction using Table Browser Ability to upload and view own annotations	Text search, BLAST and SSAHA sequence search, e-PCR primer searchAdvanced annotation extraction using BioMart Ability to upload and view own annotations Simultaneous view of syntenic regions	Text search, BLAST sequence search, e-PCR primer searchBasic annotation extraction

Abbreviations: BLAT, BLAST-like alignment tool; ENCODE, ENCyclopedia Of DNA Elements; e-PCR, electronic polymerase chain reaction; NCBI, National Center for Biotechnology Information; SSAHA, Sequence search and alignment by hashing algorithm; UCSC, University of California Santa Cruz; VEGA, Vertebrate Genome Annotation.

Feature comparison of the UCSC Genome Browser, Ensembl Genome Browser and NCBI MapViewer. Abbreviations: BLAT, BLAST-like alignment tool; ENCODE, ENCyclopedia Of DNA Elements; e-PCR, electronic polymerase chain reaction; NCBI, National Center for Biotechnology Information; SSAHA, Sequence search and alignment by hashing algorithm; UCSC, University of California Santa Cruz; VEGA, Vertebrate Genome Annotation.

Presentation

UCSC features three types of browsers: a genome browser, a gene family browser (Gene Sorter) and a proteome browser. The genome browser is the most widely used and will be the focus of this discussion, although this in no way implies that the other two are not very valuable research tools. The primary web page of the genome browser consists of a graphic that displays annotations for some specified genomic region surrounded by navigational buttons and links to tools. The navigational buttons allow for zooming in and out or moving left or right along the genomic sequence. Within the graphic, annotations -- also referred to as 'tracks' -- are displayed horizontally, with the genome sequence running from left to right. The locations of specific elements within annotations are primarily indicated by boxes with lines sometimes connecting them to show relationships, such as in gene structures (boxes = exons, lines = introns). Arrows indicate forward or reverse strand, where applicable. The use of different colours and shading of boxes highlights the properties of certain annotations, such as confidence in the underlying data -- as is the case in the Known Genes track -- and quantitative traits, employed by the GC Percent track to indicate differing levels of content of guanine (G) and cytosine (C) base pairs. Clicking on an element within an annotation will bring up a separate 'details' web page with specific information about that element and links to other databases and resources such as GenBank [13] and SwissProt [14]. The amount of this additional information varies between annotations. Drop-down menus towards the bottom of the page, also accessible through a separate 'configuration' page, allow for the selection of annotations to display in the graphic (Table 1). Ensembl structures its site around 'Views'. For humans, there are 22 Views that display different types of data and/or provide various functions. The primary View, analogous to the UCSC main browser page, is the ContigView. Within this View are three graphic displays that provide information at different resolutions for a region in the genome. The Overview graphic displays multiple megabases (Mbs), the Detailed view shows approximately 1 Mb and the Basepair view details about 100 bases. Similarly to UCSC, the genome is shown in a horizontal fashion with navigational buttons located within the Detailed view graphic. In the three graphics on this page, annotations are delineated by boxes, sometimes connected by lines and other times contained within a larger box. In the Detailed and Basepair views, the DNA contigs annotation divides the graphic with elements on the forward strand appearing above and on the reverse strand below. Clicking on an element in an annotation will cause a small pop-up window to appear with some basic information and possibly links to other Views within Ensembl or resources at other sites. For example, clicking on an Ensembl gene provides links to GeneView, TranscriptView and ProtView pages, which contain additional information about the gene or a region of the gene. Menus at the top of the Detailed view graphic provide the ability to select specific annotations for display. The primary display of NCBI's MapViewer differs significantly from both UCSC and Ensembl by orienting the genome sequence in a vertical fashion. Again, boxes and lines indicate positions of elements in annotations, also referred to as 'maps', which are presented in columns. The ability to navigate the genome is provided in a side bar to the left of the screen. Links within the annotations, as well as the LinkOut column, provide easy access to other relevant resources at the NCBI, such as Entrez Gene (formerly LocusLink) [15], Online Mendelian Inheritance in Man (OMIM)[16] and dbSNP [17]. A 'Maps & Options' button brings up a separate window, allowing one to select annotations to display.

Content

The NCBI provides access to the largest number of genome sequence assemblies, including 11 vertebrates, five invertebrates, one protozoan, eight plants and 12 fungi. Ensembl and UCSC are more heavily slanted towards the larger eucharyotic genomes, providing access to a similar set of 13 vertebrate genomes and six (Ensembl) or 15 (UCSC) invertebrates, and are devoid of the other classes of species. Annotations available within the NCBI MapViewer primarily originate in the numerous databases available at the NCBI. The MapViewer, therefore, is very tightly integrated with these data sources, some of which -- such as the Mitel-man Breakpoint annotation -- are not available at the other sites. UCSC and Ensembl also present annotations that originate from outside resources, such as the databases at NCBI, but supplement these with numerous additional annotations contributed by in-house or third-party researches. The UCSC browser arguably contains the broadest set of annotations, especially in the area of cross-species comparions. For example, the Conservation annotation, developed at and displayed only at UCSC, shows a measure of evolutionary conservation across eight vertebrate species, as determined by a phylogenetic hidden Markov model [18]. UCSC is also the official repository for, and displays data from, the ENCODE (Encyclopedia Of DNA Elements) project [19], containing annotations ranging from histone modifications to regions of DNase 1 hypersensitivity. The Ensembl browser contains the most extensive set of gene and transcription-related data, with 14 of its 22 Views primarily focused on the presentation of gene- or protein-related data. There is tight integration with gene data originating from both the Ensembl genes annotation [4] -- a computationally generated evidence-based set that Ensembl produces -- and the VEGA project [1] -- a manual curation effort. The Ensembl browser also has the most extensive presentation of haplotype data, especially in their LDView, which was generated by the HapMap project [20]. The underlying genomic sequence is exactly the same at all three sites, but analogous annotations may differ. For example, locations of mRNA and EST sequences require an alignment to the genome sequence. Their precise alignment may vary, based on the alignment program used and specific parameter settings within the program. The three sites do not employ the same alignment methods, resulting in slight differences, although they are in agreement for the vast majority of the time.

Functionality

There are many common functions that all three sites provide. Specific regions of interest can be quickly and easily displayed using keywords such as gene or marker names, exact base pair positions within chromosomes, or sequences via alignment programs like BLAST [21] (Ensembl and NCBI) or BLAST-like alignment tool (BLAT)[22] (UCSC). Locations of paired primer sequences can be obtained via electronic polymerase chain reaction (ePCR)[23] (NCBI and Ensembl) or isPCR (UCSC). Associated FTP sites allow for the download of complete genome sequences and annotations. Annotation data can also be downloaded for particular regions. NCBI allows users to view annotations in a tabular format that can then be downloaded into a text file. Ensembl's BioMart [24] and the UCSC Table Browser [25] allow for both simple downloads of annotations and for quite complex datasets to be generated. These two tools also allow for the uploading of files of genomic regions or names of genes or markers for which annotation data, including the underlying sequence, can be obtained. UCSC and Ensembl provide the ability for researchers to display their own annotation information within the browser. A simple text file denoting the base pair locations of annotation elements is uploaded and used to create a corresponding temporary annotation within the graphic, which is essentially only viewable by the originator. In this way, researchers can usefully view their own data within the context of all other available genomic data. Ensembl provides the ability to view syntenic regions of two genomes simultaneously in their MultiContigView. The layout is similar to the ContigView described previously, but with the addition of data from two separate genomes being displayed in the Detailed view graphic, and a Navigational view replacing the Overview with a zoomed-out display of the regions being analysed in both genomes.

Last words

This overview of the UCSC, Ensembl and NCBI genome browsers is by no means complete and is not meant to recommend the use of one or the other of these sites. Users should explore the capabilities of each browser to determine the one they prefer. In the end, the browser that allows a researcher to be the most productive is the best. The genome browsers reviewed here provide access to not only human genome sequence data, but also to annotations from an ever-growing set of species. Similar functionality for each genome assembly is provided for all species, although the range of annotations varies dramatically. These are by no means the only genome-related browsers available, but they are among the most comprehensive. Similar browsers with more narrow foci, such as for a single organism, share many of the features and functions described above. The quality of the publicly available data displayed in browsers is highly variable. Therefore, researchers must view this data as critically as any other. Appropriate experimentation is required as necessary to test the accuracy of any hypothesis generated using these data. Nevertheless, genome browsers offer a powerful research tool to be utilised by researchers worldwide.

25 in total

1. dbSNP: the NCBI database of genetic variation.

Authors: S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. Integrating genomic homology into gene structure prediction.

Authors: I Korf; P Flicek; D Duan; M R Brent
Journal: Bioinformatics Date: 2001 Impact factor: 6.937

3. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

4. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.

Authors: Steffen Durinck; Yves Moreau; Arek Kasprzyk; Sean Davis; Bart De Moor; Alvis Brazma; Wolfgang Huber
Journal: Bioinformatics Date: 2005-08-15 Impact factor: 6.937

5. Entrez Gene: gene-centered information at NCBI.

Authors: Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

6. The Vertebrate Genome Annotation (Vega) database.

Authors: J L Ashurst; C-K Chen; J G R Gilbert; K Jekosch; S Keenan; P Meidl; S M Searle; J Stalker; R Storey; S Trevanion; L Wilming; T Hubbard
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

7. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

8. The Universal Protein Resource (UniProt).

Authors: Amos Bairoch; Rolf Apweiler; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. Ensembl 2005.

Authors: T Hubbard; D Andrews; M Caccamo; G Cameron; Y Chen; M Clamp; L Clarke; G Coates; T Cox; F Cunningham; V Curwen; T Cutts; T Down; R Durbin; X M Fernandez-Suarez; J Gilbert; M Hammond; J Herrero; H Hotz; K Howe; V Iyer; K Jekosch; A Kahari; A Kasprzyk; D Keefe; S Keenan; F Kokocinsci; D London; I Longden; G McVicker; C Melsopp; P Meidl; S Potter; G Proctor; M Rae; D Rios; M Schuster; S Searle; J Severin; G Slater; D Smedley; J Smith; W Spooner; A Stabenau; J Stalker; R Storey; S Trevanion; A Ureta-Vidal; J Vogel; S White; C Woodwark; E Birney
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.

Authors: Ada Hamosh; Alan F Scott; Joanna S Amberger; Carol A Bocchini; Victor A McKusick
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

8 in total

1. Understanding genome browsing.

Authors: Melissa S Cline; W James Kent
Journal: Nat Biotechnol Date: 2009-02 Impact factor: 54.908

Review 2. Visualizing genomes: techniques and challenges.

Authors: Cydney B Nielsen; Michael Cantor; Inna Dubchak; David Gordon; Ting Wang
Journal: Nat Methods Date: 2010-02-25 Impact factor: 28.547

3. Defining genes: a computational framework.

Authors: Peter F Stadler; Sonja J Prohaska; Christian V Forst; David C Krakauer
Journal: Theory Biosci Date: 2009-06-26 Impact factor: 1.919

Review 4. Analysis of complex disease association and linkage studies using the University of California Santa Cruz Genome Browser.

Authors: Tianyuan Wang; Terrence S Furey
Journal: Circ Cardiovasc Genet Date: 2009-04