Literature DB >> 17166863

The ENCODE Project at UC Santa Cruz.

Daryl J Thomas¹, Kate R Rosenbloom, Hiram Clawson, Angie S Hinrichs, Heather Trumbower, Brian J Raney, Donna Karolchik, Galt P Barber, Rachel A Harte, Jennifer Hillman-Jackson, Robert M Kuhn, Brooke L Rhead, Kayla E Smith, Archana Thakkapallayil, Ann S Zweig, David Haussler, W James Kent.

Abstract

The goal of the Encyclopedia Of DNA Elements (ENCODE) Project is to identify all functional elements in the human genome. The pilot phase is for comparison of existing methods and for the development of new methods to rigorously analyze a defined 1% of the human genome sequence. Experimental datasets are focused on the origin of replication, DNase I hypersensitivity, chromatin immunoprecipitation, promoter function, gene structure, pseudogenes, non-protein-coding RNAs, transcribed RNAs, multiple sequence alignment and evolutionarily constrained elements. The ENCODE project at UCSC website (http://genome.ucsc.edu/ENCODE) is the primary portal for the sequence-based data produced as part of the ENCODE project. In the pilot phase of the project, over 30 labs provided experimental results for a total of 56 browser tracks supported by 385 database tables. The site provides researchers with a number of tools that allow them to visualize and analyze the data as well as download data for local analyses. This paper describes the portal to the data, highlights the data that has been made available, and presents the tools that have been developed within the ENCODE project. Access to the data and types of interactive analysis that are possible are illustrated through supplemental examples.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2006 PMID： 17166863 PMCID： PMC1781110 DOI： 10.1093/nar/gkl1017

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The goal of the ENCODE project is to identify all functional elements in the human genome sequence (1). The pilot phase of the project is focused on a specific 30 megabases (∼1%) of the human genome, with an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. UC Santa Cruz is the main repository for sequence-based data, with microarray data being held at GEO and ArrayExpress. The roles of UC Santa Cruz are (i) to collect the experimental data and analyses, (ii) to perform basic quality assurance (QA) on the submitted data, (iii) to publicly release the data with comprehensive descriptions, (iv) to provide interactive displays for integrating the ENCODE data with existing genome-wide data and (v) to provide interactive tools for analysis. General details of the Genome Browser have been described previously (2,3), and are briefly reviewed here for clarity. Within the Genome Browser, each dataset is represented as a track, which is a horizontal, graphical representation of the underlying data table. A complete description of each dataset is available on the description page for each track. The Table Browser has been previously described as a general purpose tool for analyzing data in the UCSC Genome Browser (4), possibly with integrated user-supplied data. Several features have been added to this platform in the context of the ENCODE project. In addition to interactive browsing and analysis tools that are only available at the ENCODE project at UCSC site (), the data are available for public download (). Data from this project are made publicly available as quickly as possible after submission. All data on the UCSC Browsers, including the ENCODE data, pass through an extensive QA and documentation process before release. Biological validation criteria have been defined for each of the datasets and are the responsibility of the submitters to confirm before submission. Our developers and QA staff work with the data to provide fast, clear display and to confirm that the file formats and genomic coordinates are consistent. It is expected that the ENCODE project will transition from the May 2004 human genome assembly (hg17; NCBI Build 35) to the newly released human genome assembly (hg18; NCBI Build 36) in early 2007. Following this, as the ENCODE project expands from the current 1% to the whole genome, UCSC is poised to support this growth. This paper describes the site and the tools that have been developed for viewing, retrieving and analyzing the data from the ENCODE project.

RESULTS

Portal and data

We have extended the UCSC Genome Browser (5) to include specialized support for the ENCODE project and its data. The ENCODE portal is accessible both through a link on the main Genome Browser site and directly at . This portal provides access to the ENCODE data and serves as a starting point for the computational analyses that are possible with the new data and analysis tools. It also contains announcements of new data releases and tool deployments, terms of use for the ENCODE data, and information about the contributors. The ‘Regions’ link opens a frames page allowing the user to quickly scan all ENCODE regions by selecting one region in the navigator frame, which opens a customizable view of that region in a display frame.

Track groups

We have added two levels of organization to reduce the complexity of accessing the data. Tracks of a similar type are collected into track groups, which provide high-level organization to the datasets. The six ENCODE-specific track groups roughly parallel the analysis working groups: Regions and Genes; Transcript Levels; Chromatin Immunoprecipitation; Chromosome, Chromatin and DNA Structure; Comparative Genomics; and Variation. The individual tracks are too numerous to list here and are frequently being updated with new results from the Consortium. The track status page at provides a current snapshot of the data, including new datasets that are being developed, those that are in QA, and the fully released datasets.

Composite tracks

Sometimes one experiment will be run repeatedly with many different experimental conditions, producing the same data type but many parallel datasets, such as with the many combinations of cell lines, antibodies and stimulation conditions used in chromatin immunoprecipitation. For organizational simplicity, these composite tracks allow a set of similar data, usually from a single data provider, to be controlled through a single interface. On the track's user interface page, parameters that are common to all sub-tracks (e.g. visibility mode, track height, display range limits) are presented once. Just below those controls, a checkbox for each sub-track allows it to be individually included or excluded from the display. Experiments can be grouped into logical categories (e.g. cell type, transcription factor) with shared controls. Figure 1 shows the Yale RNA transcriptionally active regions (TARs) (6,7) track as an example of the streamlined interface and the resulting display of the composite tracks.

Figure 1

Composite track control and display. (A) Controls for options that apply to all data in this track (top) with checkboxes to include or exclude individual sub-tracks as desired (bottom). (B) Example of a composite track display showing the IRF1 gene, repeats, Yale transcript maps and Yale transcriptionally active regions (6,7). The latter two are composite tracks, each containing multiple datasets. The Placenta RNA checkbox is deselected above, so that the data are not displayed in the image below.

Multiple sequence alignment display

In addition to our data display and repository role, UCSC and collaborators have been developing algorithms for sequence alignment (8) and conservation analysis (9). As this produces extremely rich datasets and parallels the efforts of several other consortium members, we have created a special display that combines multiple species alignments and conservation scores in the same track, as shown in Figure 2. Alignments are projected onto a reference species for display in the browser by removing alignment columns in which the reference species is a gap. Additional enhancements include annotation of alignment gaps to indicate missing sequence and syntenic breaks, and translation in coding regions with user-selectable reading frames based on available gene annotations. When even more detail is necessary, full unprojected alignments are available on the details page for this track.

Figure 2

Conservation display. (A) Conservation track at the base level shows details of a multiple sequence alignment, conservation scores and amino acid translations in coding regions. (‘.’: base is identical to human; ‘N’: missing sequence, ‘=’: sequence that does not align to reference is present in this species; orange numbers/lines: additional bases that are present in other species). (B) Conservation track zoomed out shows pairwise identity summary and conservation scores, highlighting non-coding elements in addition to exons. The comparative genomics efforts within the ENCODE Consortium are also receiving special attention. The group is producing a common dataset of sequences from 23 mammals and 5 other vertebrates, which provides a rich dataset for the development and comparison of algorithms for multiple sequence alignment and detection of evolutionary constraint. Four separate alignment algorithms are being developed [MAVID (10), MLAGAN (11), PECAN (B. Paten and E. Birney, submitted for publication) and TBA (8)], and three separate conservation scoring methods [binCons (12), GERP (13,14) and phastCons (9)] are being applied to each of these alignments. Each alignment is presented in its own Alignment track, with two composite tracks to represent the real-valued Conservation scores and the predicted Elements.

Tools for analysis

The Table Browser has always provided summary statistics on a single dataset, and we have added tools for exploring correlation between genomic datasets. Data within composite tracks can be treated as a single set for simplified comparison against other tracks. An example of this is available in Supplementary Data, where promoters that are active in at least one cell line are joined to create a set of ‘functional’ promoters. The correlation function calculates correlation coefficients, covariance, scatter plots, residuals and histograms on the fly for the selected datasets. Briefly, the data points from each table are projected down to the base level. The two datasets are intersected and only bases that contain values in both datasets are retained, resulting in datasets of equal length n. These two datasets (X,Y) are then used in a standard linear correlation function, computing the correlation coefficient: where σX and σY are the standard deviations of the datasets X and Y, and σXY is the covariance, computed as follows: The data values from a track are used in the calculations when available. For tracks that do not have data values, such as gene-structured tracks, the data value is 1.0 for bases that are covered by exons and 0.0 at all other positions in the region. Simple tracks that are neither gene structures nor have data values, e.g. BED tracks, are encoded as 1.0 over the extent of the item and 0.0 for all other positions in the region. Figure 3 shows such correlation between the Boston University •OH Radical Cleavage Intensity Database (ORChID) (15–17) and the CpG Island and GC Percent tracks. The CpG Island histogram shows significant skew in the data due to many zero values, which obscures the correlation of ORChID values within CpG Islands. The correlation of ORChID values with GC Percent is very strong at r = 0.89, which reveals a potential confounding factor when comparing the ORChID values with other datasets. This method is further described in Supplementary Data.

Figure 3

Track correlation in the Table Browser. Correlation of the Boston University. •OH Radical Cleavage Intensity Database (ORChID) (15–17) is shown with the CpG Island (left) and with the GC Percent (right) tracks. Statistical summaries (upper panels), scatter and residual plots (middle panels) and histograms (lower panels) are shown. The hgLiftOver tool, accessible via the Genome Browser's ‘Utilities’ link, translates genomic coordinates within a species from one assembly version to another and also retrieves putative orthologous regions between species using UCSC's chained and netted alignments. These tools have been used to migrate the ENCODE regions from one assembly to another, and have also been used in the Multiple Species Alignment working group to provide orthology predictions for the preparation of the sequence datasets as described above.

DISCUSSION

The ENCODE project at UC Santa Cruz extends the powerful Genome Browser with datasets and tools to aid researchers in their quest to understand the functional elements in the genome. This extension of the Browser brings datasets on DNA replication, chromatin regulation, promoter function, gene models and multiple species comparisons together and makes them available for visualization, analysis and download. Integration of the datasets generate by the ENCODE Consortium, in addition to other genome-wide data, proves to be a rich source for addressing questions about functional elements in 1% of the human genome, and is poised to expand with the needs of the ENCODE project. Extensions have been made to the display, providing capabilities such as composite tracks for better organization and increased customization. Analysis tools have been built into the Table Browser to simplify merging of related tables and to assess correlation between datasets. These build on the general usability, integration with genome-wide resources, ability to do online analyses and simplicity of exporting data for external analyses that have made the data analysis more accessible to biologists. Newer additions such as the Gene Sorter, In-Silico PCR and VisiGene (2,3) continue to add value by bringing resources together so that detailed analysis can proceed rapidly.

WEBSITES FOR REFERENCE

; UCSC Genome Browser. ; ENCODE Portal ; Data downloads. ; Status.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

16 in total

1. Large-scale transcriptional activity in chromosomes 21 and 22.

Authors: Philipp Kapranov; Simon E Cawley; Jorg Drenkow; Stefan Bekiranov; Robert L Strausberg; Stephen P A Fodor; Thomas R Gingeras
Journal: Science Date: 2002-05-03 Impact factor: 47.728

2. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

3. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

Authors: Michael Brudno; Chuong B Do; Gregory M Cooper; Michael F Kim; Eugene Davydov; Eric D Green; Arend Sidow; Serafim Batzoglou
Journal: Genome Res Date: 2003-03-12 Impact factor: 9.043

4. MAVID multiple alignment server.

Authors: Nicolas Bray; Lior Pachter
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

5. Using hydroxyl radical to probe DNA structure.

Authors: M A Price; T D Tullius
Journal: Methods Enzymol Date: 1992 Impact factor: 1.600

6. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution.

Authors: Jill Cheng; Philipp Kapranov; Jorg Drenkow; Sujit Dike; Shane Brubaker; Sandeep Patel; Jeffrey Long; David Stern; Hari Tammana; Gregg Helt; Victor Sementchenko; Antonio Piccolboni; Stefan Bekiranov; Dione K Bailey; Madhavan Ganesh; Srinka Ghosh; Ian Bell; Daniela S Gerhard; Thomas R Gingeras
Journal: Science Date: 2005-03-24 Impact factor: 47.728

7. Distribution and intensity of constraint in mammalian genomic sequence.

Authors: Gregory M Cooper; Eric A Stone; George Asimenos; Eric D Green; Serafim Batzoglou; Arend Sidow
Journal: Genome Res Date: 2005-06-17 Impact factor: 9.043

8. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Authors: Adam Siepel; Gill Bejerano; Jakob S Pedersen; Angie S Hinrichs; Minmei Hou; Kate Rosenbloom; Hiram Clawson; John Spieth; Ladeana W Hillier; Stephen Richards; George M Weinstock; Richard K Wilson; Richard A Gibbs; W James Kent; Webb Miller; David Haussler
Journal: Genome Res Date: 2005-07-15 Impact factor: 9.043

9. DNA strand breaking by the hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the DNA backbone.

Authors: B Balasubramanian; W K Pogozelski; T D Tullius
Journal: Proc Natl Acad Sci U S A Date: 1998-08-18 Impact factor: 11.205

10. The UCSC Genome Browser Database: update 2006.

Authors: A S Hinrichs; D Karolchik; R Baertsch; G P Barber; G Bejerano; H Clawson; M Diekhans; T S Furey; R A Harte; F Hsu; J Hillman-Jackson; R M Kuhn; J S Pedersen; A Pohl; B J Raney; K R Rosenbloom; A Siepel; K E Smith; C W Sugnet; A Sultan-Qurraie; D J Thomas; H Trumbower; R J Weber; M Weirauch; A S Zweig; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

49 in total

1. Algorithm to identify frequent coupled modules from two-layered network series: application to study transcription and splicing coupling.

Authors: Wenyuan Li; Chao Dai; Chun-Chi Liu; Xianghong Jasmine Zhou
Journal: J Comput Biol Date: 2012-06 Impact factor: 1.479

Review 2. The genome browser at UCSC for locating genes, and much more!

Authors: Minou Bina
Journal: Mol Biotechnol Date: 2007-12-04 Impact factor: 2.695

3. PAVIS: a tool for Peak Annotation and Visualization.

Authors: Weichun Huang; Rasiah Loganantharaj; Bryce Schroeder; David Fargo; Leping Li
Journal: Bioinformatics Date: 2013-09-04 Impact factor: 6.937

4. Trafficking of the human ether-a-go-go-related gene (hERG) potassium channel is regulated by the ubiquitin ligase rififylin (RFFL).

Authors: Karim Roder; Anatoli Kabakov; Karni S Moshal; Kevin R Murphy; An Xie; Samuel Dudley; Nilüfer N Turan; Yichun Lu; Calum A MacRae; Gideon Koren
Journal: J Biol Chem Date: 2018-11-06 Impact factor: 5.157

5. Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data.

Authors: David C King; James Taylor; Ying Zhang; Yong Cheng; Heather A Lawson; Joel Martin; Francesca Chiaromonte; Webb Miller; Ross C Hardison
Journal: Genome Res Date: 2007-06 Impact factor: 9.043

6. Tissue-specific prediction of directly regulated genes.

Authors: Robert C McLeay; Chris J Leat; Timothy L Bailey
Journal: Bioinformatics Date: 2011-06-30 Impact factor: 6.937

7. Integrative analysis of many RNA-seq datasets to study alternative splicing.

Authors: Wenyuan Li; Chao Dai; Shuli Kang; Xianghong Jasmine Zhou
Journal: Methods Date: 2014-02-28 Impact factor: 3.608

8. The UCSC Genome Browser.

Authors: Donna Karolchik; Angie S Hinrichs; W James Kent
Journal: Curr Protoc Bioinformatics Date: 2009-12

9. The FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation.

Authors: Hideya Kawaji; Jessica Severin; Marina Lizio; Andrew Waterhouse; Shintaro Katayama; Katharine M Irvine; David A Hume; Alistair R R Forrest; Harukazu Suzuki; Piero Carninci; Yoshihide Hayashizaki; Carsten O Daub
Journal: Genome Biol Date: 2009-04-19 Impact factor: 13.583

10. The UCSC Genome Browser database: update 2010.

Authors: Brooke Rhead; Donna Karolchik; Robert M Kuhn; Angie S Hinrichs; Ann S Zweig; Pauline A Fujita; Mark Diekhans; Kayla E Smith; Kate R Rosenbloom; Brian J Raney; Andy Pohl; Michael Pheasant; Laurence R Meyer; Katrina Learned; Fan Hsu; Jennifer Hillman-Jackson; Rachel A Harte; Belinda Giardine; Timothy R Dreszer; Hiram Clawson; Galt P Barber; David Haussler; W James Kent
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971