Literature DB >> 17135197

Developments in CORG: a gene-centric comparative genomics resource.

C Dieterich1, M W Franz, M Vingron.   

Abstract

The CORG resource (Comparative Regulatory Genomics, http://corg.eb.tuebingen.mpg.de) provides extensive cross-species comparisons of promoter regions in particular and whole gene loci in general. Pairwise as well as multiple alignments of 10 vertebrate species form the key component of CORG. We implemented a rapid alignment approach based on weight matrix motif anchors to ensure efficient computation and biologically informative alignments. All CORG workbench components have been enhanced towards more flexibility and interactivity. Reference sequence based data presentation and analysis was put into the well-known and modular Generic Genome Browser framework. Herein, various plugins facilitate online data analysis and integration with static conservation data. Main emphasis was put on the design of a new JAVA WebStart application for comparative data display. Flexible data import and export options for standard formats complete the provided services.

Entities:  

Mesh:

Substances:

Year:  2006        PMID: 17135197      PMCID: PMC1751536          DOI: 10.1093/nar/gkl977

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Web-based comparative genomics resources support researchers in finding significant patterns in biological sequences and likely functions thereof. The CORG workbench and database are dedicated to the analysis of non-coding DNA of homologous gene loci. It has served this purpose ever since its start in 2003 (1). Unlike other resources [e.g. ref. (2,3)] we do not compare whole genomes but DNA sequence of whole homologous gene loci and flanking regions. Local sequence conservation is an indicator of functional elements that either act in cis (e.g. promoter elements) or trans (e.g. RNA genes). Small transcriptional units such as micro-RNAs or enhancers have been reported to reside within larger protein-coding genes [e.g. ref. (4)]. To this end, CORG has been extended in content, form and function since the last report (1) and this paper details the improvements. Two sequence regions are in the center of CORG's new capabilities: (i) upstream regions that are likely to encompass one or more promoters of a gene and (ii) whole gene loci covering the complete gene structure plus 5 kb of flanking sequence. In the first group we expect to find functional promoter elements by conservation combined with motif searches and experimentally defined transcription start sites. The non-coding portion of the second group is largely unexplored territory where cross-species conservation patterns for homologous gene groups may provide crucial hints on function. Furthermore, we have improved on the strategy of alignment computation and visualization. CORG became also more flexible with respect to data manipulation as it provides interactive analysis tools and data exchange facilities.

RESULTS

The current release of CORG contains local pairwise and multiple alignments for 10 vertebrate species: Man, rhesus monkey, rat, mouse, dog, cow, chicken, frog and two fish. This species selection covers an enormous evolutionary distance and can be used to address questions as to the turnover of functional elements.

Alignment of homologous gene loci

Local pairwise alignments are still the key component of the CORG database. Multiple cross-species comparisons demand new ways for computing meaningful local sequence similarities. Upstream regions are aligned by SITEBLAST (5); a modified version of BLASTZ that employs weight matrix scans to find alignment anchors. The rational is that biologically meaningful anchors can be extended into alignments that capture further conserved functional elements. The JASPAR library (6) of family profiles provides seed motifs. Program parameters are set to pick up all motif instances that have power of 0.2 or less. The scoring scheme follows the HKY model that takes sequence specific GC content into account and is set to an expected distance of 47 PAM (7). No positional constraints are enforced on the alignments, but alignment scores must exceed 10 times the average match score. The second set of alignments is computed for whole gene loci applying the same scoring scheme, but ensuring collinear block structure and a traditional alignment seed approach (12 matching positions out of 19). Multiple alignments are subsequently computed from pairwise ones. We enumerated all distinct and maximally long paths through the graph of overlapping alignments. Corresponding sequences are optimally realigned with the POA software (8).

CORG workbench

The website is divided into four sections: Search page, Batch retrieval page, DAS tutorial and Help page. An interactive user session usually starts with searching for a particular gene identifier. Matching identifiers are displayed along with a concise description to guarantee the right choice. The following page features three views on the database content. The top panel (Figure 1) shows the genomic position and structure of the selected gene (transcript variants along with transcription start sites). Two additional tracks represent the range of the two CORG regions, upstream and whole gene. The middle panel constitutes a local view of the selected gene and annotation. Herein, we use the Generic Genome Browser framework (9) to display, edit, ex- and import annotations of CORG regions. The modular plugin architecture of this browser allowed us to add more function such as filter, finders and annotators. Our filter plugin can reduce the number of displayed conserved blocks with respect to their score, length or percentage identity. Two annotator plugins serve to pinpoint motif matches. Either one or more IUPAC consensus sequences serve as input or alternatively a weight matrix search can be performed on the displayed segment. Figure 2 shows a 2000 bp section of the upstream region of the human c-fos gene. c-fos is a component of the AP-1 transcription factor complex and therefore requires tight regulation (10). It is known that the promoter can be activated by either protein kinase A, Ras or JAK-STAT signaling (11). Endpoints of these signaling cascades sit at locations −60, −300 and −340 relative to the transcription start site (11). Figure 2 demonstrates the use of plugins along with CORG annotation to highlight the aforementioned promoter elements. Both, static database content and newly generated annotation may be exported by the ‘Download Sequence File’ plugin. Furthermore, we continue to develop plugins for emerging data sources in gene regulation such as binding assays, reporter gene assays or high-resolution expression profiling experiments.
Figure 1

Genome location and structure of the human c-fos gene. The upper panel of the CORG workbench shows the selected gene's structure and orientation on the genome assembly. RefSeq transcription start sites are represented by red diamond symbols. The extent of associated CORG regions is given by turquoise filled arrows.

Figure 2

Genome Browser view on 2 kb segment of c-fos upstream region. Conserved blocks are displayed on the top track, followed by conserved binding sites as given by pairwise alignment anchor points. All items are clickable and provide additional details on demand. A diamond shaped symbol represents the annotated transcription start site. The three bottom tracks display analysis results from the ‘Annotate DNA sites’ plugin for queries of consensus sequences of CREB boxes, Serum response elements and ETS binding sites.

Genome location and structure of the human c-fos gene. The upper panel of the CORG workbench shows the selected gene's structure and orientation on the genome assembly. RefSeq transcription start sites are represented by red diamond symbols. The extent of associated CORG regions is given by turquoise filled arrows. Genome Browser view on 2 kb segment of c-fos upstream region. Conserved blocks are displayed on the top track, followed by conserved binding sites as given by pairwise alignment anchor points. All items are clickable and provide additional details on demand. A diamond shaped symbol represents the annotated transcription start site. The three bottom tracks display analysis results from the ‘Annotate DNA sites’ plugin for queries of consensus sequences of CREB boxes, Serum response elements and ETS binding sites.

MyCORG—the CGViz AlignmentViewer

The bottom part of this webpage constitutes a paradigm shift away from the web browser towards a complete software application. We have abandoned the previous JAVA applet for the comparative display of alignments and implemented a JAVA WebStart application (requires Java 1.5, ) based on the CGViz framework (). This application is installed locally and requires standard input formats: MAF for alignments and GFF for sequence feature annotation (). Both data record formats are provided by the CORG database, but data may be also obtained from other resources as well. This procedure guarantees maximal flexibility to the user. Figure 3 is an example of the Alignment viewer showing the whole gene locus of c-fos with CORG alignments and annotation. Indeed, our application offers seamless navigation through cross-species alignments.
Figure 3

New Comparative alignment viewer implemented as JAVA WebStart application. The screenshot shows the gene locus of the c-fos gene on chromosome 14 and corresponding pairwise alignments. Gene structure annotation (exons in green, repetitive elements in yellow) is imported from GFF-formatted files. Pairwise or multiple alignments are imported from MAF-formatted files.

New Comparative alignment viewer implemented as JAVA WebStart application. The screenshot shows the gene locus of the c-fos gene on chromosome 14 and corresponding pairwise alignments. Gene structure annotation (exons in green, repetitive elements in yellow) is imported from GFF-formatted files. Pairwise or multiple alignments are imported from MAF-formatted files.

Large-scale data retrieval

Complete CORG data sets can be easily obtained from the Batch retrieval section. The user can select from a variety of GFF annotation and MAF alignment dumps. We do have a DAS service in operation. Please visit the DAS tutorial section for details ().

Future directions

The CORG resource will primarily remain web-based as this guarantees convenient access to CORG while having minimal requirements on the client computer. Client-side software components will continue to supplement the web resource. To keep CORG up-to-date with developments in sequencing and functional annotation, we strive to release a new CORG version every six months. Large-scale experimental data sets will be integrated into the CORG framework.

CONCLUSIONS

The CORG database has been updated to offer a great diversity of 10 vertebrate species. Pairwise as well as multiple guide the user in her or his data interpretation. This move became possible with improvements in alignment computation (SITEBLAST) using biological motifs as anchors. Whole gene loci alignments with locus-specific scoring schemes cover introns, UTR and downstream regions in the search for functional elements. Greatest flexibility is guaranteed by keeping the CORG workbench open and adjustable through the Generic Genome Browser framework. Finally, JAVA WebStart technology enables the seamless integration and visualization of CORG data on each local desktop.
  10 in total

1.  Multiple sequence alignment using partial order graphs.

Authors:  Christopher Lee; Catherine Grasso; Mark F Sharlow
Journal:  Bioinformatics       Date:  2002-03       Impact factor: 6.937

2.  CORG: a database for COmparative Regulatory Genomics.

Authors:  C Dieterich; H Wang; K Rateitschak; H Luz; M Vingron
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

3.  The generic genome browser: a building block for a model organism system database.

Authors:  Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal:  Genome Res       Date:  2002-10       Impact factor: 9.043

Review 4.  Growth factor-induced gene expression: the ups and downs of c-fos regulation.

Authors:  V M Rivera; M E Greenberg
Journal:  New Biol       Date:  1990-09

5.  SITEBLAST--rapid and sensitive local alignment of genomic sequences employing motif anchors.

Authors:  Morris Michael; Christoph Dieterich; Martin Vingron
Journal:  Bioinformatics       Date:  2004-12-14       Impact factor: 6.937

6.  E2A and IRF-4/Pip promote chromatin modification and transcription of the immunoglobulin kappa locus in pre-B cells.

Authors:  Adam S Lazorchak; Mark S Schlissel; Yuan Zhuang
Journal:  Mol Cell Biol       Date:  2006-02       Impact factor: 4.272

Review 7.  Signal integration at the c-fos promoter.

Authors:  R Janknecht; M A Cahill; A Nordheim
Journal:  Carcinogenesis       Date:  1995-03       Impact factor: 4.944

8.  Ensembl 2006.

Authors:  E Birney; D Andrews; M Caccamo; Y Chen; L Clarke; G Coates; T Cox; F Cunningham; V Curwen; T Cutts; T Down; R Durbin; X M Fernandez-Suarez; P Flicek; S Gräf; M Hammond; J Herrero; K Howe; V Iyer; K Jekosch; A Kähäri; A Kasprzyk; D Keefe; F Kokocinski; E Kulesha; D London; I Longden; C Melsopp; P Meidl; B Overduin; A Parker; G Proctor; A Prlic; M Rae; D Rios; S Redmond; M Schuster; I Sealy; S Searle; J Severin; G Slater; D Smedley; J Smith; A Stabenau; J Stalker; S Trevanion; A Ureta-Vidal; J Vogel; S White; C Woodwark; T J P Hubbard
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

9.  A new generation of JASPAR, the open-access repository for transcription factor binding site profiles.

Authors:  Dominique Vlieghe; Albin Sandelin; Pieter J De Bleser; Kris Vleminckx; Wyeth W Wasserman; Frans van Roy; Boris Lenhard
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

10.  The UCSC Genome Browser Database: update 2006.

Authors:  A S Hinrichs; D Karolchik; R Baertsch; G P Barber; G Bejerano; H Clawson; M Diekhans; T S Furey; R A Harte; F Hsu; J Hillman-Jackson; R M Kuhn; J S Pedersen; A Pohl; B J Raney; K R Rosenbloom; A Siepel; K E Smith; C W Sugnet; A Sultan-Qurraie; D J Thomas; H Trumbower; R J Weber; M Weirauch; A S Zweig; D Haussler; W J Kent
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.