Literature DB >> 15608249

DG-CST (Disease Gene Conserved Sequence Tags), a database of human-mouse conserved elements associated to disease genes.

Angelo Boccia¹, Mauro Petrillo, Diego di Bernardo, Alessandro Guffanti, Flavio Mignone, Stefano Confalonieri, Lucilla Luzi, Graziano Pesole, Giovanni Paolella, Andrea Ballabio, Sandro Banfi.

Abstract

The identification and study of evolutionarily conserved genomic sequences that surround disease-related genes is a valuable tool to gain insight into the functional role of these genes and to better elucidate the pathogenetic mechanisms of disease. We created the DG-CST (Disease Gene Conserved Sequence Tags) database for the identification and detailed annotation of human-mouse conserved genomic sequences that are localized within or in the vicinity of human disease-related genes. CSTs are defined as sequences that show at least 70% identity between human and mouse over a length of at least 100 bp. The database contains CST data relative to over 1088 genes responsible for monogenetic human genetic diseases or involved in the susceptibility to multifactorial/polygenic diseases. DG-CST is accessible via the internet at http://dgcst.ceinge.unina.it/ and may be searched using both simple and complex queries. A graphic browser allows direct visualization of the CSTs and related annotations within the context of the relative gene and its transcripts.

Entities: Chemical

Mesh：

Year: 2005 PMID： 15608249 PMCID： PMC539965 DOI： 10.1093/nar/gki011

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Alignment of DNA sequences from different species provides an effective tool to decode genomic information, based on the assumption that functional sequences tend to evolve at a slower rate than non-functional sequences. The availability of the complete genomic sequences from a variety of species (1–4) allows to carry out these analyses very effectively and to identify, besides coding sequences, also non-coding sequences with either regulatory or structural functions (5–8). A comparative analysis of the human and murine genomes revealed the presence of a surprisingly high number of sequence elements longer than 100 bp and displaying a sequence identity >70% between human and mouse (6). Interestingly, more than half of these conserved sequences do not represent known elements belonging to protein-coding genes and may therefore represent non-coding RNAs, expression control elements or chromosomal structural elements. Such sequences have been previously termed CNG (conserved non-genic sequences) (9,10) or CNS (conserved non-coding sequences) (2). Here, we use the more neutral and descriptive expression ‘conserved sequence tags’ (CST), which is appropriate also to describe exons. To gain further insight into the biological role of these conserved sequences, we chose to identify and annotate CSTs belonging to a set of human genes involved in the pathogenesis of genetic diseases. These are among the best-studied human genes as they have been the objects of very detailed structural and functional characterization in the past 15–20 years. Furthermore, novel functional elements within these genes may be targets of yet unidentified mutations leading to genetic diseases. Information on CSTs related to human disease genes can also be gathered from reference genome sequence databases, i.e. Ensembl (11) and Genome Browsers (12), or from more specialized resources, i.e. Vista Browser (13) and GALA (14). However, these valuable resources are not specifically designed for the study of human disease genes and retrieval of CST data for these genes may turn out to be difficult and statistical analysis impossible since CSTs are not explicitly annotated. Therefore, we decided to build DG-CST (Disease Gene CST), a database of human–mouse conserved elements associated to disease genes. To this purpose, systematic identification of CSTs in human disease genes was carried out, followed by detailed bioinformatic analysis, aimed at identifying novel functional elements associated with these genes, either transcribed and possibly coding sequences or non-transcribed sequence elements with a hypothetical role in the control of gene expression. The DG-CST database is available to the scientific community through a Web interface at the address http://dgcst.ceinge.unina.it/. The annotation of CSTs related to disease genes will be valuable for the elucidation of the functional role of these conserved sequences and for a better understanding of the pathogenesis of human genetic disorders.

CONSTRUCTION AND ORGANIZATION OF THE DG-CST

Sequence acquisition and CST identification

A list of human genes involved in either the pathogenesis of monogenic human disorders or in the predisposition to multifactorial diseases was obtained by screening the Genecards (15) and the On-Line Mendelian Inheritance in Man (OMIM) (16) databases. We then searched the human Ensembl database (assembly release NCBI34) to retrieve the human genomic sequences spanning the selected transcripts as well as 250 additional kilobases of flanking sequence on both sides. The extent of the flanking sequence was reduced when known genes were annotated in proximity of the disease gene, but a minimum of 20 kb was taken in all cases. The Ensembl database was also used as the source of the corresponding murine sequences. Orthologous gene annotation was used, when available, to find the mouse counterparts; when more than one orthologous gene was found, sequences were manually selected, on the basis of overall sequence conservation and relationships with other neighboring sequences. Mouse sequence size was defined according to the length of the human sequence. A total set of 1088 human genomic sequences was compared to the corresponding murine orthologous genomic sequences (the full list is available online). Overall, 193 million bp of human genomic sequences were analyzed, corresponding to 7% of the human genome. Human and mouse genomic sequences, prefiltered to mask all known repeated sequences, were compared using the BLASTZ program (17). Sequences showing at least 70% identity, over a region of at least 100 bp, were selected and further analyzed to eliminate redundancies, leading to the identification of 66 495 repeat-free, non-overlapping, human and mouse CST pairs. The CSTs were found to correspond or to overlap to known human exon sequences in about 32% of cases (n = 21 139) while they were located either in intronic or in intergenic region in the remaining 68% of cases (n = 45 356) (Table 1).

Table 1.

Classification of human CSTs present in DG-CST

CST type	Number	%	Length (bp)	Length (%)
Exonic	21 139	31.8	5 247 362	34.9
Intronic	18 390	27.7	3 832 169	25.5
Intergenic	26 966	40.5	5 962 769	39.6
Total	66 495	100	15 042 300	100

CST annotation

The identified CSTs are collected in the DG-CST database, together with a large number of annotations including: species; genomic location, i.e. chromosome, position, relationship with the closest gene and with the selected disease gene (often coincident); sequence content, i.e. sequence, length, GC percentage; identity between human and mouse sequences, number of gaps, polarity; BLAST matches with other CSTs, as well as with other human genomic sequences; BLAST matches versus non-redundant nucleotide databases; conservation in other species, as assessed by BLAST analysis versus the drafts of fugu (3), chicken (11), rat (4) and zebrafish (11) genome sequences; classification of CSTs in ‘intronic’, ‘intergenic’, ‘exonic’ based on Ensembl gene annotations; potential of CSTs of representing transcribed/coding elements based on a number of different tests, including determination of maximum ORF size, presence of putative splice sites, exonic splicing enhancers (18), exon predictions based on GENSCAN (19), BLAST matches with expressed sequence tags (ESTs) and non-redundant protein databases, word frequencies, determination of the coding potential score (c.p.s.) according to the CSTMiner algorithm (20,21), a recently developed software based on pairwise genome comparison; presence of single nucleotide polymorphisms (SNPs), as reported in Ensembl; presence of palindromes, tandem repeats, putative RNA secondary structures as predicted by using the ddbRNA software (22); presence of putative transcription factor (TF) binding sites, as assessed using BID, a newly developed algorithm (A. Ambesi, M. Bansal and D. di Bernardo, unpublished data).

DATABASE SEARCH

The DG-CST database contains all the annotations and is designed to allow easy retrieval of CST information. Searching is supported in a number of different ways. A graphic browser allows direct visualization of the CSTs, within the context of the relative gene and its transcripts. Briefly, CST information can be accessed in the following ways: Each CST entry present in DG-CST is assigned a unique identifier (CST ID) that can also be used to quickly find the CST from different sections of the database, including the home page. By choosing from a list of all analyzed disease genes available in the home page (Figure 1A and B).

Figure 1

The DG-CST database: examples of query interfaces. (A) The DG-CST home page. The quick search boxes are highlighted in color: the CST ID box in green; the gene box in black; and the BLAST box in red. (B) The list of all analyzed genes obtained following the link on the home page. (C) The DNA-based feature search page. (D) The advanced CST search page, where all annotated features may be used in combination or alone to query the database. (E) The gene-based CST search page, which allows a more detailed gene search.

By selecting one or more genes either as a quick search option from the home page (Figure 1A, black box) or following the ‘gene’ link. Gene selection may be carried out by gene symbol, disease name and several other criteria, also in combination (Figure 1E). By querying the database for CSTs selected according to a large number of annotated features, alone or in combination, in the ‘Advanced’ section (Figure 1D). To facilitate the search, reduced feature sets are available where CSTs can be searched by (a) DNA-based features such as presence of tandem repeats, palindromes, SNPs (Figure 1C); (b) RNA-based features such as presence of putative secondary structures, matches with ESTs, GENSCAN predicted exons; (c) protein-coding features, such as exon annotation, coding potential, BLAST matches with proteins; (d) CSTs localized to selected chromosomal regions. Finally, CSTs can be searched by BLAST sequence analysis from the home page (Figure 1A, red box).

DATA DISPLAY

When searching DG-CST using the previously described ‘DNA-based’, ‘RNA-based’, ‘protein-based’, ‘advanced’ and ‘localization’ features, a list of CST entries that meet the search criteria can be accessed. Individual CSTs may be visualized in a specific page where all annotations available on that particular CST are displayed (Figure 2C). Matching CSTs from other species may be seen and compared in a multi-sequence alignment (Figure 2D). Matches found for each CST in a number of BLAST searches, pre-run against collections of genomic, EST or protein databases, may also be displayed starting from the CST page.

Figure 2

The DG-CST database: data display. (A) Example of a gene entry (A2M) and the related CST list. (B) Graphical representation of the selected gene, accessible via the map link in (A). On mouse over, details of CST #250083 are displayed as an example. In this representation, CSTs are color-coded based on the number of matches with human ESTs. (C) Example of a CST entry with all annotations and the list of the corresponding CSTs conserved in other species. CST details are accessible either from the CST list of the gene page (A) or by clicking on the interactive graphical browser in (B). (D) Graphical representation of the sequence alignment of the orthologous CSTs shown in (C).

On the other hand, when searching by gene/disease name and/or symbol, it is possible to obtain a list of gene entries that meet the search criteria. Each gene entry, in addition to links to external resources such as LocusLink, ENSEMBL and OMIM, provides a ‘CST list’ link that gives access to the list of all CSTs found by analyzing the selected disease gene region, as shown in Figure 2A. By clicking on each entry, it is possible to access all the data pertaining to a given human CST, as described above. Graphical representation is accessible through a ‘map’ link, where CSTs and related annotations are shown within the context of the relative gene and its transcripts (Figure 2B). Moving through the genomic region and zooming to various levels of detail are supported. CSTs may be labeled by a color code on the basis of several quantitative parameters such as degree of human–mouse sequence identity, GC content, number of gaps, putative RNA secondary structures, palindromes and tandem repeats. To avoid an exceedingly crowded map, the graphic visualization tool allows the user to display selected CST subsets, such as: intergenic, intronic or exonic CSTs; CSTs containing putative TF binding sites; CSTs with matches to ESTs; CSTs conserved in additional species, besides human and mouse, such as chicken, fugu, zebrafish. These CSTs have a higher probability of representing functional elements playing a basic role in vertebrates as suggested by recent reports (23).

CONCLUSIONS

DG-CST is an annotated collection of conserved sequences related to genes involved in genetic diseases and may represent a valuable resource for investigators interested in studying the molecular mechanisms that underlie genetic diseases. The database will be updated on a regular basis to include information on newly identified human disease genes as well as on new genomic data (e.g. sequences from additional organisms). DG-CST may help in deciphering the spectrum of pathogenetic mutations that determine genetic diseases. Mutations are usually searched for in the coding regions of a gene, but may easily occur in other areas. CSTs provide a vast library of putative novel functional sites, such as non-previously described exons and/or elements possibly playing a role in regulating the level of gene expressions, which may be functionally tested as well as screened for mutations in patients, particularly in diseases where the analysis of the known functional elements of the disease gene failed so far in identifying a relevant number of causative mutations (24–27). There are a number of evidences that point to the direct involvement of regulatory control elements in the pathogenesis of human disorders, both due to chromosomal rearrangements (28,29) and to point mutations (30–32). However, the recognition of pathogenic mutations leading to genetic disorders in regulatory elements has been so far hampered by our limited knowledge of the structure and function of the elements associated to disease genes. The availability of the DG-CST database should be a valuable resource in order to fill this gap of information and to facilitate the efforts aimed at both elucidating the function of disease genes and at better understanding the pathogenetic mechanisms of genetic diseases.

32 in total

1. GALA, a database for genomic sequence alignments and annotations.

Authors: Belinda Giardine; Laura Elnitski; Cathy Riemer; Izabela Makalowska; Scott Schwartz; Webb Miller; Ross C Hardison
Journal: Genome Res Date: 2003-04 Impact factor: 9.043

2. Initial sequencing and comparative analysis of the mouse genome.

Authors: Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

3. Numerous potentially functional but non-genic conserved sequences on human chromosome 21.

Authors: Emmanouil T Dermitzakis; Alexandre Reymond; Robert Lyle; Nathalie Scamuffa; Catherine Ucla; Samuel Deutsch; Brian J Stevenson; Volker Flegel; Philipp Bucher; C Victor Jongeneel; Stylianos E Antonarakis
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

4. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes.

Authors: Samuel Aparicio; Jarrod Chapman; Elia Stupka; Nik Putnam; Jer-Ming Chia; Paramvir Dehal; Alan Christoffels; Sam Rash; Shawn Hoon; Arian Smit; Maarten D Sollewijn Gelpke; Jared Roach; Tania Oh; Isaac Y Ho; Marie Wong; Chris Detter; Frans Verhoef; Paul Predki; Alice Tay; Susan Lucas; Paul Richardson; Sarah F Smith; Melody S Clark; Yvonne J K Edwards; Norman Doggett; Andrey Zharkikh; Sean V Tavtigian; Dmitry Pruss; Mary Barnstead; Cheryl Evans; Holly Baden; Justin Powell; Gustavo Glusman; Lee Rowen; Leroy Hood; Y H Tan; Greg Elgar; Trevor Hawkins; Byrappa Venkatesh; Daniel Rokhsar; Sydney Brenner
Journal: Science Date: 2002-07-25 Impact factor: 47.728

5. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.

Authors: Ada Hamosh; Alan F Scott; Joanna Amberger; Carol Bocchini; David Valle; Victor A McKusick
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

6. Disruption of a long-range cis-acting regulator for Shh causes preaxial polydactyly.

Authors: Laura A Lettice; Taizo Horikoshi; Simon J H Heaney; Marijke J van Baren; Herma C van der Linde; Guido J Breedveld; Marijke Joosse; Nurten Akarsu; Ben A Oostra; Naoto Endo; Minoru Shibata; Mikio Suzuki; Eiichi Takahashi; Toshikatsu Shinka; Yutaka Nakahori; Dai Ayusawa; Kazuhiko Nakabayashi; Stephen W Scherer; Peter Heutink; Robert E Hill; Sumihare Noji
Journal: Proc Natl Acad Sci U S A Date: 2002-05-28 Impact factor: 11.205

7. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly.

Authors: Laura A Lettice; Simon J H Heaney; Lorna A Purdie; Li Li; Philippe de Beer; Ben A Oostra; Debbie Goode; Greg Elgar; Robert E Hill; Esther de Graaff
Journal: Hum Mol Genet Date: 2003-07-15 Impact factor: 6.150

8. ESEfinder: A web resource to identify exonic splicing enhancers.

Authors: Luca Cartegni; Jinhua Wang; Zhengwei Zhu; Michael Q Zhang; Adrian R Krainer
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

9. Human-mouse alignments with BLASTZ.

Authors: Scott Schwartz; W James Kent; Arian Smit; Zheng Zhang; Robert Baertsch; Ross C Hardison; David Haussler; Webb Miller
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

10. Strategies and tools for whole-genome alignments.

Authors: Olivier Couronne; Alexander Poliakov; Nicolas Bray; Tigran Ishkhanov; Dmitriy Ryaboy; Edward Rubin; Lior Pachter; Inna Dubchak
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

6 in total

1. Conservation across species identifies several transcriptional enhancers in the HEX genomic region.

Authors: Angela Valentina D'Elia; Elisa Bregant; Nadia Passon; Cinzia Puppin; Alessia Meneghel; Giuseppe Damante
Journal: Mol Cell Biochem Date: 2009-06-25 Impact factor: 3.396

2. A transcriptional sketch of a primary human breast cancer by 454 deep sequencing.

Authors: Alessandro Guffanti; Michele Iacono; Paride Pelucchi; Namshin Kim; Giulia Soldà; Larry J Croft; Ryan J Taft; Ermanno Rizzi; Marjan Askarian-Amiri; Raoul J Bonnal; Maurizio Callari; Flavio Mignone; Graziano Pesole; Giovanni Bertalot; Luigi Rossi Bernardi; Alberto Albertini; Christopher Lee; John S Mattick; Ileana Zucchi; Gianluca De Bellis
Journal: BMC Genomics Date: 2009-04-20 Impact factor: 3.969

3. Systematic analysis of human kinase genes: a large number of genes and alternative splicing events result in functional and structural diversity.

Authors: Luciano Milanesi; Mauro Petrillo; Leandra Sepe; Angelo Boccia; Nunzio D'Agostino; Myriam Passamano; Salvatore Di Nardo; Gianluca Tasco; Rita Casadio; Giovanni Paolella
Journal: BMC Bioinformatics Date: 2005-12-01 Impact factor: 3.169

4. NemaFootPrinter: a web based software for the identification of conserved non-coding genome sequence regions between C. elegans and C. briggsae.

Authors: Davide Rambaldi; Alessandro Guffanti; Paolo Morandi; Giuseppe Cassata
Journal: BMC Bioinformatics Date: 2005-12-01 Impact factor: 3.169

5. Stem-loop structures in prokaryotic genomes.

Authors: Mauro Petrillo; Giustina Silvestro; Pier Paolo Di Nocera; Angelo Boccia; Giovanni Paolella
Journal: BMC Genomics Date: 2006-07-04 Impact factor: 3.969

6. Genome-wide identification of coding and non-coding conserved sequence tags in human and mouse genomes.

Authors: Flavio Mignone; Anna Anselmo; Giacinto Donvito; Giorgio P Maggi; Giorgio Grillo; Graziano Pesole
Journal: BMC Genomics Date: 2008-06-11 Impact factor: 3.969

6 in total