Literature DB >> 16381980

EPD in its twentieth year: towards complete promoter coverage of selected model organisms.

Christoph D Schmid¹, Rouaïda Perier, Viviane Praz, Philipp Bucher.

Abstract

The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, experimentally defined by a transcription start site (TSS). Access to promoter sequences is provided by pointers to positions in the corresponding genomes. Promoter evidence comes from conventional TSS mapping experiments for individual genes, or, starting from release 73, from mass genome annotation projects. Subsets of promoter sequences with customized 5' and 3' extensions can be downloaded from the EPD website. The focus of current development efforts is to reach complete promoter coverage for important model organisms as soon as possible. To speed up this process, a new class of preliminary promoter entries has been introduced as of release 83, which requires less stringent admission criteria. As part of a continuous integration process, new web-based interfaces have been developed, which allow joint analysis of promoter sequences with other bioinformatics resources developed by our group, in particular programs offered by the Signal Search Analysis Server, and gene expression data stored in the CleanEx database. EPD can be accessed at http://www.epd.isb-sib.ch.

Entities: Disease Gene Species

Mesh：

Year: 2006 PMID： 16381980 PMCID： PMC1347508 DOI： 10.1093/nar/gkj146

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

HISTORICAL BACKGROUND

The Eukaryotic Promoter Database (EPD) originates from a promoter compilation published in this journal 20 years ago (1). Two years later, this collection became available in machine-readable form as an accessory database of the EMBL nucleotide sequence data library. Since then, EPD has undergone many changes, but its primary objective has remained the same: to provide access to experimentally mapped eukaryotic promoter sequences and to keep track of transcription start site (TSS) mapping data. Not only our database has evolved over the last 20 years, but also the biologist's view of promoters, the experimental protocols to map TSSs, and the biological data environment have changed over this period. When we started to compile promoter sequences, commonly held views were that (i) each gene has one promoter, (ii) transcription always initiates at the same nucleotide and (iii) there is one sequence motif, the TATA-box, common to all promoters recognized by the eukaryotic polymerase II system. None of these assumptions have turned out to be true. Today we know that many human genes are transcribed from multiple promoters, not necessarily close to each other on the genome, and often giving rise to alternative first exons. Moreover, transcription initiation mechanisms appear to be less precise than initially assumed. In the human genome, it is not uncommon that the 5′ ends of mRNAs transcribed from the same promoter region are spread over >50 bp (2). Finally, promoters turned out to be heterogeneous with regard to sequence motif content. According to recent surveys, the once considered universal TATA-box element occurs only in about a third of all promoters in the systematically analyzed genomes of human (3), Drosophila melanogaster (4) and Arabidopsis thaliana (5). The experimental procedures for mapping promoters, as well as the way EPD entries are constructed from public data, have undergone drastic changes at the beginning of the functional genomics era. Before, promoters were mapped for one gene at a time by techniques such as nuclease protection assay and primer extension analysis. The corresponding EPD entries were the result of a critical examination and independent interpretation of data published in paper-based journal articles. Today, TSSs are mapped at once for a whole genome with high-throughput technologies such as 5′ SAGE (6) or CAGE (7). The resulting data are disseminated in machine-readable form over the internet. As a consequence, EPD entries are now largely generated by intelligent Perl scripts with built-in quality control procedures rather than by critical readers of scientific articles. An overview of publicly available mass genome annotation data useful for promoter mapping is given in Table 1.

Table 1

Summary of currently accessible mass genome annotation data for promoter mapping

5′ EST sequences from oligo-capped cDNA libraries
Human		400 225	Suzuki et al. (11)
Mouse		580 209	Suzuki et al. (11)
Drosophila	Sequences available from Genbank/EMBL, accession numbers extractable from Unigene (23), Unilib IDs 23941 or 23942	102 617	Stapleton et al. (27)
Arabidopsis		92 654	Seki et al. (28)
5′ sequences tags (5′SAGE, CAGE, GIS ditag)
Human		22 546	Hashimoto et al. (6)
Human		5 992 395	Carninci et al. (7)
Mouse		11 567 973	Carninci et al. (7)
Mouse		225 914	Ng et al. (29)
Reference sequence collections from oligo-capped cDNA libraries
Rice		30 598	Kikuchi et al. (22)

The third column indicates the number of available sequences or tags.

Undoubtedly, the sequence data environment has undergone the most spectacular revolution during EPD's life span. When we started to compile promoters, sequences were available for only a few hundred short pieces of the human genome, most of them barely exceeding a thousand base pairs in length. Today, we have access to several complete genomes of higher eukaryotes totaling billions of nucleotides. Despite these changes, the conceptual organization and data representation of EPD has remained remarkably stable. As a matter of fact, we anticipated many of the forthcoming changes in the initial design of EPD. For instance, we distinguished from the very beginning three classes of promoters, characterized by (i) single initiation sites, (ii) clustered multiple initiation sites and (iii) transcription initiation regions. We also allowed for multiple promoters per gene, being aware of a few such examples known at that time. The decision to provide access to promoter sequences indirectly through machine-readable pointers to sequences stored elsewhere turned out to be very helpful during the transition phase from the old-style nucleotide sequence database to the whole genome environment. EPD is not anymore the only public database maintained by our group. The gene expression database CleanEx (8) and the Signal Search Analysis (SSA) server (9) are complementary bioinformatics resources developed in close coordination by partly overlapping teams. Note that CleanEx originated from a companion database of EPD called EPDEX (10), which by now has become largely obsolete. Whereas the source file distributions of the three products via ftp will remain self-contained and stand-alone, efforts are underway to integrate the corresponding web access tools into a tightly interconnected system for gene regulatory sequence analysis. EPD is also not anymore the only database providing information about experimentally mapped TSSs. DBTSS (11) and PromoSer (12) are comprehensive collections of mammalian promoters based on clustering of expressed sequence tag (EST) and full-length cDNA sequences. These resources define the TSS as the furthest 5′ position in the genome which can be aligned with the 5′ end of a cDNA from the corresponding gene. In contrast, EPD considers the most frequent cDNA 5′ end as the TSS and further applies a specialized algorithm to infer multiple promoters for a given gene. Arguments and results in favor of our approach were presented in a previous article (13). PlantProm (14) is a smaller volume database of plant promoters based on published TSS mapping data. HemoPDB (15) is a more specialized resource for promoters of genes of the hematopoietic system, providing information on transcription factor binding sites in addition to TSS annotation. OMGProm (16), DoOP (17) and CORG (18) are databases of orthologous promoters with a comparative genomics focus. A detailed description of the contents and format of EPD was given in Ref. (19). Information about interfaces and support for local installations can be found in Ref. (20,21). New format features for promoter entries derived from mass genome annotation data are described in Ref. (10). The in silico primer extension protocol used for generating promoter entries from mass genome annotation data is detailed in Ref. (13).

TOWARDS COMPLETE COVERAGE FOR MODEL ORGANISMS

In the past, the maintenance policy of EPD was to guarantee high-quality standards. In order to be included, a promoter had to satisfy stringent criteria regarding its experimental characterization (13). Undoubtedly, the user community of EPD (mostly computational biologists) has appreciated this focus on quality rather than quantity in the past. The backside nevertheless is that promoter coverage of important model genomes has remained modest. Today, the demand is slowly changing. As a result of the Human Genome Sequencing Project, the so-called global approach to organisms has become fashionable. Intensive efforts are currently made to functionally annotate the complete genomes of various model organisms by experimental as well as computational methods. In response to these trends, we redefined the priorities of our development efforts. Our stated objective is now to reach complete promoter coverage for three model organisms (human, D.melanogaster and rice) as soon as possible. To conciliate the contrasting objectives of high quality and quantity, we introduced a new class of promoter entries called ‘preliminary’. Such entries fulfill less stringent admission criteria and are generated automatically from mass genome annotation data and other genome information resources. Some of these entries are based on external annotation efforts. There are several potential reasons why a preliminary entry may not be acceptable as a standard, high-quality entry: insufficient experimental data, missing information about computational TSS inference procedures, format incompatibilities with third party annotations, or uncertainties about the identity of the corresponding genes. The latter happens, for instance, with scarcely annotated new genomes, such as the rice genome. The inclusion of preliminary promoter entries was encouraged by the successful development of sequence motif-based tests to assess the quality of automatically generated promoter sets (13). These tests take into account the occurrence frequency and positional distribution of predicted promoter elements in the evaluated promoter set in order to estimate the amount of contaminating non-promoter sequences and the average error of TSS positions. Corresponding results obtained from a high-quality promoter set from the same organism are used for calibration. Since preliminary promoter entries are always generated in large numbers by the same procedure, statistically robust quality estimates can be obtained for groups of such entries, but obviously not for individual promoters. The above outlined quality evaluation procedure is also used for choosing the acceptance threshold, and for the fine-tuning of certain parameters of the data processing pipeline for preliminary entries. To illustrate this principle, let's consider the in silico primer extension method for inferring TSSs. Our currently implemented procedure relies on the program madap () for identifying clusters of cDNA 5′ ends mapped to the genome. For standard EPD entries, we require at least 10 cDNAs per cluster. For preliminary entries, we could simply lower the threshold number. Sequence motif-based tests as described above suggest that a threshold as low as three cDNAs would still yield acceptable quality for preliminary promoter entries. The first set of preliminary entries ready for inclusion in EPD happened to be a collection of 13 046 rice promoters, derived from a reference collection of ∼30 500 mRNA sequences published by the rice full-length cDNA Consortium (22). This reference collection was generated by clustering and genome mapping of ∼170 000 initial cDNA sequences, all from libraries generated with the oligo-capping method. There were several reasons which prevented us from making standard EPD entries from this external genome annotation resource: (i) we had no access to the primary data, (ii) in the general case, we had to rely on one full-length cDNA sequence per gene and consequently were unable to assign the promoter to one of the three TSS classes, single, multiple or region and (iii) the preliminary annotation status of the rice genome made it impossible for many promoters to provide meaningful gene descriptions. In our local data processing pipeline, we first subjected the rice mRNA sequence collection to additional quality control steps. Sequences whose 5′ terminal 11 nt did not match the rice genome with at most one mismatch were discarded. By checking the assignment of the corresponding GenBank/EMBL accession numbers to Unigene clusters (23) we were able to eliminate hitherto undiscovered redundancy in the original collection. For the remaining entries, the rudimentary gene annotation provided by the consortium was complemented with information from the genome annotation project at TIGR [release 3.0: December 30, 2004 (24)]. Application of the sequence motif-based evaluation procedure to this preliminary promoter set indicated that the TSS assignment was of similar precisions as in standard EPD entries. However, we cannot exclude for the moment the possibility that the collection is contaminated with a sizable fraction of non-promoter sites. In an additional test, we tried to compare the newly generated entries with already existing EPD entries for the same promoters. We found only seven examples suitable for this purpose. Of those, five preliminary entries matched their high-quality homologs with TSS position shifts of −4, −2, +2, +2 and +25 bp. Additional preliminary promoter sets are in preparation. Most of them will be based on in silico primer extension protocols with relaxed constraints, as described above. Preliminary EPD entries are available in a separate file named epd_bulk.dat from our FTP server. The web-based pages provide access to both standard and preliminary entries. Note that preliminary entries are identified by the keyword ‘preliminary’ on the ID line.

OTHER RECENT DEVELOPMENTS

In response to numerous requests, we included new Fasta-formatted promoter sequence library files with an extended range of −9999 to +6000 relative to TSS in the FTP release. The popular sequence download page, which can be used for retrieval of biologically meaningful promoter sequence subsets of user-defined extension, has a new feature allowing direct sequence transfer to the SSA server (9). The web-based entry viewers were equipped with genome position hyperlinks to the ‘ENSEMBL ContigView’ (25) and ‘UCSC Genome Browser’ (26). Moreover, a graphical representation of the initiation site patterns (Figure 1) was added to the ‘niceview’ display for those EPD entries which include a cDNA 5′ end profile derived by in silico primer extension.

Figure 1

Graphical representation of the distribution of 5′ ends of full-length transcripts. The diagram is based on data from the Berkley Drosophila Genome Project for gene ARF79F and is part of the ‘niceview’ display of EPD entry DM_ARF1_2 ().

29 in total

1. CleanEx: a database of heterogeneous gene expression data based on a consistent gene nomenclature.

Authors: Viviane Praz; Vidhya Jagannathan; Philipp Bucher
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

2. 5'-end SAGE for the analysis of transcriptional start sites.

Authors: Shin-ichi Hashimoto; Yutaka Suzuki; Yasuhiro Kasai; Kei Morohoshi; Tomoyuki Yamada; Jun Sese; Shinichi Morishita; Sumio Sugano; Kouji Matsushima
Journal: Nat Biotechnol Date: 2004-08-08 Impact factor: 54.908

3. The Eukaryotic Promoter Database (EPD): recent developments.

Authors: R C Périer; T Junier; C Bonnard; P Bucher
Journal: Nucleic Acids Res Date: 1999-01-01 Impact factor: 16.971

4. The Eukaryotic Promoter Database EPD.

Authors: R Cavin Périer; T Junier; P Bucher
Journal: Nucleic Acids Res Date: 1998-01-01 Impact factor: 16.971

5. Compilation and analysis of eukaryotic POL II promoter sequences.

Authors: P Bucher; E N Trifonov
Journal: Nucleic Acids Res Date: 1986-12-22 Impact factor: 16.971

6. OMGProm: a database of orthologous mammalian gene promoters.

Authors: Saranyan K Palaniswamy; Victor X Jin; Hao Sun; Ramana V Davuluri
Journal: Bioinformatics Date: 2004-11-05 Impact factor: 6.937

7. DoOP: Databases of Orthologous Promoters, collections of clusters of orthologous upstream sequences from chordates and plants.

Authors: Endre Barta; Endre Sebestyén; Tamás B Pálfy; Gábor Tóth; Csaba P Ortutay; László Patthy
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

8. Genome wide analysis of Arabidopsis core promoters.

Authors: Carlos Molina; Erich Grotewold
Journal: BMC Genomics Date: 2005-02-25 Impact factor: 3.969

9. Ensembl 2005.

Authors: T Hubbard; D Andrews; M Caccamo; G Cameron; Y Chen; M Clamp; L Clarke; G Coates; T Cox; F Cunningham; V Curwen; T Cutts; T Down; R Durbin; X M Fernandez-Suarez; J Gilbert; M Hammond; J Herrero; H Hotz; K Howe; V Iyer; K Jekosch; A Kahari; A Kasprzyk; D Keefe; S Keenan; F Kokocinsci; D London; I Longden; G McVicker; C Melsopp; P Meidl; S Potter; G Proctor; M Rae; D Rios; M Schuster; S Searle; J Severin; G Slater; D Smedley; J Smith; W Spooner; A Stabenau; J Stalker; R Storey; S Trevanion; A Ureta-Vidal; J Vogel; S White; C Woodwark; E Birney
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Wolfgang Helmberg; David L Kenton; Oleg Khovayko; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Joan U Pontius; Kim D Pruitt; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Grigory Starchenko; Tugba O Suzek; Roman Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

56 in total

1. Rapid identification of novel functional promoters for gene therapy.

Authors: Ian A Pringle; Deborah R Gill; Mary M Connolly; Anna E Lawton; Anne-Marie Hewitt; Graciela Nunez-Alonso; Seng H Cheng; Ronald K Scheule; Lee A Davies; Stephen C Hyde
Journal: J Mol Med (Berl) Date: 2012-07-06 Impact factor: 4.599

2. Adaptive seeds tame genomic sequence comparison.

Authors: Szymon M Kiełbasa; Raymond Wan; Kengo Sato; Paul Horton; Martin C Frith
Journal: Genome Res Date: 2011-01-05 Impact factor: 9.043

3. Processing and analyzing ChIP-seq data: from short reads to regulatory interactions.

Authors: Marion Leleu; Grégory Lefebvre; Jacques Rougemont
Journal: Brief Funct Genomics Date: 2010-09-22 Impact factor: 4.241

4. Generic eukaryotic core promoter prediction using structural features of DNA.

Authors: Thomas Abeel; Yvan Saeys; Eric Bonnet; Pierre Rouzé; Yves Van de Peer
Journal: Genome Res Date: 2007-12-20 Impact factor: 9.043

5. Integrative content-driven concepts for bioinformatics "beyond the cell".

Authors: Edgar Wingender; Torsten Crass; Jennifer D Hogan; Alexander E Kel; Olga V Kel-Margoulis; Anatolij P Potapov
Journal: J Biosci Date: 2007-01 Impact factor: 1.826

6. Eukaryotic and prokaryotic promoter prediction using hybrid approach.

Authors: Hao Lin; Qian-Zhong Li
Journal: Theory Biosci Date: 2010-11-03 Impact factor: 1.919

7. Profiling the thermodynamic softness of adenoviral promoters.

Authors: Chu H Choi; Zoi Rapti; Vladimir Gelev; Michele R Hacker; Boian Alexandrov; Evelyn J Park; Jae Suk Park; Nobuo Horikoshi; Augusto Smerzi; Kim Ø Rasmussen; Alan R Bishop; Anny Usheva
Journal: Biophys J Date: 2008-04-04 Impact factor: 4.033

8. Features of mammalian microRNA promoters emerge from polymerase II chromatin immunoprecipitation data.

Authors: David L Corcoran; Kusum V Pandit; Ben Gordon; Arindam Bhattacharjee; Naftali Kaminski; Panayiotis V Benos
Journal: PLoS One Date: 2009-04-23 Impact factor: 3.240

9. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome.

Authors: Elizabeth A Rach; Hsiang-Yu Yuan; William H Majoros; Pavel Tomancak; Uwe Ohler
Journal: Genome Biol Date: 2009-07-09 Impact factor: 13.583

10. phiSITE: database of gene regulation in bacteriophages.

Authors: Lubos Klucar; Matej Stano; Matus Hajduk
Journal: Nucleic Acids Res Date: 2009-11-09 Impact factor: 16.971