Literature DB >> 15608204

The PEDANT genome database in 2005.

M Louise Riley¹, Thorsten Schmidt, Christian Wagner, Hans-Werner Mewes, Dmitrij Frishman.

Abstract

The PEDANT genome database (http://pedant.gsf.de) contains pre-computed bioinformatics analyses of publicly available genomes. Its main mission is to provide robust automatic annotation of the vast majority of amino acid sequences, which have not been subjected to in-depth manual curation by human experts in high-quality protein sequence databases. By design PEDANT annotation is genome-oriented, making it possible to explore genomic context of gene products, and evaluate functional and structural content of genomes using a category-based query mechanism. At present, the PEDANT database contains exhaustive annotation of over 1,240,000 proteins from 270 eubacterial, 23 archeal and 41 eukaryotic genomes.

Entities: Chemical Gene Species

Mesh：

Year: 2005 PMID： 15608204 PMCID： PMC539973 DOI： 10.1093/nar/gki019

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Sequencing of a new genome is not a sensational event anymore: while the first genome sequences determined several years ago often made it to the front page of the New York Times, nowadays even getting them published in major scientific journals is becoming increasingly difficult. The novelty of genomic data as such is certainly passé, but their usefulness remains intact, and perhaps is even increasing owing to virtually unlimited opportunities for comparative genomic analysis and large-scale data mining. At the same time, due to the sheer amount of genomic data currently available, the task of maintaining an up-to-date and complete genome analysis database represents a significant challenge. The MIPS group (now Institute for Bioinformatics) in Munich began to provide exhaustive automatic analysis of all publicly available genomes in 1996, when only five genomic sequences were published (1). The main mission of the PEDANT genome database is to fill the gap between manually curated high-quality protein sequence databases, such as SWISS-PROT (2) or PIR International (3), and the enormous amounts of other protein sequences produced by genome sequencing projects at an ever increasing pace. For example, release 44.0 of SWISS-PROT contains 153 871 manually annotated proteins, while the total number of currently known protein sequences stands at roughly 2 500 000. Since the aforementioned gap is quickly growing, it is probably safe to say that the majority of protein sequences will never be subjected to in-depth annotation by human experts.

IMPLEMENTATION

We use the PEDANT software suite (4) for annotation of large amounts of protein sequences by a carefully selected set of established bioinformatics methods. Exhaustive functional characterization of protein sequences includes similarity searches against the entire non-redundant sequence database, detection of motifs and patterns, automatic assignment of genes to functional categories and clusters of orthologous groups (5), similarity-based prediction of enzyme classification, and extraction of keywords and superfamily information. Structural characterization of gene products is based on similarity searches against the Protein Data Bank (PDB) (6) database, sensitive recognition of structural domains using profile searches, secondary structure prediction, detection of transmembrane regions, and prediction of low complexity and coiled coil regions. By design, PEDANT provides protein sequence annotation in genomic context. The PEDANT genome browser enables the user to select functional or structural categories of interest, obtain the list of gene products from a particular organism assigned to this category, and then view detailed information on each protein presented as an integrated report page. Advanced DNA and protein viewers allow visualizing the positions of genes and other genetic elements on the chromosome, and predicted structural and functional information about proteins, respectively. Facilities for searching the PEDANT annotation using text queries as well as BLAST (7) and pattern searches are provided. The PEDANT genome database is produced by systematically applying the automatic annotation pipeline described above to all genomic sequences that are being released in the public domain. The major premises of the PEDANT database are as listed below: Timeliness. The MIPS CPU resources make it possible to process a medium-size prokaryotic genome and make it available online essentially overnight. Completeness. We seek to process all completely sequenced genomes as well as many incomplete genomes, which are being made available by sequencing centers. In many cases, PEDANT represents the only source of annotation for a given genome. Standardization. Automatic annotation of sequences follows a clearly defined protocol in terms of the particular set of bioinformatics techniques applied to each sequence and the values of pre-determined recognition thresholds used for individual methods (e.g. BLAST E-values). Documentation. Since the results of automatic sequence analyses are inevitably afflicted by a large number of false assignments, we make available the raw output of each bioinformatics method used. This allows the user to make his own judgment on the validity of functional predictions appearing on each protein's report page.

DATABASE CONTENT

Over the past eight years, the number of analyzed genomes in the PEDANT database has grown steadily (Figure 1) and stands at 334 at the time of writing, including 228 completely sequenced and 106 unfinished genomic sequences from all three kingdoms of life (Figure 2). Most of these genomes were annotated in a totally unsupervised fashion. However, the database also includes several genomes that were manually annotated and, in many cases, published by MIPS. Those are Saccharomyces cerevisiae (8), Thermoplasma acidophilum (9), Arabidopsis thaliana (10), Neurospora crassa (11), Parachlamydia UWE25 (12), Listeria monocytogenes EGD, Listeria innocuaClip 11262 and Helicobacter pylori KE26695. The total amount of data managed by PEDANT via a relational database system MySQL, is ∼360 GB, more than one gigabyte per genome on average.

Figure 1

Growth of the number of annotated genomes in the PEDANT database since 1998. *Number as of September 1, 2004.

Figure 2

Number of annotated genomes (A) and protein sequences (B) in different genome categories.

To illustrate the functional and structural content of the PEDANT database, we calculated the coverage of all 1 240 000 annotated protein sequences by three selected popular categories: PFAM sequence motifs (13), SCOP structural domains (14) and MIPS functional role categories (15). As seen in Figure 3, the coverage varies in a wide range—from 64.3% by PFAM to 34.5% by SCOP. Only 15.2% of proteins possess all three attributes emphasizing the usefulness of applying many complementary bioinformatics techniques. The total number of all attributes computed by PEDANT for each sequence exceeds 20. The PEDANT database thus represents a valuable resource for large-scale association rule mining in automatically generated protein annotation.

Figure 3

An illustration of the functional and structural content of the PEDANT database. The figure shows the percentage of protein sequences associated with PFAM sequence motifs, SCOP structural domains and MIPS functional categories, as well as any combinations of these three attributes.

AUTOMATIC FUNCAT

The MIPS Functional Catalogue (FunCat) was developed in 1996 and used in the annotation of S.cerevisiae (8). It comprises a hierarchically structured classification system, which at first only contained categories describing yeast biology. Since then, it has been extended and used to annotate the following genomes: T.acidophilum, Bacillus subtilis 168, L.monocytogenes EGD, L.innocuaClip 11262, H.pylori KE26695, N.crassa, A.thaliana and H.sapiens. The most recent version of the FunCat (v. 2.0; 16) is organism independent and consists of 28 main categories, covering features such as metabolism and cellular transport, as well as some more recently introduced categories (e.g. development and organ localization). The main categories are assigned a unique two-digit number e.g. 01. metabolism, which appears as the first two digits of the FunCat number. The main categories are branched into more specific categories, with up to six levels of increasing specificity (e.g. 01.01.06.05.01.01 biosynthesis of homocysteine). The PEDANT software calculates automatic FunCat numbers based on a gene product's similarity to proteins in the manually annotated protein FunCat database. Although assignment of FunCat numbers by homology alone is not always reliable, it may provide useful information in the absence of manual annotation. The automatic FunCat tables for all PEDANT databases were recalculated using the new FunCat version and updated manually annotated FunCat database. Figure 4 shows the FunCat distribution of all 334 genomes in PEDANT.

Figure 4

The FunCat distribution of all 334 genomes in PEDANT. Here, the relative amounts of proteins that are assigned to one or more of six general FunCat classes are shown. Since proteins can be assigned to more than one functional category, the total fraction exceeds 100%.

15 in total

1. The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum.

Authors: A Ruepp; W Graml; M L Santos-Martinez; K K Koretke; C Volker; H W Mewes; D Frishman; S Stocker; A N Lupas; W Baumeister
Journal: Nature Date: 2000-09-28 Impact factor: 49.962

2. The distribution and query systems of the RCSB Protein Data Bank.

Authors: Philip E Bourne; Kenneth J Addess; Wolfgang F Bluhm; Li Chen; Nita Deshpande; Zukang Feng; Ward Fleri; Rachel Green; Jeffrey C Merino-Ott; Wayne Townsend-Merino; Helge Weissig; John Westbrook; Helen M Berman
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.

Authors: Brigitte Boeckmann; Amos Bairoch; Rolf Apweiler; Marie-Claude Blatter; Anne Estreicher; Elisabeth Gasteiger; Maria J Martin; Karine Michoud; Claire O'Donovan; Isabelle Phan; Sandrine Pilbout; Michel Schneider
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. The PEDANT genome database.

Authors: Dmitrij Frishman; Martin Mokrejs; Denis Kosykh; Gabi Kastenmüller; Grigory Kolesov; Igor Zubrzycki; Christian Gruber; Birgitta Geier; Andreas Kaps; Kaj Albermann; Andreas Volz; Christian Wagner; Matthias Fellenberg; Klaus Heumann; Hans-Werner Mewes
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

5. Illuminating the evolutionary history of chlamydiae.

Authors: Matthias Horn; Astrid Collingro; Stephan Schmitz-Esser; Cora L Beier; Ulrike Purkhold; Berthold Fartmann; Petra Brandt; Gerald J Nyakatura; Marcus Droege; Dmitrij Frishman; Thomas Rattei; Hans-Werner Mewes; Michael Wagner
Journal: Science Date: 2004-04-08 Impact factor: 47.728

6. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes.

Authors: Andreas Ruepp; Alfred Zollner; Dieter Maier; Kaj Albermann; Jean Hani; Martin Mokrejs; Igor Tetko; Ulrich Güldener; Gertrud Mannhaupt; Martin Münsterkötter; H Werner Mewes
Journal: Nucleic Acids Res Date: 2004-10-14 Impact factor: 16.971

7. MIPS: a database for genomes and protein sequences.

Authors: H W Mewes; D Frishman; U Güldener; G Mannhaupt; K Mayer; M Mokrejs; B Morgenstern; M Münsterkötter; S Rudd; B Weil
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

Review 8. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

9. Overview of the yeast genome.

Authors: H W Mewes; K Albermann; M Bähr; D Frishman; A Gleissner; J Hani; K Heumann; K Kleine; A Maierl; S G Oliver; F Pfeiffer; A Zollner
Journal: Nature Date: 1997-05-29 Impact factor: 49.962

10. SCOP database in 2004: refinements integrate structure and sequence family data.

Authors: Antonina Andreeva; Dave Howorth; Steven E Brenner; Tim J P Hubbard; Cyrus Chothia; Alexey G Murzin
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

23 in total

Review 1. FINDSITE: a combined evolution/structure-based approach to protein function prediction.

Authors: Jeffrey Skolnick; Michal Brylinski
Journal: Brief Bioinform Date: 2009-03-26 Impact factor: 11.622

2. Munich information center for protein sequences plant genome resources: a framework for integrative and comparative analyses 1(W).

Authors: Heiko Schoof; Manuel Spannagl; Li Yang; Rebecca Ernst; Heidrun Gundlach; Dirk Haase; Georg Haberer; Klaus F X Mayer
Journal: Plant Physiol Date: 2005-07 Impact factor: 8.340

3. Combining experimental and predicted datasets for determination of the subcellular location of proteins in Arabidopsis.

Authors: Joshua L Heazlewood; Julian Tonti-Filippini; Robert E Verboom; A Harvey Millar
Journal: Plant Physiol Date: 2005-10 Impact factor: 8.340

4. The filamentous fungal gene expression database (FFGED).

Authors: Zhang Zhang; Jeffrey P Townsend
Journal: Fungal Genet Biol Date: 2009-12-16 Impact factor: 3.495

5. A global topology map of the Saccharomyces cerevisiae membrane proteome.

Authors: Hyun Kim; Karin Melén; Marie Osterberg; Gunnar von Heijne
Journal: Proc Natl Acad Sci U S A Date: 2006-07-17 Impact factor: 11.205

6. Fungome: Annotating proteins implicated in fungal pathogenesis.

Authors: Ranganath Gudimella; Sivaramaiah Nallapeta; Pritish Varadwaj; Prashanth Suravajhala
Journal: Bioinformation Date: 2010-11-01

7. The Mouse Functional Genome Database (MfunGD): functional annotation of proteins in the light of their cellular context.

Authors: Andreas Ruepp; Octave Noubibou Doudieu; Jos van den Oever; Barbara Brauner; Irmtraud Dunger-Kaltenbach; Gisela Fobo; Goar Frishman; Corinna Montrone; Christine Skornia; Steffi Wanka; Thomas Rattei; Philipp Pagel; Louise Riley; Dmitrij Frishman; Dimitrij Surmeli; Igor V Tetko; Matthias Oesterheld; Volker Stümpflen; H Werner Mewes
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971