Literature DB >> 18940859

PEDANT covers all complete RefSeq genomes.

Mathias C Walter1, Thomas Rattei, Roland Arnold, Ulrich Güldener, Martin Münsterkötter, Karamfilka Nenova, Gabi Kastenmüller, Patrick Tischler, Andreas Wölling, Andreas Volz, Norbert Pongratz, Ralf Jost, Hans-Werner Mewes, Dmitrij Frishman.   

Abstract

The PEDANT genome database provides exhaustive annotation of nearly 3000 publicly available eukaryotic, eubacterial, archaeal and viral genomes with more than 4.5 million proteins by a broad set of bioinformatics algorithms. In particular, all completely sequenced genomes from the NCBI's Reference Sequence collection (RefSeq) are covered. The PEDANT processing pipeline has been sped up by an order of magnitude through the utilization of precalculated similarity information stored in the similarity matrix of proteins (SIMAP) database, making it possible to process newly sequenced genomes immediately as they become available. PEDANT is freely accessible to academic users at http://pedant.gsf.de. For programmatic access Web Services are available at http://pedant.gsf.de/webservices.jsp.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18940859      PMCID: PMC2686588          DOI: 10.1093/nar/gkn749

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Since its first announcement in 1997 (1), the PEDANT genome database has steadily grown to become one of the most comprehensive collections of automatically annotated genomes. As of September 2008, PEDANT covers all complete genomes as provided by the RefSeq (2) database. In total 861 completely sequenced genomes from all three domains of life as well as 2081 complete viral genomes are available (Table 1). Here, we define a ‘complete genome’ as a genome whose chromosomal datasets exist as RefSeq records or Ensembl (3) entries and genes have been predicted. For those eukaryotic genomes (currently 33) that are available both from RefSeq or Ensembl, we provide the annotation of both versions. This results in a total number of 2975 genome databases with 4.5 million proteins occupying 3.1 TB of storage. All PEDANT databases are continuously updated. For example, assignments of genes to the MIPS Functional Catalog (FunCat) (4) have been recently recalculated using the new 2.1 version of FunCat (http://mips.gsf.de/projects/funcat).
Table 1.

The number of species from major taxonomic groups contained in the PEDANT genome database as of September 2008

NCBI Taxonomy IDTaxonomic groupNumber of genomes
131567Cellular organisms861
2157Archaea53
2Bacteria691
2759Eukaryota117
4751    Fungi57
33208    Metazoa42
33090    Viridiplantae6
    Other12
10239Viruses2081
Total2942

Other groups: Alveolata (2), Amoebozoa (1), Cryptophyta (1), Euglenozoa (5), Rhodophyta (1), Stramenopiles (2).

The number of species from major taxonomic groups contained in the PEDANT genome database as of September 2008 Other groups: Alveolata (2), Amoebozoa (1), Cryptophyta (1), Euglenozoa (5), Rhodophyta (1), Stramenopiles (2). The current version of the software driving the PEDANT web site, which we refer to as PEDANT3, represents an industry-strength Java workbench that supports large-scale grid computing and utilizes a work-flow-based processing engine (D. Frishman et al., manuscript in preparation). Dozens of custom workflows are available: generic workflows for eukaryotic, prokaryotic and viral genomes as well as more specialized workflows supporting specific genome groups (gram-positive versus gram-negative bacteria, fungi, plants), data types (EST collections, raw contigs without any predicted Open Reading Frames (ORFs), protein-only datasets, etc.) and bioinformatics methods (e.g. alternative gene prediction techniques). Advanced protein and DNA viewers implemented using server-side Java provide graphical representation of protein annotation features as well as genetic elements on chromosomes.

NEW FEATURES AND IMPROVEMENTS

Genome import pipeline

Given the quick pace of genome sequencing keeping track of currently available data and obtaining them from source databases for local processing represents a time-consuming and technically challenging task. In order to organize a more efficient import of genomes to PEDANT from various sources, we set up a specialized processing pipeline (Figure 1). In the first step, we acquire a list of available genomes from each genome resource. Then we try to find out the Entrez genome project ID by using the Entrez Programming Utilities (eUtils, http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) and querying the NCBI databases (5) for genome project information. If available, we use the genome project ID as a primary key for a given genome, otherwise the NCBI taxonomy ID is utilized. The advantage of genome project IDs is that they are stable in contrast to the taxonomy IDs which may change, especially for the species/strains of newly sequenced genomes. The genome IDs are then stored in our local meta-database which also serves as the data basis for generating the full genome list for the PEDANT web page.
Figure 1.

UML activity model of the PEDANT genome import and processing pipeline. Symbols according to the UML 2.0 specification (http://www.uml.org) for activity diagrams.

UML activity model of the PEDANT genome import and processing pipeline. Symbols according to the UML 2.0 specification (http://www.uml.org) for activity diagrams. Data retrieval procedures have been adapted to several different sources of genome information. For downloading RefSeq genomes, we use a patched version (retry on connection timeouts, improved error handling) of the NCBI ToolBox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox) program. For Ensembl genomes, we install the provided MySQL database dumps (ftp://ftp.ensembl.org/pub/current_mysql) at our local MySQL server and extract the genomic data directly. Retrieval of genomes not contained in RefSeq and Ensembl can only be done in a semi-automatic fashion with manual verification. In many cases, RefSeq lists the involved genome sequence centers where original data can be obtained. Another useful resource to locate genomes is ‘the genomes online database (GOLD)’ (6). We then retrieve the assembly and annotation data directly from the sequence centers and check them for missing sequences, nonunique identifiers and unusual formatting. If the gene annotation data are missing or in a draft version (especially fungal genomes), gene predictions are carried out or existing models are improved dependent on the annotation project (7,8).

Integration of PEDANT and SIMAP

Calculating and updating protein similarities and domain assignments is the most time consuming and computationally expensive task in our genome annotation pipeline. Previously, BLASTP (9) and InterProScan (10) searches required up to 80% of the total CPU time of the PEDANT genome annotation workflow. To master the high number of newly sequenced genomes and to keep the data in PEDANT up-to-date, a radical reduction of this huge computational effort has become necessary. The most obvious answer to this problem is to utilize high-performance computing facilities and avoid redundant calculations. The similarity matrix of proteins (SIMAP) (11) provides precalculated and up-to-date all-against-all alignments as well as domain assignments for essentially all publicly available protein sequences (21 million as of this writing). Our recent efforts to integrate PEDANT with SIMAP made it possible to avoid computationally intensive BLASTP and InterProScan runs and have led to a dramatic acceleration of the genome annotation work. Compared with de novo calculations, retrieving similarities and domains from the SIMAP database reduces the required CPU time by factors between 5 and 60. A typical bacterial genome with 3000 predicted genes can be processed at MIPS in <40 min using 60 Sun Grid Engine (SGE, http://gridengine.sunsource.net) nodes. To generate and obtain these data, we have developed a computational workflow that coordinates the tasks between PEDANT and SIMAP. The first step in this workflow involves the import and maintenance of genome sequences and primary annotation provided by the respective source databases in PEDANT. In a subsequent step, SIMAP automatically retrieves protein and sequence data from PEDANT. If novel protein sequences previously unknown to SIMAP have been imported, their similarities to all other protein sequences and their domain architecture are calculated in SIMAP by utilizing large public resource computing facilities (12). As soon as the precalculated data are completely available in SIMAP, a notification event is triggered to start the SIMAP-based methods in PEDANT. These methods have been implemented as remote Enterprise Java Bean (EJB) invocations, which allow for rapid and efficient retrieval of data from SIMAP. One method designed to replace BLASTP retrieves homologs from a composite nonredundant database that includes PDB, UniProt/Swissprot, UniProt/TrEMBL, as well as all protein sequences already present in PEDANT. The second method which serves as a substitute for InterProScan retrieves precalculated protein domain assignments considering all InterPro member databases according to the InterPro XML format specification, except for the TMHMM (13), SignalP (14) and TargetP (15) methods which are run by PEDANT itself considering the appropriate genomic context (i.e. gram stain for signal peptides).

Web Services

The comprehensive collection of 3000 extensively annotated genomes provides a unique foundation for data mining and large-scale investigation of genome properties. While information on a limited number of genes of interest can be conveniently explored using the PEDANT web interface, any computational analysis of genomes at large necessitates local access to data. However, the large amount of annotation data computed for 4.5 million PEDANT proteins makes systematic dissemination of database dumps or flat files unpractical (although we do provide them upon request). Instead, we offer a simple, transparent and computer language-independent remote access based on the Web Service technology. This service has been implemented as a document style, SOAP-based Web Service (see http://www.w3.org/TR/soap12-part0). It can be easily integrated into own applications since for most computer languages libraries exist to access these kind of services. The functions provided by the Web Service are described in a Web Service Description File (WSDL, see http://www.w3.org/TR/wsdl), which allows for an automatic generation of a client program, e.g. by using the Perl SOAP::Lite (http://www.soaplite.com) or the Java Axis (http://ws.apache.org/axis/java/index.html) libraries. The PEDANT3 WSDL File can be found at http://mips.gsf.de/webservice/pedant3/Pedant3Access BeanService/Pedant3AccessWebService?wsdl. At present the service provides the following query types: For the latter query type it is possible to search in both directions: the service can return all genetic elements having a certain property (e.g. a certain functional attribute), or all properties of a certain genetic element (e.g. all functional attributes of a protein). Furthermore, in the former case it is possible to query several genomes at once. For BLASTP- and SIMAP-based methods, it is possible to restrict the results by an E-Value cutoff. A detailed overview of the Web Service functionality can be found at http://pedant.gsf.de/webservices.jsp. return the list of organisms processed in PEDANT, return the computational methods used to annotate a particular organism, return a result overview (e.g. which functional category appears how many times) for a certain method in a certain organism, return the genetic elements of an organism, return the result of a certain method for a single genetic element or for a whole genome ordered by its genetic elements. The PEDANT3 Web Service encapsulates the complicated internal data structures of the PEDANT database and returns the results in a generic format that consists of key-value pairs of properties assigned to a given genetic element. This generic format assures that the end-user client software will not have to be reprogrammed if new methods are introduced into the PEDANT system.

DISCUSSION

There is no fixed release cycle for PEDANT. As soon as new genomes become available at RefSeq or any other listed genome resource, they will be imported, processed and made available via the web server. However, since SIMAP has a monthly release cycle, the computation of a genome by PEDANT is typically finished roughly 1 month after its import. Since the PEDANT3 software is now stable and all genomes from the previous version, PEDANT2, have been either migrated or reimported into PEDANT3, we took PEDANT2 and its Web Service offline. We also discarded all incomplete genomes previously available via PEDANT2 because the new high-throughput technologies now allow finishing genome sequencing projects on a very short-time frame. In the future, genomes from further resources [i.e. USCS Genome Browser Database (16), Vega (17)] will be imported and previously imported genomes will be kept up-to-date. We are also in the process of supplementing the PEDANT web site by multiple new features, including viewing the genome project information [RefSeq status, source sequence centers, whole-genome shotgun (WGS) (18) sequencing coverage, number of records, etc.], taxonomic selection of genomes and improved search capabilities. A cross-genome index for precomputed annotations is nearly finished and will be available online shortly. This will allow for comparison of genomes based on their annotated features, such as domain content, functional categories and structural folds.

FUNDING

Funding for open access charge: Helmholtz Gemeinschaft. Conflict of interest statement. None declared.
  16 in total

1.  Improved prediction of signal peptides: SignalP 3.0.

Authors:  Jannick Dyrløv Bendtsen; Henrik Nielsen; Gunnar von Heijne; Søren Brunak
Journal:  J Mol Biol       Date:  2004-07-16       Impact factor: 5.469

2.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes.

Authors:  Andreas Ruepp; Alfred Zollner; Dieter Maier; Kaj Albermann; Jean Hani; Martin Mokrejs; Igor Tetko; Ulrich Güldener; Gertrud Mannhaupt; Martin Münsterkötter; H Werner Mewes
Journal:  Nucleic Acids Res       Date:  2004-10-14       Impact factor: 16.971

3.  A strategy of DNA sequencing employing computer programs.

Authors:  R Staden
Journal:  Nucleic Acids Res       Date:  1979-06-11       Impact factor: 16.971

4.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.

Authors:  O Emanuelsson; H Nielsen; S Brunak; G von Heijne
Journal:  J Mol Biol       Date:  2000-07-21       Impact factor: 5.469

5.  SIMAP--structuring the network of protein similarities.

Authors:  Thomas Rattei; Patrick Tischler; Roland Arnold; Franz Hamberger; Jörg Krebs; Jan Krumsiek; Benedikt Wachinger; Volker Stümpflen; Werner Mewes
Journal:  Nucleic Acids Res       Date:  2007-11-23       Impact factor: 16.971

6.  Ensembl 2007.

Authors:  T J P Hubbard; B L Aken; K Beal; B Ballester; M Caccamo; Y Chen; L Clarke; G Coates; F Cunningham; T Cutts; T Down; S C Dyer; S Fitzgerald; J Fernandez-Banet; S Graf; S Haider; M Hammond; J Herrero; R Holland; K Howe; K Howe; N Johnson; A Kahari; D Keefe; F Kokocinski; E Kulesha; D Lawson; I Longden; C Melsopp; K Megy; P Meidl; B Ouverdin; A Parker; A Prlic; S Rice; D Rios; M Schuster; I Sealy; J Severin; G Slater; D Smedley; G Spudich; S Trevanion; A Vilella; J Vogel; S White; M Wood; T Cox; V Curwen; R Durbin; X M Fernandez-Suarez; P Flicek; A Kasprzyk; G Proctor; S Searle; J Smith; A Ureta-Vidal; E Birney
Journal:  Nucleic Acids Res       Date:  2006-12-05       Impact factor: 16.971

7.  The UCSC Genome Browser Database: 2008 update.

Authors:  D Karolchik; R M Kuhn; R Baertsch; G P Barber; H Clawson; M Diekhans; B Giardine; R A Harte; A S Hinrichs; F Hsu; K M Kober; W Miller; J S Pedersen; A Pohl; B J Raney; B Rhead; K R Rosenbloom; K E Smith; M Stanke; A Thakkapallayil; H Trumbower; T Wang; A S Zweig; D Haussler; W J Kent
Journal:  Nucleic Acids Res       Date:  2007-12-17       Impact factor: 16.971

8.  Database resources of the National Center for Biotechnology Information.

Authors:  David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael Dicuccio; Ron Edgar; Scott Federhen; Michael Feolo; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; Oleg Khovayko; David Landsman; David J Lipman; Thomas L Madden; Donna R Maglott; Vadim Miller; James Ostell; Kim D Pruitt; Gregory D Schuler; Martin Shumway; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Roman L Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal:  Nucleic Acids Res       Date:  2007-11-27       Impact factor: 16.971

9.  The vertebrate genome annotation (Vega) database.

Authors:  L G Wilming; J G R Gilbert; K Howe; S Trevanion; T Hubbard; J L Harrow
Journal:  Nucleic Acids Res       Date:  2007-11-14       Impact factor: 16.971

10.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata.

Authors:  Konstantinos Liolios; Konstantinos Mavromatis; Nektarios Tavernarakis; Nikos C Kyrpides
Journal:  Nucleic Acids Res       Date:  2007-11-02       Impact factor: 16.971

View more
  56 in total

1.  Bacteriocyte-associated gammaproteobacterial symbionts of the Adelges nordmannianae/piceae complex (Hemiptera: Adelgidae).

Authors:  Elena R Toenshoff; Thomas Penz; Thomas Narzt; Astrid Collingro; Stephan Schmitz-Esser; Stefan Pfeiffer; Waltraud Klepal; Michael Wagner; Thomas Weinmaier; Thomas Rattei; Matthias Horn
Journal:  ISME J       Date:  2011-08-11       Impact factor: 10.302

2.  EnzymeDetector: an integrated enzyme function prediction tool and database.

Authors:  Susanne Quester; Dietmar Schomburg
Journal:  BMC Bioinformatics       Date:  2011-09-23       Impact factor: 3.169

3.  MotifMark: Finding regulatory motifs in DNA sequences.

Authors:  Hamid Reza Hassanzadeh; Pushkar Kolhe; Charles L Isbell; May D Wang
Journal:  Conf Proc IEEE Eng Med Biol Soc       Date:  2017-07

4.  Deciphering the cryptic genome: genome-wide analyses of the rice pathogen Fusarium fujikuroi reveal complex regulation of secondary metabolism and novel metabolites.

Authors:  Philipp Wiemann; Christian M K Sieber; Katharina W von Bargen; Lena Studt; Eva-Maria Niehaus; Jose J Espino; Kathleen Huß; Caroline B Michielse; Sabine Albermann; Dominik Wagner; Sonja V Bergner; Lanelle R Connolly; Andreas Fischer; Gunter Reuter; Karin Kleigrewe; Till Bald; Brenda D Wingfield; Ron Ophir; Stanley Freeman; Michael Hippler; Kristina M Smith; Daren W Brown; Robert H Proctor; Martin Münsterkötter; Michael Freitag; Hans-Ulrich Humpf; Ulrich Güldener; Bettina Tudzynski
Journal:  PLoS Pathog       Date:  2013-06-27       Impact factor: 6.823

5.  MicroScope: a platform for microbial genome annotation and comparative genomics.

Authors:  D Vallenet; S Engelen; D Mornico; S Cruveiller; L Fleury; A Lajus; Z Rouy; D Roche; G Salvignol; C Scarpelli; C Médigue
Journal:  Database (Oxford)       Date:  2009-11-25       Impact factor: 3.451

6.  A novel approach to the antimicrobial activity of maggot debridement therapy.

Authors:  Anders S Andersen; Dorthe Sandvang; Kirk M Schnorr; Thomas Kruse; Søren Neve; Bo Joergensen; Tonny Karlsmark; Karen A Krogfelt
Journal:  J Antimicrob Chemother       Date:  2010-06-11       Impact factor: 5.790

7.  Predicting phenotypic traits of prokaryotes from protein domain frequencies.

Authors:  Thomas Lingner; Stefanie Mühlhausen; Toni Gabaldón; Cedric Notredame; Peter Meinicke
Journal:  BMC Bioinformatics       Date:  2010-09-24       Impact factor: 3.169

8.  Population Genomics of the Maize Pathogen Ustilago maydis: Demographic History and Role of Virulence Clusters in Adaptation.

Authors:  Gabriel Schweizer; Muhammad Bilal Haider; Gustavo V Barroso; Nicole Rössel; Karin Münch; Regine Kahmann; Julien Y Dutheil
Journal:  Genome Biol Evol       Date:  2021-05-07       Impact factor: 3.416

9.  SIMAP--a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters.

Authors:  Thomas Rattei; Patrick Tischler; Stefan Götz; Marc-André Jehl; Jonathan Hoser; Roland Arnold; Ana Conesa; Hans-Werner Mewes
Journal:  Nucleic Acids Res       Date:  2009-11-11       Impact factor: 16.971

10.  Comprehensive in silico prediction and analysis of chlamydial outer membrane proteins reflects evolution and life style of the Chlamydiae.

Authors:  Eva Heinz; Patrick Tischler; Thomas Rattei; Garry Myers; Michael Wagner; Matthias Horn
Journal:  BMC Genomics       Date:  2009-12-29       Impact factor: 3.969

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.