Literature DB >> 17999997

Enteropathogen Resource Integration Center (ERIC): bioinformatics support for research on biodefense-relevant enterobacteria.

Jeremy D Glasner¹, Guy Plunkett, Bradley D Anderson, David J Baumler, Bryan S Biehl, Valerie Burland, Eric L Cabot, Aaron E Darling, Bob Mau, Eric C Neeno-Eckwall, David Pot, Yu Qiu, Anna I Rissman, Sara Worzella, Sam Zaremba, Joel Fedorko, Tom Hampton, Paul Liss, Michael Rusch, Matthew Shaker, Lorie Shaull, Panna Shetty, Silpa Thotakura, Jon Whitmore, Frederick R Blattner, John M Greene, Nicole T Perna.

Abstract

ERIC, the Enteropathogen Resource Integration Center (www.ericbrc.org), is a new web portal serving as a rich source of information about enterobacteria on the NIAID established list of Select Agents related to biodefense-diarrheagenic Escherichia coli, Shigella spp., Salmonella spp., Yersinia enterocolitica and Yersinia pestis. More than 30 genomes have been completely sequenced, many more exist in draft form and additional projects are underway. These organisms are increasingly the focus of studies using high-throughput experimental technologies and computational approaches. This wealth of data provides unprecedented opportunities for understanding the workings of basic biological systems and discovery of novel targets for development of vaccines, diagnostics and therapeutics. ERIC brings information together from disparate sources and supports data comparison across different organisms, analysis of varying data types and visualization of analyses in human and computer-readable formats.

Entities: Disease Gene Species

Mesh：

Substances：

Year: 2007 PMID： 17999997 PMCID： PMC2238966 DOI： 10.1093/nar/gkm973

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The family Enterobacteriaceae includes a variety of pathogens that pose significant threats to human health directly, and indirectly through agricultural crops and livestock. The Enteropathogen Resource Integration Center (ERIC, www.ericbrc.org) is one of the eight Bioinformatics Resource Centers (BRC) for Biodefense and Emerging/Re-Emerging Infectious Diseases (http://www.brc-central.org/). Funded by the National Institute of Allergy and Infectious Diseases (NIAID), ERIC serves as an information resource for enterobacteria on the NIAID established list of select agents related to biodefense—diarrheagenic Escherichia coli, Shigella spp., Salmonella spp., Yersinia enterocolitica and Yersinia pestis. ERIC seeks to support basic research on pathogenesis and development of novel vaccines, therapeutics and diagnostics for these organisms by: Adding value to genome data through manual and automated curation with particular focus on biological subsystems relevant to pathogenicity. Integrating diverse sources of data ranging from publications on individual genes to large-scale proteomics data sets. Developing tools for analyzing and visualizing these data. Offering training and specialized analyses to the research community.

THE ERIC–BRC PORTAL OFFERS INTEGRATED ACCESS TO ALL TOOLS AND ANALYSES

The ERIC–BRC is a web portal that provides a single point of access to information about the focus organisms. The web portal, implemented with the JBoss Application Server 4.05GA and JBoss Portal Server 2.4.1, provides a single, standardized method of accessing the diverse resources integrated into the system. In addition to the specific resources described in the sections below, the portal provides general information about pathogenic enterobacteria, summaries of the genome database contents, and links to other relevant databases, such as the Immune Epitope Database (1) a curated set of epitopes for the Category A–C select agents. New functionalities and data sets are added to existing sections of the portal when appropriate or incorporated into new portlets within the main ERIC portal. This architecture permits rapid deployment of new components and customizable display of contents.

ERIC–ASAP GENOME ANNOTATIONS

ERIC provides access to continuously updated genome annotations for all ERIC pathogens, as well as information from a variety of other enterobacteria useful for reference and comparison, including E. coli K-12 (Table 1). ERIC uses the ASAP genome annotation database system (2) using an Oracle 10 g database for genome annotation and curation. ERIC–ASAP permits database updates continuously, obviating the need for periodic database releases that are a common feature of many genome databases. There are three general types of user accounts available for genome annotation purposes. Administrator accounts permit users the full range of capabilities including the ability to create new genome projects in the system. Curator accounts give users the ability to update ERIC annotations using sophisticated web-based interfaces for manual annotation and curation of information as well as tools for uploads of large sets of annotation data. Annotator accounts provide users with interfaces for manual annotation of individual annotation records. The annotation interfaces are all web-based and can be accessed by any member of the research community that requests an account. The availability of three different types of user accounts is designed to meet the needs of different types of annotators and to encourage training in use of the annotation tools that can be used to update large numbers of annotation records at a time. Genomes in ERIC can be either ‘public’ or ‘private’ projects, with users assigned to any of the three types of user accounts. All ‘public’ genome sequence data and annotations, including any newly added information, are accessible without an account.

Table 1.

Genomes (all publicly available complete or draft sequences) contained in ERIC-ASAP as of August 2007

Organism	Complete	Draft	Total
Diarrheagenic Escherichia coli	5	7	12
Shigella spp.	8	2	10
Salmonella spp.	6	0	6
Yersinia enterocolitica	1	0	1
Yersinia pestis	6	10	16
Other related genomes	13	8	21
Total	39	27	66

Genomes (all publicly available complete or draft sequences) contained in ERIC-ASAP as of August 2007 Our goal is to provide genome annotations that are accurate, detailed, up-to-date and consistent across genomes. Descriptions of the standard operating procedures (SOPs) used by the ERIC curators are available for download from the portal (http://www.ericbrc.org/portal/eric/aboutasap). Every annotation record includes a description of the evidence supporting the data, and this is the primary way we assess the quality of the annotation information and measure improvements over time. Explanations of the evidence codes and how they are used can be found in the SOP describing gene annotation (http://www.ericbrc.org/portal/eric/sopCdsAnnotation). ERIC–ASAP is open for contribution by the research community to encourage annotation by domain experts. An additional layer of quality control is provided by a ‘curation status’ tag for each annotation that indicates whether the information has been independently approved by one of a select group of trusted users and dedicated curators. Sequences and annotations in ERIC can be downloaded in a variety of formats including GenBank flatfile format and GFF3. Files downloaded directly from ERIC reflect continuous updates by the dedicated curatorial staff as well as community-contributed annotations. Snapshots of sequences annotated de novo by ERIC are also deposited in GenBank. Examples include the genome of Y. pestis strain CA88-4125 (GenBank accession number ABCD00000000) and plasmid pMAR7 from enteropathogenic Escherichia coli (3). ERIC is working toward an efficient mechanism for updates of existing GenBank and/or RefSeq records regardless of historical constraints. However, users should be aware that while ERIC provides support for documenting evidence for each individual line of annotation, this is not currently supported by NCBI.

ENTEROFAMS: PROTEIN FAMILIES FOR ENTEROBACTERIA

The first version of the EnteroFams is a collection of 1579 protein families. Each family is represented by a profile-Hidden Markov Model (HMM) similar to Pfam (4) or TIGRfam (5) protein families. EnteroFams differ from these other databases of protein families in that they contain only full-length alignments of proteins from enterobacterial species. The current collection of EnteroFams consist of proteins that are nearly ubiquitous in enterobacteria. Each HMM was constructed from an alignment of putative orthologous proteins from eight genomes (E. coli MG1655, E. coli EDL933, Salmonella enterica Typhimurium LT2, S. enterica Typhi CT18, Y. pestis CO92, Yersinia pseudotuberculosis 32953, Erwinia chrysanthemi 3937 and Erwinia carotovora atroseptica SCRI1043) and used to scan 11 additional genomes for new members. The threshold for inclusion in a family was defined as the lowest score obtained for a protein from one of the eight seed genomes. The alignment of the seed proteins, the complete alignment of all members and the annotations for each family were manually curated. All members have a link to the associated EnteroFam page that contains alignments, cutoff thresholds and annotations for each family. EnteroFam HMMs will be made available from each EnteroFam page and for bulk download through the ERIC portal. Annotations were selected to be appropriate for all species so that they can be applied to all members of the family enabling the propagation of high-quality annotations across features in related enterobacteria.

ANNOTATION PROPAGATION: WHEN TO CUT, COPY AND PASTE?

The quality and quantity of annotation data varies within and between genomes. To reduce these inconsistencies, we would like to replace poor annotations with better information. We have developed an ‘annotation propagation’ tool within ERIC–ASAP to facilitate the comparison, evaluation and replacement of annotations across related genome features. This tool compares the text of annotation data between source and destination features as well as the evidence supporting the annotations. If the source feature annotation is supported by better evidence, the existing annotations for the destination feature are replaced with the annotation from the source feature. The user of the annotation propagation tool chooses the source and destination features and assigns the relative values to different categories of supporting evidence. Using this tool, high-quality annotations from well-curated genomes or protein families can be rapidly applied to other genomes while at the same time preserving any well-supported manual annotations that may already exist. The database retains a record of all annotations, regardless of their approval status, so no information is lost and can be reapplied as necessary. Propagation of inaccurate or erroneous annotations has potential to do great harm to the quality of genome annotations. Care must be taken to ensure that only high-quality annotations are propagated across appropriate genome features. For example, propagation of annotations to members of EnteroFam families across genomes required that the annotation of the EnteroFam family be ‘curated’, and that membership in the family was approved by a curator. If a genome already contained an annotation with better supporting evidence, such as a gene product description with an experimental evidence code linked to a publication, the existing annotation was preserved. New annotations added by the propagation procedure all contain an indication that they were added by an automated process and have a link to the SOP describing the procedure.

INSERTION SEQUENCES: ANNOTATING JUMPING GENES

Insertion Sequence (IS) element activity is a significant source of variation between genomes. We have annotated the boundaries of 3412 intact IS elements and 758 IS fragments in a core set of 20 complete genomes based on known IS element sequences collected in the ISfinder Database (6). We have not annotated the IS elements as thoroughly in draft genome sequences since IS elements frequently occur at contig boundaries and are often misassembled in draft sequences. All IS annotations include the identity of the IS element and a link to the related entry in the ISfinder database. IS feature names are assigned to distinguish between multiple copies of the same IS within one genome. The IS annotations can be viewed in ERIC's genome viewers (described below) to examine differences in IS content between genomes.

HIGH-THROUGHPUT EXPERIMENTAL DATA SETS

The ERIC–ASAP database stores and displays results of high-throughput gene expression experiments and proteomics data. For example, ERIC–ASAP contains newly discovered proteomic information about NIAID's Category A–C biodefense organisms obtained through the NIAID-funded Proteomics Research Centers (PRCs). Information about the presence and absence of S. enterica Typhimirium proteins under different growth conditions (7) can be obtained from each gene annotation page or downloaded in bulk from ERIC–ASAP. Links are provided from each page to more detailed information about the mass-spectrometry data available at the Administrative Resource Center (http://www.proteomicsresource.org/).

ERIC COMPARATIVE GENOMICS TOOLS

Comparative genomics is a powerful way to identify genes conserved among subsets of related pathogens as well as sequences that differentiate strains and species. ERIC has several components that facilitate genome comparisons and classification of relationships between genes within and across genomes.

Multiple genome alignments

Multiple genome alignments are available for each species of enterobacteria represented in ERIC–BRC (Figure 1). These alignments were constructed using a newly released progressive alignment tool, Mauve 2.0 (8,9), that dramatically improves alignment in regions conserved among subsets of genomes, a particularly important feature for recognition of genomic islands. This new version has significantly improved visualization and navigational tools and provides a powerful mechanism for comparative genomics of bacterial genomes.

Figure 1.

Two views of the same region of the Mauve 2.0 alignment of 6 E. coli genomes. The visualization on the left uses the default color scheme based on homologous segments. Each color represents a collinear block that contains regions of homologous sequence. Importantly, islands unique to a single genome or collinear islands common to a subset of genomes are indicated. The visualization of the same aligned region shown on the right is colorized by multiplicity. Here, pink blocks indicate that the region is conserved across all six genomes. Other colors mark regions found in a subset of genomes. Genome alignments are currently available for: Six complete Escherichia genomes Ten complete Escherichia and Shigella genomes Five complete Salmonella genomes Seven complete Yersinia genomes Each alignment includes all available complete published genome sequences as of April 2007 with links directly from the graphical gene display to the annotations in ERIC. As new genomes become available, these alignments will be updated. Older versions will be archived and remain available.

Curated ortholog sets

The ERIC–ASAP database maintains curated sets of proteins predicted to be orthologous. Initial sets of orthologs are constructed by analysis of pairwise BLAST searches between genomes. As described in more detail in the ERIC Ortholog SOP, this process is limited to cases where there is a single unambiguous best reciprocal match, and filtered according to empirically selected comparison-specific thresholds for percent identity and proportion of the proteins aligned. This results in a conservative set of predicted orthologs. These sets are augmented and confirmed by manual review and additional processes such as confirmation of co-linearity in genome alignments.

Generic Genome Browser (Gbrowse)

ERIC provides the GBrowse (10) for querying and viewing genomic data and linking to annotations within the database. Users can search for genes across genomes and zoom in or out on genome maps. There are several tracks of annotation data that can be selected to be displayed. We plan to expand the visualizations available through GBrowse to include high-throughput data sets as well as large-scale bioinformatics predictions such as transcriptional units and regulatory protein-binding sites.

MICROARRAY ANALYSIS SYSTEM

The mAdb microarray database and analysis system (11) is a core component of ERIC. This is a web-based system that supports sharing of data between groups and includes a microarray storage database and a variety of built-in analysis tools. mAdb can import Affymetrix data as well as spotted arrays, and can use the quantitation and composite image files from a number of microarray scanners. The central concept for mAdb is that of creating filtered, reusable data sets for analyzing microarray data. Once the raw data is processed and placed in ERIC's relational database, a user can filter the data for quality, using a variety of filters for spot size, signal/background ratio, excluding those spots marked as ‘Bad’ or ‘Not Found’ by the scanner software, as well as a number of other quantitative metrics. Normalization can be done either on the raw data or only on those spots which pass the spot quality criteria set by the user. This creates a parent-filtered data set, which can be further filtered in other ways, such as by expression ratios, by genes (rows) or by arrays (columns), or used directly in the analysis tools. Each data set maintains a history associated with it, so users can see how it was derived. The analysis tools allow hierarchical, K-means and self-organizing map-based clustering of the data by a number of metrics and linkage methods, as well as other related visualization techniques such as scatter plotting, Principal Components Analysis, and Multidimensional Scaling. Graphics can be exported for publication and protocols for MIAME-formatted data can be stored. Microarray data is linked to corresponding annotated features in ERIC genomes to provide a way to access up-to-date annotation records while viewing data in mAdb. There are security controls for access to data in mAdb. All users must register for an account. With an account, users have access to publicly available projects and their own private workspace that is only available to themselves and other users that they grant access to the project. Users can collaborate on experiments and analyses in their secure workspace, and if they desire, make data available for analysis by all users.

FUTURE DIRECTIONS

Genome sequencing is ongoing at several institutions for a number of additional strains and isolates of pathogenic enterobacteria and ERIC will continue to incorporate new sequence data as it becomes available. We plan to continue our efforts of careful manual annotation for these organisms to provide high-quality information that is supported by direct experimentation. The number of genome sequences for these pathogens already available and the large number of new sequences anticipated suggests that manual inspection of every annotation is an impossible task. For this reason, we will focus annotation efforts on a few reference genomes for each group of pathogens as well as continue to carefully annotate protein families from the EnteroFams. Judicious application of the annotation propagation tool will be used to distribute these carefully curated annotations to other genomes. Use of consistent vocabulary to describe biological entities and functions is critical for comparison of annotations within and between genomes. The Gene Ontology (GO) Consortium is a group dedicated to creating and applying a structured and controlled vocabulary for describing gene products, their functions and locations (12,13). The current annotations of genomes in ERIC contain limited use of the GO, and we plan to expand this in the future. The use of high-throughput experiments to characterize bacterial genes, proteins and metabolites is increasing and ERIC will continue to integrate these types of data and provide tools for analysis and visualization. The ERIC portal is continually under development to improve data content and usability. The goal of integration of information within ERIC is to provide researchers with a simple-to-use, richly populated database of accurate genome annotation and associated data that will aid in creation of novel diagnostics, therapeutics and vaccines to mitigate the threats posed by pathogenic enterobacteria.

12 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. The Gene Ontology (GO) database and informatics resource.

Authors: M A Harris; J Clark; A Ireland; J Lomax; M Ashburner; R Foulger; K Eilbeck; S Lewis; B Marshall; C Mungall; J Richter; G M Rubin; J A Blake; C Bult; M Dolan; H Drabkin; J T Eppig; D P Hill; L Ni; M Ringwald; R Balakrishnan; J M Cherry; K R Christie; M C Costanzo; S S Dwight; S Engel; D G Fisk; J E Hirschman; E L Hong; R S Nash; A Sethuraman; C L Theesfeld; D Botstein; K Dolinski; B Feierbach; T Berardini; S Mundodi; S Y Rhee; R Apweiler; D Barrell; E Camon; E Dimmer; V Lee; R Chisholm; P Gaudet; W Kibbe; R Kishore; E M Schwarz; P Sternberg; M Gwinn; L Hannick; J Wortman; M Berriman; V Wood; N de la Cruz; P Tonellato; P Jaiswal; T Seigfried; R White
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. The generic genome browser: a building block for a model organism system database.

Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

4. Mauve: multiple alignment of conserved genomic sequence with rearrangements.

Authors: Aaron C E Darling; Bob Mau; Frederick R Blattner; Nicole T Perna
Journal: Genome Res Date: 2004-07 Impact factor: 9.043

5. Analysis of the Salmonella typhimurium proteome through environmental response toward infectious conditions.

Authors: Joshua N Adkins; Heather M Mottaz; Angela D Norbeck; Jean K Gustin; Joanne Rue; Therese R W Clauss; Samuel O Purvine; Karin D Rodland; Fred Heffron; Richard D Smith
Journal: Mol Cell Proteomics Date: 2006-05-08 Impact factor: 5.911

6. The NCI/CIT microArray database (mAdb) system - bioinformatics for the management and analysis of Affymetrix and spotted gene expression microarrays.

Authors: J M Greene; E Asaki; X Bian; C Bock; S Castillo; G Chandramouli; R Martell; K Meyer; T Ruppert; S Sundaram; J Tomlin; L Yang; J Powell
Journal: AMIA Annu Symp Proc Date: 2003

7. ISfinder: the reference centre for bacterial insertion sequences.

Authors: P Siguier; J Perochon; L Lestrade; J Mahillon; M Chandler
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. ASAP: a resource for annotating, curating, comparing, and disseminating genomic data.

Authors: Jeremy D Glasner; Michael Rusch; Paul Liss; Guy Plunkett; Eric L Cabot; Aaron Darling; Bradley D Anderson; Paul Infield-Harm; Michael C Gilson; Nicole T Perna
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. Pfam: clans, web tools and services.

Authors: Robert D Finn; Jaina Mistry; Benjamin Schuster-Böckler; Sam Griffiths-Jones; Volker Hollich; Timo Lassmann; Simon Moxon; Mhairi Marshall; Ajay Khanna; Richard Durbin; Sean R Eddy; Erik L L Sonnhammer; Alex Bateman
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes.

Authors: Jeremy D Selengut; Daniel H Haft; Tanja Davidsen; Anurhada Ganapathy; Michelle Gwinn-Giglio; William C Nelson; Alexander R Richter; Owen White
Journal: Nucleic Acids Res Date: 2006-12-06 Impact factor: 16.971

11 in total

1. Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation.

Authors: Samuel V Angiuoli; Aaron Gussman; William Klimke; Guy Cochrane; Dawn Field; George Garrity; Chinnappa D Kodira; Nikos Kyrpides; Ramana Madupu; Victor Markowitz; Tatiana Tatusova; Nick Thomson; Owen White
Journal: OMICS Date: 2008-06

2. Termination factor Rho and its cofactors NusA and NusG silence foreign DNA in E. coli.

Authors: Christopher J Cardinale; Robert S Washburn; Vasisht R Tadigotla; Lewis M Brown; Max E Gottesman; Evgeny Nudler
Journal: Science Date: 2008-05-16 Impact factor: 47.728

3. Genetics and environmental regulation of Shigella iron transport systems.

Authors: Elizabeth E Wyckoff; Megan L Boulette; Shelley M Payne
Journal: Biometals Date: 2009-01-07 Impact factor: 2.949

4. AGeS: a software system for microbial genome sequence annotation.

Authors: Kamal Kumar; Valmik Desai; Li Cheng; Maxim Khitrov; Deepak Grover; Ravi Vijaya Satya; Chenggang Yu; Nela Zavaljevski; Jaques Reifman
Journal: PLoS One Date: 2011-03-07 Impact factor: 3.240

5. Text-mining of PubMed abstracts by natural language processing to create a public knowledge base on molecular mechanisms of bacterial enteropathogens.

Authors: Sam Zaremba; Mila Ramos-Santacruz; Thomas Hampton; Panna Shetty; Joel Fedorko; Jon Whitmore; John M Greene; Nicole T Perna; Jeremy D Glasner; Guy Plunkett; Matthew Shaker; David Pot
Journal: BMC Bioinformatics Date: 2009-06-10 Impact factor: 3.169

6. Reordering contigs of draft genomes using the Mauve aligner.

Authors: Anna I Rissman; Bob Mau; Bryan S Biehl; Aaron E Darling; Jeremy D Glasner; Nicole T Perna
Journal: Bioinformatics Date: 2009-06-10 Impact factor: 6.937

Review 7. Gene Ontology annotation highlights shared and divergent pathogenic strategies of type III effector proteins deployed by the plant pathogen Pseudomonas syringae pv tomato DC3000 and animal pathogenic Escherichia coli strains.

Authors: Magdalen Lindeberg; Bryan S Biehl; Jeremy D Glasner; Nicole T Perna; Alan Collmer; Candace W Collmer
Journal: BMC Microbiol Date: 2009-02-19 Impact factor: 3.605