| Literature DB >> 16214803 |
Ross Overbeek1, Tadhg Begley, Ralph M Butler, Jomuna V Choudhuri, Han-Yu Chuang, Matthew Cohoon, Valérie de Crécy-Lagard, Naryttza Diaz, Terry Disz, Robert Edwards, Michael Fonstein, Ed D Frank, Svetlana Gerdes, Elizabeth M Glass, Alexander Goesmann, Andrew Hanson, Dirk Iwata-Reuyl, Roy Jensen, Neema Jamshidi, Lutz Krause, Michael Kubal, Niels Larsen, Burkhard Linke, Alice C McHardy, Folker Meyer, Heiko Neuweger, Gary Olsen, Robert Olson, Andrei Osterman, Vasiliy Portnoy, Gordon D Pusch, Dmitry A Rodionov, Christian Rückert, Jason Steiner, Rick Stevens, Ines Thiele, Olga Vassieva, Yuzhen Ye, Olga Zagnitko, Veronika Vonstein.
Abstract
The release of the 1000th complete microbial genome will occur in the next two to three years. In anticipation of this milestone, the Fellowship for Interpretation of Genomes (FIG) launched the Project to Annotate 1000 Genomes. The project is built around the principle that the key to improved accuracy in high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes, rather than having an annotation expert attempt to annotate all of the genes in a single genome. Using the subsystems approach, all of the genes implementing the subsystem are analyzed by an expert in that subsystem. An annotation environment was created where populated subsystems are curated and projected to new genomes. A portable notion of a populated subsystem was defined, and tools developed for exchanging and curating these objects. Tools were also developed to resolve conflicts between populated subsystems. The SEED is the first annotation environment that supports this model of annotation. Here, we describe the subsystem approach, and offer the first release of our growing library of populated subsystems. The initial release of data includes 180 177 distinct proteins with 2133 distinct functional roles. This data comes from 173 subsystems and 383 different organisms.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16214803 PMCID: PMC1251668 DOI: 10.1093/nar/gki866
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Accumulation of complete archaeal and bacterial genome sequences at NCBI 1994–2004, and prediction of the release of genomes through 2010. Data from was extracted and plotted by year as shown with the crosses. Data from 2004–2010 is projected by the power law and is represented by open circles. At the current rate of growth, the 1000th complete microbial genome will be released in late 2007 or early 2008.
Glossary
| Annotation | An unstructured text string associated with specific genes and/or proteins. |
| Clearinghouse | A site for publish-request type peer-to-peer exchange of subsystems in a system independent manner. |
| Functional role | An abstract function that a protein performs. Subsystems developers specify a single, precise text string to represent each functional role. |
| Functional variants | Different combinations of functional roles that represent distinct operational forms of a subsystem. |
| Missing gene | A gene, that is predicted to be present in the genome of an organism but has not been identified yet. |
| Populated subsystem | A subsystem along with a spreadsheet in which each column represents a functional role for the subsystem, each row represents a specific genome, and each cell contains those genes from the specific organism that have a subsystem connection to the specific functional role. |
| Product name | A short text string used to represent the function of the protein encoded by a gene. No constraints are placed on the strings used as product names, and it is common to see the same abstract function denoted by numerous similar expressions. |
| Protein family | A collection of proteins that were grouped by a curator. Proteins may be grouped based on domain structure, similarity, or some other characteristic. Proteins within a family may implement the same or multiple functional roles. |
| Subsystem | A Subsystem is a collection of functional roles, which together implement a specific biological process or structural complex. There is no distinction between metabolic subsystems and non-metabolic subsystems. |
| Subsystem connections | The set of functional roles that tie protein-encoding genes to different subsystems. Most protein encoding genes currently have a single subsystem connection. |
| Variant code | Numeric codes used to distinguish different functional variants. |
Figure 2Subsystem and Populated Subsytem. The Histidine Degradation Subsystem was used as an example to demonstrate relevant terms. (A) The subsystem comprises of 7 functional roles (e.g. Histidine ammonia-lyase (EC 4.3.1.3), Urocanate hydratase (EC 4.2.1.49) etc.). Together with the spreadsheet it becomes the ‘Populated subsystem’. (B) The Subsystem Spreadsheet is populated with genes from 8 organisms (simplified from the original subsystem) where each row represents one organism and each column one of the functional roles of the subsystem. Genes performing the specific functional role in the respective organism populate the respective cell. Gray shading of cells indicates proximity of the respective genes on the chromosomes. (C) The Subsystem Diagram illustrates the populated subsystem: key intermediates (circles with roman numerals), connected by enzymes (boxes with abbreviations matching the spreadsheet abbreviations) and reactions (arrows). There are three distinct variants of Histidine Degradation presented in this populated subsystem. Variant 1 (green shading) is present in Caulobacter crescentus, Pseudomonas putida and Xanthomonas campestris. N-Formimino-l-Glutamate (IV) is converted to l-Glutamate (VI) via N-Formyl-l-Glutamate (V) by enzymatic activities of Formiminoglutamic iminohydrolase (EC 3.5.3.13) (ForI) and of N-formylglutamate deformylase (EC 3.5.1.68) (NfoD). Variant 2 (yellow shading) is present in Halobacterium sp., Deinococcus radiodurans and Bacillus subtilis. In this variant the conversion from intermediate IV to VI is performed by Formiminoglutamase (EC 3.5.3.8) (HutG). Variant 3 (blue shading) is present in Bacteroides thetaiotaomicron and Desulfotela psychrophila. Here the Glutamate formiminotransferase (EC 2.1.2.5) (GluF) performs the conversion from intermediate IV to VI.
Figure 3Leucine Degradation and HMG-CoA Metabolism Subsystem. Functional roles, abbreviations, key intermediates and reactions in the pathway diagram are presented using the same conventions as in Figure 2. (A) Functional roles in the subsystem. (B) The Subsystem diagram shows the presence of genes assigned with respective functions for B.melitensis and G.metallireducens, using color-coded highlighting as explained in the panel. (C) Subsystem spreadsheet showing presence of genes with functions is shown by gene names for B.subtilis or by ‘+’ for all other genomes (modified from a regular SEED display showing all gene IDs). Highlighting by a matching color indicates proximity on the chromosome. (D) Clustering on the chromosome of genes involved in the Subsystem (large yellow cluster) demonstrated by alignment of the chromosomal contigs of respective genomes around a signature pathway gene, yngG. Homologous genes are shown by arrows with matching colors and numbers corresponding to functional roles in panel A. B.subtilis genes are marked by gene names. Other genes (not conserved within the cluster) are colored gray.
Figure 4CoA Biosynthesis Subsystem. Functional roles, abbreviations, key intermediates and reactions in the pathway diagram are presented using the same conventions as in Figure 2. Background colors in the diagram illustrate the comparison of subsystem variants by highlighting functional roles asserted in two organisms: E.coli (yellow) and H.sapiens (blue). Shared functional roles are highlighted green. The lower panel is a modification of the subsystem spreadsheet. It shows a classification of major subsystem variants representing a substantially different reaction topology revealed by semi-automated graph analysis as described in (21). Selected genomes unambiguously associated with each variant are shown after variant description (e.g. De novo, complete/100). Patterns of functional roles which constitute each functional variant are generalized by: ‘+’, presence of a gene (for a given role) is required; ‘±’, optional; ‘?’, function is inferred by pathway analysis but a gene is unknown or ‘missing’ (i.e. can not be located by similarity). Typical sub-variants characterized by the same topology but relying on alternative (non-orthologous) forms of specific enzymes (e.g. PANK) are illustrated by the following genomes: E.coli K12 [NCBI taxonomy ID 83333.1], D.radiodurans R1 [243230.1], S.aureus subsp. aureus N315 [158879.1], S.oneidensis MR-1 [211586.1], G.metallireducens [28232.1], Saccharomyces cerevisiae [4932.1], P.aerophilum str. IM2 [178306.1], Streptococcus pneumoniae R6 [171101.1], Thermoanaerobacter tengcongensis [119072.1], H.sapiens [9606.2], B.aphidicola str. APS (Acyrthosiphon pisum) [107806.1], Treponema pallidum subsp. pallidum str. Nichols [243276.1] and Chlamydia trachomatis D/UW-3/CX [272561.1]. Genes assigned with respective functional roles are shown by SEED unique IDs for all illustrated genomes (except E.coli where common gene names are used). Matching background colors highlight genes that occur close to each other on the chromosome.