| Literature DB >> 19900966 |
Luke E Ulrich1, Igor B Zhulin.
Abstract
The MiST2 database (http://mistdb.com) identifies and catalogs the repertoire of signal transduction proteins in microbial genomes. Signal transduction systems regulate the majority of cellular activities including the metabolism, development, host-recognition, biofilm production, virulence, and antibiotic resistance of human pathogens. Thus, knowledge of the proteins and interactions that comprise these communication networks is an essential component to furthering biomedical discovery. These are identified by searching protein sequences for specific domain profiles that implicate a protein in signal transduction. Compared to the previous version of the database, MiST2 contains a host of new features and improvements including the following: draft genomes; extracytoplasmic function (ECF) sigma factor protein identification; enhanced classification of signaling proteins; novel, high-quality domain models for identifying histidine kinases and response regulators; neighboring two-component genes; gene cart; better search capabilities; enhanced taxonomy browser; advanced genome browser; and a modern, biologist-friendly web interface. MiST2 currently contains 966 complete and 157 draft bacterial and archaeal genomes, which collectively contain more than 245 000 signal transduction proteins. The majority (66%) of these are one-component systems, followed by two-component proteins (26%), chemotaxis (6%), and finally ECF factors (2%).Entities:
Mesh:
Substances:
Year: 2009 PMID: 19900966 PMCID: PMC2808908 DOI: 10.1093/nar/gkp940
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Semi-automatic algorithm for defining high-quality domain models. (A) Bona fide domain members which have had their structure solved are subjected to iterative PSI-BLAST searches (30) against the UniRef90 (31) database with a stringent E-value threshold. The resulting sequences are then clustered, aligned and edited (CAT, part B) to form the set of core homologs. Remote homologs are identified by the same procedure with a much relaxed threshold and then removing hits that do not match a secondary structure type associated with at least one core homolog. The resulting remote homologs are combined with the core homologs and then subjected to the CAT process to produce the final domain model(s). (B) The CAT sub-algorithm is a divide-and-conquer method for addressing the extreme sequence divergence present in signal transduction families. Markov Clustering Linkage (32) simulates a random-walk through all-versus-all BLAST results and produces clusters of related members. After aligning and editing each individual subgroup, they are further combined into one or more final curated alignments.
Domain-based rules for classifying signal transduction proteins
| Rule | Classification |
|---|---|
| Chemotaxis domains | |
| HK_CA:Che (Agfam) | Chemotaxis, CheA |
| CheW and transmitter domain | Chemotaxis, CheA |
| CheW and receiver domain | Chemotaxis, CheV |
| CheW | Chemotaxis, CheW |
| CheB_methylest | Chemotaxis, CheB |
| CheR or CheR_N | Chemotaxis, CheR |
| CheD | Chemotaxis, CheD |
| CheZ | Chemotaxis, CheZ |
| CheC and not SpoA | Chemotaxis, CheCX |
| MCPsignal | Chemotaxis, MCP |
| * | Chemotaxis, Other |
| Transmitter or receiver domain | |
| HATPase_c signaling domain + receiver | |
| HATPase_c before receiver | Two-component, HHK |
| N-terminal receiver | Two-component, HRR |
| * | Two-component, other |
| HATPase_c signaling domain | Two-component, HK |
| Receiver domain | Two-component, RR |
| * | Two-component, other |
| Output domain | One-component |
| * | Other |
| ECF domain | ECF |
The rule system is hierarchical and each rule is processed sequentially. Proteins are classified according to the first matching rule. Asterisks (*) match all proteins. The complete list of Pfam and Agfam signaling domains is provided in Supplementary Table 1. Marker domains represent signaling domains that implicate a protein as participating in signal transduction. Unless otherwise indicated, all domains are from Pfam (19). HHK – hybrid histidine kinase, HRR – hybrid response regulator, HK – histidine kinase, RR – Response regulator.
Figure 2.Screenshots of the MiST2 website. (A) E. coli genome summary page. Below the header and navigational links there are three sections: genome and organism metadata, and a hyperlinked graphical image of the genome’s signal transduction profile; fully linked tables displaying the genomic distribution of one-component, two-component, chemotaxis and ECF signaling proteins by replicon; and lastly a table containing the counts of neighboring two-component proteins. (B) E. coli CheA protein page. The Refseq annotation and database cross-references for the currently viewed protein and corresponding gene is displayed at the top. This is followed by an interactive visualization of the protein’s domain architecture. The genome neighborhood section contains an AJAX-driven, dynamic representation of the genomic context surrounding the currently viewed protein. In the neighboring DNA section, it is possible to retrieve upstream or downstream DNA sequence data. Hyperlinked cross-references to external databases appear at the bottom of the page.
Distribution of signal transduction proteins within complete and draft genomes belonging to Archaeal and Bacterial phyla
| Genomes | One-component | Two-component | ECF | |||||
|---|---|---|---|---|---|---|---|---|
| HK | HHK | RR | HRR | Chemotaxis | ||||
| Archaea | ||||||||
| Complete | 67 | 3265 | 546 | 8 | 304 | 142 | 453 | – |
| Draft | 2 | 77 | – | – | 1 | – | 10 | – |
| Bacteria | ||||||||
| Complete | 899 | 135 396 | 20 862 | 4717 | 26 962 | 923 | 13 549 | 5332 |
| Draft | 155 | 22 217 | 3791 | 364 | 3981 | 61 | 1418 | 784 |
| Total | 1123 | 160 955 | 25 199 | 5089 | 31 248 | 1126 | 15 430 | 6116 |
HK, histidine kinase; HHK, hybrid histidine kinase; RR, response regulator; HRR, hybrid response regulator.