| Literature DB >> 15757509 |
Margrethe H Serres1, Monica Riley.
Abstract
BACKGROUND: Escherichia coli a model organism provides information for annotation of other genomes. Our analysis of its genome has shown that proteins encoded by fused genes need special attention. Such composite (multimodular) proteins consist of two or more components (modules) encoding distinct functions. Multimodular proteins have been found to complicate both annotation and generation of sequence similar groups. Previous work overstated the number of multimodular proteins in E. coli. This work corrects the identification of modules by including sequence information from proteins in 50 sequenced microbial genomes.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15757509 PMCID: PMC555942 DOI: 10.1186/1471-2164-6-33
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Identification and sequence similarity of multimodular E. coli proteins. (a) An E. coli protein (gi1787250) aligns with two smaller proteins from C. acetobutylicum, histidinol phosphatase (gi15026114) and imidazoleglycerol-phosphate dehydratatase (gi15023840). The E. coli protein represents a fused or multimodular protein encoding the two functions in separate parts of the protein as indicated by the two non-overlapping alignment regions. Based on the alignment regions, the E. coli protein is separated into two separate components, modules. The modules are identified with the extensions "_1" or "_2" to indicate their location in the gene product as N-terminal or C-terminal, respectively. (b) Sequence similarity between modules of the multimodular proteins is shown. No detectable similarity between the joined modules is indicated by a difference in the module patterns in the cartoon. Similarity is measured by Darwin and indicates that the proteins align at a distance of ≤ 200 PAM units over at least 83 amino acid residues or >45% of the length of the proteins. This level of similarity also reflects whether the modules belong to the same paralogous group.
Examples of multimodular E. coli proteins.
| b0002_1 | 1 | 461 | e | aspartokinase I, threonine sensitive | |
| b0002_2 | 464 | 820 | e | homoserine dehydrogenase I, threonine sensitive | |
| b0414_1 | 1 | 143 | e | diaminohydroxyphosphoribosylaminopyrimidine deaminase | |
| b0414_2 | 147 | 366 | e | 5-amino-6-(5-phosphoribosylamino) uracil reductase | |
| b1014_1 | 1 | 569 | e | bifunctional: transcriptional repressor (N-terminal); proline dehydrogenase, FAD-binding (C-terminal) | |
| b1014_2 | 618 | 1320 | e | pyrroline-5-carboxylate dehydrogenase | |
| b1241_1 | 1 | 400 | e | acetaldehyde-CoA dehydrogenase | |
| b1241_2 | 449 | 891 | e | iron-dependent alcohol dehydrogenase | |
| b0067_1 | 1 | 274 | t | thiamin transport protein (ABC superfamily, membrane) | |
| b0067_2 | 285 | 536 | t | thiamin transport protein (ABC superfamily, membrane) | |
| b0448_1 | 1 | 310 | pt | putative transport protein, multidrug resistance-like (ABC superfamily, membrane) | |
| b0448_2 | 314 | 590 | pt | putative transport protein, multidrug resistance-like (ABC superfamily, ATP_bind) | |
| b0760_1 | 1 | 260 | t | molybdenum transport protein (ABC superfamily, ATP_bind) | |
| b0760_2 | 261 | 490 | t | molybdenum transport protein (ABC superfamily, ATP_bind) | |
| b0731_1 | 1 | 178 | t | PTS family enzyme IIA, induction of ompC | |
| b0731_2 | 186 | 454 | t | PTS family enzyme IIB, induction of ompC | |
| b0731_3 | 456 | 628 | t | PTS family enzyme IIC, induction of ompC | |
| b2220_1 | 1 | 125 | r | response regulator | |
| b2220_2 | 145 | 461 | r | sigma54 interaction module of response regulator (EBP family) | |
| evgS | b2370_1 | 1 | 935 | e | histidine kinase of hybrid sensory kinase |
| evgS | b2370_2 | 953 | 1197 | r | response regulator of hybrid sensory histidine kinase |
| b3868_1 | 1 | 120 | r | response regulator, two-component regulator with GlnL, nitrogen regulation | |
| b3868_2 | 139 | 469 | r | sigma54 interaction module of response regulator (EBP family) | |
| b0465_1 | 1 | 779 | o | unknown function module of mechanosensitive channel | |
| b0465_2 | 780 | 1120 | t | mechanosensitive channel (MscS family) | |
| b2818_1 | 1 | 293 | o | acetylglutamate kinase homolog (inactive) | |
| b2818_2 | 298 | 442 | e | N-alpha-acetylglutamate synthase (amino acid acetyltransferase) | |
| b1439_1 | 1 | 117 | pr | putative transcriptional regulator (GntR family) | |
| b1439_2 | 118 | 468 | pe | putative amino transferase | |
| b1629_1 | 1 | 448 | pc | Fe-S binding module of electron transport protein | |
| b1629_2 | 450 | 740 | o | unknown function module of electron transport protein | |
1Gene product type: e, enzyme; pe, putative enzyme; r, regulatory protein; pr, putative regulatory protein; t, transport protein; pt, putative transport protein; pc, putative carrier protein; o, unknown function.
Figure 2Size distribution for multimodular and single module proteins. The protein lengths in amino acid residues are shown for single module proteins (□) and for multimodular proteins (■). On average the multimodular proteins are longer than the unimodular proteins, 637 amino acids versus 314 amino acids. The length of a protein alone does not infer multimodularity and long single module proteins are seen.
Features of multimodular E. coli proteins:
| No. Modules | |
| 109 multimodular proteins | 226 |
| 101 bimodular proteins | 202 |
| 8 trimodular proteins | 24 |
| with identity to unfused orthologs | 203 |
| without identity to unfused orthologs | 23 |
| known function | 151 |
| putative function | 66 |
| unknown function | 9 |
| type of protein1: | |
| enzyme | 97 |
| transport protein | 85 |
| regulatory protein | 26 |
| other | 18 |
1 includes putative assignments
Types of multimodular proteins.
| Enzyme | Aas, AdhE, AegA, ArgA, ArnA, CysG, Dfp, DgoA, DsbD, FadB, FadJ, FtsY, GlcE, GlmU, GlnE, Gsp, HisB, HisI, HldE, HmpA, MaeB, MetL, MrcA, MrcB3, NifJ3, PaaZ, PbpC, PheA, PolA, PurH, PutA, RbbA3, RibD, Rne3, ThrA, TrpC, TrpD, TyrA, YdiF, YfiQ, YgfN, YgfT, YjiR |
| Transport protein | AlsA, AraG, CydC, CydD, DhaH, Ego, FeoB, FhuB, FruA, FruB, FrvB, HrsA3, KefA, MacB, MalK, MalX, ManX, MdlA, MdlB, MglA, ModF, MsbA, MtlA, NagE3, PtsA, PtsG, PtsP, RbsA, ThiP, Uup, XylG, YbhF, YbiT, YddA, YejF, YheS, YjjK, YliA, YnjC, YojI, YpdD3, YphE |
| Regulatory protein | Ada, Aer, ArcB, AtoC, BarA, BglF, CheA, CheB, EvgS, GlnG, KdpD, MalT, RcsC, TorS, YdcR, YfhA, YieN, ZraR |
| Other | InfB, MukB3, RnfC, YegH, YfcK, YoaE |
1Gene type includes known and putative functions.
2Protein names derived from gene names.
3Genes encoding three modules.
Size distribution of paralogous groups.
| 2 | 279 |
| 3 | 91 |
| 4 | 32 |
| 5 | 31 |
| 6 | 6 |
| 7 | 18 |
| 8 | 7 |
| 9 | 2 |
| 10 | 2 |
| 11 | 3 |
| 12 | 1 |
| 13 | 2 |
| 14 | 2 |
| 18 | 2 |
| 20 | 1 |
| 21 | 1 |
| 22 | 2 |
| 24 | 1 |
| 30 | 2 |
| 40 | 1 |
| 43 | 1 |
| 46 | 1 |
| 51 | 1 |
| 92 | 1 |
Sequence relationships within paralogous groups.
| 3 | 92 | 56 | 36 |
| 4 | 32 | 21 | 11 |
| 5 | 31 | 7 | 24 |
| 6 | 6 | 0 | 6 |
| 7 | 18 | 2 | 16 |
Paralogous enzyme groups in E. coli.
| 20 | oxidoreductase, Fe-S-binding |
| 18 | oxidoreductase, NAD(P)-binding |
| 18 | oxidoreductase1, NAD(P)-binding |
| 13 | aldehyde oxidoreductase, NAD(P)-binding |
| 13 | oxidoreductase, FAD/NAD(P)-binding |
| 11 | sugar kinase |
| 10 | terminal oxidoreductase, subunit |
| 9 | aldo-keto oxidoreductase, NAD(P)-binding |
| 8 | phosphatase |
| 8 | nucleoside diphosphate (Nudix) hydrolase |
| 8 | acyl-CoA ligase |
| 7 | glutathione S-transferase |
| 7 | RNA helicase, ATP-binding |
| 7 | sugar epimerase/dehydratase, NAD(P)-binding |
| 7 | alcohol oxidoreductase |
| 7 | acyltransferase |
| 7 | aminotransferase, PLP-binding |
| 7 | decarboxylase, TPP-binding |
| 7 | crotonase |
| 7 | acyltransferase |
1Contains GroES-like structural domain (SCOP sf50129).
Paralogous transport protein groups in E. coli
| 92 | ABC superfamily transport protein, ATP-binding component |
| 51 | ABC superfamily transport protein, membrane component |
| 40 | MFS family transport protein |
| 24 | ABC superfamily transport protein, periplasmic binding component/ transcriptional regulator (GalI/LacR family)/ |
| 22 | APC family transport protein |
| 12 | ABC superfamily transport protein, membrane component |
| 11 | PTS family transport protein, enzyme IIA |
| 9 | ABC superfamily transport protein, periplasmic binding component |
| 8 | ABC superfamily transport protein, periplasmic binding component |
| 7 | GntP family transport protein |
| 7 | RND family transport protein |
| 7 | ABC superfamily transport protein, membrane component |
| 5 | HAAP family transport protein |
| 5 | PTS family transport protein, enzyme IIB |
| 5 | PTS family transport protein, enzyme I |
| 5 | GPH family transport protein |
| 5 | NCS2 family transport protein |
| 5 | HAAP family transport protein |
| 5 | transport protein |
| 5 | PTS family enzyme IIC |
| 5 | RhtB family transport protein |
| 5 | outer membrane porin |
Paralogous regulatory protein groups in E. coli.
| 46 | LuxR/UhpA or OmpR family transcriptional response regulator of two-component regulatory system |
| 43 | LysR family transcriptional regulator |
| 30 | GntR or DeoR family transcriptional regulator |
| 22 | sensory histidine kinase in two-component regulatory system |
| 14 | sigma54 activator protein, enhancer binding protein |
| 14 | AraC/XylS family transcriptional regulator |
| 7 | ROK family transcriptional regulator/sugar kinase |
| 7 | IclR family transcriptional regulator |
| 5 | methyl-accepting chemotaxis protein |
| 5 | MerR family transcriptional regulator |
| 4 | DNA-binding regulatory protein |
| 3 | AraC/XylS family transcriptional regulator |
| 3 | MarR family transcriptional reguator |
| 3 | AsnC family transcriptional regulator |
Cross genome comparisons of enzyme groups.
| 20 | 18 | 1 | oxidoreductase, Fe-S-binding |
| 18 | 14 | 31 | oxidoreductase, NAD(P)-binding |
| 18 | 13 | 10 | oxidoreductase4, NAD(P)-binding |
| 13 | 13 | 11 | aldehyde dehydrogenase, NAD(P)-binding |
| 13 | 11 | 13 | oxidoreductase, FAD/NAD(P)-binding |
| 11 | 16 | 6 | sugar kinase |
| 10 | 13 | 5 | respiratory reductase, alpha subunit |
| 9 | 8 | 8 | aldo-keto reductase, NAD(P)-binding |
| 8 | 7 | 5 | phosphatase |
| 8 | 8 | 2 | nucleoside diphosphate (Nudix) hydrolase |
1No. proteins in Escherichia coli paralogous group
2No. sequence matches for E. coli paralogous group in Salmonella typhimurium LT2
3No. sequence matches for E. coli paralogous group in Bacillus subtilis
4Contains GroES-like structural domain (SCOP sf50129).
Figure 3Annotation and composition of multimodular proteins. (a) Annotation is complicated by multimodular proteins. An E. coli protein (gi1786183) contains two modules, an N-terminal aspartokinase and a C-terminal homoserine dehydrogenase. Two single module proteins from L. lactis and B. halodurans (gi12723655 and gi10174117) align to the N-terminal aspartokinase module of the E. coli protein. Based on the sequence alignments, both of these proteins should be annotated as aspartokinases. However, errors are seen in the annotation of the L. lactis and B. halodurans proteins stemming from transfer of functions between multimodular proteins and partially aligned sequences without taking into account the alignment regions. (b) Different combinations of modules are seen in multimodular proteins of different organisms. While aspartokinase is fused to homoserine dehydrogenase in E. coli it is fused to DAP decarboxylase in X. fastidiosa. In both organisms the fusions are between enzymes of metabolic pathways, threonine biosynthesis for E. coli and lysine biosynthesis in X. fastidiosa.
Figure 4Sequence similarity of E. coli paralogous protein groups versus the group size. Protein sequences were aligned by the AllAllDb program of Darwin. Multimodular proteins were separated into modules (independent functional units) prior to the Darwin analysis. Alignments with similarities of ≤ 200 PAM units over 83 amino acids and where >45% of the length of both proteins in the pair were aligned were used to generate protein groups. The average PAM distances for the protein pairs in the smaller groups having 2–4 members (▲) and in the larger groups of ≥ 5 members (△) are shown. The smaller groups are more abundant and show a wide range of similarities. The larger groups appear to be more divergent with higher average PAM values clustering around PAM 150.