Literature DB >> 21092245

LAB-Secretome: a genome-scale comparative analysis of the predicted extracellular and surface-associated proteins of Lactic Acid Bacteria.

Miaomiao Zhou1, Daniel Theunissen, Michiel Wels, Roland J Siezen.   

Abstract

BACKGROUND: In Lactic Acid Bacteria (LAB), the extracellular and surface-associated proteins can be involved in processes such as cell wall metabolism, degradation and uptake of nutrients, communication and binding to substrates or hosts. A genome-scale comparative study of these proteins (secretomes) can provide vast information towards the understanding of the molecular evolution, diversity, function and adaptation of LAB to their specific environmental niches.
RESULTS: We have performed an extensive prediction and comparison of the secretomes from 26 sequenced LAB genomes. A new approach to detect homolog clusters of secretome proteins (LaCOGs) was designed by integrating protein subcellular location prediction and homology clustering methods. The initial clusters were further adjusted semi-manually based on multiple sequence alignments, domain compositions, pseudogene analysis and biological function of the proteins. Ubiquitous protein families were identified, as well as species-specific, strain-specific, and niche-specific LaCOGs. Comparative analysis of protein subfamilies has shown that the distribution and functional specificity of LaCOGs could be used to explain many niche-specific phenotypes.A comprehensive and user-friendly database LAB-Secretome was constructed to store, visualize and update the extracellular proteins and LaCOGs http://www.cmbi.ru.nl/lab_secretome/. This database will be updated regularly when new bacterial genomes become available.
CONCLUSIONS: The LAB-Secretome database could be used to understand the evolution and adaptation of lactic acid bacteria to their environmental niches, to improve protein functional annotation and to serve as basis for targeted experimental studies.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 21092245      PMCID: PMC3017865          DOI: 10.1186/1471-2164-11-651

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

Lactic Acid Bacteria (LAB) have been used for centuries in industrial and artisanal food and feed fermentations as starter cultures and are important bacteria linked to the human gastro-intestinal (GI) tract [1-8]. Phylogenetically they form a relatively compact group of mainly Gram-positive, anaerobic, non-sporulating, low G+C content acid-tolerant bacteria [9-12]. The genera that comprise the LAB belong to the order Lactobacillales, and are primarily Lactobacillus, Pediococcus, Lactococcus, Streptococcus and Leuconostoc, while some peripheral genera are Enterococcus, Oenococcus, Aerococcus, and Carnobacterium. Interestingly, even within such a compact group, vastly divergent phenotypes have been reported, providing indications of high flexibility and adaptation of these species to their living environments [13-16]. Extracellular and surface-associated proteins play a most important role in many essential interactions and adaptations of LAB to their environment [17-26]. By definition these proteins are either exposed on (anchored to membrane GO:0046658, intrinsic to external side of plasma membrane GO:0031233 and the cell wall, GO: 0005618) or released (extracellular milieu, GO:0005576) from the bacterial cell surface. On a genome scale these proteins form a subset of the proteome which contains both the exoproteome [27] and part of the surface proteome [28], but excluding the integral membrane proteins (GO: 0005887) and the proteins that are intrinsic to internal side of plasma membrane (GO:0031235). This subset of the proteome belongs to what Desvaux et.al have defined as "secretome" [27] and is known to mainly be involved processes such as: (1) recognition, binding, degradation and uptake of extracellular complex nutrients, (2) signal transduction, (3) communication with the environment and (4) attachment of the bacterial cell to specific sites or surfaces, e.g. to intestinal mucosa cells of the host [29-37]. Hence, genome-scale comparative analysis of these secretome (surface-associated and released from the cell) proteins may provide an understanding of the molecular function, evolution, and diversity of different LAB species and their adaptation to different environments. Here we report a comparison of the predicted secretomes of 26 sequenced genomes of LAB representing 18 different species (Table 1). The secretome clusters of orthologous protein families (LaCOGs: Lactobacillales Cluster of Ortholog Groups) were extracted by combining homology clustering methods with protein subcellular location (SCL) prediction. The comparative analysis of LaCOGs shows many niche-specific protein families that can be used as leads for future experiments.
Table 1

The predicted LAB secretomes (genomes included in the original LaCOG analysis 43 are marked by *).

Secretome proteins (%)
LAB species and strainsTotal proteinsABCDEFGTotal(%)
E.faecalis_V58331862.321.263.360.970.161.60.139.8
L.acidophilus_NCFM18342.240.654.090.9302.450.0510.41
L.gasseri_ATCC_33323*17331.850.693.920.520.120.6907.79
L.johnsonii_NCC_533*17892.070.894.30.560.390.0608.27
L.delbrueckii_bulgaricus_ATCC1184215361.560.133.451.040.072.0208.27
L.delbrueckii_bulgaricus_ATCC_BAA-365*16811.430.063.150.950.182.0807.85
L.casei_ATCC_334*26931.630.783.790.780.151.410.078.61
L.casei_BL2329731.680.773.40.8401.350.138.17
L.salivarius_UCC11819730.910.253.40.610.151.270.16.69
L.sakei_23K18451.520.333.360.760.052.060.278.35
L.plantarum_WCFS1*29811.611.113.990.910.30.108.02
L.brevis_ATCC_36721781.290.553.351.520.142.530.099.47
L.fermentum_IFO_395618260.660.222.960.5501.150.055.59
L.helveticus_DPC_457115971.380.134.510.4402.1308.59
L.reuteri_F275_JGI18810.740.213.670.8501.0106.48
L.reuteri_F275_Kitasato18030.780.283.55101.2206.83
L._lactis_cremoris_MG136323931.460.463.010.7901.9607.68
L.lactis_cremoris_SK11*24591.380.413.171.020.121.670.087.85
L.lactis_lactis_IL1403*22841.40.614.290.740.041.620.188.88
L.citreum_KM2017840.060.284.431.231.2300.067.29
S.thermophilus_CNRZ1066*18721.280.053.470.530.270.430.056.08
S.thermophilus_LMD-9*16691.50.243.890.540.180.8407.19
S.thermophilus_LMG_1831118541.290.113.780.540.490.6506.86
L.mesenteroides_ATCC_8293*19660.10.314.931.120.311.220.158.14
O.oeni_PSU-1*16640.120.064.330.91.5600.067.03
P.pentosaceus_ATCC_25745*17271.10.173.880.350.170.980.126.77

A: Lipid anchored; B: LPxTG Cell-wall anchored; C: N-terminally anchored (No cleavage site); D: N-terminally anchored (with cleavage site); E: Secreted via minor pathways (bacteriocin) (no cleavage site); F: Extracellular (with cleavage site); G: C-terminally anchored (with cleavage site)

The SCL prediction was made by LocateP.

The predicted LAB secretomes (genomes included in the original LaCOG analysis 43 are marked by *). A: Lipid anchored; B: LPxTG Cell-wall anchored; C: N-terminally anchored (No cleavage site); D: N-terminally anchored (with cleavage site); E: Secreted via minor pathways (bacteriocin) (no cleavage site); F: Extracellular (with cleavage site); G: C-terminally anchored (with cleavage site) The SCL prediction was made by LocateP. The complete results of this study are stored in our open-source database LAB-Secretome http://www.cmbi.ru.nl/lab_secretome with a user-friendly web-interface. An automatic update scheme was constructed to be able to add information to the database on new bacterial genomes.

Results and Discussion

Construction of the secretome protein clusters (LaCOGs)

In this study we focus on those proteins that are predicted to be wholly or largely on the outside of the cell, regardless of the translocation systems. These proteins form a sub-proteome of what Desvaux et.al defined as the "secretome" [27] by excluding the translocation systems, the integral membrane proteins, and non-protein products. Although we adapt this term "secretome" to describe our protein subset of interest, we must specify that in our analysis the term "secretome" refers to only the proteins that are released from the cells to the extracellular milieu (also called exoproteome), and the proteins that remain cell-surface associated, but nothing else. Ideally, a comparative secretome analysis should be performed on the experimentally validated sub-proteomes or on in silico predicted secretome proteins with the highest possible accuracies. However, it is well-known that wet-lab proteomic studies are extremely costly and can lead to many false predictions of subcellular location, while all the currently available in silico protein SCL predictors have only 80%-93% prediction accuracy [38-41]. Therefore, instead of clustering predicted extracellular proteins directly, we designed an alternative process which firstly groups all proteins in the sequenced LAB genomes into ortholog groups (LaCOGs) and afterwards extracts the secretome groups by using genome-scale SCL predictions (Figure 1). In this way, the wrongly predicted secretome proteins could be reduced because homologous proteins with similar functions and domains always tend to have the same SCL, and vice versa [39-42].
Figure 1

The flowchart for constructing the secretome LaCOGs. The completely sequenced LAB genomes are used as input data. No plasmid sequences were used for the Inparanoid search. The squares with dash-line frames are intermediate products that are not user-queryable from the LAB-Secretome interface; the squares with full-line frames are the final information stored in LAB-Secretome database. The upper left frame shows the processes that produce new LACOGs; the upper right frame shows the processes that extend existing LaCOGs. The new LaCOGs are coded starting with "9", the extended existing LaCOGs retain the original names from Makarova et.al [43]. BlastP1: the Blast results were processed by a revised criterion "uniform top 3" (see Material and Methods); BlastP2: the Blast results were processed by cut-off of 1e-3 and aligned sequence coverage of 60% for distant homolog identification. This work scheme can be used to update the LAB-Secretome database when new bacterial genomes are available.

The flowchart for constructing the secretome LaCOGs. The completely sequenced LAB genomes are used as input data. No plasmid sequences were used for the Inparanoid search. The squares with dash-line frames are intermediate products that are not user-queryable from the LAB-Secretome interface; the squares with full-line frames are the final information stored in LAB-Secretome database. The upper left frame shows the processes that produce new LACOGs; the upper right frame shows the processes that extend existing LaCOGs. The new LaCOGs are coded starting with "9", the extended existing LaCOGs retain the original names from Makarova et.al [43]. BlastP1: the Blast results were processed by a revised criterion "uniform top 3" (see Material and Methods); BlastP2: the Blast results were processed by cut-off of 1e-3 and aligned sequence coverage of 60% for distant homolog identification. This work scheme can be used to update the LAB-Secretome database when new bacterial genomes are available. The Lactobacillales-specific clusters of orthologous groups of proteins (LaCOGs) previously generated by Makarova et.al [43] were used as the basis for protein clustering into protein families. In total 3374 (729 new and 2645 existing) LaCOGs were formed by adding 14 recently sequenced LAB genomes to the Makarova et. al. set. Subsequently, a genome-scale SCL prediction was performed on all proteins in the 26 genomes (Table 1). By combining the SCL prediction and LaCOGs, and after manual curation (see below), we defined 462 secretome LaCOGs (of which 212 are new compared to the Makarova et. al. set) composed of 3357 proteins, representing 7.4% of the complete genome dataset and 93% of all predicted secretome proteins in these 26 genomes. We defined thirteen general functional classes for these proteins, and the distribution of these clustered secretome proteins over the classes and LaCOGs is shown in Figure 2. An additional 249 putative secretome proteins could not be grouped into these LaCOGs, comprising 69 proteins that had only a distant homolog in non-LAB, and 180 proteins that had no homolog in any sequenced bacterial genomes, which we termed the extracellular "ORFans" (Table 2, Additional file 1, sheet S1).
Figure 2

overview of distribution of secretome proteins in LaCOGs. The central pie depicts the distribution of secretome proteins in LaCOGs according to their functional classes. The percentage was calculated as the number of proteins in the category divided by the total of 3357 secretome proteins that were clustered into LaCOGs. The number of LaCOGs in each category is listed in the pie chart legend behind the name of the functional class. The separate yellow-red-green piecharts for each functional class represents the distribution of this LaCOG in the LAB genomes, i.e. ubiquitous, .species/strain-specific, or variable.

Table 2

Overview of the LaCOGs (genomes included in the original LaCOG analysis 43 are marked by *).

LAB species and strainsSecretome sizeProteins in LaCOGDistant HomologsORFansLaCOGs
E.faecalis V5832812322227131
L.acidophilus NCFM17116128108
L.brevis ATCC 367177154518113
L.casei ATCC 334 *19218732148
L.casei BL2320519708153
L.citreum KM201121120093
L.delbrueckii bulgaricus ATCC BAA-365 *1151130294
L.delbrueckii bulgaricus ATCC1184287793568
L.fermentum IFO 39561121120089
L.gasseri ATCC 33323 *1151130288
L.helveticus DPC 45711311232697
L.johnsonii NCC 533 *236209621131
L.lactis cremoris MG13631051030286
L.lactis cremoris SK11 *1051050087
L.lactis lactis IL1403 *13611441880
L.mesenteroides ATCC 8293 *1129451377
L.plantarum WCFS1 *16015154123
L.reuteri F275 JGI15915612124
L.reuteri F275 Kitasato171156213123
L.sakei 23K1141034780
L.salivarius UCC11813512636103
O.oeni PSU-1 *95900570
P.pentosaceus ATCC 25745 *99891979
S.thermophilus CNRZ1066 *90900077
S.thermophilus LMD-9 *97941284
S.thermophilus LMG 1831194940081
Overview of the LaCOGs (genomes included in the original LaCOG analysis 43 are marked by *). overview of distribution of secretome proteins in LaCOGs. The central pie depicts the distribution of secretome proteins in LaCOGs according to their functional classes. The percentage was calculated as the number of proteins in the category divided by the total of 3357 secretome proteins that were clustered into LaCOGs. The number of LaCOGs in each category is listed in the pie chart legend behind the name of the functional class. The separate yellow-red-green piecharts for each functional class represents the distribution of this LaCOG in the LAB genomes, i.e. ubiquitous, .species/strain-specific, or variable. Although the LAB genomes vary in size, the size of the secretome as a fraction of each genome was fairly consistent (6-10%), as well as the distribution of proteins over different SCLs. The N-terminally anchored proteins with no signal peptidase cleavage site are the most abundant kind among all predicted secretome proteins. A striking feature of numerous secretome proteins, and particularly surface-associated proteins, is that they are large and consist of many different domains (often in repeats), and domain compositions (see examples in Figure 3). In fact, this variation in domain composition has been used in constructing and sub-dividing the LaCOGs and separating sub-families of homologous proteins. Distinct combinations of domains provide hints for functions of these extracellular proteins in cell-wall metabolism, cell-wall binding and their communication with the environment (see below).
Figure 3

Variations in domain composition. Examples of LaCOGs families showing different domain types, domain compositions and repeats.

Variations in domain composition. Examples of LaCOGs families showing different domain types, domain compositions and repeats.

False predictions and pseudogenes

The preliminary secretome clusters were curated manually and corrected based on expert knowledge, e.g. for false-positive and false-negative predictions, incorrect gene starts, pseudogenes, etc. Examples of proteins of known intracellular function, but with consistent false-positive extracellular SCL prediction are listed in Additional file 2, sheet S1. In most cases the mis-prediction was caused by an α-helix-like N-terminal sequence in these proteins (possibly as part of the hydrophobic core of a globular protein), leading to the prediction as a signal peptide by LocateP. A further improvement was made by finding and removing those LaCOGs that have proteins which are anchored in the cell membrane with a single N-terminal transmembrane helix, but with the rest of the protein inside the cell (so-called outside-in topology, GO:0031235) [44-53]. By aligning proteins within these LaCOGs we found that these proteins do not have positively charged residues preceding the N-terminal hydrophobic helix, but exclusively have a positively charged residue(s) immediately downstream of the transmembrane helix (examples in Additional file 2, sheet S2). Hence such features could be used for further development of a model for SCL prediction of N-terminally anchored proteins by LocateP. Nearly 400 pseudogenes were identified, but this is probably an underestimate. In most cases this was due to gene frameshifts, and occasionally to N- or C-terminal truncation of genes. Most of these genes could be concatenated to encode larger proteins with high similarity to known proteins in the LaCOGs. Many of these pseudogenes were initially predicted to encode intracellular proteins by LocateP, but after concatenation these proteins are predicted to be extracellular and/or contain domains of extracellular functionalities. An example are the proteins encoded by adjacent genes LSA1731 and LSA1730 in L.sakei 23K which were annotated as hypothetical proteins. The concatenated protein showed high similarity to proteins in LaCOG02935 which were exclusively cell-surface protein Csc complex family members [54]. In total 129 concatenated pseudoproteins were made with 279 protein fragments (Additional file 3, sheet S1), while 87 pseudogenes could not be combined (Additional file 3, sheet S2).

The LAB-Secretome database

The LAB-Secretome database http://www.cmbi.ru.nl/lab_secretome was constructed to store and browse all the predicted extracellular proteins and LaCOGs. An overview page summarizes all predicted secretomes, LaCOGs, distant homologs in non-LAB species and the ORFans, with hyperlinks to the corresponding HTML pages to help users to browse the whole database (Figure 4A). The LAB-Secretome database can be queried in many ways, e.g. by bacterial species, protein subcellular location, protein accession identifiers, LaCOG numbers, protein functional classes, and Pfam domain accession codes or domain functions (Figure 4B). Visualization includes a description of LaCOG members and function, protein functional domain composition, and multiple alignments with notification of corrected start codons, pseudogenes and concatenated proteins (Figure 4D). A Blast function, utilizing the BlastP [55] program, enables users to query the clustering information of their proteins of interest to the extracellular proteins and families that are already in the database (Figure 4C). An automatic updating scheme for the LaCOGs (Figure 1) was designed to ensure that the need for manual curation is minimized when adding new bacterial genomes to the database.
Figure 4

Screen shot of the LAB-Secretome database. A: Overview page of the database showing statistical information of the predicted LAB secretomes with active links to their corresponding pages; B: The search engine in LAB-Secretome which can browse the database by various types of queries; C: The BlastP search page of LAB-Secretome; D: An example page depicting parts of the detailed information that LAB-Secretome presents for each LaCOG.

Screen shot of the LAB-Secretome database. A: Overview page of the database showing statistical information of the predicted LAB secretomes with active links to their corresponding pages; B: The search engine in LAB-Secretome which can browse the database by various types of queries; C: The BlastP search page of LAB-Secretome; D: An example page depicting parts of the detailed information that LAB-Secretome presents for each LaCOG.

Overview of the extracellular protein families

Ubiquitous/essential LaCOGs

Only 22 LaCOGs were found to be fully conserved among all 26 LAB secretomes, or only lacking in 1 genome (5 LaCOGs), e.g. the absence of an ATP-dependent protease from LaCOG01453 in P. pentosaceus (Additional file 1, sheet S3). Most of these LaCOGs contain proteins with universal functionalities involved in cell-wall metabolism, secretion, transport and DNA uptake (Figure 2). Only one conserved family (LaCOG01219) contains proteins of as yet unknown function, but presumably essential as they are conserved in all genomes.

Most common functionalities in the secretomes of LAB

Among all 215 secretome LaCOGs with known or presumed functions, almost half of them contain proteins which are involved in cell-wall metabolism, e.g. the muramidase, lysin, lysozyme and beta-lactamase families (Figure 2). Many of these enzyme families are further subdivided into different LaCOGs based on variations in sequence homology and protein domain compositions, and some may represent species/niche-specific subfamilies. One example is the subdivision of proteins with an Nlpc/P60 family domain (e.g. gamma-D-glutamate-meso-diaminopimelate muropeptidase) into 5 separate LaCOGs (Additional file 4, sheet S1). These proteins vary in length from ~150 to ~500 amino acids, all with the Nlpc/P60 domain in the C-terminal part. In only one of these subfamilies (LaCOG90015), all 16 members have 1-3 copies of LysM domains (Pfam PF01476) in their N-terminal part, indicating extra binding functions to the cell-envelope. A similar domain architecture is found in one of the four N-acetylmuramoyl-L-alanine amidase subfamilies (LaCOG01848), which has an enzymatic C-terminal domain and 0-3 N-terminal SH3 domains (Pfam PF08239), known to bind to proline-rich regions of proteins. In the pepdidoglycan hydrolase subfamilies LaCOG00186 and LaCOG01653 the enzymatic domain is located at the N-terminus and can be followed by different kinds, combinations and numbers of binding domains such as LysM, SH3 or surface layer domain (Pfam PF03217) (Figure 5). These examples all illustrate that the many types of extracellular enzymes involved in cell-wall turnover have different mechanisms to attach to components of the cell surface.
Figure 5

Domain structure variation of enzymes within a family. Examples of an enzyme family (N-acetyl-glucosaminidase) with variations in the type and number of cell-envelope binding domains.

Domain structure variation of enzymes within a family. Examples of an enzyme family (N-acetyl-glucosaminidase) with variations in the type and number of cell-envelope binding domains.

Niche-specific LaCOG families

1/L. acidophilus complex specific

The acidophilus "complex" including the species L. acidophilus, L. johnsonii, L. gasseri, L. delbrueckii ssp bulgaricus and L. helveticus has long been regarded as a phylogenetic subgroup [56-58]. About 30 LaCOGs appear to be specific for these species (Additional file 1, sheet S4). Their proteins include an ABC-type phosphate/phosphonate transport system (LaCOG02118), the aggregation promoting factor (LaCOG90005) [59-61], a putative competence protein (LaCOG03110) and several families of S-layer proteins, which may reflect the special binding function that these S-layer proteins generally share in these acidophilus complex species [62-69]. Interestingly, twenty of these acidophilus complex-specific LaCOGs contain only extracellular proteins of unknown function, and it should be challenging to focus on experimental determination of their function.

2/GI-tract specific

If we consider the LAB species L. acidophilus, L. johnsonii, L. gasseri, L. reuteri, and L. salivarius to be specifically found in the GI-tract, then we can identify 17 LaCOGs which are not found outside of this group, of which 13 families contain only proteins of unknown function (Additional file 1, sheet S4). One mucus-binding protein family (LaCOG02280) was found to be specific for these GI-tract LAB, and contains 4 proteins from L. acidophilus, L. gasseri and L. johnsonii. All four proteins are larger than 2300 amino acids, contain a signal peptide with YSIRK domain (Pfam PF04650) and appear to be anchored to the peptidoglycan by an LPxTG cell-wall anchor (Pfam PF00746). Each protein has 5-11 copies of a mucus-binding domain, as defined by Boekhorst et al [60], showing their particular role in binding to mucus components in the GI-tract [5,70-72]. The 3 D structure of this domain of 184 residues has recently been determined and shows similarity to the functional repeat found in a family of immunoglobulin-binding proteins [73].

3/Plant-associated specific

Twelve LaCOGs appear to be specific for the group of plant-associated species Leuconostoc, Oenococcus, L. plantarum, L. brevis, and P. pentosaceus, of which 7 familes contain only proteins of unknown function (Additional file 1, sheet S4). One of these (LaCOG02876) includes 4 homologous proteins from L.brevis, L.plantarum, O.oeni and L. citreum, which show a high sequence similarity to each other, but the protein from L. plantarum has a much longer serine-rich spacer between the N- and C-terminal domains. A similar domain structure differing in a long serine-rich spacer is seen in the 2 hypothetical proteins from L. plantarum and L. brevis in LaCOG02927.

4/Dairy LAB specific

A few protein families were found only to occur in the secretomes of the dairy LAB S. thermophilus, L. lactis and E. faecalis (Additional file 1, sheet S4). These proteins have functional properties that may be relevant to the dairy niche, e.g. LaCOG00374 contains ABC transporter substrate-binding proteins for polar amino acids, and could possibly be required for growth in milk [74-77]. The L. lactis strains have a single copy of this gene, while the S. thermophilus strains all have 3 consecutive genes encoding paralogs of this amino acid-binding protein. All dairy Streptococcus and Lactococcus strains contain a single gene encoding a beta-lactamase (LaCOG00012) which may play a role in destroying penicillin that these strains may encounter in milk [78-82]. A putative chitinase (glycosyl hydrolase family 18; LaCOG02690) is found exclusively in E. faecalis and in L. lactis strains.

Species-specific and strain-specific LaCOGs

Up to 150 LaCOGs were found to be species-specific or strain-specific (Additional file 1, sheet S5). The distinction is not so clear yet because for some species several strains were sequenced (e.g. L. lactis, S. thermophilus) while for many species only a single strain was sequenced to date. Most of these families are made up solely of hypothetical proteins with highly conserved sequence (Figure 2). L. casei and L. lactis have the highest number of species-specific LaCOGs, indicating that they may have more unique extracellular functions. Examples of species-specific extracellular proteins are the PrgA/PrgB/PrgC surface proteins of E. faecalis [83-85], an alpha-amylase (LaCOG02644) in L. lactis strains, a phospholipase A2 family enzyme (LaCOG99223) in L. casei strains, a cyclo-nucleotide phosphodiesterase (LacOG00213) in S. thermophilus strains, and a mucus-binding protein (LaCOG90010) in L. delbrueckii strains.

Extracellular proteins not in LaCOGs: ORFans and proteins with only distant homologs in non-LAB

About 249 putative extracellular proteins could not be classified into LaCOG families, and comprise 69 proteins that have only distant homologs in non-LAB species and 180 ORFans that are species-specific (Additional file 1, sheets S6 and S7). While the ORFans are nearly all hypothetical proteins of unknown function, the distant homologs also contain proteins with a variety of known functions, such as extracellular enzymes (e.g. xylanase, pectate lyase, endo-beta-N-acetylglucosaminidase, proteases and beta-fructosidase), substrate-binding proteins of transporters, miscellaneous binding proteins and specific bacteriocins. The uniqueness of these proteins suggests that most species or strains have a few unique extracellular proteins that are not found in other sequenced LAB, and may encode unique functions that are related to their environmental niche. Quite a few of the proteins of unknown function are predicted to be lipid-anchored and therefore may represent substrate-binding proteins of uncharacterized transporters.

Specific enzyme families

LAB possess a variety of extracellular hydrolytic enzymes and transglycosylases which presumably relate to interactions with their environment, e.g. for degradation of growth substrate polymers. These enzymes have been clustered and sub-divided into protein families (LaCOGs) based on specific domain compositions (Table 3, Additional file 4, sheet S2). For instance, the subtilisin-like serine proteases (Pfam PF00082), known to be important for growth on protein substrates [86-89], were clustered into 2 LaCOGs: the first family (LacOG02153) is composed of 7 proteins containing a protease-associated PA domain (Pfam PF02225) inserted in the catalytic domain which forms a lid structure that covers the active site, whereas the other family (LaCOG90024) was only found in L. casei and L. acidophilus, and contains subtilisin-like serine proteases without the PA domain. Putative transglycosylases, also referred to as aggregation-promoting factors [59,90-92], are divided into three subfamilies (LaCOG01580, LaCOG02932, LaCOG90005), and have a highly conserved C-terminal domain [71]. Furthermore, there are several families of hydrolases of unknown function (Table 3). The extracellular alpha/beta hydrolases with a DUF915 domain (Pfam PF06028) are subdivided into four families, two of which are highly populated (LaCOG01137 and LaCOG01138, with 46 and 30 members, respectively) and found in nearly all LAB, suggesting that they have an essential, but as yet unknown, function.
Table 3

Examples of specific enzyme and binding-protein sub-families

ProductLaCOGFunctional domainDistributionSpecial features
Specific enzyme families

Subtilisin-like serine proteaseLaCOG02153Subtilase familyL. casei, L. delbrueckii bulgaricus,L. johnsonii, L. lactis, S. thermophilusPA domain (PF02225) inserted in the subtilase family domain
LaCOG90024Subtilase familyL. acidophilus, L. caseino PA insert domain

Trans-glycosylaseLaCOG01094Transglycosylase-like domain,mainly in L.plantarum,L.lactis, S.thermophilusdifferent domains for PG binding
LaCOG01589aggregation promoting factor related surface proteinnot in L.acidophilus groupPG bound by LysM domain; highly conserved C-terminal domain ending in GWY
LaCOG02932aggregation promoting factor related surface proteinonly in L.delbrueckii bulgaricus,L.plantarum, L.acidophilus grouphighly conserved C-terminal domain ending in WY
LaCOG90005aggregation promoting factor related surface proteinonly in L.acidophilus grouphighly conserved C-terminal domain ending in GWY

Dextran sucraseLaCOG90016glycosyl hydrolase family 70only in Leuconostoc, L. reuteri,O. oeni

Cell-surface hydrolases

alpha/beta hydrolaseLaCOG01137alpha/beta hydrolase of unknown function (DUF915)ubiquitous
LaCOG01138alpha/beta hydrolase (DUF915)Ubiquitous
LacOG01920alpha/beta hydrolase (DUF915)only in L. delbrueckii bulgaricus, L.plantarum , L.casei
LaCOG02785alpha/beta hydrolase (DUF915)only in L.plantarum , L.casei , L.sakei

lipase/Acyl-hydrolaseLaCOG00342GDSL-like Lipase/Acylhydrolasenot in L.acidophilus groupwith GDSL-like motif

general cell surface hydrolaseLacOG02019cell surface hydrolase membrane-bound (putative)only in L.delbrueckii bulgaricus, L.plantarum,L.casei , L.fermentum
LaCOG01618cell-surface hydrolase;only in L.plantarum , L.delbrueckii bulgaricus, P.pentosaceus

Binding proteins

mannose-specific adhesionLaCOG01741MUB domain, Gram positive anchoronly in L.plantarum, L.delbrueckii bulgaricus, P.pentosaceus, L.acidophilus group

collagen-binding proteinLaCOG00092Collagen binding domain, Gram positive anchornot in L.acidophilus group

mucus-binding proteinLaCOG00885MucBP domain (Classical), Gram positive anchornot in L.acidophilus groupLeucine Rich Repeat, PT repeat
LaCOG01470MUB domain, Gram positive anchormany pseudogenes, most L.acidophilus group proteins have YSIRK-type signal peptide
LacOG02280MUB domain, Gram positive anchoronly in L.acidophilus groupvery large, YSIRK-type signal peptide
LaCOG03211MUB domain, Gram positive anchor5 of 10 are pseudogenes; YSIRK SP in L.acidophilus group members
LacOG99309MUB domain, Gram positive anchoronly in L.acidophilus groupall pseudogenes; YSIRK type signal peptide

chitin-binding proteinLaCOG01300Chitin binding domainE.faecalis, L.plantarum, L.sakei, L.lactismaybe related to niche

adherence proteinLaCOG01366von Willebrand factor type A domain, Cna protein B-type domainonly in L.lactis, E.faecalis, L.citreum, L.casei
Examples of specific enzyme and binding-protein sub-families

Specific binding-protein families

Many extracellular proteins contain known domains for binding to macromolecular substrates. In addition to domains for binding to the cell wall of the producing cell (e.g. LysM, SH3), several other domains are found which are related to binding to host macromolecules (e.g. domains annotated as mucus-binding, chitin-binding, collagen-binding, fibronectin-binding, carbohydrate-binding, etc) (Table 3). Some of these annotations derive from in vitro binding studies and may not reflect in vivo functions. In LAB, mucus-binding domains (MUB, MucBP) are found in many proteins and are thought to play a role in binding to the host GI-tract mucus layer [57,93,94]. An enormous variety is found in the size of these mucus-binding proteins and in the number of mucus-binding domains. We have made a preliminary separation into 7 different subfamilies of mucus-binding proteins based on protein size, sequence homology, domain composition and phylogeny (Table 3). The three largest subfamilies are (1) LaCOG00885 containing 11 members from different LAB but not from L. acidophilus group members, (2) LaCOG01470 with 28 members, found in many LAB, and (3) LaCOG03211 which includes 10 proteins. The proteins of LaCOG00885 contain solely the MucBP domains as defined by Pfam (PF00746), while the proteins of the other two LaCOGs possess multiple copies of the larger MUB domains as defined by Boekhorst et al. [71] (see also Figure 2 in[95]). Many mucus-binding proteins of L. acidophilus group members contain an N-terminal [Y/F]SIRKxxxGxxS-containing signal peptide (PF04650) which was earlier reported as a typical characteristic of the L. acidophilus MUB proteins [94,96], and may relate to a specific function in sorting or folding [97,98]. Furthermore, it is striking that many large genes encoding mucus-binding proteins are pseudogenes (e.g. in LaCOG01470, LaCOG03211 and LaCOG99309). While it is unlikely that these are all due to sequencing errors, it is not clear yet whether these are truly pseudogenes, or possibly may encode functional proteins after transcription with strand-slipping [5,71].

Conclusions

Lactic Acid Bacteria (LAB) occur naturally in many different fermentation environments such as plant, meat, dairy and cereal. Overall similarities have been identified among the genomes of many LAB species [61,99-105]. However, bio-diversity has also been reported frequently, showing that subtle variations in presence or absence of proteins and functional domain composition might lead to important traits during bacterial adaptation to their living environments [106-113]. Our comparative research on extracellular and surface-associated protein families has provided a more solid basis for this hypothesis. Universal families have been identified which are apparently essential for survival of all LAB, but also species-specific protein families. Besides the clustered proteins with known functions, many families of hypothetical proteins and unique proteins (ORFans and proteins with only distant homologs in non-LAB) were found. Protein clustering supports niche-dependent features of specific subgroups of LAB (e.g. the L. acidophilus group) and could aid in linking bacterial phenotypes to genotypes. The distinct sub-families of the different LaCOGs have provided clues for adaptation of the bacterial cells to their living environment, such as the GI-tract. The result of this study can be used as leads for experimental work on the molecular evolution, diversity, function and adaptation of bacteria to specific environments. Our clustering methods and database structure were designed in a way that allows adoption to other groups of bacteria than LAB. The analysis results are stored in a queryable database which provides vivid browsing functions for users, and will be updated regularly to guarantee the continuation of the service to the biology community. Our clustering information into families could definitely help in checking the quality of newly sequenced genomes and for genome (re-)annotation.

Methods

Genome sequences and bioinformatics tools used in this research

The genome sequences of 26 selected representative lactic acid bacteria, including the protein functional annotation and the gene contexts, were obtained from the NCBI bacterial genome database (version 15 Aug., 2008) [114]. BlastP (default cutoff values of E < 1, low-complexity filter disabled) [55] and Inparanoid [115] were used for sequence homology and orthology searches, respectively. Protein subcellular location (SCL) was predicted by LocateP [38]. Multiple sequence alignments were constructed using Muscle [116]. Motif searches were performed using MEME and MAST [117]. Protein domains (version Dec. 2008) [118] originating from the Pfam database [119-121] and additional HMMs reported in other studies [54,71,96,122-124] were searched using HMMER [125] with the respective cut-off of each model. The domain functions were obtained from the GO database [126] using the PFAM2GO dataset [126]. The LAB-Secretome database was created in MySQL and the database interface was written in PHP (version5.2.7). Visualization of the protein domain composition was made using scalable vector graphics (SVG).

Protein clustering into orthology groups (LaCOGs)

First, the 22,191 proteins in 3195 LaCOGs generated by Makarova et.al [1] from 12 LAB genomes were used as the basis for protein clustering. All protein sequences from 14 newly sequenced LAB genomes were searched against the Makarova LaCOG set using BlastP. The proteins that have high homology to the existing LaCOGs were then selected using a revised criterion based on the well-known COG extension rule "uniform top 3"[127]: if all the top 3 (in case of LaCOG size of 2, the top 2 hits were taken) BlastP hits of a query protein belong to the same LaCOG (LaCOG size bigger than or equals to 2), then the query protein is added to this LaCOG. Since the above-mentioned extension was purely based on the homologs of proteins that were already included in the LaCOGs by Makarova et al., the specific proteins from newly sequenced species, e.g. L. reuteri, were not added due to the absence of the "seeding sequences" for BlastP. In order to cluster all proteins that originated from the newly sequenced genomes, a complete all-to-all Inparanoid [115] search was performed in a parallel fashion with the proteins encoded in the 14 new genomes to identify orthologous proteins. Cut-off settings of bit score 50 and sequence overlap of 50% were used. The proteins with all-to-all bidirectional-best-hit (BBH) relationship [128,129] were clustered into groups, meaning that in any such group, each member is the BBH of another member. This stringent criterion generates new cores of orthologous proteins. Using the core ortholog clusters and the extended LaCOGs made in step one, the proteins that were not previously included in any clusters, including those proteins from Makarova LaCOGs containing only 1 member, were Blasted as queries. In this step, the revised criterion "uniform top 3" was used and new LaCOGs were made. The newly made LaCOGs were merged with the extended Makarova LaCOGs, and the newly made ones were assigned coding numbers starting with "9" in their names, e.g. LaCOG90001, to distinguish them from the extended Makarova LaCOGs.

LaCOG quality control

In order to check the quality of the merged LaCOGs, an iterative BlastP search was performed using the clustered proteins as queries against all the proteins that were not included in any constructed LaCOGs, using the criteria of 1E-3 and query-hit protein length ratio of 0.6, which has been tested by Boekhorst et. al. [130] for distant homolog identification. This iterative search found that only 13 non-clustered proteins (mostly hypothetical proteins) had a distant homolog in 11 different LaCOGs, indicating that our clustering methods have extensively included most of the proteins into possible homologous clusters.

ORFans and proteins with only non-LAB distant homologs

The LAB proteins that could not be clustered into LaCOGs by the previously described procedures were then collected and Blasted against all completely sequences non-LAB bacterial genomes (both Gram- and Gram+ species). The same criterion of distant homolog identification [130] was utilized. Proteins that had no homologs in any other species were named "ORFans".

Secretome LaCOG extraction

The clustering information of merged LaCOGs, proteins that have only distant homologs in non-LAB species and the ORFans was then combined with the SCL prediction made by LocateP (Table 1). Initially, only the LaCOGs that had at least half of the members with a predicted secretome SCL corresponding to (1) lipid-anchored; (2) N-/C-terminally anchored; (3) secreted by Tat- or Sec- pathway; (4) secreted via non-classical pathways, or (5) cell-wall anchored were identified as the secretome LaCOGs. Later, all other LaCOGs were manually inspected, and a few families were identified with a mixture of secretome and intracellular proteins; only the secretome proteins were added to the database. The same classification was applied to the secretome ORFans and proteins that have only distant homologs in non-LAB species. The resulting clusters of secretome proteins, the "secretome", can be further extended by similar processes when new (LAB) genome sequences become available. Proteins that are exported by unknown mechanisms and so-called "moon-lighting" proteins (known intracellular function, but often also found on the outside of the cell) [131] were not considered as their extracellular SCL cannot be predicted.

Manual curation

In order to obtain as accurate as possible prediction of secretome proteins and their classification into LaCOGs, we performed a throughout manual inspection on all the secretome proteins, including the ORFans and the ones included in LaCOGs. All proteins were double checked for the ORF-calling quality by the criteria combining protein length, possible alternative start (end) codon, multiple sequence alignments, protein domain composition and SCL prediction consistency. Incorrectly chosen start codons in the original annotations were corrected based on sequence alignment with protein family members, position of putative ribosome-binding sites, and known features of signal peptides. Pseudogenes were initially identified when BLASTP analysis of the encoded proteins showed that they belong to extracellular protein families in LaCOGs, but that they represented only a fragment of the protein. By analysis of the coding region of these pseudogenes with their adjacent nucleotide sequences we could generally identify frameshifts, such that the missing protein part(s) were found to be encoded in a different reading frame. In these cases, the entire opening-reading frames were translated into protein fragments, regardless of the absence of start codons, and these protein fragments were concatenated to form new protein sequences that share high similarity to other known full-length proteins. In a few cases, ORFans were also identified as pseudogenes when they lacked a signal peptide, but otherwise contained protein domains typical of extracellular proteins. Generally, we expected the ORFans to be real genes that represent unique functionality to the specific LAB in which they occur. However, because the average size of these hypothetical ORFs was below 100 amino acids, it is possible that some small ORFans could as well be wrongly predicted ORFs or pseudogenes. Proteins smaller than 80 amino acids containing only a Sec-type N-terminal signal sequence were removed from the set of predicted extracellular proteins, since their C-terminal part is generally too small to represent an extracellular domain. Moreover, many of such small proteins with a single predicted TM helix are now increasingly considered as small integral membrane proteins [132].

Authors' contributions

MZ, DT and MW carried out the LaCOGs and database construction. RS performed the manual curation of the clustered proteins and DT and MM refined the LAB-Secretome database. MM and DT drafted the manuscript. Both MW and RS participated in its coordination and helped to draft and finalize the manuscript. All authors read and approved the final manuscript.

Additional file 1

The overview of LAB-Secretome. Sheet S1: an overview of secretomes included in this research; sheet S2: the presence and absence patterns of the LaCOGs in 26 LAB genomes; sheet S3: the ubiquitous LaCOGs; sheet S4: the niche-specific LaCOGs; sheet S5: the species-specific LaCOGs; sheet S6: the ORFans; S7: the proteins with only distant homologs. Click here for file

Additional file 2

False-positive SCL predictions. The false-positive SCL predictions that were corrected using domain composition and homolog information of LaCOGs. Sheet S1: the intracellular proteins that had been wrongly predicted to be extracellular; sheet S2: the N-terminally anchored LaCOGs with C-terminal inside topology. Click here for file

Additional file 3

The extracellular pseudogenes. The secretome pseudogenes. The pseudogenes with wrongly annotated start/end codons were corrected and concatenated with corresponding gene neighbors. The resulting proteins seem to have homologs in various LaCOGs. The concatenated protein sequences are listed in the last column, with an "x" showing the conjunction site of each sequence. Click here for file

Additional file 4

Interesting cases of extracellular protein families. The distribution of binding protein families: sheet S1: Nlpc-P60 families; sheet S2: Cell surface hydrolase; sheet S3: Binding proteins. Click here for file
  127 in total

Review 1.  Single-spanning transmembrane domains in cell growth and cell-cell interactions: More than meets the eye?

Authors:  Pierre Hubert; Paul Sawma; Jean-Pierre Duneau; Jonathan Khao; Jérôme Hénin; Dominique Bagnard; James Sturgis
Journal:  Cell Adh Migr       Date:  2010-04-20       Impact factor: 3.405

Review 2.  The extracellular biology of the lactobacilli.

Authors:  Michiel Kleerebezem; Pascal Hols; Elvis Bernard; Thomas Rolain; Miaomiao Zhou; Roland J Siezen; Peter A Bron
Journal:  FEMS Microbiol Rev       Date:  2010-01-19       Impact factor: 16.408

3.  Investigation of archaeal and bacterial diversity in fermented seafood using barcoded pyrosequencing.

Authors:  Seong Woon Roh; Kyoung-Ho Kim; Young-Do Nam; Ho-Won Chang; Eun-Jin Park; Jin-Woo Bae
Journal:  ISME J       Date:  2009-07-09       Impact factor: 10.302

4.  Functional roles of aggregation-promoting-like factor in stress tolerance and adherence of Lactobacillus acidophilus NCFM.

Authors:  Yong Jun Goh; Todd R Klaenhammer
Journal:  Appl Environ Microbiol       Date:  2010-06-18       Impact factor: 4.792

5.  Assessment of bile salt effects on s-layer production, slp gene expression and some physicochemical properties of Lactobacillus acidophilus ATCC 4356.

Authors:  Mouj Khaleghi; Rooha Kasra Kermanshahi; Mohammad M Yaghoobi; S H Zarkesh-Esfahani; Amin Baghizadeh
Journal:  J Microbiol Biotechnol       Date:  2010-04       Impact factor: 2.351

6.  Amylolytic Lactobacillus strains from Bulgarian fermented beverage boza.

Authors:  Penka Petrova; Milena Emanuilova; Kaloyan Petrov
Journal:  Z Naturforsch C J Biosci       Date:  2010 Mar-Apr

7.  Biodiversity of exopolysaccharides produced from sucrose by sourdough lactic acid bacteria.

Authors:  Marie-Sophie Bounaix; Valérie Gabriel; Sandrine Morel; Hervé Robert; Philippe Rabier; Magali Remaud-Siméon; Bruno Gabriel; Catherine Fontagné-Faucher
Journal:  J Agric Food Chem       Date:  2009-11-25       Impact factor: 5.279

8.  Crystal structure of a mucus-binding protein repeat reveals an unexpected functional immunoglobulin binding activity.

Authors:  Donald A MacKenzie; Louise E Tailford; Andrew M Hemmings; Nathalie Juge
Journal:  J Biol Chem       Date:  2009-09-16       Impact factor: 5.157

9.  The proteolytic system of lactic acid bacteria revisited: a genomic comparison.

Authors:  Mengjin Liu; Jumamurat R Bayjanov; Bernadet Renckens; Arjen Nauta; Roland J Siezen
Journal:  BMC Genomics       Date:  2010-01-15       Impact factor: 3.969

10.  Genes for a beta-lactamase, a penicillin-binding protein and a transmembrane protein are clustered with the cephamycin biosynthetic genes in Nocardia lactamdurans.

Authors:  J J Coque; P Liras; J F Martín
Journal:  EMBO J       Date:  1993-02       Impact factor: 11.598

View more
  32 in total

Review 1.  Common Non-classically Secreted Bacterial Proteins with Experimental Evidence.

Authors:  Guangqiang Wang; Yongjun Xia; Xin Song; Lianzhong Ai
Journal:  Curr Microbiol       Date:  2015-10-01       Impact factor: 2.188

2.  Reconstruction of the Bifidobacterial Pan-Secretome Reveals the Network of Extracellular Interactions between Bifidobacteria and the Infant Gut.

Authors:  Gabriele Andrea Lugli; Walter Mancino; Christian Milani; Sabrina Duranti; Francesca Turroni; Douwe van Sinderen; Marco Ventura
Journal:  Appl Environ Microbiol       Date:  2018-08-01       Impact factor: 4.792

Review 3.  Computational Prediction of Effector Proteins in Fungi: Opportunities and Challenges.

Authors:  Humira Sonah; Rupesh K Deshmukh; Richard R Bélanger
Journal:  Front Plant Sci       Date:  2016-02-12       Impact factor: 5.753

4.  Screening, Characterization and In Vitro Evaluation of Probiotic Properties Among Lactic Acid Bacteria Through Comparative Analysis.

Authors:  Sundru Manjulata Devi; Ann Catherine Archer; Prakash M Halami
Journal:  Probiotics Antimicrob Proteins       Date:  2015-09       Impact factor: 4.609

5.  Surface display of N-terminally anchored invasin by Lactobacillus plantarum activates NF-κB in monocytes.

Authors:  Lasse Fredriksen; Charlotte R Kleiveland; Lene T Olsen Hult; Tor Lea; Cathrine S Nygaard; Vincent G H Eijsink; Geir Mathiesen
Journal:  Appl Environ Microbiol       Date:  2012-06-15       Impact factor: 4.792

6.  Complete resequencing and reannotation of the Lactobacillus plantarum WCFS1 genome.

Authors:  Roland J Siezen; Christof Francke; Bernadet Renckens; Jos Boekhorst; Michiel Wels; Michiel Kleerebezem; Sacha A F T van Hijum
Journal:  J Bacteriol       Date:  2012-01       Impact factor: 3.490

7.  Lactobacillus plantarum and Lactobacillus buchneri as expression systems: evaluation of different origins of replication for the design of suitable shuttle vectors.

Authors:  Katharina Spath; Stefan Heinl; Esther Egger; Reingard Grabherr
Journal:  Mol Biotechnol       Date:  2012-09       Impact factor: 2.695

8.  In silico characterization of the global Geobacillus and Parageobacillus secretome.

Authors:  Pedro H Lebre; Habibu Aliyu; Pieter De Maayer; Don A Cowan
Journal:  Microb Cell Fact       Date:  2018-10-03       Impact factor: 5.328

Review 9.  Stress Physiology of Lactic Acid Bacteria.

Authors:  Konstantinos Papadimitriou; Ángel Alegría; Peter A Bron; Maria de Angelis; Marco Gobbetti; Michiel Kleerebezem; José A Lemos; Daniel M Linares; Paul Ross; Catherine Stanton; Francesca Turroni; Douwe van Sinderen; Pekka Varmanen; Marco Ventura; Manuel Zúñiga; Effie Tsakalidou; Jan Kok
Journal:  Microbiol Mol Biol Rev       Date:  2016-07-27       Impact factor: 11.056

10.  Functional analysis of the Lactobacillus casei BL23 sortases.

Authors:  Diego Muñoz-Provencio; Jesús Rodríguez-Díaz; María Carmen Collado; Philippe Langella; Luis G Bermúdez-Humarán; Vicente Monedero
Journal:  Appl Environ Microbiol       Date:  2012-10-05       Impact factor: 4.792

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.