Literature DB >> 32236097

A systematic pipeline for classifying bacterial operons reveals the evolutionary landscape of biofilm machineries.

Cedoljub Bundalovic-Torma1,2, Gregory B Whitfield1,2, Lindsey S Marmont1,2, P Lynne Howell1,2, John Parkinson1,2,3.   

Abstract

In bacteria functionally related genes comprising metabolic pathways and protein complexes are frequently encoded in operons and are widely conserved across phylogenetically diverse species. The evolution of these operon-encoded processes is affected by diverse mechanisms such as gene duplication, loss, rearrangement, and horizontal transfer. These mechanisms can result in functional diversification, increasing the potential evolution of novel biological pathways, and enabling pre-existing pathways to adapt to the requirements of particular environments. Despite the fundamental importance that these mechanisms play in bacterial environmental adaptation, a systematic approach for studying the evolution of operon organization is lacking. Herein, we present a novel method to study the evolution of operons based on phylogenetic clustering of operon-encoded protein families and genomic-proximity network visualizations of operon architectures. We applied this approach to study the evolution of the synthase dependent exopolysaccharide (EPS) biosynthetic systems: cellulose, acetylated cellulose, poly-β-1,6-N-acetyl-D-glucosamine (PNAG), Pel, and alginate. These polymers have important roles in biofilm formation, antibiotic tolerance, and as virulence factors in opportunistic pathogens. Our approach revealed the complex evolutionary landscape of EPS machineries, and enabled operons to be classified into evolutionarily distinct lineages. Cellulose operons show phyla-specific operon lineages resulting from gene loss, rearrangement, and the acquisition of accessory loci, and the occurrence of whole-operon duplications arising through horizonal gene transfer. Our evolution-based classification also distinguishes between PNAG production from Gram-negative and Gram-positive bacteria on the basis of structural and functional evolution of the acetylation modification domains shared by PgaB and IcaB loci, respectively. We also predict several pel-like operon lineages in Gram-positive bacteria and demonstrate in our companion paper (Whitfield et al PLoS Pathogens, in press) that Bacillus cereus produces a Pel-dependent biofilm that is regulated by cyclic-3',5'-dimeric guanosine monophosphate (c-di-GMP).

Entities:  

Year:  2020        PMID: 32236097      PMCID: PMC7112194          DOI: 10.1371/journal.pcbi.1007721

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


Introduction

The generation of novel genomes through next generation sequencing is creating a wealth of opportunities for understanding the evolution of biological systems. A key challenge is the development of robust and systematic approaches that allow genes to be classified into functional categories and which are also capable of inferring evolutionary relationships. In bacterial genomes, functionally-related genes corresponding to metabolic pathways or protein complexes are often encoded by neighbouring co-regulated and co-transcribed loci, termed an operon. Computational prediction of operons based on the conservation of short inter-genetic distances found between homologous genes across phylogenetically diverse bacteria has been frequently used to predict the biological roles of neighbouring uncharacterized genes [1-6]. Such approaches are valuable for computational inference of gene function in experimentally uncharacterized organisms and facilitate comparative genomics of adaptive traits across phylogenetically diverse bacteria. Analyzing patterns of sequence divergence within each gene yields insights into species-specific functionalities. However, genes in an operon do not function in isolation but typically form parts of higher-order, biological modules (e.g. protein complexes or metabolic pathways). Consequently, analysing evolutionary events in an operonic context provides additional opportunities to better infer functional relationships. For example, while sequence divergence has the potential to impact the function of a single gene, evolutionary events that alter operon structure (e.g. rearrangements, duplications, gains and losses) have the potential to dramatically alter the overall function of the operon [7,8]. Due to the lack of a systematic framework, very few studies have attempted to examine the influence of evolutionary events on operon structure [9,10]. Phylogenetic-tree based classification of 197 ATP binding motif sequences associated with operon-encoded bacterial ATP-binding cassette (ABC) transporters was successful in resolving two evolutionarily distinct transporter clades associated with import and export functions [11]. Gene duplications have been shown to play an important role in driving protein superfamily expansion and are positively correlated with bacterial genome size [12]. Duplications have been found to be associated with biological processes associated with environmental adaptation of species clusters [13], such as outer membrane polysaccharide export proteins involved in capsule biosynthesis [14,15], and amplification of beta-lactamase enzymes associated with increased antibiotic resistance [16]. The study of co-localized “gene blocks” across bacteria has also shown that gene duplication, loss, and rearrangement play important roles in shaping the large-scale organization of bacterial genomes [10]. Key to these analyses is the use of a rigorous and systematic approach for assigning genes into evolutionarily related ‘families’ that are likely to share similar functions. However, the inference of biological function based on sequence similarities of genes or proteins is often complicated by functional divergence arising through recent gene duplication events. A variety of metrics have been employed for determining the relatedness of genes and their protein products from which groups (i.e. clustering) can be defined. These metrics include: evolutionary distances derived through the construction of phylogenetic trees [17-19]; global protein sequence similarities [20-22]; and shared sequence features such as conserved amino acids at specific sites or shared amino acid subsequences, which define motifs or structural domains [23,24]. The aim of these approaches is to automatically resolve large protein families comprising potentially thousands of genes into a smaller number of clusters defining evolutionarily related subfamilies with similar biological roles. An additional challenge faced by clustering methodologies is defining which set of clusters result in an optimal partitioning of the underlying data. To help guide such partitioning, a variety of cluster validation approaches have been devised. These are broadly divided into two categories: external-validation and internal-validation, based on whether previous information is available for the data being clustered [25]. For example, methods developed for classifying orthologous genes (i.e. those related by common ancestry) and paralogous genes (i.e. those emerging from duplication after a speciation event) rely on internal-validation tests. In one such approach, internal branch lengths between one-to-many homologous gene relationships are compared between two species [26]. In an alternative approach, clustering is employed to define “triangles” of proteins with significant sequence similarity occurring between three distinct species [20,27]. However, to distinguish the finer-scale evolutionary relationships occurring within an orthologous group or gene family, phylogenetic tree construction is required. Such methods have typically focused on well-characterized biological systems, e.g. homologs of the bacterial flagellar and type III secretion system subunits [28] and diverse systems associated with the type IV filament superfamily [29], which have utilized an external-validation approach for defining functionally distinct phylogenetic clades. Here we build on these methods and present a framework for the systematic classification and analysis of diverse gene families in the context of operons. Focusing on synthase-dependent exopolysaccharide (EPS) biosynthetic machineries, we use our framework to explore how gene divergence in combination with duplication, loss, and rearrangement events have shaped the evolution of EPS operons, and may have influenced the biofilm producing capabilities of evolutionarily diverse bacteria. EPS are an important component of bacterial biofilms that not only ensure survival in response to limited nutrient availability, but are also involved in antibiotic tolerance, immune evasion and serve as virulence factors in many clinically relevant pathogens [30-32]. Distinct mechanisms have been identified in the production of bacterial EPS, including the well-characterized Wzx/Wzy and ABC transporter-dependent pathways [33], and synthase-dependent systems [34]. Typically, Gram-negative synthase-dependent EPS systems are organized as discrete operons comprised of genes encoding: 1) an inner membrane associated polysaccharide synthase; 2) a regulatory domain or co-polymerase subunit responsible for binding the intracellular signaling molecule cyclic-3’,5’-dimeric guanosine monophosphate (c-di-GMP); 3) periplasmic polysaccharide modification enzymes; and 4) a periplasmic tetratricopeptide repeat (TPR) domain coupled with an outer membrane pore [34]. This operonic organization allows bacteria to acquire complete EPS functionality through discrete lateral gene transfer events and may act as a key driver in niche adaptation [35]. To date five synthase-dependent EPS have been identified: cellulose, acetylated cellulose, poly-β-1,6-N-acetyl-D-glucosamine (PNAG), alginate and the Pel polysaccharide. While much interest has focused on the molecular basis of biofilm formation, these systems have been characterized for only a relatively limited set of bacterial species. Consequently, little is known concerning how these systems have evolved. Of interest is how mechanisms such as gene divergence, duplication, acquisition, loss, and rearrangement of EPS operons have contributed to bacterial adaptation to diverse environments, and from a human health perspective, contributed to a pathogen’s ability to infect and cause disease. While a previous survey of cellulose EPS machineries has been reported [36], a comprehensive systematic analysis of all EPS machineries is lacking. In this study, we describe a phylogenetic tree-based clustering method for defining protein sequence subfamilies and apply it to study the evolutionary relationships of operons. This method was employed for the systematic classification of EPS operons predicted from a survey of over a thousand bacterial genomes. Applying a graphical visualization approach, we demonstrate that phylogenetic clustering enables the resolution of discrete EPS operon clades which differ in their organization from experimentally characterized operons, providing valuable insights toward further understanding the roles of gene duplication, rearrangement, and loss/absence in the evolution of biofilm production between phylogenetically diverse species from distinct environmental niches. For example, we demonstrate the biological implications of operon evolution that has been shaped by horizontal gene transfer (HGT) and subsequent divergence, for two cellulose operon clades among Proteobacteria which correspond to the production of cellulose polymers with different structural organizations. Furthermore, we note that most of our operon predictions are novel and demonstrate the value of applying computational predictions to guide the discovery of EPS production in previously uncharacterized species. We highlight an example for Pel production, which was initially identified and characterized in Pseudomonas aeruginosa [37] and other Gram-negative bacteria. Our approach identified several pel-like operons in some Bacillus spp. and other Gram-positive bacteria, which appeared to be regulated by the intracellular signaling molecule c-di-GMP. In our companion paper (Whitfield et al PLoS Pathogens, in press) we experimentally validate these findings by demonstrating the production of Pel by the Gram-positive Bacillus cereus ATCC 10987 and its regulation by c-di-GMP.

Results

A systematic survey of bacterial EPS operons reveals EPS systems across bacteria of diverse lifestyles and environmental niches

A schematic overview of the pipeline we used for classifying bacterial operons is provided in . The process begins with a systematic survey of all five previously characterized synthase-dependent EPS systems (cellulose, acetylated cellulose, PNAG, Pel, and alginate) (S1 & ) through an iterative hidden Markov-model (HMM) -based search strategy and subsequent genomic-proximity based reconstruction of 1733 complete reference and representative bacterial genomes (downloaded April 20, 2015—see ). We identified 407 cellulose, 321 PNAG, 146 Pel, 64 alginate, and 4 acetylated cellulose EPS “operons” defined as comprising at least: 1) a polysaccharide synthase subunit; and 2) one additional locus involved in EPS modification or transport as defined previously [30] (). These could be allocated to 367, 288, 140, 60 and 4 different bacterial species, respectively (). We note that for all previously characterized EPS producing species included in this set of fully sequenced genomes, we successfully detected an operon corresponding to the type of EPS produced (). Furthermore, for experimentally characterized species lacking a fully sequenced genome, we also identified several closely related strains possessing an EPS operon identical to the characterized strain, providing a valuable resource for further experimental validation. PNAG was significantly enriched in pathogen genomes (161/288–56%; Chi-squared test p-value = 2.05e-09). Conversely, Pel (84/140–60%; Chi-squared test p-value ~ 1.83e-4), alginate (39/60–65%; Chi-squared test p-value ~ 3.5e-2) and cellulose (187/367–51%; Chi-squared test p-value = 2.05e-14) were significantly enriched in non-pathogen genomes (). Interestingly, both cellulose and PNAG operons were significantly associated with genomes with host-associated lifestyles (Chi-squared p-values ~ 1.05e-9 and ~1.48e-12, respectively). From a phylogenetic perspective, each EPS system was well represented by Proteobacteria, with cellulose, PNAG and Pel additionally featuring operons from Bacilli and Clostridia, which to our knowledge have not been previously reported (). Further, we note that Pel operons exhibited the greatest diversity of bacterial families (Shannon index of bacterial families– 2.74) with representation in Thermotogae, Actinobacteria and Rubrobacteria, among others. While most genomes contain only a single synthase-dependent EPS system, we observed many instances of co-occurrence (), with cellulose and PNAG systems being the most common combination (83 genomes), followed by alginate and Pel (20 genomes). Notably, all species possessing three systems were Pseudomonas spp., e.g.: Pseudomonas protegens strains Pf-5 and CHA0 (alginate, Pel and PNAG); Pseudomonas fluorescens SBW25 and Pseudomonas sp. TKP (acetylated cellulose, alginate, and PNAG).

Summary of predicted bacterial EPS operons.

(A) Overview of the main steps of the EPS operon prediction pipeline: 1) De novo construction of reference operon locus HMM models; 2) Sequence homology searches against 1733 fully sequenced bacterial genomes retrieved from NCBI; 3) Prediction of putative EPS operons through genomic-proximity reconstruction of significant locus hits; 4) Definition of clusters defining evolutionary relationships for each EPS locus; 5) Integration of operon and cluster predictions to visualize operon networks. (B) Number of predicted EPS operons and percentages of species summarized by bacterial lifestyle (pathogen, non-pathogen, unknown) and corresponding niche (host-associated, environmental/other, unknown). (C) Percentage of evolutionary events associated with EPS operons: Locus loss (core EPS operon loci not detected by HMM searches); Locus rearrangement (EPS operons featuring locus orderings that differ from the canonical operon for that type–S1 Table); Locus duplication (defined by two loci possessing a significant match to the same EPS HMM within the same operon); Operon duplication, defined as a genome encoding two copies of the same type of EPS system, separated by greater than 10 kb; Locus fusion, loci possessing significant matches to multiple EPS HMMs. (D) A cladogram based on 16S rRNA sequences illustrating the phylogenetic distribution of predicted EPS operons. Branches are coloured according to taxonomic class and piecharts represent the proportion of EPS operon types identified for each clade. Leaf labels represent the major taxonomic families found in each clade, along with the number of genomes represented. 16S rRNA sequences were retrieved from full genome sequences using Barrnap (https://github.com/tseemann/barrnap), clustered using CD-HIT [94] at 95% sequence identity and aligned using MUSCLE [92] for phylogenetic tree construction with Fasttree [103] and subsequent visualization with iTOL [104]. (E) An upset plot (generated using the R (https://www.r-project.org) package UpSetR [105]) representing the number of different EPS operon combinations identified across bacterial genomes.

Evolution of EPS operons is driven by gene duplication, loss and rearrangements

The processes underlying EPS operon evolution across diverse bacterial phyla are poorly understood. We examined how operon organization is influenced by the following evolutionary events that are likely to affect EPS production capabilities among bacteria: 1) single locus or whole operon duplications, which could lead to dosage effects and alter the level of EPS modification or export; 2) locus losses, that may indicate a reduction or loss in EPS production or modification, or may suggest supplementation of the lost function with a novel gene; 3) operon rearrangements which may affect the regulation of EPS production through the order of expression of individual EPS system components; and, 4) gene-fusions, resulting in enhanced co-expression of interacting subunits. For each set of predicted EPS operons, the resulting number of operon evolutionary events was defined relative to the locus composition and order of reference Gram-negative experimentally characterized operons [30,38-41]. In contrast to previous studies of operon evolution [28,29], we use the term evolutionary events to refer to key changes that define distinct organizations between evolutionarily distinct operon clades and not divergence of operons from an ancestral state. With the exception of acetylated cellulose, locus losses were found to be the most frequent event (~46% of predicted operons lacked one or more reference loci), and occurred with the greatest frequency for Pel which exhibited an average loss of 2.6 loci lost per operon (). Among all EPS systems the majority of locus losses were associated with the outer membrane pore encoding loci (441 / 993–44% of all locus loss events identified) among Gram-positive species (), consistent with the lack of an outer membrane bilayer in Gram-positive cell envelope architectures. Operon rearrangements were the next most frequent evolutionary events (~ 39%), largely associated with cellulose operons [36] (327 / 407–80%). Focusing on duplication events, within-operon loci duplications tended to be more common than whole-operon duplications (2 or more core EPS loci identified > = 1 kb apart), with the exception of cellulose operons (29 whole operon duplications compared to 24 loci duplications). All duplicated operons were found to be separated by at least 10 kb, suggesting they may have been acquired through HGT rather than tandem duplication of a pre-existing operon [42,43].

Systematic phylogenetic distance-based clustering of EPS operon loci and genomic-proximity networks identifies evolutionarily distinct operon clades

To better understand how these evolutionary events may have altered operon function, we next devised an agnostic, systematic classification strategy to cluster each family of EPS operon loci on the basis of phylogenetic distance (). In brief, for each EPS operon locus, multiple sequence alignments were generated and used to construct phylogenetic trees. From these trees, we defined sets of clusters through an iterative scan of the tree structure that captures an increasing sequence distance between family members, starting at the leaves and ending at the root. During this scan, sequences are grouped into a cluster if they share a common node (i.e. are within a specified evolutionary distance). To define the optimal set of clusters for each locus, we then applied three cluster quality scoring schemes (Q1, Q2 and Q3) based on the following metrics: proportion of sequences clustered (to maximize the number of sequences clustered); average silhouette score (to minimize the occurrence of clusters containing highly divergent sequences); and the Dunn index (to maximize the separation of closely related sequences from divergent sequences). For each scoring scheme, we defined the optimal pattern of clustering based on the evolutionary distance (expected number of substitutions per site) derived from a maximum-likelihood constructed phylogenetic tree (see s) that maximizes the quality score. Comparisons across scoring schemes (see below) for cellulose operon loci identified Q2 as providing the most informative sets of clusters. Applying this scoring scheme to all EPS loci revealed the average number of sequence clusters generated correlated with the total number of operons predicted for each type of EPS system (), which further corresponded to the underlying differences in species distributions of EPS systems (). For example, the cellulose system was predicted to have the largest average number of sequence clusters overall (30 clusters) and also had the greatest species diversity (Shannon index 2.16 –) compared to all other systems. Furthermore, for each EPS system the variability of the number of sequence clusters predicted per locus () suggests differing degrees of locus evolution that are likely to be the result of different structural and functional constraints. For example, a higher degree of conservation would be expected for glycosyl transferase (GT) subunits to maintain efficient co-ordination between polymerization and inner membrane transport of EPS, while increased variability of periplasmic modification enzymes suggests that only a subset of highly conserved motifs are required to carry out polysaccharide modification reactions.

Clustering of EPS loci.

(A) Schematic illustrating the process of scanning through a phylogentic tree and identifying sets of clusters associated at different evolutionary distance cutoffs. Here evolutionary distance is defined as the number of expected amino-acid substitutions normalized over the multiple sequence alignment length. To identify optimal patterns of clusters, we examined three scoring schemes (Q1, Q2 and Q3). Q1 is defined as the sum of the average silhouette score for all clusters: μ(s(i)) and the Dunn index (DI). Q2 is defined as the sum of the proportion of sequences identified in clusters (Σc/©m), μ(s(i)) and DI. Q3 is defined as the product of (Σc/Σm) and the sum of μ(s(i)) and DI. For the family of genes related to the bcsA locus, each scoring scheme identifies a different optimal evolutionary distance cutoff resulting in defining different sets of clusters. (B) Graph illustrating the average number of sequence clusters predicted (sum of # of clusters over all loci / total number of EPS loci) for each type of EPS operon. (C) Graph illustrating the average evolutionary distance of EPS loci cluster members with other members of the same cluster. (D) Cellulose operon networks generated using the different types of scoring scheme cutoffs used in (A). For each network, nodes indicate clusters of sequences representing individual cellulose loci, edges indicate genome proximity between the two linked loci. Nodes are organized into sets of four, ordered from top to bottom as bcsA, bcsB, bcsZ and bcsC. Node size indicates the number of family members associated with that locus cluster. Node colour indicates phylogenetic representation of cluster members. Edge colour indicates genomic proximity of phylogenetic clusters. At higher evolutionary distances (as defined by Q2 and Q3), networks yield more informative patterns of evolutionary relationships as illustrated by larger clusters of loci featuring larger number of interconnections. To compare patterns of clusters identified by each scoring scheme, we applied the three scoring schemes to each set of genomically-neighbouring protein sequences assigned by HMM searches cellulose EPS machinery. Here we focused on four core cellulose genes: bcsA, bcsB, bcsZ, and bcsC encoding the inner membrane GT, co-polymerase, periplasmic hydrolase and outer membrane export pore, respectively [36]. While previous studies have identified additional genes associated with the production of cellulose biofilms (e.g. bcsD, bcsE, bcsF, bcsG and bcsQ; [36]), to simplify our analyses we limited our query to bcsA, bcsB, bcsC, and bcsZ based on the observation that these are the most abundant genes when comparing all identified cellulose biosynthesis operons. Although this may bias the diversity of operons identified, given that a functional cellulose biosynthesis locus should contain the synthase genes bcsA and bcsB, we expect our analysis should result in few false negative predictions. From the resulting clusters we generated operon genomic-proximity networks (). These networks provide a visual display of the conservation of individual loci, together with their respective genomic proximity to yield patterns of sequence divergence associated with the emergence of distinct forms of operon organization. In the absence of any clustering (Q0), the network trivially resolves into individual operons featuring up to four loci. Applying the Q1 scoring scheme to each locus, the network reveals a variable number of clusters across operon loci, with each cluster generally comprising sequences belonging to the same bacterial genus. Application of the Q2 scoring scheme results in the generation of clusters of increased size, encompassing species featuring distinct operon organizations and compositions. For example, two distinct lineages of alpha-proteobacterial cellulose operons can be easily distinguished, one of which is more closely related in sequence and composition to gamma-proteobacterial operons, and a second which lacks two loci and appears evolutionarily divergent from gamma-proteobacterial operons [25]. However, these distinctions were more difficult to resolve using the Q3 scoring scheme due to clustering of highly divergent sequences. Given the trade-off between clustering highly divergent sequences (Q3) with the depiction of individual operons (Q1), we applied the Q2 scoring scheme to generate clusters for all EPS loci (). Using this locus-specific phylogenetic clustering approach, we were able to devise a classification scheme to define EPS locus clades based on the average evolutionary distance of a group of clustered locus sequences to a reference operon sequence (). For example, the cellulose polysaccharide synthase locus, bcsA, from Escherichia coli is assigned to clade 1, while divergent alpha-proteobacterial species including Rhodobacter sphaeroides are assigned to clade 2. We further resolved operons into distinct groups based on the genomic co-occurrence patterns of locus clades; e.g. for the cellulose operon (bcsABZC) we identify clade combinations of 1:1:1:1, 1:2:2:2 and 1:3:5:3, which correspond to operons identified in Escherichia spp. and other closely related enterobacteria, Klebsiella spp., and Burkholderia spp., respectively.

Phylogenetic clustering and genomic proximity networks reveal evolutionary events driving EPS operon divergence

Having generated clustering patterns for each EPS locus, we next used the sets of cellulose operon associated loci, bcsA, bcsB, bcsZ, and bcsC, to examine how these patterns might inform on the evolution of this EPS system. Relative to BcsA, the three other subunits (BcsB, BcsZ and BcsC) display greater sequence diversity as indicated by a larger number of sequence clusters (). Detailed structure-function studies of the BcsA-BcsB inner membrane cellulose synthase complex, outlined below, illustrate how these findings are consistent with their known functional roles. Further inspection of the cellulose operon network identifies a number of sub-networks comprised of taxon-specific loci clusters associated with distinct patterns of operon organization as illustrated through the following examples: 1) a subnetwork comprised of loci from several beta-proteobacteria, represented here by Burkholderia cenocepacia and Pandoraea promenusa (), which feature a rearrangement of the bcsA locus and novel locus gains (also supported by inspection of corresponding Genbank genomic annotations) as indicated by a genomic distance of > 0.1 kb between bcsA and the neighbouring locus bcsC; 2) a subnetwork composed of loci from several species of the alpha-proteobacterial Zymomonas show rearrangement of bcsZ and/or the loss of bcsB or bcsZ (). Further inspection reveal such losses were due to gene fusion events; 3) a subnetwork composed of loci from a separate group of alpha-proteobacteria which reveals a diverse set of bcsB loci that additionally lack the bcsC outer membrane pore (); 4) a subnetwork of loci from a group of gamma-proteobacteria reveal instances of HGT and divergence (). In this latter example, our network identifies two distinct clades of operons, sharing a common group of bcsA loci, but featuring two evolutionarily divergent sets of bcsB, bcsZ and bcsC loci which co-occur in several genomes separated by inter-genic distances greater than 10 kb. Detailed investigation of the operonic arrangements of species possessing single copies of either of these clades of operons reveal two distinct loci organizations: the first representing the canonical cellulose locus order (clade A1), bcsABZC, found among E. coli and Salmonella enterica strains; while the second represents a non-canonical locus ordering (clade B1), in which the periplasmic glycoside hydrolase, BcsZ, has undergone a rearrangement, bcsABCZ. This clade is found among Dickeya, Erwinia and Pantoea spp. (). Of note, we found that several species (e.g. Enterobacter and Klebsiella spp.) possess both operon clades. The clades have been suggested to have originated by HGT [19]; a hypothesis further supported by our phylogenetic clustering assignments (). Furthermore, we identified two additional divergent BcsB sequences associated with a novel organization of operon clade B1 and include several loci with other roles in cellulose production (designated operon clade B2; ). The divergence of BcsB sequences associated with clade B2 were also found to distinguish bacterial genomes possessing multiple cellulose operons of distinct evolutionary lineages: Proteus mirabilis (2 cellulose operons: Clades A1 and B2) and Enterobacter spp. (3 cellulose operons: Clades A1, B1 and B3) (). Additional sequence database searches revealed that the non-core loci associated with operon clades B2 and B3 share functionally homologous loci to the cellulose accessory protein D (AxcesD), which has been characterized as increasing the efficiency of cellulose production in the Acetobacter xylinus cellulose synthase complex [44]; GalU an uridine triphosphate (UTP)-glucose-1-phosphate uridylyltransferase involved in cellulose precursor biosynthesis; and an additional uncharacterized locus predicted to possess both PAS_9 and GGDEF signaling domains, indicating the potential adaptation in Proteus and Enterobacter spp. to produce varied forms of cellulose upon different environmental stimuli [45].

Genomic-Proximity network of phylogenetically clustered cellulose operons.

Phylogenetically clustered operon loci are arranged vertically with respect to the canonical ordering of the cellulose operon (indicated by grey side bar). Inset boxes depict selected examples of cellulose operon clades, illustrating how the network can inform on evolutionary events: (i) Rearrangement of bcsA among betaproteobacteria–Here, bcsA appears in closer proximity to bcsC than to bcsB or bcsZ (as indicated by a cyan coloured edge for the former and a grey coloured edge for the latter). Further the cyan edge indicates a relatively large intergenic distance, suggesting a locus gain between bcsA and bcsC, confirmed upon inspection of the genome of Burkholderia cenocepacia; (ii) Rearrangement and gene fusions in alpha-proteobacteria–in examples 1 and 2, the red edge indicates operons in which bcsB is closer to bcsC than bcsZ, the cyan edges suggest that bcsZ is present, but appears after bcsC (example 1), while in other operons, bcsZ appears missing (example 2). Detailed inspection of example operons (e.g. Zymomonas spp.) reveals the fusion of the periplasmic hydrolase and outer membrane pore (BcsZC), in example 3, the apparent loss of bcsB in another Zymomonas spp. is explained by a fusion between the inner membrane cellulose synthase complex subunits (BcsAB); (iii) Loss of outer membrane pore, BcsC, and divergence of the inner membrane cellulose co-polymerase, BcsB, in alpha-proteobacteria–in these taxa, BcsB appears highly divergent (as indicated by their identification through more sensitive HMM searches–grey nodes) and no BcsC was identified (confirmed through inspection of representative operons). Further interpretation of the operons identified in the box denoted with an asterisk ‘*’, which represent HGT events, are illustrated in Fig 4. Node size indicates the relative number of sequences per phylogenetic cluster; node colouring represents the taxonomic distribution of loci for a given cluster; edges connect clusters which co-occur in the same genome(s); edge colour indicates the genomic-proximity of loci clusters.
Fig 4

Horizontal gene transfer of cellulose operons identified from analysis of the genomic-proximity network.

Here we show how a subgraph (A) from the global cellulose EPS operon genomic-proximity network (), may be interpreted to reveal HGT events involving two distinct gamma proteobacterial operon clades, A (canonical bcsABZC) and B (bcsABC-Z). (B) Examples of operons in two species which possess either a single A1 (“canonical”) or B1 (rearrangement of bcsZC) operon clade. (C) Example from Klebsiella pneumoniae in which a single genome contains both A1 and B1 operons, indicating a HGT event. (D) Example from Proteus mirabilis featuring two copies (designated A2 and B2 respectively) of the cellulose EPS operon, which appear to be divergent forms of A1 and B1: A2 features an apparent loss of the bcsZ locus from A1; B2 features a locus gain between bcsC and bcsZ from B1. Example from Enterobacter spp. in which the genome carries three copies of the cellulose EPS operon. In addition to clade A1 and B1 operon arrangements, a further operon (designated B3) appears in which bcsB has diverged from a B2 clade operon. Arrows within the network schematics depict the order of loci within the operon and are coloured according to intergenic distance: red < 100bp; cyan >100bp & <5 kb; grey >5 kb.

Horizontal gene transfer of cellulose operons identified from analysis of the genomic-proximity network.

Here we show how a subgraph (A) from the global cellulose EPS operon genomic-proximity network (), may be interpreted to reveal HGT events involving two distinct gamma proteobacterial operon clades, A (canonical bcsABZC) and B (bcsABC-Z). (B) Examples of operons in two species which possess either a single A1 (“canonical”) or B1 (rearrangement of bcsZC) operon clade. (C) Example from Klebsiella pneumoniae in which a single genome contains both A1 and B1 operons, indicating a HGT event. (D) Example from Proteus mirabilis featuring two copies (designated A2 and B2 respectively) of the cellulose EPS operon, which appear to be divergent forms of A1 and B1: A2 features an apparent loss of the bcsZ locus from A1; B2 features a locus gain between bcsC and bcsZ from B1. Example from Enterobacter spp. in which the genome carries three copies of the cellulose EPS operon. In addition to clade A1 and B1 operon arrangements, a further operon (designated B3) appears in which bcsB has diverged from a B2 clade operon. Arrows within the network schematics depict the order of loci within the operon and are coloured according to intergenic distance: red < 100bp; cyan >100bp & <5 kb; grey >5 kb.

Genomic-proximity networks of pel operons reveal a novel pel locus in the gram positive bacterium, bacillus cereus that is regulated by c-di-GMP

Examination of the genomic-proximity networks of pel loci also reveal novel operon organizations across phylogenetically divergent bacteria (). As with cellulose loci bcsA and bcsZ, we identify examples of operon rearrangements involving pelB (outer membrane transport pore and TPR domain) loci and pelA (periplasmic modification hydrolase) (), across several species associated with diverse environments. Again, consistent with our findings for cellulose, we noted loci losses and acquisitions. Although it has not been demonstrated that the pel operon forms a trans-envelope biosynthetic complex, the ordering of operon loci has been shown to play an important role in the assembly of macromolecular complexes [46] and optimizing biosynthetic pathways [47], suggesting that there exists a functional coupling between Pel outer membrane transport and periplasmic modification [48]. However, the effects of these rearrangement events on Pel production still remain to be experimentally investigated.

Genomic-proximity network of phylogenetically clustered pel operons.

Phylogenetically clustered operon loci are arranged vertically with respect to the canonical ordering of the pel operon (indicated by grey side bar). As for Fig 4, inset boxes depict selected examples of pel operon clades, illustrating how the network can inform on evolutionary events: (i) Canonical organization of the pel operon, as defined in the Pseudomonas aeruginosa genome.; (ii) Duplication of the pel operon in Nitrosospira multiformis with subsequent evolution through locus gain and loss, as well as rearrangement of pelA; (iii) pelB fission, locus gain and rearrangement in aquatic thermophilic species; (iv) A potentially novel duplicated pel operon identified in Leptospirillum ferrooxidans comprised of divergent pelA and pelF loci; (v) pel operons identified in Gram-positive species including divergent pelD loci involved in regulation through c-di-GMP. Node size indicates the relative number of sequences per phylogenetic cluster; node border colouring represents the taxonomic distribution of loci for a given cluster; grey filled nodes indicate loci predicted by iterative HMM searches; edges connect clusters which co-occur in the same genome(s); edge colour indicates the genomic-proximity of loci clusters. We also observed a high degree of overall conservation among components which are known to play key roles in Pel biogenesis, such as the putative polysaccharide synthase (PelF), putative inner membrane protein (PelG), hydrolase/deacetylase (PelA) and cyclic-di-GMP receptor (PelD) [32]. In contrast, a greater degree of divergence can be seen among inner (PelE) and outer membrane (PelB, PelC) transport associated loci. In these loci, there appears a consistent pattern of clustering across bacterial phyla suggesting co-evolution of potentially physically interacting components, however no evidence of interaction has been shown to date. Our genomic proximity network revealed two distinct clades in several Gram-positive species (). Of the synthase dependent EPS operons known to date, only PNAG production has been genetically and structurally characterized in Gram-positive Staphylococci [49]. Operons reconstructed from initial HMM searches identified putative pel operons in several Gram-positive bacteria, comprised of the GT encoding PelF and the PelG putative transport protein (). To determine whether these were bona-fide pel operons with additional loci, iterative HMM searches were performed including additional protein sequences from predicted pel operons. These searches revealed additional loci including a homolog of PelD (). C-di-GMP signaling in Gram-positive bacteria is less well characterized [50] and this finding suggests a role for this secondary metabolite in regulating biofilm formation in these species. In our companion paper, we have experimentally validated our predictions by showing that single gene deletions within the predicted B. cereus pel operon result in a loss of EPS production, and that PelD regulates EPS production through binding of c-di-GMP (Whitfield et al PLoS Pathogens, in press). This work provides the first and crucial piece of evidence which suggests that divergent Gram-positive pel operons, particularly those belonging to the same phylogenetic clade as B. cereus, likely possess the ability to produce a Pel-dependent biofilm. However, it is also possible that significantly divergent operon loci may not result in the production of EPS of identical composition to that characterized in Gram-negative bacteria, as in the case of PNAG modification between Gram-negative (pga) and Gram-positive (ica) operons [51].

Genomic-proximity networks of PNAG uncover locus loss and duplication events in pathogenic and environmental bacteria

To examine how locus duplication, loss, and rearrangement events have contributed to the evolution of PNAG operons across bacterial phyla, selected examples of pga operon clusters were identified and compared (). For example, within a group of enterobacteria possessing related pgaD loci, there exist a number of closely related pathogenic enterobacteria that have lost pgaA (E. coli ETEC H10407), as well as pgaB (Shigella flexneri 5 str. 8401). These losses suggest these taxa may no longer be able to produce PNAG (). Interestingly, we also observed a lack of pga operons among pathogenic Salmonella spp. genomes surveyed in this study. Previous work has shown that loss of PNAG production is associated with adaptation of Salmonella spp. to an intracellular pathogenic lifestyle [52]. Although PNAG production in E. coli H10407 has not been examined, our findings are also consistent with the adaptation of E. coli H10407 from a commensal to a pathogenic lifestyle [53], where enterotoxins and colonization factors serve a crucial role for attachment to the intestinal epithelium and enhanced toxicity [54,55]. Furthermore, it has been previously shown that biofilm formation in S. flexneri impairs invasive ability and virulence [56], which suggests that the loss of PNAG production in S. flexneri [57] is the result of adaptation to an intracellular mode of infection. These results shed light on the relationship between biofilm production capability and adaptation of enteric bacteria toward a pathogenic lifestyle. Based on the divergence of pgaB loci, we also identified pga operon clades corresponding to partial and whole operon duplications in aquatic bacteria, including a partial duplication of the pga operon specific to the important pathogen Acinetobacter baumannii spp. and Methylovora versatilis 301, respectively (). Also, in environmental bacteria we discovered a novel pga organization resulting from rearrangement of pgaC, and a lack of pgaB and pgaD loci, which may have been too divergent to detect from initial HMM searches (). Although our HMM models only used Gram-negative pga operon protein sequences, we also identified a number of Gram-positive pga operons consisting of pgaB and pgaC (). Upon closer inspection these loci were found to correspond to Staphylococcus polysaccharide intercellular adhesion (PIA) loci icaB and icaA, respectively. This suggests a potential common evolutionary origin of synthase-dependent PNAG production between Gram-positive and -negative organisms. A clade of pga operons were also identified possessing varying numbers of divergent pgaC loci resulting from repeated tandem duplication events (). Despite lacking a detectable pgaA locus, a possible role of these gene clusters in EPS production was investigated. One member of this operon clade, Thauera sp. MZ1T, inhabits a wide range of environments, and is an abundant producer of EPS responsible for viscous bulking in activated sludge wastewater treatment processes [58]. Furthermore, a recent mutagenesis study [59] demonstrated that biofilm-formation defective Thaurea mutants could be rescued by the complementation of the predicted pgaB deacetylase locus identified in the present study. Together, these findings suggest that the divergence of deacetylase and duplication of GT related loci in PNAG biosynthesis have resulted in the emergence of a distinct operon lineage.

Genomic proximity networks of alginate uncover distinct operon clades in pseudomonas spp. and atypical operon architectures in environmental bacteria

Although the majority of alginate operons were predicted largely among Pseudomonas spp. genomes (), phylogenetic clustering and genomic-proximity network reconstruction revealed an array of events influencing alginate operon evolution. For example, two distinct alginate operon clades were identified among Pseudomonas spp., defined by whole operon duplication and rearrangement of alginate polysaccharide modification loci (). Also identified were divergent, “atypical”, alginate operons () comprising extensive rearrangements and also losses of functionally related subsets of alginate loci, e.g. outer membrane transport loci (algKE), and polysaccharide modification machinery (algGXLIJF). Closer examination of the alginate genomic-proximity network also indicated a greater number of clusters for alg44 and algX loci, which were reflective of increased divergence among distinct alginate operon clades. Given that both loci play related roles in the regulation, polymer-modification, and assembly of the alginate EPS secretion machinery [60], these results provide an avenue for future research toward elucidating how species may modify alginate production to adapt to diverse environmental niches.

Genomic proximity networks of acetylated cellulose operons reveals duplication of co-polymerase subunits and sequence homology of loci with alginate acetylation machinery

From the genome sequences surveyed, only four species were identified as possessing acetylated cellulose operons, comprising two distinct operon clusters with differing operon constitutions among three Pseudomonas spp. and Bordetella avium 197N (). Contrary to cellulose phylogenetic clusters, the polysaccharide synthase, wssB, was divided into distinct Gamma- and Beta- proteobacterial clusters. We also found a distinct phylogenetic cluster identifying a unique tandem duplication of wssC in Bordetella avium 197N, which was not observed among orthologous cellulose bcsB co-polymerase loci (). This observation might suggest a divergent mechanism of action of cellulose inner membrane transport. As we previously observed (), 3 out of 4 of the predicted acetylated cellulose operons were also found to co-occur with alginate operons. Additional HMM-searches identified significant sequence similarity between acetylated cellulose wssBCDE operon sequences to those previously identified as bcsABZC, as well as between acetylated cellulose acetylation-machinery and their functional homologs in alginate operons (WssH–AlgI; WssI–AlgJ/AlgX). Taken together, these findings suggest that acetylated cellulose production has likely evolved through the duplication and operonic acquisition of the alginate acetylation machinery loci.

Sequence variability of phylogenetic clusters reveals different degrees of structural conservation of cellulose biosynthesis machinery

With the availability of a crystal structure for the BcsA-BcsB inner membrane complex responsible for cellulose biosynthesis and transport [61], we examined the potential structural and functional consequences of the sequence variability of the BcsA and BcsB phylogenetic clusters highlighted above (). In brief, we generated multiple sequence alignments of eight BcsA and BcsB sequences summarizing the evolutionary diversity of cellulose operon clades identified in . Residue conservation information from this alignment were subsequently mapped onto the structure of the BcsA-BcsB complex (PDB ID:4HG6 [62]; ). The results of the following analysis are also consistent when including all predicted BcsA and BcsB sequences. We identified a high degree of sequence conservation among BcsA sequences corresponding to the GT domain responsible for cellulose polymerization. Conserved residues mapped specifically to a cleft in the GT domain where a uridine diphosphate (UDP) carrier moiety is bound and oriented through a conserved QxxRW motif to enable polymerization of glucose monomers of the growing cellulose chain [61]. Conversely, the PilZ domain of BcsA, involved in regulation of the GT function in response to c-di-GMP levels shows low conservation overall, except for the subset of residues required for c-di-GMP binding. Further, the periplasmic region of BcsB shows low sequence conservation overall, aside from a number of highly conserved residues in the carbohydrate binding and ferredoxin domains. These residues include one (L193 of the Rhodobacter sphaeroides ATCC 17025 reference sequence) representing a putative cellulose binding residue oriented in close proximity to the growing cellulose chain near the exit of the BcsA inner membrane translocation channel. From phylogenetic sequence clustering, structurally relevant conservation features of the cellulose synthase complex were identified which should facilitate further investigation of cellulose EPS production across phylogenetically diverse species. For example, c-di-GMP binding residues of the PilZ domain of BcsA vary in conservation across phylogenetic clusters, which could impact the ability of the protein to bind the nucleotide. This may in turn limit access of activated glucose monomers to the GT domain, thus altering the rate of cellulose polymerization. Insertion/deletion events are also observed across BcsB phylogenetic clusters that may facilitate the recruitment of additional periplasmic processing proteins [63], or macromolecular assembly of the BcsA-BcsB complex [64]. This might result in differences in the higher-ordered structuring of cellulose microfibres as a consequence of adaptation to diverse environmental niches. These results demonstrate how the application of our phylogenetic clustering methodology can be extended to provide biologically informative insights into the function of components of EPS secretion machineries.

Phylogenetic clustering elucidates the structural and functional divergence of the pgaB locus, revealing the evolution of PNAG production across gram-negative and gram-positive bacteria

PNAG production is found across phylogenetically diverse species and is carried out by the pgaABCD operon of Gram-negative bacteria [39] and icaADBC operon of Gram-positive bacteria [65]. Although the functional and immunological properties of PNAG produced by the pga and ica loci appear to be similar [66], there are important differences between the roles of pga and ica operon loci [67]. Common to both operons is the presence of an integral membrane GT locus, pgaC and icaA, which are both members of the GT-2 family and share sequence homology [67]. In addition, non-homologous loci encoding integral membrane proteins, pgaD and icaD, are also present and required for the full function of their respective GTs [66,68]. In Gram-negative bacteria, PNAG production is regulated through physical interactions between PgaD and PgaC which are stabilized by the allosteric binding of c-di-GMP [56]. In Staphylococci, PNAG production does not depend on c-di-GMP and is likely regulated by an alternate signaling pathway [69]. Deacetylation of PNAG is carried out by pgaB and icaB loci and has been shown to play a crucial role in biofilm formation and immune evasion [51,70]. pgaB also possesses an additional C-terminal glycoside hydrolase domain which cleaves the PNAG polymer following its partial deacetylation [71], although the mechanism of how these activities are coordinated and the biological role of the hydrolase activity is unknown. Unique to pga operons is a locus encoding an outer membrane export pore, pgaA [72] and, in ica operons, an additional integral membrane protein, icaC, which has been proposed to be involved in PNAG O-succinylation [67]. Using Gram-negative pga loci as seed sequences for the reconstruction of synthase-dependent PNAG operons, we were also able to identify Gram-positive ica operons based on significant sequence similarities to pgaB and pgaC loci. Our phylogenetic clustering approach also revealed that pgaC/icaA sequences clustered into a single clade, while pgaB/icaB were associated with distinct sequence clades (). To explore the evolution of Gram-negative and Gram-positive pga and ica operons, we generated multiple sequence alignments for representative sequences of 18 PgaB clades. Our phylogenetic clustering results confirm previous observations [67] that the glycoside hydrolase domain is exclusively associated with Gram-negative pga operons (PgaB_G1) and is absent in a clade of Staphylococcus Gram-positive ica sequences (PgaB_G3; ). We also identified additional Gram-positive icaB clades among non-Staphylococcus spp., e.g. Bacillus and Lactococcus (), which possess operons lacking the icaC locus [67]. Interestingly, we also identified a number of divergent Gram-negative pgaB clades resembling icaB clade sequences. Members of these clades lacked the canonical N-terminal glycosyl hydrolase domain, and were distinguished by possessing N-terminal fusions, primarily of GT domains. Furthermore, these pgaB clades are associated with operons lacking detectable pgaA outer membrane pore locus and pgaD (). Although PNAG production in these species has not been experimentally confirmed, these findings suggest that if the polymer is produced, it may be regulated through a novel mechanism, that glycoside hydrolase activity might not be essential for PNAG export across all Gram-negative species, and that other modes of export may exist. The N-terminal fusion of GT with PgaB de-acetylase domains would also suggest that the de-acetylase activity of PgaB in these organisms may be associated with the periplasmic face of the inner membrane, in contrast to dual domain PgaB clades where the protein is predicted to function at the periplasmic face of the outer membrane [72]. In addition to these novel domain fusion events, PgaB phylogenetic clustering enabled us to resolve distinct events affecting the evolution of the deacetylase domain across different operon clades. Using the E. coli K12 MG1655 sequence of the largest PgaB clade (PgaB_G1) as a reference, multiple sequence alignments against other representative PgaB clade sequences identified several regions of insertion/deletion events (). When these regions were mapped to the published crystal structure of PgaB (PDB ID: 4F9D [73]), they were found to correspond to distinct structural elements surrounding the conserved deacetylase core (). We assigned insertion/deletion regions a number according to their order of appearance in the multiple sequence alignment of PgaB deacetylase domains, and divided them into two categories (). The first two indel regions, 1 and 2, resided in the N-terminal region of the reference E. coli sequence, and corresponded to beta-strands flanking the conserved active site residues involved in deacetylation, His55, Asp114, and Asp115. Region 1 was associated with Gram-positive icaB and comprised insertions of ~10aa in Staphylococcus aureus VC40 (PgaB_G3), as well as Bacillus infantis NRL B-14911 (PgaB_G7), Lactobacillus plantarum 16 (PgaB_G9), Leptospirillum ferriphilum ML-04 (PgaB_11). Structural characterization of Ammonifex degensii IcaB (PgaB_G3) identified residues overlapping with Region 1 as encoding a hydrophobic loop responsible for membrane localization in this species [74]. Region 2 was found to be exclusive to Gram-negative pgaB loci and comprised a much larger insert of ~77aa in Geobacter metallireducens GS-15 (PgaB_G2), Crinalium epipsammum PCC 9333 (PgaB_G5), and Colwellia psychrerythraea 34H (PgaB_6). The functional role of this insertion is unknown. The last three insertion/deletion regions, 3–5, occurred in a region oriented away from the deacetylase active site, and correspond to two beta-turn motifs and an alpha-helix cap, respectively. To further elucidate the biological import of identified PgaB indel regions, we examined regions 3, and 5 in the context of Gram-negative PNAG modification. In the E. coli K12 MG1655 PgaB_G1 sequence, region 3 encompasses a beta-turn with an elongated loop, which is spatially proximal to a disordered loop and alpha helix (pos. 367–392) on the N-terminal region of the PgaB glycoside hydrolase domain. Region 3 also encodes a histidine (E. coli PgaB—H189) which is part of the nickel binding pocket of Gram-negative PgaB deacetylases. Both regions contain polar and electrostatically charged residues which are highly conserved across PgaB_G1 sequences (). Region 5 corresponds to an 8 amino acid elongation of an alpha-helix (pos. 219–226), which also appears to provide an additional point of contact between the deacetylase and hydrolase domains. Although region 5 is also shared with icaB associated sequences (PgaB_G3), region 3 appears only in other dual deacetylase-hydrolase Gram-positive pgaB sequences identified in the sporulating bacteria Lachnoclostridium phytofermentans ISDg and Kitasatospora setae KM-5043. Although initial PFAM searches failed to identify the additional Gram-positive C-terminal domains, subsequent BLAST searches revealed them to be homologous to glycoside hydrolases. In region 4 a unique 29 amino acid insertion was also identified in Lachnoclostridium phytofermentans ISDg (PgaB_G16), which may play a compensatory role for the absence of 9aa in region 3. These insertion regions suggest an overall functional importance in ensuring stability between each domain and could play a role in coordinating their activities. These findings in combination with our identification of ica-like operon organizations among environmental Gram-negative species () suggest that Gram-negative pga operons may share a common evolutionary origin with Gram-positive ica operons. Recent research is providing growing evidence for the emergence of the di-derm Gram-negative architecture from sporulating monodermal Gram-positives [75], which provides a plausible evolutionary context for the insertion/deletion events observed among pgaB/icaB deacetylase domains. Through the loss of inner membrane localization [74] (Region 1), the compensatory gain of an N-terminal palmitoylation site [66], along with a C-terminal fusion of a hydrolase domain (Regions 3–5), an ancestral deacetylase locus may have been adapted to regulate the export of PNAG [66] at the outer-membrane of Gram-negative pga operon lineage.

Discussion

In this work we describe a novel and generalizable approach for the systematic classification and presentation of bacterial protein families in the context of their host operon. Protein families are defined as sets of homologs (groups of related sequences having a common evolutionary ancestor) sharing a particular set of sequence motifs or structural domains that can be utilized to determine their biological roles. For example, the PFAM database utilizes curated sets of protein family sequences in the generation of profile HMMs [76]. A key challenge that complicates the definition of these relationships are evolutionary events such as duplication, gene fusion, and HGT. In attempts to account for such events, a variety of computational approaches have been developed for refining functional assignments. These operate either by graphical clustering of pair-wise protein sequence similarities (e,g, COG [27], OrthoMCL [19] and EggNOG [20]), or through the generation of hierarchical evolutionary relationships and construction of phylogenetic trees (e.g. TreeFAM [77] and TreeCL [78]). However, these methods are limited in their ability to provide further resolution of sequence diversity within a family that might otherwise offer additional insights into evolutionary events that allow taxa to adapt to specific environments. Agnostic approaches to define sub-clusters of evolutionarily related protein families have ranged from phylogenetic tree reconstructions [79] to hierarchical clustering of pairwise global sequence alignments [80]. Here we present an extension of previous efforts and introduce a novel systematic approach for defining protein sub-family relationships through the clustering of phylogenetic trees. Key to this approach is defining a scoring function that allows a phylogenetic tree to be resolved into optimal clusters that best capture the similarities between cluster members, as well as the dissimilarities between clusters. Combining two clustering quality metrics (Silhouette and Dunn index) and proportion of sequences clustered, we demonstrate that our approach classifies a diverse array of operon-associated protein families into taxonomically consistent and functionally informative sub-clusters. Genomic-proximity networks were also constructed to provide an intuitive means of utilizing phylogenetic clusters to examine diverse mechanisms of operon evolution across taxonomically diverse bacterial genomes. Genomic-proximity networks have previously been utilized for inferring functional relationships [81], understanding mechanisms underlying bacterial genomic organization into functionally related gene clusters [82], and transcriptional regulation of bacterial operons [83]. In this study we extend the application of genomic-proximity networks as a tool for the systematic exploration of operon evolution resulting from locus divergence, loss, duplication, and rearrangement events. To demonstrate the effectiveness of our approach, we applied our methods to classify the synthase-dependent bacterial EPS operon machineries for 5 different polymers: cellulose, acetylated cellulose, alginate, Pel and PNAG. There has been only one previous attempt to classify synthase-dependent EPS operons and this focused specifically on the cellulose system [36]. In that study, cellulose operons were categorized into four major types, based on the presence or absence of experimentally validated accessory loci involved in cellulose production. Here, we based our analysis on the four core operon loci, bcsABZC, deemed essential for cellulose production. Cellulose operon clades identified in this study showed little consistency with the previously defined four major cellulose operon types [36], suggesting that the conservation of accessory loci is more variable across bacterial species compared to loci encoding core EPS functionalities. However, one operon type was identified in this analysis, representing the loss of the BcsC outer membrane transporter identified among a subset of alpha-proteobacterial genomes, which include several known cellulose producing species [62,84] suggesting a novel mechanism of cellulose export () [36]. We also found that the loss of BcsC has resulted in an increased divergence of BcsB loci in these genomes, which highlights the key role of BcsB as an intermediary between cellulose biogenesis and periplasmic transport (). In general, inner membrane components involved in EPS polymerization were found to be relatively conserved across all systems examined, while periplasmic and outer membrane components showed a relatively increased degree of evolution, which are likely to have important functional implications. For example, in the cellulose and Pel operon networks (Figs ), rearrangement events involving the periplasmic glycosyl hydrolase (BcsZ) and glycosyl hydrolase/deacetylase (PelA) were found to be a defining feature of several operon clades. It is interesting to note that these rearrangements have resulted in a change in the ordering of bcsZ and pelA relative to their respective outer membrane transport pore loci, which highlights the important role of polysaccharide modification in both the biogenesis and regulation of extracelluar EPS transport [48,85,86]. Similarly, the rearrangement of alginate modification machinery loci (algIJF) was observed as a distinguishing feature of Pseudomonas spp. operon clades. These findings suggest that rearrangement and locus ordering may serve as an important means of regulating EPS production by modifying the timing of translation of modification enzymes, which could affect the assembly of EPS complexes or the structural properties of EPS produced [47,64,87]. Furthermore, identifying operon clades through a phylogenetic approach elucidated numerous instances of cellulose whole operon duplications arising from HGT of two evolutionary distinct operon clades (). Such large-scale duplications, if they are functional, may either serve as a dosage response to given environmental stressors, as observed in the duplication of bacterial multiple-drug transporter operons [88], or could be under the regulation of different environmental stimuli. Interestingly, representative species of the two cellulose operon lineages identified in HGT events, e.g. the plant and human pathogens, D. dadantii and S. enterica, respectively, are known to produce structurally distinct forms of cellulose with different properties and roles in pathogenesis [89,90]. In addition, we identified that BcsB divergence was also seen to accompany the rearrangement or horizontal transfer of these operons, which further suggests that it may play a key role in the fine-tuning of cellulose production by coordinating the export of growing cellulose polymers through the periplasm. Furthermore, our analyses of acetylated cellulose, alginate and PNAG operons suggest a dynamic evolutionary scenario for the evolution of EPS biofilm production through the acquisition of novel polysaccharide modification loci. The limited number of acetylated cellulose operons identified, their frequent co-occurrence in alginate possessing species, and significant sequence similarities between acetylation machinery loci, suggests that the cellulose acetylation machinery is likely to have originated from previously existing alginate operons in Pseudomonas spp. The evolutionary trajectories of Gram-positive and Gram-negative PNAG operon lineages appears to have resulted through the fusion of glycosyl hydrolase and deacetylase domains in Gram-negative pgaB loci. A further key finding from this study was the identification of homologous pel operons in the genomes of several Gram-positive bacteria. With the additional identification of homologs of PelD through iterative HMM searches, our analyses have uncovered a novel example of c-di-GMP regulation of biofilm machinery in Gram-positive bacteria. In the accompanying paper we experimentally validate that a predicted pel-like operon in B. cereus ATCC 10987 is responsible for biofilm production and is regulated by the binding of c-di-GMP to PelD (Whitfield et al PLoS Pathogens, in press). Together this work demonstrates a novel integrative approach combining phylogenomics and genomic-context approaches to systematically explore the adaptive implications of sequence divergence of protein families associated with operon associated EPS secretion machineries. Further extension of this work holds great potential as a general approach for elucidating how bacterial operon encoded biological pathways and complexes have contributed to bacterial adaptation to and survival in diverse environmental niches and lifestyles.

Methods

Sources of data

Sequences corresponding to experimentally characterized EPS operon loci were obtained from the National Centre for Biotechnology Information (NCBI) reference sequence database [91] (). Fully sequenced genomes and associated protein sequences were obtained for 1758 bacteria from the NCBI (Retrieved April 20th 2015) (). In this set of genomes the major phyla represented are largely Gram-negative Proteobacteria (47%), followed by Gram-positive Firmicutes (~20%) and Actinobacteria (~10%). For each bacterial strain predicted to possess an EPS operon, metadata corresponding to niche (host-associated or environmental) and lifestyle (pathogenic or non-pathogenic) were collated from literature searches (–and is also made available for download from https://github.com/ParkinsonLab/eps_biofilms).

Prediction of EPS operons

To identify putative EPS operons, we applied an iterative HMM-based sequence similarity profiling strategy. For each set of EPS loci, we first constructed a HMM from previously characterized EPS producing bacteria (); alignments were constructed using MUSCLE v.3.8.1551 [92], with default settings, from which HMM-models were built using HMMER v.3.1b2 [93], with default settings. As the number of characterized EPS producing species varies greatly by system, each HMM was then used to identify additional EPS loci within the set of 1733 bacterial genomes and build an expanded HMM model for downstream operon prediction. First a protein-BLAST search of reference sequence EPS protein coding sequences was performed against the set of reference+representative completely sequenced bacterial genomes downloaded from NCBI (e-value threshold of < = 1e-5). Next, these putative EPS sequences were subject to a second-round of all-vs-all protein-BLAST searches. These results were then processed using an in-house custom perl script to select non-redundant sequences (percentage-identity of < 97%) in an incremental fashion: starting with the reference EPS sequence used in the first step, its highest scoring non-redundant match was selected and subsequently used to identify the next non-redundant sequence with the highest scoring match, and repeated until a pre-defined number of sequences were selected. To capture a consistent degree of locus sequence diversity and HMM sensitivity for each EPS system, we selected 20 sequences to represent each locus with the expectation that this would provide an adequate lower-bound on sampling potential amino acid substitutions for sites undergoing random mutation, i.e. not-under functional constraints. Using these new sets of HMMs, sets of EPS loci for the reconstruction of EPS operons (see below) were predicted through sequence similarity searches of the 1733 genomes using HMMER, with default settings. Significant sequence matches were defined as those with E-values < = 1e-5. HMM models as well as the table detailing lifestyles and environmental niches of bacteria used in this study () have been made available online (https://github.com/ParkinsonLab/eps_biofilms). To reconstruct putative EPS operons from the sets of loci retrieved from our searches, we first retrieved locus start and stop positions for each locus from their RefSeq entry. We then define putative operons using the following two rules: first only loci that occur within a distance of twice the size of a reference EPS operon to other loci are considered; second intergenic distances of individual loci must be < = 5 kb; third putative operons must consist of at least one locus encoding a putative polysaccharide synthase, together with at least one other locus. To detect previously undiscovered loci that may have been missed in the first rounds of HMM searches, predicted loci of reconstructed operons were used to generate expanded locus-specific HMM models and were subjected to an additional round of HMM searches. This process was performed using custom Perl scripts and results in a list of predicted EPS operons identified in each of the 1733 genomes.

Classification of evolutionary events

For each EPS system (cellulose, acetylated cellulose, PNAG, Pel, and alginate), the locus assignments of each reconstructed operon was compared to a defined reference EPS operon compositions and locus ordering () and were classified into the following evolutionary events; 1) locus losses—the total number of reference loci missing or not detected by HMM searches; 2) locus duplications–number of distinct loci appearing as multiple significant hits to the same HMM model < 10 kb apart; 3) locus fusions–the number of loci that were significant hits to two or more reference EPS locus HMM models; 4) operon rearrangements–the number of predicted operons with locus ordering (accounting for transcriptional direction) different from the reference operon; 5) operon duplications–number of predicted operons (as defined above) present in the same genome > = 10 kb apart.

Classification of EPS loci

Systematic classification of each EPS operon family starts with first merging closely related sequences using CD-HIT v.4.6.3 [94] with default settings (using global sequence identity threshold 0.9; word length 5) to generate a non-redundant set of sequences for each family. Multiple sequence alignments (MSAs) were then generated using MUSCLE and trimmed using trimal v.1.2rev59 [95] (using -automated1 setting). The resulting alignment was then used to construct a consensus phylogenetic tree using PhyML v.3 [96], with default parameters (LG substitution model, with 1000 bootstrap replicates). For each consensus tree, pairwise evolutionary distances (defined as the number of expected average number of amino-acid substitutions per site) for all locus protein sequences were extracted using a custom perl script. These evolutionary distances were subsequently used to iteratively generate sets of clusters, with proteins sharing an evolutionary distance less than a defined cutoff (starting at 0 and incrementing by 0.01) placed in the same cluster. This results in the generation of increasingly coarse clusters of sequences with increasing sequence dissimilarity, such that in the final step all sequences are assigned to a single cluster. At this stage, for all possible clusterings three metrics are calculated and summed together to calculate a clustering quality score: (1) proportion of sequences clustered (p) number of sequences clustered / total number of sequences); (2) the average silhouette score (s_avg) [97]: For each sequence, i, its silhouette score, s(i), is defined as: Where a(i) = average evolutionary distance (expected number of substitutions per site) i) is the lowest average evolutionary distance to any other cluster of which i is not a member; and (3) Dunn index (DI)[98], for a set of m clusters, its Dunn index, DI, is defined as: Where DI is the evolutionary distance between clusters i and j and Δc is the size of cluster c. Note that a higher s(i) indicates that a sequence is well matched to other members of its cluster and not well matched to neighbouring clusters. Furthermore, a higher DI indicates clusters that are compact (smaller cluster sizes) and well differentiated (larger inter-cluster distances). Thus, the evolutionary distance cutoff which maximizes p + s_avg + DI is chosen as the optimal phylogenetic clustering for a given set of EPS locus sequences. In these analyses we have chosen not to incorporate bootstrap support parameters. Bootstrap values have previously been used in phylogenetic-based clustering approaches (particularly for epidemiological studies of pathogen transmission and evolution [99,100]) and provide an important metric in assessing the reliability of phylogenetic tree topologies [101]. In contrast, we define clusters in a topology-agnostic manner by clustering sequences based only on pairwise evolutionary distances. This provides a readily automated approach for defining evolutionary relationships, particularly for sets of sequences that exhibit diverse phylogenetic distributions that are subject to varying evolutionary selection pressures and feature a variety of sequence lengths.

Construction of EPS operon genomic-proximity networks

To visualize evolutionary and genomic organization relationships of predicted EPS operons, genomic proximity networks were generated in which each node represents an individual EPS locus cluster (as defined above), and an edge connecting a pair of nodes represents the average genomic distance (base pairs) between loci represented by each node found in the same genome. Further, nodes are represented as pie-charts indicating phylogenetic distribution of each EPS locus, as defined by NCBI taxonomic classification scheme. Networks were visualized using Cytoscape (version 3.5) [102].

Lifestyle and niche distribution of predicted EPS operons.

The number of bacterial genomes with different combinations of predicted EPS operons, further represented with their distribution (% bacterial genomes) across different lifestyles and environmental niches. (PDF) Click here for additional data file.

Species diversity of predicted synthase dependent EPS systems (shannon diversity).

(PDF) Click here for additional data file.

Identification of gram-positive pel operons.

(A) Subnetwork depicting Gram-positive pel operon clades with varying numbers of loci identified as significant matches (e-value < 1e-5) in first-pass (unfilled nodes) and iterative HMM searches (grey nodes). Selected examples shown: (i) PelA-PelFG sequences identified by first-pass HMM hits; (i.b) Iterative HMM searches identifying additional pelA loci in B. cereus ATCC 10987, a known pellicle producing Gram-positive; (ii) Additional pelD loci identified by iterative HMM; (iii) Gram-positive pel operons with only pelF and pelG loci identified. (B) Operon organizations of selected examples of Gram-positive pel operons (corresponding highlighted in panel A) with additional highly divergent loci identified (red boxes: hits above HMM e-value threshold of 1e-5). (PDF) Click here for additional data file.

Genomic-proximity network of phylogenetically clustered pga operons.

Phylogenetically clustered operon loci are arranged according to the canonical pga operon ordering indicated by the grey sidebar. Inset boxes depict selected examples of pga operon clades distinguished by evolutionary events: i) Divergence of pgaD corresponding to related enterobacterial species including pathogen-specific losses of pgaA and pgaB loci critical for PNAG export; ii) Operon duplications occurring in aquatic niche dwelling bacteria, including a partial duplication of the pga operon specific to the opportunistic pathogen Acinetobacter baumannii spp. and a whole operon duplication identified in Methylovora versatilis; iii) A unique pga operon organization among environmental bacteria lacking a pgaD locus; iv) Gram-positive ica operons (annotated by their HMM hits to corresponding Gram-negative pga loci) with divergent icaB loci, resulting from novel domain acquisitions (iv.b and iv.c); v) A novel pga derived operon resulting from multiple tandem duplications of the pgaC polysaccharide synthase and lack of detectable pgaA outer membrane pore and pgaD. Node size indicates the relative number of sequences per phylogenetic cluster; node colouring represents the taxonomic distribution of loci for a given cluster; edges connect clusters which co-occur in the same genome(s); edge colour indicates the genomic-proximity of loci clusters. (PDF) Click here for additional data file.

Genomic-proximity network of phylogenetically clustered alginate operons.

Phylogenetically clustered operon loci are arranged according to the canonical alginate operon ordering indicated by the grey sidebar. Inset boxes depict selected examples of alginate operon clades distinguished by evolutionary events: Inset boxes depict selected examples of alginate operon clades distinguished by evolutionary events: i) Canonical alginate operon organization with a partial operon duplication event identified in Pseudomonas resinovorans 136 resulting in the loss of alginate acetylation machinery (ib–indicated by A*); ii) A distinct alginate operon clade (ii.a-c) identified by rearrangement of acetylation machinery (indicated by B*) as well as HGT events with canonical alginate operon possessing species; iii) Atypical alginate operons involving loss of outer membrane transport loci or portions of acetylation machinery in deep sea dwelling bacteria. Node size indicates the relative number of sequences per phylogenetic cluster; node colouring represents the taxonomic distribution of loci for a given cluster; edges connect clusters which co-occur in the same genome(s); edge colour indicates the genomic-proximity of loci clusters. (PDF) Click here for additional data file.

Genomic-proximity network of phylogenetically clustered acetylated cellulose operons.

Phylogenetically clustered operon loci are arranged according to the canonical acetylated cellulose operon ordering indicated by the grey sidebar. Inset panels identify three acetylated cellulose operons identified in Pseudomonas spp. (i) and a single Bordetella avium genome possessing a duplicated polysaccharide co-polymerase wssC locus (ii—indicated by red asterisk). Node size indicates the relative number of sequences per phylogenetic cluster; node colouring represents the taxonomic distribution of loci for a given cluster; edges connect clusters which co-occur in the same genome(s); edge colour indicates the genomic-proximity of loci clusters. (PDF) Click here for additional data file.

Phylogenetic sequence clustering reflect differences in structural conservation between cellulose synthase complex subunits BcsA and BcsB.

Top panel—Sequence conservation was mapped onto the cellulose synthase complex, BcsA-BcsB (4HG6 –Rhodobacter sphaeroides ATCC 17025) comprising sequences from eight species representing distinct cellulose operon clades (Fig 4(i)–4(iv)). Lower panels—structural and multiple sequence alignments indicate a high degree of conservation corresponding to BcsA glycosyl hydrolase catalytic core domain and regions of the cellulose translocation channel (i) and UDP binding sites of the BcsA PilZ domain (ii). In Contrast, low overall sequence conservation is found among the carbohydrate binding and ferredoxin domains (CBD1-2, and FD1-2) of BcsB sequences, except the highly conserved cellulose binding site residing in CBD-2 (iii). The translocated cellulose polymer is indicated in green. BcsA domains identified using PFAM predictions for the R. sphaeroides reference sequence, BcsB domains were assigned according to [45]. Multiple sequence alignment was visualized generated using Geneious 10.2.2 (http://www.geneious.com), protein structure was visualized using Chimera 1.11.2 [106]. (PDF) Click here for additional data file.

Phylogenetic clustering reveals structural evolution of PNAG PgaB periplasmic modifying enzyme distinguishing gram-negative and gram-postive PNAG operon clades.

A)—Multiple sequence alignment of representative sequences comprising all PgaB phylogenetic clusters. Global sequence conservation compared against E. coli MG1655 K12 PgaB, phylogenetic cluster PgaB_G1, indicates presence of polysaccharide deacetylase domain (blue box) but an absence of glycosyl-hydrolase domain in non-PgaB_G1 sequences. Red arrows indicate phylogenetic group specific N-terminal domain fusions predicted by PFAM searches; C-terminal domain fusions identified (red box) as putative hydrolase domains from BLAST searches. B)—A close up view of sequence conservation of PgaB polysaccharide deacetylase domains with indel events highlighted: green boxes indicate insertions identified in non PgaB_G1 sequences; teal boxes indicate insertions in PgaB_G1 sequence residing in the C-terminal alpha-helix cap (yellow box). C–Crystal structure of E. coli PgaB (4F9D) indicating conservation of the deacetylase domain catalytic core. D–Deacetylase domain with indel regions indicated according to the colour scheme described for panel B. E–C-terminal alpha helical cap region of the PgaB deacetylase domain indicating insertions of the PgaB_G1 region that are spatially proximal to an N-terminal region of the hydrolase domain (light purple); comparison of the same regions with PgaB_G1 sequence conservation indicated. Multiple sequence alignment was visualized generated using Geneious 10.2.2 (http://www.geneious.com), protein structure was visualized using Chimera 1.11.2 [106]. (PDF) Click here for additional data file.

Reconstructed synthase-dependent EPS operons.

(XLSX) Click here for additional data file.

Phylogenetic sequence clade assignments of reconstructed synthase-dependent EPS operon loci.

(XLSX) Click here for additional data file.

EPS reference operon loci by species and RefSeq accession number.

(XLSX) Click here for additional data file.

Previously experimentally characterized EPS producing species; successful identification of a predicted EPS operon by the present study, and additional novel EPS operon possessing strains identified.

(XLSX) Click here for additional data file.

Summary of synthase-dependent EPS operon evolutionary events.

(XLSX) Click here for additional data file.

Synthase-dependent EPS operon locus clade associations.

(XLSX) Click here for additional data file.

NCBI reference complete bacterial genomes and genbank accessions used for synthase-dependent EPS operon reconstruction.

(XLSX) Click here for additional data file.

Associated lifestyle and environmental niche metadata for bacterial reference genomes.

(XLSX) Click here for additional data file. 18 Nov 2019 Dear Dr Parkinson, Thank you very much for submitting your manuscript 'A systematic pipeline for classifying bacterial operons reveals the evolutionary landscape of biofilm machineries' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts. In addition, when you are ready to resubmit, please be prepared to provide the following: (1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors. (2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text. (3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution. Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are: - Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition). - Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video. - Funding information in the 'Financial Disclosure' box in the online system. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here. We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us. Sincerely, Mark M. Tanaka Associate Editor PLOS Computational Biology Alice McHardy Deputy Editor PLOS Computational Biology A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] The reviewers all found the manuscript to be a potentially important contribution. They make a number of suggestions for improvement. One reviewer remarks that the novelty of your study resides more in the results than development of a new method. This should be addressed - indeed, if you feel it is appropriate you can re-orient the narrative of the paper. All reviewers raise a number of technical issues which should be addressed with revisions or clarification. The reviewers also point out the need for the senior authors to check the manuscript for language errors. In fact, all authors should carefully read and edit the text. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The paper by Bundalovic-Torma and colleagues presents a comparative analysis of the operons that encode the enzymes synthesizing five exopolysaccharides: cellulose, acetyl-cellulose, poly-N-acetylglucosamine, Pel (a polymer of mannose, rhamnose, and glucose residues), and alginate (mannuronate/guluronate polymer). The operons are clustered and compared on the basis of their gene content, phylogenetic distribution, and the likely evolutionary history (gene rearrangement, duplication, loss, fusion,and horizontal gene transfer). This is an interesting and potentially useful work. However, as could be expected for such an ambitious project, a number of important issues have been either glossed over or relegated to the supplementary files. The paper would benefit from addressing the following points. Major comments. 1. The paper is entitled "A systematic pipeline for classifying bacterial operons...". Unfortunately, I do not see a clear description of any pipeline. An explanatory figure or a flow chart would be helpful. It would also be useful to highlight which steps of the pipeline result in which results (predictions, discoveries). 2. Given the numerous errors in language (see below for examples), which is unexpected for a Toronto-based team, it is essential that the senior author(s) read and checked the entire text. 2. This ms fails to mention the first publication on the use of operon structure to predict functional coupling: Overbeek et al. 1999, PNAS 96 (6) 2896-2901, https://www.pnas.org/content/96/6/2896. This paper needs to be properly acknowledged. 3. The authors do not explain why they have limited their description of the cellulose synthase operon to just four genes, one of which, bcsZ, actually encodes an inhibitor of cellulose biosynthesis (Ahmad et al. 2016, Microb Cell Fact. 15:177. PMID: 27756305). Traditionally, the reference cellulose synthase operon was the best-studied bcsABCD from Gluconobacter xylinus. As detailed in a recent review (ref. 25), the bcs operon in E. coli and Salmonella consists of at least 6 genes, with three more involved in cellulose modification by phosphoethanolamine (Thongsomboon et al. 2018, Science 359:334-338, PMID: 30232265). Ignoring these extra genes and the additional modification of cellulose simplifies the calculations but results in a biased presentation of the diversity of cellulose synthase operons. At the very least the existence of the extra genes and the associated modification must be acknowledged and the reasons for trimming the gene set carefully discussed. 4. In this paper, any deviations from the reference gene set are described in evolutionary terms (gene rearrangement, duplication, acquisition and loss), implicitly assuming that chosen the reference sets are ancestral operons. Since there is no evidence presented for this assumption being correct for any analyzed operon, all these terms must be qualified and the likely pathways of the evolution of the respective operons discussed without making any unnecessary assumptions. 5. It is not clear whether the trimmed operons actually produce the same compound. The example of Pel biosynthesis by B. cereus is very impressive but it could still be an exception. 6. Supplemental Table 7 is a very useful product of this work. Any chance it could be made available to the community as a separate resource (e.g. a web site), not getting lost among 15 other supplementary files? Minor comments. L. 52. "all known synthase dependent bacterial biofilm machineries". Not all, there is also Psl, levan, and phosphoethanolamine cellulose, and probably other biofilm-forming polysaccharides. L. 67. Strictly speaking, the definition of an operon includes co-regulation. Rephrase. L. 69. What do you mean by "co-conserved"? Rephrase. L. 70. Remove "subsequently" (or move to the beginning of the sentence). L. 71 function isolation -> function in isolation L. 78-79. "role of evolutionary events on operon structure" Role in or Influence on? L. 103. "EPS are an important constituent". Rephrase. L. 117-118. Again, what about the Psl and phosphoethanolamine cellulose? L. 138. these finding -> these findings L. 156. Actinobacteridae and Rubrobacteridae are obsolete names for the classes Actinobacteria and Rubrobacteria, see https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=201174&lvl=1 L. 170. "The processes ... is poorly understood." L. 179-180. " the resulting number ... were assessed" L. 180. ordering -> order L. 338. pathogen enterobacteria -> pathogenic enterobacteria L. 384 and elsewhere. Acetylated cellulose does not need a dash. In the Supplementary Table 6, the "NCBI Genbank Genome ID" column are actually RefSeq genome IDs. RefSeq is different from GenBank, e.g. the RefSeq entry NC_015671 for Cellulomonas [Cellvibrio] gilvus ATCC 13127 corresponds to the GenBank entry CP002665. Reviewer #2: The manuscript entitled « A systematic pipeline for classifying bacterial operons reveals the evolutionary landscape of biofilm machineries » by Bundalovic-Torma and colleagues represent an incredible effort to characterize the composition and evolution of the five major EPS operons known across the bacterial diversity. The authors should be commended by the amount of work performed and the depth of their analysis and all the details provided. Despite the load of work, the results are presented, in most cases, clearly and the figures in the text (and supplements) provide a lot of interesting and insightful information. Overall, I think this manuscript is of interest to the biofilm and comparative genomics community, it provides novel results and further understanding the evolution of different EPS operons. The reviewer also appreciated the analysis of the genomic results in light of the biological function each protein has. There are however, a few points that I feel should be address prior to acceptance. Major comments • Validation of results. My major concern relies on the validation of the EPS predictions. How was this performed? Typically, in a real agnostic way, the database of reference genomes should have been split in two, one should be used for generating the models (HMM profiles and genomic reference organization) and the other one as a validation dataset. Was this done? Further, there is no evidence that the results obtained were systematically contrasted with any published literature, at least a random subset of those. I would suggest that the authors validate their data using new genomes from NCBI. (They worked on genomes from 2015, and NCBI has most likely doubled the number of genomes since then). I am aware that there is an accompanying manuscript validating one novel EPS in Bacillus, but it would provide robustness to this manuscript if a more thorough literature search would be done and potentially detect false negatives, or maybe even false positives. • L157-158: Statistical analysis. I do not understand how a t-test could have been done with this data. This data concerns distribution of a trait (EPS) across a character (pathogen-non pathogen). This should be analysed using contingency tables and the appropriate statistical analysis would be a Chi-squared or a Fisher’s test but not a T-test. Further, given that there are multiple EPS operons, this should be then corrected post hoc for multiple observations. Statistical analysis should be redone and claims of link towards pathogenicity should be modified accordingly. • L157-158: In some occasions, pathogenicity is not a well-defined trait. For example, many species are not pathogenic but there can be a reference in which in a given patient they may have caused disease. For example, commensal bacteria can under certain circumstances (immunosuppressed individuals) become facultative pathogens. Were these cases consistently assigned as pathogens or non –pathogen? This should be more explicitly mentioned in the methods sections. Can you explain for example why two very similar S. aureus, “(Sequence type (ST) 59 from an epidemic lineage of community-associated (CA) methicillin-resistant Staphylococcus aureus (MRSA) isolates. Taiwanese CA-MRSA isolates belong to ST59 and can be grouped into 2 )” were classified as pathogenic (SA957) and non-pathogenic (SA40)? • Figure 1, L 148-157 I believe it would be more informative to provide the percentage of genomes in which each EPS was found rather than absolute numbers (i.e. as shown in Figure S1). I also believe that for visualization purposes, a figure with a tree of the bacterial diversity and the presence/absence of each EPS could be shown in each clade (or better, the degree prevalence (%) of each EPS on each clade). • The authors present this paper at the beginning as a methodology paper. I think this is an overstatement, and somehow undermines the important results (evolutionary and biological results) presented here. Authors refer to a “novel” method to study the evolution of operons using phylogenetic clustering. I am confused as to what is the precise novelty of this approach. There are a number of papers that have used genomic operon architecture & phylogenetics to study the evolution for instance of secretion systems in bacteria (doi: 10.1371/journal.pgen.1002983, doi: 10.1371/journal.pbio.3000390), as well as stand-alone programs to detect the presence of complex operons in genomes (doi:10.1371/journal.pone.0110726). I would tone down the novelty of the method and focus directly in the evolutionary results. Minor comments • L71 word missing? “do not function in isolation” • L82 more references could be added for example that investigating the divergence between the outer membrane exporter of bacterial capsules and EPS (https://doi.org/10.1016/B978-0-12-394313-2.00007-X, doi: 10.1128/MMBR.00024-08) • Figure 1D was confusing at first. Wouldn’t it be simpler to do a more traditional co-occurrence matrix? Or a small interaction network? Or maybe modify the legend. I felt it was somewhat counterintuitive to read the table vertically. • L199-200 It would be of use for the molecular microbiologists to add to the methods the precise genes/proteins used to generate the multiple sequence alignment, and why were those genes chosen. Were those core genes for each operon? Those described in the literature as essential for the biosynthesis? For instance, concerning cellulose operon, in several E. coli natural isolates, the operon is a divergent operon with bcsABZC on one side and bcsEFG on the other side (both of which are essential for cellulose production, doi:10.1111/j.1365-2958.2009.06678.x). • In Figure 2, given the low quality/resolution of the trees presented in figure 2A, I wonder whether this panel is necessary or it could go to the supplemental material. • Also in Figure 2, t would be of help to keep the colours constant for each EPS operon across panels B and C. • Figure C, phylogenetic distance between loci seems to be sometimes really high > 2. What is the unit represented, SNP/site? Can the authors comment on the robustness of their phylogeny? Bootstrap values are never commented, and it would be of use. Further, do authors believe that with distances over 2, these results are trustworthy? I think this grants a small commentary or caveat section in the discussion • L257-26X It would help the reader if the authors would briefly state the function of each protein in the bscABZC operon clearly, as it was done in the next section for the pel operon for instance. • L347-350 It is not clear whether these two species are known to produce PNAG or this is a putative novel discovery? • L 472 Is there a word missing? • L521 there seems to be a problem with the references. It feels that instead of reference #64 it should be #63. Please check this thoroughly throughout the manuscript. • L631 Authors mention repeatedly to have analysed 1861 bacteria, but table S6 only contains 1388. This is important, specially to evaluate the relative presence of each EPS operon across clades. L631 I think it would be useful to have some summary statistics concerning the bacteria analysed, concerning number of genomes per bacterial clade, gram –ve vs gram +ve bacteria. I am guessing that the NCBI is very biased towards Gammaproteobacteria. • L641 When building HMM profiles, how was the non-redundancy eliminated (> 97% similarity). Which program or settings were used to determine sequence similarity? This should be detailed in the Methods. Wouldn’t protein clustering be a more appropriate method to reduce redundancy? • Data availability: I think it would be of use to the community if the HMMs were made available (or if used directly from Pfam: add Pfam link) as well as a simple pipeline for the search to run locally. Reviewer #3: In this manuscript, the authors state that “a systematic approach for studying the evolution of operon organization is lacking.” They aim to “present a novel method to study the evolution of operons based on phylogenetic clustering of operon-encoded protein families and genomic-proximity network visualizations of operon architectures.” This is a laudable aim for an important problem. The authors present methods for analyzing phylogenetic trees to identify clades of functionally similar genes. They then tune their clades using cluster quality measures. This is mostly standard (one might argue about the specific cluster quality measures used, but this is not a major issue). What makes the method more interesting is the joint integration of phylogenetic and genomic-proximity data. From an algorithmic perspective, there is no question that effective tools and methodologies for these tasks are important, and it is obvious that combining phylogenetic and genomic locus information is essential. As both of these data are graphical, and noisy, the algorithmic challenges are huge. The method that the authors have developed appears to depend on human visual inspection of the operon architectures, and as such, may not be fully automatable for a systematic approach. On the other hand, for such an important task, even small steps towards more effective tools are valuable. This being said, the paper could benefit from some attention to clarity in the method presentation. This is particularly important as the method is being proposed as a model for others to implement. For instance, in the subsection titled, “Classification of EPS loci” (lines 671-700) the phylogenetic clustering needs to be clarified. Bootstrap analysis was used, but there is no mention of how (or even if) bootstrap values were incorporated into the method, or even if the final tree analyzed was the consensus tree (or simply the tree constructed from the full multiple sequence alignment). What is meant by “all sequences which share a branch less than the given threshold are assigned to the same evolutionary cluster”? By “branch” do the authors mean tree distance (so that sequences were included in an evolutionary cluster if their tree distance fell within the threshold)? As a second example, in the iterative HMM methodology described on page 27, lines 637-644, why did the authors select 20 sequences only to expand the HMM training set? Since HMMs generally need large training sets, this is puzzling. Lastly, although the writing is generally quite good, some editing is needed to fix minor issues in English usage. A few examples are provided: (a) line 22: “The evolution of these operon-encoded processes is affected by diverse mechanisms such gene duplication, loss, rearrangement, and horizontal transfer.” Insert the word “as” after “such.” (b) “gene-families” (line 24) should not be hyphenated. (c) line 88: “the inference of biological function based on sequence similarities of genes or proteins are often…” replace “are” with “is” ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No 17 Jan 2020 Submitted filename: Response.docx Click here for additional data file. 11 Feb 2020 Dear Parkinson, We are pleased to inform you that your manuscript 'A systematic pipeline for classifying bacterial operons reveals the evolutionary landscape of biofilm machineries' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch within two working days with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Mark M. Tanaka Associate Editor PLOS Computational Biology Alice McHardy Deputy Editor PLOS Computational Biology *********************************************************** The reviewers are now satisfied with the manuscript. Please note the minor inconsistencies raised by Reviewer 2 to be repaired during the production process. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: All my concerns have been addressed. Reviewer #2: The manuscript entitled « A systematic pipeline for classifying bacterial operons reveals the evolutionary landscape of biofilm machineries » by Bundalovic-Torma and colleagues represent an incredible effort to characterize the composition and evolution of the five major EPS operons known across the bacterial diversity. I had the chance to review this manuscript in its initial version. The authors have satisfactorily answered/ addressed all the concerns that I raised in my previous review. I believe the manuscript, including the figures, is clearer and has improved significantly. I thus believe this meets the standards of PloS Computational Biology for publication. Minor comment. Please revise once again manuscript for minor inconsistencies, for instance, note that the in-file legend of Supplemental Table 8, states it is Supplemental Table 5. Reviewer #3: My concerns from my original review have been basically satisfied. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Submitted filename: Bundalovic.PlosCB.R1.docx Click here for additional data file. 6 Mar 2020 PCOMPBIOL-D-19-01601R1 A systematic pipeline for classifying bacterial operons reveals the evolutionary landscape of biofilm machineries Dear Dr Parkinson, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Sarah Hammond PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
  99 in total

1.  Estimating maximum likelihood phylogenies with PhyML.

Authors:  Stéphane Guindon; Frédéric Delsuc; Jean-François Dufayard; Olivier Gascuel
Journal:  Methods Mol Biol       Date:  2009

2.  Cytoscape: a community-based framework for network modeling.

Authors:  Sarah Killcoyne; Gregory W Carter; Jennifer Smith; John Boyle
Journal:  Methods Mol Biol       Date:  2009

3.  PelA and PelB proteins form a modification and secretion complex essential for Pel polysaccharide-dependent biofilm formation in Pseudomonas aeruginosa.

Authors:  Lindsey S Marmont; Gregory B Whitfield; Jacquelyn D Rich; Patrick Yip; Laura B Giesbrecht; Carol A Stremick; John C Whitney; Matthew R Parsek; Joe J Harrison; P Lynne Howell
Journal:  J Biol Chem       Date:  2017-09-27       Impact factor: 5.157

4.  Inability of toxin inhibitors to neutralize enhanced toxicity caused by bacteria adherent to tissue culture cells.

Authors:  I Ofek; D Zafriri; J Goldhar; B I Eisenstein
Journal:  Infect Immun       Date:  1990-11       Impact factor: 3.441

5.  Gene duplications in prokaryotes can be associated with environmental adaptation.

Authors:  Marit S Bratlie; Jostein Johansen; Brad T Sherman; Da Wei Huang; Richard A Lempicki; Finn Drabløs
Journal:  BMC Genomics       Date:  2010-10-20       Impact factor: 3.969

6.  Biosynthesis of the Pseudomonas aeruginosa Extracellular Polysaccharides, Alginate, Pel, and Psl.

Authors:  Michael J Franklin; David E Nivens; Joel T Weadge; P Lynne Howell
Journal:  Front Microbiol       Date:  2011-08-22       Impact factor: 5.640

7.  The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements.

Authors:  Adrian M Altenhoff; Nives Škunca; Natasha Glover; Clément-Marie Train; Anna Sueki; Ivana Piližota; Kevin Gori; Bartlomiej Tomiczek; Steven Müller; Henning Redestig; Gaston H Gonnet; Christophe Dessimoz
Journal:  Nucleic Acids Res       Date:  2014-11-15       Impact factor: 16.971

8.  Top-down clustering for protein subfamily identification.

Authors:  Eduardo P Costa; Celine Vens; Hendrik Blockeel
Journal:  Evol Bioinform Online       Date:  2013-05-06       Impact factor: 1.625

9.  CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors:  Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal:  Bioinformatics       Date:  2012-10-11       Impact factor: 6.937

10.  DOOR 2.0: presenting operons and their functions through dynamic and integrated views.

Authors:  Xizeng Mao; Qin Ma; Chuan Zhou; Xin Chen; Hanyuan Zhang; Jincai Yang; Fenglou Mao; Wei Lai; Ying Xu
Journal:  Nucleic Acids Res       Date:  2013-11-07       Impact factor: 16.971

View more
  7 in total

Review 1.  Weaving of bacterial cellulose by the Bcs secretion systems.

Authors:  Wiem Abidi; Lucía Torres-Sánchez; Axel Siroy; Petya Violinova Krasteva
Journal:  FEMS Microbiol Rev       Date:  2022-03-03       Impact factor: 16.408

2.  The Pel polysaccharide is predominantly composed of a dimeric repeat of α-1,4 linked galactosamine and N-acetylgalactosamine.

Authors:  François Le Mauff; Erum Razvi; Courtney Reichhardt; Piyanka Sivarajah; Matthew R Parsek; P Lynne Howell; Donald C Sheppard
Journal:  Commun Biol       Date:  2022-05-26

3.  Identification of the Clostridial cellulose synthase and characterization of the cognate glycosyl hydrolase, CcsZ.

Authors:  William Scott; Brian Lowrance; Alexander C Anderson; Joel T Weadge
Journal:  PLoS One       Date:  2020-12-02       Impact factor: 3.240

4.  The Matrix Revisited: Opening Night for the Pel Polysaccharide Across Eubacterial Kingdoms.

Authors:  Gregory B Whitfield; P Lynne Howell
Journal:  Microbiol Insights       Date:  2021-02-15

5.  OperonSEQer: A set of machine-learning algorithms with threshold voting for detection of operon pairs using short-read RNA-sequencing data.

Authors:  Raga Krishnakumar; Anne M Ruffing
Journal:  PLoS Comput Biol       Date:  2022-01-05       Impact factor: 4.475

Review 6.  Bacillus cereus sensu lato biofilm formation and its ecological importance.

Authors:  Yicen Lin; Romain Briandet; Ákos T Kovács
Journal:  Biofilm       Date:  2022-02-15

7.  Structural and biochemical characterization of the exopolysaccharide deacetylase Agd3 required for Aspergillus fumigatus biofilm formation.

Authors:  Natalie C Bamford; François Le Mauff; Jaime C Van Loon; Hanna Ostapska; Brendan D Snarr; Yongzhen Zhang; Elena N Kitova; John S Klassen; Jeroen D C Codée; Donald C Sheppard; P Lynne Howell
Journal:  Nat Commun       Date:  2020-05-15       Impact factor: 14.919

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.