UNLABELLED: Marine bacteria in the Roseobacter and SAR11 lineages successfully exploit the ocean habitat, together accounting for ~40% of bacteria in surface waters, yet have divergent life histories that exemplify patch-adapted versus free-living ecological roles. Here, we use a phylogenetic birth-and-death model to understand how genome content supporting different life history strategies evolved in these related alphaproteobacterial taxa, showing that the streamlined genomes of free-living SAR11 were gradually downsized from a common ancestral genome only slightly larger than the extant members (~2,000 genes), while the larger and variably sized genomes of roseobacters evolved along dynamic pathways from a sizeable common ancestor (~8,000 genes). Genome changes in the SAR11 lineage occurred gradually over ~800 million years, whereas Roseobacter genomes underwent more substantial modifications, including major periods of expansion, over ~260 million years. The timing of the first Roseobacter genome expansion was coincident with the predicted radiation of modern marine eukaryotic phytoplankton of sufficient size to create nutrient-enriched microzones and is consistent with present-day ecological associations between these microbial groups. We suggest that diversification of red-lineage phytoplankton is an important driver of divergent life history strategies among the heterotrophic bacterioplankton taxa that dominate the present-day ocean. IMPORTANCE: One-half of global primary production occurs in the oceans, and more than half of this is processed by heterotrophic bacterioplankton through the marine microbial food web. The diversity of life history strategies that characterize different bacterioplankton taxa is an important subject, since the locations and mechanisms whereby bacteria interact with seawater organic matter has effects on microbial growth rates, metabolic pathways, and growth efficiencies, and these in turn affect rates of carbon mineralization to the atmosphere and sequestration into the deep sea. Understanding the evolutionary origins of the ecological strategies that underlie biochemical interactions of bacteria with the ocean system, and which scale up to affect globally important biogeochemical processes, will improve understanding of how microbial diversity is maintained and enable useful predictions about microbial response in the future ocean.
UNLABELLED: Marine bacteria in the Roseobacter and SAR11 lineages successfully exploit the ocean habitat, together accounting for ~40% of bacteria in surface waters, yet have divergent life histories that exemplify patch-adapted versus free-living ecological roles. Here, we use a phylogenetic birth-and-death model to understand how genome content supporting different life history strategies evolved in these related alphaproteobacterial taxa, showing that the streamlined genomes of free-living SAR11 were gradually downsized from a common ancestral genome only slightly larger than the extant members (~2,000 genes), while the larger and variably sized genomes of roseobacters evolved along dynamic pathways from a sizeable common ancestor (~8,000 genes). Genome changes in the SAR11 lineage occurred gradually over ~800 million years, whereas Roseobacter genomes underwent more substantial modifications, including major periods of expansion, over ~260 million years. The timing of the first Roseobacter genome expansion was coincident with the predicted radiation of modern marine eukaryotic phytoplankton of sufficient size to create nutrient-enriched microzones and is consistent with present-day ecological associations between these microbial groups. We suggest that diversification of red-lineage phytoplankton is an important driver of divergent life history strategies among the heterotrophic bacterioplankton taxa that dominate the present-day ocean. IMPORTANCE: One-half of global primary production occurs in the oceans, and more than half of this is processed by heterotrophic bacterioplankton through the marine microbial food web. The diversity of life history strategies that characterize different bacterioplankton taxa is an important subject, since the locations and mechanisms whereby bacteria interact with seawater organic matter has effects on microbial growth rates, metabolic pathways, and growth efficiencies, and these in turn affect rates of carbon mineralization to the atmosphere and sequestration into the deep sea. Understanding the evolutionary origins of the ecological strategies that underlie biochemical interactions of bacteria with the ocean system, and which scale up to affect globally important biogeochemical processes, will improve understanding of how microbial diversity is maintained and enable useful predictions about microbial response in the future ocean.
Heterotrophic marine bacterioplankton taxa have frequently been conceptualized into two ecological categories: those with large genomes, versatile metabolic capabilities, and rapid responses to transient conditions, likened to ecological r-strategists of macroorganisms (1), and those with streamlined genomes, the ability to grow under extremely low substrate concentrations, and the inability to take advantage of enhanced nutrients (2), paralleling the K-strategist paradigm (3–5). Ephemeral patches of organic matter formed at nanometer to millimeter scales through biotic and abiotic processes (6, 7) and harboring nutrient concentrations up to three orders of magnitude higher than bulk seawater (7, 8) are postulated to underlie these divergent strategies, as they provide enriched microhabitats that contrast with the nutrient-poor bulk seawater matrix. While genome sequences have offered insights into differing tactics for obtaining resources in the ocean, evolution of alternate bacterioplankton life history strategies is not yet well understood.Two phylogenetically related marine taxa, the Roseobacter and SAR11 lineages, exemplify extremes in the free-living to patch-adapted continuum while sharing a common ancestor in the alphaproteobacteria. As two of the most abundant heterotrophic bacterial groups in ocean surface waters (1, 2), the evolutionary paths leading to their divergent ecological strategies have likely influenced where and when fixed carbon is processed in the ocean (9, 10) and what fraction is exported into deep waters (11, 12).We sought to interpret the evolution of genome properties associated with these marine alphaproteobacterial clades by integrating ancestral gene content reconstruction and patterns of protein-coding sequence evolution. The reconstruction of genome content in ancestral lineages has frequently been modeled using maximum parsimony methods (13–20), but these techniques are not able to model parallel and repeated gene insertions and deletions and are known to underestimate the number of evolutionary events. They also cannot model rate variability among different lineages and gene functions (21). More recently, maximum likelihood approaches have been developed to overcome the disadvantages of parsimony-based reconstructions (21–25). In these likelihood analyses, however, insertion rates are frequently unrealistically assumed to be equal to deletion rates, and no differentiation is made between lateral gene transfer (LGT) and gene duplication.A recently developed maximum likelihood method implemented in the COUNT software (26, 27) is based on the birth-and-death evolutionary model of multigene families (28). The birth-and-death model assumes that genes are lost, gained, and duplicated independently (29), with constant rates for a fixed family and phylogeny branch, thereby modeling microbial genome evolution in a more realistic way (27, 30). The model is thus described by lineage- and family-specific gene loss and duplication rates, coupled with a lineage-specific family gain process accounting for LGT. Here, we apply this method to Roseobacter and SAR11 clades to address the evolutionary history of their distinct ecological strategies.
RESULTS AND DISCUSSION
Ancestral reconstruction of the marine alphaproteobacterial tree.
Ancestral reconstruction of genome content requires a robust phylogenetic tree describing the evolutionary relationship of the taxa. Using four different phylogenomic approaches which take into account different aspects of heterogeneous evolutionary processes that likely have occurred during the evolution of alphaproteobacterial lineages (P4, RAxML, PhyloBayes, and MrBayes), we obtained a robust phylogenetic position of the marine Roseobacter clade and other major lineages in the alphaproteobacterial tree (Fig. 1; see also Fig. S1 in the supplemental material). However, the SAR11 clade was placed in three alternate evolutionary positions, all of which were supported by extremely high bootstrap values or posterior probabilities within that phylogenomic approach (Fig. 1; see also Fig. S1). Regardless of the specific position in the competing reconstructions, however, the phylogenetic birth-and-death model consistently predicted that the small extant SAR11 genomes (1,300 to 1,500 genes) evolved from a slightly larger common ancestor (~2,000 genes; Fig. 2A; see also Fig. S2A and C), while the large and variable extant Roseobacter genomes (2,000 to 5,000 genes; median, >4,000) evolved from a quite large common ancestor (~8,000 genes; Fig. 2B; see also Fig. S2B and D), a characteristic echoed in the genome size of nonmarine Roseobacter relatives.
FIG 1
Model-based phylogenomic trees of alphaproteobacteria based on a concatenation of 60 orthologous protein sequences using the P4 Bayesian software with the NDCH and NDRH models (A), the RAxML software (B), and the PhyloBayes software with the CAT model (C). A Bayesian phylogeny using MrBayes with or without the covarion model had the same branching order as the RAxML tree. The node representing the most recent common ancestor (MRCA) of the Roseobacter and SAR11 lineages is indicated with a red dot, and the predicted gene number for the MRCA is indicated. For clarity, only the deep branches connecting the major lineages and their statistical support values are shown. The complete trees are shown in Fig. S1 in the supplemental material.
FIG 2
Ancestral genome content reconstruction using the COUNT software. The reconstruction is based on the P4-based alphaproteobacterial tree (see Fig. S1 in the supplemental material), but only the parts of the results involving marine SAR11 (A) and Roseobacter (B) are shown. The log-scale color coding represents numbers of reconstructed gain and loss events of each lineage. Numbers in parentheses are predicted gene numbers for ancestral nodes and observed gene numbers for extant lineages. The genome expansion on the Roseobacter branch leading to R37 was statistically significant based on reconstruction of randomized genome content in 100 bootstrapped replicates (see Table S3).
Model-based phylogenomic trees of alphaproteobacteria based on a concatenation of 60 orthologous protein sequences using the P4 Bayesian software with the NDCH and NDRH models (A), the RAxML software (B), and the PhyloBayes software with the CAT model (C). A Bayesian phylogeny using MrBayes with or without the covarion model had the same branching order as the RAxML tree. The node representing the most recent common ancestor (MRCA) of the Roseobacter and SAR11 lineages is indicated with a red dot, and the predicted gene number for the MRCA is indicated. For clarity, only the deep branches connecting the major lineages and their statistical support values are shown. The complete trees are shown in Fig. S1 in the supplemental material.
Dynamics of genome content.
The phylogenetic birth-and-death model imposed on the phylogenomic trees shows a steady trend toward streamlined genomes in the SAR11 lineage, with no abrupt changes in gene family content since the common ancestor (Fig. 2A; see also Fig. S2A and C in the supplemental material). The Roseobacter lineage exhibits a more complicated evolutionary path to net genome reduction, however (Fig. 2B; see also Fig. S2B and D), with the Roseobacter ancestor experiencing an early surge in gene content (leading to the R37 node in Fig. 2B; see also Fig. S2B and D). The model suggests that this surge occurred exclusively through gain of new families rather than expansion of existing ones (see Table S1). The calculated rate of gene loss compared to the amino acid substitution rate for Roseobacter branches varies depending on the underlying phylogenetic tree reconstruction (14 deletions per amino acid substitution for P4, 5 for RAxML, 8 for PhyloBayes), but all three predict that genes were lost at a constant rate for both ancestral and exterior branches (Fig. 3A; see also Fig. S3A and D and Table S2A). For LGT, however, calculated rates are significantly lower for ancestral than for exterior branches (Fig. 3B; see Fig. S3B and E and Table S2A), with the LGT rate following a molecular clock only for the ancestral branches (averaging 0.036, 0.015, or 0.024 gene family acquisitions per amino acid substitution, depending on tree construction; R2 > 0.69 and P < 0.001 in all cases). The notable exception is the ancestral branch leading to Roseobacter node R37, showing a significantly higher LGT rate than any other ancestral branch (see Table S2B), in agreement with the significant surge of genome content on that branch predicted by the birth-and-death model (Fig. 2B; see also Fig. S2B and D) with bootstrapped genome content data sets (see Table S3). For gene duplication, calculated rates are low in the majority of Roseobacter branches (Fig. 3C; see also Fig. S3C and F) and do not follow a molecular clock (P > 0.05) (see Table S2A). The branch leading to the Arctic strain Octadecabacter arcticus 238 is an outlier (P < 0.001; see Table S2B), regardless of whether the observed expansion of insertion sequence families are included or not (see Fig. S4), suggesting that gene duplication rates may be enhanced in polar roseobacters.
FIG 3
Analysis of gene loss rate (A), lateral gene transfer rate (B), and gene duplication rate (C) versus amino acid substitution rate on the Roseobacter branches of the alphaproteobacterial phylogeny constructed using P4. For the exterior Roseobacter branches, LGT rate calculations were highly variable and did not exhibit a clock-like pattern (R2 = 0.14; P = 0.02).
Ancestral genome content reconstruction using the COUNT software. The reconstruction is based on the P4-based alphaproteobacterial tree (see Fig. S1 in the supplemental material), but only the parts of the results involving marine SAR11 (A) and Roseobacter (B) are shown. The log-scale color coding represents numbers of reconstructed gain and loss events of each lineage. Numbers in parentheses are predicted gene numbers for ancestral nodes and observed gene numbers for extant lineages. The genome expansion on the Roseobacter branch leading to R37 was statistically significant based on reconstruction of randomized genome content in 100 bootstrapped replicates (see Table S3).Analysis of gene loss rate (A), lateral gene transfer rate (B), and gene duplication rate (C) versus amino acid substitution rate on the Roseobacter branches of the alphaproteobacterial phylogeny constructed using P4. For the exterior Roseobacter branches, LGT rate calculations were highly variable and did not exhibit a clock-like pattern (R2 = 0.14; P = 0.02).One basal Roseobacter lineage (represented by strain HTCC2255) diverged at node R38 and escaped the early surge, evolving directly toward a highly reduced genome of only 2,240 genes (Fig. 2B; see also Fig. S2B and D in the supplemental material), while in the remaining clades, a trend toward gradual genome reduction followed the rapid early innovation (Fig. 2B; see also Fig. S2B and D). At the tips of the phylogeny, Roseobacter lineages show either gradual genome downsizing or expansion (Fig. 2B; see also Fig. S2B and D). Thus, two time periods of substantial evolutionary change in Roseobacter genomes are predicted: one occurring early in their history and manifested as genome expansion via LGT along the branch leading to node R37, and the second occurring more recently along the branches leading to extant members. A flux of gene family content is also observed in some SAR11 leaf lineages but is of considerably smaller magnitude than that observed at the tips of the Roseobacter phylogeny (Fig. 2; see also Fig. S2).
Biased gene acquisition in roseobacters and SAR11s.
Characterization of gene families based on clusters of orthologous groups (COGs) (31) indicated that putative biological functions gained during the evolution of the SAR11 and Roseobacter lineages were significantly different (chi-square test, P < 0.001). Along the SAR11 branches, acquired families were biased toward cell wall biogenesis (55 families of lipopolysaccharide, cell wall, and polysaccharide synthesis proteins) and pilus synthesis (15 families of assembly proteins). Along the Roseobacter branches, a greater proportion of acquired families were involved in gene regulation (450 families of transcriptional regulators, DNA binding proteins, and sigma factors) and replication/recombination/repair (431 families of transposases, endonucleases, recombinases, methylases, and mismatch repair enzymes) (Fig. 4). During later innovation in the Roseobacter lineage, lateral gene acquisition was biased toward gene regulation (41 families of transcriptional regulators and DNA binding proteins) and defense mechanisms (12 families of antibiotic synthesis and export proteins and multidrug efflux pumps) (Fig. 4), potentially equipping the cells to better compete in microbial communities associated with enriched patches (32). This nonrandom collection of SAR11 and Roseobacter gene functions gained through LGT is indicative of adaptive evolution.
FIG 4
Gene families gained per branch in Roseobacter versus SAR11 lineages (left) and in Roseobacter ancestral nodes R37 versus R1 to R36 (right). Letters represent COG categories. Asterisks indicate significant differences in proportions based on Xipe analysis (64) (P < 0.01). The horizontal axis indicates the number of families gained per branch for each COG class. Cell motility families gained in SAR11 represent pilus formation genes.
Gene families gained per branch in Roseobacter versus SAR11 lineages (left) and in Roseobacter ancestral nodes R37 versus R1 to R36 (right). Letters represent COG categories. Asterisks indicate significant differences in proportions based on Xipe analysis (64) (P < 0.01). The horizontal axis indicates the number of families gained per branch for each COG class. Cell motility families gained in SAR11 represent pilus formation genes.
Pattern of gene loss in Roseobacter strain HTCC2255.
Not all extant Roseobacter lineages have evolved toward genome content suggestive of patch-associated life histories, however. The birth-and-death model predicts that the HTCC2255 lineage lost >5,000 gene families since divergence at the clade ancestor, including those conserved in a majority of roseobacters and involved in motility, chemotaxis, secondary metabolite synthesis and metabolism, signal transduction, and various regulatory functions, making the genetic composition of HTCC2255 and the Roseobacter clade ancestor (node R38 in Fig. 2B; see also Fig. S2B and D in the supplemental material) significantly different (Fig. S5 chi-square test, P < 0.001). In this case, the pattern of gene family loss is suggestive of relaxation of purifying selection on gene families not necessary for a small, free-living bacterioplankter, and in fact the functional profile of the HTCC2255 genome is more similar to that of SAR11 than other roseobacters (Fig. 5).
FIG 5
High-throughput multidimensional scaling (HiT-MDS) plot of the genetic composition of the Roseobacter and SAR11 extant lineages and predicted composition of their respective common ancestors. The genetic composition was determined by mapping the gene families to COG functional categories. COG classes significantly negatively correlated with dimension 1 and hypothesized to include traits associated with r-selected life histories are G (carbohydrate transport and metabolism), I (lipid transport and metabolism), K (transcription), N (cell motility), P (inorganic ion transport and metabolism), Q (secondary-metabolite biosynthesis, transport, and catabolism), T (signal transduction mechanisms), and V (defense mechanisms). COG classes significantly positively correlated with dimension 1 and hypothesized to include traits associated with K-selected life histories are C (energy production and conversion), D (cell cycle control, cell division, and chromosome partitioning), F (nucleotide transport and metabolism), H (coenzyme transport and metabolism), J (translation, ribosomal structure, and biogenesis), M (cell wall/membrane/envelope biogenesis), O (posttranslational modification and protein turnover, chaperones), and U (intracellular trafficking, secretion, and vesicular transport). No significant correlation was found between dimension 1 and COG classes E (amino acid transport and metabolism) or L (replication, recombination, and repair).
High-throughput multidimensional scaling (HiT-MDS) plot of the genetic composition of the Roseobacter and SAR11 extant lineages and predicted composition of their respective common ancestors. The genetic composition was determined by mapping the gene families to COG functional categories. COG classes significantly negatively correlated with dimension 1 and hypothesized to include traits associated with r-selected life histories are G (carbohydrate transport and metabolism), I (lipid transport and metabolism), K (transcription), N (cell motility), P (inorganic ion transport and metabolism), Q (secondary-metabolite biosynthesis, transport, and catabolism), T (signal transduction mechanisms), and V (defense mechanisms). COG classes significantly positively correlated with dimension 1 and hypothesized to include traits associated with K-selected life histories are C (energy production and conversion), D (cell cycle control, cell division, and chromosome partitioning), F (nucleotide transport and metabolism), H (coenzyme transport and metabolism), J (translation, ribosomal structure, and biogenesis), M (cell wall/membrane/envelope biogenesis), O (posttranslational modification and protein turnover, chaperones), and U (intracellular trafficking, secretion, and vesicular transport). No significant correlation was found between dimension 1 and COG classes E (amino acid transport and metabolism) or L (replication, recombination, and repair).
An evolutionary timeline.
The timing of diversification of the lineages was inferred using a maximum likelihood method based on a relaxed molecular clock calibrated by the fossil record (33). This approach dates the occurrence of the common ancestor of SAR11 at 826 (±21) million years ago (mya) (Fig. 6). The prediction from the phylogenetic birth-and-death model that extant SAR11 genomes have been streamlined by only 25 to 30% from their common ancestor emphasizes the importance of the SAR11 position on the alphaproteobacterial tree to the genome streamlining theory (2, 9). If the SAR11 lineage clusters with Rickettsiales at the base of the alphaproteobacterial tree (Fig. 1B), the most recent common ancestor (MRCA) (Fig. 1B) of the SAR11 and Roseobacter lineages is predicted to have had only ~2,100 genes, suggesting only a trivial reduction to the SAR11 ancestor. If the SAR11 lineage branched off either before (Fig. 1A) or after (Fig. 1C) the marine SAR116 lineage (represented by the “Candidatus Puniceispirillum marinum” IMCC1322 genome), the MRCA genome is predicted to contain either ~3,300 or ~6,900 genes (Fig. 1A and C), with the latter most strongly supporting the hypothesis that genomic and metabolic streamlining is the primary evolutionary process influencing the content of extant SAR11 genomes.
FIG 6
A chronogram of alphaproteobacteria. Nodes with fossil record corrections are indicated with an asterisk. The tree branching order was constructed using RAxML version 7.3.0 software with a data partition model determined using PartitionFinder, and molecular dating was performed using the r8s software.
A chronogram of alphaproteobacteria. Nodes with fossil record corrections are indicated with an asterisk. The tree branching order was constructed using RAxML version 7.3.0 software with a data partition model determined using PartitionFinder, and molecular dating was performed using the r8s software.The timeline of the Roseobacter lineage indicates that their common ancestor (node R38 in Fig. 2B; see Fig. S2B and D in the supplemental material) occurred more recently, at 260 (±7) mya (Fig. 6). In comparison to SAR11, extant Roseobacter genomes exhibited a greater net genome reduction from the common ancestor (50 to 70%) within a considerably shorter evolutionary time frame. Molecular dating of the R37 node at 196 (±7) mya (Fig. 6) places the timing of the first episode of Roseobacter diversification concurrent with the Mesozoic radiation of the eukaryotic red-lineage phytoplankton (dinoflagellates, coccolithophorids, and diatoms), predicted as early as 250 mya (34). Because the cyanobacteria and green algae that dominated the early ocean were not much larger than bacteria (34) and probably of insufficient size to be detected by bacterial chemosensory mechanisms (35), the radiation of larger phytoplankton groups likely offered new habitats for heterotrophic bacterioplankton, particularly for lineages with large genomes encoding chemotaxis, motility, defense, and other functions beneficial for locating and tracking nutrient-enriched microzones (7, 36). Indeed, members of the Roseobacter lineage in the contemporary ocean frequently occur in association with red-lineage phytoplankton cells (37, 38).
Conclusion.
Although it is a simplified representation of the evolutionary paths taken by heterotrophic marine bacterioplankton (9, 10), the free-living versus patch-adapted dichotomy is nonetheless useful to explore implications of disparate life history strategies of marine bacteria (3, 5). The comparative evolutionary history of the Roseobacter and SAR11 lineages points to the emergence of large eukaryotic phytoplankton as an important event driving divergence of patch-adapted from free-living bacterioplankton, the former of which are implicated in enhancing export flux of organic matter to deeper waters via aggregation (11, 12), and the latter is linked to intensive remineralization of upper ocean fixed carbon through the microbial loop. A future ocean shaped by rising greenhouse gas emissions is consistently predicted to favor picophytoplankton over diatoms, dinoflagellates, and coccolithophorids (39–42) and therefore may favor free-living over patch-adapted bacterioplankton. Subsequent effects on ocean heterotrophy mediated through alterations in the rates and efficiencies (43) of bacterial assimilation of distinct classes of organic compounds (44) could intensify future changes to the oceanic carbon cycle.
MATERIALS AND METHODS
Since resolving the evolutionary position of the SAR11 clade on the alphaproteobacterial tree has proven to be difficult (20, 45), we used multiple evolutionary models to account for the potential heterogeneity in phylogenetic reconstruction and studied the genome evolution of the marine Roseobacter and SAR11 clades in the context of this controversy. Phylogenetic reconstruction used a concatenation of 60 conserved orthologous proteins in 65 alphaproteobacterial genomes (39 and 7 representatives of marine Roseobacter and SAR11 clades, along with additional related lineages; see Fig. S1 in the supplemental material) and 8 outgroup species associated with gammaproteobacteria and betaproteobacteria. Phylogenetic models and software included a maximum likelihood method using a data partition model in the RAxML version 7.3.0 software (46) and a Bayesian method using a data partition model with and without the covarion model in MrBayes version 3.1.2 (47). The partition model involves estimating independent evolutionary models for different genes or subsets of genes, which are implemented in the PartitionFinder software (48). Two alternate partition schemes were chosen depending on the statistical evaluation method. The covarion model takes into account the variation of substitution rate at a site across time (49, 50). Both RAxML and MrBayes are implemented in parallel versions, making them computationally efficient for this large data set on a high-performance computing cluster. Nevertheless, these phylogenetic methods are unable to model other inherent heterogeneities of this data set, including a substantial variation of amino acid composition across sites and across lineages. We thus employed the PhyloBayes and P4 Bayesian software, which are computationally expensive but designed to account for these two aspects.As there exists a substantial variation in nucleotide G+C content among alphaproteobacterial lineages (~<30% to 70%), and it is known that amino acid composition is affected by the G+C bias (51), the concatenated protein sequence was recoded into the following six Dayhoff groups to reduce this bias (52): cysteine; alanine, serine, threonine, proline, glycine; asparagine, aspartic acid, glutamic acid, glutamine; histidine, arginine, lysine; methionine, isoleucine, leucine, valine; phenylalanine, tyrosine, tryptophan. After recoding was complete, we used a Bayesian method with the CAT model in the PhyloBayes version 3.2e software (53) and a Bayesian method with the node-discrete composition heterogeneity (NDCH) and the node-discrete rate heterogeneity (NDRH) models in the P4 software (54). The CAT model integrates heterogeneity of amino acid composition across sites of a protein alignment (55). The NDCH model allows heterogeneity of amino acid composition across different branches, and the NDRH model allows different rate matrices on different branches (54). All models were used with a Gamma distribution of rate variation among sites.To study the evolution of life history strategies, we compiled a comprehensive data set of 44,064 orthologous gene families covering the 65 alphaproteobacterial genomes. Gene families were identified using the OrthoMCL software (56). To reconstruct ancestral gene family sizes, we adapted a recently developed pipeline that is suitable for the analysis of such large data sets (27, 57–60), as implemented in the COUNT software package (26). The reconstruction is based on numerical phylogenetic patterns formed by the gene copy numbers across extant genomes in homologous families. Ancestral family sizes are inferred in COUNT by assuming a probabilistic framework involving a phylogenetic birth-and-death model (28) along a rooted phylogeny. In particular, the model is described by lineage- and family-specific gene loss and duplication rates, coupled with a family gain process accounting for arrival by LGT. In contrast to gene-species tree reconciliation methods (61, 62), phylogenetic birth-and-death methods gain expediency by ignoring sequence information and infer ancestral events affecting family sizes by using solely the copy number information. Both larger (27) and smaller (30) ancestral genomes than extant genomes have been predicted with the birth-and-death model when investigating archaeal and virus evolution. In order to infer confidence intervals of the predicted number of gene families in the ancestral nodes, we repeated the procedure with 100 bootstrap data sets generated by randomly sampling gene families (with repetition).For draft Roseobacter genomes, a regression analysis (see Fig. S6 in the supplemental material) for universal single-copy genes (63) indicated that completeness ranged from 90 to 100%, with a median of 98%. The molecular dating was performed using penalized likelihood based on a relaxed clock model implemented in r8s software version 1.71 (33). Molecular dating requires a phylogenomic tree with fossil calibrations, and thus a few cyanobacterial branches with time constraints were included. This phylogenomic tree was constructed using RAxML version 7.3.0 (46) with an optimized data partition of a concatenation of 61 conserved single-copy orthologous protein sequences. Details of the computational methods can be found in the
Text S1 in the supplemental materials. The orthologous sequences, the partitioned amino acid sequences, the gene families that are gained on the Roseobacter and SAR11 branches, and the gene families that are lost on the HTCC2255 branch are available upon request.Supplemental methodsText S1, DOCX file, 0.1 MBThree alternate well-supported phylogenomic trees of alphaproteobacteria based on a concatenation of 60 orthologous protein sequences. (A) A Bayesian tree using the P4 software. This fifty-percent majority-rule consensus of 1,000 trees was sampled from the posterior distribution of a Metropolis-coupled Markov chain Monte Carlo (MCMCMC) with the NDCH model of 14 across-tree composition vectors and the NDRH model of 3 across-tree rate matrix vectors with Dayhoff-recoded data and with a gamma model accounting for among-site rate variation (GTR + G + NDCH + NDRH). A chi-square test using a posterior predictive distribution rejected the homogeneous model, which does not take into account the variation in amino acid composition across the branches of the tree. The number of vectors in the NDCH and NDRH models was determined by Bayes factors. The scale bar indicates the number of substitutions per site. The value near each internal branch is the posterior probability for that branch. (B) A maximum likelihood phylogeny using the RAxML version 7.3.0 software with a data partition model determined using PartitionFinder. Values at the nodes show the number of times the clade defined by that node appeared in the 100 bootstrapped data sets using RAxML. A Bayesian phylogeny using MrBayes with the same partition scheme and with or without the covarion model had the same branching order as the RAxML tree. (C) A Bayesian phylogeny using the PhyloBayes version 3.2e software based on Dayhoff-recoded protein sequences with the CAT model, a gamma model for among-site rate variation, and Poisson substitution matrix (Poisson + G + CAT). The value near each internal branch is the posterior probability for that branch. Trees are rooted using species from betaproteobacteria and gammaproteobacteria. DownloadFigure S1, PDF file, 0.7 MBAncestral genome content reconstruction using the COUNT software. The reconstruction is based on the RAxML-based (see Fig. S1B) and PhyloBayes-based (see Fig. S1C) alphaproteobacterial tree, but only the parts of the results involving marine SAR11 (A, C) and Roseobacter (B, D) are shown. The log-scale color coding represents numbers of reconstructed gain and loss events of each lineage. Numbers in parentheses are predicted gene numbers for ancestral nodes and observed gene numbers for extant lineages. The genome expansion on the Roseobacter branch leading to R37 was statistically significant based on reconstruction of randomized genome content in 100 bootstrapped replicates (see Table S3). DownloadFigure S2, PDF file, 1 MBAnalysis of gene loss rate (A, D), lateral gene transfer rate (B, E), and gene duplication rate (C, F) versus amino acid substitution rate on the Roseobacter branches of the alphaproteobacterial phylogeny constructed using RAxML (see Fig. S1B) and PhyloBayes (see Fig. S1C). For the exterior Roseobacter branches, LGT rate calculations were highly variable and did not exhibit a clock-like pattern (R2 = 0.13; P = 0.02); this is the case for both reconstructions. DownloadFigure S3, PDF file, 0.3 MBGene duplication rate versus amino acid substitution rate on the Roseobacter branches of the alphaproteobacterial phylogeny constructed by P4 (A), RAxML (B), and PhyloBayes (C) after 667 transposase families were removed. The transposases were identified using RPS-BLAST against the COG database, using the best BLASTP hit to the NCBI RefSeq database and using the ISsaga database. DownloadFigure S4, PDF file, 0.2 MBGenetic composition in the common ancestor of the Roseobacter clade (R38) versus the basal Roseobacter lineage Rhodobacterales bacterium HTCC2255. Letters represent COG categories. Asterisks indicate significant differences using Xipe resampling analysis. The horizontal axis indicates the percentage that each COG category represents in the genome. DownloadFigure S5, PDF file, 0.3 MBRegression model for Roseobacter genome coverage estimation based on genome statistics for 39 cultured Roseobacter strains. The x axis is the ratio of the number of conserved single-copy genes universally present in Roseobacter species to the number of predicted protein-encoding genes; the y axis is the number of nucleotides sequenced. The data were fit to a power regression model (R2 = 0.94). DownloadFigure S6, PDF file, 0.3 MBGene number (predicted for ancestral nodes) and predicted number of gene families gained, lost, expanded, and contracted in the Roseobacter and SAR11 clade genomes. The ancestral reconstruction is based on an alphaproteobacterial tree estimated using P4 (A), RAxML (B), and PhyloBayes (C). Note that the gain of a gene family or expansion of an existing gene family could be a result of either LGT or gene duplication. The relative contribution of these two mechanisms is presented in Fig. 3 and Fig. S3.Table S1, PDF file, 0.1 MB.Statistical analyses of gene loss, LGT, and gene duplication rates for Roseobacter branches. (A) Analysis of covariance results of F tests for ancestral compared to exterior branches; (B) deleted t residual analysis from regressions of gene loss rate, LGT rate, and gene duplication rate versus amino acid substitution rate, conducted separately for ancestral and exterior branches.Table S2, PDF file, 0.1 MB.Gene count estimates and confidence intervals of ancestral Roseobacter and SAR11 nodes for three ancestral reconstructions of the alphaproteobacterial tree.Table S3, PDF file, 0.1 MB.
Authors: Bastien Boussau; E Olof Karlberg; A Carolin Frank; Boris-Antoine Legault; Siv G E Andersson Journal: Proc Natl Acad Sci U S A Date: 2004-06-21 Impact factor: 11.205
Authors: Gordon T Taylor; Frank E Muller-Karger; Robert C Thunell; Mary I Scranton; Yrene Astor; Ramon Varela; Luis Troccoli Ghinaglia; Laura Lorenzoni; Kent A Fanning; Sultan Hameed; Owen Doherty Journal: Proc Natl Acad Sci U S A Date: 2012-10-15 Impact factor: 11.205
Authors: Haiwei Luo; Bradley B Tolar; Brandon K Swan; Chuanlun L Zhang; Ramunas Stepanauskas; Mary Ann Moran; James T Hollibaugh Journal: ISME J Date: 2013-11-07 Impact factor: 10.302