Horizontal gene transfer via plasmids could favour cooperation in bacteria, because transfer of a cooperative gene turns non-cooperative cheats into cooperators. This hypothesis has received support from theoretical, genomic and experimental analyses. By contrast, we show here, with a comparative analysis across 51 diverse species, that genes for extracellular proteins, which are likely to act as cooperative 'public goods', were not more likely to be carried on either: (1) plasmids compared to chromosomes; or (2) plasmids that transfer at higher rates. Our results were supported by theoretical modelling which showed that, while horizontal gene transfer can help cooperative genes initially invade a population, it has less influence on the longer-term maintenance of cooperation. Instead, we found that genes for extracellular proteins were more likely to be on plasmids when they coded for pathogenic virulence traits, in pathogenic bacteria with a broad host-range.
Horizontal gene transfer via plasmids could favour cooperation in bacteria, because transfer of a cooperative gene turns non-cooperative cheats into cooperators. This hypothesis has received support from theoretical, genomic and experimental analyses. By contrast, we show here, with a comparative analysis across 51 diverse species, that genes for extracellular proteins, which are likely to act as cooperative 'public goods', were not more likely to be carried on either: (1) plasmids compared to chromosomes; or (2) plasmids that transfer at higher rates. Our results were supported by theoretical modelling which showed that, while horizontal gene transfer can help cooperative genes initially invade a population, it has less influence on the longer-term maintenance of cooperation. Instead, we found that genes for extracellular proteins were more likely to be on plasmids when they coded for pathogenic virulence traits, in pathogenic bacteria with a broad host-range.
The growth and success of many bacterial populations depends upon the production of cooperative ‘public goods’[1-4]. Public goods are molecules whose secretion provides a benefit to the local group of cells. Examples include iron-scavenging siderophores[5], exotoxins that disintegrate host cell membranes[6,7], and elastases that break down connective tissues[8-10]. A problem is that cooperation can be exploited by ‘cheats’: cells which avoid the cost of producing public goods but can still use and benefit from those produced by cooperative cells[3,11,12]. What prevents cheats from outcompeting cooperators, and ultimately destabilising cooperation?In bacteria, some genetic elements are able to move between cells[13]. This horizontal gene transfer has been suggested as a mechanism to help stabilize the production of cooperative public goods[14-18] (Figure 1a). If a gene coding for the production of a public good can be transferred horizontally, it would allow cheats to be ‘infected’ with the cooperative gene and turned into cooperators. Theoretical models have shown that this can facilitate the invasion of cooperative genes, in conditions where they would not be favoured on chromosomes[14-18]. Experiments on a synthetic Escherichia coli system have shown that location on a plasmid helped the gene for a cooperative public good to invade, particularly in structured populations[18]. In addition, bioinformatic analyses across a range of species found that genes that code for extracellular proteins, many of which act as public goods, are more likely to be found on plasmids than the chromosome[15,19,20].
Figure 1
Three hypotheses for why selection might favour genes coding for extracellular proteins to be located on plasmids.
(a) Cooperation Hypothesis. Blue cells produce extracellular proteins which act as cooperative public goods, while red cells are ‘cheats’ which exploit this cooperation. Over time cheats grow faster than cooperators since they forgo the cost of public good production. However, because the gene for the extracellular protein is located on a plasmid, cooperators can transfer the gene to the cheats, turning them into cooperators, increasing genetic relatedness at the cooperative locus, and stabilising cooperation[14–18]. (b) Gain and Loss Hypothesis. The production of the extracellular protein is required in some environments, but not others. Transitions between these environments can result from temporal or spatial change. Cells are selected to either lose (Environment A) or gain (Environment B) the plasmid coding for the production of the extracellular protein. (c) Beyond Horizontal Gene Transfer Hypothesis. The location of a gene on a plasmid could provide a number of benefits, other than the possibility for horizontal gene transfer[38]. For example, when the quantity of extracellular protein required varies across environments (A versus B), plasmid copy number could be varied to adjust production[38]. Created with BioRender.com.
There are, however, three potential problems for the hypothesis that horizontal gene transfer favours cooperation. First, previous bioinformatic analyses made important first steps, but are not conclusive. One study examined only a single species, which may not be representative of all bacteria[15]. Two additional studies examined multiple species, but assumed that genes and genomes from the same and different species can be treated as independent data points, in a way that could have led to spurious results[19,20]. Statistical tests typically assume that data points are independent, and even slight non-independence can lead to heavily biased results (type I errors)[21,22]. There is an extensive literature in the field of evolutionary biology showing that species share characteristics inherited though common descent, rather than through independent evolution, and so cannot be considered independent data points[23-25]. Genomes are nested within species, and genes are nested within genomes, multiplying this problem of non-independence, analogous to the problem of pseudoreplication in experimental studies[26-29]. Phylogenetically-controlled bioinformatic analyses are required to address this problem of non-independence, and test the robustness of previous conclusions.Second, from a theoretical perspective, while horizontal gene transfer can favour the initial invasion of cooperation, it is not clear if it favours the maintenance of cooperation in the long run[16]. For example, after a plasmid carrying a cooperative gene has spread through a population, a loss of function mutation could easily lead to a cheat plasmid evolving, which could then potentially outcompete the plasmid carrying the cooperative gene[16,30]. Theory is required that examines the maintenance as well as the invasion of cooperation, while accounting for important biological details, such as how plasmid transmission depends on the population frequency of the plasmid, and how frequently plasmids are lost, for example by segregation during cell division.Third, there are alternative hypotheses for why genes coding for extracellular proteins might be preferentially carried on plasmids in some species (Figure 1)[20,31]. Bacteria can rapidly adapt to new and/or changing environments by acquiring new genes via horizontal gene transfer, and losing genes no longer required but costly to maintain (Figure 1b)[32-34]. Genes which facilitate adaptation to environmental variability are often those which code for molecules secreted outside the cell[34-37]. Consequently, we might expect to find genes for extracellular proteins on plasmids to facilitate rapid gain and loss of genes depending on environmental conditions, and not because they are cooperative per se. Alternatively, genes may be favoured to be on plasmids for reasons other than horizontal gene transfer (Figure 1c)[38]. For example, a higher plasmid copy number offers a mechanism for more expression of a gene, potentially even conditionally, in response to certain environmental conditions[38]. The benefit of being able to regulate gene expression in this way could be higher in genes which code for molecules that are secreted outside the cell, when different quantities of molecule are required in different environments. These different hypotheses are not mutually exclusive.We addressed all three of these potential problems for the hypothesis that horizontal gene transfer favours cooperation. We first tested two predictions that would be expected to hold if horizontal gene transfer favours cooperation. Specifically, cooperative genes would be more likely to be found on: (i) plasmids relative to chromosomes; (ii) more mobile plasmids relative to less mobile plasmids[14-20]. We used phylogeny-based statistical methods that control for the problem of non-independence, analysing 1632 genomes from 51 bacterial species, to examine the location of genes that code for extracellular proteins. We then used theoretical models, to examine whether horizontal gene transfer facilitates the evolution as well as the initial spread of cooperation.Finally, we also tested alternative hypotheses for why genes coding for extracellular proteins might be preferentially carried on plasmids. We used three measures of environmental variability to ask whether species which had more variable environments were those most likely to carry genes for extracellular proteins on their plasmids. Additionally, we examined one of these measures in more detail, to help determine whether genes for extracellular proteins were located on plasmids so that they could be gained and lost easily (Figure 1b), or instead because of some additional benefit conferred by plasmid carriage (Figure 1c).
Results
Genomic Analyses
We use the approach developed by Nogueira et al.[15,19,20], of using PSORTb[39] to predict the subcellular location of every protein encoded by 1632 complete genomes from 51 diverse bacterial species (Extended Data Figure 1; Table S3). We are also building upon the work of researchers who pointed out that extracellular (secreted) proteins are likely to provide a benefit to the local population of cells, and hence act as cooperative public goods[2,15,19,20,40]. The advantage of this method is that it allows a large number of genes to be examined, across multiple species.
Extended Data Fig. 1
Protein subcellular localisations.
Visualisation of all possible subcellular locations predicted by PSORTb. The left panel shows a crosssection of a typical Gram-negative bacterium and the right panel shows the equivalent for a Gram-positive bacterium. Both kinds of bacteria have an inner membrane, known as the cytoplasmic membrane. The main difference is that Gram-positive bacteria are surrounded by a thick layer of a molecule called peptidoglycan, while Gram-negative bacteria have a much thinner layer of peptidoglycan, and have an additional membrane. Created with BioRender.com.
Overall, we found the average bacterial genome had 2696 protein-coding genes on the chromosome(s), and 223 on the plasmid(s). Of these, an average of 57 genes (~2%) coded for the production of an extracellular protein, with 52 on the chromosome(s) and 5 on the plasmid(s). This means, on average, 1.9% of chromosome genes and 2.4% of plasmid genes coded for extracellular proteins. To control for the number of genomes per species, we first calculated the mean number of genes for each species, and then the mean of these species means. Therefore, the values above give an indication of the location of genes coding for extracellular proteins in an average genome. Genes with unknown protein localisations were not included (Chromosome: 26.2%; Plasmid: 38.3%). Across species, the proportion of genes coding for extracellular proteins for plasmid(s) was generally more variable than for the chromosome(s) (Figure S2). These patterns are very similar to those found previously[3,15,19,20].
Extracellular proteins are not overrepresented on plasmids
We found that extracellular proteins were not more likely to be carried on plasmids compared to chromosomes (Figure 2). The difference in the proportion of genes that coded for extracellular proteins between plasmid and chromosome was not significantly different from zero across all species (MCMCglmm[41]; posterior mean = 0.004, 95% CI = -0.063 to 0.057, pMCMC= 0.87; n = 1632 genomes; R2 of species sample size = 0.47, R2 of phylogeny = 0.17; Table S2, row 1a). This result was robust to alternative forms of analysis. We also found no significant difference when we: (i) compared chromosomes to plasmids of only certain mobilities (Fig S3; Table S2, rows 20-22); (ii) analysed our data by two alternative methods, by looking at the ratio of proportions instead of the difference, or by considering only whether the plasmid proportion was greater than the chromosome proportion, removing any effect of the magnitude of this difference (Extended Data Figure 2; Table S2, rows 2 and 3). Our analyses use a bacterial phylogeny, which assumes plasmid evolution follows bacterial phylogeny, but we also found no significant pattern if we ignored phylogeny and analysed species as independent data points (Figure 2; Table S2, row 1b; pMCMC = 0.644).
Fig 2
Extracellular proteins are not overrepresented on plasmids.
For each species we calculated the mean difference between plasmid(s) and chromosomes in the proportion of genes coding for extracellular proteins. Species in blue have a difference greater than zero, meaning their plasmid genes code for a greater proportion of extracellular proteins than chromosome genes. Species in red have a difference less than zero, meaning their chromosome genes code for a greater proportion of extracellular proteins than plasmid genes. Error bars indicate the standard error. The dot and error bar at the top of the graph indicate the mean difference and 95% Credible Interval given by a MCMCglmm analysis across all species, controlling for phylogeny and sample size. We arcsine square root transformed proportion data before calculating the difference. Overall, there is no consistent trend that genes coding for extracellular proteins are more likely to be carried on plasmids (i.e. no consistent trend towards species in blue).
Extended Data Fig. 2
Substantial variation within and between species in the genomic location of extracellular proteins.
The x-axis is the % of genomes in each species where the proportion of plasmid proteins predicted as extracellular is greater than the proportion of chromosome proteins predicted as extracellular. Crucially, this considers only whether the plasmid proportion is greater than the chromosome proportion for each genome, rather than also considering the magnitude of the difference (Figure 2). Error bars are the 95% Confidence Intervals from a binomial test on each species, comparing the number of genomes which have plasmid proportion > chromosome proportion to a null prediction of 50% of genomes. Species in blue have >50% of genomes where plasmid > chromosome extracellular proportion, meaning extracellular proteins are significantly overrepresented on plasmids. Species in red have <50% of genomes where plasmid > chromosome extracellular proportion, meaning extracellular proteins are significantly overrepresented on chromosomes. Species in grey have a 95% CI which overlaps 50%, so extracellular proteins are not significantly overrepresented on either plasmids or chromosomes in these species.
The lack of an overall significant result was clear when looking at the raw data for the different species that we examined (Figure 2; Extended Data Figure 2). There was considerable variation across species in the location of genes coding for extracellular proteins. Overall, extracellular proteins were more likely to be on plasmids in 51% of species (26/51), and more likely to be on the chromosome(s) in 49% (25/51) of species (Extended Data Figure 2). For example, in Bacillus anthracis genes coding for extracellular proteins were three times more likely to be on plasmids, whereas in Acinetobacter baumannii genes coding for extracellular proteins were three times more likely to be on the chromosome(s) (Extended Data Figure 2). Clearly, across species, genes coding for extracellular proteins are not consistently more likely to be on plasmids.As a control, we also analysed the genomic location of the genes coding for all other classes of protein (Extended Data Figure 1). Specifically, we analysed genes that coded for the production of Cytoplasmic, Cytoplasmic Membrane, Periplasmic, Outer Membrane and Cell Wall proteins. We found that none of these protein localisations were significantly overrepresented on plasmids or chromosomes across the 51 species (Extended Data Figure 3; Table S2, rows 5-10). Plasmids are highly variable in the genes they carry.
Extended Data Fig. 3
Difference in plasmid and chromosome proportion for all protein classes predicted by PSORTb.
The x-axis is the difference in plasmid and chromosome extracellular proportions, as in Figure 2. The y-axis is all possible subcellular locations predicted by PSORTb. These protein ‘classes’ are ordered along the y-axis by location within the cell, from intracellular to increasingly extracellular. Each dot is the posterior mean and 95% Credible Intervals from a MCMCglmm[42] on the difference in plasmid and chromosome proportion across all species, accounting for phylogeny and sample size. The only proteins significantly overrepresented in either direction are unknown proteins, which make up a higher proportion of plasmid proteins in all species we analysed.
Importance of controlling for non-independence of genomes
Our results contrast with previous studies, which found that plasmid genes code for proportionally more extracellular proteins than chromosomes[15,19,20]. The first of these studies found this pattern across 20 Escherichia coli genomes[15]. We also found that genes coding for extracellular proteins in E. coli were more likely to be found on plasmids (Figure 2; Extended Data Figure 2). However, Figure 2 shows that this is not a consistent pattern across species: approximately half (25/51) of the species we analysed showed a pattern in the opposite direction, with genes coding for extracellular proteins more likely to be on their chromosome(s) than their plasmid(s).Two subsequent, multi-species studies found that plasmid genes were significantly more likely to code for extracellular proteins than chromosome genes[19,20]. These studies used statistical tests such as Wilcoxon signed-rank test to ask whether there was a consistent pattern, using bacterial genomes as independent data points. When we analysed our data with the same statistical methods used in these studies, we also obtained a significant result (Wilcoxon signed-rank test; V= 826530, p-value <0.001, R2 = 0.385; n = 1632 plasmid-chromosome pairs). When analysing other questions, Garcia-Garcera & Rocha[20] used MCMCglmm to control for phylogeny.Why does using bacterial genomes as independent data points lead to a significant result? By using a Wilcoxon signed-rank test, at the level of the genome, we are implicitly assuming that all the genomes analysed are: (i) independent from one another; (ii) a representative sample of bacteria in nature. Neither of these are true for multi-species genomic datasets. First, due to shared ancestry, species are not independent from one another, and so neither are genomes in such analyses[24,42]. Even a slight lack of independence can lead to heavily biased results in statistical analyses and spurious conclusions[21]. Second, genomic databases tend to have a disproportionate abundance of certain species and genera. This will bias the results towards commonly sequenced species.Consequently, when asking questions across species, it is inappropriate to treat all the genomes in genomic datasets as independent data points. When we performed an analysis analogous to the Wilcoxon signed-rank test, using the same untransformed data which produced a significant result above, but controlled for the number of genomes per species and the non-independence of species, we no longer found any significant difference between the proportion of plasmid and chromosome genes coding for extracellular proteins (MCMCglmm; posterior mean = 0.017, 95% CI = -0.021 to 0.057, pMCMC = 0.332; n = 1632 plasmid-chromosome paired differences in extracellular proportion; R2: species sample size = 0.46, phylogeny = 0.34; Table S2, row 4). Furthermore, we found that the number of genomes per species and the non-independence of species explained 46% and 34% of the variation in data respectively (paired plasmid and chromosome differences across our 1632 genomes). Taken together, this illustrates that it is not our data which disagrees with previous studies, but instead our use of statistical analyses appropriate for multi-genome, multi-species datasets[23-25].These data also illustrate the importance of examining effect sizes, and not just whether results are statistically significant. With large sample sizes it is possible to get results that are significant but not biologically important. The percentage of variance explained that is considered biologically significant can depend upon the kind of data you are examining and the field of research, but a baseline of 5-10% seems reasonable for many areas of evolutionary biology (Supp. Info. 1)[43-45]. When bacterial genomes are assumed to be independent data points in across species analyses, this leads to inflated sample sizes. Consequently, even when results are statistically significant at P<0.05, they can still only explain 1-2% of the variation in the data, which is clearly not biologically significant. The flip side of such considerations is that effects sizes and examination of raw data at the species level (e.g. Figure 2) are also useful checks against non-significant results due to a lack of statistical power (type II errors).
Plasmids with higher mobility do not carry more genes for extracellular proteins
We then tested another prediction of the cooperation hypothesis: cooperation is more likely to be favoured when coded for on more mobile plasmids[14-18]. We used data from the MOBsuite database to assign plasmids to one of three levels of mobility (Fig 3a)[46,47]. We classify: conjugative plasmids, which carry all genes necessary to transfer, as the most mobile; mobilizable plasmids, which are dependent upon conjugative plasmids’ machinery to transfer, to have intermediate mobility; non-mobilizable plasmids, which cannot be transferred via conjugation, to be the least mobile (Fig 3a)[46,48].
Figure 3
Plasmid mobility and extracellular proteins.
(a) We divided plasmids into three mobility types: non-mobilizable (lowest or no mobility); mobilizable (intermediate mobility); conjugative (highest mobility). Blue cells are potential plasmid donors, while red cells are potential recipients. Each panel shows when plasmid transfer is possible for one of the three plasmid mobility types. Non-mobilizable plasmids cannot be transferred. Mobilizable plasmids cannot be transferred alone, but they carry enough genes to ‘hijack’ the machinery of a conjugative plasmid that is in the same cell. Conjugative plasmids carry all genes necessary to transfer independently. Created with BioRender.com. (b) The 40 species which carried plasmids of all three mobilities are shown, with a panel for each of these species. Dots in each panel indicate the mean % of genes coding for extracellular proteins of all plasmids of each mobility level. The lines are the linear regression of these three points, coloured blue if the slope is positive and orange if the slope is negative. Note that each row of species has a different y-axis scale, indicated on the left, which applies to all species in that row. We arcsine square root transformed proportion data before calculating the mean for each species, and then back-transformed these values for display of the data. Overall, there is no consistent trend for genes that code for extracellular proteins to be on more mobile plasmids.
Genes coding for extracellular proteins were not more likely to be on plasmids with higher transfer rates (Figure 3b). Examining the slope of the regression between plasmid mobility and the proportion of genes coding for extracellular proteins, we found no consistent pattern across species (MCMCglmm; posterior mean = 0.006, 95% CI = -0.040 to 0.052, pMCMC = 0.73; n = 40; Table S2, row 11). This lack of a significant relationship was robust to different forms of analysis, including an examination of the means of each mobility type of each species (Figure S4; Table S2, row 12). We also found no correlation between the proportion of a species’ plasmids which can transfer and how overrepresented or underrepresented extracellular proteins are on plasmids compared to chromosomes (Extended Data Figure 4; Table S2, rows 16 and 17).
Extended Data Fig. 4
No effect of plasmid mobility on the difference in plasmid and chromosome proportion of genes coding for extracellular proteins.
The x-axis is the % of a species’ plasmids which are conjugative or mobilizable. The y-axis shows the difference in the plasmid and chromosome proportions of genes coding for extracellular proteins, as in Figure 2. Each dot is the mean for all genomes in a species. Species in blue are those with genes coding for extracellular proteins overrepresented on plasmids, while species in red have genes coding for extracellular proteins overrepresented on chromosomes.
To examine our assumption that mobilizable plasmids are likely to be less mobile than conjugative plasmids, we examined how frequently these two kinds of plasmids co-occurred within a genome. If mobilizable plasmids are present in the same cell as conjugative plasmids, they could be transmitted at similar rates. However, we found that of genomes with a mobilizable plasmid(s), 60% did not also carry a conjugative plasmid (434/727). In addition, when mobilizable plasmids did co-occur with a conjugative plasmid, they did not have a higher proportion of genes coding for extracellular proteins (Supp. Info. 1; Figure S6). A caveat here is that our estimates of transfer rates across different types of plasmid is relative, and it would be very useful to obtain quantitative estimates of transfer rates.
Theoretical Stability of Cooperation
Our empirical results did not support the theoretical prediction that cooperative genes should be overrepresented on plasmids, relative to the chromosome[14-18,49]. Consequently, we then extended existing theory, to examine whether we could find conditions where cooperative genes were not predicted to be overrepresented on plasmids. We investigated the consequences of two factors: (1) allowing for a greater range of possible genetic architectures, especially plasmids that lacked the gene for cooperation (non-cooperative or ‘cheat’ plasmids); and (2) examining the evolutionary stability (maintenance) of cooperation, not just its initial invasion[16,49].We examined two possible reasons for why cooperative genes could be overrepresented on plasmids, relative to the chromosome. First, horizontal gene transfer on a plasmid could allow cooperation to be favoured in conditions where it would otherwise not be favoured[14-18]. For example, because plasmid transfer can turn non-cooperators into cooperators, and increase relatedness at the loci for cooperation[17]. Second, even if horizontal gene transfer did not increase the range of biological scenarios (parameter space) where cooperation was favoured, there could be selection for cooperation to be coded for on a plasmid, rather than a chromosome.We assumed an infinite population of haploid individuals (bacterial cells). Individuals may carry a cooperative gene, that codes for public goods production, either on a plasmid, or the chromosome, or both (redundancy). We also allowed for the possibility of: non-cooperative plasmids and chromosomes; plasmid-free cells; a cost of plasmid carriage (C).Each generation, the population is divided into patches, each founded by N independent cells. Cells reproduce clonally until there are a large number of cells per patch. Cells are then randomly shuffled into pairs on their patch and, if a plasmid-free individual has a plasmid-bearing partner, with probability β, the plasmid-free individual acquires a copy of its partner’s plasmid (horizontal gene transfer). Individuals with a gene for cooperation then produce a public good, at a cost C, which generates a benefit B that is shared between all members of the patch. Individuals then survive according to their fitness. Plasmid-bearing individuals lose their plasmid with probability s. Finally, individuals disperse to found new patches.Consistent with previous analyses, we found that, in the short term, horizontal gene transfer on a plasmid can initially help cooperation invade (Figure 4)[14-18]. Horizontal gene transfer increased the frequency of cooperation, by turning non-cooperators into cooperators, which also increases relatedness at the cooperative locus on the plasmid[14-18,49]. Relatedness is increased because, in the short term, whilst plasmids are spreading from rarity, there are many plasmid-free cells available, meaning plasmids have many opportunities to be transferred, generating genetic similarity.
Figure 4
Plasmids facilitate the invasion but not the maintenance of cooperation.
In parts (a) and (b), we plot the results of our theoretical model for the case when there is no plasmid loss (s=0). (a) Cooperation is only maintained at equilibrium (green shaded area) when it is favoured at the chromosomal level RB > C, which is unaffected by plasmid transfer (β). (b) Plasmids can facilitate the invasion and initial spread of cooperation (blue line shoots above red line), but cooperative plasmids are eventually outcompeted by cheat plasmids (red line goes to 1). We note that, in (b), all individuals are chromosomal defectors – chromosomal cooperation was permitted, but did not evolve in this run. To generate the plots in (a) and (b), we assumed the following parameter values: (a & b) B = 1.435, C = 0.1, C 0.2; (b) β = 0.5, N = 16.
In contrast, we found that transfer on a plasmid did not appreciably increase the range of parameter space where cooperation was maintained at evolutionary equilibrium (Fig 4a & 5) (Supp. Info. 4). First, in the absence of plasmid loss (s=0), cooperation was only favoured when RB-C>0, where R is the genetic relatedness at the chromosomal (individual) level (R=1/N). Cooperation was therefore only favoured on the plasmid when it provided a kin selected benefit at the level of the chromosome (individual), as predicted by Hamilton’s rule[50,51].
Figure 5
Plasmid loss can favour the maintenance of cooperation.
We plot the results of our theoretical model for different levels of plasmid loss (s=0-1). The areas encapsulated by the coloured lines show the regions of parameter space where cooperation is polymorphic at equilibrium (i.e. population comprises some cooperators & some defectors). When plasmid loss is absent (s=0), there is no polymorphism (encapsulated area collapses to nothing), meaning cooperation is only maintained at equilibrium (at fixation) when it is favoured at the chromosomal level RB > C (to the left of the black dotted line) (R=1/N). When plasmid loss is intermediate (s=0.1,0.2,0.3,0.4), cooperation can be polymorphic at equilibrium (encapsulated areas), with cooperation being disfavoured in the encapsulated areas to the left of the black dotted line, and favoured in the encapsulated areas to the right of the black dotted line, relative to when plasmids are absent (β=0). When plasmid loss is high (s≥0.5), or when transmission (β) is low, plasmids fail to persist at equilibrium, meaning they have no long-term effect on cooperation (encapsulated areas collapse to nothing). Overall, plasmid loss can facilitate cooperation, but only if plasmid loss (s) is intermediate and transmission (β) is high. To generate this plot, we assumed the following parameter values: B = 1.435, C = 0.1, C = 0.2 (same as Fig. 4).
The reason for this result is that, in the absence of plasmid loss (s=0), plasmids continue to increase in frequency after invasion, ultimately reaching fixation in the population. This means that, in the long term, there are no plasmid-free individuals left to infect, which means that the overall level of horizontal gene transfer in the population goes to zero. Consequently, competition between plasmids with and without a cooperative gene (cooperators and cheats) becomes analogous to the scenario in which the gene for cooperation is on the chromosome[17].Second, when plasmids can be lost (s>0), this can favour cooperation on plasmids, but only in certain areas of parameter space (Figure 5). Plasmid loss means that plasmids do not reach fixation in the population, and so some plasmid transfer still occurs in the evolutionary long term, increasing relatedness at the cooperative plasmid locus. This increased relatedness may favour cooperation on the plasmid, when it would not otherwise be favoured on the chromosome, if plasmids are transferred rapidly (high β) and rates of plasmid loss are intermediate (Figure 5). Specifically, plasmids need to be lost quickly enough that plasmid relatedness appreciably deviates from chromosomal relatedness, but not too quickly that plasmids are not maintained (Figure 5). Another factor that might prevent plasmids from reaching fixation is if there was a constant, high influx of plasmid-free cells (immigration).Overall, our model suggests that horizontal gene transfer can help cooperation initially invade, but will then often have less influence on whether cooperation is maintained in the long term (Figures 4 & 5). We are not saying that horizontal gene transfer can never favour cooperation, just that there is an appreciable area of parameter space where it does not. Consequently, our model provides an explanation for why cooperative genes are not consistently overrepresented on plasmids (Figures 2 & 3). An analogous theoretical result for the case without plasmid loss (s=0) was also found in a meta-population model by Mc Ginty et al.[16]. Our predictions are consistent with experiments carried out by Bakkeren et al.[30], who found that location on a conjugative plasmid could help a cooperative trait invade in Salmonella Typhimurium (S.Tm), but that this was only stable with strong population bottlenecks (high relatedness). Dimitriu et al.[18] found that cooperative plasmids were favoured in structured but not well-mixed populations, and that cooperation was favoured more during ‘epidemic spreads’ into a population.In addition, we found that, when cooperation is favoured, cooperative traits are not more likely to be favoured on, or transferred to, plasmids. The reason is that, when cooperation is favoured, non-cooperators (cheats) are purged from the population, which means there is no extra fitness benefit of coding for the cooperative trait on a plasmid rather than the chromosome. Consequently, our results suggest that horizontal gene transfer only favours cooperation in a restricted area of parameter space. Although, there could be interesting transient dynamics, with cooperation being favoured temporarily (Figure 4), or when cooperation has other consequences, such as increasing plasmid transmission[52,53]. Another important factor is the rate of horizontal gene transfer. While plasmids clearly transmit fast enough to influence evolution, the transfer rates per cell per generation might not be high enough to significantly influence relatedness at the locus for cooperation (i.e. a high enough β)[54].
Alternate hypotheses
Finally, we examined whether alternate hypotheses may better explain the considerable variation in the location of genes coding for extracellular proteins across species. Species which live in more variable environments may be more likely to carry extracellular genes on plasmids. This could be expected for different reasons, including plasmid transfer allowing genes for different environments to be gained and lost (Figure 1b), or plasmids conferring some other advantage not associated with horizontal gene transfer, such as allowing copy number to be conditionally adjusted (Figure 1c)[31,32,38,55]. There are a number of different ways to classify environmental variability, and so we used three different methods.
Broad host-range pathogens are most likely to carry genes for extracellular proteins on plasmids
We first used the diversity of pathogen hosts as a proxy for environmental variability. Although this does not capture all environmental variability experienced by species in our data set, pathogenicity is a key aspect of bacterial lifestyle that has been suggested to be important for plasmid gene content, such as antibiotic resistance and virulence factors[6,40,56,57]. We divided species into three categories: pathogens with broad host-range, pathogens with narrow host-range, and non-pathogens. Broad host-range pathogens are expected to encounter more variable environments than narrow host-range pathogens.We found that pathogens with a broad host-range were more likely to carry genes coding for extracellular proteins on their plasmids, compared with both narrow host-range pathogens and non-pathogens (Fig 6a). Specifically, we compared the difference in the proportion of genes coding for extracellular proteins between plasmid(s) and chromosome(s) across these three categories of species (MCMCglmm; Narrow compared to Broad host-range pathogens: posterior mean = -0.222, 95% CI = -0.322 to -0.123, pMCMC = <0.001; Non-pathogens compared to Broad host-range pathogens: posterior mean = -0.161, 95% CI = -0.252 to -0.067, pMCMC = <0.001; n = 701 genomes; R2 of pathogenicity/host-range = 0.35, R2 of species sample size = 0.28, R2 of phylogeny = 0.11; Table S2, row 23). There was no significant difference between narrow host-range pathogens and non-pathogens in the proportion of genes coding for extracellular proteins on their plasmids compared to chromosome(s) (MCMCglmm; Non-pathogens compared to Narrow host-range pathogens: posterior mean = 0.031, 95% CI = -0.065 to 0.127, pMCMC = 0.482; n = 389; Table S2, row 25). These patterns hold irrespective of whether we included species that we could not reliably classify into either category, such as opportunistic pathogens, in our analyses (Extended Data Figure 5).
Figure 6
Pathogenicity, host-range and the location of genes coding for extracellular proteins.
We have divided species into either pathogens or non-pathogens, with pathogens further categorised into those with a narrow or broad host-range. The y-axis in (a) shows the difference in the proportion of genes on plasmids and chromosomes coding for extracellular proteins – this is the same as the x-axis in Figure 2. The y-axes in (b)(i) and (b)(ii) show the difference in the proportion of a subset of genes coding for extracellular proteins on plasmids and chromosomes which are predicted by MP3 as either (i) pathogenic or (ii) non-pathogenic. Each dot is the mean for all genomes in a species. Species in blue are those with the relevant subset of extracellular proteins overrepresented on plasmids, while species in red are those with the subset of extracellular proteins overrepresented on chromosomes. (c) Phylogeny based on recently published maximum likelihood tree using 16S ribosomal protein data[64]. The inner ring indicates whether extracellular proteins were more likely to be coded for on the plasmid(s) or chromosome(s), as in Figure 2. The outer ring indicates how we classified each species’ pathogenicity, and the presence or absence of diagonal lines for pathogens indicates narrow or broad host-range, respectively. Species with a pink or green label in the outer ring are those included in (a) and (b), since for these we could be reasonably confident of whether or not pathogenicity was an important and consistent aspect of their lifestyle. Overall, pathogens with a broad host-range are more likely to have genes coding for extracellular proteins, and particularly those involved in pathogenicity, on their plasmids.
Extended Data Fig. 5
No difference in where extracellular proteins are coded for in pathogens compared to non-pathogens.
The y-axis shows the difference in the plasmid and chromosome proportion of genes coding for extracellular proteins. Each dot is the mean for all genomes in a species. Species in blue are those with genes coding for extracellular proteins overrepresented on plasmids, while species in red have genes coding for extracellular proteins overrepresented on chromosomes. Species were categorised as pathogens or non-pathogens; those we could not classify as either are shown in the ‘Opportunistic + others” category. The black bars indicate the mean for all species in each category.
Plasmids of broad host-range pathogens carry many pathogenicity genes
We suspected that the additional extracellular proteins coded for by plasmids of broad host-range species, compared to narrow host-range species, may be particularly involved in facilitating pathogenicity[40,56,57]. To investigate this, we used the program MP3[58] to assign each extracellular protein as either ‘pathogenic’ or ‘non-pathogenic’.We found that plasmids of broad host-range pathogens were particularly enriched with extracellular proteins involved in facilitating pathogenicity, compared to plasmids of narrow host-range species (Figure 6b(i)). Specifically, we found that pathogens with a broad host-range were significantly more likely to code for pathogenic extracellular proteins on their plasmids compared to narrow host-range species (Figure 6b(i)) (MCMCglmm; Narrow compared to Broad host-range pathogens: posterior mean = -0.209, 95% CI = -0.350 to -0.086, pMCMC = 0.012; n=474 genomes; Table S2, row 26). In contrast, the relative location of non-pathogenic extracellular proteins did not vary between broad and narrow host-range pathogens (Figure 6b(ii)) (MCMCglmm; Narrow compared to Broad host-range pathogens: posterior mean = -0.036, 95% CI = -0.115 to 0.040, pMCMC = 0.296; n=474 genomes; Table S2, row 27). Consequently, the excess of genes coding for extracellular proteins on the plasmids of broad host-range species (Figure 6a) appears to arise due to an excess of pathogenicity genes coding for extracellular proteins (Figure 6b).Most genomic databases are biased towards species that interact with and/or infect humans, so we examined whether human pathogens had driven the above results. In our dataset, 5 out of 10 broad host-range species and 3 out of 5 narrow host-range species can infect humans. We found no significant difference in how likely both pathogenic and non-pathogenic extracellular proteins were to be on plasmids of human pathogens compared to non-human pathogens. We also found that while host-range had a significant effect on how likely plasmids were to code for pathogenic extracellular proteins, whether a species could infect humans had no significant effect (Table S2, rows 28 to 30).Pathogenic extracellular proteins could be preferentially coded for on plasmids to facilitate their gain and loss (Figure 1b: Gain and loss hypothesis), or because of some other benefit provided by being carried on a plasmid (Figure 1c: Beyond horizontal gene transfer hypothesis). We tested these possibilities by examining whether pathogenic extracellular proteins were more likely to be on plasmids that transfer at higher rates. This would be predicted by the gain and loss hypothesis, but not the beyond horizontal gene transfer hypothesis. We found that plasmids with higher mobility did not code for more pathogenic extracellular proteins. Specifically, across broad host-range pathogen species, the slope of the regression between plasmid mobility and the proportion of genes coding for pathogenic extracellular proteins was not consistently positive (Figure S7) (MCMCglmm; posterior mean = -0.020, 95% CI = -0.224 to 0.185, pMCMC = 0.774; n=7; Table S2, row 31). This lack of a significant relationship was robust to additional forms of analysis, such as considering all pathogenic species, including narrow host-range pathogens and those not carrying plasmids of all three mobility types (Figure S8; Table S2, rows 32 and 33).Taken together, our results are most consistent with the hypothesis that genes coding for extracellular proteins are overrepresented on plasmids when plasmid carriage provides a benefit other than mobility (Figure 1c). A number of other factors may influence which genes are carried on plasmids, beyond horizontal gene transfer. First, there is evidence that increasing the copy number of plasmids can lead to increasing rates of evolution in the genes they carry[59], and it also may act as a mechanism to increase the expression of genes carried on plasmids[60,61]. For example, increased expression of genes coding for extracellular public goods such as virulence factors could help invasion of a host and utilisation of host resources. This could be particularly beneficial for broad host-range pathogens that frequently invade a variety of different hosts. Copy number of plasmids has also recently been shown to lead to genetic dominance effects[55], with likely implications for the phenotypes of genes selected for plasmid carriage[55]. Second, plasmids compete with their bacterial hosts for resources such as replication machinery and nucleotides[62,63]. To resolve this competition, plasmids should be under selection to reduce their cost to the host, with a likely impact on their gene content. For example, extracellular proteins are, on average, cheaper to produce than intracellular proteins[15,20]. Plasmid-host competition could consequently select for plasmids to carry more genes coding for cheaper proteins, and so more extracellular proteins. Our conclusion here should be seen as tentative, as some form of the gain and loss hypothesis (Figure 1b) could still be argued to be consistent with the data, if it is just the potential for horizontal gene transfer that matters, and not the rate.
Number of environments and core vs accessory genes
To further examine a potential association with environmental variability, as could be predicted by both hypotheses b (“Gain and Loss”) and c (“Beyond Horizontal Gene Transfer”), we also looked at two additional measures of environmental variability: (i) the number of five broad environments a species was sequenced in[20,65,66]; (ii) the proportion of a species’ genomes that is composed of ‘core’ genes, which are those found in all genomes of the species – species which experience more variable environments appear to have relatively smaller core genomes[32]. We found no significant correlation between either of these measures and the likelihood that genes coding for extracellular proteins were carried on plasmids (Extended Data Figure 6) (Supp. Info. 1; Table S2, rows 35 and 37). Garcia-Garcera & Rocha[20] previously analysed a different but related question, examining the type of environment, and also used a MCMCglmm to control for the phylogenetic structure of the data (Supp. Info. 1). Our finding of no correlation between these two measures of environmental variability and whether plasmids code for extracellular proteins is in contrast to our above results with respect to pathogen host-range (Figure 6). This suggests that hypothesis c, which our data is most consistent with, may be important for pathogens in particular, but not necessarily across all bacterial species and lifestyles.
Extended Data Fig. 6
Additional measures of environmental variability.
We used two additional methods to estimate the environmental variability encountered by these species. (a) The x-axis shows published data on the number of five broad environments each species was recorded in, which we supplemented with information from the literature to include all species. (b) The x-axis shows the proportion of each species’ genes which are ‘core’ genes, meaning they are found in all members of the species. The y-axis in both graphs shows the difference in the proportion of genes on plasmids and chromosomes coding for extracellular proteins. Each dot is the mean for all genomes in a species. Species in blue are those with extracellular proteins overrepresented on plasmids, while species in red are those with extracellular proteins overrepresented on chromosomes. For both these measures, we found no significant correlation with the genomic location of genes coding for extracellular proteins across species.
Complementary Analyses
There a number of directions in which our analyses could be expanded. We focused on plasmids because they have been the focus of previous theoretical and empirical work[14,16-18]. Other mobile genetic elements include bacteriophages and integrative conjugative elements[67,68]. Comparing core and accessory genes could be a potential way to lump all causes of horizontal gene transfer[15,19]. We considered the relative transfer rates among mobility types; quantitative estimates of plasmid transfer rates would be very useful for further examination of plasmid mobility[48,54,69-71]. We followed previous genomic studies by using extracellular proteins as indicators of cooperative traits[2,15,19,20]. The advantages of this approach are that: (i) we could compare our results with those from previous studies; (ii) secretion systems are highly conserved, allowing us to examine a large number of species, where detailed genetic annotations are lacking; (iii) cooperation mediated by extracellular proteins is usually controlled by only one gene, making them potentially more suitable for plasmid carriage compared to cassettes of multiple genes[72,73]. However, while extracellular proteins are likely to be cooperative traits, not all cooperative genes code for extracellular proteins (e.g. secondary metabolites such as siderophores), and not all extracellular proteins are involved in cooperation (e.g. those involved in motility such as flagellin). It would be very useful to examine more detailed annotations of social genes, and expand to other mobile genetic elements.
Discussion
We found no support for the hypothesis that horizontal gene transfer generally favours cooperation. Our genomic analyses showed that extracellular proteins are not: (i) overrepresented on plasmids compared to chromosomes (Figure 2); (ii) more likely to be carried by plasmids that transfer at higher rates (Figure 3). These patterns could be explained by our theoretical modelling, which showed that while horizontal gene transfer may help cooperation to initially invade a population, it has less influence on the maintenance of cooperation in the long term (Figures 4 & 5). Once plasmids become common, cheat plasmids that do not code for cooperation are able to outcompete cooperative plasmids, analogous to selection at the level of the chromosome[16,30]. Our results suggest that horizontal gene transfer on plasmids has not consistently favoured cooperation across bacterial species – but it is still possible that horizontal gene transfer could have an influence in certain scenarios or species. In contrast, we found that genes coding for extracellular proteins involved in pathogenicity and virulence are preferentially located on plasmids in pathogens with a broad host-range (Figure 6). These pathogenic virulence genes were not preferentially located on plasmids that transfer at a higher rate, suggesting that the benefit of being located on a plasmid is something other than horizontal gene transfer, such as the ability to vary copy number.
Methods
Genome Collection
We retrieved 1632 complete genomes comprising 51 bacterial species from GenBank RefSeq (https://www.ncbi.nlm.nih.gov) between February-November 2019. We used species on panX (http://pangenome.tuebingen.mpg.de)[74] as a list of potential species for our dataset, since these comprise the most sequenced bacterial species. To allow comparison of chromosome and plasmid genes within the same genome, we only retrieved genomes that contained at least one plasmid sequence. We included species with 10 or more RefSeq genomes with one or more plasmids available in our analysis. We retrieved up to 100 genomes for each species; this was either all complete genomes available for the species, or a random sample where more than 100 were available. Where two or more genomes had the same strain name, we randomly retrieved one genome to reduce the risk of pseudoreplication.
Prediction of Subcellular Location of Proteins
We used PSORTb v.3[39] to predict the subcellular location of every protein encoded by each genome in our dataset. We used a Docker image of PSORTb developed by the Brinkman Lab, available at: https://github.com/brinkmanlab/psortb_commandline_docker. We chose PSORTb because it is widely regarded as one of the best performing programs of its kind[75]. It has also been used in previous analyses to identify ‘ cooperative’ genes and/or extracellular proteins in bacteria[15,20]. The program has a number of modules which are trained to recognise particular features of proteins. Results from these modules are combined to give a Final Prediction for each protein. We consulted the literature to confirm the Gram stain of each of our species. For Gram-positive species, PSORTb assigns proteins to one of four locations within the cell: cytoplasmic, cytoplasmic membrane, extracellular or cell wall (Extended Data Figure 1). The locations for Gram-negative species are the same, except that cell wall is replaced with outer membrane and periplasmic, meaning there are five possible locations for proteins of Gram-negative species (Extended Data Figure 1). We used these predicted locations throughout all subsequent analyses in this work. PSORTb could not reliably assign a subcellular location to 27% of proteins we analysed, giving a final prediction of ‘unknown’ (Table S1). Unless explicitly stated, we did not include these unknown proteins in our analyses.
Predicting Plasmid Mobility
We also predicted the mobility of every plasmid in our dataset using the MOB-typer tool of the program MOBsuite[46]. This searches for features of plasmid sequences including the origin of transfer (oriT), relaxase and mating-pair formation to give each plasmid one of three mobility predictions: (i) conjugative, where plasmids encode all machinery required to transfer via conjugation; (ii) mobilizable, where plasmids do not encode all machinery, but encode oriT and/or relaxase, allowing them to ‘hijack’ another plasmid’s conjugation machinery and mobilize; (iii) non-mobilizable, where plasmids do not encode the genes necessary to be mobilized by themselves or other plasmids, and so cannot transfer via conjugation. 628 of the 4150 plasmids in our dataset were flagged as ‘unverified’ against the MOBsuite dataset, meaning their mobility prediction was unreliable and they were not included. This left 3522 plasmids for subsequent analysis.
Effect of Mobility on Plasmid Extracellular Protein Content
We next examined how plasmid mobility correlates with each plasmid’s extracellular protein proportion. As part of its mobility prediction, MOBsuite[46] identifies sequences within each plasmid involved with conjugation. To control for the possibility that conjugative plasmids, by definition of being conjugative, must carry genes controlling this process, we subtracted the total number of these sequences from the total number of proteins when calculating the extracellular proportion of each plasmid. This is a highly conservative control, since it assumes none of the proteins predicted as extracellular are involved in conjugation. We did all analyses on these data with and without removing these mating-pair accessions to ensure any results were not affected by factors unrelated to plasmids’ extracellular protein content.Additionally, we used the plasmid mobility predictions to ask whether differences in the mobility of species’ plasmids correlated with whether genes encoding extracellular proteins are overrepresented on plasmids compared to chromosomes. We calculated the proportion of plasmids in each genome capable of transferring via conjugation (conjugative and mobilizable plasmids), and averaged across all genomes to give a general measure of the mobility of each species’ plasmids.
Measures of Bacterial Lifestyle and Environmental Variability
We classified a species as pathogenic if it was described in the literature as an obligate or facultative pathogen. Given some bacterial species only rarely act as pathogens, such as opportunistic pathogens, we only included species where we could be sure pathogenicity was a key aspect of their lifestyle and a regular selection pressure acting on their genome content. For this reason, we decided not to include species described as opportunistic pathogens in the literature and those which frequently live as commensals in their hosts. We classified non-pathogens as species which are strictly environmental (never live in hosts) or strictly mutualists and/or commensals (never cause pathogenicity in their hosts). There were 26 species we could not definitively assign to either of these categories. These were not included in our main analyses, although we carried out additional analyses to ensure that removing these species did not bias our results (Extended Data Figure 5).To estimate the host-range of pathogens, we used information from the literature to determine the maximum taxonomic level of hosts each species is able to invade. We defined narrow host-range species as those which can invade either only one host species, or host species within the same genus or family. In contrast, we defined broad-host range pathogens as those capable of invading host species within the same order, class or phylum. For example, Xanthomonas citri acts as a plant pathogen within the genus Citrus[76], while Pseudomonas syringae acts as plant pathogen across multiple orders of flowering plants[77]. For more details and references to the literature used for this classification, please see Table S3.We completed additional analyses for other two measures and proxies of environmental variability, the details and results of which can be found in Supp. Info. 1. In brief, we used previously published data which classified the habitat diversity of species using 16S RNA environmental datasets across five broad habitats: water, wastewater, sediment, soil and host[65,66]. We also supplemented this with information from the literature for species not included in the published data. We used this to ask whether species which lived in multiple habitats had genes encoding extracellular proteins more overrepresented on their plasmids.We also looked at bacterial pangenomes as a proxy for environmental variability, since it has been noted that species with a high % of accessory genes, defined as genes found in only a subset of genomes within a species, are generally those with more variable environments. All pangenome data was collected from panX[74] (http://pangenome.tuebingen.mpg.de), since this calculates the pangenome using the same method across all of our species.
Pathogenicity categorisation of extracellular proteins
We used MP3[58] to examine the pathogenicity of extracellular protein-coding genes in broad host-range and narrow host-range pathogens. MP3 compares protein sequences to a curated dataset of proteins known to be involved in various aspects of pathogenicity: adhesion, invasion, secretion and resistance[58]. MP3 uses two modules to produce a ‘Hybrid’ prediction for each protein: either ‘Pathogenic’ or ‘Non-Pathogenic’. We used MP3 with default parameters to gain this prediction for every extracellular protein in all genomes of broad and narrow host-range species. MP3 was unable to give a prediction for approximately 9% of extracellular proteins, and so these were not included in this analysis.For each genome in broad and narrow host-range pathogens, we summed the MP3 predictions to give the total number of ‘Pathogenic’ and ‘Non-Pathogenic’ extracellular proteins on the chromosome and on the plasmid(s). We then calculated the proportions of plasmid and chromosome genes which code for ‘Pathogenic’ and ‘Non-Pathogenic’ extracellular proteins.
Statistical analyses
MCMCglmm
Many commonly used statistical methods in biology require data points to be independent from one another. However, due to shared ancestry, species cannot be considered as independent data points[24]. Recently developed statistical methods now allow for phylogenetic relationships to be controlled for within mixed effects models. For all statistical analyses we used the MCMCglmm (Markov Chain Monte Carlo generalised linear mixed effects model) package in R with phylogeny a random effect[41,78]. This means the phylogeny is implemented in the model as a covariance matrix of the relationships between species, which is controlled for when considering whether patterns exist across species[41,78]. We also included sample size as a random effect when analysing at the genome level to control for differences in the number of genomes per species. Specific details of each model can be found in Table S2. We extracted from each model the posterior mean, 95% Credible Intervals (functionally similar to 95% Confidence Intervals), and the pMCMC value (generally interpreted in a similar way to a ‘p-value’). We also calculated R2 values for models of particular interest using methods described in[79,80]. A detailed description of MCMCglmm can be found elsewhere[41,78].The response variable in all of our analyses is either a proportion or a measure calculated from proportions. Proportion data is bound between 0 and 1 and has a non-normal distribution. To control for this, all proportion data in our analyses has been arcsine square root transformed to improve normality.
Phylogeny
To control for species relationships, we generated a phylogeny including all 51 species in our dataset (Figure S1). We used a recently published maximum likelihood tree using 16S ribosomal protein data as the basis for our phylogeny[64]. This tree of life typically had only one representative species per genus. We used the R package ‘ape’ to extract all branches matching species in our dataset[81]. In cases where the genus representative was different to the species in our dataset, we swapped the tip name with our species, since all members of the same genus are equally related to members of a sister genus. In cases where we had multiple species within a single genus in our dataset, we used the R package ‘phylotools’ to add these species as additional branches into their genus[82]. We used published phylogenies from the literature to add any within-genus clustering of species’ branches. We used this phylogeny in nexus format for all our MCMCglmm analyses (Fig S1, Table S2). Methods are also available to control for uncertainty in phylogenetic reconstruction[83,84], although we have not done this here.
Protein subcellular localisations.
Visualisation of all possible subcellular locations predicted by PSORTb. The left panel shows a crosssection of a typical Gram-negative bacterium and the right panel shows the equivalent for a Gram-positive bacterium. Both kinds of bacteria have an inner membrane, known as the cytoplasmic membrane. The main difference is that Gram-positive bacteria are surrounded by a thick layer of a molecule called peptidoglycan, while Gram-negative bacteria have a much thinner layer of peptidoglycan, and have an additional membrane. Created with BioRender.com.
Substantial variation within and between species in the genomic location of extracellular proteins.
The x-axis is the % of genomes in each species where the proportion of plasmid proteins predicted as extracellular is greater than the proportion of chromosome proteins predicted as extracellular. Crucially, this considers only whether the plasmid proportion is greater than the chromosome proportion for each genome, rather than also considering the magnitude of the difference (Figure 2). Error bars are the 95% Confidence Intervals from a binomial test on each species, comparing the number of genomes which have plasmid proportion > chromosome proportion to a null prediction of 50% of genomes. Species in blue have >50% of genomes where plasmid > chromosome extracellular proportion, meaning extracellular proteins are significantly overrepresented on plasmids. Species in red have <50% of genomes where plasmid > chromosome extracellular proportion, meaning extracellular proteins are significantly overrepresented on chromosomes. Species in grey have a 95% CI which overlaps 50%, so extracellular proteins are not significantly overrepresented on either plasmids or chromosomes in these species.
Difference in plasmid and chromosome proportion for all protein classes predicted by PSORTb.
The x-axis is the difference in plasmid and chromosome extracellular proportions, as in Figure 2. The y-axis is all possible subcellular locations predicted by PSORTb. These protein ‘classes’ are ordered along the y-axis by location within the cell, from intracellular to increasingly extracellular. Each dot is the posterior mean and 95% Credible Intervals from a MCMCglmm[42] on the difference in plasmid and chromosome proportion across all species, accounting for phylogeny and sample size. The only proteins significantly overrepresented in either direction are unknown proteins, which make up a higher proportion of plasmid proteins in all species we analysed.
No effect of plasmid mobility on the difference in plasmid and chromosome proportion of genes coding for extracellular proteins.
The x-axis is the % of a species’ plasmids which are conjugative or mobilizable. The y-axis shows the difference in the plasmid and chromosome proportions of genes coding for extracellular proteins, as in Figure 2. Each dot is the mean for all genomes in a species. Species in blue are those with genes coding for extracellular proteins overrepresented on plasmids, while species in red have genes coding for extracellular proteins overrepresented on chromosomes.
No difference in where extracellular proteins are coded for in pathogens compared to non-pathogens.
The y-axis shows the difference in the plasmid and chromosome proportion of genes coding for extracellular proteins. Each dot is the mean for all genomes in a species. Species in blue are those with genes coding for extracellular proteins overrepresented on plasmids, while species in red have genes coding for extracellular proteins overrepresented on chromosomes. Species were categorised as pathogens or non-pathogens; those we could not classify as either are shown in the ‘Opportunistic + others” category. The black bars indicate the mean for all species in each category.
Additional measures of environmental variability.
We used two additional methods to estimate the environmental variability encountered by these species. (a) The x-axis shows published data on the number of five broad environments each species was recorded in, which we supplemented with information from the literature to include all species. (b) The x-axis shows the proportion of each species’ genes which are ‘core’ genes, meaning they are found in all members of the species. The y-axis in both graphs shows the difference in the proportion of genes on plasmids and chromosomes coding for extracellular proteins. Each dot is the mean for all genomes in a species. Species in blue are those with extracellular proteins overrepresented on plasmids, while species in red are those with extracellular proteins overrepresented on chromosomes. For both these measures, we found no significant correlation with the genomic location of genes coding for extracellular proteins across species.
Authors: Otto X Cordero; Hans Wildschutte; Benjamin Kirkup; Sarah Proehl; Lynn Ngo; Fatima Hussain; Frederique Le Roux; Tracy Mincer; Martin F Polz Journal: Science Date: 2012-09-07 Impact factor: 47.728
Authors: Christopher M Ference; Alberto M Gochez; Franklin Behlau; Nian Wang; James H Graham; Jeffrey B Jones Journal: Mol Plant Pathol Date: 2018-03-08 Impact factor: 5.663