Hye Jin Park1,2,3, Chaitanya S Gokhale4, Frederic Bertels5. 1. Department of Evolutionary Theory, Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany. 2. Asia Pacific Center for Theoretical Physics, Pohang, 37673, Korea. 3. Department of Physics, POSTECH, Pohang, 37673, Korea. 4. Research Group for Theoretical Models of Eco-evolutionary Dynamics, Department of Evolutionary Theory, Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany. 5. Research Group for Microbial Molecular Evolution, Department of Microbial Population Biology, Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany.
Abstract
Compared to their eukaryotic counterparts, bacterial genomes are small and contain extremely tightly packed genes. Repetitive sequences are rare but not completely absent. One of the most common repeat families is REPINs. REPINs can replicate in the host genome and form populations that persist for millions of years. Here, we model the interactions of these intragenomic sequence populations with the bacterial host. We first confirm well-established results, in the presence and absence of horizontal gene transfer (hgt) sequence populations either expand until they drive the host to extinction or the sequence population gets purged from the genome. We then show that a sequence population can be stably maintained, when each individual sequence provides a benefit that decreases with increasing sequence population size. Maintaining a sequence population of stable size also requires the replication of the sequence population to be costly to the host, otherwise the sequence population size will increase indefinitely. Surprisingly, in regimes with high hgt rates, the benefit conferred by the sequence population does not have to exceed the damage it causes to its host. Our analyses provide a plausible scenario for the persistence of sequence populations in bacterial genomes. We also hypothesize a limited biologically relevant parameter range for the provided benefit, which can be tested in future experiments.
Compared to their eukaryotic counterparts, bacterial genomes are small and contain extremely tightly packed genes. Repetitive sequences are rare but not completely absent. One of the most common repeat families is REPINs. REPINs can replicate in the host genome and form populations that persist for millions of years. Here, we model the interactions of these intragenomic sequence populations with the bacterial host. We first confirm well-established results, in the presence and absence of horizontal gene transfer (hgt) sequence populations either expand until they drive the host to extinction or the sequence population gets purged from the genome. We then show that a sequence population can be stably maintained, when each individual sequence provides a benefit that decreases with increasing sequence population size. Maintaining a sequence population of stable size also requires the replication of the sequence population to be costly to the host, otherwise the sequence population size will increase indefinitely. Surprisingly, in regimes with high hgt rates, the benefit conferred by the sequence population does not have to exceed the damage it causes to its host. Our analyses provide a plausible scenario for the persistence of sequence populations in bacterial genomes. We also hypothesize a limited biologically relevant parameter range for the provided benefit, which can be tested in future experiments.
Repetitive sequences can be found in most genomes. They are particularly abundant in eukaryotes, where often only a small proportion of the genome encodes for host proteins (Jurka ). In contrast, about 90% of a typical bacterial genome encodes for host proteins (Silby ). The extragenic space is mostly taken up by rRNA, tRNA, transcription and translation promoters, repressors, and terminators (Rogozin ). Yet, repetitive sequences can also be found in the extragenic space of many bacteria (Treangen ).Short repetitive sequences were first identified in Escherichia coli in the early 1980s (Higgins ). Then, due to their characteristics, they were called REPs, short for epetitive xtragenic alindromic sequences (Stern ). It was unclear if REP sequences fulfill a functional role in the host bacterium and if so what kind of function this might be. Numerous studies found REP sequences to be involved in different biological processes, for example in transcription termination, RNA stabilization, gyrase, and integration host factor binding, as well as nucleoid folding (Higgins ; Newbury ; Yang and Ames 1988; Boccard and Prentki 1993; Espéli ; Qian ). However, whether the identified functions are locally co-opted, or common to all REP sequences and therefore able to explain the presence of REP sequences in the bacterial genome, is not clear.To determine whether a function is incidental or whether it can explain the persistence and emergence of an entire sequence class requires the understanding of the evolution of REP sequences. A study in Pseudomonas fluorescens SBW25 showed that REP sequences are not evolutionarily relevant units (Bertels and Rainey 2011b), but a part of a larger replicative unit, called REPIN ( doublet forming hairp). REPINs consist of two inverted REP sequences separated by a short and highly diverse spacer region. This arrangement allows REPINs to form hairpins in single-stranded DNA or RNA. REP singlets also exist, but these are usually decaying remnants of full-length REPINs. REPINs are nonautonomous transposable elements that are duplicated by RAYT (EP ssociated trosine ransposase) proteins (Nunvar ; Bertels and Rainey 2011b; Ton-Hoang ).RAYT transposases are single-copy genes that have been vertically inherited for millions of years (Bertels ), making RAYTs domesticated transposases. Despite the domestication of the RAYT transposase by the bacterium, RAYTs have not lost their association to REPINs and actively replicate REPINs albeit at very low rates (Bertels ).Although the RAYT transposase’s exact function is unknown, it is conceivable that formerly parasitic genes are domesticated by the host. It is much less clear how a population of replicating sequences can be maintained in a bacterial genome over long periods of time. There is a large body of literature on the persistence of transposable elements (TEs). In the 1980s research was mostly focused on how it is possible to maintain TEs in sexually reproducing eukaryotic genomes (Doolittle and Sapienza 1980; Hickey 1982; Charlesworth and Charlesworth 1983; Wright and Finnegan 2001). These studies showed that beneficial effects need not be invoked to explain the presence of TEs in the genome. Instead, if the TE copies or transposes itself from one sister chromatid to the other during meiosis, TEs can even reduce the host’s fitness by up to 50% and still spread through the host population.TEs are much rarer in asexually reproducing prokaryotic genomes than in sexually reproducing eukaryotic genomes. Nevertheless, studies of TEs in asexually reproducing organisms followed shortly after the first studies on eukaryotes (Sawyer and Hartl 1986). The authors assume, similar to sexually reproducing organisms, “that the TE performs no function for the host and, that the reduction in fitness with increased copy number is due to effects such as impairment of beneficial genes by transposition or homologous recombination.” These models can explain the distribution of simple TEs such as insertion sequences (ISs), and even short repetitive sequences assumed to act as promoters (mobile promoters, MPs) as long as there is replicative horizontal gene transfer (hgt) (Sawyer and Hartl 1986; Dolgin and Charlesworth 2006; Matus-Garcia ; Bichsel ; van Passel ).As more and more sequence data became available, it was noticed that TEs often cause beneficial mutations in prokaryotic genomes (Schneider ). When incorporating the mutational effect of TEs into models, analyses showed that mutation rates increased by TEs can elevate TE persistence time in bacterial genomes in novel or fluctuating environments (Martiel and Blot 2002; Edwards and Brookfield 2003; McGraw and Brookfield 2006; Startek ). TEs can theoretically be maintained at intermediate numbers if the environment fluctuates regularly (Startek ). However, there are numerous issues with this result. As the authors point out, TEs will not be maintained through this mechanism over long evolutionary time periods.One reason is that nonautonomous TEs are expected to quickly evolve by inactivating mutations of the encoded transposase. Nonautonomous elements cannot produce a transposase protein, but can be the target of transposases produced by autonomous elements. The evolution of nonautonomous elements will quickly lead to the extinction of full-length elements, making the long-term survival of TEs in prokaryotic genomes unlikely, consistent with the transient nature of prokaryotic TEs in sequenced bacterial genomes (Sawyer and Hartl 1986; Startek ).A second reason is that increasing mutation rates by ISs is not a viable strategy over long evolutionary time periods. Each time a beneficial mutation is generated through the insertion of a TE, the transposition rate increases. Increasing the transposition rate will, of course, increase the mutation rate and lead to high costs for the cell. Hence, increasing mutation rates by modifying the DNA repair system should, in the long term, be a less costly route of adapting to novel environments (Wielgoss ; Consuegra ).In eukaryotes stably maintained sequence populations exist in Drosophila populations (Charlesworth and Charlesworth 1983; Charlesworth and Langley 1989). A stable population can only be obtained when the accumulation of TEs is stopped (Figure 1). This can be achieved by either an exponentially increasing fitness cost of TEs or the down regulation of transposition rates. Without hgt or recombination a stable equilibrium of intermediate TE numbers cannot be maintained (Wright and Schoen 1999).
Figure 1
Previous research shows there are two trivial outcomes for transposable element evolution. In prokaryotes, transposable elements go extinct by default (Dolgin and Charlesworth 2006). In eukaryotes, transposable elements tend to increase indefinitely until eventually the TE population collapses and a large part of the genome is lost. The TE population size will then increase again until eventual collapse. This has been shown to have happened in birds and mammals (Kapusta ). Superscripts indicate the following references: a (Sawyer and Hartl 1986), b (RANKIN ), c (Hickey 1982), d (Charlesworth and Charlesworth 1983), e (Charlesworth and Langley 1989).
Previous research shows there are two trivial outcomes for transposable element evolution. In prokaryotes, transposable elements go extinct by default (Dolgin and Charlesworth 2006). In eukaryotes, transposable elements tend to increase indefinitely until eventually the TE population collapses and a large part of the genome is lost. The TE population size will then increase again until eventual collapse. This has been shown to have happened in birds and mammals (Kapusta ). Superscripts indicate the following references: a (Sawyer and Hartl 1986), b (RANKIN ), c (Hickey 1982), d (Charlesworth and Charlesworth 1983), e (Charlesworth and Langley 1989).To obtain stable sequence populations in prokaryotes, the high cost of transposition has to be alleviated to prevent the extinction of TEs (Figure 1). Since until recently stable sequence populations in prokaryotes have not been observed, no mathematical model has been proposed to explain the persistence of intermediate numbers of TEs over long time periods.Currently, REPINs are to our knowledge the only intragenomic sequence population that is stably maintained in prokaryotes. REPINs have been maintained in the genome for millions of years (Bertels ) and mean and mode of the population size is far greater than 0 in E. coli.For example, across 20 representative E. coli strains the minimum REPIN number is 96, and the average is 156 (Touchon ), whereas IS5 is only present in four of 20 strains. This pattern also holds for larger E. coli strain collections. In a selection of 300 E. coli genomes only 44% (133) contain one or more IS5 genes (Figure 2). The maximum number of IS5 copies is 53. In contrast, across the same strain collection, the minimum REPIN number is 69 and the maximum 235. RAYT containing [the transposase responsible for REPIN transposition (Nunvar ; Bertels and Rainey 2011b; Messing )] genomes harbor more REPINs than genomes lacking the RAYT transposase responsible for REPIN dissemination. There is no overlap between the REPIN distribution and the IS5 distribution in E. coli, which strongly suggests that fundamentally different evolutionary processes maintain REPINs inside bacterial genomes compared to ISs.
Figure 2
Distribution of IS5 elements (red) compared to REPINs (blue) across 300 de-replicated E. coli genomes. To display the REPIN numbers, E. coli genomes are divided into two categories. Genomes that contain the RAYT transposase gene (dark blue) and genomes that do not (light blue). REPINs are more common in genomes that contain a RAYT gene compared to genomes that do not contain a RAYT gene. The distribution of REPIN numbers does not overlap with the distribution of IS5 elements (i.e., IS5 occurs at most 53 times per E. coli genome, yet there are at least 69 REPINs present per E. coli genome).
Distribution of IS5 elements (red) compared to REPINs (blue) across 300 de-replicated E. coli genomes. To display the REPIN numbers, E. coli genomes are divided into two categories. Genomes that contain the RAYT transposase gene (dark blue) and genomes that do not (light blue). REPINs are more common in genomes that contain a RAYT gene compared to genomes that do not contain a RAYT gene. The distribution of REPIN numbers does not overlap with the distribution of IS5 elements (i.e., IS5 occurs at most 53 times per E. coli genome, yet there are at least 69 REPINs present per E. coli genome).Our study aims to understand the conditions that allow the maintenance of intermediate REPIN numbers. We start by devising a simple model for REPIN evolution. In agreement with previous work, we show that in our model the bacterial population will either be driven to extinction by the cost of the transposition activity of an ever-increasing intragenomic sequence population or the sequence population will be lost from the bacterial population, with and without nonreplicative hgt (Figure 1). However, persistence of intermediate numbers is possible when each sequence provides a small benefit to the host bacterium, decreasing as the sequence number per genome increases. Interestingly, for high nonreplicative hgt rates, sequence populations can persist even if the caused harm outweighs the fitness benefit provided to the host. Together, our analyses provide testable hypotheses to explain the persistence of intragenomic sequence populations in bacteria.
Materials and methods
REPIN and IS5 distribution in E. coli
We downloaded 1165 E. coli genomes from NCBI (https://www.ncbi.nlm.nih.gov/) on the 27th of February 2020 using the following query “(“Escherichia coli”[Organism] OR Escherichia coli[All Fields]) AND (latest[filter] AND (all[filter] NOT” derived from surveillance project”[filter] AND all[filter] NOT anomalous[filter])) AND (”complete genome”[filter] OR” chromosome level”[filter]) AND” has annotation”[Properties]”. We then de-replicated those genomes to make sure that all nucleotide sequences of all genomes differed by at least 0.5% (Mash distance) using the “Assembly de-replicator” (downloaded on the 27th of February 2020). A selection of 300 genomes remained. The sequences can be downloaded using the code provided at GitHub.For all genomes, REPINs were identified by first determining the most common 21 bp long sequences in E. coli O15: H11 strain 90-9272 (GATGCGGCGTGAACGCCTTAT). All related sequences that differ in at most one position are identified recursively for this seed sequence until no more new sequences are found. This procedure was repeated with the same 21 bp long seed sequence for all 300 E. coli genomes.IS5 sequences were identified using TBLASTN in BLAST+ (Camacho ) (version 2.10.0) with an e-value threshold of 1e-90 and the IS5 protein with NCBI accession number QEF05883.1 as a query. Similarly, RAYT sequences were identified with TBLASTN using the YafM protein from E. coli K-12 MG1655 as a query and an e-value threshold of 1e-90. We chose low e-value thresholds to ensure that we only analyze full-length and likely functional genes that mainly evolved inside the E. coli species.The analyses can be done with the RAREFAN webtool.
Local REPIN amplification rate λ
REPINs are often found in two or more tandem repeat copies (Bachellier ; Bertels and Rainey 2011b). Hence, REPINs can get locally amplified or deleted. To estimate the local amplification and deletion rates in the genome, we consulted mutation accumulation data from E. coli MG1655 (Foster ). In this experiment, the authors started 50 parallel mutation accumulation lines from a single E. coli MG1655 wild-type clone (strain PFM2m). These 50 lines were grown on minimal medium and serially transferred about 220 times through single-cell bottlenecks. Between bottlenecks, the cells grew for about 28 generations. The final bacterial clones experienced about 6160 cell divisions from the start to the end of the experiment (Lee ; Foster ). At the end of the experiment, the authors observed 277 single base-pair substitutions across the 50 individual mutation accumulation lines and based on this data estimated a per genome mutation rate of .Using the same logic, and further data from (Lee ), we can estimate the local amplification rates of REPINs. Across the experiment, they only observed a single large indel that involved REPINs and hence is relevant for the estimation of local REPIN amplifications and deletions (λ). We analyzed the Illumina sequence data with breseq (Deatherage and Barrick 2014) to verify the presence of a mutation in a REPIN cluster. This event occurred in M2M-85 (SRA accession number: SRR2169198) at position 4295870.4296434 in the E. coli MG1655 ancestor (Genbank accession number: U00096.3) and deleted five REPIN copies in a tandem cluster of six REPINs. From these numbers, we can estimate the magnitude of the amplification rate λ the same way Lee et al. have done for the substitution rate. To focus on the rate per REPIN we have to also divide by the REPIN population size of E. coli MG1655 (224) to obtain a maximum likelihood estimate of . The 95% confidence interval of λ ranges from to .
REPIN transposition rate δ
Note, that strictly speaking, the REPINs we identify in E. coli and other enterobacterial strains are REP sequences. REPINs consist of two REP sequences in an inverted orientation. However, since REPINs in enterobacteria are asymmetric (i.e., the 5 REP sequence differs from the 3 REP sequence by a single nucleotide deletion/insertion), it is difficult to identify and analyze the whole REPIN (Bertels ). However, despite focusing our analyses on REP sequences in enterobacteria, we speak of REPINs as these are the actual mobile elements. REP sequences, when encountered as singlets (which is relatively rare) are immobile remnants of REPINs (Bertels and Rainey 2011b).We first identified the most common 21–25 bp long sequences in ten different Enterobacterial strains to determine approximate REPIN transposition rates. We identified the corresponding REPIN populations for each of these highly abundant sequences by recursively searching all sequences that differ in exactly one position from any already identified sequence in the genome [see Bertels ) for more details]. Using the mutation-selection (or Quasispecies) model, we inferred REPIN transposition rates as described in Bertels ). This model considers four mutation classes. The first mutation class only contains a single sequence, the master sequence. The second and third mutation classes contain all sequences that differ from the master sequence in exactly one and two positions, respectively. The last mutation class contains all sequences that differ in three or more positions to the master sequence. By assuming that the frequency distribution of the four mutation classes is in a steady-state, the REPIN transposition rate can be estimated for a constant mutation rate. Using this procedure, we obtained five transposition rates for the master sequence per bacterial strain, one for each sequence length. For each strain, we report the highest master sequence transposition rate. All estimated transposition rates are summarized in Table 1.
Table 1
Estimated transposition rates and REPIN population sizes
Strain
Seq.
Transp.
REPIN
Length (bp)
Rate (δ)
Pop. size (r)
Salmonella enterica ATCC 9150
24
5.2×10−9
98
Citrobacter koseri ATCC BAA-895
24
3.8×10−9
323
Enterobacteriaceae bacterium FGI57
21
9.3×10−9
150
Klebsiella variicola 342
23
5.4×10−9
91
Escherichia albertii 07-3866
24
7.6×10−9
226
Escherichia coli K-12 MG1655
23
1.2×10−8
224
Escherichia coli B REL606
24
9.7×10−9
220
Escherichia coli UMN026
21
1.4×10−8
159
Escherichia coli UTI89
24
7.7×10−9
137
Escherichia coli 536
24
8.7×10−9
158
Estimated transposition rates and REPIN population sizes
Model
Our main objective is to explore the conditions that would allow REPINs to persist in their bacterial host genome for millions of years or billions of bacterial generations. We begin by describing the dynamics of the hosts—the bacteria. We assume that bacteria grow near exponentially when the population size is small, and growth saturates when the population size is close to carrying capacity (i.e., logistic growth).
where B is defined as , K is the population carrying capacity, and n is the number of bacteria in the population. g is defined as .We can define bacterial subpopulations depending on the number of REPINs r each bacterium carries. The relative abundance of bacteria carrying r REPINs with respect to K is denoted by . The bacterial pool is the sum of all bacteria with different numbers of REPINs, . Hence B becomesThe number of bacteria carrying r REPINs can change due to bacterial growth and the REPIN dynamics. For example, if a REPIN is deleted, the bacterium changes its state from r to r—1, which happens with rate . Similarly, if a REPIN successfully duplicates then we see the transition from r to r + 1, which happens at rate . The REPIN dynamics are sketched in Figure 3. Altogether, the change in the relative bacterial abundance is captured by the following set of differential equations,
Figure 3
Modeling intragenomic sequence population in a bacterial population. Bacteria in the population only differ in the number of REPINs they contain. A bacterium with r REPINs gains a REPIN with rate and loses a REPIN with rate . The gain and loss of REPINs depend on the parameter λ (random amplification and deletion of REPINs) and δ (REPIN transposition rate). The transposition rate δ also decreases the growth rate of each bacterium by , since with probability γ a bacterium will be killed after a transposition event. The minimum number of REPINs is zero, the upper REPIN population size limit for maintaining a viable bacterial population is given by .
Modeling intragenomic sequence population in a bacterial population. Bacteria in the population only differ in the number of REPINs they contain. A bacterium with r REPINs gains a REPIN with rate and loses a REPIN with rate . The gain and loss of REPINs depend on the parameter λ (random amplification and deletion of REPINs) and δ (REPIN transposition rate). The transposition rate δ also decreases the growth rate of each bacterium by , since with probability γ a bacterium will be killed after a transposition event. The minimum number of REPINs is zero, the upper REPIN population size limit for maintaining a viable bacterial population is given by .Since having zero REPINs is a boundary condition, for r = 0 we have . The last equality also confirms that once the REPINs are lost, they cannot be regained.We connect growth and transition rates in the above equation with our observation in the previous section. The RAYT transposase duplicates REPINs by copying them into another location of the genome (Bertels and Rainey 2011b). This transposition rate is denoted as δ. However, transposition comes at a cost. Once a REPIN is copied into a gene, then the gene will be destroyed. If the gene is essential for bacterial survival, then the bacterium that carries the REPIN population, including the transposed REPIN, will die. We denote γ as the fatality probability that a bacterium dies due to a REPIN transposition. Hence bacterial growth rate, g, can be written asOur observation of REPINs in bacterial genomes suggests that besides the RAYT transposase activity, REPINs may be able to reproduce locally. Local amplification and deletion of REPINs are probably mediated by the host replication machinery and not by the RAYT transposase (Bertels and Rainey 2011a,b). This mode of amplification and deletion is captured by including a birth rate λ and an equal death rate λ giving the transition rates,
Results
Simple replicating intragenomic sequence populations cannot persist in bacterial genomes
Our model describes a bacterial population in which each bacterium carries a certain number of REPINs r. REPINs can transpose to a different position in the genome through duplication. Every REPIN transposition can harm the bacterium. There is a chance γ that a REPIN transposition leads to the bacterial host’s death. This model will lead to two different outcomes depending on the parameter values and initial conditions. Either the REPIN population will go extinct in the bacterial population (, purple distribution in Figure 4) or the REPIN population will grow uncontrolled and eventually drive the bacterial population to extinction (B = 0, green distributions in Figure 4).
Figure 4
The dynamics of the bacterial pool and the average REPIN numbers found per bacterial genome under the base model. The y-axis shows the relative bacterial abundances . The cartoon demonstrates the two possible stable equilibria of the bacterial population governed by Equation (3) with Equations (4) and (5). Different initial conditions lead to two different outcomes indicated by the purple and green circles. When the initial condition is close to the zero-REPIN state, the bacterial population follows the thick purple line leading to the extinction of the REPIN population. When starting with a large initial REPIN population size with a small fatality probability γ, the REPIN population size will increase across the entire bacterial population. Consequently, the bacterial population size will decrease and eventually go extinct (green line). Each point marked with an arrow shows the distribution of the bacterial population b at that time point.
The dynamics of the bacterial pool and the average REPIN numbers found per bacterial genome under the base model. The y-axis shows the relative bacterial abundances . The cartoon demonstrates the two possible stable equilibria of the bacterial population governed by Equation (3) with Equations (4) and (5). Different initial conditions lead to two different outcomes indicated by the purple and green circles. When the initial condition is close to the zero-REPIN state, the bacterial population follows the thick purple line leading to the extinction of the REPIN population. When starting with a large initial REPIN population size with a small fatality probability γ, the REPIN population size will increase across the entire bacterial population. Consequently, the bacterial population size will decrease and eventually go extinct (green line). Each point marked with an arrow shows the distribution of the bacterial population b at that time point.For a fatality probability greater than 0 () any transposition event can lead to the death of the bacterial host, and thus the fittest subpopulation is the population without REPINs. Bacteria devoid of REPINs have the highest growth rate. They cannot acquire REPINs in the absence of hgt. Hence, as soon as a fraction of bacteria loses all REPINs, REPINs will go extinct in the bacterial population. REPIN extinction usually occurs when a population starts with small REPIN numbers or a large fatality probability (γ). When γ is large, bacteria are more likely to die after a transposition event than to successfully increase the REPIN number (purple distribution in Figure 4).Alternatively, the accumulation of REPINs can lead to the extinction of the bacterial population. The bacterial population will go extinct when large REPIN numbers accumulate, for example, when the fatality probability (γ) is low (the bacterium is unlikely to die after a transposition event). In this case, an increasing number of REPINs will lead to a decreasing number of bacteria. Thus eventually the entire population becomes extinct (green distribution Figure 4).We analytically prove that these two trivial scenarios are the only possible, stable solutions of our model (see Appendix B for detailed calculations), showing that our model agrees with existing literature. Hence, our basic model does not explain what we observe in nature: an intragenomic sequence population that persists for millions of years.
Horizontal gene transfer within a bacterial population cannot explain REPIN persistence
Hgt has been shown to be essential to explain the persistence of selfish genetic elements (Doolittle and Sapienza 1980; Sawyer and Hartl 1986; Bichsel ). Although for REPINs there is no evidence of significant hgt, at least on the species level (Bertels ), hgt within populations may be able to explain the persistence of REPINs as shown for a specific model and a very specific parameter set in ISs (Bichsel ).To understand how exactly hgt affects the evolutionary dynamics of REPIN populations, we implemented hgt as a simple mixing process to mimic the process of gene conversion (Vos 2009). Currently, we believe that replicative hgt is unlikely to occur for REPINs, since they are nonautonomous elements and cannot simply copy themselves on a plasmid and then from that plasmid back into a new host unless the RAYT gene is copied at the same time. Furthermore, RAYT genes have not been observed on plasmids compared to IS elements, and do not copy themselves (Bertels ).The hgt rate h determines the frequency at which REPINs are transferred from one bacterium to another.This mixing process makes the complete loss of REPINs (b0) reversible, allowing bacteria without REPINs to gain a REPIN from the rest of the population.However, even though hgt provides a way to escape the zero-REPIN state, hgt by itself does not lead to a sustainable REPIN population. The number of REPINs in the population will still either decrease until all bacteria lose all REPINs or increase until the bacterial population is extinct.Whether the REPIN population or the bacterial population goes extinct is mainly determined by the fatality probability γ for high hgt rates (Appendix C). For REPIN population size increases to infinity because REPINs successfully duplicate most of the time (eukaryotic regime in Figure 1). In contrast, REPINs go extinct for due to a twofold effect: (1) REPIN populations grow more slowly because most transposition events are unsuccessful and (2) carrying REPINs is more costly because transposition events often kill the bacterial host (Prokaryotic regime in Figure 1). Hence, as established previously with similar models, hgt alone cannot stabilize a REPIN population in bacterial genomes.
Beneficial effects can lead to stable REPIN population sizes
To explain the persistence of REPINs in the genome, we propose a mutualistic relationship between REPINs and their host. In a simple model, each REPIN contributes a constant benefit α to the host. The total fitness benefit will then be αr. Besides being unrealistic (adding too much of anything will eventually be detrimental), such a benefit function does not lead to a stable REPIN population. If α is smaller than the transposition rate δ, then the possible steady states do not change; either REPINs get purged from the genome, or the whole bacterial population goes extinct together with the REPINs. If α is larger than δ, then REPIN population size will grow to infinity and so will the bacterial population size, which is not a plausible scenario.Ergo the fitness benefit function needs to be more complex to describe a realistic biological scenario. An additional parameter, w modifies the beneficial effect each additional REPIN provides to the host. The following functional form changes the benefit provided by each additional REPIN, where w is the base of the change: (Dawes ; Hauert ; Gokhale and Hauert 2016),The benefit function captures the total benefit of r REPIN sequences (Figure 5A). For w = 1 each REPIN provides a constant benefit α (discussed above). With w < 1, each additional REPIN provides a smaller benefit, saturating the total benefit. Similarly, with w > 1, each additional REPIN provides a larger benefit, exponentially increasing the total benefit. The beneficial effect of REPINs is reflected in the bacterial growth rate, .
Figure 5
Benefit functions and dynamics of average REPIN numbers for various hgt rates. (A) Benefit function with synergy (w > 1) and discounting (w < 1) effects. Total benefit increases with the number of REPINs r. With w = 1 the benefit a REPIN provides is constant (gray dashed line). For w > 1, REPIN benefits are synergistic, i.e., each additional REPIN provides a greater benefit than the previously added REPIN. For w < 1, REPIN benefits are discounting, i.e., each additional REPIN provides a smaller benefit than the previously added REPIN. The black arrow points at the benefit function, which is used in (B). (B) Changes of average REPIN population sizes over time for different hgt rates (h). The black dotted line is the expected REPIN population size () at the steady-state for high hgt rates. Lower hgt rates lead to smaller average REPIN population sizes. We used the following model parameters , and w = 0.975.
Benefit functions and dynamics of average REPIN numbers for various hgt rates. (A) Benefit function with synergy (w > 1) and discounting (w < 1) effects. Total benefit increases with the number of REPINs r. With w = 1 the benefit a REPIN provides is constant (gray dashed line). For w > 1, REPIN benefits are synergistic, i.e., each additional REPIN provides a greater benefit than the previously added REPIN. For w < 1, REPIN benefits are discounting, i.e., each additional REPIN provides a smaller benefit than the previously added REPIN. The black arrow points at the benefit function, which is used in (B). (B) Changes of average REPIN population sizes over time for different hgt rates (h). The black dotted line is the expected REPIN population size () at the steady-state for high hgt rates. Lower hgt rates lead to smaller average REPIN population sizes. We used the following model parameters , and w = 0.975.Decreasing benefits (w < 1) allow a stable REPIN population to persist in the bacterial genome (Figure 5B). For high hgt rates, we can analytically determine the size of the REPIN population in steady-state. To obtain a stable REPIN population, the fatality rate needs to be high () and the benefit strength α needs to be higher than (Appendix D for the detailed calculation). For these conditions, we can calculate the average number of REPINs in a bacterial genome:A careful analysis of the model parameters shows that few parameter combinations yield a REPIN population of biologically relevant size. The REPIN population size is determined by three free parameters (α, γ and w). We set the parameter range for α to , close to the transposition rate δ, also determining the fitness cost of each REPIN. The other two parameters are bounded by the model itself: γ can range from and w can range from .Each parameter combination yields an average REPIN population size in the bacterial population. Yet, the biologically relevant REPIN population sizes should be between 91 and 323 REPINs (Table 1). To assess, which parameter combinations lead to biologically relevant REPIN population sizes one of the three free parameters was fixed. The other two parameters were varied across the entire range (Figure 6).
Figure 6
Observable parameter range and likelihood. Three model parameters, benefit effect w, fatality probability γ, and benefit strength α, determine the average REPIN population size . Only certain parameter combinations result in biologically relevant REPIN population sizes (i.e., between 91 and 323 REPINs, Table 1). To visualize this observable parameter range, we fixed one parameter while the other two parameters varied. In the colored area, REPIN population sizes are between 91 and 323. The carrying capacity K is measured in the absence of REPINs. Hence, the bacterial population size increases with REPINs in yellow-colored areas, while bacterial population size becomes smaller in blue-colored areas. Each row is associated with one parameter. For example, in the first row, for three fixed benefit effects w, we determine the observable parameter range (A–C). The size of the observable parameter range is plotted in (D), which corresponds to the likelihood that a w-value is part of a parameter combination that leads to a stable REPIN population of observable size. The second and third rows show the same plots for the fatality probability γ and benefit strength α, respectively. Note that for all parameter ranges (, and ) the proportion of parameter combinations that result in stable REPIN populations of observable size is only about 2%.
Observable parameter range and likelihood. Three model parameters, benefit effect w, fatality probability γ, and benefit strength α, determine the average REPIN population size . Only certain parameter combinations result in biologically relevant REPIN population sizes (i.e., between 91 and 323 REPINs, Table 1). To visualize this observable parameter range, we fixed one parameter while the other two parameters varied. In the colored area, REPIN population sizes are between 91 and 323. The carrying capacity K is measured in the absence of REPINs. Hence, the bacterial population size increases with REPINs in yellow-colored areas, while bacterial population size becomes smaller in blue-colored areas. Each row is associated with one parameter. For example, in the first row, for three fixed benefit effects w, we determine the observable parameter range (A–C). The size of the observable parameter range is plotted in (D), which corresponds to the likelihood that a w-value is part of a parameter combination that leads to a stable REPIN population of observable size. The second and third rows show the same plots for the fatality probability γ and benefit strength α, respectively. Note that for all parameter ranges (, and ) the proportion of parameter combinations that result in stable REPIN populations of observable size is only about 2%.Without a detailed analysis of our model, the biologically relevant range of the discounting effect w (how strongly the benefit of each REPIN decreases with increasing REPIN number) is hard to predict. However, our model suggests that the effect needs to be in the range of 0.95 and 0.99 (Figure 6D). Otherwise, the other parameter values have to become unrealistic to yield suitable average REPIN population sizes. Intuitively, this means that the host’s benefit decreases by 1–5% with each REPIN added to the genome. Furthermore, for large discounting effects, relevant REPIN population sizes are only observed for small fatality probabilities (Figure 6A). On the contrary, small discounting effects require a small benefit strength to lead to relevant REPIN population sizes (Figure 6C).In our model, it is impossible to maintain a stable REPIN population if γ is below 0.5 because this regime would lead to an ever-increasing sequence population. Hence, at least 50% of the bacterial genome needs to be critical for long-term survival to maintain REPIN populations. A fatality probability of close to 0.5 also yields the most parameter combinations to maintain a stable REPIN population (Figure 6H).Finally, high-benefit strength is most likely to yield a stable REPIN population (Figure 6, I–K). Whereas low benefit strength is only possible when the discounting effect is close to 1 and the fatality probability is close to 0.5 (Figure 6I) and always leads to bacterial populations that are less fit than a population without REPINs.
Discussion
In prokaryotic genomes, TEs get continuously purged from the genome due to a combination of low hgt (and recombination rates) and the high cost of transposition. As a result, ISs are usually present in only a fraction of strains within a species (Touchon and Rocha 2007). Nonautonomous REPINs are different. If present in a species, then most strains of that species will contain a significant number of REPINs. To maintain a large number of TEs inside a genome, where transposition costs are high and hgt is low, the continuous extinction process has to be halted.Here, we propose a model that endows each REPIN with a fitness benefit to the host bacterium. The benefit function prevents the REPIN population from going extinct and allows them to be maintained as a stable population inside the bacterial genome. The benefit, however, follows a particular functional form. If each REPIN provides a constant beneficial effect to the bacterium, REPIN populations are still not stably maintained. If the benefit is lower than the cost of carrying a REPIN, then the aforementioned scenarios apply, otherwise the REPIN and bacterial population size increase indefinitely. Only discounting benefits (i.e., the benefit each REPIN provides decreases with increasing REPIN population size) can lead to stable REPIN population sizes.In eukaryotes, sequence populations have been discovered and modeled since in the 1980s (Hickey 1982; Charlesworth and Charlesworth 1983; Charlesworth and Langley 1989). The models suggest that instead of preventing TEs from going extinct, TEs have to be prevented from indefinitely accumulating in the genome in eukaryotes. Accumulation can be stopped when the cost of carrying TEs increases synergistically or the transposition rate is regulated. Interestingly, a synergistic increase of fitness costs, in eukaryotes, is a symmetric solution to discounting fitness benefits in prokaryotes (Figure 1). The reason for this symmetry probably lies in the cost of transposition. In prokaryotes, transposition is very costly (), and hence extinction needs to be prevented by supplying a benefit, whereas in eukaryotes the low cost () of transposition leads to increasing TE population sizes that have to be countered by a synergistically increasing fitness cost, eventually pushing the fatality rate γ in our model past 0.5. In both cases, the TEs modify host fitness to form stable sequence population sizes.Our results also show that only a small subset of discounting fitness functions allow REPINs to persist in bacteria. The range is particularly small for the discounting effect w. Only if the benefit each additional REPIN provides decreases by about 1% to 5%, are there many parameter combinations that lead to a REPIN population of biologically relevant size (i.e., between 91 and 323 REPINs). The surprisingly narrow range of the discounting effect will allow us to test our model in the future. In a laboratory experiment, one could, for example, delete all REPINs in a single bacterial strain (e.g., with CRISPR technology) and then add REPINs one at a time (or vice versa). We would expect the average additional benefit for each REPIN added to decrease by about 1–5%.The fitness advantage of bacteria carrying a single REPIN over bacteria carrying no REPINs should be on average in the range of the benefit strength α. The benefit strength α is expected to be low per individual REPIN . Low benefit strength is a consequence of low levels of harm done by REPIN transposition due to low-transposition rates . Interestingly, even when the benefit provided by each REPIN is less than the harm done () it is possible to maintain stable REPIN populations at least in the presence of hgt. It is unclear whether these results still hold in the absence of hgt, which might be more biologically relevant. Our simulations suggest that in almost all cases where bacterial populations survive with low benefits in the presence of hgt, the populations would not survive in the absence of hgt (Appendix D).Currently, there is little evidence to what benefit REPINs (in conjunction with RAYTs) could provide to the host bacterium. We have previously speculated that RAYTs and REPINs could be part of a promoter (REPIN) and transcription factor system (RAYT) (Bertels and Rainey 2011a). This speculation was based on the fact that REP sequences (repetitive components of REPINs) have been shown to affect gene expression of neighboring genes by terminating transcription or affecting mRNA half-life (Merino ; Espéli ; Liang ). In turn, the RAYT protein could modify this effect by binding to the REPIN and excising it from the mRNA. Alternatively, RAYT and REPINs could affect gene expression by altering folding of the DNA; another function REP sequences have been implicated in Yang and Ames (1988) and Qian .For ISs, it has been argued that they can increase their persistence time because they occasionally cause beneficial mutations in the host (Schneider ; Startek ). It is unlikely that the same argument can be made for REPINs, for the following reasons. First, REPINs are maintained for millions of years as a stable sequence population. If it were possible to explain their persistence through occasional beneficial mutations, then we would also expect IS elements to persist as populations, which they do not. Second, one of the reasons IS elements cannot persist over long periods inside bacterial genomes, is that the mutator phenotype they can cause is extremely costly (every additional insertion increases the transposition rate and hence also the mutation rate), unable to compete with mutator phenotypes generated through mutations in mut genes (Wielgoss ; Consuegra ). Third, to significantly contribute to the host bacterium’s mutation rate, REPIN transposition rates would have to be 1000 times higher than measured in E. coli and other species (Bertels ).Another appealing aspect of our study is the result concerning the fatality probability γ. The fatality probability describes what proportion of REPIN transposition events leads to the death of a bacterium. Our model suggests that γ has to be larger than 0.5 to yield a stable REPIN population. This result was somewhat surprising to us and initially did not seem to be compatible with the biological reality since studies have shown that only about 10% of all genes in the genome are essential (Baba ; Freed ). For fatality probabilities of less than 0.5 it is impossible to maintain sequence populations in bacterial genomes under our model.One could also argue that γ describes the proportion of essential genes in at least one of the bacterium’s natural environments. That the set of essential genes in one environment differs from essential genes in a different environment has been shown in E. coli (Nichols ). Hence, if E. coli regularly encounters a large range of different environments; the proportion of genes that contribute to fitness that is too high to result in their loss from the genome might exceed 50%.Another indicator of the importance of genes for long term survival is the size of the core genome. In E. coli about 46% of the genes in the genome are found in all strains (Touchon ). Suppose we now add essential noncoding regions such as rRNa, tRNA, and essential regulatory regions. In that case, it is likely that indeed more than 50% of the genome is essential for long-term survival of the strain. What “long term” means depends, of course, on how we define a species. The most common ancestor of all E. coli strains is predicted to have lived about 15 million years ago (Ochman ; Bertels ). Hence, about 50% of the genes were necessary for all E. coli strains to survive for the last 15 million years. For a more specialized subset of strains within the E. coli species a much larger proportion of genes is expected to be shared and important for survival. Hence, it seems plausible that a large proportion of the bacterial genome is required for long-term survival as predicted by our model. If this is not the case, then the number of repetitive sequences should increase over time, similar to what can be observed in birds and mammals, where transposon replication is only counteracted by infrequent loss events of large parts of the genome (Kapusta ).Our results in Figure 6 only hold if the hgt rate is much higher than the transposition rate δ. Active hgt mediated by the RAYT transposase is very unlikely to occur in nature (Bertels ). Although REPINs and RAYTs may be passively transferred to other genomes through homologous recombination (Guttman and Dykhuizen 1994), the resulting REPIN transfer rate is probably low. Hence, the results in Figure 6 might not be directly applicable to REPIN populations. Nevertheless, simulations for low-hgt rates show that REPIN populations can persist without hgt, given that the REPIN population is beneficial for the host (Appendix D).In the absence of hgt antagonistic coevolution as observed for other mobile genetic elements is nigh impossible. A predominantly vertical mechanism of inheritance ties the evolutionary fate of REPINs almost entirely to the host’s fate. The only way to ensure REPIN survival is to ensure the survival of the host. REPIN populations that are not providing enough of a benefit will be purged. Hence, coevolution between REPINs and the bacterial host is unlikely to be antagonistic compared to other mobile genetic elements.One of the main issues we have not addressed in our current study is the RAYT transposase evolution. If we assume that RAYTs can be lost and gained from the genome leading to a REPIN transposition rate δ of 0, then our model’s long-term dynamics could change. Extending our current model with the possibility of RAYT evolution requires at least one more parameter to describe RAYT gain and loss rates. In Appendix E, we present an elementary analysis of such a model. We assume that the number of RAYTs linearly increases the transposition rate δ but does not affect the benefit accrued by the REPIN (an assumption ripe for experimental testing). A numerical simulation of the extended model shows that REPINs are maintained at a stable equilibrium, which slightly varies between bacteria containing a RAYT gene(s) and bacteria that do not contain a RAYT gene. We plan to extend this model in the future to set RAYT gain and loss rates to correspond to observed data. Ideally, such a model may accurately predict a set of parameters that could, for example, explain the E. coli data presented in Figure 2.Currently, all our analyses are deterministic. Although these models do not currently allow us to measure the long term stability of the system, we are confident that at least for REPIN populations that are larger than 100 individuals, populations should be stable for long periods [as investigated in a previous study (Bertels )]. Hence, the conditions explored here could explain the presence and maintenance of sequence populations in bacterial genomes.In conclusion, our analyses show that discounting beneficial effects can explain the presence of stable REPIN populations in bacterial genomes. The small parameter range of our benefit function provides a plethora of testable hypotheses on the evolution of intragenomic sequence populations in bacterial genomes.
Data Availability
All codes with code README file for simulations are available on GitHub. Data for Figure 1 are in the same repository in a subfolder REPINSDataFig.
Authors: Igor B Rogozin; Kira S Makarova; Darren A Natale; Alexey N Spiridonov; Roman L Tatusov; Yuri I Wolf; Jodie Yin; Eugene V Koonin Journal: Nucleic Acids Res Date: 2002-10-01 Impact factor: 16.971
Authors: Patricia L Foster; Heewook Lee; Ellen Popodi; Jesse P Townes; Haixu Tang Journal: Proc Natl Acad Sci U S A Date: 2015-10-12 Impact factor: 11.205
Authors: Marie Touchon; Claire Hoede; Olivier Tenaillon; Valérie Barbe; Simon Baeriswyl; Philippe Bidet; Edouard Bingen; Stéphane Bonacorsi; Christiane Bouchier; Odile Bouvet; Alexandra Calteau; Hélène Chiapello; Olivier Clermont; Stéphane Cruveiller; Antoine Danchin; Médéric Diard; Carole Dossat; Meriem El Karoui; Eric Frapy; Louis Garry; Jean Marc Ghigo; Anne Marie Gilles; James Johnson; Chantal Le Bouguénec; Mathilde Lescat; Sophie Mangenot; Vanessa Martinez-Jéhanne; Ivan Matic; Xavier Nassif; Sophie Oztas; Marie Agnès Petit; Christophe Pichon; Zoé Rouy; Claude Saint Ruf; Dominique Schneider; Jérôme Tourret; Benoit Vacherie; David Vallenet; Claudine Médigue; Eduardo P C Rocha; Erick Denamur Journal: PLoS Genet Date: 2009-01-23 Impact factor: 5.917
Authors: Bram van Dijk; Frederic Bertels; Lianne Stolk; Nobuto Takeuchi; Paul B Rainey Journal: Philos Trans R Soc Lond B Biol Sci Date: 2021-11-29 Impact factor: 6.237