Literature DB >> 32280434

Selection or drift: The population biology underlying transposon insertion sequencing experiments.

Anel Mahmutovic¹, Pia Abel Zur Wiesch^1,2,3,4, Sören Abel^1,2,4,5.

Abstract

Transposon insertion sequencing methods such as Tn-seq revolutionized microbiology by allowing the identification of genomic loci that are critical for viability in a specific environment on a genome-wide scale. While powerful, transposon insertion sequencing suffers from limited reproducibility when different analysis methods are compared. From the perspective of population biology, this may be explained by changes in mutant frequency due to chance (drift) rather than differential fitness (selection). Here, we develop a mathematical model of the population biology of transposon insertion sequencing experiments, i.e. the changes in size and composition of the transposon-mutagenized population during the experiment. We use this model to investigate mutagenesis, the growth of the mutant library, and its passage through bottlenecks. Specifically, we study how these processes can lead to extinction of individual mutants depending on their fitness and the distribution of fitness effects (DFE) of the entire mutant population. We find that in typical in vitro experiments few mutants with high fitness go extinct. However, bottlenecks of a size that is common in animal infection models lead to so much random extinction that a large number of viable mutants would be misclassified. While mutants with low fitness are more likely to be lost during the experiment, mutants with intermediate fitness are expected to be much more abundant and can constitute a large proportion of detected hits, i.e. false positives. Thus, incorporating the DFEs of randomly generated mutations in the analysis may improve the reproducibility of transposon insertion experiments, especially when strong bottlenecks are encountered.

Entities: CellLine Chemical Disease Species

Keywords: Bottleneck; DFE; Distribution of fitness effects; Drift; Multinomial random sampling; Population biology; Random birth-death process; Selection; Tn-seq; Transposon insertion sequencing

Year: 2020 PMID： 32280434 PMCID： PMC7138912 DOI： 10.1016/j.csbj.2020.03.021

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Transposon insertion sequencing is a powerful method to detect genomic loci (e.g., genes) that contribute to growth and survival in a given environment. It relies on genome-wide random disruptions of loci by transposon insertion in a bacterial population and the detection of mutants with transposons at specific insertion sites by sequencing. Mutants with transposon insertions in loci that are important for survival in the tested environment are assumed to be underrepresented in the population of mutated cells. This is the central paradigm of transposon insertion sequencing: The number of sequence reads per locus is correlated with mutant fitness in the tested environment. In the last decade this technique has been very successfully used in a wide variety of organisms and strains to determine the gene products essential for virulence [1], [2], [3], [4], [5], [6], [7], [8], [9], tumorigenesis [10], persistence [11], [12], biofilm formation [13] and antibiotic resistance [14], [15] both in vivo and in vitro (see [16], [17], [18] for a more comprehensive list of applications). While several transposon insertion methods have been developed e.g., Tn-seq [19], Tn-seq Circle [20], INSeq [21], TraDIS [22] and HITS [23], the primary distinction between them is the protocol employed to amplify the transposon-genome junction to identify the transposon insertion site. For simplicity, we will refer to transposon insertion sequencing as Tn-seq in this work. The fundamental steps shared by all Tn-seq methods are transposon mutagenesis (Fig. 1 – Step 1), growth in a selective environment (Fig. 1 – Step 2), selection for mutants (Fig. 1 – Step 3), preparation of the transposon-genome junction for sequencing (depending on the technique) and sequencing (Fig. 1 – Step 4), mapping the sequence reads to the wild-type genome and tallying the number of reads for each transposon insertion site. Afterwards, the reads are analyzed (Fig. 1 – Step 5). During the analysis, the sequence reads are mapped to the genome to identify the respective transposon insertion site, tallied and the loci are categorized based on the tally. Essential insertion sites are those for which the mutants display a strong growth defect relative to the wild-type population, for example when no sequence reads are mapped to all transposon insertion sites within a gene. Accordingly, mutants displaying no growth defect are neutral and mutants that show a strong growth advantage over the wild-type population are categorized as advantageous. Hence, the objective of a Tn-seq experiment is to assess fitness costs where the basic premise is that the number of reads is proportional to fitness, i.e. that changes in mutant frequencies are due to selection.

Fig. 1

Schematic of transposon insertion sequencing workflow. Description of individual steps to create a transposon insertion library and/or define essential genes in a specific condition. Not all steps can be easily observed experimentally; we highlight routinely measured quantities (eye symbol). In the first step (1), transposons (colored rectangles) are delivered to recipient bacteria and integrated into the genome (rings) at different positions (i) out of all possible insertion sites (k; black rectangles), resulting in N mutant cells. Wild-type cells grow with a division rate β and a death rate δ. The transposons disrupt genes which can result in altered division rates (w) and altered death rates (w) that are specific for the cells bearing a transposon at site i. Typically, experimental constraints lead to inadvertent (or sometimes intended) bacterial growth and death before the library can be analyzed (2AB). This typically serves to select against the wild-type (2B) (dead cells are marked by red x) and leads to a distortion of the mutant frequencies present in the library created by mutagenesis. Sampling of cells (3) can lead to additional distortions. Sampling includes various experimental processes for example harvest of the cells, genomic DNA preparation, and the small amount of genomic DNA subjected to PCR amplification. During the last experimental steps (4), the transposon-genome junctions are prepared for sequencing (exact protocol varies by technique) and then sequenced. Since sequencing capacity is typically limiting, the sequencing bottleneck is another sampling event. Finally, the sequencing data are analyzed (5) by mapping them to the genome and by quantifying the number of sequences per transposon insertion site (n) (green bars). The probability of no reads for a transposon insertion site i is given as qi. The probability of no reads in m sites is qj(m). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) However, the results are frequently not clear-cut. Repetitions of very similar (if not identical) Tn-seq experiments by different laboratories often have poor overlap [24], [25]. Part of the problem is that a number of factors contribute to noisy results and obscure the correlation between the number of reads and fitness. Among them are random birth-death processes and sampling events/bottlenecks. Both can lead to random distortions of mutant frequencies, i.e. drift. Depending on the experimental setup, bottlenecks can randomly remove substantial proportions of the transposon library, especially for in vivo experiments. For example, ~99.99% of an infectious dose of Vibrio cholerae is lost during infections of a rabbit model host [26] and ~99.9999% of Listeria monocytogenes does not survive orogastric inoculation of mice [27]. In addition, sequencing itself can become a bottleneck if a (too) low number of sequence reads is acquired to analyze an experiment. Statistical methods have been developed to analyze noisy reads and assign essentiality to transposon insertion sites [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38]. In general, transposon insertion experiments have been greatly successful to generate lists of genomic loci enriched for relevant hits from which researchers pick individual genes or groups of genes with related function. These are then confirmed independently and analyzed further e.g. in [39], [40]. Since Tn-seq is a global genetic screening technique, it allows in principle to test all genes in a single experiment simultaneously. However, a global screen poses much higher demands on understanding experimental noise: from the perspective of a single gene or pathway, a much higher false positive rate is acceptable than in a global screen where the false positive rate has to be multiplied with the number of all genes that are tested. Experimental noise, i.e. distortion of mutant frequencies, can be understood with population biological models. Birth-death processes and bottlenecks are well understood in population biology, and both bottlenecks [25] and bacterial growth have been modeled implicitly and explicitly [32], [41]. However, understanding random distortions of mutant frequencies requires a population biological model encompassing all processes that add noise. In addition, when screening globally for a comprehensive list of genes that are crucial for viability in the given environment, it is important to carefully formulate the goal of the screen. After random mutagenesis, most mutants will have fitness defects, even though they may be on average mild. The distribution of fitness effects of random mutagenesis is a matter of intensive research [42]. Since the vast majority of mutations is detrimental, the goal of a Tn-seq experiment is to enrich for those mutants that are significantly more impaired than an average random mutant. To assess this, the distribution of fitness effects of random mutations has to be taken into account. Here, we develop a theoretical modeling framework that can describe the effects of mutagenesis, random birth-death processes and bottlenecks on the composition of a library created by transposon mutagenesis. In contrast to previous approaches, we do take the distribution of fitness effects (DFE) of random mutations into account, i.e. assume that the fitness of the generated mutants is on average slightly lower than the one of the wild-type, with some mutants having very low fitness and only few mutations being beneficial. Our model allows us to address the question whether mutant frequencies changed because of selection or drift. For simplicity and to get conservative estimates, we focus on the extreme case of mutant extinction, i.e. zero reads in a specific locus. While we find that in the absence of strong bottlenecks, mutants that go extinct typically have strongly reduced fitness, we also find that bottlenecks of the size to be expected in animal models add so much random extinction that the fitness of extinct mutants is not substantially larger than what can be expected of any random mutant.

Results

Mutagenesis

A standard transposon insertion experiment proceeds by first creating a pool of insertion mutants whereby transposons are randomly inserted into the genome of cells by means of transposon mutagenesis (Fig. 1). We model mutagenesis so we can characterize the composition of the starting population of N mutant cells by making some simplifying assumptions. The primary assumption is that mutagenesis is completely random where the distribution of mutants over the potential insertion sites is uniform. A uniform distribution of mutants over potential insertion sites requires neglecting the influence of potential genomic cold-/hotspots at which transposons are more or less likely to insert [43]. In addition, this requires that the procedure to create mutants itself does not distort the mutant proportions, for example by more growth of mutants that are created early in contrast to mutants that are created late in the mutagenesis process. Finally, we assume that at most one transposon integrates into the genome. This can for example be experimentally achieved by having an excess of wild-type recipients over transposon donors such that the chance of more than one transposon donor transferring a transposon to a particular wild-type cell is negligible. Given these assumptions, the mutagenesis process is equivalent to a multinomial random sampling process where N mutants are picked from an infinite pool of uniformly distributed mutants (Fig. 2A). Equivalence is here taken to mean that both the experimental process of mutagenesis and the model result in a uniform distribution of N mutants over the potential insertion sites. A consequence of the uniform distribution is that the chance of picking a mutant cell corresponding to any transposon insertion site i is 1/k where k is the number of potential transposon insertion sites. The number of mutant cells where the transposon has inserted into insertion site i independently n time is described by n. The probability distribution of obtaining n mutants is binomial in the multinomial random sampling model where the average and the variance over repetitions of the mutagenesis experiment is µ=N/k and µ(1 − 1/k), respectively.

Fig. 2

Illustration of mutagenesis model and the resulting correlation between library complexity and number of mutants. (A) Transposon mutagenesis leads to a pool of uniformly distributed mutant cells over transposon insertion sites i. Illustration of the experimental mutagenesis and the multinomial random sampling model that leads to the same distribution. We show three examples of experiments in which three independent transposon mutants N were created. In this example the transposon (colored rectangle) can insert in any of four potential transposon insertion sites k (black rectangles) on the genome (rings) and depending on the insertion site, can lead to different library complexities C. This is equivalent to randomly picking N mutants from an infinite pool (triple dots) of uniformly distributed mutants. Both the experiment and the model lead to the same distribution of cells with mean µ and variance µ. Because the probability of a transposon to insert to any transposons insertion site is equal, the mean and the variance can be determined either over multiple repetitions of the same experiment or over the different transposon insertion sites within a single experiment (arrows). (B) The relationship between the number of mutants after mutagenesis, N, and the library complexity, C, for k potential insertion sites where k was set to 104 (solid), 105 (dashed) and 106 (dotted). Eq. (1) was used and solved for the number of mutants N = kln(1-C) to generate the plot where the range of C was set to 0.001 to 0.999. The larger the number of potential insertion sites, the more mutants are needed to reach a given library complexity shown as a shift of the graph to the right for larger k. At this stage, we approximate the binomial distribution as a Poisson distribution for which the variance is equal to the mean, µ. This approximation works well for large N and small probabilities for picking a mutant cell (1/k). Typical transposon insertion experiments are particularly amenable to this approximation. For example, for the Himar1 mariner transposon k is of the order 105 and while it depends on the organism, an N of the order 105-106 can be achieved in many bacteria. In the context of the Poisson model, the probability C of sampling at least one mutant corresponding to transposon insertion site i is In supplementary figure S1 we compare and validate this equation against random sampling simulations. See table 1 for an explanation of all variables in this paper.

Table 1

A summary of the variables used in this work.

Variable	Meaning	Comments
i	A potential transposon insertion site in the genome of the cell.	The range is i = 1,2,…,k.When focusing on extinction probabilities within a gene: i = 1,2,…,m = k_G.In the main paper, we consider complete extinctions of all sites within a gene.In the supplementary material, we show the equations for quantifying extinctions for a subset of insertion sites, m, with k_G being the potential number of transposon insertion sites within a gene (Supplementary data S1).
k_G	Number of insertion sites per gene/locus	This variable is used in the supplementary material to quantify extinctions for a subset of insertion sites, m, with k_G being the potential number of transposon insertion sites within a gene (Supplementary data S1).
k	Total number of potential transposon insertion sites in the genome.	The value of k depends on the specific transposon used in the experiment and the wild-type organism and strain.
Na	Number of mutants in the mutagenesis step.
β	Division rate of wild-type cells.	Defined as the inverse average time it takes for a wild-type cell to divide.
δ	Death rate of wild-type cells.	Defined as the inverse average time it takes for a wild-type cell to die.
w_i	Fitness coefficient.	Throughout the paper we assume that the effect of inserting a transposon into the genome of wild-type cells is to modify the net growth rate by w_ir with w_i > 0 and r > 0.
<n_i(t) > b	Average number of mutants with a transpons insertion at site i at time t.
m	Subset of insertion sites in one locus that are simultaneously extinct
q_{i,growth/bottleneck}^m	Extinction probability of all mutants with transposon insertions in gene i in all m sites.	We use q_i,growth if the extinction is due to a random birth–death process or q_i,bottleneck if the extinction is due to a random sampling event.
µ	Number of mutants per potential number of transposon insertion sites.
C	Library complexity after mutagenesis.	Defined as the number of transposon insertion sites with at least one transposon insertion divided by the potential number of transposon insertion sites.
r	Net growth rate of wild-type cells.	The net growth rate is the difference between the division rate β and the death rate δ.
t	Time of growth
s	Selection coefficient s = w-1
<N(t)>	Total average number of mutants at time t.
f_i(t)	The proportion of mutants with a transposon insertion at site i.	<f_i > denotes the average proportion of mutants where the average is taken over realizations.
Z_m	Average number of zero reads over m transposon insertion sites.	Z_k, i.e. the average number of zero reads over all transposon insertion sites in the genome, is used to calculate the reduction in library complexity due to random birth–death events and bottlenecks.
n_s	Sampling size.
b	Bottleneck size	Sample size relative to the total mutant population size N.
C′	The library complexity after growth/death or sampling.
a_i	The base of Eq. (5):ai=δ-δewirtδ-βewirt	Introduced for notational simplicity to express Eq. (9) in an easily accessible form.The variable a_i carries the interpretation of the extinction probability of a mutant population consisting of a single cell for transposon insertion site i.

All cell numbers are implicitly expressed as per unit volume.

Averages over repetitions are denoted with angular brackets <>.

A summary of the variables used in this work. All cell numbers are implicitly expressed as per unit volume. Averages over repetitions are denoted with angular brackets <>. Because mutants are uniformly distributed over k, repetitions over experiments are equivalent to repetitions over transposon insertion sites. Therefore, the probability of a given number of independent mutants with transposon insertions at site i is equal to the proportion of mutants with that given number in a single experiment. Therefore, C is the proportion of transposon insertion sites with at least one mutant, often referred to as the library complexity [16]. The library complexity C is experimentally easily observable and serves a good measure for how comprehensive the Tn-seq screen is. Finally, since C is the probability of sampling at least one mutant for transposon insertion site i, 1-C is the probability of sampling zero mutants. The relationship between the number of mutants (N), the library complexity (C) and the potential number of transposon insertion sites (k) (eq. (1)) is illustrated in (Fig. 2B) and validated against Monte-Carlo simulations in (Fig. S1).

Growth of mutant library, a random birth-death process

The next step in the workflow of a transposon insertion sequencing experiment involves growing the mutant library in a bacterial growth medium (Fig. 1) and is typically required to select against wild-type cells that did not receive a transposon. While this step can be used to simply prepare a mutant library for later investigation, it can also be used to identify the genetic elements for which disruption has a strong fitness effect on the mutant strain under the respective growth conditions, often called “essential gene analysis” [44], [45], [46], [47], [48], [49], [50]. Over time, mutants with a low fitness will decrease in frequency, while mutants with a high fitness will increase. The growth process can be understood as random birth-death events and changes the abundance of mutant cells. This can even lead to extinction of mutants, especially when the initial mutant population size is small, either because the mutant was not created often during the mutagenesis process in the first place or because its abundance decreased over time. For simplicity and clarity, we will first describe the changes in mutant frequencies due to fitness differences (Section 2.2.1). In a second step, we will investigate how mutants may disappear (i.e. lead to zero reads) (2.2.2, 2.3).

Frequency changes due to fitness costs

The effect of a transposon insertion at site i can in general be complex and can affect the baseline (wild-type) division rate, β, and the baseline death rate, δ, differentially. Here, the meaning of the division rate β is the inverse of the average division time of a cell and conversely, the death rate δ is the inverse average of the time it takes for a wild-type cell to die. Usually, bacterial division and death rates are unknown for cells and would need to be determined for all mutants by complex setups, e.g. by single cell microscopy [51] or the plasmid segregation method [52]. Hence, we do not distinguish between the fitness effects on division and death, but investigate the net division rate, i.e. the net change of the bacterial population size over time of the transposon mutant. In contrast to the division and the death rate, the net division rate is easily observable and is described as r = β-δ for wild-type cells. For the mutant population at insertion site i this becomes wir. Accordingly, the average number of mutants for insertion site i at time t is,where ni(0) is the mutant population size corresponding to transposon insertion site i at the start of the birth-death process. The fitness coefficients in Eq. (2) are real numbers and can be either positive or negative. For example, the meaning of a negative fitness coefficient wi is that the mutant subpopulation for transposon insertion site i is dying if the wild-type population of cells is on average growing (r > 0). Likewise, if wi > 0 and r > 0 then a small value for wi means that the mutant cells grow at a slower rate than wild-type cells. We only consider mutants that would be able to grow under the measured conditions, because only those will contribute to the population dynamics in the long term. Depending on the definition of essentiality, mutants that are able to grow but go extinct during the experiment can be regarded as false-positives. It has been argued that random mutations with positive fitness coefficients follow a gamma distribution [53], and consequently we sample the fitness of our mutant library from a theoretical gamma distribution. The modelling approach is illustrated in (Fig. 3). However, the quantitative method presented here works just as well for any real numbers wi and r.

Fig. 3

Modelling growth of mutant library. (A) Illustration of the distribution of fitness coefficients (distribution 1). We assume a gamma-distribution for a mutant library created by transposon insertion [53] with a shape parameter of 10 and a scale parameter of 0.09. The fitness of the wild-type, w = 1, is highlighted by the red dashed line. We do not model lethal mutants with a fitness below 0 that would have a negative net growth rate. (B) Same as (A), illustrating the distribution of selection coefficients s = w-1. (C) Illustration of mutant composition of an arbitrary mutant library during exponential growth. The x-axis shows the time in bacterial doubling times, the y-axis shows the number of bacteria. There are six mutants with fitnesses w1 = w2 = 1 (violet and blue), w3 = w4 = w5 = 0.8 (green, yellow and orange), w6 = 0.5 (red). At t = 0, the simulation starts with one mutant of each of the six genotypes and follows them for 10 generations. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Eq. (2) is based on well-established theory on stochastic birth-death processes [54] where is the average over multiple stochastic trajectories. This equation formalizes the assumption underlying all Tn-seq analysis: The number of sequence reads are proportional to the fitness cost of a transposon mutant [19]. When the population size goes to infinity (and the dynamics can be described by deterministic models), stochastic fluctuations do not exist and we do not need to take the average of several realizations. Hence can be replaced with ni(t). This is valid for large population sizes for example during exponential growth. The net growth rates for the mutant subpopulations in Eq. (2) are wir where r = β-δ is the net growth rate for wild-type cells. Mutant cells therefore divide with a uniformly scaled division rate w and die with a death rate w. Developing Eq. (2) a bit further, the total average population size is simply the sum of the average mutant subpopulation sizes (eq. (2)) over all potential transposon insertion sites, Hence the average proportion of mutants at insertion site i ( ) is approximately < fi(t)>= ≈/ for which the equation reads after substituting in Eqs. (2), (3), The approximation is a consequence of the fact that the population sizes in a stochastic birth-death process are random for which the average of a ratio of random variables is not exactly equal to the ratio of the averages. Eq. (4) becomes an equality for infinite population sizes, i.e. for deterministic models. Note that while the mutant subpopulations grow independently of each other, the average proportion of mutants at insertion site i depend on the population sizes of the mutants at all insertion sites. To rephrase this, the average proportion of mutants at insertion site i depends on the fitness distribution in the entire population. As such, calculated quantities that depend on the average proportion of mutants will be sensitive to the fitness distribution of the whole population. This can for example complicate the analysis of Tn-seq experiments that compare two mutant libraries generated in different parental backgrounds.

Random extinction due to birth-death events

One of the major readouts of a Tn-seq experiment is the absence of any insertions in a gene. This is often taken as a sign that the gene in question is essential under the conditions in which the experiment was conducted. To further improve our quantitative understanding of the underlying birth-death processes we investigate mutant extinction due to stochastic fluctuations and quantify the average number of extinct mutant populations over m sites, (Z). For instance, Z, is the average number of extinct mutant populations in a gene with 50 potential transposon insertion sites. Based on random birth-death processes, the extinction probability of all mutants for one transposon insertion site i (q) reads,where ni(0) is again the mutant population size corresponding to transposon insertion site i at the start of the birth-death process. The extinction probability corresponding to a particular transposon insertion site i within a gene is dependent on only the fitness cost of the transposon insertion in that particular site. In other words, the extinction probability of each individual mutant due to stochastic fluctuations is independent of the fitness distribution of fitness values of all mutants (DFE). This stems from the assumption that all mutant subpopulations grow and die independently of each other. Additionally, the extinction probability depends on not only the net growth rate for the wild-type population but also the baseline turnover of cells under the investigated conditions, i.e. the division rate and death rate of wild-type cells. Therefore, to quantitatively answer whether zero transposon insertion reads are due to a significant fitness cost (i.e., the gene is “essential”) would require knowing the wild-type division rate β, the wild-type death rate δ, and the mutant subpopulation sizes ni(0) at the start of the birth-death process. We get Z, the average number of extinct mutant populations over m sites in a given gene, by recognizing that the extinction probability of mutants for insertion site i (q) is equal to the average number of times that the mutants go extinct for site i. Therefore, the average number of extinct mutant populations over m sites iswhere we have labeled the insertion sites within a gene as i = 1,2,…,m. Moreover the extinction probability of all mutants in a gene with m potential insertion sites is q. For the sake of simplicity we will focus on complete extinctions of all sites within a gene and use q to calculate their extinction probability. In the supplementary material (Supplementary data S1, Supplementary figure 1, Supplementary figure 2, Supplementary figure 3), we show how to get the extinction probability of mutants corresponding to m sites in a gene with k potential insertion sites where the combinatorics of counting the number of ways that m extinction events can be realized is taken into account.

Supplementary figure 1

Supplementary figure 2

Supplementary figure 3

In summary, this section sheds light on the factors that influence extinction probabilities due to a random birth-death process while cells grow in liquid culture. The following example illustrates the importance of the baseline division and death rates. Two identical mutant populations, i.e. the same mutant population sizes ni(0) with the same relative fitness w, are grown for the same time span with different baseline division and death rates, for example by growing them in media with different nutrient content. Even though the mutants have the same relative fitness compared to the wild-type at the respective growth conditions, the extinction probabilities of each mutant differ in the two environments because of their dependence on baseline division and death rates. Another example illustrates that the extinction probability during a birth death process is independent of the fitness of other cells, i.e. the fitness distribution in the entire population (DFE). We simulate the growth of two different mutant libraries in the same media for two hours and for four hours (Fig. 4). These libraries need not have the same fitness distribution, however, we assume that the wild-type populations grow with the same division rate (β) and the same death rate (δ). Over time the mutant populations grow, i.e. the total number of cells increases, seen as an increase in the area of the distribution (Fig. 4ABC). Simultaneously, the relative abundance of a specific mutant changes over time because fitter mutants, larger w, grow faster than mutants with higher fitness costs, (wi-1). How the relative abundance changes depends on the fitness distribution in the entire population. In our example, this is evident by the change in the binned mean fitness values of the mutant library over time (Fig. 4ABC). This also means that a specific mutant with the same fitness in both mutant libraries will be present in different proportions at the same time-point. Nevertheless, the extinction probabilities for two mutants with the same relative fitness would be the same for both mutant libraries as a consequence of the extinction probability being independent of the fitness distribution (Fig. 4DE). Even for mutants with a substantially reduced fitness (wi = 0.15), extinction is very unlikely. After two hours (Fig. 4D), extinction probabilities for the mutants corresponding to 1, 2, 3 and 4 insertion sites within a gene are 9.3 × 10−4, 8.7 × 10−7, 8.2 × 10−10 and 7.6 × 10−13 respectively. Hence, on average a single insertion site in a gene goes extinct every 1/9.3 × 10−4 ≈ 1000 repetitions of the experiment with more than one extinction event being orders of magnitude less likely. The extinction probability of all mutants within a gene increases with time, though it remains independent of the fitness distribution for a constant baseline division rate and death rate. After four hours, the extinction probabilities corresponding to 1, 2, 3 and 4 insertion sites within a gene are 7.6 × 10−3, 5.8 × 10−5, 4.4 × 10−7 and 3.3 × 10−9, respectively. Therefore, growing the cells for two additional hours increases the extinction probability by approximately one order of magnitude where there is, on average, one extinction event of a single insertion site in 1/3.4 × 10−3 ≈ 130 repetitions of the experiment.

Fig. 4

Random birth/death process and extinctions for two different DFEs. In this graph we compare the dynamics of two mutant libraries with different DFEs. The DFEs are gamma-distributed with shape parameter 10 and scale parameter 0.09 (distribution 1, black) and 0.04 (distribution 2, red). The number of potential transposon insertion sites (k) was set to 105 with the number of mutant cells set to 5 at t = 0 for i = 1,2,…, 105. In the top panel, the fitness coefficient wi is shown on the x-axis and the number of mutants i with the corresponding fitness is shown on the y axis. The fitness coefficients were binned using a bin width of 0.01 for wi values between 0.01 and 2.5 for both distributions. The binned mean fitness values (magenta and green dashed vertical lines) were calculated by summing wifi over i where i is the number of bins (250) and fi is the proportion of mutant cells in bin i. (A) The number of mutants present at the start of a birth-death process. (B) The distribution of the number of mutants over the fitness coefficients after 2 h of growth (Eq. (2)) with a baseline division rate set to β = 0.03 min−1 and a baseline death rate set to δ = 0.02 min−1. (C) The distribution of the number of mutants over the fitness coefficients after 4 h of growth with the same rates as in (B). The bottom panel shows the extinction probability (y-axis) for all mutants corresponding to 1–4 insertion sites within a gene (x-axis). Eq. (5) was used to calculate the extinction probabilities where wi was either sampled from distribution 1 (black) or distribution 2 (red). All insertion sites within an essential gene have the same fitness cost with wj arbitrarily chosen and set to 0.15 to represent a gene with high fitness costs. (D) Extinction probabilities after 2 h of growth. (E) Same as (D) except the mutant cells have been growing for 4 h. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Random sampling events (bottlenecks)

In addition to stochastic fluctuations during bacterial growth and death, sampling events (also known as bottlenecks) contribute to changes in mutant frequencies and may lead to extinction of mutants. Bottlenecks are frequently encountered during several steps of a typical Tn-seq experiment; some of which are unavoidable. For example, pipetting and sequencing can constitute bottlenecks. During pipetting, often a small volume is taken from a larger volume and during sequencing, the number of sequences acquired is limited. In addition, when Tn-seq studies are performed in vivo in animal models, the host defenses of the animal will impose additional bottlenecks [55]. For example, only one in 104 V. cholerae will contribute to colonization in a rabbit model after inoculation [56]. Generally speaking, all bottlenecks result in random distortions of mutant frequencies and extinction of mutants, both of which are independent of mutant fitness. Since the underlying assumption of all Tn-seq experiments is that mutant frequencies change depend on mutant fitness, bottlenecks add noise to the experimental readouts. This can be illustrated with an extreme example: When only one cell makes it through a bottleneck, the absence of all other mutants at the end of the experiment does then not state much about their fitness. The aim of this section is to formalize this intuitive reasoning and to predict mutant extinction due to random sampling. To model bottlenecks we use the multinomial random sampling model and the Poisson approximation for the probability distribution of sampling ni mutants for insertion site i. The multinomial random sampling model and the accompanying Poisson approximation are elaborated upon in the mutagenesis Section 2.1. In contrast to the mutagenesis model however, the total population size is finite. As a consequence, the multinomial random sampling model will only be accurate if small samples, ns, are taken from a very large population size N such that perturbations in the proportions of mutants for insertion sites i are negligible. Based on the Poisson model the probability q of all mutants for transposon insertion site i going extinct readswhere fi is the proportion of mutants for insertion site i prior to sampling and nsfi is the average number of mutant cells for transposon insertion site i after sampling. The extinction probability due to a bottleneck is a random variable if the frequency of the mutant i is itself a random variable due to preceding random processes. In our framework (Fig. 1) this could be a birth-death process and/or mutagenesis (or potentially another sampling event). To account for this, we move from the extinction probability of mutant i in a single experiment, q, to the average extinction probability when repeating experiments, . The average extinction probability can be estimated by using the error propagation method for which the simplest estimate is a substitution of the proportion of mutants fi with the average proportion of mutants in Eq. (7). In the context of library complexity reduction (Section 2.4), we include higher order terms that depends on the variances and covariances in the proportion of mutants to estimate the average extinction probability (Supplementary data S2). The interpretation of Eq. (7) and the connection to experiments depend on the experimental protocol prescribing how a sample is to be taken. Here, we distinguish between absolute and fractional bottlenecks [55]. In an absolute bottleneck the number of mutants that are present after a bottleneck is constant and independent of the pre-bottleneck population size. An example of such a bottleneck is the sequencing step, when a limited and constant number of sequences (often around 106−7) are read. When a population undergoes a fractional bottleneck, the number of mutant after a sampling event is always the same fraction of the original population, i.e. directly proportional. Such a bottleneck is encountered for example during pipetting where a fixed fraction of the total volume, independently from the concentration of the cells, genomic DNA, etc., is processed. The effect of absolute bottlenecks on mutant frequencies is illustrated in (Fig. 5), where we take the two mutant libraries that underwent a birth-death process from Fig. 4 and sampled 106 cells from them. The initial mutant population (Figs. 4A and 5A) for 105 potential insertion sites (k) contains 5 mutants for each individual transposon insertion site i with a fitness coefficient drawn from two different gamma distributions (red and black). This distribution and the total number of cells changes during the two hours or four hours long birth-death process (Fig. 4BC). While bottlenecks result in random distortions of mutant frequencies for single realizations of Tn-seq experiments they do not change the average frequencies. Since the only difference between Fig. 4, Fig. 5 is an added bottleneck, the mean fitness of the resulting distributions are identical. However, the extinction probability of individual mutants might strongly differ. In Fig. 5DE, we show the extinction probabilities of an arbitrary mutant with a fitness coefficient of w = 0.15 after a random birth-death process and a subsequent bottleneck as shown in Fig. 5BC.

Fig. 5

The effects of bottlenecks on mutant extinction. (A) The fitness distributions (DFE) of two mutant populations with high average (black, distribution 1) and low average (red, distribution 2) fitness at the start of the birth-death process and before passage through a bottleneck. This figure is equivalent to Fig. 4A. The binned mean fitness is indicated in dashed magenta for distribution 1 and dashed green for distribution 2. (B) The distribution of the number of mutants over the fitness coefficients after a sample of 106 mutants following 2 h of growth. The distribution of the pre-bottleneck population is shown in Fig. 4B. (C) The distribution of the number of mutants over the fitness coefficients after a sample of 106 mutants following 4 h of growth. The distribution of the pre-bottleneck population is shown in Fig. 4C. (D) The extinction probability (y-axis) for all mutants corresponding to 1–4 insertion sites within a gene (x-axis) after sampling 106 mutants following 2 h of growth. Eq. (7) was used to calculate the extinction probabilities where wi was either sampled from distribution 1 (black) or distribution 2 (red). All insertion sites within a gene have the same fitness cost where we show an example with wj arbitrarily chosen and set to 0.15. (E) Same as (D) except the mutant cells have been growing for 4 h after which 106 mutant cells are sampled. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) After two hours of growth, the extinction probabilities for the mutants corresponding to 1,2,3 and 4 insertion sites within a gene are 0.02186, 4.8 × 10−4, 1.0 × 10−5 and 2.3 × 10−7 for the first distribution (black) and 6.7 × 10−4, 4.4 × 10−7, 2.9 × 10−10 and 1.9 × 10−13 for the second distribution (red) (Fig. 5D). Hence, there is on average one extinction event per transposon insertion site in 1/0.02186 ≈ 46 repetitions of the experiment for the first distribution with more than one extinction event being orders of magnitude less likely. The extinction probabilities increase with additional time to grow (Fig. 5E). The extinction probabilities corresponding to 1, 2, 3 and 4 insertion sites within a gene are 0.2828, 0.08, 0.02263 and 0.0064 for the first distribution and 0.005, 2.9 × 10−5, 1.6 × 10−7 and 8.6 × 10−10 for the second distribution. Therefore, growing the cells for two additional hours prior to sampling increases the extinction probability for one event by ~2 orders of magnitude where there is, on average, one extinction event in 1/0.2828 ≈ 3–4 repetitions of the experiment for the first distribution. These numbers illustrate that even when two mutants from the two distributions have the same fitness, the extinction probability differs. The intuitive explanation is that mutants “compete” with others in the same library for a “spot” among the 106 cells that are let through the bottleneck. In our example, a mutant with the low fitness value of w = 0.15 will have a higher extinction probability and more extinctions over m sites for distribution 1 (higher average fitness) than for distribution 2 (lower average fitness). A large part of the differences in extinction probabilities can be explained by the bottleneck after four hours being relatively more restrictive than after two hours. Most cells for distribution 1 grow much faster than cells from distribution 2 because of their higher fitness. Therefore, a smaller fraction of distribution 1 is sampled when keeping the sampled cells constant at 106. However, if the number of sampled cells is adjusted to the total bacterial population size, i.e. a fractional bottleneck, the extinction probabilities between a two hours and four hours sampling time point would be equal. This is because the mutant frequencies do not change that much, compared to the total population size which changes dramatically. Mathematically, this can be explained by looking at the exponent in Eq. (7): −ns. For fractional bottlenecks, a constant fraction b of the total mutant population size N is sampled at both time points. Hence, the sample size is ns = b. Moreover ≈ /, such that ns = b and is independent of the total bacterial population size.

Comparison of effects of birth-death events and bottlenecks on library complexity

So far, we have quantified the effect of random birth-death processes (Section 2.2) and sampling bottlenecks (Section 2.3) via extinction probabilities q and the average number of mutant subpopulation extinctions over m sites, q. In order to better quantify the relative impact of extinctions due to random sampling events and random birth-death processes we aggregate the extinction probabilities into a single measure, the library complexity C. To do this, we use Eq. (6) where we set m = Ck to get Z which is the average number of mutant extinctions over the transposon insertion sites with at least one mutant cell after mutagenesis due to stochastic fluctuations if Eq. (5) is used (q) or random sampling events if Eq. (7) is used (q). Since Ck is the number of transposon insertion sites with at least one mutant, Ck-Z is the number of transposon insertion sites with at least one mutant after a random birth-death process or sampling. By dividing Ck-Z with k and applying the average operator, <>, we get the average proportion of insertion sites with at least one mutant after a sampling bottleneck or random birth-death process, ,where the average is taken with respect to repetitions of the experiment. In different words, mutants for Ck transposon insertion sites emerge from mutagenesis and subsequently undergo extinction events due to either a birth-death process or a random sampling event in a single repeat of the experiment. The implication is that only successful integrations at the very beginning of the experiment should be counted when taking the average of all steps to get the average reduced library complexity. In supplementary text S2 we take this into account and use the error propagation method to derive the average extinction probabilities ,due to a birth-death process where a has been introduced for notational simplicity and is equal to the base of Eq. (5). For a fractional random sampling event, the average extinction probability reads, The average extinction probabilities in Eqs. (9), (10) are subsequently used to calculate (eq. (6)) and the average reduced library complexity, (eq. (8)). Fig. 6 illustrates the library reduction due to stochastic fluctuations in a random birth-death process (Fig. 6A) and due to a random sampling event (Fig. 6B). The library complexity reduction due to stochastic fluctuations becomes negligible as µ becomes bigger than 10. However, the library complexity can become as low as 35% if on average, there is one mutant per insertion site (µ = 1), for which the initial library complexity is 63% according to Eq. (1). It is therefore advisable to have at least one order of magnitude more mutants after mutagenesis than the potential number of transposon insertion sites to minimize the chance of extinctions due to chance fluctuations caused by random birth-death events. In addition, the time the mutant population of cells is allowed to grow should be carefully evaluated. Depending on the underlying DFE, the baseline division rate, and death rate, the proportion of mutants with the lowest fitness cost will become overrepresented in the population. Sequencing will then act as a bottleneck and will select for the mutant cells present in the highest proportions, which could potentially lead to a significant reduction in library complexity. For example, if the number of mutants following mutagenesis is 106, which corresponds to a library complexity of 1 according to Eq. (1). By iteratively applying Eq. (8) to a birth-death step and then a sequencing bottleneck of size n = 2 × 107 we get an average library complexity of 2.92%. The duration of the birth-death step was chosen to be 24 h, otherwise we used the same parameter values as in Fig. 6 and a gamma distributed DFE with shape parameter 10 and scale parameter 0.09. If the scale parameter is reduced to 0.04 to decrease the likelihood of introducing an advantageous mutant, the library complexity jumps to 90.58%. Hence, the library complexity can be very sensitive to the shape of the underlying DFE, where the sensitivity is proportional to the baseline net growth rate and the time that the mutant cells spend in exponential growth phase. If the growth time is reduced from 24 h to 12 h, the library complexity becomes 78.82% for scale parameter 0.09 and 97.44% for scale parameter 0.04. Therefore, it is advisable to minimize the time the cells spend in exponential growth and maximize the number of mutants from mutagenesis when creating a transposon insertion library.

Fig. 6

Library complexity reduction due to random birth-death events and random sampling events. The initial mutant population emerges from the process of mutagenesis with an average number of mutant cells per potential transposon insertion site µ (x-axis) related to the initial library complexity C through Eq. (1). (A) The mutant population grows on average with a baseline division rate β set to 0.03 min−1 and a baseline death rate δ set to 0.02 min−1 with fitness coefficients sampled from distribution 1 (shape = 10, scale = 0.09). The mutant subpopulations have a chance to go extinct as a consequence of stochastic fluctuations shown as a reduction in library complexity, C’ (y-axis). The theoretical results were calculated using Eq. (9). The black solid line shows the library complexity reduction due to stochastic fluctuations after 2 h of growth and the black dashed line after 4 h of growth. Stochastic tau-leaping simulations were ran for 20 iterations for each value of µ where for each iteration the number of cells for site i = 1,2,…,103 was drawn from a Poisson distribution with mean µ. The mean and the standard error in the library complexity was subsequently calculated and plotted as red circles for cells growing for 2 h and red squares for cells growing for 4 h. (B) Library complexity reduction after sampling 10% (black solid line) and 1% (black dashed line) of the initial mutant population that emerges from the process of mutagenesis. The theoretical results were calculated using Eq. (10). Multinomial random sampling simulations were ran for 20 iterations where for each iteration the number of cells for site i = 1,2,…,105 was drawn from a Poisson distribution with mean µ. The mean and the standard error in the library complexity was subsequently calculated and plotted as red circles and red squares for the 10% bottleneck case and the 1% bottleneck case, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Bottlenecks are a major source of false positives during Tn-seq experiments

The main concern with fitness-independent disappearance of mutants due to bottlenecks is that genes that are extinct at the end of the experiment are classified as essential even though they may not be. Most mutants with a random mutation will have reduced fitness [53]. The aim of a Tn-seq experiment is to identify those that have a very large effect on fitness, ideally significantly larger than an average random mutant. In this section, we investigate how many mutants with fitness larger than zero, i.e. that would be able to grow under the selected conditions, go extinct by chance and how the fitness of these mutations is distributed. Since we have seen in Fig. 6 that birth-death processes generally are not the main source of fitness-independent extinctions, we now simplify our approach and model the growth of individual mutants i as deterministic exponential growth (deterministic version of Eq. (2)). When we substitute ni in Eq. (7) with Eq. (2), we obtain an extinction probability qiwith b as the bottleneck size, ni(0) as the initial post-mutagenesis number of mutants i, wi as the fitness coefficient, r as the net growth rate, and t as the time. If we now do not only look at the extinction of a specific insertion site i, but require all insertion sites m being extinct, the extinction probability of an entire gene, becomeswhere we assume that the fitness cost of all transposon insertion within a gene are equal. Note that requiring all insertion sites per gene being hit is mathematically the same as focusing on one individual insertion site, repeating the experiment m times and always observing extinction. Eq. (12) describes an approach that is often used to assign essentiality: Quantifying the probability of random extinction. Many approaches exist to do so, and many use more sophisticated frameworks [16], [41]. However, the prior fitness distribution of random mutations (DFE) is neglected. We illustrate the impact of the prior distribution in Fig. 7, where we derive the fitness distribution and size of the population of extinct mutants.

Fig. 7

Bottlenecks are a major source of false positives during Tn-seq experiments. (A) Distribution of fitness coefficients (DFE) for a library of random mutants. The DFE is the same as distribution 1 in Fig. 4, Fig. 5, i.e. gamma-distributed with shape parameter 10 and scale parameter 0.09. Here, we scaled this distribution to a population size of 106 cells after mutagenesis and assume five mutants per transposon insertion site at the beginning of the experiment. (B) Extinction probabilities (y-axis) of a single insertion site depending on the fitness coefficient (x-axis) and different bottleneck sizes (0.0001% black, 0.01% red, 0.1% green) as predicted by Eq. (12). We assume that bacteria grew for 4 h with a doubling time of 20 min (r = ln(2)/20 min−1). (C) Fitness distribution of the population of mutants that went extinct, i.e. had zero sequence reads at the end of the experiment. This graph was obtained by multiplying the gamma distribution in (A) with the extinction probability in (B). (B) and (C) focus on a single transposon insertion site. The percentage of loci going extinct for a given bottleneck is specified in the graph. (D) Same as (B) for the simultaneous extinction of 30 insertion sites with the same fitness, i.e. located in the same gene. (E) Same as (C), also for 30 insertion sites. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Because completely non-viable mutants would not contribute to the population, we focus only on mutants with a fitness larger than zero, i.e. we focus on “false positives” that would be classified as “essential” because of their absence in the sequencing data despite the fact that they can still grow under the experimental conditions. We furthermore assume that all insertion sites were hit (C = 1) and that ni(0) is uniformly distributed, i.e. the same for all mutants i. We also assume that the number of insertion sites that are grouped and go extinct simultaneously (e.g. per gene), m, is the same for all genes. This leads to conservative estimates that underreport extinction, which would be more likely if some mutants were only present infrequently or some genes had only few insertion sites (e.g. short genes). Under these assumptions, we can multiply the gamma-distributed DFE (Fig. 7A) with Eq. (12) (Fig. 7B and D) to obtain the distribution of fitness values for genes that randomly disappeared (Fig. 7C and E). This shows that even though the extinction probability per site is highest for mutants that have very low fitness, the majority of mutants that randomly disappear actually have intermediate fitness. This is because mutants with intermediate fitness were more frequent in the original population after mutagenesis. The percentage of entire genes that disappear randomly depends strongly on the bottleneck size and on the number of insertion sites per gene (compare Fig. 7BC to DE). Other factors that influence the percentage of “false positive” genes are the prior fitness distribution (Fig. S2) and the number of mutants per site i at the beginning of the experiment, ni(0) (Fig. S2). In our example, depending on the number of insertion sites per locus, 0.8%-10% of all genes would be misclassified as essential after a 0.01% bottleneck which was reported for V. cholerae infection models [26]. On an E. coli or V. cholerae genome scale (with approx. 4000 genes), this means 32–400 false positive hits. If a comprehensive answer to which genes are essential was the goal, the repletion would have to increase until the expected number of false positives falls below one. In this case, one would have to repeat a gene-wide analysis (approx. 30 insertion sites per gene, [16]) for eight experimental replicates, thereby bringing the number of repetitions to 30 × 8 = 240 (Fig. S2). We additionally analyzed the number of false positives in our full stochastic model to ascertain that our main conclusions hold true under more realistic scenarios (Fig. S3). Again, we find that the majority of false positive mutants that go extinct have intermediate fitness costs.

Discussion

Tn-seq is a powerful method to identify genes that are important for bacterial viability in a given environment. It has been successfully used to identify genes that are critical for specific processes and determine important phenotypes including virulence [1], [2], [3], [4], [5], [6], [7], [8], [9], persistence [12], [13], biofilm formation [14] and antibiotic resistance [15], [16]. Since Tn-seq is a global genetic screening technique, it allows in principle to test all genes in a single experiment simultaneously. In reality, several aspects of the experimental setup are limiting and repetitions of very similar (if not identical) Tn-seq experiments by different laboratories have poor overlap [25]. From a population biological perspective, this may be explained by the fact that drift, i.e. random extinction, plays a larger role than anticipated. All steps in the experimental process have their limitations and may obscure the correlation between sequence reads and mutant fitness that underlies all Tn-seq experiments. For example, it is difficult to ascertain that a transposon library is complex enough that all loci that are to be investigated are sufficiently covered. Random events during growth and passage through bottlenecks distort mutant frequencies and thereby add noise. In addition, the analysis is complicated by the fact that most mutants in any bacterial genome will result in a reduced fitness of the organism [53]. All this makes it difficult to achieve the goal of a Tn-seq experiment: to identify loci in which mutation leads to severe fitness defects, which are significantly larger than the mild effects of an average random mutation. Several analysis methods have approached these limitations from different angles. Some studies employ successive sequencing to track mutant populations over time [32], [41], implicitly or explicitly modelling bacterial growth. Also the effect of bottlenecks have been modeled by multinomial sampling [25]. However, a concise population biological framework encompassing all steps is still missing. Most importantly, while the effects of random mutagenesis on the distribution of fitness in a population (DFE) have been a matter of intensive research [42], [53], [57], [58], this underlying or prior distribution at the very start of a Tn-seq experiment has received little attention. In this paper, we investigate the population biology of a Tn-seq experiment, i.e. the changes in size and composition of a population of mutants created by transposon mutagenesis. Specifically, we investigate the population biological mechanisms that underlie experimental noise. We illustrate the effects of experimental noise and focus on a very conservative case, when mutants have zero reads at the end of the experiment despite being able to grow in the given environment. If “essential gene” is taken to mean that a cell cannot grow at all, i.e. the mutation is lethal, we focus on the dynamics of false positives. We take a two-pronged approach to investigate the population biology of a Tn-seq library: we develop both an analytical approach using basic population dynamical and probability theory concepts and a framework based on stochastic tau-leaping simulations. Both frameworks encompass the following steps of a Tn-seq experimental workflow: mutagenesis, a random birth-death process and random sampling events or bottlenecks (including sequencing). This yields the size and distribution of fitness values in those bacteria that go extinct (false-positives). To use the model developed in this work to fully describe a Tn-seq experiment, one requires careful measurements of parameters that influence library complexity. Thus, growth times (in exponential phase), OD values, the sampled volume sizes and the volume sizes of the media from which the cells are sampled, and DNA concentrations before and after each step during which DNA can be lost (shearing, adaptor ligation, …) must be carefully determined. In addition, one would need an estimated shape of the DFE and an estimate of the division rate and death rate of wild-type cells. In addition, care should be taken in how to sample fitness coefficients from the DFE. It is reasonable to assume that, in general, the fitness cost of insertions within the same gene are similar. Given this assumption, the DFE could be sampled on a gene level. Then the sampled fitness coefficients would need to be repeated a number of times corresponding to the number of potential Tn insertion sites in the genes. Taking this approach of integrating the model developed in this work with a specific Tn-seq protocol can aid in minimizing the discrepancies of current statistical methods, which do not fully incorporate a quantitative understanding of the process that generates the sequence reads in the first place. For the mutagenesis step, we find a simple relationship between the library complexity (the fraction of transposon insertion sites with at least one mutant), the total number of mutant cells after mutagenesis, and the potential number of transposon insertion sites. Our approach can be used to plan experiments such that a sufficient library complexity can be ensured. In addition, it serves as a null hypothesis regarding transposon distribution that can be used to identify preferential transposon insertion regions, i.e. genomic cold/hot spots. When modelling bacterial growth after mutagenesis with a birth-death process (a well-established method in population biology [54]), we find that fitter mutants are enriched and that the mean fitness of the population shifts towards higher values. However, we also find that this is not due to pervasive extinction of mutants (which is rare during the birth-death process in our model), but due to substantial growth where the fitter mutants grow faster. While we found little extinction and therefore little contribution to false-positives while bacteria grow, bottlenecks by definition lead to “mass extinctions”. The random disappearance of many mutants makes it very difficult to find out which mutants went extinct by chance and which loci are truly essential. With very stringent bottlenecks, the fitness distribution of the extinct mutants will be exactly the same as the one right after mutagenesis (minus the few cells that survived). Random extinction due to bottlenecks is especially problematic if genes have few insertion sites (e.g., short genes) or if subgenomic locations that only encompass few insertion sites are investigated. Random mutations such as those created by transposon mutagenesis [53] are usually deleterious, with only few being advantageous. The distribution of fitness effects (DFE) peaks at mildly deleterious fitness values with a long tail to strongly deleterious mutations and a sharp decline after mildly beneficial mutations. As stated above, the aim of a Tn-seq experiment is to identify those loci that are critical, i.e. a mutation in this loci is strongly detrimental. To assess in how far this can be achieved in the presence of bottlenecks, we investigate the distribution of fitness values in the population that goes extinct during the experiment. Importantly, we assess an idealized situation where the library complexity is perfect and no extinctions happen during the birth-death process. In this case, the extinction probability depending on mutant fitness can be described with Eq. (11). This probability also depends on the time bacteria grow, their doubling time, the bottleneck stringency, how many insertion sites are grouped together (e.g. located in one gene), and the number of individual mutants per insertion site after mutagenesis. When multiplying the number of loci expected with a certain fitness (from the DFE) with the extinction probability, we obtain both the size and the distribution of fitness values of the extinct mutants. Thus, this equation can serve as a tool to assess false positive rates and the mean fitness of false positives after a bottleneck. We find that permissive bottlenecks (0.1%, e.g. sequencing 106 reads from 1 ml bacterial culture with 109 bacteria, i.e. OD ~1) are only problematic when insertion sites are investigated individually and not grouped to larger loci. If we only count genes as “essential” when not a single insertion site in that gene was hit, the false positive rate for an average gene with 30 insertion sites starts to rise when only approx. 0.01% of the original population survives a bottleneck (such as in rabbit models of cholera infections [26]). For such a stringent bottleneck, one would have to repeat the experiment 8 times to achieve less than 1 false positive, thereby achieving a comprehensive, genome-wide list of essential loci. In extreme cases such as e.g. the colonization by listeria [27], there are so few cells that survive the host defenses, that the absence or presence of all other mutants is not conclusive of essentiality or even low fitness. For an average gene with 30 insertion sites, we would expect 30% of all genes to go extinct, i.e. recorded as false positives. These would have a mean fitness close to the mean fitness of a random mutant. While this work incorporates explicit population biological models of many steps of Tn-seq experiments, we had to make several simplifying assumptions, frequently because of a lack of experimental data. For example, it is conceivable that a mutation affects bacterial division, but not bacterial death. However, since only the net growth rate is routinely measured in batch culture and the rates of bacterial division and death are largely unknown, we here assume that the fitness coefficient scales the division and death rates equally. This has no effect on population turnover when the baseline death rate is 0 (as often assumed in population biological models, [52]). If we however assume there is bacterial death, lower fitness would result both in a lower growth rate and in lower bacterial death, thereby underestimating population turnover and extinction. We also assume for simplicity that bacteria replicate exponentially. Our assumption is only true as long as the culture density is far away from carrying capacity, such that our model is best suited to describe experiments that start with an abundance of nutrients. However, the assumption is helpful because resource limitations lead to competition between mutants of different fitnesses, which in turn leads to complex dynamics that would again depend on the unknown bacterial division and death rates [59]. Finally, the DFEs of bacteria are increasingly investigated, but not well-characterized for different strains or species. This work therefore calls for more determinations of DFEs, better measurements of bacterial division and death rates or supplementing Tn-seq with e.g. the plasmid segregation method [52] or single cell microscopy [51]. Furthermore, bottlenecks during the experimental setup need to be assessed and ideally measured [56]. As stated above, we are taking a relatively conservative approach to assess how many viable mutants would go extinct by chance. While we find that typical sampling during experiments should not be very problematic, our results do show that most mutants that are lost in the presence of strong bottlenecks have average fitness values. This is despite the probability of extinction being dependent on mutant fitness. This probability is not the sole determinant of the DFE of extinct mutants, but the DFE of the population after mutagenesis has to be taken into account by multiplying probabilities. If there are few mutants with low fitness to being with, few will be lost. Among the many mutants with intermediate fitness, some will be lost even though the chance of extinction per mutant is low. Thus, the previous disregard of the prior DFE and the conditional probabilities of extinction may shed light on the problem of reproducibility of Tn-seq experiments, especially in animal models. This reasoning is very similar to Bayesian reasoning in interpreting diagnostic tests. A famous example would be a diagnostic test with 95% accuracy taken by a person who is from a population with a low prevalence (0.1%) of the disease in question [60]. The probability that a randomly selected person who tests positive then actually has the disease is not 95%, but 2%, necessitating further tests [60]. Therefore, the fitness of all mutants extinct in Tn-seq experiments with animal models or other systems with strong bottlenecks should be tested individually against the wild-type, as for example in [5], [9]. The same conclusions can be drawn for any kind of screen that relies on random mutagenesis, such as CRISPR interference (CRISPRi) in eukaryotes [61].

Methods

Codes for (Fig. 3, Fig. 7) and (Fig. S2) were implemented in statistical software package R (version 3.4.4, The R Foundation for Statistical Computing, Vienna, Austria). Codes for all other figures were implemented in MATLAB (R2017b, The MathWorks, Natick, MA, USA). The stochastic tau-leaping simulation results (Figs. 6 and S3) were generated using StochKit2 [62] with a time step of 0.01 with additional code in Matlab to create an input file to StochKit2. All code for reproducing the figures and the Matlab code for generating the input file to StochKit2 for running the stochastic simulations are available as supplementary data. All scripts can be accessed via Mendeley Data under the terms of the Creative Commons (CC) license CC by NC 3.0.

CRediT authorship contribution statement

Anel Mahmutovic: Conceptualization, Methodology, Software, Formal analysis, Writing - original draft, Visualization. Pia Abel zur Wiesch: Conceptualization, Formal analysis, Writing - review & editing, Visualization, Supervision, Funding acquisition. Sören Abel: Conceptualization, Formal analysis, Writing - review & editing, Visualization, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

3 in total

1. Quantification of Brucella abortus population structure in a natural host.

Authors: Aretha Fiebig; Catherine E Vrentas; Thien Le; Marianne Huebner; Paola M Boggiatto; Steven C Olsen; Sean Crosson
Journal: Proc Natl Acad Sci U S A Date: 2021-03-16 Impact factor: 12.779

2. Refined Quantification of Infection Bottlenecks and Pathogen Dissemination with STAMPR.

Authors: Karthik Hullahalli; Justin R Pritchard; Matthew K Waldor
Journal: mSystems Date: 2021-08-17 Impact factor: 6.496

3. A Comprehensive Overview of the Genes and Functions Required for Lettuce Infection by the Hemibiotrophic Phytopathogen Xanthomonas hortorum pv. vitians.

Authors: Lucas Morinière; Laurène Mirabel; Erwan Gueguen; Franck Bertolla
Journal: mSystems Date: 2022-03-21 Impact factor: 7.324

3 in total