Gergely Tibély1,2, Dominik Schrempf2, Imre Derényi2,3, Gergely J Szöllősi1,2,4. 1. MTA-ELTE "Momentum" Evolutionary Genomics Research Group, Budapest, Hungary. 2. Department of Biological Physics, Eötvös University, Budapest, Hungary. 3. MTA-ELTE Statistical and Biological Physics Research Group, Eötvös Loránd Research Network (ELKH), Budapest, Hungary. 4. Institute of Evolution, Centre for Ecological Research, Budapest, Hungary.
Abstract
Tumors often harbor orders of magnitude more mutations than healthy tissues. The increased number of mutations may be due to an elevated mutation rate or frequent cell death and correspondingly rapid cell turnover, or a combination of the two. It is difficult to disentangle these two mechanisms based on widely available bulk sequencing data, where sequences from individual cells are intermixed and, thus, the cell lineage tree of the tumor cannot be resolved. Here we present a method that can simultaneously estimate the cell turnover rate and the rate of mutations from bulk sequencing data. Our method works by simulating tumor growth and finding the parameters with which the observed data can be reproduced with maximum likelihood. Applying this method to a real tumor sample, we find that both the mutation rate and the frequency of death may be high.
Tumors often harbor orders of magnitude more mutations than healthy tissues. The increased number of mutations may be due to an elevated mutation rate or frequent cell death and correspondingly rapid cell turnover, or a combination of the two. It is difficult to disentangle these two mechanisms based on widely available bulk sequencing data, where sequences from individual cells are intermixed and, thus, the cell lineage tree of the tumor cannot be resolved. Here we present a method that can simultaneously estimate the cell turnover rate and the rate of mutations from bulk sequencing data. Our method works by simulating tumor growth and finding the parameters with which the observed data can be reproduced with maximum likelihood. Applying this method to a real tumor sample, we find that both the mutation rate and the frequency of death may be high.
Cancer is an evolutionary phenomenon within a host organism that unfolds on the timescale of years, but only becomes apparent in late stages and, as a result, is most often not directly observed during large part of its evolution. Genomic sequencing data offers a window into the evolutionary processes underlying tumor development and progression. Analyzing widely available bulk sequencing data, where sequences from individual cells are intermixed, however, is challenging. Bulk sequencing can only essay mutation frequencies for a population of cells from each tumor sample and does not resolve the genotypes of individual cells, making even basic evolutionary parameters, like mutation rate and death rate, difficult to recover.Consider the neutral model of tumor growth, where new mutations can appear with each cell division, while cells can also die for reasons such as lack of nutrients or immune reactions [1, 2]. This simple process can be described by two paramaters, the ratio of cell death and birth rates and the mutation rate per cell division. Despite the simplicity of such a model and a plethora of bulk tumor sequencing data, these parameters remain largely unknown, with estimates spanning several orders of magnitudes [2].While tumor cells can contain a large number of mutations, it is not clear whether this is due to an elevated mutation rate or frequent cell death and birth. There are arguments for both cases [3-7], but distinguishing between these two alternatives is difficult because there are no known methods for direct measurements. As a result, estimating the mutation rate per cell division in human tumors must rely on assumptions about the duration of the cell cycle, the growth rate of the tumor and the total mutational burden of the tumor [8-10].In previous works, Williams et al. [2] and Bozic et al. [3] showed that the combined effect of the mutation rate and the death rate can be estimated from the frequencies of neutral mutations that is readily available from bulk sequence data. Distinguishing excess mutations and increased cell death, however, still required external information or assumptions. In follow-up work, Williams et al. [11, 12] showed that separating these two quantities can be achieved by the bulk sequencing of multiple spatially disjunct samples from the same tumor, thus, resolving a coarse grained cell lineage tree. This approach, however, is inherently limited by the number of samples in its ability to resolve the cell lineage tree and, as a result, in its ability to distinguish excess mutations from increased cell death.In general, tumor phylogenies represent the evolutionary history of its subclones and can be used to test different hypotheses about tumor evolution. However, the specific features of cancer data pose challenges to the direct application of classical phylogenetic models. In particular, bulk sequencing data contain an unknown number of novel cancer genomes, while classical phylogenetic approaches assume that taxa are known a priori [13]. Tree deconvolution methods, for instance, attempt to solve this problem by combining phylogenetic inference with a deconvolution step, in which clonal subpopulations from mixed genomic samples are separated prior to or concurrent with inferring phylogenetic relationships between those subpopulations [13-17].Our approach also models mutation accumulation along a tumor phylogeny, a cell lineage tree arising from a birth-death process described by the death-to-birth ratio of cells (characteristic of cell turnover). It does not, however, attempt to infer a tree. Rather, it attempts to approximate the approach conceptualized by Felsenstein, wherein the likelihood of a genetic data set is assessed by considering all possible genealogical histories of the data, each in proportion to its probability [18, 19]. As a result, in contrast to methods such as tree deconvolution and other clone tree methods [15, 16] that attempt to infer complete or partial tumor phylogenies, no single tree is inferred. Instead, the parameters of an explicit probabilistic model of both mutations and the tree along which they occur are estimated by approximating the average over all possible trees.Below we demonstrate, as a proof-of-concept, that approximating the average over all possible trees can differentiate between a wide range of death rates (even very close to the birth rate), and accurately estimate the mutation rates spanning a range of several orders of magnitude, based on a single tumor-normal sample pair. After introducing the underlying probabilistic model and the parameter estimation procedure we assess its accuracy on simulated data and present results on empirical data.
Model
We describe the evolution of tumor cells in terms of a cell lineage tree, i.e., the bifurcating tree traced out by cell divisions. As cells that have died cannot be observed we consider the tree spanned by surviving cells. The leaves of this tree correspond to extant cells and the internal nodes to cell divisions. To model the descent of sampled cells, we consider a birth-death process conditioned to have a fixed number of observed lineages (in our case sequenced cells) [20, 21]. This process is parameterized by the birth rate (which is identical to the cell division rate and is also referred to as the cell turnover rate) b, death rate d, and number of sampled cells n. We measure branch lengths in number cell divisions, i.e., birth events. Consequently, the birth rate b can be considered as a scaling constant that sets the unit of time. Changing the unit of time—an arbitrary decision of ours—should not change the shape of the tree. Therefore, rescaling the rates should have no effect. Thus, the relevant parameter that determines the structure of the cell lineage tree is the death-to-birth ratio δ = d/b. We only consider exponentially growing cell populations (i.e., δ < 1), the growth rates of which in units of cell birth are (b − d)/b = 1 − δ. Mutations occur at a rate μ per site per cell division. They are considered neutral and we neglect the probability that a site is hit by mutation more than once.Throughout the paper, the following notation is used: Branches of the cell lineage tree are denoted by the index k, the length of branch k is denoted by l and L = ∑
l denotes the sum of all branch lengths in the tree.When simulating data we first sample random cell lineage trees with continuous-time branch lengths (measured in units of one over the birth rate) from a conditioned birth-death process parameterized by the growth rate 1 − δ and the number of surviving lineages n using the point process approach described in [20]. Mutations are subsequently simulated along branches of this tree: For each branch k the number of mutations is drawn independently from a Poisson distribution with parameter corresponding to the product of the branch length l and the mutation rate μ, and appears in n ⋅ f cells of the final population, where f is the fraction of cells that descend from branch k (cf. Fig 1A) in the final population.
Fig 1
Distinguishing mutation rate and cell death intensity based on variant allele frequencies.
Two possible scenarios for the generation of mutations along cell lineage trees. A) Different death-to-birth ratios lead to different lineage tree shapes. Bifurcations are cell divisions, leaves are cells comprising the bulk sequencing sample. Note that the (surviving) tree topologies are the same, only branch lengths differ. B) Mutations, symbolized by purple stars, accumulate at cell divisions. High death-to-birth ratio and low mutation rate can lead to the same number of observed mutations as low death-to-birth ratio and high mutation rate, however, the mutation spectrum of the two trees are different. C) For simulated trees of 104 leaves, the differences in the branch length distribution are clearly visible. D) VAF spectra for death-to-birth ratios of 0 (left panel) and 0.999999 (right panel). See also Fig A in S1 Text, which shows the effect of varying the mutation rate on the VAF spectra. The mutation rate was set to μ = 1. Fractions of mutant cells are binned (note the logarithmic scale). Ploidy is set to two, contamination is zero. Simulated sequencing depth is 1000. The VAF spectra are based on the trees used to generate subplot C).
Distinguishing mutation rate and cell death intensity based on variant allele frequencies.
Two possible scenarios for the generation of mutations along cell lineage trees. A) Different death-to-birth ratios lead to different lineage tree shapes. Bifurcations are cell divisions, leaves are cells comprising the bulk sequencing sample. Note that the (surviving) tree topologies are the same, only branch lengths differ. B) Mutations, symbolized by purple stars, accumulate at cell divisions. High death-to-birth ratio and low mutation rate can lead to the same number of observed mutations as low death-to-birth ratio and high mutation rate, however, the mutation spectrum of the two trees are different. C) For simulated trees of 104 leaves, the differences in the branch length distribution are clearly visible. D) VAF spectra for death-to-birth ratios of 0 (left panel) and 0.999999 (right panel). See also Fig A in S1 Text, which shows the effect of varying the mutation rate on the VAF spectra. The mutation rate was set to μ = 1. Fractions of mutant cells are binned (note the logarithmic scale). Ploidy is set to two, contamination is zero. Simulated sequencing depth is 1000. The VAF spectra are based on the trees used to generate subplot C).The data available from bulk sequencing are the mutant and wild type read counts at each site. To illustrate how such data can be used to separate the effects of the mutation rate and cell death intensity, consider the variant allele frequency (VAF) spectrum, which can be obtained from read count data. As shown in (Fig 1A and 1B) mutation frequencies in the population reflect the branch length distribution of the tumor’s cell lineage tree, the leaves of which correspond to the population of cells at the time the sample is taken, and its root is the most recent common ancestor of these cells. Mutation frequencies in the population, however, are not directly observable. What is observed is a random sample of mutant and wild-type read counts per site. The ratio of the observed mutant and wild-type read counts (also called the variant allele frequency) on average corresponds to the fraction of mutant cells in the population, and the total number of reads is determined by sequencing depth.Changing the death-to-birth ratio modifies the shape of the cell lineage tree by changing the relative lengths of branches closer to and further away from the root and (Fig 1A), as a result, modifies mutation frequencies in the population (Fig 1B and 1C) and the observed variant allele frequencies (i.e. changes the shape of the variant allele frequency spectrum or VAF, Fig 1D). Changing the mutation rate, on the other hand, does not have any effect on the branch length distribution as it changes the expected number of mutations uniformly along the tree. Thus, changing the mutation rate leaves the relative frequencies of mutations in the population on average unchanged and, aside from the random sampling of reads, only changes the VAF by on average a multiplicative scaling factor (cf. Fig 1C and 1D and Fig A in S1 Text).It should be noted, however, that sequencing data contains more information than the VAF spectrum. For each site the mutant and total read counts, and not only their ratio is known (e.g., 10 mutant reads out of 30 total reads have different statistical significance from 1 out of 3, even though they both correspond to a mutation frequency of 1/3).To compare different combinations of mutation rates and death-to-birth ratios describing the observed empirical data we employ a likelihood based approach. First, we derive the likelihood of the observed data D, in our case the distribution of variant read counts, as a function of the per site mutation rate μ and the death-to-birth ratio δ. As described below we maximize a score based on the likelihood function using samples of cell lineage trees drawn according to their probability corresponding to the conditioned birth-death process, in order to estimate the parameters that are most likely to have generated the observed data.First we derive the likelihood of the observed data for a fixed cell lineage tree . Similar to other phylogenetic methods we assume that sites collect mutations independently of each other. This assumption is consistent with our assumption of neutral mutations, but is in general a simplification that is motivated by computational tractability. Consequently, takes the form
where the product runs over all sites, m is the number of reads exhibiting a mutation at site i, and r is the total number of reads covering site i. To calculate the probability of observing m mutant reads out of a total of r reads we consider the following two alternatives: First, if m = 0, then either a mutation occurred with probability μ ⋅ l on some branch k with length l, but no mutant read was observed out of the r reads, or with probability 1 − μ ⋅ ∑
l = 1 − μ ⋅ L no mutation occurred on any of the branches. Second, if m > 0, then a mutation occurred on some branch k with length l with probability μ ⋅ l and m mutant reads were observed out of r. Thus,
where Binom(m, r, f) is the binomial distribution of m successes out of r independent Bernoulli trials with success probability f, and f denotes the fraction of sequenced leaf cells that descend from branch k. Multiple mutations at the same site are neglected, which is an appropriate approximation if μ ⋅ L ≪ 1. In all of our applications we verified that this condition is satisfied.To take into consideration sequencing errors, we must consider that they lead to an excess of spurious mutant reads that are in fact the result of sequencing error. We do this by introducing the probability ε of a sequencing error per site per read. In the presence of sequencing error (i.e., ε > 0) a read can be i) an apparent mutant read caused by an actual mutation (a true mutant read) or a sequencing error (a false mutant read), alternatively it can be ii) an apparent wild type read due to no mutation (a true wild type read) or because a mutation was reverted by a sequencing error (false wild type read):
Note that this equation incorporates both the m = 0 and m > 0 cases.So far, we have assumed that each site can have 2 states, wild type or mutant, corresponding to a DNA consisting of only two types of nucleotides, rather than four. It is, however, straightforward to introduce multiple mutant types. The likelihood function for 3 possible mutant types relevant for DNA sequences that we use for inference where indicated, can be found in the Materials and Methods section.
Parameter inference
A conceptually straightforward approach for treating the unknown cell lineage tree as a nuisance parameter is to average over all trees [18] according to their probability:
where is the probability of the cell lineage tree conditioned on a birth-death process with death-to-birth ratio δ. Due to the very large number of possible trees the above average is intractable and because of the inherent lack of resolution resulting from bulk sequencing data a Markov Chain Monte Carlo sampler is impractical. To overcome these issues, here we resort to sampling a finite number of trees from the conditioned birth-death process with fixed δ. However, as we can sample only a small fraction of trees, the estimated likelihood is typically dominated by a single tree, hence, the estimation becomes sensitive to the particular realization of the sample set. Using synthetic data (see below), however, we demonstrate that maximizing the average of the log-likelihood
where is the number of sampled trees, allows accurate and robust inference that is less sensitive to sampling noise.The mutation rate (defined as the number of mutations divided by the number of sites and the total branch length of the tree) is estimated directly from the data, in order to speed up the inference. Since some mutations are expected to be sequencing errors, we only count mutations of significant read count. The threshold number of mutant read counts is set in a sequencing depth dependent manner such that the expected number of sites with mutant reads that result from sequencing error is less than one for the entire dataset by choosing:
where r is the number of reads (i.e., the local sequencing depth) of which m are mutant reads and nsites is the total number of sites in the data. The expected total branch length of corresponding to the significant mutant read counts can be estimated as
where n is the number of sites with r reads and nsites is the genome length (number of all the sites). Using the above formula, the mutation rate can be estimated as:
where the M(ε) = ∑ 1 is the number of sites that have at least a threshold number of mutant reads in the data.
Results on synthetic data
We validated the parameter estimation method on simulated data generated according to the model described above. The simulation procedure is described in the Materials and methods.
No sequencing errors
Fig 2 shows how the estimated death-to-birth ratios and mutation rates compare to the true death-to-birth ratios and mutation rates. The method can reasonably differentiate between datasets with different death-to-birth ratios and mutation rates, and estimate their values.
Fig 2
True vs. estimated death-to-birth ratios and mutation rates.
In subplot A), the growth rate 1 − δ (in units of the birth rate) is shown for better visualization. In both A) and B), 10 synthetic datasets were generated for each true value, with an independent cell lineage tree and associated mutations for each replicate. For each replicate 104 trees with 104 leaves were used for fitting (see the Materials and methods for details on calculating the likelihood). Points are slightly dispersed horizontally for clarity. Datasets are the same for A) and B). Horizontal ordering of the data points is the same for both subplots, e.g., the rightmost point in each group of points corresponds to dataset no. 10 in both plots.
True vs. estimated death-to-birth ratios and mutation rates.
In subplot A), the growth rate 1 − δ (in units of the birth rate) is shown for better visualization. In both A) and B), 10 synthetic datasets were generated for each true value, with an independent cell lineage tree and associated mutations for each replicate. For each replicate 104 trees with 104 leaves were used for fitting (see the Materials and methods for details on calculating the likelihood). Points are slightly dispersed horizontally for clarity. Datasets are the same for A) and B). Horizontal ordering of the data points is the same for both subplots, e.g., the rightmost point in each group of points corresponds to dataset no. 10 in both plots.Fig 3 shows the joint estimation of mutation rate and death-to-birth ratio pairs. The simulated data points are grouped into isolines with a constant number of observed mutations
where denotes the average sequencing depth (average read count). The expected value is approximated by averaging over trees generated with death-to-birth ratio δ.
Fig 3
Joint estimates of mutation rate and death-to-birth ratio pairs.
10 synthetic dataset replicates were generated, each based on an independent cell lineage tree and associated mutations. For each true parameter pair replicates are denoted by the same color. True parameter values are indicated by large filled circles. Solid lines show the numerical approximation of parameter pairs for different values of Mobs mut. Data around the middle line are the same as in Fig 2. Note that the death rate decreases along the horizontal axis.
Joint estimates of mutation rate and death-to-birth ratio pairs.
10 synthetic dataset replicates were generated, each based on an independent cell lineage tree and associated mutations. For each true parameter pair replicates are denoted by the same color. True parameter values are indicated by large filled circles. Solid lines show the numerical approximation of parameter pairs for different values of Mobs mut. Data around the middle line are the same as in Fig 2. Note that the death rate decreases along the horizontal axis.Results on simulated data indicate that the accuracy of parameter estimates for different datasets are not uniform. As Fig F in S1 Text shows resolving large death-to-birth ratios requires trees with more leaves, and accuracy is increased when analyzing trees with a sufficiently large number of leaves. Aside of the effect of the size of the trees in case of high death-to-birth ratios, differences between estimates for different datasets can also result from: (i) sampling noise from the trees used in Eq (5), (ii) stochasticity of the simulated data and (iii) the simulated cell lineage tree used to generate the read counts. To differentiate between these possible factors, we chose a dataset with an estimated death-to-birth ratio that markedly deviated from the true value (dataset no. 7 for 1 − δ = 1.0 in Fig 2 with an estimated value of 1 − δ = 0.47). We calculated the estimated death-to-birth ratio values using 10 independent sets of 1000 trees. Estimates were obtained in the range of 1 − δ ∈ [0.39, 0.54]. Therefore, the deviation of the estimate from 1.0 cannot be attributed to the sample of fitting trees. Then, we simulated 10 additional datasets using the same tree as for the original dataset. The estimated death-to-birth ratio values were: 1 − δ ∈ [0.44, 0.49], even more closely matching the original estimate. Consequently, the effect does not depend on the simulated mutations but on the tree along which they were generated. This suggests that the deviation of the estimates from the true death-to-birth ratios primarily reflects the fluctuation of the shapes of the trees used for sample generation. It seems that estimating the distribution of bifurcation times of the generating tree is not an easy task.
Effects of sequencing error
To estimate the effect of sequencing errors we re-estimated death-to-birth ratios while varying the magnitude of sequencing error, but using the same initial read counts as in Fig 2). In this case, the error rate of the data was also estimated by the fitting procedure, along with the mutation rate and the death-to-birth ratios. The range of the error rates is from 10−3, which is cited as the error rate of the Illumina sequencing technology [22], to 10−8, which is what advanced technology can achieve [23]. The influence of sequencing errors on the estimation of the death-to-birth ratios is shown in Fig 4. For an error rate of 10−3 (the case of the Illumina technology), the estimated death-to-birth ratios can have significant deviations from the true values.
Fig 4
The effect of varying the error rate.
Sequencing error rates are shown in the subplots. 104 trees with 104 leaves were used for fitting. Horizontal coordinates are slightly dispersed for clarity. Open circles are results corresponding to error rates fixed to their true values, crosses correspond to error rates estimated by the parameter fit. Each open circle-cross pair corresponding to the same dataset is vertically aligned. Note that the death rate decreases along the horizontal axis.
The effect of varying the error rate.
Sequencing error rates are shown in the subplots. 104 trees with 104 leaves were used for fitting. Horizontal coordinates are slightly dispersed for clarity. Open circles are results corresponding to error rates fixed to their true values, crosses correspond to error rates estimated by the parameter fit. Each open circle-cross pair corresponding to the same dataset is vertically aligned. Note that the death rate decreases along the horizontal axis.
Results on empirical data
To estimate in vivo death-to-birth ratios and mutation rates, a tumor sample is required. Due to the sensitivity of our method to high sequencing error rates, we need a sample which is sequenced using a very low error rate technology. We found a partial whole genome sequnceing sample of a human hepatocellular carcinoma (HCC) [24], which was sequenced using the o2n sequencing technology [23], providing error rates between 10−5 and 10−8, which is significantly lower than the 10−3 rate of the standard Illumina process. Besides the low error rate, the dataset also has very high sequencing depth coverage, the average sequencing depth is 904, over 923383 sites. High sequencing depth allows the identification of more mutations and provides better resolved VAF spectra.After preprocessing the data (see details in the Materials and methods), 2284 mutations were identified. The variant allele frequency spectrum is shown in Fig 5. Trees of 104 leaves were used for estimating the parameters, as DNA was extracted and quantified from 104 tumor cells that were precisely collected by laser capture microdissection (LCM) [24]. Using the procedure described above we obtained: a sequencing error of ε = 10−7, consistent with the reported error rate of between 10−5 and 10−8 [23] for o2n, a death-to-birth ratio of δ = 1 − 1.1 × 10−3, and a mutation rate of μ = 9.3 × 10−8 per site per cell division. Fig 5 also shows the VAF of a synthetic sample, generated using the sample tree that fits best the empirical data. Fig C in S1 Text shows that the mutation rate and the death-to-birth ratio do not depend strongly on the value of the error rate.
Fig 5
Estimating the death-to-birth ratios and mutation rates on empirical data.
A) The solid purple bars show the VAF spectrum of a human hepatocellular carcinoma sample [24], obtained by o2n sequencing [23]. The orange line shows the VAF spectrum of a synthetic sample, generated using the tree having the highest likelihood for the empirical data. The tree has 104 leaves corresponding to the number of sequenced tumor cells. B) counts are shown on a logarithmic scale.
Estimating the death-to-birth ratios and mutation rates on empirical data.
A) The solid purple bars show the VAF spectrum of a human hepatocellular carcinoma sample [24], obtained by o2n sequencing [23]. The orange line shows the VAF spectrum of a synthetic sample, generated using the tree having the highest likelihood for the empirical data. The tree has 104 leaves corresponding to the number of sequenced tumor cells. B) counts are shown on a logarithmic scale.Estimates for the mutation rate of healthy somatic cells can come from fitting mathematical models to the development of certain tumors, or cell line experiments, both methods observing mutations on a single gene [25]. A third method is to count the number of mutations in sequenced cells and estimate the number of corresponding cell divisions [26]. Finally, a mathematical model fitted to multiple samples from the same donor was applied to estimate mutation rates during early development in haematopoietic stem cells and neurons [11]. For all of these studies, estimated mutation rate values are in the range 10−9 to 10−10 per site per cell division [11, 25, 26], lower than our estimation of the hepatocellular carcinoma data.Estimates of the mutation rate in tumors have high variance. In colorectal cancers, μ = 5 × 10−10 per site per cell division [8] was estimated by observing the number of mutations and estimating the corresponding number of cell divisions. In kidney cancer, μ = 2 × 10−9 per site per cell division was estimated [11], using multiple samples from the same tumor and fitting a mathematical model. The same study estimated μ = 5 × 10−8 per site per cell division for lung cancer. An effective mutation rate was estimated using a different mathematical model, in which the mutation and death rates are not separable [2]. Assuming no cell death (δ = 0), for brain and prostate cancers μ = 9 × 10−8 per site per cell division was estimated, along with μ = 8 × 10−7 per site per cell division for lung and bladder cancers. Substituting our estimation of δ = 0.999, these change to μ = 9 × 10−11 and 8 × 10−10 per site per cell division, respectively. Our estimation of μ = 9.3 × 10−8 per site per cell division for HCC is at the higher end of the estimations for the mutation rates in tumors.For the death-to-birth ratio, earlier estimations range within δ ∈ [0.08, 0.97] [11] for colon cancer, using the same methodology as above. In another study [3], δ = 0.72 was estimated for fast-growing colorectal cancer metastases and δ = 0.999 for premalignant colorectal tumors. Upper limits are provided [2] within δ ∈ [0.99, 0.999], assuming μ = 10−9 per site per cell division. Using our estimated mutation rate μ = 9.3 × 10−8 per site per cell division, this changes to δ ∈ [0, 0.88]. Our estimated death-to-birth ratio is similar to the previously estimated δ = 0.999 of premalignant colorectal tumors. Possible causes for such an elevated death rate include the effect of the immune system, the deleterious nature of mutations, or competition for resources among tumor cells. In conclusion, for this tumor sample, the high number of mutations is due to a combination of an elevated mutation rate and a high death-to-birth ratio.The above results allow us to estimate the number of cell division rounds from the founder cell to the biopsied tumor. The average height of the simulated trees with the estimated parameters is 2023 cell divisions. This value might seem counterintuitively low at first, according to the following estimation. A naive estimation of the average tree height would assume that to reach the size of ntumor cells leaves, the tree should have log2
ntumor cells branches between the root and a leaf. The average length of such a branch should be 1/(1 − δ) = 1/(1.1 × 10−3). As ntumor cells = 2.7 × 109 (see the next paragraph), the naive estimation of tree height is log2(2.7 × 109) ⋅ (1/1.1 × 10−3) = 2.8 × 104. This reasoning, however, does not take into account the fact that trees with a death-to-birth ratio close to 1 have very different shapes compared to cell lineage trees resulting from a pure birth process. Consequently, the much lower tree height of 2023 as opposed to 2.8 × 104 can be attributed to the difference between the average tree with birth-to-death ratio 1 − 1.1 × 10−3 and the average tree of a pure birth process stretched out by a factor of 1/(1.1 × 10−3).It is also possible to estimate the lifetime of the HCC sample and the cell division rate of the HCC tumor. The diameter of the tumor is 35 mm, while the length of a HCC cell is 25 μm [24]. This gives a total number of 2.7 × 109 cells in the entire tumor. The median HCC tumor volume doubling time is 86 days [27]. Based on these figures, the lifetime of the analyzed sample is around 7 years, and the cell division rate is estimated to be around 2023/[log2(2.7 × 109) ⋅ 86days] ≈ 0.75 day−1.In the above calculation, the effect of sampling was neglected, which corresponds to cutting out exactly the descendants of one branch in the tumor lineage tree. This can be regarded as an approximation of taking a local sample from a solid tumor, and the estimated parameters refer to that branch of the lineage tree. Random sampling from a perfectly mixed population can be considered as an opposite extreme. In this case, the transformation of the birth rate is known [28]:
where ρ is the sampling ratio and b′ is the birth rate corresponding to the ρ = 1 case. In our case, the size of the tumor is 2.7 × 109 cells, from which 104 cell were sampled. Then the sampling ratio is ρ = 104/(2.7 × 109) = 3.7 × 10−6 and the division rate is b = 2.03 × 105 day−1, which is perhaps biologically less realistic.
Conclusion
In summary, we describe a method to simultaneously estimate the mutation rate and the death-to-birth ratio together with the sequencing error rate, making it possible to answer the question which of the two is responsible for the elevated number of mutations in tumors. In particular, the mutated sites’ read counts, which are closely related to the distribution of branch lengths and the shape of the cell lineage tree, contain useful information about the death-to-birth ratio, even in the presence of a moderate sequencing error rate.Our results on simulated data show that the estimation method can resolve the death-to-birth ratio even if the birth and death rates are close to each other (i.e., 1 − δ ≪ 1) as long as the sequencing error is sufficiently small. Unfortunately, for the error rates of standard Illumina sequencing, the estimation has a significant variance, therefore applying the method to typical samples is impractical. One solution is to use a sequencing technology with much lower error rates, e.g., as in Refs. [29, 30], or even below the 10−6 error rate of the PCR process [22, 23]. Another possibility is to apply noise filtering to standard sequencing data, e.g., deepSNV [31], and modify the error model of the fitting process accordingly.As a proof of concept application we analyzed low error sequencing data from human hepatocellular carcinoma [23]. For both the death-to-birth ratio and the mutation rate we recovered estimates, a mutation rate being 9.3 × 10−8 per site per cell division, and death-to-birth ratio δ = 0.9989, which are higher than expected for most healthy tissues but fall within the range of previous estimates for tumors [2] and are consistent with a high mutation burden in HCC [32].It is important to note that high death-to-birth ratios can produce marked subclonal peaks in the VAF spectrum (Fig 5). This implies that subclonal peaks are not necessarily the consequence of selection. Neutral processes, in particular a high death-to-birth ratio, can also produce such signal.In this work, the death-to-birth ratio was assumed to be constant during the evolutionary process. It is more realistic to assume a death-to-birth ratio which changes during tumor growth [3]. In our case, the estimated strong cell death suggests that the tumor reached a slowly growing phase, in line with a Gompertzian model of tumor growth [33, 34], which is corroborated by the large sizes of observed tumors (diameter ≥ 1cm) used in the doubling time estimation [27]. It is possible that in earlier stages of tumor development, cell death is less frequent and doubling time is shorter. It might be the case that the rate of cell division is constant during tumor growth, and doubling time is set by the death-to-birth ratio, which, in turn, is limited by external factors.The probabilistic model introduced here, together with the associated likelihood calculation and averaging procedure, including the simulation of cell lineage trees, can be extend in a relatively straightforward manner to consider the death-to-birth ratio as well as the mutation rate to be time or lineage dependent or both. Our current model has the potential to act as a null model compared to which models with increased complexity can be compared.Lineage specific changes in the mutation rate or the death-to-birth ratio during the course of tumor evolution will both result in subtrees of the cell lineage tree that are described by different rates and descend from the cell in which the mutation rate or turnover rate changes. The two changes can be expected to have markedly different effects: a change in the mutation rate will from the perspective of the rest of the mutations observed act to “rescale” the branches of the subtree, while a change in the death-to-birth ratio will change the shape of the subtree together with the expected number of descendants surviving until the present. A decrease in the death-to-birth ratio (corresponding to an increase in 1 − δ, i.e., a positively selected “driver” mutation) is expected to have the most substantial effect on the mutations eventually observed during tumor sequencing. As show in Fig G in S1 Text, despite a substantial lineage specific increase, the mutation rate is recovered relatively accurately. A systematic increase is, however, apparent in the inferred values of both the mutation rate and the death-to-birth ratio compared to “wild type” values resulting from the subtree with an increased death-to-birth ratio. This highlights the limitations [35] of the assumption of only neutral mutations, which our model shares with Williams et al. [2] and which is incompatible with recent empirical results on patterns of selection in cancer and somatic tissues [36].Finnally, our model ignores any reminants of underlying tissue architecture and stem cell renewal mode that might impact the dyanmics of cell proloferation in a tumour. In heathly tissues it is clear that these effects play an imporant role, for instance, the organization of intestinal epithelial stem cells into crypts imposes constraints on clonal expansion resulting in elevated competition between stem cells belonging to the same crypt [37].To go beyond a proof of concept application, aside of introducing “driver” mutations, several other important developments are required: At present we assume a simple uniform mutational process that is independent of genomic context and other mutational processes that shape the genomes of the clones comprising a tumor. In particular, taking into account mutational signatures [38] that incorporate a tumor’s evolutionary context [17] based on the cell lineage tree offer the potential to increase both the realism and accuracy of our method. In addition, currently the method uses a well mixed population model for tumor growth and assumes uniform sampling. In the future, a more realistic growth model that includes spatial effects [39, 40] would enhance the applicability of the method opening up the possibility to use data from spatially resolved sampling of tissues, in which the measured mutation frequencies intertwine the correlated ancestry of sampled cells with the prevalence of the mutations.
Materials and methods
3 mutant types
For 3 possible mutant types, relevant for DNA sequences, instead of the mutant read count m, we introduce three mutant read counts, corresponding to the three possible mutant types, m(1), m(2), m(3). Consequently, the input data need to contain three mutant read counts rather than one for each site. This leads to the use of a multinomial distribution, with four states: wild type and 3 mutant types. The possibility of more than one real mutation at the same site is still neglected, because it is very rare, technically a second-order process in the mutation probability of a site. We also neglect the probability of more than one error hitting the same site. The likelihood function at a single site is then
and
where (j), (j + 1), and (j + 2) denote the three possible mutant types with cyclic notation (j) = (j + 3), and Mult(vector of numbers of outcomes; number of trials; vector of probabilities of outcomes) is the multinomial distribution. The factor of 1/3 is due to the fact the if a mutation occurs at a site than the probability of each of the three possibilities is 1/3. In the multinomial distribution the first outcome corresponds to the true mutant type, the second and third to the false mutant types, and the forth to the wild type.
Generating trees
Cell lineage trees are simulated by our own implementation [41] of the point process described in [20], which is more stable for death-to-birth ratios close to unity then the widely used TreeSim implementation [21].
Generating synthetic samples
Simulated data consists of mutant and wild type read counts based on mutations generated along a cell lineage tree obtained using the ELynx suite.Mutations are generated as follows. For each site, each branch k is checked for contributing a mutation with probability . The first branch providing a hit is selected.The total number of reads of the current site is drawn from a pre-specified distribution.The number of mutant reads is drawn form a hypergeometric distribution, corresponding to the branch frequency f. Sequencing errors, if required, are introduced using a multinomial distribution, with probabilities ε/3, ε/3, ε/3, 1 − ε corresponding to the 3 possible false types and the true type.The mutation rate for different death-to-birth ratios is chosen such that the total number of observed real mutations should remain close to each other, i.e., the estimation algorithm should have a similar amount of input data.Our implementation of the synthetic data generation method is available at https://github.com/tg433/evolgenom.
Calculating the likelihood
We calculate the average of the log-likelihood, Eq (5), on a fixed grid of death-to-birth ratios, where for each point we generate a sample of a given set of simulated trees using the ELynx suite [41]. At each grid point the error rate is optimized either using Brent’s method [42] implemented in Julia’s Optim package [43] or calculating the log-likelihood on a grid of error rate values to speed up the calculation. The mutation rate is estimated in both cases according to Eq (8). The final death-to-birth ratio is then estimated by interpolation using cubic splines and the final estimate for the mutation rate is based on this value.Our implementation of the parameter estimation method is available at https://github.com/tg433/evolgenom.
Preprocessing of the empirical data
The raw sequencing data was preprocessed according to [23], using the code provided by the authors. The DNA contents of 104 cells were sequenced [24], along with a sample of neighboring normal tissue. Mutations were called using VarScan 2 [44], which is flexible and easy to adapt to the requirements of the fitting procedure. The original study [23] targeted 410k sites, however, the raw data covers well over 108 sites. We applied the mapping quality and base quality thresholds of [23] to the data, and also left the minimum sequencing depth requirement of VarScan 2 at its default setting (8), both in the tumor and the normal samples. 923383 sites remained. Among these sites, the distribution of sequencing depths is wide, ranging from 8 to 10853, with a mean of 904. For mutation calling by VarScan 2, the minimum number of mutant reads was set to 1 and the strand filter switched off. Although the number of false positives increases with these parameter choices, the resulting called mutations correspond better to the error model of our fitting procedure than an error rate which changes sharply with threshold frequency or read count values. The minimum variant frequency was set to 10−6 to include even the least frequent mutation. Purity was set to 0.85, in accordance with [24]. We also checked that the default somatic p-value threshold does not exclude any candidate somatic mutations. Other parameters were left at their default values. Mutation frequencies were corrected for i) copy number variation (CNV), using VarScan 2 with default parameters, and ii) for ploidy of the sex chromosomes. CNV detection for targeted sequencing data is a more difficult task than for whole genome data, and VarScan 2 was found to be a stable performer [45]. We followed the CNV detection steps suggested by the VarScan manual. We used VarScan to obtain raw copy numbers, using the copynumber command. The tumor-to-normal ratio was set to 1.091 for this run. Adjustment for GC content and filtering for minimum region size was done using VarScan’s copycaller, with default parameters. Finally, copy number data were smoothed and segmented using the DNAcopy circular binary segmentation algorithm in R. Sites having multiple variant types (i.e., number of reads of wild type plus most frequent mutant type being lower than the sequencing depth) were checked manually. Read counts of all 4 possible genotypes were identified for all variant sites.
Supplementary information for distinguishing excess mutations and increased cell death based on variant allele frequencies.
Fig A: VAF spectra for varying mutation rates and death-to-birth ratios. Branch length distributions for A) low death rate and B) high death rate and corresponding VAFs, respectively C) and D), with the mutation rate varying over two orders of magnitude. Fig B: Loglikelihood-error rate curve of the empirical data. The right panel shows the peak with a narrow loglikelihood range. Fig C: Death-to-birth ratio-error rate and mutation rate-error rate curves of the empirical data. The loglikelihood curve is shown in light gray. Fig D: Loglikelihood-death-to-birth ratio and loglikelihood-mutation rate curves of the HCC data. Interpolation between data points is by cubic splines. Green line highlights the maximum. Fig E: The effect of tree sizes on the estimations. Estimated death-to-birth ratios for fitted trees with 100 (top left), 1000 (top right), 10000 (bottom left), and 100000 (bottom right) leaves. Sizes of the sample generating trees are the same as those of the fitting trees. Fig F: The effect of the error rate on the estimated mutation rate. Sequencing error rates are ε = 10−3 (top left), 10−4 (top right), 10−5 (middle left), 10−6 (middle right), 10−7 (bottom left), 10−8 (bottom right). 104 trees with 104 leaves were used for fitting. Horizontal coordinates are slightly dispersed for clarity. Open circles are results corresponding to error rates fixed to their true values, crosses correspond to error rates estimated by the parameter fit. Each open circle-cross pair corresponding to the same dataset is vertically aligned. Fig G: The effect of the error rate on the estimated mutation rate. In 500 replicate experiments we introduced a “driver” mutation that increases 1 − δ by 30%. The large purple point indicates the parameter values used to simulate the data, gray circles are inferences for data without driver mutations from Fig 3, and purple datapoints are inferences for data including drivers. A small, but systematic increase is apparent in the inferred values of both the mutation rate and the death-to-birth ratio compared to “wild type” values resulting from the subtree with an increased death-to-birth ratio driver lineage.(PDF)Click here for additional data file.8 Dec 2021Dear Tibély,Thank you very much for submitting your manuscript "Distinguishing excess mutations and increased cell death based on variant allele frequencies" for consideration at PLOS Computational Biology.As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Teresa M. PrzytyckaAssociate EditorPLOS Computational BiologyDouglas LauffenburgerDeputy EditorPLOS Computational Biology***********************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: This paper seeks to disentangle the processes of new mutations due to a higher mutation rate or due to higher cell turnover rate. This work assumes a fixed mutation rate and a fixed cell turnover rate during the progression of a tumor, and seeks to estimate these two rates given variant and total read counts of mutations in bulk DNA sequencing data. The authors validate their method on simulated data, finding that while their method is sensitive to per-base sequencing errors, in regimes of low sequencing error the method provides accurate estimates of ground-truth mutation and cell turnover rates. Finally, the paper considers an HCC sample sequenced using a technology with low error, and discusses the estimated rates in light of related findings. While I find the work interesting, I have several comments and questions.1. How do cell lineage trees relate to trees inferred using tree deconvolution methods?Many clone tree inference methods have been proposed for reconstructing clone trees from bulk DNA sequencing data of tumors. See for example [1,2]. How do these trees differ from the cell lineage trees that you sample in your method? Is there a way to make use of these clone trees in your method?2. Copy-number aberrationsVAFs are affected by copy-number aberrations (CNAs). I suggest using cancer cell fractions instead. Also, I would appreciate more details on how VAFs were corrected for CNAs for the HCC data.3. Justify/discuss assumptionsThere are two key assumptions: (1) fixed mutation rate and (2) fixed turnover rate. While the latter assumption is discussed in the Discussion, I would have appreciated simulation experiments that assess how your method would perform if this assumption was violated? Would you still be able to accurately infer the mutation rate?Moreover, I would like to see a discussion on the assumption of a fixed mutation rate. It has been shown that the mutation rate can change during the evolution of a tumor, due to for instance mutations in DNA mismatch repair mechanisms. Or more generally, due to exposures to mutational signatures. For an example of the former, please see Ref [3].4. More discussion on subclonal mutations clusters.I found this point extremely interesting. I would appreciate more simulation experiments to further investigate this point. Also, please discuss this in light of the findings in the MOBSTER paper [4].5. Extension to support multiple samplesIt would be good if the method could support multiple bulk samples from the same tumor.6. Software and documentation- Please include example input files.- Add comments to your code/functions.- It looks like your code takes trees as input rather than generating them. Please describe how trees can be generated.- To facilitate reproducibility of the experimental results, please include simulated data and preferably real data as well.Minor comments:- turnover or turn over?- page 3: it'd be good to justify the assumption of independence among sites.- Good to provide the name of the method in the paper.- What is the correspondence between the number of mutations and number of leaves in the trees that you sample? More generally, how do you choose the number of leaves?References[1] El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics. 2015;31(12):i62-i70. doi:10.1093/bioinformatics/btv261[2] Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 2015;16(1):35. doi:10.1186/s13059-015-0602-8[3] Christensen S, Leiserson MDM, El-Kebir M. PhySigs: Phylogenetic Inference of Mutational Signature Dynamics. In: Biocomputing 2020. WORLD SCIENTIFIC; 2019:226-237. doi:10.1142/9789811215636_0021[4] Caravagna G, Heide T, Williams MJ, et al. Subclonal reconstruction of tumors by using machine learning and population genetics. Nat Genet. 2020;52(9):898-907. doi:10.1038/s41588-020-0675-5Reviewer #2: In this manuscript Tibély et al. propose a likelihood-based approach to estimate the mutation rate and cell renewal rate history of a tumor from DNA sequencing data from a single bulk sample of the tumor and an adjacent normal sample. The methodology used is sounding, even though the applicability and interest for a wide biological community is questionable and possibly limited due to the strong theoretical character and the various (unavoidable) strong assumptions implied. The method is applied to an experimental dataset, but there is a lack of tools for cross-validation of the inferred parameters with real estimates (even if the method is appropriate for synthetic datasets generated under the same methodological assumptions, which is not striking per se). The mathematical approach is clean but the text suffers from a too technical style that sometimes is more proper of a Technical Note than a Main manuscript. Altogether, in my view the authors should pay most attention on the narrative to better introduce the background and succinctly explain the model’s novelty on such a non-new question to be able to convince on its far-reaching biological implications.MAIN POINTS- The Introduction feels short and too simplistic given the level of concretion of the subsequent approach and it should be more pedagogical to acknowledge the multiple facets and limitations involved in the inference of evolutionary parameters. First, Williams et al. work is referred, but since those authors addressed a similar problem in their original publication and even refined their subsequent analysis based on multiple-sampling data, it is unclear what it was inappropriate in their methodology, or, in other words, what is genuine in this work’s approach that makes it that more successful at discerning mutation rate and growth-death ratio even when considering just from bulk sequencing data. Is it just the likelihood approach vs. machine learning methods? Is it the fact of simulating and fitting the whole VAF distribution rather than its global scaling properties derived from Luria-Delbrück model? Secondly, the method shows important limitations (and this is common to Williams et al. work): it assumes a net exponential tumor growth, which is not necessarily realistic, and purely neutral subclonal dynamics. Other authors have shown the prevalence of positive Darwinian selection across tumor types (Martincorena et al, 2017, Cell), and even if this is not incompatible with an excess of neutral passenger mutations, it would affect the subclonal dynamics and the tree shape, consequently affecting inferences from the VAF distribution, as has been largely debated (e.g. Tarabichi et al. 2018 Nat. Genet, as a criticism to Williams et al. 2016). These limitations cannot be ignored and have to be highlighted or authors may want to consider how much inferences would depart from the given estimated values when these assumptions get relaxed. One could draw the impression that estimates can result orders of magnitude deviated if basic evolutionary assumptions are inappropriate. For the same reason, I would avoid overclaims in the Discussion.- The model definition is not sufficiently clear or precise. Authors use a lineage tree representation but should explain what methodology they use to generate the simulated trees, i.e. T(delta). Is it by Markov-chain Monte-Carlo? Continuous or discrete time? What is the initialization condition and end time? What are the random events? From the Methods it seems clear that tree topologies and mutagenesis are simulated independent from one another but a higher level of concretion in the main text would be helpful. In particular, there are two major confusing points:- (i) Whether mutagenesis events are circumscribed to the cell division instant (i.e. internal nodes) or are simulated too all along branches. The text sometimes gives the impression of the former assumption, but at the same time the division rate is taken as unit of time. This is relevant insofar as in one case both parameters are intermingled and mu changes as a result of changes in delta, while in the other they represent independent measures (delta would impact the mutation burden just indirectly, by modulating the total no. of tumor branches, but would not affect its density) and thus parameters might be more easily resolvable? If mu is truly independent it should not depend on b.- (ii) Related with the previous, how empirical VAF distributions (that are indicative of tumor subclonal composition) can be directly contrasted with a tree whose branches are said to represent individual cells rather than subclones of multiple cells. I would assume a representation in terms of a subclonal tree where branches represent populations of clonal cells would be more readily contrastable with experimental VAF data. It is unclear how the fitting procedure is done if branches are restricted to individual cells and not clades. As a suggestion, I think it would help to improve the explanation on Fig. 1. This shows site frequency spectra but these are difficult to relate with trees on the left as there is no sketch drawing read sites along the branches. And the paragraph describing this connection between site frequency spectrum and tree topology is confusing. On the same line, Fig. 1 would benefit much if it shows trees generated under four different scenarios (low mu-low delta, high mu-low delta, low mu-high delta, and high mu-high delta) or at least under scenarios where just one or the other parameter is changed but not both at the same time like is done here. That would better illustrate how the tree shape and mutation burden change with the two parameters of interest. Altogether, the model section is vague and contrasts with that of the parameter estimation procedure which is much more detailed (Indeed, the parameter estimation section is already initiated by the second last paragraph of pg. 3 even if the title comes later).MINOR POINTS- I in part differ from the diagnosis that the main limitation for evolutionary inference is that bulk sequencing does not resolve the genotypes of individual cells. I assume it is not as much a problem of spatial resolution as it is of temporal resolution, as it is limited to a snapshot. I think exploring models accounting for sampling bulk data at various serial time points is a suitable direction for future work. This is poorly explored. On the contrary, I am not persuaded of what the benefit of sampling individual cell genotypes is vs. sampling subclones in regard to tree reconstruction and evolutionary inferences. This is perhaps a misinterpretation of what individual cell trees in this work offer vs. subclone-based approaches in Williams et al, or they are essentially similarly informative.- In the Abstract the sentence referring to the “orders of magnitude” difference between mutation burden in tumors and healthy tissues should be toned down. It is certainly the case in liver but not necessarily that dramatic in other tissues (authors may want to refer to recent literature by P. Campbell group and colleagues). In any case it deviates the attention from the main purpose of the manuscript that is not on this comparison. This difference in mutation rate is perhaps too much overstated too in pg. 10 given that methods to estimate mutation rate in healthy tissues are so relaxed and heterogeneous. Yet, authors should consider that most reliable method to infer mutation rate in healthy tissues is by fitting data from individuals of different age, and for liver in particular, they can refer to the estimate in Brunner et al (2019) Nature. Here again it is patent that the particular election to normalize mutation rate by cell division is inconvenient as it heavily relies on that estimate, while Brunner et al. present a mutation rate as an independent parameter with dimension site-1 real time-1.- Fig. 3 is very nice, even if it is limited to some particular true mu:(1-delta) pairs. I would suggest an accompanying figure (e.g. a heatmap), but not necessarily a comprehensive one, showing how uncertainty in parameter estimates changes across a grid search on different true values of mu and (1-delta) extending to other regions of the grid. Do changes in mu affect uncertainty for the same value of delta? It is certainly interesting how uncertainty increases when the death rate approaches the growth rate. The authors point to the fluctuation on the shapes of the trees used for sample generation. I think this deserves a brief reflection. If I am right, I understand that the difficulty in capturing the distribution of bifurcation times with limited leave sized trees in cases where d is similar to b arises from the broad distribution of possible outcomes when fates are balanced in stochastic branching birth-death processes (Bailey, 1964, The elements of stochastic processes with application to the natural sciences). Similar problems arise in healthy tissues (Snippert et al, 2010, Cell; Piedrafita et al, 2020, Nat Commun). Indeed, when commenting on possible reasons for tumors to show elevated values of delta, the underlying tissue architecture and stem cell renewal mode might play important roles too determining the tumor death-to-birth ratios. These factors can be reflected too in the discussion in pg. 11. For instance, the organization of intestinal epithelial stem cells into crypts imposes constraints on clonal expansion between crypts while there is intrinsically elevated competition forces between stem cells belonging to the same crypt (Snippert et al, 2010, Cell).- The considerations when discussing the cell division rate of the HCC tumor are pertinent but the authors neglect other major factors, such as the significant fraction of the HCC volume occupied by cirrhotic tissue and stroma, or the heterogeneous dynamics between tumor cells depending on their genotype (Darwinian selection) and/or the microenvironment that can greatly affect those estimates. They should tone down the argument that “division rate is realistic, suggesting that the approximation is adequate” – they should do so too when arguing that the fact that the value falls unrealistic when considering random sampling from a large specimen suggests that the sampling ratio may be close to 1.- I would suggest to change the title for: “Distinguishing excess mutations and increased cell renewal based on tumoral variant allele frequencies”, which is more understandable.- Regarding the mathematical formulation, the two conditions for m reflected in Eq. 6: is it the intersection between the two sets that is considered? In Eq. 7, better to include a parenthesis for readability, to show that second summation term is nested to the first. Both in Eq. 6 and 7 subindexes i might be pertinent for r_i, m_i, m’_i and m_th,i.- As a suggestion, variable names m_th, m’, M_obs mut can be explicitly mentioned in the text right before their first use in the respective equations.- Eq. 7 could benefit from a little explanation in the accompanying text saying that it is a way of weighting the influence of coverage on the discovery of any actual given branch length from just significant mutant read counts, which should thus be < or = to Sum_k l_k. In this sense, I wonder how much this consideration affects the estimates in practice? Can some values for cases where mu is estimated ignoring the significance criterion be represented in Fig. 3? Similarly, in the same paragraph, authors may want to explain that Msig is similar to the total number of accumulated true mutations.- The definition and extent of a “synthetic dataset” is unprecise. I infer that when “10 synthetic datasets were generated for each true value” it means that 10 query trees were generated for the same given point in the delta grid search, and then “10^4 trees with 10^4 leaves were used for fitting” it means per each dataset?- Similarly, mu is said to be estimated directly from the data, but since mu_est requires L_sig computation and this depends on T(delta), one infers that it cannot be uncoupled from the delta inference; in other words it is not estimated beforehand but in conjunction with the likelihood for delta, for being a related quantity. Is this correct? This reflection would clarify the procedure.- What is the criterion when selecting a particular number of simulated trees (10^4)? Is this based on empirical observation, e.g. an optimality criterion like the value for which the error on the estimate gets below a certain threshold or similar? Regarding the choice of 10^4 leaves, it seems to fulfill the number of cells in the empirical HCC dataset, but I wonder if fitting bulk data from a larger tumor sample would require more tree leaves or not necessarily insofar as leaves are representative of the overall subclonal heterogeneity? i.e. Can we assume one branch = one subclone in the formalism used?- Confidence intervals on MLE values should be provided for the inferred parameters, at least for the empirical dataset. From pg. 14 it is unclear whether a confidence interval is implemented or not but authors could resort to a Likelihood Ratio test or similar in the light of supplementary figures.- I am just curious to know why not defining delta as a birth-death ratio instead of death-birth ratio. In any case, the election to represent one minus death-to-birth ratio is already sufficiently twisted so that figures might benefit from indicating the extremes corresponding to 0 death and max. death.- When introducing the likelihood model, it would be useful to state what exactly “observed data D” is. Indeed, is “site frequency spectrum” equivalent to “distribution of variant read counts”? I find less confusing the latter term.- Authors may acknowledge in pg. 9 if empirical data corresponds with WGS or WES, and whether it involved amplification that could impair the interpretation of the VAF distribution.TYPOS AND REWORDING- multiple sites where the term “linage” is used and it should read “lineage”- (pg. 2) “mutations from individual cells are intermixed” -> rephrase- (pg. 2) “with the cell linage tree” -> “in terms of a cell lineage tree”- (pg. 3) “descendance of the extant cells” -> “descendance of any given cell”- (pg. 9) Perhaps one could simply point to (ii) the level of randomness on tree structure and (iii) the randomness on mutation occurrence as main sources of noise apart from the limited size of trees (i)?- (pg. 12) “are higher than expected for healthy tissues” -> not true in the case of delta. Rephrase.- (pg. 14) “from a prescribed distribution” -> an obscure term… A Poisson?- (pg. 14) “we generate a sample tree” -> “we generate a sample of a given set of simulated trees”**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: No: See review, simulated data missing.Reviewer #2: No: The authors have made (well annotated) code available to generate synthetic data and compute estimates. They could just perhaps include one or two examples in the repository that summarize results in some of the main Fig.**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: Yes: Gabriel PiedrafitaFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at .Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols6 Feb 2022Submitted filename: PLOS CB response.pdfClick here for additional data file.22 Mar 2022Dear Dr Szöllősi,We are pleased to inform you that your manuscript 'Distinguishing excess mutations and increased cell death based on variant allele frequencies' has been provisionally accepted for publication in PLOS Computational Biology.Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.Best regards,Teresa M. PrzytyckaAssociate EditorPLOS Computational BiologyDouglas LauffenburgerDeputy EditorPLOS Computational Biology***********************************************************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: My previous comments have been satisfactorily addressed.Reviewer #2: The authors have addressed all points raised and have significantly improved the narrative. The details of the model are much clearer and the potential approach limitations/future research directions are well acknowledged in the new version of the Discussion, reason why I consider the current version suitable for publication.Just two very minor points for consideration (for which no round of revision would be needed by my side):- In pg. 4 it is explained how cell lineage trees are simulated with continuous-time branch lengths. I guessed but would be nice to state: Is cell turnover simulated as a Poisson process? i.e. cell division rate drawn from an underlying exponential distriibution?- "Poisson distribution with parameter corresponding to the product" : if "expectancy" parameter is implied, perhaps better specifying so.Typos:- pg. 11: "sequnceing"- pg. 14: "Finnally" and "dyanmics"**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: None**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: Yes: Gabriel Piedrafita8 Apr 2022PCOMPBIOL-D-21-01380R1Distinguishing excess mutations and increased cell death based on variant allele frequenciesDear Dr Szöllősi,I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!With kind regards,Katalin SzaboPLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
Authors: Andrea Sottoriva; Haeyoun Kang; Zhicheng Ma; Trevor A Graham; Matthew P Salomon; Junsong Zhao; Paul Marjoram; Kimberly Siegmund; Michael F Press; Darryl Shibata; Christina Curtis Journal: Nat Genet Date: 2015-02-09 Impact factor: 38.330
Authors: Amit G Deshwar; Shankar Vembu; Christina K Yung; Gun Ho Jang; Lincoln Stein; Quaid Morris Journal: Genome Biol Date: 2015-02-13 Impact factor: 13.583
Authors: Iñigo Martincorena; Keiran M Raine; Moritz Gerstung; Kevin J Dawson; Kerstin Haase; Peter Van Loo; Helen Davies; Michael R Stratton; Peter J Campbell Journal: Cell Date: 2017-10-19 Impact factor: 41.582