Literature DB >> 33302872

PyClone-VI: scalable inference of clonal population structures using whole genome data.

Abstract

BACKGROUND: At diagnosis tumours are typically composed of a mixture of genomically distinct malignant cell populations. Bulk sequencing of tumour samples coupled with computational deconvolution can be used to identify these populations and study cancer evolution. Existing computational methods for populations deconvolution are slow and/or potentially inaccurate when applied to large datasets generated by whole genome sequencing data.
RESULTS: We describe PyClone-VI, a computationally efficient Bayesian statistical method for inferring the clonal population structure of cancers. We demonstrate the utility of the method by analyzing data from 1717 patients from PCAWG study and 100 patients from the TRACERx study.
CONCLUSIONS: Our proposed method is 10-100× times faster than existing methods, while providing results which are as accurate. Software implementing our method is freely available https://github.com/Roth-Lab/pyclone-vi .

Entities: Chemical Disease Gene Species

Keywords: Bayesian statistics; Cancer; Cancer evolution; Tumour heterogeneity

Mesh：

Year: 2020 PMID： 33302872 PMCID： PMC7730797 DOI： 10.1186/s12859-020-03919-2

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Cancer is an evolutionary process driven by ongoing somatic mutation within the malignant cell population [1, 2]. The combination of mutation, drift, and selection lead to heterogeneity within the population of cancer cells. Identifying population structure and quantifying the amount of heterogeneity in tumours is an important problem which has been extensively studied [3-8]. High throughput sequencing (HTS) provides a powerful approach to solve the problem with both bulk and single cell approaches being employed. While single cell sequencing approaches can more accurately resolve clonal population structure, they are not widely available and have limitations both technical and due to cost. Using bulk sequencing to study heterogeneity thus remains the predominate approach, and methods for studying heterogeneity using bulk sequencing will become even more important as HTS is increasingly used in translational and clinical work [9-12]. Identifying population structure and quantifying heterogeneity from bulk sequencing data is a computationally challenging problem. The core issue is to deconvolve sequence data generated from a mixture of cell populations. This task is challenging because neither the genotypes of the populations nor the number of populations is known. In addition, factors such as tumour cellularity and copy number variation co-incident to small nucleotide variants (SNVs) further complicate the analysis. The past decade has seen a number of methods to deconvolve bulk data and infer clonal population structure, in particular to identify populations using SNV data. One of the first approaches developed was PyClone, which remains widely used. PyClone was originally developed for use with small panels of deeply sequenced mutations as input [4]. While the PyClone method can in principle be applied to genome scale analysis, the computational cost becomes prohibitive. This deficiency has limited the utility of PyClone for the analysis of genome scale datasets with 10,000–100,000 s of mutations. In this work we present a new tool, which we refer to as PyClone-VI, which is orders of magnitudes faster than the original PyClone method, while providing comparable accuracy.

Related work

A number of other methods have been developed to efficiently infer clonal population structure from genome scale data. We provide a brief, non-extensive, review of some of the most popular methods. SciClone uses Bayesian mixture models and variational inference (VI) like our proposed approach PyClone-VI [6]. However, because SciClone fails to correct for coincident copy number variation, it is only applicable to clustering mutations in regions with no copy number variation or with single copy deletions. It follows that in practice SciClone cannot be applied to many tumours, especially when multi-region sequencing is performed, as few mutations will fall in such regions. EXPANDS is based on the principle of clustering probability distributions of cancer cell fractions (CCFs) using a multi-stage optimization procedure [5]. It has been applied to whole genome studies alongside PyClone and shown to perform similarly [13]. One key difference between EXPANDS and PyClone is that mutations are clustered independently in each sample and then the clusters are combined in a post-processing step. As a result of post-hoc analysis, statistical strength cannot be shared between samples when inferring population structure using EXPANDS. QuantumClone is a Bayesian mixture model that is fit to the data using expectation maximization (EM) to find the maximum a posteriori (MAP) estimate [8]. MAP estimation for mixture models is prone to overfitting, in the sense that the model will tend to use all possible clusters (clones). To address the model selection problem QuantumClone uses the Bayesian Information Criterion (BIC) to select the number of clusters. QuantumClone can correct for genotype effects and jointly analyse multi-region data. The use of the BIC for model selection requires that multiple runs of the method be performed with varying numbers of clusters. QuantumClone is conceptually similar to our proposed method, however our approach avoids the expensive model complexity search across varying number of clusters. As we demonstrate in the experiments, avoiding restarts for the model complexity search can lead to a considerable reduction in runtime. PhyloWGS is a popular approach which attempts to solve a more challenging problem of identifying not only clonal populations, but the phylogeny that relates them [7]. PhyloWGS adopts a very similar model to PyClone, but substitutes the Dirichlet process prior for clustering with a tree structured stick breaking prior [14]. Like PyClone, PhyloWGS relies on Markov Chain Monte Carlo (MCMC) methods and can be computationally expensive to run with large datasets.

Results

PyClone-VI is as accurate as PyClone but faster

PyClone-VI introduces two levels of approximation to the original PyClone model. First, we alter the model to make it more tractable to perform variational inference. Second, we use variational inference which is an approximate method to infer a posterior distribution. To assess the impact these approximations have and investigate whether they lead to tangible performance gains, we compared PyClone-VI to PyClone using synthetic data. We simulated data from the PyClone model with varying numbers of mutations. We generated datasets with 50, 100 and 1000 mutations. Each simulated dataset had four samples each with a tumour content of 1.0. Total copy number for each loci ranged from one to four and major copy number was allowed to vary from one to the total copy number. Genotypes were simulated by selecting whether mutations were late events which affected only one copy or early events which occurred on either the major or minor allele before the copy number change. We simulated the depth of coverage from a Poisson distribution with mean 100. We repeated the simulation for each number of mutations 100 times to generate 300 datasets in total. Comparison of PyClone and PyClone-VI a V-measure as a function of the number of mutations. b Mean absolute deviation of inferred CCF from truth as a function of the number of mutations. c Runtime of the methods. d Memory usage The results of this analysis are summarized in Fig. 1. Clustering accuracy was assessed using the V-Measure metric with a value of 1.0 indicating perfect accuracy (Fig. 1a) [15]. The mean difference in V-Measure between PyClone and PyClone-VI was 0.011 in favour of PyClone. To assess the accuracy of the CCF estimates we computed the mean absolute deviation of the predicted CCF from truth for each mutation (Fig. 1b). The mean difference in CCF error was 0.00036 in favour of PyClone-VI. These results suggest there is a negligible performance difference between the two approaches. We note that we would expect PyClone to have a slight performance advantage in this experiment as we simulated the data from the PyClone model rather than the PyClone-VI model. Finally, we sought to quantify the computational performance of both methods. Figure 1c, d show the runtime and maximum memory used by both methods. PyClone-VI outperforms PyClone in terms of runtime by nearly two orders of magnitude regardless of the number of mutations (Fig. 1c). PyClone-VI also uses significantly less memory than PyClone (Fig. 1d). Theoretical memory usage for the original PyClone method scales as where n is the number of mutations. In contrast, memory usage for PyClone-VI scales as . The empirical results in Fig. 1d appear to support this.

Fig. 1

Comparison of PyClone and PyClone-VI a V-measure as a function of the number of mutations. b Mean absolute deviation of inferred CCF from truth as a function of the number of mutations. c Runtime of the methods. d Memory usage

We performed additional simulated data experiments (Additional files 2–4) to test the performance of both methods as we varied tumour content (Additional file 2), error rate (Additional file 3), and number of samples (Additional file 4). V-measure scores and inferred CCF acurracy were similar for PyClone and PyClone-VI across all simulation regimes. Running time and memory usage was significantly less for PyClone-VI in all cases. General trends for both methods were: a decrease in accuracy as tumour content decreased; a decrease in accuracy as error rate increased; an increase in accuracy as more samples were analyzed.

PyClone-VI is significantly faster than existing methods

We next sought to compare the performance of PyClone-VI against other state of the art methods. In addition to comparing against PyClone, we also considered PhyloWGS and QuantumClone. We downloaded synthetic data used in the ICGC-TCGA DREAM Somatic Mutation Calling - Tumour Heterogeneity Challenge, an open competition to benchmark methods for studying clonal heterogeneity [16]. We limited the analysis to tumours with 10,000 mutations or fewer due to issues relating to runtime (PyClone, PhyloWGS and QuantumClone) and memory (PyClone and QuantumClone). As in the previous experiment, we consider two metrics to assess performance: V-measure (Fig. 2a) and mean absolute deviation error in predicted CFF per mutation (Fig. 2b).

Fig. 2

Analysis of the DREAM SMC-Het data Analysis of the ICGC-TCGA DREAM Somatic Mutation Calling—Tumour Heterogeneity Challenge data using PhyloWGS (PWGS), PyClone (PC), PyClone-VI (PCVI) and QuantumClone (QC). This analysis used the 31 simulated tumours from the competition with fewer than 10,000 mutations. See Additional file 1: Table S5 for details about the characteristics of the datasets. a Comparison of V-measure across the methods (higher is better). b Comparison of the mean absolute deviation of estimated cancer cell fraction across methods (lower is better). c Comparison of runtime across methods (lower is better). c Comparison of memory usage across methods (lower is better) When comparing methods we applied the Friedman test to see if there were any significant differences in performance between the methods ). If the Friedman test was significant we then applied the post-hoc Nemenyi test with a Bonferroni correction to all pairs of methods to determine which methods showed significantly different performance from each other (p-value < 0.01) [17]. All statements of significance are with respect to this test. PyClone-VI significantly outperformed PyClone and QuantumClone with respect to clustering performance. Though PyClone-VI performed better on average than PhyloWGS the difference was not significant (). With respect to accuracy estimating CCF, both PyClone-VI and PhyloWGS outperformed QuantumClone. There were no other significant differences in accuracy metrics between methods. In general, the results were quite similar across methods, with the differences in performance being quite small. However, there was a significant difference in runtime between methods. PyClone-VI was significantly faster and more memory efficient than all other approaches, finishing 10x-100x times faster than the other approaches while requiring less memory (Fig. 2c, d). A caveat to this analysis is that runtime is a tuneable parameter for all these approaches. Fewer MCMC iterations can be performed for PyClone and PhyloWGS to shorten runtime at the expense of accuracy. Similarly, QuantumClone and PyCloneVI can use fewer random restarts to speed up runtime, again trading accuracy. For this analysis we attempted to select parameters which gave comparable accuracy (see methods). We did not make use of parallel computing in this experiment. Both QuantumClone and PyClone-VI can perform random restarts in parallel to decrease runtime. The MCMC based methods cannot be parallelised in the same way.

Analysis of PCAWG cohort

To demonstrate the real life utility of PyClone-VI we analysed the data from the Pan-Cancer Analysis of Whole Genomes (PCAWG) [18]. We downloaded processed data from the ICGC data portal and pre-processed it for input into PyClone-VI. The only filtering performed was to remove mutations with no copy number information or in regions with total copy number zero. We analysed the resulting data from 1717 patients with 28–881,464 mutations. All data was single sample whole genome data. Figure 3a shows the runtime of PyClone-VI as function of the number of mutations. Runtime increases linearly with the number of mutations with times ranging from 11 to 28,575 s. Figure 3b shows the runtime as a function of the number of clones detected and Fig. 3c shows how the number of clones detected depends on the number of mutations. The trend is that more clones are detected as more mutations are included, with runtime correspondingly increasing with the number of clones. Figure 3d is an illustrative analysis which shows the number of clones normalised by the number of mutations broken down by ICGC project.

Fig. 3

Analysis of the PCAWG cohort a Runtime of PyClone-VI as a function of the number of mutations. b Runtime of PyClone-VI as a function of the number of clones inferred. c Comparison between the number of clones found and number of mutations. d Number of clones normalized by total number of mutations for each ICGC project To generate a rough estimate of the running time of other approaches used in the DREAM benchmark for this dataset, we fit a linear regression to the observed running times on the DREAM data as a function of the number of mutations. We then used the fitted model to predict running times for each method on the PCAWG data (Additional file 1: Table S10). For the DREAM dataset we observed total running times of approximately: 5960 s for PyClone-VI, 38,400 s for QuantumClone, 74,300 s for PyClone and 156,000 s for PhyloWGS. For the PCAWG dataset we predicted total running times of approximately: 842,000 s for PyClone-VI, 6,740,000 s for QuantumClone, 14,200,000 s for PyClone and 28,900,000 s for PhyloWGS. The predicted value of 842,000 s was higher than the observed value of 560,000 s for PyClone-VI suggesting these predictions may be pessimistic. We note that this analysis assumes a linear increase in running time with the number of mutations.

Analysis of TRACERx cohort

As another real world demonstration, this time with multiple samples, we analysed whole exome data from the 100 lung cancer patients from the TRACERx study [12]. Patients had between 1 and 7 samples sequenced from different regions of their tumours with between 65 and 3566 mutations detected. Figure 4a shows the runtime of PyClone-VI as function of the number of mutations. Again runtime increases linearly with the number of mutations with times ranging from 9 to 1454 s. Figure 4b, c show runtime and runtime normalised by the number of mutations with varying numbers of samples. Runtime does not directly increase with the number of samples (Fig. 4b), but once the runtime is normalised to account for the number of mutations we see an increase (Fig. 4c). In Fig. 4d, e we show the number of mutations and clones that can be resolved as a function of the number of samples. Interestingly, the number of mutations identified does not seem to depend strongly on the number of samples, however the number of clones which can be detected increases as more samples are added. This result illustrates the important role that multi-region sequencing plays in determining clonal population structure. Eight patients in the cohort had only a single sample. We compared the number of mutations in these patients inferred to be clonal to the number inferred to be clonal from multi-region sequencing (Fig. 4f). The proportion of detected clonal mutations decreases in the multi-sample setting suggesting that many apparently clonal mutations in single sample sequencing may in fact be sub-clonal, consistent with the findings in [12] which performed a more thorough held out sampling.

Fig. 4

Analysis of the TRACERx cohort a Runtime of PyClone-VI a function of the number of mutations. b Runtime of PyClone-VI a function of the number of samples. c Runtime normalised by number of mutations for varying numbers of samples. d Number of mutations detected with varying numbers of samples. e Number of clones detected with varying numbers of samples. f Comparison of proportion of mutations deemed clonal when using single versus multiple samples

Discussion

PyClone-VI achieves significant computational gains over the original PyClone method by altering the model and changing the approach used for inference. To do so we introduce several approximations on top of those already in the PyClone model. We assume that CCF values can only take on a finite set of values. The number of possible values determines the accuracy of this approximation and the runtime. For the analyses performed in this paper we used a grid of 100 values, which provides CCFs accurate to within 0.01. Using a larger grid of values will provide more accurate estimates if the mutations are sequenced to a sufficient depth. In general, large numbers of mutations are not deeply sequenced, so using relatively sparse grids is appropriate for the data. If a small panel of mutations is deeply sequenced, then the original PyClone method maybe more appropriate than PyClone-VI. Another approximation we make is to use a finite mixture model in place of a Dirichlet process (DP) for clustering. We rely on the variational inference procedure to automatically perform model selection by only using the number of clusters supported by the data. The approach of using more clusters than needed is heuristic, however it is widely employed and generally performs well [19]. We note neither DP models or using the BIC are guaranteed to consistently estimate the correct number of clusters. The use of VI rather than MCMC for inference means that PyClone-VI will deliver posterior approximations of unknown accuracy. In contrast, MCMC approaches are guaranteed to approximate the posterior to arbitrary accuracy given enough samples are drawn. In practice, VI approaches are typically observed to estimate the mean of the posterior distribution well, but to underestimate the variance. When inferring clonal population structure the underestimation of variance would lead to over confident assignment of mutations to clusters and under-estimates of error bar widths for CCF values. If accurate estimates of these values are required, then we recommend the use of the original PyClone model. It is our observation that most users do not make use of these values, and instead rely on the point estimates generated by PyClone. In this case, PyClone-VI should be the preferred approach due to reduced runtime. Like PyClone, PyClone-VI clusters mutations which share the same evolutionary history. Such mutations originate at the same point in the phylogeny and exhibit the same pattern of mutation loss. PyClone-VI does not attempt to infer the phylogenetic tree, in contrast to methods such as PhyloWGS. Ignoring the phylogenetic structure is a potential weakness, but it does mean we do not have to make additional assumptions such as mutations cannot be lost once gained. Such assumptions are restrictive and violated in many cancers [20]. We believe that the ability to quickly cluster mutations will be useful for downstream software which attempts to infer phylogenies. By reducing the size of the input data from the number of mutations to the number of clonal populations, more sophisticated and computationally expensive tree building methods can be used [21-23].

Conclusions

We have introduced a new method, PyClone-VI, for inferring clonal population structure in tumours from point mutations measured using high throughput sequencing. PyClone-VI is significantly more computationally efficient than existing approaches and provides comparable accuracy. Tumours with 100,000 s of mutations can easily be analysed by PyClone-VI in less than a day on a personal computer, a dramatic reduction in both runtime and memory required for this analysis. PyClone-VI will be a useful tool for researchers performing large cohort studies of tumour heterogeneity. PyClone-VI will also be useful in clinical studies which integrate WGS analysis of tumours and require timely analysis to inform treatment decisions.

Methods

Inference in the original PyClone package was performed using MCMC sampling [4]. As the number of mutations grows, each iteration of the MCMC sampler becomes slower which is problematic as large datasets likely need many more iterations of MCMC sampling than small datasets which further adds to the computational complexity. However, many users do not adjust for this factor, and as result PyClone is often run with too few iterations for the MCMC chain to converge leading to poor performance. One widely observed symptom of this problem is the tendency for PyClone to produce many clusters containing a single mutation [8]. To overcome these limitations we have modified the original PyClone model. This modification has allowed us to develop and implement an efficient VI procedure which is orders of magnitudes faster than the previous MCMC method. We refer to this new model and software implementation as PyClone-VI. In addition to being significantly faster, this approach also removes the need for the user to assess the convergence of the MCMC sampler thus reducing potential for misuse.

PyClone

We provide a brief review of the original PyClone method here to motivate the changes in Pyclone-VI. More details can be found in the original PyClone paper [4] which includes additional details such as how to elicit genotype priors and the form of the emission distributions supported. The original PyClone model is a DP mixture model [24]. The basic hierarchical model is as followsHere we use the distribution H to denote the emission distribution used to generate the observed variant read counts , where i indexes the mutation and j the sample. This distribution depends on local hyper-parameters which capture information about the genotype and read depth. The parameter represents global hyper-parameters which are shared across mutations. In the original PyClone paper when using a Beta-Binomial distribution would be the precision of the distribution. The above model induces a clustering of mutations since the measure G sampled from the DP is almost surely discrete which implies there is a non-zero probability that mutations share the same CCF. We can define a clustering of the mutations as follows, let be the unique set of CCFs used to generate the data. Then for mutation i we define if . The introduction of the cluster indicator variable is commonly used when developing MCMC sampling strategies for DP mixture models [25]. This formulation is also useful for allowing us discuss how to modify the PyClone model to derive a more computationally efficient approach. The original PyClone model makes use of the DP to solve the model selection problem. The model selection problem refers to the fact we do not know the true number of clusters (clones) in the model. The DP formulation solves this by positing there exists an infinite number of clusters, but the observed data will only be generated from a finite subset of these. While DP mixtures provide an elegant solution to the model selection problem, they tend to be computationally expensive. The computational expense primarily due to the need to use MCMC methods to approximate the posterior distribution and thus infer model parameters [25].

Variational inference

VI is a popular alternative to MCMC methods in the Bayesian statistics and machine learning literature [26]. VI reformulates the problem of approximating the posterior as an optimization problem. In the general case, a variational distribution is assumed, where are the model parameters and are the variational parameters. The goal is to find the variational distribution that minimizes some notion of distance from the posterior distribution . A widely used measure of distance is the exclusive Kullback-Leibler divergence denoted . VI using as the objective can lead to efficient inference procedures that provide adequate approximations to the true posterior for many problems. Mean field VI (MFVI), often called variational Bayes in the machine learning and statistics literature, posits the variational distribution decomposes as a product of terms for each model parameter . For models which obey certain conjugacy constraints, simple closed form MFVI updates can be derived leading to efficient inference algorithms. The updates take the formwhere denotes the expectation taken over all parameters except [27]. The need to compute an expectation is what leads to the constraints on conjugacy for MFVI. We note there has been significant work recently using Monte Carlo methods to compute these expectations in models that don’t satisfy conjugacy constraints [28, 29]. These approaches could potentially be used as an alternative to our proposed method for performing VI for the PyClone model. The original PyClone model does not fall in the class of models for which MFVI is easily applicable. There are two issues. The most important issue is the emission density H does not have a conjugate prior distribution. The second related issue is that while there are ways to perform VI with DP mixtures, they require that we have a conjugate emission density [30]. Moreover these approaches impose a finite truncation on the number of clusters. This latter point means there is not a major advantage to using the DP when employing VI [31]. Rather, using over complete finite mixture models is often equally effective. Here we use over complete to mean we fit a finite mixture model with more components than we expect to need [32], and allow the inference procedure to perform model selection [19].

PyClone-VI model

In order to apply VI to fit the PyClone model, we make some modifications to the model. First, we change the model from a DP mixture model to a finite mixture model. In principle the use of a finite mixture model means we must address the model selection problem and fit the model with a varying number of clusters K. In practice we avoid this issue by setting K to be large and allowing the inference procedure to only use the number of clusters required. This heuristic strategy has been shown to work well in practice [19, 33]. The second modification is to assume that the CCFs of mutations can only take values in a finite set where . This change is primarily motivated by computational considerations, but can be justified by noting that we typically sequence genomes to 50–1000 when performing whole genome or exome sequencing. Thus, it would seem unreasonable to expect to resolve the CCF of a mutation to arbitrary precision. Provided we choose the grid of CCF values to be sufficiently large, this approximation should yield reasonable results. The modified version of the PyClone model which we call PyClone-VI is defined as followswhere indicates the discrete distribution with mass vector and support . We use the uninformative priors , where is the vector of ones of length K, and . The joint distribution is thus given bywhere we have suppressed the dependence on hyper-parameters for notational clarity. We let denote the emission density and the indicator function which is one when and zero otherwise. As we will show in the next section this formulation leads to an efficient MFVI procedure.

Inference

We use MFVI to fit the PyClone-VI model. To do so we make the usual mean field assumption for our variational distribution q.The distributional assumptions are as followsThe densities are then given byThus we need to optimize the variational parameters , and . The parameter updates can be derived by applying the standard MFVI update. Thus we haveand we have the following normalization constraintsThese updates are iterated until convergence. Convergence can monitored by computing the difference in the evidence lower bound (ELBO) after each update [26]. Monitoring the ELBO is also useful to assess that the software implementation is correct, as it should increase monitonically. Since we assume the CCFs, , can only take a finite set of values we can evaluate for all mutations and samples across this grid as a pre-processing step during inference. Caching this value leads to a dramatic reduction in runtime for the method. This strategy is only applicable if the global parameters of the emission distribution h are fixed. In practice, this means we fix the precision term of the Beta-Binomial emission distribution, rather than estimating it as PyClone does. We also treat the hyper-parameter as a fixed parameter. This hyper-parameter weakly controls the number of clusters used, with values greater than one promoting the use of more clusters, and values less than one fewer. For all experiments in this work we used a value of one.

Experiments

Synthetic data

For the results shown in the main text we simulated data from the PyClone model with 50, 100 and 1000 mutations using a DP concentration parameter of 1.0. Additional simulation parameters are described in the results. Additional simulations were performed with varying tumour content (Additional file 2), error rates for the expected VAF (Additional file 3), and number of samples (Additional file 4). Parameter settings for these simulations are provided in the file descriptions. For all simulations we used PyClone version 0.13.1 run with 10,000 iterations and discarding the first 1000 as burn-in. We ran PyClone-VI using 40 clusters and 100 random restarts.

DREAM data

We downloaded the ICGC-TCGA DREAM Somatic Mutation Calling—Tumour Heterogeneity Challenge [16] from www.synapse.org. To generate realistic data the authors generated a tree structure relating the clones in the sample and simulated the clonal prevalence values. BAMSurgeon [34] was used to manipulate a real sequence data set to introduce mutations in BAM files for each clone. The clonal BAM files were merged and then analyzed with the Batterberg [35] for copy number calling and Mutect [36] for SNV/Indel calling. Summary statistics for the datasets used are provided in Additional file 1: Table S5. A custom script was used to process the battenberg TSV and mutect VCF files for input into PyClone, PyClone-VI and QuantumClone. We used the included PhyloWGS parser for these input formats to generate input files for PhyloWGS. Tumour content values were set to the ground truth values provided for all methods which accept this argument. PhyloWGS was run for 10 iterations of burn-in and subsequently 100 samples were collected from the MCMC trace. We selected the maximum a posteriori sample, that is the sample with the highest joint probability, to compute estimates from PhyloWGS. PyClone was run for 1000 iterations, discarding the first 100 iterations as burn-in. We used the PyClone Beta-Binomial emission distribution with the connected initialization strategy and major copy number prior elicitation method. Default parameters were used for post-processing the PyClone MCMC trace. QuantumClone was run with 2–10 clones and 10 random restarts. PyClone-VI was run with 10 clusters, 100 random restarts and used the Beta-Binomial emission distribution.

PCAWG data

We downloaded SNV and CNV data from PCAWG project hosted in the ICGC portal [18]. We used a custom script to pre-process the data into a format compatible with PyClone-VI, extracting read counts from the input VCF files and allele specific copy number from the CNV data. We ignored sub-clonal CNVs and removed mutations with major copy number zero. We fit PyClone-VI using the Binomial emission distribution with 20 clusters and 100 random restarts.

TRACERx data

We downloaded SNV and CNV data included in the supplementary material of [12]. We used a custom script to pre-process the data into a format compatible with PyClone-VI. We fit PyClone-VI using the Binomial emission distribution with 40 clusters and 100 random restarts. Additional file 1. Table S1: Performance results for the comparison of PyClone and PyClone-VI using synthetic data used in Fig. 1. Table S2: Performance results for the comparison of PyClone and PyClone-VI using synthetic data used in Additional file 2. Table S3: Performance results for the comparison of PyClone and PyClone-VI using synthetic data used in Additional file 3. Table S4: Performance results for the comparison of PyClone and PyClone-VI using synthetic data used in Additional file 4. Table S5: Summary statistics for datasets used in Fig. 2. Table S6: Performance results for the analysis of DREAM SMC-HET data used in Fig. 2. Table S7: Friedman test results for comparing methods using the DREAM SMC-HET data. Table S8: Post-hoc Nemenyi test for comparing methods using the DREAM SMC-HET data. Table S9: Results from the PCAWG data analysis used in Fig. 3. Table S10: Predicted run time to analyze PCAWG data for programs used in DREAM analysis. Table S11: Results from the TRACERx data analysis used in Fig. 4. Additional file 2. Fig. S1: Comparison of PyClone and PyClone-VI with varying tumour content PDF file with figures showing the results of running PyClone and PyClone-VI with varying tumour content values. Data was simulated from the PyClone model with 4 samples, 100 mutations, a mean depth of 100 reads and copy number ranging from 1–4 copies. The same tumour content values were used for all 4 samples for each dataset. a V-measure as a function of the tumour content of the samples. b Mean absolute deviation of inferred CCF from truth as a function of the tumour content. c Runtime of the methods. d Memory usage. Additional file 3. Fig. S2: Comparison of PyClone and PyClone-VI with varying error rates PDF file with figures showing the results of running PyClone and PyClone-VI with varying the error rates of the expected variant allele frequency. Data was simulated from the PyClone model with 4 samples, 100 mutations, a mean depth of 100 reads, copy number ranging from 1–4 copies and tumour content for each sample randomly selected from [0.4, 0.8]. To simulate error, the true expected VAF f was computed for each mutation and then a perturbed expected VAF was simulated uniformly from where is the error rate. The perturbed expected VAF was then used to simulate read count data. a V-measure as a function of the error rate. b Mean absolute deviation of inferred CCF from truth as a function of the error rate. c Runtime of the methods. d Memory usage. Additional file 4. Fig. S3: Comparison of PyClone and PyClone-VI with varying number of samples PDF file with figures showing the results of running PyClone and PyClone-VI with varying number of samples. Data was simulated from the PyClone model with 1–8 samples, 100 mutations, a mean depth of 100 reads, copy number ranging from 1–4 copies and tumour content for each sample randomly selected from [0.4, 0.8]. a V-measure as a function of the number of samples. b Mean absolute deviation of inferred CCF from truth as a function of the number of samples. c Runtime of the methods. d Memory usage.

24 in total

1. Clonality inference in multiple tumor samples using phylogeny.

Authors: Salem Malikic; Andrew W McPherson; Nilgun Donmez; Cenk S Sahinalp
Journal: Bioinformatics Date: 2015-01-06 Impact factor: 6.937

2. PyClone: statistical inference of clonal population structure in cancer.

Authors: Andrew Roth; Jaswinder Khattra; Damian Yap; Adrian Wan; Emma Laks; Justina Biele; Gavin Ha; Samuel Aparicio; Alexandre Bouchard-Côté; Sohrab P Shah
Journal: Nat Methods Date: 2014-03-16 Impact factor: 28.547

3. The life history of 21 breast cancers.

Authors: Serena Nik-Zainal; Peter Van Loo; David C Wedge; Ludmil B Alexandrov; Christopher D Greenman; King Wai Lau; Keiran Raine; David Jones; John Marshall; Manasa Ramakrishna; Adam Shlien; Susanna L Cooke; Jonathan Hinton; Andrew Menzies; Lucy A Stebbings; Catherine Leroy; Mingming Jia; Richard Rance; Laura J Mudie; Stephen J Gamble; Philip J Stephens; Stuart McLaren; Patrick S Tarpey; Elli Papaemmanuil; Helen R Davies; Ignacio Varela; David J McBride; Graham R Bignell; Kenric Leung; Adam P Butler; Jon W Teague; Sancha Martin; Goran Jönsson; Odette Mariani; Sandrine Boyault; Penelope Miron; Aquila Fatima; Anita Langerød; Samuel A J R Aparicio; Andrew Tutt; Anieta M Sieuwerts; Åke Borg; Gilles Thomas; Anne Vincent Salomon; Andrea L Richardson; Anne-Lise Børresen-Dale; P Andrew Futreal; Michael R Stratton; Peter J Campbell
Journal: Cell Date: 2012-05-17 Impact factor: 41.582

4. The clonal evolution of tumor cell populations.

Authors: P C Nowell
Journal: Science Date: 1976-10-01 Impact factor: 47.728

5. Divergent modes of clonal spread and intraperitoneal mixing in high-grade serous ovarian cancer.

Authors: Andrew McPherson; Andrew Roth; Emma Laks; Tehmina Masud; Ali Bashashati; Allen W Zhang; Gavin Ha; Justina Biele; Damian Yap; Adrian Wan; Leah M Prentice; Jaswinder Khattra; Maia A Smith; Cydney B Nielsen; Sarah C Mullaly; Steve Kalloger; Anthony Karnezis; Karey Shumansky; Celia Siu; Jamie Rosner; Hector Li Chan; Julie Ho; Nataliya Melnyk; Janine Senz; Winnie Yang; Richard Moore; Andrew J Mungall; Marco A Marra; Alexandre Bouchard-Côté; C Blake Gilks; David G Huntsman; Jessica N McAlpine; Samuel Aparicio; Sohrab P Shah
Journal: Nat Genet Date: 2016-05-16 Impact factor: 38.330

6. Clonal genotype and population structure inference from single-cell tumor sequencing.

Authors: Andrew Roth; Andrew McPherson; Emma Laks; Justina Biele; Damian Yap; Adrian Wan; Maia A Smith; Cydney B Nielsen; Jessica N McAlpine; Samuel Aparicio; Alexandre Bouchard-Côté; Sohrab P Shah
Journal: Nat Methods Date: 2016-05-16 Impact factor: 28.547

7. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data.

Authors: Mohammed El-Kebir; Layla Oesper; Hannah Acheson-Field; Benjamin J Raphael
Journal: Bioinformatics Date: 2015-06-15 Impact factor: 6.937

8. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors.

Authors: Amit G Deshwar; Shankar Vembu; Christina K Yung; Gun Ho Jang; Lincoln Stein; Quaid Morris
Journal: Genome Biol Date: 2015-02-13 Impact factor: 13.583

9. Clonal neoantigens elicit T cell immunoreactivity and sensitivity to immune checkpoint blockade.

Authors: Nicholas McGranahan; Andrew J S Furness; Rachel Rosenthal; Sofie Ramskov; Rikke Lyngaa; Sunil Kumar Saini; Mariam Jamal-Hanjani; Gareth A Wilson; Nicolai J Birkbak; Crispin T Hiley; Thomas B K Watkins; Seema Shafi; Nirupa Murugaesu; Richard Mitter; Ayse U Akarca; Joseph Linares; Teresa Marafioti; Jake Y Henry; Eliezer M Van Allen; Diana Miao; Bastian Schilling; Dirk Schadendorf; Levi A Garraway; Vladimir Makarov; Naiyer A Rizvi; Alexandra Snyder; Matthew D Hellmann; Taha Merghoub; Jedd D Wolchok; Sachet A Shukla; Catherine J Wu; Karl S Peggs; Timothy A Chan; Sine R Hadrup; Sergio A Quezada; Charles Swanton
Journal: Science Date: 2016-03-03 Impact factor: 47.728

10. A community effort to create standards for evaluating tumor subclonal reconstruction.

Authors: Adriana Salcedo; Maxime Tarabichi; Shadrielle Melijah G Espiritu; Amit G Deshwar; Matei David; Nathan M Wilson; Stefan Dentro; Jeff A Wintersinger; Lydia Y Liu; Minjeong Ko; Srinivasan Sivanandan; Hongjiu Zhang; Kaiyi Zhu; Tai-Hsien Ou Yang; John M Chilton; Alex Buchanan; Christopher M Lalansingh; Christine P'ng; Catalina V Anghel; Imaad Umar; Bryan Lo; William Zou; Jared T Simpson; Joshua M Stuart; Dimitris Anastassiou; Yuanfang Guan; Adam D Ewing; Kyle Ellrott; David C Wedge; Quaid Morris; Peter Van Loo; Paul C Boutros
Journal: Nat Biotechnol Date: 2020-01-09 Impact factor: 68.164

14 in total

1. Exploring Current Challenges and Perspectives for Automatic Reconstruction of Clonal Evolution.

Authors: Sarah Sandmann; Silja Richter; Xiaoyi Jiang; Julian Varghese
Journal: Cancer Genomics Proteomics Date: 2022 Mar-Apr Impact factor: 4.069

Review 2. Computational Approaches for the Investigation of Intra-tumor Heterogeneity and Clonal Evolution from Bulk Sequencing Data in Precision Oncology Applications.

Authors: Alessandro Laganà
Journal: Adv Exp Med Biol Date: 2022 Impact factor: 2.622

3. J-SPACE: a Julia package for the simulation of spatial models of cancer evolution and of sequencing experiments.

Authors: Fabrizio Angaroni; Alex Graudenzi; Alessandro Guidi; Gianluca Ascolani; Alberto d'Onofrio; Marco Antoniotti
Journal: BMC Bioinformatics Date: 2022-07-08 Impact factor: 3.307

4. Phasing analysis of lung cancer genomes using a long read sequencer.

Authors: Yoshitaka Sakamoto; Shuhei Miyake; Miho Oka; Akinori Kanai; Yosuke Kawai; Satoi Nagasawa; Yuichi Shiraishi; Katsushi Tokunaga; Takashi Kohno; Masahide Seki; Yutaka Suzuki; Ayako Suzuki
Journal: Nat Commun Date: 2022-06-16 Impact factor: 17.694

5. Longitudinal genomic alternations and clonal dynamics analysis of primary malignant melanoma of the esophagus.

Authors: Jingjing Li; Wenyan Guan; Wei Ren; Ziyao Liu; Hongyan Wu; Yiqiang Chen; Siyuan Liu; Xiangming Quan; Zuoquan Yang; Chong Jiang; Jian He; Xiao Xiao; Qing Ye
Journal: Neoplasia Date: 2022-06-01 Impact factor: 6.218

6. Benchmarking pipelines for subclonal deconvolution of bulk tumour sequencing data.

Authors: Georgette Tanner; David R Westhead; Alastair Droop; Lucy F Stead
Journal: Nat Commun Date: 2021-11-04 Impact factor: 14.919

7. Cold and heterogeneous T cell repertoire is associated with copy number aberrations and loss of immune genes in small-cell lung cancer.

Authors: Ming Chen; Runzhe Chen; Ying Jin; Jun Li; Xin Hu; Jiexin Zhang; Junya Fujimoto; Shawna M Hubert; Carl M Gay; Bo Zhu; Yanhua Tian; Nicholas McGranahan; Won-Chul Lee; Julie George; Xiao Hu; Yamei Chen; Meijuan Wu; Carmen Behrens; Chi-Wan Chow; Hoa H N Pham; Junya Fukuoka; Jia Wu; Edwin Roger Parra; Latasha D Little; Curtis Gumbs; Xingzhi Song; Chang-Jiun Wu; Lixia Diao; Qi Wang; Robert Cardnell; Jianhua Zhang; Jing Wang; Xiuning Le; Don L Gibbons; John V Heymach; J Jack Lee; William N William; Chao Cheng; Bonnie Glisson; Ignacio Wistuba; P Andrew Futreal; Roman K Thomas; Alexandre Reuben; Lauren A Byers; Jianjun Zhang
Journal: Nat Commun Date: 2021-11-17 Impact factor: 17.694

Review 8. Cancer Neoantigens: Challenges and Future Directions for Prediction, Prioritization, and Validation.

Authors: Elizabeth S Borden; Kenneth H Buetow; Melissa A Wilson; Karen Taraszka Hastings
Journal: Front Oncol Date: 2022-03-03 Impact factor: 6.244

9. Rapid idiosyncratic mechanisms of clinical resistance to KRAS G12C inhibition.

Authors: Yihsuan S Tsai; Mark G Woodcock; Salma H Azam; Leigh B Thorne; Krishna L Kanchi; Joel S Parker; Benjamin G Vincent; Chad V Pecot
Journal: J Clin Invest Date: 2022-02-15 Impact factor: 14.808

10. Intratumoral genetic and immune microenvironmental heterogeneity in T4N0M0 (diameter ≥ 7 cm) non-small cell lung cancers.

Authors: Jia Tao Zhang; Song Dong; Li Yan Ji; Jia Ying Zhou; Zhi Hong Chen; Jian Su; Qing Ge Zhu; Meng Min Wang; E E Ke; Hao Sun; Xue Tao Li; Jin Ji Yang; Qing Zhou; Xu Chao Zhang; Xuan Gao; Xue Ning Yang; Xuefeng Xia; Xin Yi; Wen Zhao Zhong; Yi Long Wu
Journal: Thorac Cancer Date: 2022-04-08 Impact factor: 3.223