Literature DB >> 28193548

Advances in understanding tumour evolution through single-cell sequencing.

Jack Kuipers¹, Katharina Jahn¹, Niko Beerenwinkel¹.

Abstract

The mutational heterogeneity observed within tumours poses additional challenges to the development of effective cancer treatments. A thorough understanding of a tumour's subclonal composition and its mutational history is essential to open up the design of treatments tailored to individual patients. Comparative studies on a large number of tumours permit the identification of mutational patterns which may refine forecasts of cancer progression, response to treatment and metastatic potential. The composition of tumours is shaped by evolutionary processes. Recent advances in next-generation sequencing offer the possibility to analyse the evolutionary history and accompanying heterogeneity of tumours at an unprecedented resolution, by sequencing single cells. New computational challenges arise when moving from bulk to single-cell sequencing data, leading to the development of novel modelling frameworks. In this review, we present the state of the art methods for understanding the phylogeny encoded in bulk or single-cell sequencing data, and highlight future directions for developing more comprehensive and informative pictures of tumour evolution. This article is part of a Special Issue entitled: Evolutionary principles - heterogeneity in cancer?, edited by Dr. Robert A. Gatenby.

Entities: Chemical Disease Gene Species

Keywords: Cancer evolution; Phylogenetics; Single-cell sequencing; Tumour heterogeneity

Mesh：

Substances：
Biomarkers, Tumor

Year: 2017 PMID： 28193548 PMCID： PMC5813714 DOI： 10.1016/j.bbcan.2017.02.001

Source DB: PubMed Journal: Biochim Biophys Acta Rev Cancer ISSN： 0304-419X Impact factor: 10.680

Tumour evolution and heterogeneity

Cancerous cells experience complex and diverse genomic aberrations which may induce characteristic hallmarks [1], [2] and allow tumour progression. The view of a sequence of genetic changes providing a fitness advantage and leading to a clonal expansion of cells inheriting those characteristics was crystallised by Nowell [3], and exemplified for colon cancer [4]. The consequences of an evolutionary model of competing clones in a Darwinian framework are complex and heterogeneous tumours, as were also initially observed [5] and seen as a founder of metastases [6]. Tumour heterogeneity was quickly established and examined (as reviewed in [7]) but the evolutionary view of competing populations of tumour cells came back into focus with the turn of the millennium [8], [9], [10] with the arrival of genome sequencing. The collection of large amounts of genetic data with next generation sequencing (NGS), spearheaded by the compilation of large public databases by consortia like The Cancer Genome Atlas (TCGA) [11] or the International Cancer Genome Consortium (ICGC) [12], cemented the view of cancer as an dynamic evolutionary process with clones arising, expanding and descendent cells differentiating into further competing subclones [13], [14], [15]. Detailed genomic data have also uncovered the clonal complexity and heterogeneity across many cancer types as recently reviewed [16]. The negative effects of clonal diversity on tumour progression were observed clinically for esophageal adenocarcinoma [17], allowing the use of diversity as a biomarker [18]. This example spurred the examination of the clinical implications of the genetic diversity resulting from tumour heterogeneity [19]. Heterogeneity or diversity is also a cause of drug resistance or relapse [15], [20], [21], [22]. The treatment may target the most common clone, which upon its remission, and the new selective pressures of treatment, may allow smaller subclones to emerge, develop resistance and to progress [23], [24], [25]. Subclones may also cooperate [26], which connects back to the ideas of Heppner [7] which emphasised that subclones belong to a complex tumour ecosystem. The order of mutations can also affect disease progression and response to treatment [27]. The large amounts of genomic data have therefore not only shone light on the complex makeup of tumours, but now highlight how a deeper understanding of their diversity and evolutionary history are needed for more effective and precise cancer therapies [15], [16], [25], [28], [29], [30].

Decoding heterogeneity and evolutionary histories

Typically, approaches to study heterogeneity and clonal evolution have looked at bulk samples which mix the DNA of thousands or millions of cells before sequencing. The resulting output is an estimate of the frequencies of various variants in each sample. To understand the diversity and subclone structure, one needs to be able to decode the evolutionary history from such bulk data. The problem of moving from variant frequencies to evolutionary histories reduces to one of deconvolving the mutations in the mixture into clones and their phylogenetic relationship. We review methods developed for resolving this problem in Section 2. As depicted in Fig. 1 there are situations where the frequencies alone cannot distinguish between different histories. This can be improved by taking multiple samples [31], [32] or at different times [33]. The results from bulk data however tend to provide rather low-resolution indications of the evolutionary history and heterogeneity [34], [35] because low-frequency mutations cannot be reliably separated into new clones and tend to be placed together or in existing clones. Again multiple samples can help in improving the resolution.

Fig. 1

(a) Schematic representation of the clonal expansion that shaped the heterogeneous tumour depicted in (b). The colours of the cells represent their belonging to the different subclones. The small stars inside the cells represent the mutations present. (c) Two bulk samples admixed with normal cells (empty grey circles) taken from the tumour in (b). The bar plots depicted next to the samples can be derived from variant allele frequency data obtained by bulk sequencing. Each bar represents the estimated cellular prevalence of one mutation present in the sample. Note that the dark purple mutation on the bottom left of (a) is absent from the frequency plots because it is too low frequency to be detected. (d) Mutation histories compatible with the cell prevalences of sample 1 or sample 2. (Not all compatible trees are depicted.) The two trees in the intersection are compatible with both samples. It can not be inferred from the given data that the left one is the true history that matches the clonal expansion in (a). To arrive at the highest possible resolution of a tumour's history, the sequencing of individual cells has been advocated [35]. All cells in the body and in tumours descend a binary genealogical tree of which the cells themselves are the taxa, as depicted in Fig. 2. Reconstructing the tree then requires no deconvolution. It does though require that mutations, once they arise are preserved from generation to generation and that they may only occur once in the evolutionary tree, also known as the infinite sites assumption. With this assumption and perfect calling of the mutations in each cell, the phylogeny can be reconstructed very efficiently [36]. The challenge with single-cell data though is that the errors in mutation calling can be very large, and unbalanced. In particular when the single copy of a cell's DNA is amplified to allow it to be sequenced, the coverage may be rather uneven so that some genome positions cannot be called and are effectively missing. Due to feedback in the amplification, one allele may happen to predominate at certain genomic positions so that mutations on the other allele do not appear in the sequencing data. Algorithms have therefore been developed to specifically deal with single-cell data which we review in Section 4 after discussing the advances in single-cell sequencing in Section 3. An overview of the sequencing and phylogentic reconstruction processes for both bulk and single-cell samples is presented in Fig. 3.

Fig. 2

Fig. 3

Left: Overview of the typical work flow for the reconstruction of mutation histories from bulk tumour samples. DNA is extracted from a bulk sample and sequenced to reveal the admixed mutation profile. Clustering mutations by variant allele frequencies reveals possible subclones and their relative frequency in the admixed sample. Based on this information compatible mutation histories are inferred. Right: Overview of the typical work flow for the reconstruction of mutation histories from single-cell samples. The DNA is extracted from the individual cells and amplified due to the limited starting material. This process does not amplify all genomic sites equally well. The amplified DNA material is then sequenced and mutations are called. The mutation profiles of the individual cells are now combined into a single (noisy) character state matrix that is then used for tree inference.

From the heterogeneous tumour from Fig. 1 depicted in (a) which has evolved following the schematic representation in (b), the 10 single cells shown in (b) are selected for sequencing. One cell is normal tissue while the remaining nine cells from the tumour contain additional mutations represented by the stars in the cells. The cells belong to a binary genealogical tree as in (c) where they are connected at their common ancestors. The exact nature of the branch points cannot necessarily be determined by the mutations each cell possesses, for example the three cells on the left can have any arrangement as long as they are all below the purple mutation which distinguishes them from other cells. The representation in (c) is a sample genealogical tree focussing on the relationship between the cells themselves while an equivalent representation is presented in (d). Here the mutations are encapsulated in nodes on a tree with the samples attached as leaves to create a mutation tree. This representation emphasises the ordering and evolutionary history of the mutations. Left: Overview of the typical work flow for the reconstruction of mutation histories from bulk tumour samples. DNA is extracted from a bulk sample and sequenced to reveal the admixed mutation profile. Clustering mutations by variant allele frequencies reveals possible subclones and their relative frequency in the admixed sample. Based on this information compatible mutation histories are inferred. Right: Overview of the typical work flow for the reconstruction of mutation histories from single-cell samples. The DNA is extracted from the individual cells and amplified due to the limited starting material. This process does not amplify all genomic sites equally well. The amplified DNA material is then sequenced and mutations are called. The mutation profiles of the individual cells are now combined into a single (noisy) character state matrix that is then used for tree inference.

Bulk sequencing phylogeny approaches

Due to the higher prevalence of bulk-sequencing data, most approaches to reconstruct evolutionary histories of individual tumours are based on this data type. Sequencing the admixed cell populations of hundreds of thousands or even millions of cells that compose a bulk sample only reveals the allele frequencies of the individual mutations in the mixture leaving the number of present subclones, their prevalences, their individual mutation profiles and their genealogy undetermined [35]. Phrased in terms of classic phylogeny reconstruction, this is a situation where the number of taxa, their relative population sizes, their individual character states, as well as their phylogenetic relationships needs to be established, while the only information available is the set of characters and an estimate of their relative frequencies across the admixed populations. This constitutes a highly underdetermined problem for which classic approaches to phylogeny reconstruction are not suited. Hence many tools customised to this problem have been developed in the past years.

Phylogeny reconstruction from SNV data

An overview of software tools for reconstructing tumour evolution based on single-nucleotide variant (SNV) data is given in Table 1. We discuss in the following the shared and distinctive features of the underlying methods.

Table 1

Software	Year	Reference	Phylogeny	Multiple samples	Inference
TrAp	2013	[37]	Y	N	Exhaustive search
Clomial	2014	[31]	N	Y	Binomial/EM
PhyloSub	2014	[32]	Y	Y	Tree-structured stick-breaking/MCMC
PyClone	2014	[38]	N	Y	Dirichlet process, beta-binomial/MCMC
RecBTP	2014	[39]	Y	N	Approximation algorithm
SciClone	2014	[40]	N	N	Beta mixture model
AncesTree	2015	[41]	Y	Y	Optimisation/MILP
CITUP	2015	[42]	Y	Y	Optimisation/QIP
LICHeE	2015	[43]	Y	Y	Heuristic
BayClone	2015	[44]	N	Y	Gibbs sampling/Metropolis-Hastings
CTPsingle	2016	[45]	Y	N	Dirichlet process, beta-binomial/MCMC
Cloe	2016	[46]	Y	Y	Metropolis-coupled MCMC

Clonal reconstruction methods based on SNV bulk data. Abbreviations: EM, expectation maximisation; MCMC, Markov chain Monte Carlo; MILP, mixed integer linear programming; QIP, quadratic integer programming. An important preprocessing step for reconstructing tumour phylogenies from SNV data, is the correction of allele frequencies for ploidy aberrations - due to copy number alterations (CNAs) or loss of heterozygosity (LOH) - to estimate the cellular prevalences of the mutations [38], [47]. In practice many SNV based approaches focus on mutations at copy number neutral sites [39], [40], [41], [42], [45], in which case the cellular prevalence of heterozygous mutations is just two times their relative allele frequency. A key assumption shared by nearly all approaches focusing on phylogeny reconstruction from SNV data is that of infinite sites which restricts the space of possible mutation histories in two ways: First, no genomic site is hit by more than one mutation throughout the entire evolutionary history of a tumour, and second, once present, a mutation persists in the whole lineage founded by the cell where it initially occurred. The motivation for this assumption is mainly its plausibility given the size of the genome and the relatively low number of mutations observed in tumour samples. However it also has the welcome side-effect of reducing the underdetermination of the deconvolution problem and the tree search space. The next step common to most SNV based approaches is a clustering of mutations with approximate allele frequencies. Some approaches use Bayesian mixture models for this step [47], [48]. The assumption behind the clustering is that variants with identical frequency are either both present or both absent in every subpopulation. A scenario for such a connection to arise could be a driver mutation occurring in a cell with a pre-existing passenger mutation. Then the increased fitness of the cell with the driver and its descendants may have led to the extinction of all cells carrying only the passenger mutation. For mutation sets with a shared cell prevalence >50% such a connection is the only way they can fit on a single tree. This follows from the infinite sites assumption, which prevents mutations from being split onto separate tree parts, and the pigeon hole principle by which some cell population of the tumour has to have both mutations as the sum of cell prevalences can not exceed 100%. For smaller cell prevalences - especially for low-frequency mutations - it is less obvious why the assumption should be generally true. Two low frequency mutations could have the same approximate cell prevalence by chance without the driver/passenger link described above and could still be erroneously clustered together. It has been shown that the deconvolution problem can be solved without grouping mutations by cellular prevalence [37]. However the complexity of the problem increases significantly with increasing numbers of subclones, and indeed Strino et al. could only solve instances of up to 25 aberrations [37], such that tree inference would in most cases be restricted to a selection of mutations. Once the clustering is fixed, the remaining task is to arrange the mutations in a tree consistent with the cell prevalences of the mutations. The mutation states of the subclones and their relative frequencies in the sample follow immediately from the consistent tree. Consistency here means that the cellular prevalence of each node is at least as large as the sum of the prevalences of its child nodes. This is necessary as the nodes are then interpreted as subclones that contain all the mutations along the path from the root to this node, such that the prevalence of a mutation at a node has to be shared with the whole subtree below the node. This constraint is also referred to as the ‘sum rule’ [32]. While it substantially restricts the solution space, it is typically not enough to find a unique solution. For example, a linear chain of mutations sorted by decreasing prevalence is always consistent with a single sample. Biologically motivated constraints, such as minimising the number of populated subclones or the tree depth can be used to pick plausible topologies [37], [39]. Here it is also advantageous that studies increasingly analyse multiple samples per patient. These could either be from spatially distinct tumour parts [49], tumour metastasis pairs, or longitudinal studies such as tumour/relapse pairs [20], or xenograft models [50]. When multiple samples of the same tumour are available, there is a second constraint, the ‘fork rule’, which states that if among two mutations, the first is more prevalent in one sample and the second in another sample, they need to be placed in separate branches [32]. In general, the more samples available the more topologies can be excluded, as long as the their subclone composition differs sufficiently. However, in practice this process is complicated by inaccuracies in the estimated cell prevalences and possible errors in the clustering due to which no tree may be consistent with all data. One solution here is to find a tree that minimises the errors in the estimated cell prevalences to fit them to a tree [32], [42], or to exclude some mutations from the tree [41]. While all SNV based reconstruction approaches make use of the combinatoric constraints, they employ vastly different methodologies. Three major lines can be identified: Some perform an exhaustive search enumerating all trees that fulfil the combinatoric constraints plus additional biological restrictions [37] or an approximation thereof [39]. Others represent the constraints via a directed ancestry graph, which contains the optimal solutions in the form of spanning trees [41], [43], and finally there is a group of Bayesian approaches that give a posterior distribution over the tree space, thereby quantifying uncertainty in the inference [32], [45]. Recently another Bayesian approach for tree inference has been proposed that merely penalises trees for violations of the infinite sites assumptions instead of generally excluding them [46]. For high-frequency subclones, tree reconstruction from SNV bulk data has sufficient discriminative power to reveal their evolutionary relationships. However for low-frequency populations, the signal in the admixed variant allele frequencies seems to be too weak for a reliable reconstruction [35]. Also the clustering by allele frequency is less convincing for low-frequency mutations leaving their correct placement in the tree a largely unsolved problem. Advances in the sequencing technology towards longer reads may provide further constraints in the future, as mutations located on a single read can not be placed in different tree branches.

Phylogeny reconstruction from SNV and CNA data

There exist a few approaches such as THetA [51], THetA2 [52] and TITAN [53] that use CNA data alone to infer subclones, but none of them reconstructs tumour phylogenies. More recently CNA and SNV data have been combined to increase the discriminative power in the reconstruction process. A summary of methods following this strategy and their key features are given in Table 2.

Table 2

Clonal reconstruction methods based on SNV and CNA bulk data. Abbreviations: HMM, hidden Markov model; MCMC, Markov chain Monte Carlo.

Software	Year	Reference	Phylogeny	Multiple samples	Inference
CHAT	2014	[54]	N	N	Dirichlet process Gaussian mixture model/MCMC
CloneHD	2014	[55]	N	Y	HMM/local optimisation
SubcloneSeeker	2014	[56]	Y	Y	Exhaustive enumeration
PhyloWGS	2015	[58]	Y	Y	Tree-structured stick-breaking/MCMC
SCHISM	2015	[57]	Y	Y	Likelihood ratio tests/genetic algorithm
SPRUCE	2016	[59]	Y	Y	Exhaustive enumeration
CANOPY	2016	[60]	Y	Y	MCMC

Clonal reconstruction methods based on SNV and CNA bulk data. Abbreviations: HMM, hidden Markov model; MCMC, Markov chain Monte Carlo. The methods CHAT [54] and CloneHD [55] estimate cellular prevalences of both SNVs and CNAs but do not set them into a phylogenetic context. SubcloneSeeker infers trees based on cellular prevalences of both SNV and CNA data [56]. However it relies on other tools to accurately estimate these prevalences in a preprocessing step and is restricted to two samples such as tumour/relapse pairs. SCHISM [57] also relies on pre-established cellular prevalences. The inference is then a two-step process: It first uses a hypothesis testing framework to establish subclones and their pairwise relationships and then applies a genetic algorithm to find a matching phylogeny. PhyloWGS [58] extends the probabilistic framework of PhyloSub [32] to integrate copy number information. It is also the first approach to model overlaps between CNA and SNV data. Estimates of CNA copy number status and population frequencies are required as input which are then used to transform sites affected by a CNA, or by a CNA and SNV, into pseudo-SNV sites to apply the SNV based probabilistic tree inference method of PhyloSub. All of the tree inference approaches discussed so far make the infinite sites assumption which should be revisited in context of copy number changes. Since these events typically affect larger segments, the likelihood of two of them overlapping is not negligible. Likewise the chance of a mutated allele being lost by a segmental loss is much higher than that of a point mutation reverting it back to its original state. Neither scenario is compatible with the infinite sites model such that it is debatable whether the assumption is still safe to make. SPRUCE [59] relaxes the assumption to a model where a mutation can change its state multiple times but can not twice attain the same state independently in the tree. This restriction is known as infinite alleles assumption or multi-state perfect phylogeny. While this is a step in the right direction, it still overlooks many plausible scenarios, such as a site undergoing a copy number change that is later reverted. CANOPY [60] solves the issue of recurrent mutation states in a different way: While it nominally keeps the infinite sites assumption, it restricts the scenarios in which it could be violated to such a small number that the assumption becomes reasonable again. For example a mutation event would only be considered as recurrent when it sets the exact same genomic segment to the exact same copy number state in different parts of the phylogeny. As the endpoints of the segments are defined at the resolution of nucleotide positions, such a recurrence is unlikely to be observed. In contrast to the other methods discussed so far, CANOPY is also the only one to recognise that copy number alterations are interdependent and should be rather modelled as sequences of events than as independent changes of chromosome segments. This view on genome evolution will become even more useful once tree inference models start to consider structural rearrangements and their potential in confounding read-depth data. Pioneering work in this direction was performed by Greenman et al. [61] and Purdom et al. [62] Neither of these two studies focuses on tree construction, but they estimate the order of genomic rearrangement events. Many of the concepts introduced in these works such as the use of external linkage information, e.g. HapMap data, for phasing, the assignment of copy numbers to one of the physical alleles [61], may be worthwhile to integrate in future approaches to reconstruct mutation histories of tumours from bulk sequencing data. An approach for phasing using only major and minor allele copy number profiles was recently suggested by Schwarz et al. [63]. Besides the phasing, it computes the tree topology and assigns genomes to ancestral states based on the minimum evolution criterion.

Single-cell advances

After the arrival of NGS and the accompanying drop in price of obtaining genomic information, efforts to understand tumour diversity were epitomised by the collection and archiving of thousands of tumour samples by TCGA [11] and the ICGC [12]. Efforts were later also underway to understand intra-tumour diversity at full resolution by sequencing individual tumour cells. The technical advances are reviewed for example in [64], [65] and expounded in [66], and here we focus on their use to uncover tumour heterogeneity from a modelling perspective.

Single-cell sequencing

The first results for single-cell genomics were for mRNA sequencing of a mouse blastomere [67] where the major challenge was to have sensitive enough sequencing for the small amount of primary material. For DNA this involves amplifying the initial single copy enough to be passed on to sequencers. The first successful results [68] used a modified version of PCR for the initial amplification, before further PCR amplification and sequencing. The low resulting coverage (≈ 10%) allowed for the identification of copy number variations, but not high confidence mutation calling. Higher coverage was then quickly achieved through the use of multiple-displacement amplification (MDA) [69], [70], [71], [72] allowing the identification of SNVs. The MDA process involves the attachment of randomly primed Φ29 enzymes which synthesise DNA to create additional and displaced strands, which may then themselves be further amplified. From a modelling perspective the amplification of the two original alleles is more akin to a Pólya urn model: starting with two balls representing the genomic base on each allele, repeatedly one ball is selected at random, duplicated and returned with the duplicate to the urn. This feedback in the MDA process can also lead to rather non-uniform coverage. Sites with low coverage cannot be reliably used for SNV calling, leading to high levels of missing data in early experiments (≈ 60% in [69]). To obtain higher uniformity, although at the cost of higher error rates, hybrid amplification methods have also been developed and utilised [73], [74], [75], [76], [77]. Using cells where the DNA had just duplicated [78] reduced the amount of early amplification needed leading to lower error and missing data rates and can be part of the single nucleus exome sequencing (SNES) protocol of [79]. With current techniques, single-cell sequencing (SCS) provides high coverage and low false positive rates, but the largest source of uncertainty comes from allelic dropout (AD) where one strand (or part of it) does not get amplified (or not sufficiently) in the early stages and is not detectable in the final sequencing. Although AD, which leads to false negatives, has fallen from highs of 40% or more [69], currently they are in the range of 10–20%. False negatives therefore remain a very important component for any modelling of SCS data. Although the false positive error rates are low (≲ 10 −5) many base positions can be tested across the whole exome or genome so that the total number of falsely detected SNVs may still be in the hundreds or thousands per cell. For cells from the same tumour sample, a simple consensus of SNVs across two or more cells reduces the error rates back to low values, which is fortuitous from a modelling perspective because mutations observed in only one cell are also uninformative for reconstructing the evolutionary history of the tumour. Since SNVs are selected for analysis when they are detected, the false positive rate among them may be enriched compared to the per base pair error rate of the SCS technique. An exciting alternative to whole exome sequencing (WES), or whole genome sequencing, of each single cell to reduce the cost while offering low error rates was to first perform deep bulk sequencing and to liberally select sites which may possess a mutation. A personalised panel was then developed for 6 leukaemia patients to use for the final sequencing and mutation calling [80]. The preselection of sites to test reduces the enrichment of false positives, but AD and other false negatives still occur during the amplification. A further alternative to amplifying the DNA of single cells is to culture individual cells (as done for organoids [81], [82]) before harvesting a large number and performing standard bulk sequencing with the downside that culturing will bias the sample by selecting for viable cells, and may introduce new mutations. Before individual cells can have their DNA amplified and sequenced, the cells themselves need to be isolated first. One approach has been to collect circulating tumour cells (CTCs) from blood samples which for DNA experiments first had low coverage for CNA calling [83], [84], [85] and later with WES [86]. For primary tumour cells, early experiments focussed on micropipetting [69], [70], [73], [74], [87] or nuclei sorting [68], [78], [88]. Higher throughput experiments, combined with panel sequencing, have turned to microfluidics [80] or FACS [89], [90]. Barcoding methods [91] are also promising to increase the scope of SCS at lower costs. Microwells or drops combined with barcoded beads [92], [93] now allow the parallel RNA sequencing of thousands of cells. A more recent version of barcoding for DNA sequencing [94] offers the possibility to sequence 48–96 cells simultaneously broadening the scope of single cell sequencing experiments. High-throughput protocols also offer the joint RNA and DNA sequencing of single cells [95]. However the individual cells are isolated, a key point in SCS experiments is to verify that the cells are indeed unique. Any doublet samples obviously break the single cell assumption at the heart of methods designed specifically to analyse single-cell data. Some cell isolating techniques may have high rates of doublet sampling in the range of 10–40% [96] which are important to control experimentally and to bear in mind when modelling.

Single-cell histories

Once the single cells have been sequenced, and the mutations or copy number events uncovered with standard bioinformatics pipelines, one focus is on understanding the evolutionary history of tumours and their diversity. We highlight some of the key datasets, with their characteristics summarised in Table 3, and how the single-cell phylogenetic history informed their analysis.

Table 3

Cancer type	Year and reference	Number of patients	Number of samples	Number of mutations	Number of cells	False positive rate	Allelic drop out rate	Missing data
Myeloproliferative neoplasm	(2012) [69]	1	1	712	58	6.04 ×10⁻⁵	0.4309	58%
Kidney	(2012) [70]	1	1	35	17	2.67 ×10⁻⁵	0.1643	22%
Bladder	(2012) [71]	1	1	443	44	6.7 × 10⁻⁵	0.4	55%
Colon	(2014) [87]	1	1	176	63	<1 ×10⁻⁴	>0.5	–
Breast	(2014) [78]	2	1	40/519	47/16	1.24 ×10⁻⁶	0.0973	1%
Leukemia	(2014) [77]	3	1	≤ 1953a	11–12	–	0.12	28%
Leukemia	(2014) [80]	6	1	10–105	96–150	–	≤ 0.3	–
Breast (and xenografts)	(2015) [50]	2	2/3	37/45b	120/90	–	≈ 0.2	7–12%
Ovarian (intraperitoneal)	(2016) [97]	3	4–5	23–33b	420–672	–	–	–

The number of mutations listed for [77] refers to the number of loci sequenced.

The number of mutations only indicates those uncovered in targeted panels of 40/45 SNVs for [50] and of 43–84 SNVs for [97].

Characteristics of some single-cell sequencing datasets. The number of samples is per patient. The number of cells, also per patient, only includes those which passed quality control and were used for mutation calling. The false positive and allelic drop out rate estimates are per genomic position. The number of mutations excludes those which only occur in one cell which are uninformative for the phylogenetic reconstruction. They may however include mutations occurring (or with missing data) in all cells which are also uninformative. These have been removed from the count of [70] and do not occur for the ER + tumour of [78] on in any of the patient samples from [80]. The number of mutations listed for [77] refers to the number of loci sequenced. The number of mutations only indicates those uncovered in targeted panels of 40/45 SNVs for [50] and of 43–84 SNVs for [97]. One of the first single-cell datasets comes from a JAK2-negative myeloproliferative neoplasm [69], PCA was employed to uncover a likely monoclonal origin of the tumour. Also they found that the patient specific mutations did not coincide with the commonly implicated genes for that tumour type. Back-to-back a kidney cancer sample [70] was published and no real evidence of clonal subpopulations was uncovered using neighbour-joining (NJ) [98]. However there was large diversity in mutations suggesting an accumulation of passenger mutations. The cancer cells were also close to the non-tumour controls indicating a short time frame for the cancer's progression. The first evidence for a branching mutation history in single-cell data was discovered in a bladder cancer [71] using hierarchical clustering. This revealed two main subclones which seemed to be outgrowing the ancestral clone since they appeared late in the tumour development but still made up sizeable proportions of the tumour itself. Hierarchical clustering was also employed on a colon cancer sample [87] which uncovered a minor clone alongside a much larger main clone. The main clone possessed early mutations in TP53 and APC, which are highly prevalent in colon cancer, but they were missing in the minor clone pointing to it having a distinct origin and separate development. Advances in SCS technology led to better coverage and lower error rates for two breast cancer samples [78]. Phylogenetic histories were reconstructed with NJ. Since copy number analysis was also performed on the same single cells, they could uncover an early phase of aneuploid rearrangements followed by clonal expansion dominated by point mutations. For one sample they saw a linear progression of clonal expansions, while for the second sample the clones separated into subclones, with one subclone founded by another aneuploidy event. This combination of copy number and SNV calling on the same individual cells highlighted how both sets of information can be combined to improve the understanding of the phylogenetic history. Single cells were analysed from three leukaemia patients [77]. In particular they compared different SNV callers, opting for joint calling across samples, and specifically sequenced doublets samples to test for their contamination in the single-cell data. To infer the phylogenetic history, they learnt a maximum likelihood tree from the genetic distances between each pair of single cells. The evolution was mostly linear (with major subclones for one patient sample) but also exhibited low frequency heterogeneity and branching. Since SNV callers (like [99], [100], [101], [102], [103], [104], [105]) are aimed at uncovering variants of different frequencies from bulk sequencing data, they are less applicable to single-cell data where the underlying number of copies of any variant is a (low) integer but the amplification and sequencing is much more noisy. To account particularly for the non-uniform coverage of SCS [106], clustered the reads to correct for errors. More recently a mutation caller designed for single-cell data has been developed [107] which treats the underlying mutation states in a single cell allowing it to outperform bulk SNV callers. For single cell samples from 6 leukaemia patients (from targeted panel sequencing), [80] looked in the other direction of modifying the phylogenetic reconstruction to account for the particularities of single-cell data. With high dropouts from the MDA step before sequencing the error rates in single-cell data are highly unbalanced. The distance based approaches employed before (whether in constructing a tree, in hierarchical clustering or NJ) implicitly weigh both kinds of errors equally, which can adversely affect the reconstruction. Instead [80] introduced a binomial mixture model to cluster the single-cell genotypes, where the probability of a mutation or its absence varies for each cluster according to the data. Once clustered, the phylogeny can be found as the minimum spanning tree, which for five of the six patient samples featured coexisting high-frequency clones. Often the ancestral clones were also still present in the population. Along with the phylogenies, the clustering highlighted cells sharing mutations from different lineages indicating that they were the result of doublet sampling. More recently, the clustering in [80] was refined to a variational Bayes approach [108] which could also explicitly model the presence of doublet samples. The clustering however, like in [80], was performed without enforcing a phylogeny. After performing deep bulk sequencing on primary tumours and derived xenograft lines from 15 patients, and studying their clonal composition and dynamics with PyClone [38], two examples were selected in [50] for high resolution follow up with SCS: one with strong initial selection upon transplantation, and one with complex clonal evolution through the xenograft generations. For the SCS a targeted panel was designed for each example based on mutations detected with the bulk sequencing. For inferring the tree structure of the single cells, the Bayesian phylogenetic approach of [109] was employed. The resulting single-cell phylogenies were mainly used to corroborate the genotype clusters found by PyClone from the bulk sequencing, but with the advantage of also providing the ancestral histories of the clones. For the example with strong initial selection, the single cell data indicated complete separation between the primary tumour and a late xenograft sample and that the xenograft clone was founded by a very minor clone of the original tumour. The other example showed complex clonal evolution with two main lineages. The second lineage expanded heavily during the second xenograft generation to then vanish compared to further generations of the first lineage. Likewise utilising SCS to enrich bulk sequencing data, the intraperitoneal spread of high-grade ovarian cancer was examined over 68 samples from 7 patients in [97]. For three patients, each with 4 or 5 spatially distinct samples, a total of 1680 single cells were isolated and subjected to targeted sequencing of a small number of genomic sites. The clonal composition of those tumours was inferred from the single cells using the clustering method of [108]. This augmented the bulk clustering analysis by providing higher quality genotypes. From the phylogenetic analysis of the multiple spatial samples for each of the 7 patients, the nature of the clonal spread from the ovaries to the intraperitoneal sites could be uncovered [97]. Particularly striking was that along with the five patients exhibiting monoclonal seeding, two patients exhibited reseeding and polyclonal spread. As well as indicating different possible modes of peritoneal spread, this could also suggest that the different microenvironment of the peritoneal cavity leads to novel selective pressures on heterogeneous tumours.

Single-cell phylogenetic reconstruction

Along with approaches to call mutations in single cells [107] and cluster them [80], [108], a different direction has been to modify the phylogenetic inference to account for the specifics of single-cell data. All cells in a tumour live on a genealogical tree, Fig. 2 (c), where they connect with each other at their common ancestors. If we take the infinite sites assumption that the genome is essentially so long that there is no chance that the same position may mutate more than once in the entire tumour's history (which also means that no mutations are lost once they arise), then the mutations in the cells form a perfect phylogeny [36]. However, fast and straightforward phylogenetic algorithms, like hierarchical clustering, NJ, perfect phylogeny or distance based tree constructions like a minimum spanning tree can struggle or fail completely when presented with noisy data. Extensions of the perfect phylogeny problem exist to handle imperfect data, but typically aim to remove data to remove any inconsistencies. For example they may find the minimum number of mutations to remove [110], [111] or the minimum number of sampled cells [112]. A further difficulty with single-cell data, and where these approaches still struggle, is that the errors are very unbalanced. In single-cell data AD or false negative rates are generally over 10% while false positives are of the order of 10 −5 or less. To account for this fully, probabilistic approaches have been introduced which select possible phylogenetic trees by how well they explain the single-cell data and which consider the full dataset with all of its inconsistencies and the errors due to the technical challenges of sequencing single cells. In particular the methods start with a given tree which allows one to check which cells should exhibit which mutations. If a cell is supposed to possess a mutation under the tree model, but it is absent in the observed data this would be considered a false negative, with a probability of occurrence given by the false negative rate. Conversely if the tree model predicts no mutations, but one is observed, the model would indicate a false positive. Repeating this for all cells provides the joint probability of observing the data for that particular tree and error rates. This is the likelihood of obtaining the observed data under the tree model and naturally accounts for differences in the error rates. A common approach is to find the tree which maximises the likelihood and fits the data most closely. Alternatively, Bayes theorem may be employed to find the probability of the tree from the data as a measure of fit of the tree to the data. These underlying ideas link the methods developed for single-cell phylogenetic inference [113], [114], [115], [116] although the exact details of the models and their inference vary, as we summarise in Table 4 and now explore in some detail.

Table 4

Overview of single-cell phylogenetic methods. Abbreviation: MCMC, Markov chain Monte Carlo.

Method	Phylogenetic representation	Inference
Kim and Simon [113]	Mutation tree	Pairwise ordering and maximum spanning tree
BitPhylogeny [114]	Clonal tree	Tree-structure stick-breaking MCMC
OncoNEM [115]	Sample/clonal tree	Greedy structure search
SCITE [116]	Mutation treea	MCMC

SCITE [116] provides the option of using the sample tree representation.

Overview of single-cell phylogenetic methods. Abbreviation: MCMC, Markov chain Monte Carlo. SCITE [116] provides the option of using the sample tree representation. Despite the elevated error rates, an advantage of single-cell data is that, assuming diploid cells and the infinite sites assumption, mutations should be present in either none or one or the alleles, rather than at arbitrary frequencies, and these are the only two cases that need to be tested. Of course the presence of mutations across single-cell samples are not independent, but related by the phylogenetic history and in general the challenge is dealing with the vast number of trees that exist and in finding optimal trees, or a good set of them. The first probabilistic single-cell approach of [113] considered three mutation states for the data of [69]: wildtype, and heterozygous and homozygous variants. Homozygous variants are presumed to be the result of an allelic dropout of the normal allele so that only the alternative is amplified. The likelihood of [113] consisted of the probability of the three observable states given either of the two underlying states and the allelic dropout and false positive rates. For the trees themselves, the representation in terms of mutation trees, Fig. 2 (d), was employed with the aim of uncovering the mutation ordering and evolutionary history. Rather than examining the tree as a whole, first the pairwise ordering of each pair of mutations was considered [113]. In particular the likelihood of the data when the pair of mutations are in the same or different lineages was computed. By simulating genealogical trees [Fig. 2 (c)], Monte Carlo estimates of the prior probability of mutations sharing a lineage were obtained resulting in a posterior estimate of the probability of different relationships between each pair of mutations. In simulating genealogical trees, a parameter was introduced to model the relative time of the first branching event. This parameter, which influences the prior distribution, was inferred from the data (an approach known as empirical Bayes). In order to build the mutation tree, first estimates for the pairwise ancestral relationships of all mutations were obtained. The maximal posterior ordering between each pair was encoded as an edge in a directed graph, weighted by the posterior probability. The mutation tree is then defined as the maximum spanning tree. Specifically, edges were removed to achieve a tree which maximised the remaining weights. Although this procedure returns a tree, it is not necessarily the tree with the highest likelihood as a whole model since the ancestral relations inferred earlier behave more like parent-child relationships when embedded in the directed graph. For the 18 cancer related mutations in the 58 single cells of [69], for example, the empirical Bayes estimate of the prior tree structure is highly linear while the resulting minimum spanning tree is rather branched. BitPhylogeny [114] works on the sample tree representation, but rather than using the single cells as leaves they are clustered together into clones. Since the number of clones and their composition is unknown, the number of nodes and branches in the cluster tree is also unknown. BitPhylogeny therefore considers in its search space all trees with an arbitrary number of clones. A prior for the trees is derived from a nested stick-breaking process following [117]. A stick, or unit interval, is chopped into many parts. Each part is then further divided with the same process, and this is repeated at all scales. At each stage the first part denotes a clone which is a child of the clone at the previous stage, providing the tree structure. The stick-breaking process involves parameters which influence the shape and number of clones in the prior distribution. The process has also been applied to bulk data [32], [58] and BitPhylogeny also includes a model for methylation data [114]. Returning to the single-cell treatment, BitPhylogeny employs the Markov chain Monte Carlo (MCMC) inference scheme of [117]. Essentially one component, like the composition of the clones or the division of the stick at a particular stage, is updated while keeping the rest fixed. In the phylogenetic model of [114] the mutations occur along the edges of the clonal tree with same rate. This leads to a transition probability of mutations accumulating across the phylogeny so that the appearance of mutations in descendant clones is treated probabilistically. For the inference of the tree itself these probabilistic appearances are averaged over so that the mutations become marginalised out. The combining of cells into clones can be seen as a way of correcting for the high error rates of SCS (like [80]) while respecting a phylogeny enforced by the tree framework. The MCMC sampling also provides a posterior distribution of trees and parameters, better representing the uncertainty in the phylogeny than a single maximum likelihood estimate. However the inference scheme is relatively computationally costly which might cause convergence issues for more intricate or larger clonal trees. For the example of the full 712 mutations uncovered in the data of [69], BitPhylogeny [114] finds one large clone consisting of 70% of the cells and some smaller clones that branch off near the root of the tree. The more recent approaches [115], [116] returned to the full tree model with likelihoods given by the false positives and negatives. From there they take complementary paths: OncoNEM [115] focuses on the sample tree, Fig. 2 (c), by marginalising or averaging over the placement of mutations along the edges; SCITE [116] focuses on the mutation tree, Fig. 2 (d), by averaging over the attachment of sampled cells. The averaging serves to vastly simplify and speed up the tree inference but a complete tree can be obtained from both approaches. For the phylogenetic inference, both methods utilise a search-and-score framework: OncoNEM with a greedy search and SCITE with a stochastic MCMC scheme. The latter can either provide a single maximum likelihood estimate or a full posterior sample accounting for uncertainty in the inferred trees. After the greedy search in the sample tree space, OncoNEM [115] then attempts to cluster similar cells together into clone in a second step to provide a clone tree like BitPhylogeny [114]. Both of the more recent methods [115], [116] allow error rates to be learnt from the data and significantly outperform previous single-cell approaches and bulk data methods applied to single-cell data. The different choice of representation between sample and mutation trees as in Fig. 2 (d) is mainly one of interest: if the key question concerns the clonal composition of the tumour then a sample tree is more appropriate, while questions concerning the order and evolutionary history of the mutations are better answered with the mutation trees. The choice is also partly dictated by the nature of the single-cell data. Mutations which occur in only one cell, or in all of them, are not informative for the tree reconstruction (although they may still inform the inferred error rates). If the number of remaining mutations is much larger than the number of cells, then the sample tree representation can be much more computationally efficient. When the number of sampled cells dominates then mutation tree inference is much faster. This occurs for example with the leukaemia datasets of [80] and especially when a targeted panel is utilised as in [50], [97]. SCITE [116] offers the option to change the representation depending on the data. In reanalysing previous data, both OncoNEM and SCITE were applied to the 58 sequenced cells of [69] with OncoNEM considering the full set of 712 SNVs and SCITE looking at the 18 cancer-related mutations or the set of 78 non-synonymous ones due to the different representations. Both found highly linear or sequential trees suggesting monoclonal evolution and trees with much higher likelihoods than those found previously in [113], [114] with the same data. OncoNEM [115] additionally considered the bladder cancer data set of [71] finding very similar results to the original paper, but refining the clonal composition. SCITE [116] found another highly linear tree for the kidney cancer data of [70], again suggesting monoclonal expansion, but a tree with a long trunk region followed by complex branching lower down for the higher quality ER + breast tumour sample of [78]. This would be consistent with an early build up of mutations which fixate in the tumour before a more recent division into competing subclones.

Discussion

Studying the evolutionary history of tumours and their heterogeneity covers computational aspects from processing raw sequencing data to resolving the phylogeny. For bulk data, the discovery of the prevalence of mutations in the sample is reasonably accurate, apart from for low-frequency events. However low-frequency mutations are common and could account for much of a tumour's diversity and be relevant for treatment. Deeper sequencing can help give better accuracy on distinguishing their prevalence and so in resolving their evolutionary history [118]. Apart from the difficulties in resolving low-frequency mutations, the main issue is with untangling the clonal structure from the mixture of DNA from a large number of cells. Computational approaches started focusing on the clustering [31], [38], [47] or the phylogenetic [37], [39], [41] aspects before considering their inference jointly [32], [42], [58], [60]. For single-cell data, the deconvolution is no longer needed, but the need for extensive amplification of the initial DNA material, and feedback within the amplification process introduces more noise in the sequencing data and makes uncovering mutations harder. Computational approaches have each so far focused separately on one facet of single-cell data: mutation calling designed for the specifics of SCS [107], clustering to correct for errors in the calling [80], [108], or probabilistic phylogenetic methods tailored for those high (and unbalanced) errors [113], [114], [115], [116]. Mirroring the advances for bulk data, we can expect the next advances for single-cell based approaches to offer a holistic treatment for the process from sequencing to phylogeny, while also considering a larger range of mutation types. A first step would be to account for the uncertainty in the mutation calling (as performed by [111] for bulk data and as can be extracted from [107] for single cells) in the input for the phylogenetic inference [115], [116], but overall the aim would be joint inference of the mutations and their phylogenetic structure. Along with combining the raw sequencing data with the tree reconstruction, models will also need to account for further technical errors in single-cell data, like the inadvertent sampling of doublets (as was recently considered in the clustering approach of [108]). Another aspect concerns copy number and aneuploidy changes, which often occur in cancer evolution and can inform the tumour phylogeny. These raise a number of interesting challenges for single-cell data, both for the mutation calling where the underlying frequencies can differ from and for the tree reconstruction where such events can impact several mutations at once. For copy number variations in single cells this problem also arises for copy number changes at the different scales of the gene and chromosome level. Algorithms have been developed to find the most parsimonious set of aberration events consistent with the data [119]. The data concerned were obtained using fluorescent imaging rather than sequencing but sequencing data will only add higher resolution of small scale events down to SNVs. Since it has already been shown that CNAs and SNVs can be discovered from the same SCS data [78], we expect further and corresponding modelling frameworks to arise to deal with such data. A further aspect that CNAs thrust into the spotlight is the infinite sites assumption, that mutations or aberrations only occur once in the evolutionary history and persist afterwards. Although a priori reasonable for sparse point mutations, this is not compatible with back mutations due to a LOH. Indeed, developing and employing a probabilistic model allowing for deletions and loss of mutations, bulk sequencing of ovarian cancer uncovered different CNAs affecting the same genomic regions providing routes to convergent evolution [97]. The copy number changes were still assumed to only occur once, a generalisation of the infinite sites assumption to infinite alleles [59]. Convergent evolution has also been observed at a gene level, with the same driver gene affected in different evolutionary lineages and spatial areas of tumours [120], [121], albeit with mutations at distinct genomic sites consistent with the infinite sites assumption. At the level of point mutations, the resolution of SCS actually allows one to test the persistence of mutations and for convergent recurrence [122]. Results from SCS datasets strongly indicate that the infinite sites assumption is frequently violated [122]. Although employed in the current single-cell phylogenetic methods [113], [114], [115], [116], and bulk methods, as it greatly simplifies the inference, this will need to be relaxed for more general models which capture the full complexity of tumour evolution. These can build on models allowing (and penalising) a single recurrence [122], allowing the loss of mutations [97], or with substitution models allowing arbitrary recurrence and loss as in [123] and the methylation model of BitPhylogeny [114]. Alternatively phylogenetic clustering approaches which do not need to enforce the infinite sites, like [46], can be further explored. Important when relaxing the infinite sites assumption will be to account for and appropriately penalise the increase in complexity of more general models. One general limitation of SCS is that from a relatively small sample of cells it is difficult to obtain an accurate picture of the prevalence of clones and their mutations, especially for highly heterogeneous tumours. Low frequency clones are unlikely to be sampled, and those which happen to be sampled would appear more frequent than they really are. Sequencing more cells obviously gives a clearer picture, but at a higher cost and likely to recapitulate high frequency clones while providing little extra information about the low frequency ones. Deep sequencing of bulk samples, however, can give complementary information on these frequencies, which could also inform the phylogenetic reconstruction. This is highlighted by [50], [97] where selected and targeted SCS was employed to enrich bulk analyses. The challenge would be to combine both single-cell and bulk data, with their individual characteristics, into a coherent modelling framework. Several bulk samples may help in particular (as for the bulk phylogeny problem [32], [41], [42], [43], [56], [57], [58], [59], [60]) and importantly this sort of framework could inform experiments on which combinations of bulk and single-cell data would offer the most detailed picture of the tumour's history and heterogeneity. For single-cell data with high coverage and current error rates [78] we can expect a good reconstruction of the mutation order and history with a couple of cells sampled per relevant mutation [116]. For the 40 mutations uncovered in an ER + breast tumour, even the 47 cells sequenced by [78] offer a detailed picture of the clonal expansion and subsequent separation into subclones [116] since probabilistic phylogenetic models account for the uncertainties in the mutations observed or missed in each cell and combine this information when inferring the tree structure. By considering current single-cell datasets, it would seem that sequencing 50–100 single cells should give a high resolution picture of the tumour. Sequencing more cells obviously improves the resolution, but at a higher cost and may be of less marginal value than several very deep bulk sequences. Better estimates will however arise once methods arrive to combine single-cell and bulk data. Experimentally it is also worthwhile verifying that samples are indeed single cells before sequencing to avoid contamination from doublets. A related aspect is to consider the spatial resolution and heterogeneity of tumours, as recently performed by [124], and the temporal evolution for example by following tumour progression through xenograft generations [50]. Spatiotemporal dynamics also play a key role for the spread of tumours [97] and the link between the primary tumour and metastases [111]. Here a key question, and one with great treatment relevance, is whether the metastases were seeded early in the tumour's development or are derived from later cells. Again we can consider which sorts and combinations of data would best help to answer such questions. To understand where metastases fit in the evolutionary history of the primary tumour and their origin, ideally we would posses a high resolution understanding of the primary tumour with single-cell and deep bulk data. Assuming a single seeding event of each metastasis suggests that their bulk sequencing would suffice (as in the data of [111]), but to test this assumption would also require high resolution of the heterogeneity within the metastases themselves. As well as answering the question of the origin of metastases, SCS and its ability to provide clear understanding of a tumour's evolutionary history offers great potential for examining tumour development under the action of clinical therapies through serial biopsies or even time course collection of CTCs. Looking to a future where high quality single-cell (and bulk) data is available across many patient samples, as is currently the case for the TCGA and ICGC databases for bulk samples, such data and its analysis will not only help in the identification of further driver mutations but will also allow the identification of recurring mutational patterns. These may be informative for cancer treatment and in predicting cancer progression. Furthermore, combining evolutionary histories from real patient data with evolutionary models (like [125], [126], [127]) offers the possibility to infer the fitness landscape of the tumour's aberrations. Different evolutionary models result in different phylogenetic patterns so that single-cell analysis could further help to distinguish between different models of tumour evolution like clonal expansion [15], neutral evolution [124], [128], ‘Big Bang’ models [129] of a sudden selective change followed by mostly neutral evolution, and punctuated evolution [130] of flurries of aberrations followed by clonal expansion.

List of abbreviations

Allelic dropout Copy number alteration Circulating tumour cell Expectation maximisation Fluorescence-activated cell sorting International Cancer Genome Consortium Loss of heterozygosity Markov chain Monte Carlo Multiple-displacement amplification Mixed integer linear programming Next generation sequencing Quadratic integer programming Principal component analysis Polymerase chain reaction Single cell sequencing Single nucleus exome sequencing Single nucleotide variant The Cancer Genome Atlas Whole exome sequencing

Author contribution

JK, KJ and NB wrote the manuscript.

Funding

JK was supported by ERC Synergy Grant 609883 (http://erc.europa.eu/). KJ was supported by SystemsX.ch RTD Grant 2013/150 (http://www.systemsx.ch/).

Transparency document

Transparancy document

The Transparency Document associated with this article can be found, in online version.

122 in total

1. Bayclone: Bayesian nonparametric inference of tumor subclones using NGS data.

Authors: Subhajit Sengupta; Jin Wang; Juhee Lee; Peter Müller; Kamalakar Gulukota; Arunava Banerjee; Yuan Ji
Journal: Pac Symp Biocomput Date: 2015

Review 2. Cancer as an evolutionary and ecological process.

Authors: Lauren M F Merlo; John W Pepper; Brian J Reid; Carlo C Maley
Journal: Nat Rev Cancer Date: 2006-11-16 Impact factor: 60.716

3. Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution.

Authors: Peter Eirew; Adi Steif; Jaswinder Khattra; Gavin Ha; Damian Yap; Hossein Farahani; Karen Gelmon; Stephen Chia; Colin Mar; Adrian Wan; Emma Laks; Justina Biele; Karey Shumansky; Jamie Rosner; Andrew McPherson; Cydney Nielsen; Andrew J L Roth; Calvin Lefebvre; Ali Bashashati; Camila de Souza; Celia Siu; Radhouane Aniba; Jazmine Brimhall; Arusha Oloumi; Tomo Osako; Alejandra Bruna; Jose L Sandoval; Teresa Algara; Wendy Greenwood; Kaston Leung; Hongwei Cheng; Hui Xue; Yuzhuo Wang; Dong Lin; Andrew J Mungall; Richard Moore; Yongjun Zhao; Julie Lorette; Long Nguyen; David Huntsman; Connie J Eaves; Carl Hansen; Marco A Marra; Carlos Caldas; Sohrab P Shah; Samuel Aparicio
Journal: Nature Date: 2014-11-26 Impact factor: 49.962

4. Effect of mutation order on myeloproliferative neoplasms.

Authors: Christina A Ortmann; David G Kent; Jyoti Nangalia; Yvonne Silber; David C Wedge; Jacob Grinfeld; E Joanna Baxter; Charles E Massie; Elli Papaemmanuil; Suraj Menon; Anna L Godfrey; Danai Dimitropoulou; Paola Guglielmelli; Beatriz Bellosillo; Carles Besses; Konstanze Döhner; Claire N Harrison; George S Vassiliou; Alessandro Vannucchi; Peter J Campbell; Anthony R Green
Journal: N Engl J Med Date: 2015-02-12 Impact factor: 91.245

5. SNES: single nucleus exome sequencing.

Authors: Marco L Leung; Yong Wang; Jill Waters; Nicholas E Navin
Journal: Genome Biol Date: 2015-03-25 Impact factor: 13.583

6. Complex tumor genomes inferred from single circulating tumor cells by array-CGH and next-generation sequencing.

Authors: Ellen Heitzer; Martina Auer; Christin Gasch; Martin Pichler; Peter Ulz; Eva Maria Hoffmann; Sigurd Lax; Julie Waldispuehl-Geigl; Oliver Mauermann; Carolin Lackner; Gerald Höfler; Florian Eisner; Heinz Sill; Hellmut Samonigg; Klaus Pantel; Sabine Riethdorf; Thomas Bauernhofer; Jochen B Geigl; Michael R Speicher
Journal: Cancer Res Date: 2013-03-07 Impact factor: 12.701

7. Phylogenetic quantification of intra-tumour heterogeneity.

Authors: Roland F Schwarz; Anne Trinh; Botond Sipos; James D Brenton; Nick Goldman; Florian Markowetz
Journal: PLoS Comput Biol Date: 2014-04-17 Impact factor: 4.475

8. Clonal architecture of secondary acute myeloid leukemia defined by single-cell sequencing.

Authors: Andrew E O Hughes; Vincent Magrini; Ryan Demeter; Christopher A Miller; Robert Fulton; Lucinda L Fulton; William C Eades; Kevin Elliott; Sharon Heath; Peter Westervelt; Li Ding; Donald F Conrad; Brian S White; Jin Shao; Daniel C Link; John F DiPersio; Elaine R Mardis; Richard K Wilson; Timothy J Ley; Matthew J Walter; Timothy A Graubert
Journal: PLoS Genet Date: 2014-07-10 Impact factor: 5.917

9. Rapid phenotypic and genomic change in response to therapeutic pressure in prostate cancer inferred by high content analysis of single circulating tumor cells.

Authors: Angel E Dago; Asya Stepansky; Anders Carlsson; Madelyn Luttgen; Jude Kendall; Timour Baslan; Anand Kolatkar; Michael Wigler; Kelly Bethel; Mitchell E Gross; James Hicks; Peter Kuhn
Journal: PLoS One Date: 2014-08-01 Impact factor: 3.240

Review 10. Cancer genomics: one cell at a time.

Authors: Nicholas E Navin
Journal: Genome Biol Date: 2014-08-30 Impact factor: 13.583

43 in total

1. A sparse differential clustering algorithm for tracing cell type changes via single-cell RNA-sequencing data.

Authors: Martin Barron; Siyuan Zhang; Jun Li
Journal: Nucleic Acids Res Date: 2018-02-16 Impact factor: 16.971

Review 2. Informatics for cancer immunotherapy.

Authors: J Hammerbacher; A Snyder
Journal: Ann Oncol Date: 2017-12-01 Impact factor: 32.976

Review 3. Cancer progression and the invisible phase of metastatic colonization.

Authors: Christoph A Klein
Journal: Nat Rev Cancer Date: 2020-10-06 Impact factor: 60.716

4. Tumor Copy Number Deconvolution Integrating Bulk and Single-Cell Sequencing Data.

Authors: Haoyun Lei; Bochuan Lyu; E Michael Gertz; Alejandro A Schäffer; Xulian Shi; Kui Wu; Guibo Li; Liqin Xu; Yong Hou; Michael Dean; Russell Schwartz
Journal: J Comput Biol Date: 2020-03-16 Impact factor: 1.479

5. Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny.

Authors: Eugene Urrutia; Hao Chen; Zilu Zhou; Nancy R Zhang; Yuchao Jiang
Journal: Bioinformatics Date: 2018-06-15 Impact factor: 6.937

6. CACTUS: integrating clonal architecture with genomic clustering and transcriptome profiling of single tumor cells.

Authors: Shadi Darvish Shafighi; Szymon M Kiełbasa; Cornelis A M van Bergen; Ewa Szczurek; Julieta Sepúlveda-Yáñez; Ramin Monajemi; Davy Cats; Hailiang Mei; Roberta Menafra; Susan Kloet; Hendrik Veelken
Journal: Genome Med Date: 2021-03-24 Impact factor: 11.117

Review 7. Exploiting unique features of the gut-brain interface to combat gastrointestinal cancer.

Authors: Alyssa Schledwitz; Guofeng Xie; Jean-Pierre Raufman
Journal: J Clin Invest Date: 2021-05-17 Impact factor: 14.808

8. A new view of the mammary epithelial hierarchy and its implications for breast cancer initiation and metastasis.

Authors: Lindsey J Anstine; Ruth Keri
Journal: J Cancer Metastasis Treat Date: 2019-06-13

9. PhISCS-BnB: a fast branch and bound algorithm for the perfect tumor phylogeny reconstruction problem.

Authors: Erfan Sadeqi Azer; Farid Rashidi Mehrabadi; Salem Malikić; Xuan Cindy Li; Osnat Bartok; Kevin Litchfield; Ronen Levy; Yardena Samuels; Alejandro A Schäffer; E Michael Gertz; Chi-Ping Day; Eva Pérez-Guijarro; Kerrie Marie; Maxwell P Lee; Glenn Merlino; Funda Ergun; S Cenk Sahinalp
Journal: Bioinformatics Date: 2020-07-01 Impact factor: 6.937

Review 10. Leveraging Single-Cell Approaches in Cancer Precision Medicine.

Authors: Aritro Nath; Andrea H Bild
Journal: Trends Cancer Date: 2021-02-06