Ron Zeira1, Benjamin J Raphael1. 1. Department of Computer Science, Princeton University, Princeton, NJ 08544, USA.
Abstract
MOTIVATION: Copy number aberrations (CNAs), which delete or amplify large contiguous segments of the genome, are a common type of somatic mutation in cancer. Copy number profiles, representing the number of copies of each region of a genome, are readily obtained from whole-genome sequencing or microarrays. However, modeling copy number evolution is a substantial challenge, because different CNAs may overlap with one another on the genome. A recent popular model for copy number evolution is the copy number distance (CND), defined as the length of a shortest sequence of deletions and amplifications of contiguous segments that transforms one profile into the other. In the CND, all events contribute equally; however, it is well known that rates of CNAs vary by length, genomic position and type (amplification versus deletion). RESULTS: We introduce a weighted CND that allows events to have varying weights, or probabilities, based on their length, position and type. We derive an efficient algorithm to compute the weighted CND as well as the associated transformation. This algorithm is based on the observation that the constraint matrix of the underlying optimization problem is totally unimodular. We show that the weighted CND improves phylogenetic reconstruction on simulated data where CNAs occur with varying probabilities, aids in the derivation of phylogenies from ultra-low-coverage single-cell DNA sequencing data and helps estimate CNA rates in a large pan-cancer dataset. AVAILABILITY AND IMPLEMENTATION: Code is available at https://github.com/raphael-group/WCND. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Copy number aberrations (CNAs), which delete or amplify large contiguous segments of the genome, are a common type of somatic mutation in cancer. Copy number profiles, representing the number of copies of each region of a genome, are readily obtained from whole-genome sequencing or microarrays. However, modeling copy number evolution is a substantial challenge, because different CNAs may overlap with one another on the genome. A recent popular model for copy number evolution is the copy number distance (CND), defined as the length of a shortest sequence of deletions and amplifications of contiguous segments that transforms one profile into the other. In the CND, all events contribute equally; however, it is well known that rates of CNAs vary by length, genomic position and type (amplification versus deletion). RESULTS: We introduce a weighted CND that allows events to have varying weights, or probabilities, based on their length, position and type. We derive an efficient algorithm to compute the weighted CND as well as the associated transformation. This algorithm is based on the observation that the constraint matrix of the underlying optimization problem is totally unimodular. We show that the weighted CND improves phylogenetic reconstruction on simulated data where CNAs occur with varying probabilities, aids in the derivation of phylogenies from ultra-low-coverage single-cell DNA sequencing data and helps estimate CNA rates in a large pan-cancer dataset. AVAILABILITY AND IMPLEMENTATION: Code is available at https://github.com/raphael-group/WCND. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Cancer is an evolutionary process where somatic mutations accumulate in a population of tumor cells (Nowell, 1976). Copy number aberrations (CNAs), the deletion or amplification of large genomic regions, are a common type of somatic mutation in many cancer types (Ciriello ). CNAs range in scale from a few kilobases to chromosome arms and even entire chromosomes (Zack ). CNAs play an important role in driving cancer development (Burrell ; McGranahan and Swanton, 2015), and thus characterization of these events is essential for disease diagnosis, prognosis and treatment (Fisher ). Moreover, CNAs provide important information for reconstructing tumor evolution (Beerenwinkel ; Schwartz, 2019).There are two major challenges in modeling copy number evolution. First, genomes containing multiple duplicated regions are difficult to correctly reconstruct from current short-read DNA sequencing technologies (Li ; McPherson ; Oesper ). Second, genome evolution models that allow multiple genomic copies are computationally hard to solve (Fertin ). To address these difficulties, recent research has focused on modeling the evolution of copy number profiles (CNPs), a simplified representation of a genome. A CNP is a sequence of integers that indicates the number of copies of each region, or segment, from a reference genome that are present in the genome. Thus, CNPs model only the number of copies of segments of the reference genome and not the sequence of rearranged segments. However, they are a useful representation because they can be readily derived from DNA sequencing data or microarrays.CNPs can be derived from DNA sequencing data of bulk tumor samples using specialized algorithms that infer the integer-valued CNPs from the mixtures of normal and cancerous cells in this data (Carter ; Fischer ; Ha ; Nik-Zainal ; Oesper ; Shen and Seshan, 2016; Zaccaria ). While earlier methods calculated the total copy number of the two alleles (Carter ), recent methods derive allele-specific (McPherson ; Zaccaria and Raphael, 2018), and even haplotype-specific CNPs (Jamal-Hanjani ). Recently, single-cell sequencing has emerging as a promising approach for assessing tumor heterogeneity and evolution (Gawad ; Wang ). Single-cell sequencing precludes the need for deconvolution of bulk samples into integer CNPs, and thus enables the detection of small populations of cells with specific aberrations (Wang ). While high-coverage whole-genome single-cell sequencing is technically and financially prohibitive, two recent technologies, Direct Library Preparation (Laks ; Zahn ) and the 10X Genomics CNV Solution (10X Genomics, 2019a; Andor ), have demonstrated the feasibility of obtaining CNPs from thousands of single cells using ultra-low coverage (<0.05× per cell). While earlier methods derived total copy numbers from single-cell sequencing (10X Genomics, 2019b; Garvin ), a recent method called CHISEL (Zaccaria and Raphael, 2019) derives allele-specific and haplotype-specific copy number calls, thus opening new opportunities for analyzing copy number evolution in cancer.Modeling copy number evolution using CNPs is challenging because, unlike single-nucleotide mutations, CNAs often overlap, and therefore the copy numbers of different segments are not independent (Beerenwinkel ; Schwartz, 2019). Recently, several methods have been introduced to describe the evolution of CNPs. Some methods do not rely on an evolutionary model but instead use distance measures such as the Euclidean distance to reconstruct phylogenies from CNPs (Navin ; Pennington ). Other methods consider only events that alter the copy number of single segments independently (McPherson ) or include only single position events as well as whole-chromosome and whole-genome duplication events (Chowdhury ). An extension to the latter model allows for events of different weights (Chowdhury ), but both the unweighted and weighted models lack efficient algorithms to compute the distance between profiles; thus, applications of this model have been limited to very short profiles.An alternative model of CNA evolution is the copy number transformation (CNT) model (Schwarz ). In this model, amplifications and deletions of contiguous intervals are counted as single events. The copy number distance (CND) between two profiles is the length of a shortest sequence of amplifications and deletions that transform one profile into the other. MEDICC (Schwarz ), the first algorithm to compute the CND, uses a heuristic to reconstruct a phylogenetic tree from CNPs, and has been used successfully in several cancer studies (Mangiola ; Schwarz ; Sottoriva ). More recently, Zeira showed that the CND between a pair of profiles can be computed in linear time and El-Kebir gave an integer linear programming formulation for reconstructing a phylogenetic tree between CNPs with the minimum number of events.The CND is useful because it can be computed efficiently, but the CND has the disadvantage that it gives all events equal weight, regardless of their length, position or type (amplification versus deletion). This limitation has several drawbacks. First, CNAs are reported to have different rates in different cancers depending on their location, length and type (Beroukhim ; Ciriello ; Macintyre ; Zack ). Second, some events, such as those affecting oncogeneic regions, may have more profound effect on cancer development and thus are more important than others (Mermel ). Third, the CND is sensitive to errors in copy number calls as each change is counted as a single event. The discrete nature of CND makes it hard to distinguish small focal events that are possibly errors from large scale events such as chromosome losses.Generalization of the CND to a model that weighs events differently is not straightforward. Moreover, it is not immediately apparent that a weighted generalization retains the combinatorial properties of the CNT model that enable its efficient computation. For example, there is no change in computational complexity between computing edit distance and weighted edit distance for sequences of independent characters. However, this is not the case for other models with non-independent characters. Most famously, while reversal distance can be computed in linear time (Hannenhalli and Pevzner, 1995a, b), weighted reversal distance is NP-hard (Bader and Ohlebusch, 2007; Pinter and Skiena, 2002). A recent generalization of the CNT model allows each event to modify the CNP with any amplitude at the same unit cost (Cordonnier and Lafond, 2020). But, computing the optimal transformation under this cost framework is NP-hard. Finally, the existing algorithms that compute the CND efficiently restrict the order of aberrations to specific ordered CNTs that have identical distances (El-Kebir ; Zeira ). This restriction is not appropriate for weighted CND. While the MEDICC algorithm (Schwarz ) computes CND without any assumption on the order of events, its algorithmic complexity has not been analyzed and is suggested to be exponential (Zeira ).In this work, we derive a weighted CND and provide an efficient algorithm to compute this distance. The weighted CND allows for different weights (or probabilities) to be assigned to segmental events according to their genomic positions, lengths and/or types (amplification versus deletion). This is the first efficient algorithm for weighted CND and relies on two key results: (i) a generalization of the ordered CNTs used in the derivation linear-time algorithm for unweighted CND (Zeira ) to semi-ordered CNTs and (ii) formulation of weighted CNTs as a linear program (LP) with a totally unimodular constraint matrix, implying that integer solutions are obtained with a polynomial time algorithm. In addition to the distance, the algorithm also provides a minimum weight transformation, i.e. a likely series of amplifications and deletions between a pair of profiles.We demonstrate the utility of the CND on three applications. First, we show that weighted CND produces more accurate phylogenetic trees on simulated CNPs generated with events with varying probabilities. Next, we use the weighted CND to derive phylogenetic trees from single-cell whole-genome sequencing data of a breast tumor obtained using the 10X Genomics CNV Solution. We show that the weighted CND improves the inference of tumor clones and cell lineages. Finally, we use the weighted CND to infer CNA rate signatures across cancer types and chromosomes on the Cancer Genome Atlas (TCGA) pan-cancer dataset.
2 Materials and methods
We start by reviewing CNTs and the CND (Section 2.1). We then describe the solution space of optimal CNTs, generalizing from ordered CNTs to semi-ordered CNTs (Section 2.2). Finally, we show how to compute optimal weighted CNTs where events have a weight determined by their position, length or type (Section 2.3). Additional details and proofs are in the Appendix (Supplementary Section S1).
2.1 CNT distance
We review the CNT model (Zeira ).We model chromosomes and CNAs as follows. A CNP is a vector of non-negative integers. A segmental event is a triplet where are the start and end positions of the event and is the type of the event. An event with τ = 1 is an amplification and an event with a deletion. Segmental event transforms a profile C into a new profile ; i.e. positive values between are increased by τ.A CNT from a source CNP S to a target CNP T is a sequence of events such that . Given a source CNP S and a target CNP T, the copy number transformation distance (CND) (note that this measure is not a true metric as it is not symmetric and does not obey the triangle inequality) d(S, T) is the length of the shortest CNT from S to T. Note that if a pair S, T of CNPs has s = 0 but for some i, then there is no CNT from S to T; we say that in this case. The CND d(S, T) between a pair of profiles can be computed in linear time (Zeira ).
2.2 Semi-ordered CNTs
Both the linear time algorithm (Zeira ) and the integer linear programming (ILP) algorithm (El-Kebir ) for computing the CND restrict to ordered CNTs, where all deletions come before all amplifications. Formally, let be a CNT from S to T. Suppose, we partition E into maximal contiguous sequences of events of the same type. Thus, where each phase E is a contiguous sequence in E, each E is composed of events of the same type , and no two consecutive subsequences are of the same type. In this case, we say that E is composed of k phases . Let be the number of events of type that affect the ith position in the profile in phase E. CNT E from S to T is phase-bounded provided for all and every phase E, where is the maximum copy number. Zeira showed that for any pair S, T of CNPs with there exists a shortest phase-bounded CNT with two phases: having only deletions, and having only amplifications. A transformation of this form is called an ordered transformation.As a shortest ordered CNT always exists, it is algorithmically sufficient to restrict attention to ordered CNTs in order to compute the CND. However, unordered CNTs may yield the same distances, and in some cases may be more biologically relevant. For example, the CNPs and have CND (Fig. 1). A shortest ordered transformation consists of an amplification of the first two positions followed by an amplification of the last two positions. On the other hand, the unordered transformation also has two events: an amplification of the entire chromosome followed by a deletion of the middle segment. As whole-chromosome duplications and deletions are common in cancer, the unordered transformation may be more plausible than the first transformation. Thus, restricting to ordered CNTs may preclude other optimal transformations that better explain the profiles.
Fig. 1.
Weighted and semi-ordered CNTs from to . The right (yellow) CNT is ordered with two amplifications while the left (blue) CNT is semi-ordered and has one amplification and one deletion. Given a weight function that assigns a weight of 1 to deletions and 2 to amplifications, the left CNT has a weight of 3 while the right CNT has a weight of 4
Weighted and semi-ordered CNTs from to . The right (yellow) CNT is ordered with two amplifications while the left (blue) CNT is semi-ordered and has one amplification and one deletion. Given a weight function that assigns a weight of 1 to deletions and 2 to amplifications, the left CNT has a weight of 3 while the right CNT has a weight of 4We define a semi-ordered CNT from S to T as a CNT with three phases where E1 and E3 have only deletions, E2 has only amplifications, and for all i where t = 0. In other words, a semi-ordered CNT has three phases: deletions, amplifications and deletions, and every zero position in T reaches zero after the first phase of deletions. While there is no specific biological rationale for restricting transformations to three phases, this restriction provides a richer space of transformations than ordered transformations while remaining computationally tractable (Fig. 1). In Supplementary Section S1.1, we show that while finding a shortest semi-ordered transformation between a pair of profiles can be written as an ILP formulation, the corresponding constraint matrix is totally unimodular (TUM). As a result, the ILP can be converted to a linear programming formulation (LP S2) without integrality constraints that is guaranteed to have integer optimum (Hoffman and Kruskal, 2010). In addition, we give a graph-theoretic characterization of the space of shortest CNTs that can be generated from a solution to the LP.
2.3 Weighted CNTs
In this section, we derive a weighted CNT model and describe an efficient algorithm to compute the weighted CNT distance.The CND counts all events equally in the distance, regardless of the length or type of event. This is problematic for two reasons. First, it has been observed that CNAs of different lengths occur at different rates in cancer (Beroukhim ; Ciriello ; Zack ). Second, CNPs inferred from real data often have uncertainty, and this uncertainty is generally length dependent. For instance, consider the following pairs of CNPs: and . Both pairs of profiles have CND . However, the first transformation E includes two focal amplifications which might be less likely in cases where the CNPs have errors. The second transformation also has a distance of 2, but the events are chromosome arm gain and loss. While the CND gives both pairs the same distance, if there is uncertainty in the CNPs, then arguably and should be less similar than S and T.To model differences in events, we introduce an event weight function that maps events on CNPs of length n to positive weights. Namely, is the weight of an operation starting at position i and ending at position k of type τ. The event weight accounts for the position of the event along the chromosome, the length of the event and its type. For example, deletions and amplifications can have different weights depending the rate of these events in a specific cancer type. Longer events may have a higher weight than shorter ones, and the weight of a whole-chromosome duplication may have a different weight regardless of the chromosome length.We define the following weighted CNT model:Weighted CNT model: Let S and T be CNPs, let E be a CNT from S to T and let w be a weight function for events. We define the weight of the CNT E to be the sum of weights of events in E.The weighted CNT model distinguishes transformations based on their weight and not just the number of events. For example, there are two shortest CNTs from to : (i) an amplification of the entire chromosome followed by a deletion of the middle segment, or (ii) an amplification of the first two positions followed by an amplification of the last two positions (Fig. 1). Suppose that the weight function is and . Then, the weight of the first CNT is 3 whereas the weight of second CNT is 4.The weighted CNT model can also be interpreted as a probability model. Suppose that each event occurs with probability p, and that events in a transformation E from S to T are independent. Then, the probability of observing a transformation E is and gives a maximum likelihood CNT between S and T. Therefore, setting the weight for each event e will make the weight of a transformation proportional to its likelihood.The goal of the event weight function is to distinguish between CNTs. In Supplementary Section S1.1, we show how weights can be used to determine a shortest semi-ordered CNT consistent with a single solution to the LP (Supplementary Problem S2). However, there may be multiple optimal solutions to the LP. Moreover, while a shortest CNT has the minimum number of events, the true biological transformation need not be parsimonious. For instance, a shortest transformation of to involves one chromosome arm amplification. Yet, a biological explanation for gaining an arm could be first gaining a whole chromosome and then losing an arm. The next problem generalizes the CND by finding a minimum weight transformation between a pair of profiles.Problem 1 (Minimum weight semi-ordered CNT). Given a source CNP S, a target CNP T and a weight function w, find semi-ordered phase-bounded CNT E having a minimum weight W(E).We now give our main result; an LP formulation to find a minimum weight semi-ordered CNT. Let be a variable indicating the number of events from position l to position k in phase j.Minimum weight semi-ordered CNT (LP 1): subject toLP 1 has a quadratic number of variables and a linear number of constraints. Next, we show that LP 1 yields a minimum weight transformation.Theorem 1 The constraint matrix of is totally unimodular. Thus, has an integer solution corresponding to a minimum weight semi-ordered CNT between a given pair S, T of CNPs and any weight function w.Note that as the minimal weight CNT problem does not find a shortest transformation, it may produce long transformations. Therefore, we may want to find a transformation that balances both its weight and its length. In Problem 2 we find a transformation that minimizes a linear combination of the weight and the length of the transformation, while in Problem 3 we find the minimum weight transformation only among the set of shortest transformations.Problem 2 (Minimum regularized semi-ordered CNT). Given a source CNP S, a target CNP T, a weight function w and a non-negative number λ, find a phase-bounded semi-ordered CNT E that minimizes .Problem 3 (Minimum weight shortest semi-ordered CNT). Given a source CNP S, a target CNP T and a weight function w, find a shortest phase-bounded semi-ordered CNT E having a minimum weight, .We show that LP 1 solves both Problems 2 and 3. First, for Problem 1, let E be a semi-ordered CNT from S to T and denote by the number of events in E from position l to k in phase j. The objective of Problem 2 can be written as . Hence, Problem 2 is equivalent to Problem 1 with modified weights and is solved with LP 1.We reduce Problem 3 to Problem 2 with . As for any phase-bounded CNT E, in order to minimize , the length of the CNT must be minimized first and only then the weight W(E) of the CNT should be minimized. We note that Problem 3 can also be solved directly by modifying LP 1 with a constraint , where d(S, T) is the CND from S to T. Though by adding this constraint, the constraint matrix of the modified formulation would not be TUM, the first approach shows that that Problem 3 has integer optimum regardless.
3 Results
We present three applications of the weighted CND. First, we show on simulated data that the weighted CND provides better estimates of the evolutionary distance between CNPs that evolve with non-uniform length distribution compared to the unweighted CND and the Euclidean distance (Section 3.1). Next, we show how the weighted CND helps recover cell populations in tumors using noisy CNPs derived from low-coverage single-cell DNA sequencing data (Section 3.2). Finally, we use the weighted CND to estimate CNA rates on TCGA data (Section 3.3).
3.1 Reconstruction of simulated copy number trees
In this section, we compare distance-based tree reconstruction on simulated CNPs using three distances: Euclidean distance, unweighted CND and weighted CND. We simulate CNPs from a directed tree via copy number events that occur with different probabilities. We assume that the tree is unknown but the probability distribution of events is known. We further assume that all profiles, including inner nodes are used to reconstruct the evolutionary tree and estimate the events along the edges. Obviously, these assumptions do not hold in real data; rather, the goal of these simulations is to show that using the weighted CND with prior knowledge of the distribution of events gives better estimates of distance than the other distance measures.In this setting, a minimum spanning arborescence (MSA) (Edmonds, 1967) of the simulated profiles corresponds to a maximum parsimony tree when using unweighted CND between nodes. Conversely, we define the likelihood of a tree as the product of all transformation probabilities along its edges. Therefore, if we set the weight of an event e as –log(p(e)), where p(e) is the probability of e in the simulation, then an MSA corresponds to a maximum likelihood tree when using the weighted CND.We generate a rooted, directed binary tree T with n nodes and a designated root node r having a CNP of length m with all entries having the same value b. The length of each edge in T, corresponding to the number of events between nodes, is , where X is drawn from a Poisson(λ) distribution. Thus, each edge has a minimum of one event and an average of events. Each event is an amplification with probability p and a deletion with probability 1 – p. For each event, we draw the length l of an event from a distribution having . The position an event acts along the genome is selected uniformly among possible positions. Profiles in the tree are simulated from the root downwards. Throughout the simulations we fix n = 61, λ = 1, b = 2, m = 50, p = 0.6. We tested different distribution of event lengths (Supplementary Fig. S2). For each β, 50 trees and corresponding profiles are simulated.Let be the set of simulated profiles in the tree T. Given a distance measure d, we calculate a pairwise distance matrix D. Note that D is not necessarily symmetric as both the weighted CND and unweighted CND are not symmetric. Moreover, for some ordered pairs of profiles there may not exist a transformation between them and in this case the distance in undefined. We build a directed weighted graph , i.e. G has a directed edge from u to v of weight D(u, v) if the distance from u to v exists. To make a fair comparison, when building the graph based on the Euclidean distance, we remove edges where the CND will not exist. Finally, we find a minimum weight spanning arborescence rooted at r in G (Edmonds, 1967).We evaluate the difference between the inferred tree and the true tree using two measures. First, we define the true positive edge rate as the fraction of edges common to both trees. Second, we calculate the difference between N(T), the total number of events along the edges of T, and , the corresponding quantity for . quantifies how well the inferred tree recapitulates the events of the simulated tree. As events can overlap and cancel one another over time, a tree that correctly captures the events () is a better estimation of the true evolution.We find that both the unweighted CND (CND) and weighted CND (WCND) outperform the Euclidean distance (EUC) in reconstructing the true tree across all values of β (Fig. 2a). In addition, the weighted CND shows significant improvement over the unweighted CND (p ≤ 3 × 10−5 in paired t-test) over all values of β. The average improvement increases from 0.03 when β = 10 to 0.11 when . Smaller values of β correspond to distributions where longer events are more likely than short ones (Supplementary Fig. S2). In this case, there is a higher probability for events to overlap, creating by chance similar profiles on different branches of the tree. As there are more similar profiles, it is more difficult to correctly recover the true tree topology. We indeed see that the reconstruction performance improves in all methods when β increases. The weighted CND shows most improvement exactly in those hard cases ().
Fig. 2.
Comparison of trees constructed using Euclidean distance (EUC), unweighted CND (CND) and weighted CND (WCND) on simulated profiles with length-based distribution of CNAs. (a) The proportion of edges common to the simulated tree and the tree inferred by the MSA. (b) The difference in the total number of events between the simulated and the tree inferred by the MSA. P-value in paired t-test: *, **, ***, ****
Comparison of trees constructed using Euclidean distance (EUC), unweighted CND (CND) and weighted CND (WCND) on simulated profiles with length-based distribution of CNAs. (a) The proportion of edges common to the simulated tree and the tree inferred by the MSA. (b) The difference in the total number of events between the simulated and the tree inferred by the MSA. P-value in paired t-test: *, **, ***, ****We also find that the weighted CND infers trees with for all values of β (Fig. 2b). In contrast, the Euclidean distance yields trees that grossly overestimate the true number of events in the tree (), while the unweighted CND yields trees that underestimate the true number of events in the tree (). The latter is not surprising as the unweighted CND minimizes the total number of events in the tree. Although for , both the weighted and unweighted CND produce trees with similar number of events, the weighted CND has higher . This shows that the weighted CND is able to recover the tree topology among multiple trees with the same total number of events along the edges. For other β, the weighted CND produces trees with total number of events closer to the true tree, having an average of 0.36–3.78 more events than the unweighted CND.As in real data the true distribution of events is unknown, we repeated the simulations using different probabilities for tree inference with the weighted CND. Specifically, we simulated trees with except we assigned the weight of an event of length l from a distribution with . We find that even with an incorrect weight distribution, the WCND still infers trees that are significantly better than the unweighted CND (Supplementary Fig. S3a). Thus, having prior knowledge of the event distribution is still superior to using an unweighted CND. On the other hand, both weighted and unweighted CND find trees with the exact minimum number of events (Supplementary Fig. S3b).
3.2 Single-cell cancer sequencing data
We analyzed CNPs derived from ultra-low-coverage ( to ) single-cell whole-genome sequencing data of a breast tumor obtained using the 10X CNV Solution (10X Genomics, 2019a). The dataset includes five tumor sections, each comprising cells. CHISEL (Zaccaria and Raphael, 2019) was used to infer a haplotype-specific CNP (separating integer copy numbers of the two homologous chromosomes) for each cell in 5-MB bins across all autosomes. Thus, each cell has 44 CNPs corresponding to the 22 pairs of chromosomes. We analyzed cells jointly across all five sections, focusing on cells that CHISEL classified as tumor cells (Zaccaria and Raphael, 2019); we excluded centromere bins that had highly variable calls and discarded duplicate cells with identical CNPs, resulting in a final dataset of 4012 cells by 1052 CN entries. We arbitrarily designated one allele as A and the other as B (Fig. 3a).
Fig. 3.
(a) Haplotype-specific CNPs obtained from whole-genome single-cell sequencing of a breast tumor. Cells were previously clustered into eight clones I–VIII (Zaccaria and Raphael, 2019). Copy numbers were limited to 4 for simpler presentation. Trees constructed using neighbor joining on clone consensus CNPs using (b) Euclidean distance, (c) unweighted CND and (d) weighted CND. Dashed branches marked with an asterisk indicate differences from previous CHISEL (Zaccaria and Raphael, 2019) analysis
(a) Haplotype-specific CNPs obtained from whole-genome single-cell sequencing of a breast tumor. Cells were previously clustered into eight clones I–VIII (Zaccaria and Raphael, 2019). Copy numbers were limited to 4 for simpler presentation. Trees constructed using neighbor joining on clone consensus CNPs using (b) Euclidean distance, (c) unweighted CND and (d) weighted CND. Dashed branches marked with an asterisk indicate differences from previous CHISEL (Zaccaria and Raphael, 2019) analysisDue to the ultra-low coverage, copy number calls in individual cells are prone to errors. While whole-chromosome and arm-level CNs can be more reliably called, focal (single bin) changes in the CNPs are more likely to be errors. The CHISEL analysis (Zaccaria and Raphael, 2019) addressed this issue by clustering cells using Hamming distance and creating consensus CNPs for each cluster of cells. This procedure resulted in eight groups of cells, or clones, labeled I–VIII and characterized by different large-scale CNAs, including whole-chromosome and chromosome–arm level events and an early whole-genome duplication (Fig. 3a and Supplementary Fig. S4). This analysis also built a phylogeny relating these clones with two main branches: one branch containing clones I and II harboring deletions of chromosome 2A and 3B, and arm deletions of 6A.p and 10B.p; the other branch containing clones III–VIII harboring deletions of chromosomes 4B and 8A, and also a deletion of chromosome 2B in most of the clones in this lineage. The CHISEL tree analysis also suggested that clone III is the ancestor of clone IV which is the ancestor of clones V–VIII.As the individual cell CNPs are noisy, we first analyzed the clone CNPs. Similar to CHISEL, we created consensus CNPs for the cells of each clone (Supplementary Fig. S4). While cells in the dataset had on average 81 break points, i.e. positions where consecutive copy numbers do not match, clone profiles had only 39 breakpoints on average. We then calculated a symmetric distance matrix between clones using Euclidean distance, unweighted CND and weighted CND. Because both CNDs are not symmetric, we defined the symmetric distance between a pair of profiles as the average of the distances in both directions. We constructed a phylogenetic tree of the clones using neighbor joining. As the major clonal events in this sample are large chromosomal aberrations, we gave whole-chromosome and arm level events a high weight corresponding to () a probability of . On the other hand, we suspect that focal changes are more likely to be errors, and thus assigned these events a weight corresponding to a probability of 0.9. We gave amplifications and deletions equal probabilities. While all distances are able to separate the two main lineage of the tumor, only the weighted CND produces a tree concordant with previous analysis (Fig. 3b–d). Notably, the Euclidean distance mistakenly places clones III and VI close together on the tree because the events that distinguish these two clones are shorter relative to other chromosomal events (1A.q gain and 3B loss). Similarly, the unweighted CND also misplaces a branch, placing clones VI and VIII as descendants of clone III. This placement would imply that chromosome 2B was lost twice independently in two separate branches, which is unlikely since most cells on this lineage contain this mutation. This change is caused by a single bin in chromosome 1A.p having a different copy number than the rest of the arm. This change is shared by clones III and VIII making them one event closer in terms of the unweighted CND. On the other hand, the weighted CND gives this change a low weight in comparison to chromosome/arm level events, thus correctly resolving the clone topology.To show that the weighted CND is able to cope with errors, we next computed phylogenies on the 4012 individual cells using neighbor joining. The tree computed using Euclidean distance (Fig. 4a) has clades that largely recapitulate the clone assignment given in Zaccaria and Raphael (2019), which is not surprising because (i) Hamming distance was used to cluster cells into clones in CHISEL and (ii) whole-chromosome and arm level events are the major events in this tumor and since Euclidean distance weighs events proportional to their lengths, it is more robust to small changes in the profiles. However, we find that the Euclidean distance tree has strange evolutionary relationships placing clones III, V and VI together in the same clade although previous analysis (Zaccaria and Raphael, 2019) suggested that clones V and VI descended from clone IV. This arrangement would again imply that chromosome 2B was lost twice independently, which is not likely. The unusual placements are not surprising, because Euclidean distance has no underlying evolutionary model for CNAs. In contrast, the tree based on the unweighted CND (Fig. 4b) mixes cells from different clones in the same clade and even cannot separate the two main lineages (I–II and III–IIIV) that are the most distinct in their CNPs. This is likely because the unweighted CND is susceptible to noise in the CNPs in individual cells, and small differences in CNPs are weighted equally as large events. Finally, the tree inferred using the weighted CND clearly divides the two main lineages, preserves most of the clonal structure and recapitulates the evolutionary relationships from the clone tree in Zaccaria and Raphael (2019). Importantly, the weighted CND tree shows that clones V–VIII descend from clone IV, which is the more reasonable as these cells are distinguished by a 2B loss. The weighted CND tree does group some cells from different clones into the same clade. Closer inspection of cells that were reported to be from the same clone in Zaccaria and Raphael, (2019) but were split in the weighted CND tree sheds light on smaller sub-populations of cells. For instance, a small group of 13 cells that were classified as clone VIII but separated from the rest of the cells from this clone have one fewer copy of the p arm of chromosome 16B compared to the other cells in this clone. As we use higher weights for arm events, these cells were split from the rest of the clone. Similarly, a group of 14 cells of clone V containing a loss of chromosome 11 were separated from the rest of the cells in this clone.
Fig. 4.
Trees built on single-cell CNPs for different distance measures (a-c) visualized using iTOL (Letunic and Bork, 2019) where edge lengths are scaled for visualization and not proportional to the distance
Trees built on single-cell CNPs for different distance measures (a-c) visualized using iTOL (Letunic and Bork, 2019) where edge lengths are scaled for visualization and not proportional to the distanceAs the true weights of events are unknown, we explored how the choice of weights in the weighted CND affects the phylogeny. We restricted our analysis to a single tumor section (E) with 1062 cells and calculated cell pairwise distance matrices using weighted CND with various weights. As the space of possible event weights is large, we reduced the weight function to three parameters: amplification/deletion probability P, local/chromosome event probability q and focal/segmental probability r. Therefore, the probability of a segmental deletion, for example, would be . Each of these parameters was assigned either 0.25 or 0.75 resulting in a total of eight parameter configurations. In addition, we also considered another parameter combination to reflect the fact that chromosomal events are the major events in this tumor’s evolution.We find that varying the weights yields neighbor-joining trees that are quite different from one another, showing that the selection of weights can have a considerable effect of the resulting trees (Supplementary Fig. S6). We see that when perturbing only the probability p of amplifications while keeping other parameters fixed, the resulting trees seem a lot more similar. This can be explained by the fact that when symmetrizing the distance, some of the information on the direction of events is lost. Interestingly, weighted CND trees are more similar to the unweighted tree when the chromosome/arm level events and focal events are more likely () and least similar when segmental events are more likely (). Conversely, the tree is more similar to the Euclidean tree when focal events have lower weights (). This corroborate with our expectation as Euclidean distance gives a higher weight to longer events. Finally, the additional parameter combination that assigns a high weight to chromosome event gives a tree that is the most similar to the Euclidean distance tree than the other parameter combinations, as expected.
3.3 Estimating CNA rates in TCGA pan-cancer data
A key challenge in applying the weighted CND to real data is how to select biologically realistic weights. Estimating CNA rates is substantially more difficult than estimating SNV rates: as CNA overlap with each other, it is difficult to conclude which CNAs have occurred from pairs of moderately diverged CNPs. In this section, we illustrate how the weighted CNT can help in the inference of CNA rates using the inferred transformations between pairs of CNPs. We apply this procedure to CNPs from 26 cancer types from TCGA. It has been observed that different cancer types have different rates of CNAs (Ciriello ; Zack ), and even within a cancer type, different chromosomes show varying patterns of aneuploidy (Taylor ).We infer CNA rates from a collection of CNPs using the following model. Let be a set of independent pairs of source and target profiles. Our goal is to find a probability distribution P over CNAs that will maximize the likelihood of observing the set of CNP pairs, namely . Here, we assume that there are m classes of events with unknown probabilities . Event classes may be determined by length, genomic location, or type of event: e.g. whole-chromosome amplifications, whole p-arm deletions, local q-arm amplifications, etc.To find the most likely CNA probability distribution P, we use an EM-like approach. Previous analysis of CNA used a similar approach based on a heuristic for deconstruction of CNPs to identify segmental events (Mermel ; Zack ). We start with a random probability distribution over CNAs. At each iteration t, we find the most likely transformation for each pair i of profiles given the probabilities at the previous iteration. Then we re-estimate the event probabilities by the proportion of events in each class j in the entire cohort. We continue until convergence or a predefined number of iterations.We obtained total CNPs and whole-genome doubling statuses for 10180 samples from the 26 cancer types from the TCGA pan-cancer dataset that had at least 100 samples (Taylor ). For each cancer type and each chromosome, we created a set of CNP pairs where T is the CNP of the ith sample and is a profile with the same CN b across segments, where b = 2, 4 or 6 depending on the sample’s genome doubling status. For each such cohort, we estimated CNA rates considering the following classes of events. As it has been noted that CNA tend to be localized on each chromosome arm, we classify events based on their start and end points in relation to the centromer (p-arm, q-arm, cross). An event is classified as affecting the whole chromosome/arm if its length consists of at least 80% of the chromosome/arm length (Taylor ). Finally, each event is either an amplification or a deletion giving 12 classes of events overall.We limited our analysis to the 17 non-acrocentric chromosomes resulting in 442 distributions over 12 CNA classes (Supplementary Fig. S7). We clustered the inferred CNA rates into eight groups with k-means using the Silhouette method to determine the number of clusters. The cluster centroids represent CNA rate signatures common to cancers and chromosomes (Fig. 5). The different signatures show variability in CNA rates. For instance, in Cluster 1 about 30% of events are whole-chromosome deletions while in Cluster 4 about 30% of events are chromosome amplifications. Similarly, Cluster 0 shows more deletions on the q-arm and amplifications on the p-arm while Cluster 6 shows more deletions on the p-arm and amplifications on the q-arm. Cluster 2 has around 50% focal amplifications on the q-arm whereas Cluster 3 has around 50% focal deletions on the q-arm.
Fig. 5.
Inferred CNA rates in TCGA pan-cancer CN data. Colors correspond to 12 classes of events: {deletions(−), amplifications(+)} × {whole chromosome, whole p-arm, whole q-arm, focal p-arm, focal q-arm, crossing centromere}
Inferred CNA rates in TCGA pan-cancer CN data. Colors correspond to 12 classes of events: {deletions(−), amplifications(+)} × {whole chromosome, whole p-arm, whole q-arm, focal p-arm, focal q-arm, crossing centromere}We tested whether the clusters were enriched for certain cancer types or chromosomes (Table 1). With the exceptions of Cluster 1 having a high number chromosomes from TGCT and KIRC, and Cluster 5 having a high number of LAML chromosomes, clusters were not enriched to specific cancer types. On the other hand, almost half of the chromosomes were enriched in some cluster. Notably, 24 (out of 26) chromosome 17 CNA rates were included in Cluster 2 suggesting that the distribution of events on chromosome 17 is consistent across cancer types. Cluster 6, having a high proportion of deletions on the p-arm and amplifications on the q-arm, was enriched with Chromosomes 1, 3 and 8. Chromosome 3 p-arm loss and q-arm gain have been shown to be a dominant feature of squamous cell carcinomas (Taylor ). Indeed, we see that HNSC, LUSC, CESC have similar CNA rates characterized by 3p-arm loss and 3q-arm gain (Supplementary Fig. S8).
Table 1.
Significantly enriched chromosome and cancer type in each CNA rate cluster
Cluster
Chromosome
Cancer
0
5 (1.22e-6)
1
18 (5.6e-5)
TGCT (4.04e-7), KIRC (6.36e-5)
2
17 (4.6e-22)
3
4 (3.77e-6)
5
LAML (1.25e-10)
6
1 (3.99e-7), 3 (3.21e-6), 8 (3.99e-7)
7
6 (1.3e-5), 9 (1.3e-5)
Note: Hyper-geometric P-value is presented in parentheses. The P-values were thresholded with Bonferroni correction for multiple testing.
Significantly enriched chromosome and cancer type in each CNA rate clusterNote: Hyper-geometric P-value is presented in parentheses. The P-values were thresholded with Bonferroni correction for multiple testing.
4 Discussion
In this paper, we introduce a weighted CND. The weighted CND allows for segmental events to have different weights, or probabilities, based on type, length and location, enabling more detailed studies of the rates of copy number events compared to the unweighted CND that is currently in use. We give the first efficient polynomial-time algorithm to compute the weighted CND. This algorithm is based on the observation that computation of the weighted CND is an optimization problem with totally unimodular constraint matrix. In addition to computing the minimum weighted distance, the algorithm also provides an explicit transformation that achieves this distance. We illustrate the utility of the weighted CND in three different applications: distance-based phylogenetic tree reconstruction, analysis of noisy CNPs from single-cell DNA sequencing data and estimation of CNA rates. We show on simulated data that the weighted CND outperforms Euclidean distance and unweighted CND in inferring ancestral relations between profiles when events are generated with different rates. We analyze CNPs from single-cell DNA sequencing of a breast tumor and show that the weighted CND overcomes errors in copy number calls and improves tree reconstruction. Finally, we use the weighted CND to infer CNA rate signatures from the TCGA pan-cancer data-set.An important question in applications of the weighted CND is how to determine the weights. We showed that it is possible to estimate CNA rates from cancer cohorts, but there is substantial room improving this process. First, larger cohorts are needed to estimate the distribution of sub-arm focal events as a function of their length and/or position along the chromosome. Second, while we assumed that samples from the same tumor type share the similar CNA rate distribution, there may actually be different mutagenic processes that affect the rate of CNAs in different samples (Alexandrov ; Macintyre ; Shah, 2018). There may be multiple CNA rate distributions both within a cohort and even within a single sample. Third, CNPs from bulk tumors are a mixture of multiple cells with potentially different CNPs. Thus, one must rely on accurate deconvolution of bulk samples into integer copy numbers (Gerstung ; Salcedo ; Zaccaria and Raphael, 2018) or obtain larger cohorts of single-cell sequencing data. Finally, an alternative approach for finding event weights is by examining the relative least-squares fit of the distances to the tree [e.g. using the Fitch–Margoliash method (Fitch and Margoliash, 1967)] for different weight parameters.The probabilistic model used in the weighted CND can be extended in several ways. First, the model assumes that events are independent given a transformation. This may not be the case in general as the probability of events may depend on previous events. Second, the model does not directly estimate errors in the CNPs. Both a probabilistic model and a simulation framework that accurately model real events and errors are important directions for future work. In addition, the number of CNAs and not just their relative proportions should be modeled, especially because some cancers are characterized by a higher number of CNAs (Ciriello ). Finally, extending the model to more complex CNAs such as whole-genome duplications (Bielski ) or breakage–fusion–bridge cycles (Zakov ) remains open.Click here for additional data file.
Authors: Serena Nik-Zainal; Peter Van Loo; David C Wedge; Ludmil B Alexandrov; Christopher D Greenman; King Wai Lau; Keiran Raine; David Jones; John Marshall; Manasa Ramakrishna; Adam Shlien; Susanna L Cooke; Jonathan Hinton; Andrew Menzies; Lucy A Stebbings; Catherine Leroy; Mingming Jia; Richard Rance; Laura J Mudie; Stephen J Gamble; Philip J Stephens; Stuart McLaren; Patrick S Tarpey; Elli Papaemmanuil; Helen R Davies; Ignacio Varela; David J McBride; Graham R Bignell; Kenric Leung; Adam P Butler; Jon W Teague; Sancha Martin; Goran Jönsson; Odette Mariani; Sandrine Boyault; Penelope Miron; Aquila Fatima; Anita Langerød; Samuel A J R Aparicio; Andrew Tutt; Anieta M Sieuwerts; Åke Borg; Gilles Thomas; Anne Vincent Salomon; Andrea L Richardson; Anne-Lise Børresen-Dale; P Andrew Futreal; Michael R Stratton; Peter J Campbell Journal: Cell Date: 2012-05-17 Impact factor: 41.582
Authors: Andrew McPherson; Andrew Roth; Emma Laks; Tehmina Masud; Ali Bashashati; Allen W Zhang; Gavin Ha; Justina Biele; Damian Yap; Adrian Wan; Leah M Prentice; Jaswinder Khattra; Maia A Smith; Cydney B Nielsen; Sarah C Mullaly; Steve Kalloger; Anthony Karnezis; Karey Shumansky; Celia Siu; Jamie Rosner; Hector Li Chan; Julie Ho; Nataliya Melnyk; Janine Senz; Winnie Yang; Richard Moore; Andrew J Mungall; Marco A Marra; Alexandre Bouchard-Côté; C Blake Gilks; David G Huntsman; Jessica N McAlpine; Samuel Aparicio; Sohrab P Shah Journal: Nat Genet Date: 2016-05-16 Impact factor: 38.330
Authors: Travis I Zack; Stephen E Schumacher; Scott L Carter; Andre D Cherniack; Gordon Saksena; Barbara Tabak; Michael S Lawrence; Cheng-Zhong Zhsng; Jeremiah Wala; Craig H Mermel; Carrie Sougnez; Stacey B Gabriel; Bryan Hernandez; Hui Shen; Peter W Laird; Gad Getz; Matthew Meyerson; Rameen Beroukhim Journal: Nat Genet Date: 2013-10 Impact factor: 38.330
Authors: Moritz Gerstung; Clemency Jolly; Ignaty Leshchiner; Stefan C Dentro; Santiago Gonzalez; Daniel Rosebrock; Thomas J Mitchell; Yulia Rubanova; Pavana Anur; Kaixian Yu; Maxime Tarabichi; Amit Deshwar; Jeff Wintersinger; Kortine Kleinheinz; Ignacio Vázquez-García; Kerstin Haase; Lara Jerman; Subhajit Sengupta; Geoff Macintyre; Salem Malikic; Nilgun Donmez; Dimitri G Livitz; Marek Cmero; Jonas Demeulemeester; Steven Schumacher; Yu Fan; Xiaotong Yao; Juhee Lee; Matthias Schlesner; Paul C Boutros; David D Bowtell; Hongtu Zhu; Gad Getz; Marcin Imielinski; Rameen Beroukhim; S Cenk Sahinalp; Yuan Ji; Martin Peifer; Florian Markowetz; Ville Mustonen; Ke Yuan; Wenyi Wang; Quaid D Morris; Paul T Spellman; David C Wedge; Peter Van Loo Journal: Nature Date: 2020-02-06 Impact factor: 49.962
Authors: Adriana Salcedo; Maxime Tarabichi; Shadrielle Melijah G Espiritu; Amit G Deshwar; Matei David; Nathan M Wilson; Stefan Dentro; Jeff A Wintersinger; Lydia Y Liu; Minjeong Ko; Srinivasan Sivanandan; Hongjiu Zhang; Kaiyi Zhu; Tai-Hsien Ou Yang; John M Chilton; Alex Buchanan; Christopher M Lalansingh; Christine P'ng; Catalina V Anghel; Imaad Umar; Bryan Lo; William Zou; Jared T Simpson; Joshua M Stuart; Dimitris Anastassiou; Yuanfang Guan; Adam D Ewing; Kyle Ellrott; David C Wedge; Quaid Morris; Peter Van Loo; Paul C Boutros Journal: Nat Biotechnol Date: 2020-01-09 Impact factor: 68.164