MOTIVATION: Allele-specific copy number alterations are commonly used to trace the evolution of tumours. A key step of the analysis is to segment genomic data into regions of constant copy number. For precise phylogenetic inference, breakpoints shared between samples need to be aligned to each other. RESULTS: Here, we present asmultipcf, an algorithm for allele-specific segmentation of multiple samples that infers private and shared segment boundaries of phylogenetically related samples. The output of this algorithm can directly be used for allele-specific copy number calling using ASCAT. AVAILABILITY AND IMPLEMENTATION: asmultipcf is available as part of the ASCAT R package (version ≥2.5) from github.com/Crick-CancerGenomics/ascat/.
MOTIVATION: Allele-specific copy number alterations are commonly used to trace the evolution of tumours. A key step of the analysis is to segment genomic data into regions of constant copy number. For precise phylogenetic inference, breakpoints shared between samples need to be aligned to each other. RESULTS: Here, we present asmultipcf, an algorithm for allele-specific segmentation of multiple samples that infers private and shared segment boundaries of phylogenetically related samples. The output of this algorithm can directly be used for allele-specific copy number calling using ASCAT. AVAILABILITY AND IMPLEMENTATION: asmultipcf is available as part of the ASCAT R package (version ≥2.5) from github.com/Crick-CancerGenomics/ascat/.
Allele-specific copy number alterations (CNAs) are commonly used to trace the evolution of tumours. One of the most frequently used algorithms to infer these copy number changes is ASCAT (Van Loo ), which segments each sample separately. Due to measurement noise, the inferred locations of breakpoints shared between samples often differ. These differences can impair analyses of phylogenetic relationships between the samples, because evolutionary methods depend on the assumption that shared breakpoints appear at exactly the same location. Previous approaches to address this problem include extensive experimental breakpoint validation (Schwarz ), an expensive approach that is not always feasible, or size-based heuristic filters (Mangiola ). Another approach infers allele and clone-specific CNA from multi-sample data by binning without segmentation (Zaccaria and Raphael, 2018).To rigorously address the problem of multi-sample breakpoint detection, we have developed asmultipcf (allele-specific multi-sample piecewise constant fitting), a robust allele-specific multi-sample segmentation algorithm that is tightly integrated into the ASCAT framework (Van Loo ). The ability of asmultipcf to improve phylogenetic inference was shown in a large case study on 181 samples from 10 patients with lethal metastatic breast cancer (De Mattos-Arruda ).
2 Approach
asmultipcf incorporates and extends two copy number segmentation algorithms previously developed by Nilsen , which leverage vector operations for efficient implementation: first, aspcf (an allele-specific segmentation method for single samples), and second, multipcf (a multi-sample segmentation method, which is not allele-specific). Additionally, asmultipcf handles missing values, making extensive data filtering unnecessary.
2.1 Input data
For each sample, the following input data are required across germline heterozygous sites: (i) log ratios (logR), representing log-transformed copy numbers derived from sequencing depth or single nucleotide polymorphism (SNP) array data, and (ii) B allele frequencies (BAF), describing the allelic imbalance of SNPs. The algorithm presented here can handle missing values and thus loci with incomplete data across samples do not need to be excluded.
2.2 Pre-processing
asmultipcf uses the same pre-processing steps as the allele-specific single sample algorithm of Nilsen , including (i) mirroring BAFs to obtain a single track in regions of allelic imbalance and (ii) removing extreme outliers from logR and BAF data [see Nilsen for details]. Given n samples across p SNP loci, the pre-processing yields a single matrix that contains both logR and BAF values.
2.3 An exact algorithm for weighted segmentation
We evaluate the fit of a segmentation solution to the data with a weighted least squares function that models missing values in the data matrix. A weight matrix is derived by assigning w a weight of 0 if y is missing and 1 otherwise. Then all missing values in Y are assigned an arbitrary [non-not assigned (NA)] value. Our aim is to find a segmentation that minimizes the cost function
where the best fit on a given segment I is the weighted average of the observations on that segment
and where γ is a penalty parameter that controls the number of segments. Expanding the square in (2) and omitting the term independent of S:To find an optimal solution to the cost function, we adapt the dynamic programming algorithm of Nilsen to our weighted problem. The algorithm iteratively minimizes the total errors at locus k across all samples using the errors up to k, the costs of the current segments, , and the penalty γ, together with intermediate variables and :
Algorithm 1: asmultipcf
Input: Matrix Y of log-transformed copy numbers and B allele frequencies; weight matrix W; penalty ;Output: Segment start indices and segment averagesInitialize and iterate forwhere denotes an element- wise matrix product and the element-wise inversestoring also the index at which the minimum in the last step is achieved.Find segment start indices from right to left as , s=1, where .Find segment averages
2.4 A heuristic algorithm for large data sets
Algorithm 1 is of order , which means that the segmentation becomes computationally expensive for long sequences. However, instead of allowing breakpoints at any of the p positions, we can pre-select potential breakpoints and thereby reduce the runtime to where q is the number of potential breakpoints. To identify potential breakpoints, different heuristics can be used. Here, we apply Algorithm 1 to overlapping subsequences (length 5000 with an overlap of 1000), combine all of the inferred breakpoints and use them as input for the subsequent global segmentation. Algorithm 2 describes the fast heuristic version of asmultipcf.
Algorithm 2: Fast asmultipcf
Input: Matrix Y of log-transformed copy numbers and B allele frequencies; weight matrix W; penalty ;Output: Segment start indices and segment averagesSplit data set into overlapping subsequences and apply steps 1 and 2 of Algorithm 1 to each of them in order to find potential breakpoints r0, r1, , r where and .Aggregate sequences between breakpoints by setting and .Calculate segmentation solution by using the aggregated matrices X and as input to Algorithm 1 instead of Y and W, respectively.
2.5 Post-processing
Both algorithms yield a single segmentation solution S for all samples. However, we expect that only some of the segments will be shared between all samples while others will be private. While ASCAT can be run directly on the global segmentation solution, removing unnecessary breakpoints on a per-sample base can reduce noise in the segment average estimates by generating larger segments. To refine breakpoints individually for each sample, we simply use the breakpoints inferred from the multi-sample segmentation and rerun steps 2 and 3 of Algorithm 2 on each sample individually based on these potential breakpoints.
2.6 Implementation
asmultipcf is part of the ASCAT R package from version 2.5 onwards. The asmultipcf function contains a parameter to select whether the exact or the fast algorithm should be run, as well as an option to include the per-sample breakpoint refinement. Furthermore, samples can be weight adjusted to account for quality differences in the data. The manual contains example use cases, including a comparison to HATCHet (Zaccaria and Raphael, 2018).
3 Discussion
The independent segmentation of related samples can artificially inflate tumour heterogeneity. The algorithm presented here addresses this problem by joint segmentation. While this approach can potentially underestimate tumour heterogeneity, because CNAs that are shared by many samples are more likely to be detected than CNAs that are private or shared by only few samples, in practice, the penalty parameter γ can be adjusted to ensure sensitivity. Overall, asmultipcf substantially improves the analysis of copy number changes of multiple samples.
Funding
This research was supported by the Cancer Research UK Cambridge Institute with core grant C14303/A17197 and the Francis Crick Institute with core funding from Cancer Research UK [FC001202], the UK Medical Research Council [FC001202] and the Wellcome Trust [FC001202]. P.V.L. is a Winton Group Leader, F.M. is a Royal Society Wolfson Research Merit award holder.Conflict of Interest: none declared.
Authors: Peter Van Loo; Silje H Nordgard; Ole Christian Lingjærde; Hege G Russnes; Inga H Rye; Wei Sun; Victor J Weigman; Peter Marynen; Anders Zetterberg; Bjørn Naume; Charles M Perou; Anne-Lise Børresen-Dale; Vessela N Kristensen Journal: Proc Natl Acad Sci U S A Date: 2010-09-13 Impact factor: 11.205
Authors: Roland F Schwarz; Charlotte K Y Ng; Susanna L Cooke; Scott Newman; Jillian Temple; Anna M Piskorz; Davina Gale; Karen Sayal; Muhammed Murtaza; Peter J Baldwin; Nitzan Rosenfeld; Helena M Earl; Evis Sala; Mercedes Jimenez-Linan; Christine A Parkinson; Florian Markowetz; James D Brenton Journal: PLoS Med Date: 2015-02-24 Impact factor: 11.069
Authors: Leticia De Mattos-Arruda; Stephen-John Sammut; Edith M Ross; Rachael Bashford-Rogers; Erez Greenstein; Havell Markus; Sandro Morganella; Yvonne Teng; Yosef Maruvka; Bernard Pereira; Oscar M Rueda; Suet-Feung Chin; Tania Contente-Cuomo; Regina Mayor; Alexandra Arias; H Raza Ali; Wei Cope; Daniel Tiezzi; Aliakbar Dariush; Tauanne Dias Amarante; Dan Reshef; Nikaoly Ciriaco; Elena Martinez-Saez; Vicente Peg; Santiago Ramon Y Cajal; Javier Cortes; George Vassiliou; Gad Getz; Serena Nik-Zainal; Muhammed Murtaza; Nir Friedman; Florian Markowetz; Joan Seoane; Carlos Caldas Journal: Cell Rep Date: 2019-05-28 Impact factor: 9.423
Authors: Gro Nilsen; Knut Liestøl; Peter Van Loo; Hans Kristian Moen Vollan; Marianne B Eide; Oscar M Rueda; Suet-Feung Chin; Roslin Russell; Lars O Baumbusch; Carlos Caldas; Anne-Lise Børresen-Dale; Ole Christian Lingjaerde Journal: BMC Genomics Date: 2012-11-04 Impact factor: 3.969
Authors: Stefano Mangiola; Matthew K H Hong; Marek Cmero; Natalie Kurganovs; Andrew Ryan; Anthony J Costello; Niall M Corcoran; Geoff Macintyre; Christopher M Hovens Journal: Sci Rep Date: 2016-09-22 Impact factor: 4.379