Literature DB >> 21183584

DiNAMIC: a method to identify recurrent DNA copy number aberrations in tumors.

Vonn Walter¹, Andrew B Nobel, Fred A Wright.

Abstract

MOTIVATION: DNA copy number gains and losses are commonly found in tumor tissue, and some of these aberrations play a role in tumor genesis and development. Although high resolution DNA copy number data can be obtained using array-based techniques, no single method is widely used to distinguish between recurrent and sporadic copy number aberrations.
RESULTS: Here we introduce Discovering Copy Number Aberrations Manifested In Cancer (DiNAMIC), a novel method for assessing the statistical significance of recurrent copy number aberrations. In contrast to competing procedures, the testing procedure underlying DiNAMIC is carefully motivated, and employs a novel cyclic permutation scheme. Extensive simulation studies show that DiNAMIC controls false positive discoveries in a variety of realistic scenarios. We use DiNAMIC to analyze two publicly available tumor datasets, and our results show that DiNAMIC detects multiple loci that have biological relevance. AVAILABILITY: Source code implemented in R, as well as text files containing examples and sample datasets are available at http://www.bios.unc.edu/research/genomic_software/DiNAMIC.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
DNA, Neoplasm

Year: 2010 PMID： 21183584 PMCID： PMC3042182 DOI： 10.1093/bioinformatics/btq717

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

DNA copy number aberrations (CNAs) are commonly found in tumor tissue, and can range from losses (deletions) of one or both copies of chromosomal regions to gains of numerous additional copies (amplifications). The size of these aberrations can range from entire chromosome arms to less than 100 kb (Myllykangas and Knuutila, 2006). A variety of platforms are used to detect CNAs, and provide quantitative signals that reflect the underlying discrete copy number (Coe ; Davies ; Zhao ). Much of the statistical effort in analyzing CNAs has focused on discerning copy number at each location within individual tumors (Hupe ; Olshen ; Venkatraman and Olshen, 2007), and in handling the potential contamination of normal tissue in tumor samples (Sun ). In contrast to heritable copy number variation, CNAs are the result of genomic instability in somatic tumor tissue (Albertson ). From the earliest days of modern cancer genetics, it was recognized that such instability could unmask or promote the effects of tumor suppressors and oncogenes (Knudsen, 1971; Stratchan and Read, 1999). However, surveys of a number of tumor types (e.g. Miller ) demonstrate that sporadic gains and losses can also occur throughout the genome, likely representing generic genomic instability, with little effect on tumor progression. The phenomenon of recurrent CNAs, which affect the same region in multiple tumors, is of great interest, as such CNAs may highlight genes or regions that are directly involved in tumor progression. Past studies have detected recurrent CNAs in a wide range of tumor types, with an extensive catalog of these findings in the Mitelman Database (Mitelman ) and the Genetic Alterations in Cancer (GAC) database (Jackson ). Despite the apparent successes in the field, there is no clear basis for a general approach for sensitive detection of recurrent CNAs, as many regions important for tumor progression may affect only a minority of tumors. The task of distinguishing between sporadic and recurrent CNAs is thus largely a statistical issue. The instability-selection model introduced by Newton ) provides a statistical framework specific to loss of heterozygosity (LOH) data, but even for this specific data type difficulties remain in assessing significance over multiple markers (Sterrett and Wright, 2007). The problem of assessing significance for general copy number data has received relatively little attention until recently (Shah, 2008). Few of the existing methods (reviewed below) provide an explicit description of the null hypothesis being tested, or fully acknowledge the inherent correlation structure of copy number data. For these reasons, it has been difficult to place the techniques in a traditional statistical framework or to understand error rates on a genome-wide scale. The purpose of this article is to introduce an explicit testing scheme for recurrent CNAs that preserves correlations inherent to the data. Before proceeding to our testing framework, we review the current methods for copy number calling/segmentation, which can serve as a useful intermediary to the detection of recurrent CNAs. Numerous technologies are available to measure DNA copy number, ranging from array comparative genomic hybridization (Coe ; Davies ) at tens of thousands of probes, to high- density SNP platforms (up to 1 million probes or more, Zhao ). Reviews of the technologies are provided elsewhere (Davies ; Zhao ), but a common feature is that a quantitative signal is extracted at each probe that reflects underlying copy number, with additional noise and potentially probe-specific bias inherent to the platform. Regional losses and gains within a single tumor typically cover contiguous sets of numerous probes (Myllykangas and Knuutila, 2006), and so segmentation approaches (Hupe ; Olshen ; Venkatraman and Olshen, 2007) are popular as a means to estimate the underlying copy number state at each position per tumor. Here we distinguish between discrete segmentation, where the copy number is constrained to the non-negative integers, and continuous segmentation, where the segmented values need not be integers. Examples of continuous segmentation include methods that essentially average over quantitative probe values within a genomic region determined by the algorithm to be a copy number segment. Regardless of the segmentation procedure, technical artifacts and differences in probe characteristics can lead to probe-specific bias, potentially reducing the accuracy of segmentation. A number of authors have established the presence of probe bias. Marioni ) show that aCGH data exhibits serial autocorrelation, a phenomenon they term ‘genomic waves’, and Komura ) note a correlation between apparent DNA copy number and GC content. Marioni ) and van de Wiel ) present procedures for ‘smoothing’ genomic waves. Both procedures can be applied to copy number data from normal samples, and the method of van de Wiel ) can be applied to tumor samples as long as normal samples are present. For sufficient sample sizes, we describe further below an approach to correct the bias by comparing intensities of individual probes using data from surrounding probes (via segmentation), without the need to model or otherwise consider the sequence context. We also clarify that we are interested in somatic copy number changes in tumors, rather than heritable copy number variants (CNVs). As the resolution of typing technologies increases, it is possible that CNVs, which are rarely larger than 1 Mb (Itsara ) and thus considerably shorter than the aberrations found in solid tumors (Albertson ), can be mistaken for recurrent CNAs. The distinction can be clarified by comparisons of matched tumor and normal tissue. Researchers using tumor-only datasets should be alert to the possible presence of common copy number polymorphisms when interpreting DiNAMIC output (Redon ). Over a dozen software packages for analyzing DNA copy number data are discussed by Rueda and Diaz-Uriate (2008), Baross ) and Shah (2008). We focus here on the approaches that attempt to identify recurrent copy number changes, highlighting the input formats and a few relevant similarities and differences. STAC (Diskin ) and CGHregions (van de Wiel and van Wieringen, 2007) require discrete segmented input data, i.e. categorical values such as aberrant/normal, gain/normal/loss or some numerical equivalent. GISTIC (Beroukhim ) requires continuous segmented input data, such as one might obtain from a segmentation program such as GLAD (Hupe ) or DNAcopy (Olshen ; Venkatraman and Olshen, 2007). KC-SMART (Klijn ) and MSA (Guttman ) accept continuous input data, such as log2 intensity ratios, although MSA performs discrete segmentation internally and then makes multiple calls to the STAC algorithm. GISTIC, KC-SMART, STAC and MSA assess the statistical significance of the most striking marker or region using permutation-based null distributions, while adjusting for multiple comparisons. However, the resulting output differs among the methods. GISTIC produces false discovery rate (FDR) q-values for the ‘significant’ regions. STAC and MSA control the family-wise error rate (FWER) by using the max-T procedure of Westfall and Young (1993), while KC-SMART controls FWER by using a Bonferroni adjustment. GISTIC and KC-SMART analyze genome-wide data, whereas STAC and MSA analyze data at the level of the chromosome or chromosome arm. Here we introduce Discovering Copy Number Aberrations Manifested In Cancer (DiNAMIC), a new procedure to map recurrent CNAs and assess their statistical significance. DiNAMIC can be applied in the analysis of data from individual chromosomes or genome wide. The input can consist of segmented data, either discrete or continuous. Alternately, quantitative probe measurements may be used directly, although the reader is advised to read the material on Probe Bias in DNA Copy Number Data in Section 2.4 before analyzing individual probe-level data. DiNAMIC is computationally fast, statistically robust and requires no specialized software. We believe that DiNAMIC is a valuable addition to the methods available to search for recurrent CNAs.

2 METHODS

2.1 Data format and definitions

The data are contained in a numeric n × m matrix X. Each entry x represents DNA copy number (or LOH data) for subject i at marker j. In other words, each row X of X corresponds to copy number for one subject at m markers, while each column X· corresponds to data at a single marker for n subjects. Markers that exhibit high or low average copy number are of interest, so it is natural to examine summary statistics for each marker. We define S to be the sum of the entries in the j-th column of X, leading to the local summary statistics S1, S2,…, S. Copy number gains and losses are analyzed separately, and for either type of analysis we want a global summary statistic that is sensitive to the presence of the corresponding CNA. We will restrict our attention to when copy number gains are of interest and when the focus is on copy number losses. For brevity, we restrict all subsequent discussion to Tgain(X) and make comments regarding Tloss(X) when necessary.

2.2 Permutation, cyclic shift and assessing statistical significance

Sporadic variation in DNA copy number often occurs throughout the genome, so it important to determine the statistical significance of recurrent CNAs. This can be done if we have a null distribution for Tgain(X) under the hypothesis that no recurrent CNAs are present, i.e. that all CNAs are sporadic. We prefer to not make assumptions about the entries in X—i.e. they may be discrete segmented, continuous segmented or continuous values. Thus, the estimation of a null distribution for Tgain(X) using permutation is attractive. This approach is also broadly taken by GISTIC, STAC, MSA and KC-SMART with their corresponding statistics. A variety of permutation schemes are possible. Entries in different rows come from different subjects, and may have different rates of CNV, contamination of normal tissue in tumor samples, etc. Thus, permutation of entries across rows should be avoided. KC-SMART randomly permutes the DNA copy number values within a given row, and thus breaks up the genomic positional relationship of the entries. The null distribution assumed by GISTIC is based on a convolution of histograms. Each histogram is created from the entries in a given row, but the serial structure of the entries is not preserved. In contrast, STAC and MSA perform random rearrangements of ‘aberrant’ regions within a given row. These permutations maintain the serial structure within aberrant regions, but require a clear and prior distinction between aberrant and non-aberrant regions. DNA copy number data is inherently correlated, due to the underlying loss or gain of chromosomal segments, even if the CNAs are sporadic. It is desirable to maintain this correlation under permutation in order to maximize the retained information and to provide proper control of error rates. These considerations provide the motivation for DiNAMIC's permutation scheme. Let X = x x…x be the i-th row of X, which corresponds to the data from the i-th subject. For 1 ≤ k ≤ m, we define a cyclic shift of X of index k to be More generally, a cyclic shift σ(X) of X is obtained by applying cyclic shifts σ to each row of X, where the shift index k can vary from one row to the next. This yields a total of m − 1 distinct possible cyclic shifts other than the original state. The biological motivation for cyclic shifts is clear for organisms such as bacteria that have circular chromosomes, for the serial structure of the copy number data from a given row is completely preserved under cyclic shifts. Thus, if the observed copy numbers for each row mimic a circular stationary process (Anderson, 1960), the correlation structure is not changed by the cyclic shifts. Human chromosomes are of course not circular, but the cyclic shift σ(X) maintains all of the serial structure between the markers, except at the breakpoint x. As the number of markers m is much larger than the total number of breakpoints, we conclude that the correlation structure is approximately maintained. Another way of motivating the cyclic shift is to consider that the hallmark of a recurrent CNA is that gains or losses tend to ‘line up’ at similar positions across multiple tumors. In contrast, the null hypothesis maintains that sporadic aberrations may occur anywhere on the genome, but that no region is ‘special’. Thus, the cyclic shift assesses significance by shifting each row of X, so that the rarity of apparently recurrent events (multiple tumors showing aberration at a marker) is easily assessed. These assumptions are directly tested in simulations described further below. We now describe our method for assessing the statistical significance of Tgain(X) = max(S1, S2,…, S). Algorithm 1. Assessing Statistical Significance of Tgain(X) (Optional) Set random seed r, Perform N random cyclic shifts σ(X), σ(X),…, σ(X) of the data matrix X, Compute the value of the summary statistic Tgain(σ(X)) for each shifted dataset l = 1, 2,…, N, Define the quantile-based P-value where and I(.) is the indicator function. The empirical P-value of Tloss(X) is computed by replacing ‘gain’ with ‘loss’ and reversing the inequality in Step 4. Both definitions yield P-values that are easy to interpret and are automatically adjusted for multiple comparisons across the markers. Natrajan ) obtained genome-wide aCGH copy number data using 3288 probes on 97 Wilms' tumor samples. By analyzing this data in conjunction with data on tumor relapse, these authors concluded that copy number gains in chr1q are associated with increased risk of tumor relapse. The original data consists of log2 quantitative copy number values, for which we perform segmentation and bias correction (see the material on Probe Bias in DNA Copy Number Data in the Section 2.4). The column sums of the resultant X matrix are plotted in Figure 1. Marker 196 on chr1q with probe name RP11-393K10 corresponds to the location of the maximum column sum of X, and applying Algorithm 1 with a random seed r = 12345 and N = 1000 cyclic shifts yields pgain(T(X)) = 0.001.

Fig. 1.

A plot of the column sums of the Wilms' tumor data of Natrajan ). The point corresponding to probe RP11-393K10, which corresponds to Tgain(X), is shown in red. Using N = 1000 cyclic shifts, we obtain p(Tgain(X)) = 0.001.

2.3 Peeling

Figure 1 shows that a number of markers have large column sums, both in the region surrounding probe RP11-393K10 and elsewhere in the genome. In fact, Natrajan ) detected frequent gains on chromosomes 8 and 12. If multiple highly significant recurrent gains are present in the genome, they presumably contribute high copy number values to the overall null distribution, which may reduce power for detection of less-extreme loci. We use these observations as motivation to extend DiNAMIC so that the significance of multiple regions can be assessed in a straightforward manner. Our approach is similar to the one employed by the GISTIC procedure of Beroukhim ). GISTIC assesses the significance of a ‘new’ region conditional on having found the previously most significant region(s) using a procedure the authors term the ‘peel off’ algorithm. Similarly, we successively correct for each significant region in order to better detect and dissect the recurrent CNAs. However, our peeling method is tailored to the cyclic shift procedure used by DiNAMIC. For a given data matrix X, our peeling procedure has three components. When analyzing copy number gains we start by identifying the marker k that yields the maximum column sum. Then we find all of the entries in X that contribute to the significance of marker k (i.e. are above their row mean). Algorithm 2, given below, outlines our method for identifying the appropriate entries of X. Next, we multiply these entries of X by a scaling factor τ to create a new data matrix in which the effect of marker k has been removed. We describe the computation of τ and the creation of in the second part of Algorithm 2. Then is subjected to the cyclic shift procedure, with a null distribution conditional on having found marker k in X. Below we describe the peeling procedure in detail. For convenience, we restrict our attention to copy number gains at marker k, the marker corresponding to the maximum column sum. We write and for the means of the i-th row and j-th column of X, respectively, and for the mean of X. Algorithm 2. The Peeling Procedure for Copy Number Gains at Marker k Find the largest interval [a, b] containing k such that the column means for all j ∈ [a, b]. If necessary, reduce [a, b] so that the interval contains only markers from the same chromosome arm as marker k. Let be the set of rows such that the entry x exceeds the mean of the i-th row. For each i ∈ I, find the maximal interval [a, b] such that (i) [a, b] ⊆ [a, b], (ii) k ∈ [a, b], (iii) for all j ∈ [a, b]. We say that {x : i ∈ I, j ∈ [a, b]} is the set of all matrix entries that contribute to the significance of marker k. Supplementary Figure 1 in the Supplementary Material illustrates the entries of a simulated data matrix X identified by Steps (1)–(4) of the peeling procedure. We now show how to compute the scaling factor τ and the new data matrix . Algorithm 2 Continued. (5) Find a constant τ such that (6) Define (7) Let be an n × m matrix whose entries are . Recall that marker k corresponds to the maximum column sum in X. By construction, the sum of the k-th column of is , the mean of the column sums in X. Thus, applying the peeling procedure yields a new dataset in which marker k is null. The same constant τ is used to rescale all matrix entries x that contribute to the aberration at marker k, so in we expect markers near k to have column sums close to , and thus also be null. Figure 2 shows a plot of the column sums for the Wilms' tumor data of Natrajan ) before peeling (black and blue, as in Fig. 1) and after peeling (grey). After peeling, column 196 is no longer significant.

Fig. 2.

A plot of the column sums of the Wilms' tumor data of Natrajan ). The gray lines show the plot of the column sums after applying the peeling procedure to probe RP11-393K10.

A plot of the column sums of the Wilms' tumor data of Natrajan ). The gray lines show the plot of the column sums after applying the peeling procedure to probe RP11-393K10. The peeling procedure immediately enables the researcher to filter out minor variations in the vicinity of major peaks, and to potentially distinguish among multiple major peaks. Without any further computation (other than the trivial demands of the peeling procedure itself), the peeling procedure may be applied sequentially while comparing all peaks to the original for significance testing. We note that the number of iterations is an input parameter, and thus the user may choose an appropriate balance between sequential significance of CNAs and computational cost. Such an approach is conservative, as the original X contains the extreme values of true recurrent CNAs, but the resulting P-values are corrected for multiple comparisons. A more powerful but computationally demanding approach is to repeat the cyclic shift assessment of statistical significance using the post-peeled matrix . Accordingly, DiNAMIC provides two options, Quick Look and Detailed Look, and the flow chart in Supplementary Figure 2 illustrates the differences between the two procedures. In Quick Look, the original distribution of Tgain(X) is used for significance testing of the most extreme markers, whereas in Detailed Look the null distribution of Tgain(X) is recomputed after each peeling. It is natural to wonder if additional power to detect aberrant markers can be gained by recomputing the null distribution of Tgain(X) after each peeling, and Supplementary Figure 3 indicates that this is the case. A comparison of computation times for Quick Look and Detailed Look can be found in the Supplementary Materials.

Fig. 3.

Power curves for DiNAMIC for simulated datasets containing a single recurrent copy number aberration.

2.4 Probe bias in DNA copy number data

As noted in the Section 1, probe-specific variations in hybridization affinity can lead to corresponding variations in array intensity. These in turn can result in biased estimates of DNA copy number. To get some sense of the potential magnitude of the bias, suppose Z is the chromosome 2 data from the glioma dataset of Kotliarov ), and let Seg(Z) be a continuous segmented version of Z obtained using DNA copy. Segmentation algorithms use the existing data to model the true underlying copy number as a piece-wise constant function, so the expected column mean of Resid(Z) = Z − Seg(Z) should be zero, with variation that should reflect random error. This assumption can be tested for each marker using a t-test for the mean of the entries of each column of Resid(Z) differing from zero. The histograms in Supplementary Figure 4 of the Supplementary Material show that the t-statistics are markedly overdispersed, which provides clear evidence that probe bias is widespread.

Fig. 4.

A plot of the column sums of the glioma dataset of Kotliarov ) near Tgain(X). This figures was constructed using the UCSC Genome Browser (Kent ) and LocusZoom (Pruim ).

A plot of the column sums of the glioma dataset of Kotliarov ) near Tgain(X). This figures was constructed using the UCSC Genome Browser (Kent ) and LocusZoom (Pruim ). Probe bias can lead to matrices with statistically significant column sums, even in the absence of recurrent CNAs. Failure to correct for it can result in increased type I error. Nevertheless, to the best of our knowledge none of the currently available methods for analyzing DNA copy number data have addressed this issue. One possible method of obtaining a bias-corrected version of the data is to perform continuous segmentation as a preprocessing step, and then to analyze the segmented data. [GISTIC takes this approach, although Beroukhim ) make no mention of probe bias.] In the Supplementary Material, we discuss simulations that show that probe bias may still affect segmented data. We recommend the following bias-correction procedure when working with data matrices X that contain quantitative probe-level data: Algorithm 3. Removing Probe Bias Use a segmentation algorithm to get Seg(X), a segmented version of X. Compute Resid(X) = X − Seg(X). Let b be a 1 × m (row) vector whose j-th entry is the mean of the entries in the j-th column of Resid(X). Let B be an n × m matrix with each row equal to b. Define . Use a segmentation algorithm to obtain , a segmented version of . The vector b is an estimate of the probe bias in X, and this estimated bias is removed when we compute . However, it is not appropriate to use DiNAMIC to analyze directly. To see why this is the case, we first note that the column sums of are identical to the column sums of Seg(X). On the other hand, the entries of the two matrices do not have the same form, because the rows of Seg(X) are piece-wise constant, whereas the rows of are not. As a result, the entries in are likely to be more variable than those of Seg(X), and clearly the same statement can be made when comparing the entries of and σ(Seg(X)). This observation becomes important when we examine the column sums of and σ(Seg(X)), which need not be the same, even if the same cyclic shift σ is applied to both matrices. Since the entries of tend to be more variable than those of σ(Seg(X)), the column sums of tend to be more variable than those of σ(Seg(X)) as well. However, the column sums of and σ(Seg(X)) should have approximately the same mean value. As a result, is likely to assume larger values than Tgain(σ(Seg(X)). Therefore, we expect to observe conservative behavior if we use the empirical null distribution to assess the significance of Tgain(Seg(X)). Since , it follows that the same conclusion holds when we use cyclic shifts to assess the statistical significance of , which is why it is not appropriate to use DiNAMIC to analyze . Simulation studies of the effectiveness of the peeling procedure are discussed in the Supplementary Material.

3 IMPLEMENTATION

A number of simulated null datasets were created and subsequently analyzed with DiNAMIC in order to study its behavior under the null hypothesis that no recurrent CNAs are present, and the results of these analyses are presented in Table 1. Various marker spacing and correlation schemes were considered in an effort to show that DiNAMIC is robust to the type of deviation from stationarity that can be found in real datasets. A full description of the simulated datasets appears in the Null Simulation Studies section of the Supplementary material. In each case, the observed type I error was computed as follows. Steps (1)–(3) were repeated 10 000 times, and the observed type I error was defined to be the proportion of Tgain(X) that was significant at the α = 0.05 level.

Table 1.

Observed type I error for datasets simulated under the null hypothesis

Null simulation model		Type I error
Copy number data		0.0424
Segmented copy number data		0.0429
Serially correlated normal		0.0466
Clumped copy number data (25%)		0.0473
Clumped copy number data (50%)		0.0450
Clumped copy number data (75%)		0.0456
Clumped copy number data (100%)		0.0410

Create a data matrix X using the appropriate simulation scheme. Compute using N = 1000 cyclic shifts of X. Determine whether Tgain(X) is significant at the α = 0.05 level. Observed type I error for datasets simulated under the null hypothesis The values of the observed type I error given in Table 1 suggest that DiNAMIC is slightly conservative, which seems reasonable in light of the effect of the cyclic shift procedure on the underlying correlation of the markers. Markers on either side of a breakpoint will be essentially independent, and hence they are more likely to exhibit greater variability than neighboring markers in the original data. As a result, the distribution of the maximum column sum after cyclic shift should yield larger values than the corresponding distribution for the original data, and similarly for the minimum column sums. Because the values in Table 1 are quite close to 0.05, any difference in the distributions appears to be very minor. Additional simulations were performed under the alternative hypothesis that a recurrent CNA is present, and a detailed discussion of these simulations can be found in the Power Simulations and Peeling Accuracy section of the Supplementary Material. Briefly, we note that these simulations show that DiNAMIC has equal power to detect gains and losses. Moreover, DiNAMIC's power to detect CNAs increases with the effect size of the aberration. Both of these properties are illustrated by the power curves in Figure 3. Next we present the results of the analysis of two publicly available tumor datasets. The dataset of Natrajan ) contains a number of copy number gain and loss loci that are potentially statistically significant. Using both GISTIC and DiNAMIC's Detailed Look, we analyzed a segmented version of this dataset after applying the bias correction scheme described in the Section 2. Because no normal tissue reference set was available, the thresholds for amplification and deletion, which are required input parameters for GISTIC, were set to the default values of ±0.1. Table 2 shows all markers that were peeled by DiNAMIC and have either p(Tgain(X)) < 0.025 or p(Tloss(X)) < 0.025 (marked by ‘X’), thereby controlling the overall genome-wide false positive rate (FWER) at α = 0.05. For comparison, we also show all regions detected by GISTIC. By default, GISTIC uses an FDR threshold of q = 0.25, and in order to facilitate comparison with DiNAMIC, we distinguish between GISTIC findings with q < 0.05 (marked by ‘X’) from those with 0.05 ≤ q < 0.25 (marked by ‘O’). Note that there are fewer regions declared significant by GISTIC than by DiNAMIC at the respective 0.05 level. For a given error threshold, the FWER is more conservative than the FDR, so this comparison is meaningful.

Table 2.

Markers in the Glioma dataset of Natrajan ) discovered by DiNAMIC's detailed look and GISTIC

Gain marker	DiNAMIC	GISTIC	Loss marker	DiNAMIC	GISTIC
1q23	X	X	1p36		X
2p16	X	X	1p31	X
2q32	X		2p14	X
2q37	X		2q37	X
5p15	X		3p21	X
6p25		X	3q13	X
6p24	X		4p15	X
6q24	X		4q22		X
7p11	X		4q31	X
7q21		X	5p15	X
7q34	X		5q11	X
8p23	X	X	6p12	X
8q24	X		6q12	X
9q34	X	X	9p24		O
11p15	X	X	9p21	X
12p13	X	X	9q21	X
12q12	X		10p15	X	X
13q12	X		10q11	X
13q31	X		11p15	X
13q32		O	11p13	X	X
15q11		O	11q23	X
16p13	X	X	11q24	X	X
17q25	X		13q21	X	X
18p11	X		14q12	X
18q11	X		14q21	X	X
18q12		X	15q12	X	X
20p11	X		16p11	X
20q13	X		16q21		X
			16q23	X
			17p13	X
			17q12	X
			17q21	X
			18q11	X
			18q21	X	X
			19p12	X
			19q12	X
			21q21	X	X
			22q12	X
			22q13		X

Markers denoted with an ‘X’ are significant at the 0.05 level (DiNAMIC genome-wide P-value and GISTIC q-value). For GISTIC, markers with 0.05 ≤ q < 0.25 are denoted by ‘O.’

Markers in the Glioma dataset of Natrajan ) discovered by DiNAMIC's detailed look and GISTIC Markers denoted with an ‘X’ are significant at the 0.05 level (DiNAMIC genome-wide P-value and GISTIC q-value). For GISTIC, markers with 0.05 ≤ q < 0.25 are denoted by ‘O.’ Natrajan ) noted that the most common copy number gains were found in 1q, 8 and 12, with focal gains located at 1q22-25, 8p21-12 and 12p13. Both DiNAMIC and GISTIC detected markers corresponding to these gains. DiNAMIC and GISTIC detected markers at 9q34, the site of the SET oncogene, which is supported by SET protein amplification findings by Carlson ) in Wilms' tumor. Natrajan ) also found that gains at 13q31 and 16p13 were associated with tumor relapse. Both methods detected 16p13. DiNAMIC's P-value for the locus in 13q31 is significant at the 0.05 level, whereas GISTIC's q-value for the locus in the neighboring cytoband 13q32 is not. DiNAMIC's detection of 7q34 and 8q24 is noteworthy because the oncogenes BRAF and c-Myc lie in these regions, respectively. Neither of these regions were detected by GISTIC. Losses at 10p15 and 11p13 were found by Natrajan ) in a number of subjects; these are the sites of WT1 and WT2, genes known to be associated with Wilms' tumor. Both loci were detected by DiNAMIC and GISTIC. The same authors concluded that loss of 21q22 was associated with tumor relapse; both methods detected the nearby locus 21q21. Although the loss sites found by the two methods on 1p, 11q and 16q are not identical, the differences appear to be minor. Using linkage analysis, Rahman ) discovered FWT1/WT4, a familial Wilms' tumor gene located on 17q12. This site was detected by DiNAMIC but not GISTIC. The gene PDCD6 is located on 5p15, a site that was found by DiNAMIC but not GISTIC. Because PDCD6 is known to be associated with programmed cell death, detection of this locus may have biological relevance. GISTIC and DiNAMIC's Detailed Look were also used to analyze the glioma dataset of Kotliarov ). This dataset contains copy number values from 178 tumors, 82 of which are glioblastomas. As above, GISTIC's amplification and deletion thresholds were set to the default values of ±0.1; the q-value threshold was 0.05. With these settings, GISTIC found 47 significant gain regions and 20 significant loss regions. Using DiNAMIC, over 100 loci for gains and losses were found to be significant at the α = 0.05 level. The maximum column sum yielded the most aberrant marker, which is marker 55489 in chr7. Figure 4 shows the column sums near the marker, as well as nearby RefSeq genes (hg18 genomic annotation tracks). The highest peak includes EGFR and a region upstream. EGFR amplification is a very common genetic mutation in glioblastoma (Heimberger ), and the peak finding is a reassuring illustration of the DiNAMIC procedure.

4 DISCUSSION

The analysis of DNA copy number data has proven to be a valuable tool for the study of cancer. Segmentation methods can be used to detect loci where copy number changes occur for a single subject, but different approaches are needed if we wish to assess the statistical significance of DNA copy number changes present in multiple subjects. Here we have introduced DiNAMIC, a new permutation-based method that can be used by researchers to detect statistically significant recurrent CNAs. When compared to existing methods, DiNAMIC has a number of advantages. First, since DiNAMIC makes no distributional assumptions, it can analyze a variety of input data—continuous, continuous segmented or discrete segmented. In addition, DiNAMIC can analyze genome-wide data, as well as data from a single chromosome or chromosome arm. Finally, it does not require any tuning parameters, such as user-defined thresholds for gain or loss, that are potentially arbitrary. DiNAMIC is similar to GISTIC, STAC, MSA and KC-SMART in that it assesses statistical significance using a permutation-based null distribution. However, in contrast to the other procedures, the cyclic shifts employed by DiNAMIC preserve essentially the entire serial structure in the data. The serial marker correlation can be very high, especially for segmented data. For example, the average successive marker correlation of the segmented glioma data of Kotliarov ) was 0.985. Thus, DiNAMIC can preserve the overall false positive rate, without the need to resort to overly conservative procedures, which are sometimes employed by other methods. For example, GISTIC provides multiple comparison control across markers via Benjamini–Hochberg FDR control, which is generally conservative under positive dependence structures (Benjamini and Yekutieli, 2001). KC-SMART uses the Bonferroni method to control for multiple testing, but this is also known to be conservative (Simes, 1986). Based on our analysis of real datasets, it appears that DiNAMIC performs well when compared to currently available methods. In addition, extensive simulation studies are described in Section 3, and these are performed under a variety of marker correlation schemes. The results of these simulations show that DiNAMIC preserves type I error, and is slightly conservative, even when the markers follow non-stationary correlation structures similar to those found in real data. Probe bias is potentially problematic, however. Even under the null hypothesis that no CNAs are present, probe bias can lead some columns to be more likely to attain the minimum or maximum column sum. We propose a bias correction procedure in Section 2, and use simulations to illustrate its effectiveness, as discussed in the Supplementary Material. Presumably, other methods for detecting recurrent CNAs are also susceptible to probe bias, but we have not seen this issue discussed elsewhere. DiNAMIC currently uses the columns sums S to assess the local evidence for excess copy number gains and losses, and global statistics Tgain(X) and Tloss(X) for global testing. These statistics are intuitive and easy to describe, but may potentially ignore additional information, such as the simultaneous evidence provided by multiple markers in a region. In addition, we do not explore the true dissection of multiple regions, which may exhibit correlated gain or loss structure across different tumors. Our intent in developing DiNAMIC is to create a statistically sound testing structure, which has immediate utility and can serve as the basis for further extensions. We have also performed simulation studies (data not shown) that indicate that DiNAMIC may also be applied to LOH data, provided the data have few missing values. However, in standard SNP-based LOH calling, the presence of heterozygosity produces a larger number of missing values, with missingness rates that vary by marker. Thus, we leave the application of DiNAMIC to LOH data as an extension for future research.

33 in total

1. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

2. An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays.

Authors: Xiaojun Zhao; Cheng Li; J Guillermo Paez; Koei Chin; Pasi A Jänne; Tzu-Hsiu Chen; Luc Girard; John Minna; David Christiani; Chris Leo; Joe W Gray; William R Sellers; Matthew Meyerson
Journal: Cancer Res Date: 2004-05-01 Impact factor: 12.701

3. Evidence for a familial Wilms' tumour gene (FWT1) on chromosome 17q12-q21.

Authors: N Rahman; L Arbour; P Tonin; J Renshaw; J Pelletier; S Baruchel; K Pritchard-Jones; M R Stratton; S A Narod
Journal: Nat Genet Date: 1996-08 Impact factor: 38.330

Review 4. Computational methods for identification of recurrent copy number alteration patterns by array CGH.

Authors: S P Shah
Journal: Cytogenet Genome Res Date: 2009-03-11 Impact factor: 1.636

5. On the statistical analysis of allelic-loss data.

Authors: M A Newton; M N Gould; C A Reznikoff; J D Haag
Journal: Stat Med Date: 1998-07-15 Impact factor: 2.373

6. Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization.

Authors: John C Marioni; Natalie P Thorne; Armand Valsesia; Tomas Fitzgerald; Richard Redon; Heike Fiegler; T Daniel Andrews; Barbara E Stranger; Andrew G Lynch; Emmanouil T Dermitzakis; Nigel P Carter; Simon Tavaré; Matthew E Hurles
Journal: Genome Biol Date: 2007 Impact factor: 13.583

Review 7. Chromosome aberrations in solid tumors.

Authors: Donna G Albertson; Colin Collins; Frank McCormick; Joe W Gray
Journal: Nat Genet Date: 2003-08 Impact factor: 38.330

8. Expression of SET, an inhibitor of protein phosphatase 2A, in renal development and Wilms' tumor.

Authors: S G Carlson; E Eng; E G Kim; E J Perlman; T D Copeland; B J Ballermann
Journal: J Am Soc Nephrol Date: 1998-10 Impact factor: 10.121

9. Inferring the location of tumor suppressor genes by modeling frequency of allelic loss.

Authors: Andrew Sterrett; Fred A Wright
Journal: Biometrics Date: 2007-03 Impact factor: 2.571

10. CGHregions: dimension reduction for array CGH data with minimal information loss.

Authors: Mark A van de Wiel; Wessel N van Wieringen
Journal: Cancer Inform Date: 2007-02-08

11 in total

1. Use of autocorrelation scanning in DNA copy number analysis.

Authors: Liangcai Zhang; Li Zhang
Journal: Bioinformatics Date: 2013-09-16 Impact factor: 6.937

2. Integration of genomic data enables selective discovery of breast cancer drivers.

Authors: Félix Sanchez-Garcia; Patricia Villagrasa; Junji Matsui; Dylan Kotliar; Verónica Castro; Uri-David Akavia; Bo-Juen Chen; Laura Saucedo-Cuevas; Ruth Rodriguez Barrueco; David Llobet-Navas; Jose M Silva; Dana Pe'er
Journal: Cell Date: 2014-11-26 Impact factor: 41.582

3. Differential pathogenesis of lung adenocarcinoma subtypes involving sequence mutations, copy number, chromosomal instability, and methylation.

Authors: Matthew D Wilkerson; Xiaoying Yin; Vonn Walter; Ni Zhao; Christopher R Cabanski; Michele C Hayward; C Ryan Miller; Mark A Socinski; Alden M Parsons; Leigh B Thorne; Benjamin E Haithcock; Nirmal K Veeramachaneni; William K Funkhouser; Scott H Randell; Philip S Bernard; Charles M Perou; D Neil Hayes
Journal: PLoS One Date: 2012-05-10 Impact factor: 3.240

4. TAGCNA: a method to identify significant consensus events of copy number alterations in cancer.

Authors: Xiguo Yuan; Junying Zhang; Liying Yang; Shengli Zhang; Baodi Chen; Yaojun Geng; Yue Wang
Journal: PLoS One Date: 2012-07-18 Impact factor: 3.240

5. RUBIC identifies driver genes by detecting recurrent DNA copy number breaks.

Authors: Ewald van Dyk; Marlous Hoogstraat; Jelle Ten Hoeve; Marcel J T Reinders; Lodewyk F A Wessels
Journal: Nat Commun Date: 2016-07-11 Impact factor: 14.919

6. Genome-wide identification of significant aberrations in cancer genome.

Authors: Xiguo Yuan; Guoqiang Yu; Xuchu Hou; Ie-Ming Shih; Robert Clarke; Junying Zhang; Eric P Hoffman; Roger R Wang; Zhen Zhang; Yue Wang
Journal: BMC Genomics Date: 2012-07-27 Impact factor: 3.969

7. Comparative analysis of methods for identifying recurrent copy number alterations in cancer.

Authors: Xiguo Yuan; Junying Zhang; Shengli Zhang; Guoqiang Yu; Yue Wang
Journal: PLoS One Date: 2012-12-20 Impact factor: 3.240

8. Molecular subtypes in head and neck cancer exhibit distinct patterns of chromosomal gain and loss of canonical cancer genes.

Authors: Vonn Walter; Xiaoying Yin; Matthew D Wilkerson; Christopher R Cabanski; Ni Zhao; Ying Du; Mei Kim Ang; Michele C Hayward; Ashley H Salazar; Katherine A Hoadley; Karen Fritchie; Charles J Sailey; Charles G Sailey; Mark C Weissler; William W Shockley; Adam M Zanation; Trevor Hackman; Leigh B Thorne; William D Funkhouser; Kenneth L Muldrew; Andrew F Olshan; Scott H Randell; Fred A Wright; Carol G Shores; D Neil Hayes
Journal: PLoS One Date: 2013-02-22 Impact factor: 3.240

Review 9. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine.

Authors: Benjamin J Raphael; Jason R Dobson; Layla Oesper; Fabio Vandin
Journal: Genome Med Date: 2014-01-30 Impact factor: 11.117

10. Detecting independent and recurrent copy number aberrations using interval graphs.

Authors: Hsin-Ta Wu; Iman Hajirasouliha; Benjamin J Raphael
Journal: Bioinformatics Date: 2014-06-15 Impact factor: 6.937