| Literature DB >> 20711339 |
Franck Rapaport1, Christina Leslie.
Abstract
Cancer progression is often driven by an accumulation of genetic changes but also accompanied by increasing genomic instability. These processes lead to a complicated landscape of copy number alterations (CNAs) within individual tumors and great diversity across tumor samples. High resolution array-based comparative genomic hybridization (aCGH) is being used to profile CNAs of ever larger tumor collections, and better computational methods for processing these data sets and identifying potential driver CNAs are needed. Typical studies of aCGH data sets take a pipeline approach, starting with segmentation of profiles, calls of gains and losses, and finally determination of frequent CNAs across samples. A drawback of pipelines is that choices at each step may produce different results, and biases are propagated forward. We present a mathematically robust new method that exploits probe-level correlations in aCGH data to discover subsets of samples that display common CNAs. Our algorithm is related to recent work on maximum-margin clustering. It does not require pre-segmentation of the data and also provides grouping of recurrent CNAs into clusters. We tested our approach on a large cohort of glioblastoma aCGH samples from The Cancer Genome Atlas and recovered almost all CNAs reported in the initial study. We also found additional significant CNAs missed by the original analysis but supported by earlier studies, and we identified significant correlations between CNAs.Entities:
Mesh:
Year: 2010 PMID: 20711339 PMCID: PMC2920822 DOI: 10.1371/journal.pone.0012028
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Toy representation of a linear partition of aCGH samples using large-margin techniques.
The algorithm finds a linear function that is able to partition the aCGH samples into two groups. By solving an optimization problem, the algorithm determines the vector , which geometrically represents the normal vector of a hyperplane (shown in red) separating the samples, along with the bias term , and the assignment of samples to groups. In the toy example shown, the hyperplane separates the samples that present a deletion on the q arm (above the hyperplane) from the ones that do not (below the hyperplane).
Figure 2Clustering tree for chromosome 19.
At each iteration of the algorithm, each previously identified group of samples are partitioned into two new clusters used a maximum-margin clustering technique that exploits the correlations in aCGH profiles (see Methods). The partitioning process stops when (i) a group has fewer than 5 samples; (ii) the partition generating the group fails to achieve a statistical significance threshold of ; or (iii) the tree is already at the maximum depth of 3. In the picture above, each group is represented by its centroid, i.e. its median profile, in green. For visualization purposes, the segmentation of the centroid, produced by circular binary segmentation [30], is shown in red.
Summary of significant events in glioblastoma data set.
| Event | Iter. | # of samples | % of samples | Size of event | Correlated genes | Examples of candidate genes |
| (a) Chr. 1 | 1 | 26 | 7.5% | 247 Mbp |
| LCK, PAX7, RPL22 |
|
| 3 | 6 | 1.7% | 25 Kbp |
| |
|
| 3 | 7 | 2.0% | 236 Kbp |
| CHIC2, FIP1L1, KIT, PDGFRA |
| (d) 6q | 2 | 31 | 9% | 110 Mbp |
| FOXO3A |
| (a) Chr. 7 | 1 | 169 | 49% | 158 Mbp |
| BRAF, CDK6, EGFR, ELN, HIP1, PMS2, SMO, TIF1 |
|
| 2 | 76 | 22% | 37 Kbp |
| EGFR |
| (d) 9p | 1 | 99 | 29% | 47 Mbp |
| CDKN2A- p14ARF, CDKN2A -p16(INK4a),FANCG, JAK2, MLLT3, PSIP2 |
|
| 3 | 7 | 2% | 140 Mbp |
| |
| (d) Chr. 10 | 1 | 154 | 45% | 135Mbp |
| BMPR1A, D10S70, MYST4, NCOA4, PTEN, SSH3BP1 |
| (d) Chr. 13 | 1 | 61 | 18% | 114Mbp |
| ERCC5, FOXO1A, LHFP, RB1, ZNF198 |
| (d) Chr. 14 | 1 | 165 | 48% | 106Mbp |
| AKT1, BCL11B, DICER1, GPHN, KTN1, TCL1A, TCL6, TSHR |
|
| 2 | 21 | 6.1% | 100 Mbp |
| BLM, CRTC3, NTRK3, PML |
|
| 2 | 15 | 4.3% | 88.8 Mbp |
| CBFB, CDH1, CREBBP, CYLD, HERPUD1, IL21R, CDH11, MAF, MHC2TA, MYH11, TNFRSF17 |
|
| 1 | 17 | 4.9% | 25.2 Mbp |
| BCL3, ERCC2, TFPT, ZNF331 |
| (a) Chr. 19 | 2 | 76 | 22% | 63.8 Mbp |
| AKT2, BCL3, BRD4, CIC, ELL, ERCC2, KLK2, SH3GL1, STK11, TCF3, TFPT, TPM4, ZNF331 |
| (a) Chr. 20 | 1 | 74 | 21% | 62.4 Mbp |
| ASXL1, GNAS, SS18L1, TOP1 |
|
| 3 | 6 | 1.7% | 46.9 Mbp |
| ERG, RUNX1, DSCR1 |
| (d) Chr. 22 | 1 | 300 | 87% | 49.7 Mbp |
| CTCL1, EWSR1, MKL1, SMRCB1, ZNF278 |
We indicated the iteration in which the event was found as well as the number of samples that were assigned to this cluster and the percentage of the total number of samples this represented. Deletions are denoted by the symbol (d) and amplifications by the symbol (a). Region names in boldface denote novel CNAs that were not found by previous analyses while underlined regions represent short events. Candidate genes denote significantly differentially overexpressed genes in this region if the CNA is an amplification and significantly differentially underepxressed genes in this region if the CNA is a deletion, according to a SAM analysis and out of the total number of genes in the region.
Figure 3Comparison of the gains and losses found by iterative partitioning versus previous analyses.
The horizontal tracks show the CNAs identified by first three iterations of our method, compared to the ones found by GISTIC and RAE. The middle track depicts the chromosomes, with even chromosome numbers annotated. Gains are denoted in red and losses in blue.