Literature DB >> 25506197

FocalCall: An R Package for the Annotation of Focal Copy Number Aberrations.

Oscar Krijgsman¹, Christian Benner¹, Gerrit A Meijer¹, Mark A van de Wiel², Bauke Ylstra¹.

Abstract

In order to identify somatic focal copy number aberrations (CNAs) in cancer specimens and to distinguish them from germ-line copy number variations (CNVs), we developed the software package FocalCall. FocalCall enables user-defined size cutoffs to recognize focal aberrations and builds on established array comparative genomic hybridization segmentation and calling algorithms. To distinguish CNAs from CNVs, the algorithm uses matched patient normal signals as references or, if this is not available, a list with known CNVs in a population. Furthermore, FocalCall differentiates between homozygous and heterozygous deletions as well as between gains and amplifications and is applicable to high-resolution array and sequencing data.

Entities: Disease Gene Species

Keywords: DNA copy number; R-package; aCGH; focal CNAs; sequencing

Year: 2014 PMID： 25506197 PMCID： PMC4251178 DOI： 10.4137/CIN.S19519

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

The increase in the resolving power of DNA copy number profiling techniques has led to the simultaneous discovery of the extend of (1) copy number variations (CNVs) of germ-line origin in the general population1 as well as (2) focal copy number aberrations (CNAs) of somatic origin in cancer specimens.2 The limited size of focal CNAs offers an excellent opportunity to pinpoint potential driver genes in cancer.3–6 CNV detection usually is an obstacle in the identification of cancer driver genes. Unfortunately, with copy number assessment in tumors, a mix of focal CNAs and CNVs is detected, of which most have the same appearance (Fig. 1). A procedure that partly circumvents the interference of CNVs in tumor samples is the simultaneous analysis of matched patient normal DNA. However, if the diploid balance in a tumor is disturbed, ie, a single copy gain, a heterozygous CNV will still give rise to a superimposed focal signal. To recognize the CNAs, a negative selective procedure can be applied by identifying CNVs detected in the healthy population through the analysis of a series of healthy normal copy number profiles, preferably patient group matched, or otherwise an external database of genomic variants (ie, DGV).7 Alternatively, an effective positive selection is through the identification of focal homozygous deletions and high-level amplifications that differ in amplitude from CNVs.5 This approach however neglects many heterozygous focal CNAs.

Figure 1

Copy number profiles of a lung cancer sequencing sample and matched patient normal signal.12 Panel (A) shows all aberrations in the tumor sample. X-axis represents the bins ordered according to their chromosomal location. Y-axis represents the log2 ratio (right side). The red line indicates the segmented values as obtained using circular binary segmentation in CGHcall.11 Panel (B) shows chromosomes 3 (left) and 10 (right) both for patient normal and tumor sample. The gray arrow in the left panels indicates a focal CNV present in both tumor and matched patient normal sample. Somatic focal CNAs on chromosome 10 are only present in the tumor and not in the matched patient normal sample. Focal CNAs and CNVs were detected using focalCall().

Despite the great opportunities focal CNAs offer for cancer gene discovery, only few software tools are available that appreciate them, eg, GISTIC, WIFA, and control-FREEC.8–10 Both GISTIC and WIFA were developed for array data and can detect focal CNAs in series of samples, but not in individual tumor profiles. GISTIC has a dedicated option to discriminate focal CNAs from CNVs based on an external database. Control-FREEC was developed to calculate genome-wide copy number information from whole genome sequencing data and can distinguish CNAs from CNVs, provided a matched patient normal signal is available. Here, we present FocalCall, which elaborates on commonly used segmentation and calling algorithms.11 A user-defined size cutoff allows for the identification of focal CNAs in individual samples as well as series of samples and can distinguish them from CNVs. FocalCall accepts copy number data from both high-resolution genome-wide array comparative genome hybrizations (aCGH) and single nucleotide polymorphism (SNP) arrays as well as data from sequencing data experiments,12 with or without a matched patient normal signal.

Methods

Patient materials and settings

FocalCall was evaluated with four publicly available data sets: (1) shallow whole genome sequencing data (∼0.2 × genome coverage) from tumor and normal DNA of a lung cancer patient12; (2) SNP array (250K) data of 371 lung cancer patients without matched patient normal samples2; (3) aCGH data (244K) of 74 glioblastoma multiforme (GBM) patients hybridized against its matched normal13; and (4) aCGH data (105K) of 60 high-grade cervical cancer pre-curser lesions hybridized against a pool of 100 healthy individuals.4 Dataset 4 is available from the Gene Expression Omnibus (GSE34575) and used as an example dataset in the R-package.

Detection of recurrent aberrations

Standard data output as produced by CGHcall11 was used as input for the main function focalCall(). Aberrations below the user-defined size threshold for focal CNAs (default = 3 Mb) were identified in each cancer sample and categorized as “gain”, “loss”, “amplification”, or “homozygous deletion”. For each region, the smallest region of overlap (SRO) was calculated over the complete sample set. Complex regions may contain multiple SROs (Supplementary Fig. 1 and 2). To determine whether focal CNAs were enriched for cancer driver genes, enrichment analysis was performed.3 In brief, enrichment analysis was implemented whereby 10,000 sets of simulated focal CNAs were randomly generated throughout the genome, with the same amount and length as the observed focal CNAs in the dataset. Overlap was determined of the simulated focal CNAs with the published list of cancer sensus genes and the significance of enrichment expressed as a P value.

Distinction between focal CNAs and CNVs

For each SRO (Supplementary Fig. 1), the percentage of overlap of focal CNAs with a normal reference or known CNVs is returned. If matched patient reference data are available, this can be provided in focalCall() as a separate CGHcall object. If no matched patient reference is available, focal CNAs are compared to a list of genomic locations of known CNVs, which can be provided in focalCall() as a flat text or bed file.

Reporting of focal CAN

The function igvFiles() generates tracks compatible with the Integrative Genome Viewer (IGV, www.broadinstitute.org/igv/home) for CNA frequency, focal CNA frequency, and segmentation values per sample (Supplementary Fig. 3). This allows the user to visually inspect the results generated by FocalCall. The functions freqPlot() and freqPlotFocal() generate .png file for CNA frequency and focal CNA frequency, respectively (Fig. 2).

Figure 2

Frequency plots of the GBM dataset of all aberrations (top) and focal aberrations and CNVs (bottom) as generated by FocalCall functions freqPlot() and FreqPlotFocal(). Red indicates a gain and blue indicates a loss. In the frequency plot of focal aberrations (bottom), the somatic focal aberrations are indicated in red for gains and blue for losses. CNVs are indicated in gray, both for gains and losses.

Computational time

Computational times for the detection of focal CNAs in the GBM dataset (n = 74, 244K probes) with default parameters are approximately 7 minutes on a standard desktop computer with a 1.7 GHz CPU and 4 Gb of RAM.

Results

Detection of focal CNAs in single patient and series of tumors

The lung cancer sequencing data yielded a total of 38 focal gains and losses: 7 were identified as CNVs and 31 as focal CNAs, of which 6 were high-level amplifications (including FGFR1) and 4 were homozygous deletions (including CDKN2A, Fig. 1 and Supplementary Table). The lung cancer SNP array dataset yielded a total of 503 focal CNAs with a frequency >5%. A total of 43 of the focal gains and losses overlapped with the CNV regions as archived in the DGV database.7 All genes in focal CNAs detected by GISTIC in the original paper were also detected by Focal-Call.2 The remaining 460 detected focal CNAs were enriched for known cancer driver genes (n = 6, P < 0.05) and included GNAS and KDM5A. The GBM aCGH dataset yielded a total of 434 somatic focal CNAs and 90 CNVs. The focal CNAs encompassed known cancer driver genes like EGFR, PTEN, and CDKN2A. All 20 focal CNAs previously reported by GISTIC13 were recognized by FocalCall. Additionally detected focal CNAs showed a highly significant enrichment for known cancer driver genes (n = 38, P < 0.008). The cervical precursor lesion aCGH dataset yielded a total of 94 focal CNAs with FocalCall. Two of the identified genes, hsa-mir-375 and EYA2, were functionally tested and validated as a new oncogene and tumor suppressor gene.4 The data and example scripts for this dataset are available in the R-package.

Conclusion

Focal CNAs provide an excellent opportunity to detect potential cancer driver genes.6 Through advances in techniques, the resolution of DNA copy number detection has increased enormously and the changes we can identify have become smaller. Accurate detection and distinction of somatic aberrations from germ-line CNVs are thereby mandatory. FocalCall offers researchers a user-friendly tool to detect focal CNAs in high-resolution DNA copy number data and provides multiple methods to distinguish these from CNVs. FocalCall elaborates on a widely used DNA copy number tool CGHcall11 and comprehensive genome analysis packages in the R/Bioconductor environment. In addition, FocalCall output in the IGV data format allows for easy browsing through the data and provides a direct link with the genes affected. In conclusion, we provide an alternative and sensitive procedure for the detection of focal CNAs applicable to both individual and series of samples analyzed by either array or next-generation sequencing. Supplementary Figure 1. Graphical explanation how the smallest region of overlap is calculated. Supplementary Figure 2. Flowchart for FocalCall procedures from input to output data. Supplementary Figure 3. IGV example with the segment values of the GBM dataset. Supplementary Table. FocalCall output for the single sample lung patient data. Supplementary Vignette. Explanation, R-code and output of the executable example data provided with the R-package.

13 in total

Review 1. Structural variation in the human genome.

Authors: Lars Feuk; Andrew R Carson; Stephen W Scherer
Journal: Nat Rev Genet Date: 2006-02 Impact factor: 53.242

2. The somatic genomic landscape of glioblastoma.

Authors: Cameron W Brennan; Roel G W Verhaak; Aaron McKenna; Benito Campos; Houtan Noushmehr; Sofie R Salama; Siyuan Zheng; Debyani Chakravarty; J Zachary Sanborn; Samuel H Berman; Rameen Beroukhim; Brady Bernard; Chang-Jiun Wu; Giannicola Genovese; Ilya Shmulevich; Jill Barnholtz-Sloan; Lihua Zou; Rahulsimham Vegesna; Sachet A Shukla; Giovanni Ciriello; W K Yung; Wei Zhang; Carrie Sougnez; Tom Mikkelsen; Kenneth Aldape; Darell D Bigner; Erwin G Van Meir; Michael Prados; Andrew Sloan; Keith L Black; Jennifer Eschbacher; Gaetano Finocchiaro; William Friedman; David W Andrews; Abhijit Guha; Mary Iacocca; Brian P O'Neill; Greg Foltz; Jerome Myers; Daniel J Weisenberger; Robert Penny; Raju Kucherlapati; Charles M Perou; D Neil Hayes; Richard Gibbs; Marco Marra; Gordon B Mills; Eric Lander; Paul Spellman; Richard Wilson; Chris Sander; John Weinstein; Matthew Meyerson; Stacey Gabriel; Peter W Laird; David Haussler; Gad Getz; Lynda Chin
Journal: Cell Date: 2013-10-10 Impact factor: 41.582

Review 3. Preprocessing and downstream analysis of microarray DNA copy number profiles.

Authors: Mark A van de Wiel; Franck Picard; Wessel N van Wieringen; Bauke Ylstra
Journal: Brief Bioinform Date: 2010-02-19 Impact factor: 11.622

4. Candidate driver genes in focal chromosomal aberrations of stage II colon cancer.

Authors: Rebecca P M Brosens; Josien C Haan; Beatriz Carvalho; François Rustenburg; Heike Grabsch; Philip Quirke; Alexander F Engel; Miguel A Cuesta; Nicola Maughan; Marcel Flens; Gerrit A Meijer; Bauke Ylstra
Journal: J Pathol Date: 2010-08 Impact factor: 7.996

5. Estimating optimal window size for analysis of low-coverage next-generation sequence data.

Authors: Arief Gusnanto; Charles C Taylor; Ibrahim Nafisah; Henry M Wood; Pamela Rabbitts; Stefano Berri
Journal: Bioinformatics Date: 2014-03-05 Impact factor: 6.937

Review 6. Focal chromosomal copy number aberrations in cancer-Needles in a genome haystack.

Authors: Oscar Krijgsman; Beatriz Carvalho; Gerrit A Meijer; Renske D M Steenbergen; Bauke Ylstra
Journal: Biochim Biophys Acta Date: 2014-08-07

7. Integrated analysis of homozygous deletions, focal amplifications, and sequence alterations in breast and colorectal cancers.

Authors: Rebecca J Leary; Jimmy C Lin; Jordan Cummins; Simina Boca; Laura D Wood; D Williams Parsons; Siân Jones; Tobias Sjöblom; Ben-Ho Park; Ramon Parsons; Joseph Willis; Dawn Dawson; James K V Willson; Tatiana Nikolskaya; Yuri Nikolsky; Levy Kopelovich; Nick Papadopoulos; Len A Pennacchio; Tian-Li Wang; Sanford D Markowitz; Giovanni Parmigiani; Kenneth W Kinzler; Bert Vogelstein; Victor E Velculescu
Journal: Proc Natl Acad Sci U S A Date: 2008-10-13 Impact factor: 11.205

8. Focal aberrations indicate EYA2 and hsa-miR-375 as oncogene and tumor suppressor in cervical carcinogenesis.

Authors: Mariska Bierkens; Oscar Krijgsman; Saskia M Wilting; Leontien Bosch; Annelieke Jaspers; Gerrit A Meijer; Chris J L M Meijer; Peter J F Snijders; Bauke Ylstra; Renske D M Steenbergen
Journal: Genes Chromosomes Cancer Date: 2012-09-14 Impact factor: 5.006

9. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers.

Authors: Craig H Mermel; Steven E Schumacher; Barbara Hill; Matthew L Meyerson; Rameen Beroukhim; Gad Getz
Journal: Genome Biol Date: 2011-04-28 Impact factor: 13.583

10. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data.

Authors: Valentina Boeva; Tatiana Popova; Kevin Bleakley; Pierre Chiche; Julie Cappo; Gudrun Schleiermacher; Isabelle Janoueix-Lerosey; Olivier Delattre; Emmanuel Barillot
Journal: Bioinformatics Date: 2011-12-06 Impact factor: 6.937