Literature DB >> 21081509

Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization.

Valentina Boeva¹, Andrei Zinovyev, Kevin Bleakley, Jean-Philippe Vert, Isabelle Janoueix-Lerosey, Olivier Delattre, Emmanuel Barillot.

Abstract

SUMMARY: We present a tool for control-free copy number alteration (CNA) detection using deep-sequencing data, particularly useful for cancer studies. The tool deals with two frequent problems in the analysis of cancer deep-sequencing data: absence of control sample and possible polyploidy of cancer cells. FREEC (control-FREE Copy number caller) automatically normalizes and segments copy number profiles (CNPs) and calls CNAs. If ploidy is known, FREEC assigns absolute copy number to each predicted CNA. To normalize raw CNPs, the user can provide a control dataset if available; otherwise GC content is used. We demonstrate that for Illumina single-end, mate-pair or paired-end sequencing, GC-contentr normalization provides smooth profiles that can be further segmented and analyzed in order to predict CNAs. AVAILABILITY: Source code and sample data are available at http://bioinfo-out.curie.fr/projects/freec/.

Entities: CellLine Chemical Disease Species

Mesh：

Substances：
Guanine
Cytosine

Year: 2010 PMID： 21081509 PMCID： PMC3018818 DOI： 10.1093/bioinformatics/btq635

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

In many studies that apply deep sequencing to cancer genomes, one has to calculate copy number profiles (CNPs) and predict regions of gain and loss. There exist two frequent obstacles in the analysis of cancer genomes: absence of an appropriate control sample for normal tissue and possible polyploidy. Most current tools do not take these points into account (Supplementary Table 1). For various reasons, sequencing of an appropriate control sample is not always possible. There is therefore a need for a bioinformatics tool able to automatically detect copy number alterations (CNAs) without use of a control dataset. Several programs have been published that allow automatic calculation and analysis of CNPs (Chiang ; Xie and Tammi, 2009). However, both CNV-seq (Xie and Tammi, 2009) and SegSeq (Chiang ) need datasets for the given tumor and its paired normal DNA. Moreover, both programs predict CNAs without providing information about how many copies were lost or gained. An interesting approach for predicting copy number variants was suggested by Yoon ), where GC content is used to normalize data. However, to estimate the ‘normal’ copy number, they rely on the assumption that there are similar percentages of amplified and deleted regions, which is not true in general for cancer cells. Moreover, their tool was designed to analyze normal human genomes and is unable to take into account possible polyploidy. Here, we propose an algorithm to call CNAs with or without a control sample. The algorithm is implemented in the C++ program FREEC (control-FREE Copy number caller). FREEC uses a sliding window approach to calculate read count (RC) in non-overlapping windows (raw CNP). Then, if a control sample is available, the program normalizes raw CNP using the control profile. Otherwise, the program calculates GC content in the same set of windows and performs normalization by GC content. Since this removes a major source of variability in raw CNPs (Chiang ; Yoon ), the resulting normalized profile becomes sufficiently smooth to apply segmentation. This is followed by the analysis of predicted regions of gains and losses in order to assign copy numbers to these regions.

2 METHODS

The algorithm includes several steps. First, it calculates the raw CNP by counting reads in non-overlapping windows. If not provided by the user, window size can be automatically selected using depth of coverage information to optimize accuracy of CNA prediction. The second step is profile normalization. If a control is not provided by the user, we compute the GC-content profile. The normalization procedure of RC by GC content (or by control RC) is described below. The third step is segmentation of the normalized CNP. To do this, we implemented a LASSO-based algorithm suggested by Harchaoui and Lévy-Leduc (2008). Segmentation provided by this algorithm is robust against outliers, which makes it suitable for segmentation of deep-sequencing CNPs. The last step involves analysis of segmented profiles. This includes identification of regions of genomic gains and losses and prediction of copy number changes in these regions. To normalize a raw CNP, we fit the observed RC by the GC content (or the control RC if it is available). We base our fitted model on several assumptions: (i) the sample main ploidy P is provided, (ii) the observed RC in P-copy regions (i.e. regions with copy number equal to P) can be modeled as a polynomial of GC content (or of control RC), (iii) the observed RC in a region with altered copy number is linearly proportional to the RC in P-copy regions and (iv) the interval of measured GC contents (respectively control RCs when a control dataset is available) in the main ploidy regions must include the interval of all measured GC contents (respectively control RC). The polynomial's degree is a user-defined parameter with a default value of three. We provide an initial estimate of the polynomial's parameters and then optimize these parameters by iteratively selecting data points related to P-copy regions and making a least-square fit on these points only (See Supplementary Methods for more details). The resulting polynomial is then used to normalize the CNP (Fig. 1). The user has an option to include mappability information into the normalization procedure (See Supplementary Methods).

Fig. 1.

Normalization of CNPs using only information about average GC content in a window. (A–D) GC content versus RC in 50 kb windows for COLO-829BL (normal diploid genome), COLO-829, NCI-H2171 and HCC1143, respectively. The result of the least-square fit for P-copy regions is shown in black. Curves corresponding to other frequent copy numbers are shown in gray. Values of copy numbers are given at the right of each panel. Chromosomes X and Y were not included. (E–H) GC-content normalized CNPs for chromosome 1 for COLO-829BL, COLO-829, NCI-H2171 and HCC1143, respectively. Automatically predicted copy numbers are shown in black.

3 RESULTS

We applied the method to predict CNAs in mate-pair datasets for the melanoma cell line COLO-829 and matched normal cell line COLO-829BL (Pleasance ), a paired-end dataset for the small-cell lung cancer cell line NCI-H2171 (Campbell ) and a single-end dataset for the breast cancer cell line HCC1143 (Chiang ). All four samples were sequenced using the Illumina Genome Analyzer platform. The number of reads in samples varied from 14 to 20 million (Supplementary Table 2). The polynomial fit by GC content explained well the observed RC (Fig. 1A–D). Using CNPs normalized by GC content, we identified regions of gain and loss in the four samples (Fig. 1E–H, Supplementary Fig. 1–4). We also assessed true positive and false positive rate for a normal diploid sample NA18507 (Alkan ; Bentley ; Supplemenary Table 3). We compared FREEC with three other existing tools: CNV-seq, SegSeq and RDXplorer (Supplementary Tables 1 and 4). As well as providing other additional functionalities, FREEC understands more input formats than any other tool. It can be used to analyze data produced for any organism and for polyploid genomes. Being implemented in C++, FREEC shows excellent performance and operating system portability.

4 CONCLUSION

We have presented a tool for automatic detection of CNAs and calculation of CNA frequency. FREEC provides more functionalities than existing tools; in particular, it can deal with the situation when no control experiment is available and when the genome is polyploid, frequent problems in cancer studies. The main steps are (i) normalization of the CNP using GC content (or control CNP if available), (ii) segmentation of normalized profiles and (iii) assignment of copy number changes to losses and gains. The program is fast, accurate and freely available. Funding: The Ligue Nationale contre le Cancer (V.B., A.Z., E.B., I.J.-L. and O.D. are members of a labeled team). Conflict of Interest: none declared.

7 in total

1. Sensitive and accurate detection of copy number variants using read depth of coverage.

Authors: Seungtai Yoon; Zhenyu Xuan; Vladimir Makarov; Kenny Ye; Jonathan Sebat
Journal: Genome Res Date: 2009-08-05 Impact factor: 9.043

2. A comprehensive catalogue of somatic mutations from a human cancer genome.

Authors: Erin D Pleasance; R Keira Cheetham; Philip J Stephens; David J McBride; Sean J Humphray; Chris D Greenman; Ignacio Varela; Meng-Lay Lin; Gonzalo R Ordóñez; Graham R Bignell; Kai Ye; Julie Alipaz; Markus J Bauer; David Beare; Adam Butler; Richard J Carter; Lina Chen; Anthony J Cox; Sarah Edkins; Paula I Kokko-Gonzales; Niall A Gormley; Russell J Grocock; Christian D Haudenschild; Matthew M Hims; Terena James; Mingming Jia; Zoya Kingsbury; Catherine Leroy; John Marshall; Andrew Menzies; Laura J Mudie; Zemin Ning; Tom Royce; Ole B Schulz-Trieglaff; Anastassia Spiridou; Lucy A Stebbings; Lukasz Szajkowski; Jon Teague; David Williamson; Lynda Chin; Mark T Ross; Peter J Campbell; David R Bentley; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2009-12-16 Impact factor: 49.962

3. High-resolution mapping of copy-number alterations with massively parallel sequencing.

Authors: Derek Y Chiang; Gad Getz; David B Jaffe; Michael J T O'Kelly; Xiaojun Zhao; Scott L Carter; Carsten Russ; Chad Nusbaum; Matthew Meyerson; Eric S Lander
Journal: Nat Methods Date: 2008-11-30 Impact factor: 28.547

4. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing.

Authors: Peter J Campbell; Philip J Stephens; Erin D Pleasance; Sarah O'Meara; Heng Li; Thomas Santarius; Lucy A Stebbings; Catherine Leroy; Sarah Edkins; Claire Hardy; Jon W Teague; Andrew Menzies; Ian Goodhead; Daniel J Turner; Christopher M Clee; Michael A Quail; Antony Cox; Clive Brown; Richard Durbin; Matthew E Hurles; Paul A W Edwards; Graham R Bignell; Michael R Stratton; P Andrew Futreal
Journal: Nat Genet Date: 2008-04-27 Impact factor: 38.330

5. Personalized copy number and segmental duplication maps using next-generation sequencing.

Authors: Can Alkan; Jeffrey M Kidd; Tomas Marques-Bonet; Gozde Aksay; Francesca Antonacci; Fereydoun Hormozdiari; Jacob O Kitzman; Carl Baker; Maika Malig; Onur Mutlu; S Cenk Sahinalp; Richard A Gibbs; Evan E Eichler
Journal: Nat Genet Date: 2009-08-30 Impact factor: 38.330

6. CNV-seq, a new method to detect copy number variation using high-throughput sequencing.

Authors: Chao Xie; Martti T Tammi
Journal: BMC Bioinformatics Date: 2009-03-06 Impact factor: 3.169

7. Accurate whole human genome sequencing using reversible terminator chemistry.

Authors: David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

7 in total

115 in total

1. Single-cell whole-genome analyses by Linear Amplification via Transposon Insertion (LIANTI).

Authors: Chongyi Chen; Dong Xing; Longzhi Tan; Heng Li; Guangyu Zhou; Lei Huang; X Sunney Xie
Journal: Science Date: 2017-04-14 Impact factor: 47.728

2. CONSERTING: integrating copy-number analysis with structural-variation detection.

Authors: Xiang Chen; Pankaj Gupta; Jianmin Wang; Joy Nakitandwe; Kathryn Roberts; James D Dalton; Matthew Parker; Samir Patel; Linda Holmfeldt; Debbie Payne; John Easton; Jing Ma; Michael Rusch; Gang Wu; Aman Patel; Suzanne J Baker; Michael A Dyer; Sheila Shurtleff; Stephen Espy; Stanley Pounds; James R Downing; David W Ellison; Charles G Mullighan; Jinghui Zhang
Journal: Nat Methods Date: 2015-05-04 Impact factor: 28.547

3. CCND2 and CCND3 hijack immunoglobulin light-chain enhancers in cyclin D1^- mantle cell lymphoma.

Authors: David Martín-Garcia; Alba Navarro; Rafael Valdés-Mas; Guillem Clot; Jesús Gutiérrez-Abril; Miriam Prieto; Inmaculada Ribera-Cortada; Renata Woroniecka; Grzegorz Rymkiewicz; Susanne Bens; Laurence de Leval; Andreas Rosenwald; Judith A Ferry; Eric D Hsi; Kai Fu; Jan Delabie; Dennis Weisenburger; Daphne de Jong; Fina Climent; Sheila J O'Connor; Steven H Swerdlow; David Torrents; Sergi Beltran; Blanca Espinet; Blanca González-Farré; Luis Veloza; Dolors Costa; Estella Matutes; Reiner Siebert; German Ott; Leticia Quintanilla-Martinez; Elaine S Jaffe; Carlos López-Otín; Itziar Salaverria; Xose S Puente; Elias Campo; Sílvia Beà
Journal: Blood Date: 2018-12-11 Impact factor: 22.113

4. WisecondorX: improved copy number detection for routine shallow whole-genome sequencing.

Authors: Lennart Raman; Annelies Dheedene; Matthias De Smet; Jo Van Dorpe; Björn Menten
Journal: Nucleic Acids Res Date: 2019-02-28 Impact factor: 16.971

5. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives.

Authors: Min Zhao; Qingguo Wang; Quan Wang; Peilin Jia; Zhongming Zhao
Journal: BMC Bioinformatics Date: 2013-09-13 Impact factor: 3.169

6. Next-generation sequencing is a robust strategy for the high-throughput detection of zygosity in transgenic maize.

Authors: Leonie Fritsch; Rainer Fischer; Christoph Wambach; Max Dudek; Stefan Schillberg; Florian Schröper
Journal: Transgenic Res Date: 2015-02-04 Impact factor: 2.788

7. 53BP1 alters the landscape of DNA rearrangements and suppresses AID-induced B cell lymphoma.

Authors: Mila Jankovic; Niklas Feldhahn; Thiago Y Oliveira; Israel T Silva; Kyong-Rim Kieffer-Kwon; Arito Yamane; Wolfgang Resch; Isaac Klein; Davide F Robbiani; Rafael Casellas; Michel C Nussenzweig
Journal: Mol Cell Date: 2013-01-03 Impact factor: 17.970

8. Utilization of Whole-Exome Next-Generation Sequencing Variant Read Frequency for Detection of Lesion-Specific, Somatic Loss of Heterozygosity in a Neurofibromatosis Type 1 Cohort with Tibial Pseudarthrosis.

Authors: Rebecca L Margraf; Chad VanSant-Webb; David Sant; John Carey; Heather Hanson; Jacques D'Astous; Dave Viskochil; David A Stevenson; Rong Mao
Journal: J Mol Diagn Date: 2017-05 Impact factor: 5.568

9. Common copy number variation detection from multiple sequenced samples.

Authors: Junbo Duan; Hong-Wen Deng; Yu-Ping Wang
Journal: IEEE Trans Biomed Eng Date: 2014-03 Impact factor: 4.538

10. Tumor haplotype assembly algorithms for cancer genomics.

Authors: Derek Aguiar; Wendy S W Wong; Sorin Istrail
Journal: Pac Symp Biocomput Date: 2014