Literature DB >> 22155870

Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data.

Valentina Boeva1, Tatiana Popova, Kevin Bleakley, Pierre Chiche, Julie Cappo, Gudrun Schleiermacher, Isabelle Janoueix-Lerosey, Olivier Delattre, Emmanuel Barillot.   

Abstract

SUMMARY: More and more cancer studies use next-generation sequencing (NGS) data to detect various types of genomic variation. However, even when researchers have such data at hand, single-nucleotide polymorphism arrays have been considered necessary to assess copy number alterations and especially loss of heterozygosity (LOH). Here, we present the tool Control-FREEC that enables automatic calculation of copy number and allelic content profiles from NGS data, and consequently predicts regions of genomic alteration such as gains, losses and LOH. Taking as input aligned reads, Control-FREEC constructs copy number and B-allele frequency profiles. The profiles are then normalized, segmented and analyzed in order to assign genotype status (copy number and allelic content) to each genomic region. When a matched normal sample is provided, Control-FREEC discriminates somatic from germline events. Control-FREEC is able to analyze overdiploid tumor samples and samples contaminated by normal cells. Low mappability regions can be excluded from the analysis using provided mappability tracks. AVAILABILITY: C++ source code is available at: http://bioinfo.curie.fr/projects/freec/ CONTACT: freec@curie.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2011        PMID: 22155870      PMCID: PMC3268243          DOI: 10.1093/bioinformatics/btr670

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Cancer genomes often display copy number alterations (CNAs) and/or losses of heterozygosity (LOH) (Hanahan and Weinberg, 2011). Genetic abnormalities in specific regions may be related to the aggressiveness of a cancer and be associated with clinical outcomes (Caren ; Suzuki ). To detect CNA and LOH regions, single-nucleotide polymorphism (SNP) arrays have been recently much in use (Popova ). Furthermore, next-generation sequencing (NGS) has been moving to replace SNP-arrays in prediction of CNAs (Boeva et al., 2010). A recent study presented ExomeCNV, a tool to predict CNAs and LOH using exome sequencing data (Sathirapongsasuti ). However, detection of LOH regions and, more generally, prediction of genotype status (copy number and allelic content) of an altered region using whole-genome sequencing data has remained unsolved. The main challenges to doing so are non-uniform read coverage of genomic positions [for example, due to different mappability and GC-content (Boeva et al., 2010)] and alignment bias (reference allele coverage is usually higher than the coverage of the alternative allele). Thus, the resulting signal is noisier and more difficult to process than in the case of SNP arrays. Here, we present Control-FREEC (Control-FREE Copy number and allelic content caller)—a tool that annotates genotypes and discovers CNAs and LOH. Control-FREEC inherits many features from FREEC (Boeva et al., 2010) (assessment of copy number variation and evaluation of contamination by normal cells) as well as the general methodology of the GAP algorithm for SNP arrays (Popova ). Control-FREEC takes as an input aligned reads, then constructs and normalizes the copy number profile, constructs the B-allele frequency (BAF) profile, segments both profiles, ascribes the genotype status to each segment using both copy number and allelic frequency information, then annotates genomic alterations. If a control (matched normal) sample is available, Control-FREEC discerns somatic variants from germline ones.

2 METHODS

Workflow: the workflow of Control-FREEC consists of three steps: (i) calculation and segmentation of copy number profiles; (ii) calculation and segmentation of smoothed BAF profiles; (iii) prediction of final genotype status, i.e. copy number and allelic content for each segment (for example, A, AB, AAB, etc.). (i) Calculation of copy number profiles is mainly done as described in our previous publication (Boeva et al., 2010). The most important features of the procedure are: (a) possibility to use GC-content and mappability profiles to normalize read count if a control sample is unavailable; (b) proper characterization of overdiploid genomes; (c) correction for possible contamination by normal cells when constructing the copy number profile of a tumor genome. The new tool Control-FREEC can also be used on non-mammalian genomes and includes many new user control settings, such as (a) defining the program's behavior in low mappability regions (http://bioinfo.curie.fr/projects/freec/tutorial.html); (b) choosing the minimal number of consecutive windows required to call a CNA. (ii) We characterize the allelic content via the BAF introduced previously for SNP arrays (Popova ). We limit the list of genomic positions that we consider to evaluate allelic content to known SNPs only (Sherry ). By the B allele, we mean the alternative variant in SNP database (dbSNP). SNPs that are homozygous in the genome being considered give no information about allelic content (in SNP arrays they are denoted as non-informative); therefore putatively homozygous positions are discarded. A position is discarded if the probability of having variation due to sequencing errors under the condition of actual homozygosity is greater than a specified threshold (Supplementary Materials). We calculate the total coverage and B-allele coverage for each known putatively heterozygous SNP position. For each window i, we calculate the median of the BAF values: Med = median(abs(x−0.5)), where {x} are BAF values of the remaining SNP positions. We segment {Med} using the same lasso-based algorithm as used for copy numbers (Harchaoui and Lévy-Leduc, 2008). (iii) We predict genotype status for each genomic segment independently, by choosing the allelic content that corresponds to the maximal log-likelihood, given the copy number detected previously. First, we combine breakpoints issued from both copy number and median BAF segmentations to get genomic segments with presumably one status. Second, copy number status of each segment is detected as described previously (Boeva et al., 2010). If the CNA is present in most of the cells, there is no ambiguity in determining exact copy number of the region (see Supplementary Materials for more details on the strategy in the case of presence of subclones or normal contamination). Third, given the copy number of the region, we fit Gaussian mixture models (GMMs) with fixed means to the observed BAF values and select the model that provides the highest log-likelihood. For example, for a region with a copy number of two, we fit a two component model (mixture of ‘AA’ and ‘BB’ alleles) and a three component model (‘AA’, ‘AB’ and ‘BB’, with a condition on the minimal weight of ‘AB’). The component means in the GMM depend on the level of contamination by normal DNA (Supplementary Materials). Input and output: the input consists of a SAM pileup (http://samtools.sourceforge.net/pileup.shtml) and a dbSNP file. The control dataset is optional if a reference genome is provided. The output contains a list of CNAs and LOH regions as well as read count, copy number, BAF and genotype information for each window. If a control (matched normal) dataset is available, each event is annotated as somatic or germline.

3 RESULTS

We applied Control-FREEC to detect CNAs and LOH regions in a tumor/normal dataset for a neuroblastoma patient (~ 30x-coverage, unpublished data). Control-FREEC detected somatic CNA and LOH regions covering 75% of the tumor genome (Fig. 1) and was able to identify the genotype status despite contamination of the tumor sample by normal cells (estimated percent of tumor cells was 60%).
Fig. 1.

Control-FREEC calculates copy number and BAF profiles and detects regions of copy number gain/loss and LOH regions. Tumor chromosomes 17 and 19 (bottom panels) versus ‘normal’ chromosomes (top panels; unpublished data). Predicted BAF and copy number profiles are shown in black. Gains, losses (left panels) and LOH (right panels) are shown in red, blue and light blue, respectively.

Control-FREEC calculates copy number and BAF profiles and detects regions of copy number gain/loss and LOH regions. Tumor chromosomes 17 and 19 (bottom panels) versus ‘normal’ chromosomes (top panels; unpublished data). Predicted BAF and copy number profiles are shown in black. Gains, losses (left panels) and LOH (right panels) are shown in red, blue and light blue, respectively. Our results agreed with the SNP-array analysis output. We obtained 95.4% consistency between the results of Control-FREEC and GAP (Popova ), which we applied to SNP array data generated for the same tumor sample (Supplementary Materials).

4 CONCLUSION

Control-FREEC is a tool for automatic detection of CNAs and LOH regions using NGS data. It accurately calls genotype status even when no control experiment is available and/or the genome is polyploid. It corrects for GC-content and mappability biases. In the case of tumor samples, Control-FREEC is able to evaluate the level of contamination by normal cells. The software is written in C++ and freely available. Funding: ‘Projet Incitatif et Collaboratif Bioinformatique et Biostatistiques’ of the Institut Curie; Ligue Nationale Contre le Cancer. Conflict of Interest: none declared.
  7 in total

1.  dbSNP: the NCBI database of genetic variation.

Authors:  S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV.

Authors:  Jarupon Fah Sathirapongsasuti; Hane Lee; Basil A J Horst; Georg Brunner; Alistair J Cochran; Scott Binder; John Quackenbush; Stanley F Nelson
Journal:  Bioinformatics       Date:  2011-08-09       Impact factor: 6.937

3.  High-risk neuroblastoma tumors with 11q-deletion display a poor prognostic, chromosome instability phenotype with later onset.

Authors:  Helena Carén; Hanna Kryh; Maria Nethander; Rose-Marie Sjöberg; Catarina Träger; Staffan Nilsson; Jonas Abrahamsson; Per Kogner; Tommy Martinsson
Journal:  Proc Natl Acad Sci U S A       Date:  2010-02-09       Impact factor: 11.205

4.  An approach to analysis of large-scale correlations between genome changes and clinical endpoints in ovarian cancer.

Authors:  S Suzuki; D H Moore; D G Ginzinger; T E Godfrey; J Barclay; B Powell; D Pinkel; C Zaloudek; K Lu; G Mills; A Berchuck; J W Gray
Journal:  Cancer Res       Date:  2000-10-01       Impact factor: 12.701

Review 5.  Hallmarks of cancer: the next generation.

Authors:  Douglas Hanahan; Robert A Weinberg
Journal:  Cell       Date:  2011-03-04       Impact factor: 41.582

6.  Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization.

Authors:  Valentina Boeva; Andrei Zinovyev; Kevin Bleakley; Jean-Philippe Vert; Isabelle Janoueix-Lerosey; Olivier Delattre; Emmanuel Barillot
Journal:  Bioinformatics       Date:  2010-11-15       Impact factor: 6.937

7.  Genome Alteration Print (GAP): a tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays.

Authors:  Tatiana Popova; Elodie Manié; Dominique Stoppa-Lyonnet; Guillem Rigaill; Emmanuel Barillot; Marc Henri Stern
Journal:  Genome Biol       Date:  2009-11-11       Impact factor: 13.583

  7 in total
  381 in total

1.  SM-RCNV: a statistical method to detect recurrent copy number variations in sequenced samples.

Authors:  Yaoyao Li; Xiguo Yuan; Junying Zhang; Liying Yang; Jun Bai; Shan Jiang
Journal:  Genes Genomics       Date:  2019-02-18       Impact factor: 1.839

2.  Adaptation by Loss of Heterozygosity in Saccharomyces cerevisiae Clones Under Divergent Selection.

Authors:  Timothy Y James; Lucas A Michelotti; Alexander D Glasco; Rebecca A Clemons; Robert A Powers; Ellen S James; D Rabern Simmons; Fengyan Bai; Shuhua Ge
Journal:  Genetics       Date:  2019-08-01       Impact factor: 4.562

3.  The genetic architecture of breast papillary lesions as a predictor of progression to carcinoma.

Authors:  Ian G Campbell; Kylie L Gorringe; Tanjina Kader; Kenneth Elder; Magnus Zethoven; Timothy Semple; Prue Hill; David L Goode; Niko Thio; Dane Cheasley; Simone M Rowley; David J Byrne; Jia-Min Pang; Islam M Miligy; Andrew R Green; Emad A Rakha; Stephen B Fox; G Bruce Mann
Journal:  NPJ Breast Cancer       Date:  2020-03-12

4.  Whole-Genome and Epigenomic Landscapes of Etiologically Distinct Subtypes of Cholangiocarcinoma.

Authors:  Apinya Jusakul; Ioana Cutcutache; Chern Han Yong; Jing Quan Lim; Mi Ni Huang; Nisha Padmanabhan; Vishwa Nellore; Sarinya Kongpetch; Alvin Wei Tian Ng; Ley Moy Ng; Su Pin Choo; Swe Swe Myint; Raynoo Thanan; Sanjanaa Nagarajan; Weng Khong Lim; Cedric Chuan Young Ng; Arnoud Boot; Mo Liu; Choon Kiat Ong; Vikneswari Rajasegaran; Stefanus Lie; Alvin Soon Tiong Lim; Tse Hui Lim; Jing Tan; Jia Liang Loh; John R McPherson; Narong Khuntikeo; Vajaraphongsa Bhudhisawasdi; Puangrat Yongvanit; Sopit Wongkham; Yasushi Totoki; Hiromi Nakamura; Yasuhito Arai; Satoshi Yamasaki; Pierce Kah-Hoe Chow; Alexander Yaw Fui Chung; London Lucien Peng Jin Ooi; Kiat Hon Lim; Simona Dima; Dan G Duda; Irinel Popescu; Philippe Broet; Sen-Yung Hsieh; Ming-Chin Yu; Aldo Scarpa; Jiaming Lai; Di-Xian Luo; André Lopes Carvalho; André Luiz Vettore; Hyungjin Rhee; Young Nyun Park; Ludmil B Alexandrov; Raluca Gordân; Steven G Rozen; Tatsuhiro Shibata; Chawalit Pairojkul; Bin Tean Teh; Patrick Tan
Journal:  Cancer Discov       Date:  2017-06-30       Impact factor: 39.397

5.  The genetics of splicing in neuroblastoma.

Authors:  Justin Chen; Christopher S Hackett; Shile Zhang; Young K Song; Robert J A Bell; Annette M Molinaro; David A Quigley; Allan Balmain; Jun S Song; Joseph F Costello; W Clay Gustafson; Terry Van Dyke; Pui-Yan Kwok; Javed Khan; William A Weiss
Journal:  Cancer Discov       Date:  2015-01-30       Impact factor: 39.397

Review 6.  Current analysis platforms and methods for detecting copy number variation.

Authors:  Wenli Li; Michael Olivier
Journal:  Physiol Genomics       Date:  2012-11-06       Impact factor: 3.107

7.  Germline deletion of ETV6 in familial acute lymphoblastic leukemia.

Authors:  Evadnie Rampersaud; David S Ziegler; Ilaria Iacobucci; Debbie Payne-Turner; Michelle L Churchman; Kasmintan A Schrader; Vijai Joseph; Kenneth Offit; Katherine Tucker; Rosemary Sutton; Meera Warby; Georgia Chenevix-Trench; David G Huntsman; Maria Tsoli; R Scott Mead; Chunxu Qu; Vasiliki Leventaki; Gang Wu; Charles G Mullighan
Journal:  Blood Adv       Date:  2019-04-09

8.  A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data.

Authors:  Yu-Fang Mao; Xi-Guo Yuan; Yu-Peng Cun
Journal:  Zool Res       Date:  2021-03-18

9.  Utilization of Whole-Exome Next-Generation Sequencing Variant Read Frequency for Detection of Lesion-Specific, Somatic Loss of Heterozygosity in a Neurofibromatosis Type 1 Cohort with Tibial Pseudarthrosis.

Authors:  Rebecca L Margraf; Chad VanSant-Webb; David Sant; John Carey; Heather Hanson; Jacques D'Astous; Dave Viskochil; David A Stevenson; Rong Mao
Journal:  J Mol Diagn       Date:  2017-05       Impact factor: 5.568

10.  Comparative genomics of CXCR4MUT and CXCR4WT single cells in Waldenström's macroglobulinemia.

Authors:  Cristina Jiménez; Lian Xu; Nickolas Tsakmaklis; Maria G Demos; Amanda Kofides; Gloria G Chan; Maria Luisa Guerrera; Jiaji G Chen; Xia Liu; Manit Munshi; Christopher J Patterson; Guang Yang; Jorge J Castillo; Steven P Treon; Zachary R Hunter
Journal:  Blood Adv       Date:  2020-09-22
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.