Literature DB >> 24876377

PatternCNV: a versatile tool for detecting copy number changes from exome sequencing data.

Chen Wang¹, Jared M Evans¹, Aditya V Bhagwate¹, Naresh Prodduturi¹, Vivekananda Sarangi¹, Mridu Middha¹, Hugues Sicotte¹, Peter T Vedell¹, Steven N Hart¹, Gavin R Oliver¹, Jean-Pierre A Kocher¹, Matthew J Maurer¹, Anne J Novak¹, Susan L Slager¹, James R Cerhan¹, Yan W Asmann¹.

Abstract

MOTIVATION: Exome sequencing (exome-seq) data, which are typically used for calling exonic mutations, have also been utilized in detecting DNA copy number variations (CNVs). Despite the existence of several CNV detection tools, there is still a great need for a sensitive and an accurate CNV-calling algorithm with built-in QC steps, and does not require a paired reference for each sample.
RESULTS: We developed a novel method named PatternCNV, which (i) accounts for the read coverage variations between exons while leveraging the consistencies of this variability across different samples; (ii) reduces alignment BAM files to WIG format and therefore greatly accelerates computation; (iii) incorporates multiple QC measures designed to identify outlier samples and batch effects; and (iv) provides a variety of visualization options including chromosome, gene and exon-level views of CNVs, along with a tabular summarization of the exon-level CNVs. Compared with other CNV-calling algorithms using data from a lymphoma exome-seq study, PatternCNV has higher sensitivity and specificity.
AVAILABILITY AND IMPLEMENTATION: The software for PatternCNV is implemented using Perl and R, and can be used in Mac or Linux environments. Software and user manual are available at http://bioinformaticstools.mayo.edu/research/patterncnv/, and R package at https://github.com/topsoil/patternCNV/.

Entities: Disease Species

Mesh：

Year: 2014 PMID： 24876377 PMCID： PMC4155258 DOI： 10.1093/bioinformatics/btu363

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

DNA copy number variations (CNVs) are genomic structural changes that result in regional or chromosomal loss or gain of DNA copies (Hastings ). Owing to the significant roles in human diseases, various laboratory techniques have been developed to detect CNVs, including recently advanced massive parallel sequencing of whole genomes and coding exomes. For exome-seq, it is commonly observed that coverage depths of short reads across regions vary, caused by different target capture efficiencies (Parla ), as well as the differences in mappability of exons. Such coverage variations impose substantial challenges for reliable CNV detection. Most existing methods use a paired-sample approach, based on the intuitive assumption that somatic sample and its paired reference share similar coverage bias that can be cancelled out through pairing (Koboldt ; Sathirapongsasuti ). Although this assumption approximately holds, it oversimplifies the problem with two limitations unaddressed: (i) The region-specific noise (coverage variability) of a local region is not accounted for, leading to amplified noise in log-ratio values of coverage between sample and the paired reference. (ii) In the case of a missing or low-quality reference sample, CNV detection based on paired reference will be infeasible or have degraded accuracy/sensitivity. A recent published method, FishingCNV, tried to address the second limitation by using the average of multiple reference samples as the denominators in log-ratio calculation, but did not address the regional noises in individual samples (the numerator), which led to false CNV calls (details in Supplementary Section S2.3). Considering these issues, we proposed a novel method called PatternCNV, which summarizes overall consistent patterns of both depths and variability of exonic region coverage across samples, where ‘patterns’ of coverage and variability are summarized using multiple ‘normal’ or reference samples. We observed that the same patterns only exist between samples prepared using the same version of exome capture kit. During CNV detection, we compute the differences of observed coverage versus the common pattern, while penalizing regions associated with larger variability using a weighting scheme. Further, whole-genome CNV can be interpolated from exon-level CNV using any third-party segmentation method, e.g. circular binary segmentation (Olshen ). The PatternCNV was implemented in two different versions: a Mac and Linux/Unix version, and an R package version. We also developed a conversion tool to transform Binary version of sequence Alignment/Map (BAM) format files to much smaller wiggle (WIG) format files (<1% of BAM file size), which greatly speeds up pattern learning and CNV calculation. When compared with other state-of-the-art CNV algorithms in a lymphoma case study, PatternCNV displayed higher resolution and greater sensitivity/specificity.

2 FEATURES

2.1 Input, output and major functions

PatternCNV is divided into three major functional components: (i) BAM-to-WIG conversion for improved computational performance: a BAM2WIG converter using SAMtools (Li ) and BEDtools (Quinlan and Hall 2010), which takes as input a BAM file, a file of Browser Extensible Data (BED) format defining exon regions and a second BED file for capture targets defined by the exome capture kit. The outputs are WIG files with greatly reduced file sizes compared with BAM files; (ii) CNV detection: starting with WIG files, PatternCNV estimates the coverage and variability patterns from multiple reference samples and calculates CNVs relative to the pattern for all samples including the references; and (iii) CNV summary and visualization: this module outputs a detailed exon-level CNV summary file per sample, and provides several visualization options for viewing CNVs at the whole-genome level or chromosome level. In addition, there are built-in QA/QC steps to detect sample outliers and batch effects. Figure 1 displays the overall workflow of PatternCNV along with illustrative examples of program output.

Fig. 1.

PatternCNV workflow is demonstrated in the upper panel. Examples of whole-genome and chromosome-level visulization are displayed in the bottom panel, along with Exon-level CNV summary table

2.2 Description of the PatternCNV algorithm

Each exon is first divided into consecutive bins of user-defined size (e.g. 10 base pairs). To make the exon coverage of different samples comparable, log2-transformed RPKM (reads per kilo-base per million total reads) is used to standardize the bin coverage. Denoting as log2-transformed RPKM coverage of l-th bin in a given exon, the standard coverage of a bin without CNV is assumed to approximately follow a normal distribution . The and are estimated from a pool of reference samples as the coverage and variability patterns. For a bin with a copy number of C, the bin signal is calculated as , ∼ . Hence, a bin-level CNV can be estimated as . Considering variability of bin coverage depending on its relative position in an exon or with respect to capture probe, we further smooth multiple bins within k-th exon (we denote related bin indices as ), leading to a maximum likelihood estimation: , where is designed to take variability of each bin into consideration (details of the statistical formulation are described in Supplementary Section S1).

2.3 Lymphoma case study

We applied PatternCNV to a set of 15 germ line–tumor pairs of diffuse large B-cell lymphoma exome-seq data (Lohr ). When comparing CNV results derived from exome-seq using PatternCNV with those calculated from SNP microarray data profiled on the same samples, the two sets of results largely correlate for large CNVs. As expected, PatternCNV identified many small CNV regions at the single exon and/or multiple exon level (Supplementary Section S2.3) that the SNP array failed to detect owing to lack of probe coverage/density at the region. In addtion, thanks to the digitalized dynamic range of read coverages, PatternCNV can differentiate high versus low amplifications, while microarrays are limited by the saturation of probe hybridization signal. We compared PatternCNV with three other exome-seq-based CNV detection methods, ExomeCNV (Sathirapongsasuti ), Varscan2 (Koboldt ) and FishingCNV (Shi and Majewski 2013) using CNV detected by SNP microarrays as the ground truth. PatternCNV displayed superior visual resolution and achieved better specifity and sensitivity when compared with the paired approaches used by ExomeCNV and Varscan2 (Supplementary Section S2.2), and had much less false positives compared with FishingCNV (Supplementary Section S2.3). In several focused comparisons, we also saw an increased resolution of PatternCNV-based estimations compared with these two methods (Supplementary Section S2.1). In situations where a reference sample had less reliable quality than its paired counterpart, we often observed dramatically reduced performance of both Varscan2 and ExomeCNV for CNV detection, but not PatternCNV (Supplementary Section S2.1). This highlights the robustness of the pattern-based approach over conventional paried approaches. FishingCNV uses a method of taking the average across normal samples, which is more similar to PatternCNV than the paired methods used by the other two tools. However, a detailed comparison shows that FishingCNV has different data processing and CNV detection methods (Supplementary Section S2.3). FishingCNV’s principle component analysis (PCA) step over corrects batch effects and consequently removes CNV signals, resulting in false negative calls. We recommend that the users do not perform the default PCA step of FishingCNV. Moreover, it also oversimplifies average read-depth approach, producing an alarmingly high number of false-positive CNV calls (Supplementary Section S2.3). In contrast, PatternCNV’s novel use of both the weighted average read depth and coverage variability produces results that are superior and simpler to use by improving true positives and greatly reducing false-positive CNV calls.

3 DISCUSIONS AND CONCLUSIONS

We introduce PatternCNV, a software package designed to focus on exon-level CNV detection from exome-seq data. CNV estimate is based on coverage and variability patterns summarized from multiple reference samples. The implemented algorithm uses WIG file format, which improves the runtime and space efficiency. Several post-processing functions are included to facilitate interpretation, through visualization, segmentation and tabular summarization. As demonstrated by the case study, we believe it is a useful utility for exome-seq studies where robust detection of germ line and/or somatic CNVs is of interest. Funding: Support for this work was provided by Center for Individualized Medicine at Mayo Clinic and the NIH (P50 CA97274). We thank Dr Todd R. Golub and colleagues at the Broad Institute, where the genomic data were generated. Conflict of interest: none declared.

9 in total

1. Circular binary segmentation for the analysis of array-based DNA copy number data.

Authors: Adam B Olshen; E S Venkatraman; Robert Lucito; Michael Wigler
Journal: Biostatistics Date: 2004-10 Impact factor: 5.899

2. Discovery and prioritization of somatic mutations in diffuse large B-cell lymphoma (DLBCL) by whole-exome sequencing.

Authors: Jens G Lohr; Petar Stojanov; Michael S Lawrence; Daniel Auclair; Bjoern Chapuy; Carrie Sougnez; Peter Cruz-Gordillo; Birgit Knoechel; Yan W Asmann; Susan L Slager; Anne J Novak; Ahmet Dogan; Stephen M Ansell; Brian K Link; Lihua Zou; Joshua Gould; Gordon Saksena; Nicolas Stransky; Claudia Rangel-Escareño; Juan Carlos Fernandez-Lopez; Alfredo Hidalgo-Miranda; Jorge Melendez-Zajgla; Enrique Hernández-Lemus; Angela Schwarz-Cruz y Celis; Ivan Imaz-Rosshandler; Akinyemi I Ojesina; Joonil Jung; Chandra S Pedamallu; Eric S Lander; Thomas M Habermann; James R Cerhan; Margaret A Shipp; Gad Getz; Todd R Golub
Journal: Proc Natl Acad Sci U S A Date: 2012-02-17 Impact factor: 11.205

3. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.

Authors: Daniel C Koboldt; Qunyuan Zhang; David E Larson; Dong Shen; Michael D McLellan; Ling Lin; Christopher A Miller; Elaine R Mardis; Li Ding; Richard K Wilson
Journal: Genome Res Date: 2012-02-02 Impact factor: 9.043

4. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV.

Authors: Jarupon Fah Sathirapongsasuti; Hane Lee; Basil A J Horst; Georg Brunner; Alistair J Cochran; Scott Binder; John Quackenbush; Stanley F Nelson
Journal: Bioinformatics Date: 2011-08-09 Impact factor: 6.937

5. FishingCNV: a graphical software package for detecting rare copy number variations in exome-sequencing data.

Authors: Yuhao Shi; Jacek Majewski
Journal: Bioinformatics Date: 2013-03-28 Impact factor: 6.937

Review 6. Mechanisms of change in gene copy number.

Authors: P J Hastings; James R Lupski; Susan M Rosenberg; Grzegorz Ira
Journal: Nat Rev Genet Date: 2009-08 Impact factor: 53.242

7. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

8. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

9. A comparative analysis of exome capture.

Authors: Jennifer S Parla; Ivan Iossifov; Ian Grabill; Mona S Spector; Melissa Kramer; W Richard McCombie
Journal: Genome Biol Date: 2011-09-29 Impact factor: 13.583

9 in total

25 in total

1. Target-enrichment sequencing and copy number evaluation in inherited polyneuropathy.

Authors: Wei Wang; Chen Wang; D Brian Dawson; Erik C Thorland; Patrick A Lundquist; Bruce W Eckloff; Yanhong Wu; Saurabh Baheti; Jared M Evans; Steven S Scherer; Peter J Dyck; Christopher J Klein
Journal: Neurology Date: 2016-04-13 Impact factor: 9.910

2. Genomic and Phenotypic Characterization of a Broad Panel of Patient-Derived Xenografts Reflects the Diversity of Glioblastoma.

Authors: Rachael A Vaubel; Shulan Tian; Dioval Remonde; Mark A Schroeder; Ann C Mladek; Gaspar J Kitange; Alissa Caron; Thomas M Kollmeyer; Rebecca Grove; Sen Peng; Brett L Carlson; Daniel J Ma; Gobinda Sarkar; Lisa Evers; Paul A Decker; Huihuang Yan; Harshil D Dhruv; Michael E Berens; Qianghu Wang; Bianca M Marin; Eric W Klee; Andrea Califano; Daniel H LaChance; Jeanette E Eckel-Passow; Roel G Verhaak; Erik P Sulman; Terry C Burns; Fredrick B Meyer; Brian P O'Neill; Nhan L Tran; Caterina Giannini; Robert B Jenkins; Ian F Parney; Jann N Sarkaria
Journal: Clin Cancer Res Date: 2019-12-18 Impact factor: 12.531

3. Prevalence of Pathogenic Mutations in Cancer Predisposition Genes among Pancreatic Cancer Patients.

Authors: Chunling Hu; Steven N Hart; William R Bamlet; Raymond M Moore; Kannabiran Nandakumar; Bruce W Eckloff; Yean K Lee; Gloria M Petersen; Robert R McWilliams; Fergus J Couch
Journal: Cancer Epidemiol Biomarkers Prev Date: 2015-10-19 Impact factor: 4.254

4. Cohort Profile: The Lymphoma Specialized Program of Research Excellence (SPORE) Molecular Epidemiology Resource (MER) Cohort Study.

Authors: James R Cerhan; Brian K Link; Thomas M Habermann; Matthew J Maurer; Andrew L Feldman; Sergei I Syrbu; Carrie A Thompson; Umar Farooq; Anne J Novak; Susan L Slager; Cristine Allmer; Julianne J Lunde; William R Macon; David J Inwards; Patrick B Johnston; Ivana N M Micallef; Grzegorz S Nowakowski; Stephen M Ansell; Neil E Kay; George J Weiner; Thomas E Witzig
Journal: Int J Epidemiol Date: 2017-12-01 Impact factor: 7.196

Review 5. Genomic approaches to the assessment of human spina bifida risk.

Authors: M Elizabeth Ross; Christopher E Mason; Richard H Finnell
Journal: Birth Defects Res Date: 2017-01-30 Impact factor: 2.344

6. Molecular Profiling Reclassifies Adult Astroblastoma into Known and Clinically Distinct Tumor Entities with Frequent Mitogen-Activated Protein Kinase Pathway Alterations.

Authors: William Boisseau; Philipp Euskirchen; Karima Mokhtari; Caroline Dehais; Mehdi Touat; Khê Hoang-Xuan; Marc Sanson; Laurent Capelle; Aurélien Nouet; Carine Karachi; Franck Bielle; Justine Guégan; Yannick Marie; Nadine Martin-Duverneuil; Luc Taillandier; Audrey Rousseau; Jean-Yves Delattre; Ahmed Idbaih
Journal: Oncologist Date: 2019-07-25

7. Recessive TAF1A mutations reveal ribosomopathy in siblings with end-stage pediatric dilated cardiomyopathy.

Authors: Pamela A Long; Jeanne L Theis; Yu-Huan Shih; Joseph J Maleszewski; Patrice C Abell Aleff; Jared M Evans; Xiaolei Xu; Timothy M Olson
Journal: Hum Mol Genet Date: 2017-08-01 Impact factor: 6.150

8. TP53 mutations, tetraploidy and homologous recombination repair defects in early stage high-grade serous ovarian cancer.

Authors: Jeremy Chien; Hugues Sicotte; Jian-Bing Fan; Sean Humphray; Julie M Cunningham; Kimberly R Kalli; Ann L Oberg; Steven N Hart; Ying Li; Jaime I Davila; Saurabh Baheti; Chen Wang; Sabine Dietmann; Elizabeth J Atkinson; Yan W Asmann; Debra A Bell; Takayo Ota; Yaman Tarabishy; Rui Kuang; Marina Bibikova; R Keira Cheetham; Russell J Grocock; Elizabeth M Swisher; John Peden; David Bentley; Jean-Pierre A Kocher; Scott H Kaufmann; Lynn C Hartmann; Viji Shridhar; Ellen L Goode
Journal: Nucleic Acids Res Date: 2015-04-27 Impact factor: 16.971

9. Contribution of Inherited DNA-Repair Gene Mutations to Hormone-Sensitive and Castrate-Resistant Metastatic Prostate Cancer and Implications for Clinical Outcome.

Authors: Siddhartha Yadav; Steven N Hart; Chunling Hu; David Hillman; Kun Y Lee; Rohan Gnanaolivu; Jie Na; Eric C Polley; Fergus J Couch; Manish Kohli
Journal: JCO Precis Oncol Date: 2019-09-17

10. Copy Number Variation Detection Using Total Variation.

Authors: Fatima Zare; Sheida Nabavi
Journal: ACM BCB Date: 2019-09