MOTIVATION: Exome sequencing (exome-seq) data, which are typically used for calling exonic mutations, have also been utilized in detecting DNA copy number variations (CNVs). Despite the existence of several CNV detection tools, there is still a great need for a sensitive and an accurate CNV-calling algorithm with built-in QC steps, and does not require a paired reference for each sample. RESULTS: We developed a novel method named PatternCNV, which (i) accounts for the read coverage variations between exons while leveraging the consistencies of this variability across different samples; (ii) reduces alignment BAM files to WIG format and therefore greatly accelerates computation; (iii) incorporates multiple QC measures designed to identify outlier samples and batch effects; and (iv) provides a variety of visualization options including chromosome, gene and exon-level views of CNVs, along with a tabular summarization of the exon-level CNVs. Compared with other CNV-calling algorithms using data from a lymphoma exome-seq study, PatternCNV has higher sensitivity and specificity. AVAILABILITY AND IMPLEMENTATION: The software for PatternCNV is implemented using Perl and R, and can be used in Mac or Linux environments. Software and user manual are available at http://bioinformaticstools.mayo.edu/research/patterncnv/, and R package at https://github.com/topsoil/patternCNV/.
MOTIVATION: Exome sequencing (exome-seq) data, which are typically used for calling exonic mutations, have also been utilized in detecting DNA copy number variations (CNVs). Despite the existence of several CNV detection tools, there is still a great need for a sensitive and an accurate CNV-calling algorithm with built-in QC steps, and does not require a paired reference for each sample. RESULTS: We developed a novel method named PatternCNV, which (i) accounts for the read coverage variations between exons while leveraging the consistencies of this variability across different samples; (ii) reduces alignment BAM files to WIG format and therefore greatly accelerates computation; (iii) incorporates multiple QC measures designed to identify outlier samples and batch effects; and (iv) provides a variety of visualization options including chromosome, gene and exon-level views of CNVs, along with a tabular summarization of the exon-level CNVs. Compared with other CNV-calling algorithms using data from a lymphoma exome-seq study, PatternCNV has higher sensitivity and specificity. AVAILABILITY AND IMPLEMENTATION: The software for PatternCNV is implemented using Perl and R, and can be used in Mac or Linux environments. Software and user manual are available at http://bioinformaticstools.mayo.edu/research/patterncnv/, and R package at https://github.com/topsoil/patternCNV/.
DNA copy number variations (CNVs) are genomic structural changes that result in regional or chromosomal loss or gain of DNA copies (Hastings ). Owing to the significant roles in human diseases, various laboratory techniques have been developed to detect CNVs, including recently advanced massive parallel sequencing of whole genomes and coding exomes. For exome-seq, it is commonly observed that coverage depths of short reads across regions vary, caused by different target capture efficiencies (Parla ), as well as the differences in mappability of exons. Such coverage variations impose substantial challenges for reliable CNV detection. Most existing methods use a paired-sample approach, based on the intuitive assumption that somatic sample and its paired reference share similar coverage bias that can be cancelled out through pairing (Koboldt ; Sathirapongsasuti ). Although this assumption approximately holds, it oversimplifies the problem with two limitations unaddressed: (i) The region-specific noise (coverage variability) of a local region is not accounted for, leading to amplified noise in log-ratio values of coverage between sample and the paired reference. (ii) In the case of a missing or low-quality reference sample, CNV detection based on paired reference will be infeasible or have degraded accuracy/sensitivity. A recent published method, FishingCNV, tried to address the second limitation by using the average of multiple reference samples as the denominators in log-ratio calculation, but did not address the regional noises in individual samples (the numerator), which led to false CNV calls (details in Supplementary Section S2.3). Considering these issues, we proposed a novel method called PatternCNV, which summarizes overall consistent patterns of both depths and variability of exonic region coverage across samples, where ‘patterns’ of coverage and variability are summarized using multiple ‘normal’ or reference samples. We observed that the same patterns only exist between samples prepared using the same version of exome capture kit. During CNV detection, we compute the differences of observed coverage versus the common pattern, while penalizing regions associated with larger variability using a weighting scheme. Further, whole-genome CNV can be interpolated from exon-level CNV using any third-party segmentation method, e.g. circular binary segmentation (Olshen ).The PatternCNV was implemented in two different versions: a Mac and Linux/Unix version, and an R package version. We also developed a conversion tool to transform Binary version of sequence Alignment/Map (BAM) format files to much smaller wiggle (WIG) format files (<1% of BAM file size), which greatly speeds up pattern learning and CNV calculation. When compared with other state-of-the-art CNV algorithms in a lymphoma case study, PatternCNV displayed higher resolution and greater sensitivity/specificity.
2 FEATURES
2.1 Input, output and major functions
PatternCNV is divided into three major functional components: (i) BAM-to-WIG conversion for improved computational performance: a BAM2WIG converter using SAMtools (Li ) and BEDtools (Quinlan and Hall 2010), which takes as input a BAM file, a file of Browser Extensible Data (BED) format defining exon regions and a second BED file for capture targets defined by the exome capture kit. The outputs are WIG files with greatly reduced file sizes compared with BAM files; (ii) CNV detection: starting with WIG files, PatternCNV estimates the coverage and variability patterns from multiple reference samples and calculates CNVs relative to the pattern for all samples including the references; and (iii) CNV summary and visualization: this module outputs a detailed exon-level CNV summary file per sample, and provides several visualization options for viewing CNVs at the whole-genome level or chromosome level. In addition, there are built-in QA/QC steps to detect sample outliers and batch effects. Figure 1 displays the overall workflow of PatternCNV along with illustrative examples of program output.
Fig. 1.
PatternCNV workflow is demonstrated in the upper panel. Examples of whole-genome and chromosome-level visulization are displayed in the bottom panel, along with Exon-level CNV summary table
PatternCNV workflow is demonstrated in the upper panel. Examples of whole-genome and chromosome-level visulization are displayed in the bottom panel, along with Exon-level CNV summary table
2.2 Description of the PatternCNV algorithm
Each exon is first divided into consecutive bins of user-defined size (e.g. 10 base pairs). To make the exon coverage of different samples comparable, log2-transformed RPKM (reads per kilo-base per million total reads) is used to standardize the bin coverage. Denoting as log2-transformed RPKM coverage of l-th bin in a given exon, the standard coverage of a bin without CNV is assumed to approximately follow a normal distribution . The and are estimated from a pool of reference samples as the coverage and variability patterns. For a bin with a copy number of C, the bin signal is calculated as , ∼ . Hence, a bin-level CNV can be estimated as . Considering variability of bin coverage depending on its relative position in an exon or with respect to capture probe, we further smooth multiple bins within k-th exon (we denote related bin indices as ), leading to a maximum likelihood estimation: , where is designed to take variability of each bin into consideration (details of the statistical formulation are described in Supplementary Section S1).
2.3 Lymphoma case study
We applied PatternCNV to a set of 15 germ line–tumor pairs of diffuse large B-cell lymphoma exome-seq data (Lohr ). When comparing CNV results derived from exome-seq using PatternCNV with those calculated from SNP microarray data profiled on the same samples, the two sets of results largely correlate for large CNVs. As expected, PatternCNV identified many small CNV regions at the single exon and/or multiple exon level (Supplementary Section S2.3) that the SNP array failed to detect owing to lack of probe coverage/density at the region. In addtion, thanks to the digitalized dynamic range of read coverages, PatternCNV can differentiate high versus low amplifications, while microarrays are limited by the saturation of probe hybridization signal. We compared PatternCNV with three other exome-seq-based CNV detection methods, ExomeCNV (Sathirapongsasuti ), Varscan2 (Koboldt ) and FishingCNV (Shi and Majewski 2013) using CNV detected by SNP microarrays as the ground truth. PatternCNV displayed superior visual resolution and achieved better specifity and sensitivity when compared with the paired approaches used by ExomeCNV and Varscan2 (Supplementary Section S2.2), and had much less false positives compared with FishingCNV (Supplementary Section S2.3). In several focused comparisons, we also saw an increased resolution of PatternCNV-based estimations compared with these two methods (Supplementary Section S2.1). In situations where a reference sample had less reliable quality than its paired counterpart, we often observed dramatically reduced performance of both Varscan2 and ExomeCNV for CNV detection, but not PatternCNV (Supplementary Section S2.1). This highlights the robustness of the pattern-based approach over conventional paried approaches. FishingCNV uses a method of taking the average across normal samples, which is more similar to PatternCNV than the paired methods used by the other two tools. However, a detailed comparison shows that FishingCNV has different data processing and CNV detection methods (Supplementary Section S2.3). FishingCNV’s principle component analysis (PCA) step over corrects batch effects and consequently removes CNV signals, resulting in false negative calls. We recommend that the users do not perform the default PCA step of FishingCNV. Moreover, it also oversimplifies average read-depth approach, producing an alarmingly high number of false-positive CNV calls (Supplementary Section S2.3). In contrast, PatternCNV’s novel use of both the weighted average read depth and coverage variability produces results that are superior and simpler to use by improving true positives and greatly reducing false-positive CNV calls.
3 DISCUSIONS AND CONCLUSIONS
We introduce PatternCNV, a software package designed to focus on exon-level CNV detection from exome-seq data. CNV estimate is based on coverage and variability patterns summarized from multiple reference samples. The implemented algorithm uses WIG file format, which improves the runtime and space efficiency. Several post-processing functions are included to facilitate interpretation, through visualization, segmentation and tabular summarization. As demonstrated by the case study, we believe it is a useful utility for exome-seq studies where robust detection of germ line and/or somatic CNVs is of interest.Funding: Support for this work was provided by Center for Individualized Medicine at Mayo Clinic and the NIH (P50 CA97274). We thank Dr Todd R. Golub and colleagues at the Broad Institute, where the genomic data were generated.Conflict of interest: none declared.
Authors: Jens G Lohr; Petar Stojanov; Michael S Lawrence; Daniel Auclair; Bjoern Chapuy; Carrie Sougnez; Peter Cruz-Gordillo; Birgit Knoechel; Yan W Asmann; Susan L Slager; Anne J Novak; Ahmet Dogan; Stephen M Ansell; Brian K Link; Lihua Zou; Joshua Gould; Gordon Saksena; Nicolas Stransky; Claudia Rangel-Escareño; Juan Carlos Fernandez-Lopez; Alfredo Hidalgo-Miranda; Jorge Melendez-Zajgla; Enrique Hernández-Lemus; Angela Schwarz-Cruz y Celis; Ivan Imaz-Rosshandler; Akinyemi I Ojesina; Joonil Jung; Chandra S Pedamallu; Eric S Lander; Thomas M Habermann; James R Cerhan; Margaret A Shipp; Gad Getz; Todd R Golub Journal: Proc Natl Acad Sci U S A Date: 2012-02-17 Impact factor: 11.205
Authors: Daniel C Koboldt; Qunyuan Zhang; David E Larson; Dong Shen; Michael D McLellan; Ling Lin; Christopher A Miller; Elaine R Mardis; Li Ding; Richard K Wilson Journal: Genome Res Date: 2012-02-02 Impact factor: 9.043
Authors: Jarupon Fah Sathirapongsasuti; Hane Lee; Basil A J Horst; Georg Brunner; Alistair J Cochran; Scott Binder; John Quackenbush; Stanley F Nelson Journal: Bioinformatics Date: 2011-08-09 Impact factor: 6.937
Authors: Jennifer S Parla; Ivan Iossifov; Ian Grabill; Mona S Spector; Melissa Kramer; W Richard McCombie Journal: Genome Biol Date: 2011-09-29 Impact factor: 13.583
Authors: Wei Wang; Chen Wang; D Brian Dawson; Erik C Thorland; Patrick A Lundquist; Bruce W Eckloff; Yanhong Wu; Saurabh Baheti; Jared M Evans; Steven S Scherer; Peter J Dyck; Christopher J Klein Journal: Neurology Date: 2016-04-13 Impact factor: 9.910
Authors: Rachael A Vaubel; Shulan Tian; Dioval Remonde; Mark A Schroeder; Ann C Mladek; Gaspar J Kitange; Alissa Caron; Thomas M Kollmeyer; Rebecca Grove; Sen Peng; Brett L Carlson; Daniel J Ma; Gobinda Sarkar; Lisa Evers; Paul A Decker; Huihuang Yan; Harshil D Dhruv; Michael E Berens; Qianghu Wang; Bianca M Marin; Eric W Klee; Andrea Califano; Daniel H LaChance; Jeanette E Eckel-Passow; Roel G Verhaak; Erik P Sulman; Terry C Burns; Fredrick B Meyer; Brian P O'Neill; Nhan L Tran; Caterina Giannini; Robert B Jenkins; Ian F Parney; Jann N Sarkaria Journal: Clin Cancer Res Date: 2019-12-18 Impact factor: 12.531
Authors: Chunling Hu; Steven N Hart; William R Bamlet; Raymond M Moore; Kannabiran Nandakumar; Bruce W Eckloff; Yean K Lee; Gloria M Petersen; Robert R McWilliams; Fergus J Couch Journal: Cancer Epidemiol Biomarkers Prev Date: 2015-10-19 Impact factor: 4.254
Authors: James R Cerhan; Brian K Link; Thomas M Habermann; Matthew J Maurer; Andrew L Feldman; Sergei I Syrbu; Carrie A Thompson; Umar Farooq; Anne J Novak; Susan L Slager; Cristine Allmer; Julianne J Lunde; William R Macon; David J Inwards; Patrick B Johnston; Ivana N M Micallef; Grzegorz S Nowakowski; Stephen M Ansell; Neil E Kay; George J Weiner; Thomas E Witzig Journal: Int J Epidemiol Date: 2017-12-01 Impact factor: 7.196
Authors: Pamela A Long; Jeanne L Theis; Yu-Huan Shih; Joseph J Maleszewski; Patrice C Abell Aleff; Jared M Evans; Xiaolei Xu; Timothy M Olson Journal: Hum Mol Genet Date: 2017-08-01 Impact factor: 6.150
Authors: Jeremy Chien; Hugues Sicotte; Jian-Bing Fan; Sean Humphray; Julie M Cunningham; Kimberly R Kalli; Ann L Oberg; Steven N Hart; Ying Li; Jaime I Davila; Saurabh Baheti; Chen Wang; Sabine Dietmann; Elizabeth J Atkinson; Yan W Asmann; Debra A Bell; Takayo Ota; Yaman Tarabishy; Rui Kuang; Marina Bibikova; R Keira Cheetham; Russell J Grocock; Elizabeth M Swisher; John Peden; David Bentley; Jean-Pierre A Kocher; Scott H Kaufmann; Lynn C Hartmann; Viji Shridhar; Ellen L Goode Journal: Nucleic Acids Res Date: 2015-04-27 Impact factor: 16.971
Authors: Siddhartha Yadav; Steven N Hart; Chunling Hu; David Hillman; Kun Y Lee; Rohan Gnanaolivu; Jie Na; Eric C Polley; Fergus J Couch; Manish Kohli Journal: JCO Precis Oncol Date: 2019-09-17