MOTIVATION: Several algorithms exist for detecting copy number variants (CNVs) from human exome sequencing read depth, but previous tools have not been well suited for large population studies on the order of tens or hundreds of thousands of exomes. Their limitations include being difficult to integrate into automated variant-calling pipelines and being ill-suited for detecting common variants. To address these issues, we developed a new algorithm--Copy number estimation using Lattice-Aligned Mixture Models (CLAMMS)--which is highly scalable and suitable for detecting CNVs across the whole allele frequency spectrum. RESULTS: In this note, we summarize the methods and intended use-case of CLAMMS, compare it to previous algorithms and briefly describe results of validation experiments. We evaluate the adherence of CNV calls from CLAMMS and four other algorithms to Mendelian inheritance patterns on a pedigree; we compare calls from CLAMMS and other algorithms to calls from SNP genotyping arrays for a set of 3164 samples; and we use TaqMan quantitative polymerase chain reaction to validate CNVs predicted by CLAMMS at 39 loci (95% of rare variants validate; across 19 common variant loci, the mean precision and recall are 99% and 94%, respectively). In the Supplementary Materials (available at the CLAMMS Github repository), we present our methods and validation results in greater detail. AVAILABILITY AND IMPLEMENTATION: https://github.com/rgcgithub/clamms (implemented in C). CONTACT: jeffrey.reid@regeneron.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Several algorithms exist for detecting copy number variants (CNVs) from human exome sequencing read depth, but previous tools have not been well suited for large population studies on the order of tens or hundreds of thousands of exomes. Their limitations include being difficult to integrate into automated variant-calling pipelines and being ill-suited for detecting common variants. To address these issues, we developed a new algorithm--Copy number estimation using Lattice-Aligned Mixture Models (CLAMMS)--which is highly scalable and suitable for detecting CNVs across the whole allele frequency spectrum. RESULTS: In this note, we summarize the methods and intended use-case of CLAMMS, compare it to previous algorithms and briefly describe results of validation experiments. We evaluate the adherence of CNV calls from CLAMMS and four other algorithms to Mendelian inheritance patterns on a pedigree; we compare calls from CLAMMS and other algorithms to calls from SNP genotyping arrays for a set of 3164 samples; and we use TaqMan quantitative polymerase chain reaction to validate CNVs predicted by CLAMMS at 39 loci (95% of rare variants validate; across 19 common variant loci, the mean precision and recall are 99% and 94%, respectively). In the Supplementary Materials (available at the CLAMMS Github repository), we present our methods and validation results in greater detail. AVAILABILITY AND IMPLEMENTATION: https://github.com/rgcgithub/clamms (implemented in C). CONTACT: jeffrey.reid@regeneron.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Detecting copy number variants (CNVs) with whole exome sequencing data is challenging because CNV breakpoints are likely to fall outside of the exome. Almost all CNV-calling algorithms for whole exome sequencing base their calls on read depths within the CNV, which are linearly correlated to copy number state. Previously published algorithms include CoNIFER (Krumm ), XHMM (Fromer ), ExomeDepth (Plagnol ) and CANOES (Backenroth ).Depth-of-coverage is subject to both systematic biases (often related to sequence GC-content) and stochastic volatility (which is exacerbated by variation in input DNA quality). CNV callers must normalize coverage data to correct for systematic biases and characterize the expected coverage profile given diploid copy number, so that true CNVs can be distinguished from noise. Variability in sample preparation and sequencing procedures result in additional coverage biases, often referred to as ‘batch effects’.CoNIFER and XHMM use principal components analysis to identify and remove systematic biases, while ExomeDepth and CANOES handle bias by normalizing each sample’s coverage against the average in a small, ‘custom’ reference panel of samples with coverage profiles that are highly correlated to the individual sample in question. Both strategies have quadratic time complexity and large RAM requirements.Each of these algorithms assumes that reference panel samples are always diploid (presenting a unimodal coverage distribution at each exon), resulting in inaccurate genotypes at common CNV loci. They can also mistake population stratification of common CNVs for batch effects if true batch effects are minimal.
2 Algorithm
The CLAMMS algorithm has three steps, outlined in Figure 1.
Fig. 1.
Overview of the CLAMMS CNV-calling pipeline. A reference panel is selected for each sample based on seven sequencing QC metrics using an efficient k-d tree data structure. After selecting reference panels, each sample and its corresponding reference panel may be processed in parallel across processes and/or servers, requiring only ∼50 MB of RAM per process
Coverage values for individual samples are normalized independently to correct for GC-amplification bias and overall average depth-of-coverage. Low-mappability regions are filtered altogether, as the read depths in these regions do not accurately represent the sequence dosage in the genome.Given a reference panel of samples, a finite mixture model is fit for each exome capture region. Each mixture component models the expected distribution of coverage across samples for a particular integer copy number state. Model parameters will vary from exon to exon, correcting for additional non-GC-related coverage biases. Exome-wide, copy numbers 0–3 are considered; in known duplication regions, copy numbers 4–6 are considered as well (see Supplement Section S1.3).CNVs are called for individual samples using a hidden Markov model (HMM). The input sequence to the HMM is the sample’s normalized coverage values for each region. Emission probabilities are based on the trained mixture models and transition probabilities are similar to those used by XHMM.Overview of the CLAMMS CNV-calling pipeline. A reference panel is selected for each sample based on seven sequencing QC metrics using an efficient k-d tree data structure. After selecting reference panels, each sample and its corresponding reference panel may be processed in parallel across processes and/or servers, requiring only ∼50 MB of RAM per processThe details of each step are described in the Supplementary Material. Mixture models allow for copy number polymorphic loci to be handled naturally, while the HMM incorporates the prior expectation that nearby anomalous signals are more likely to be part of a single CNV than multiple small CNVs. Mixture models have previously been used by Genome STRiP (a CNV caller for whole-genome data, Handsaker ), and XHMM, ExomeDepth and CANOES use HMMs; but to our knowledge, no previous CNV-calling algorithm has integrated both in a single probabilistic model.Similar to CANOES and ExomeDepth, we handle data heterogeneity by selecting a ‘custom’ reference panel for each sample. Our CNV calling pipeline (discussed further in the Supplementary Material) works as follows:We define a distance metric between samples based on seven sequencing quality control metrics from Picard (http://broadinstitute.github.io/picard).Each newly sequenced sample is added to a k-d tree in this metric space. CNVs are called using a reference panel consisting of the sample’s 100 nearest neighbors, which are found efficiently using the k-d tree.Indexing n samples and finding the k nearest-neighbors for each sample takes O(kn log n) time. Once the nearest neighbors have been found, calling CNVs for the n samples takes O(kn) time, so the complexity of the whole pipeline is O(kn log n). This improves on the O(n2) complexity of previous algorithms, which all compare the coverage profile of each sample to every other sample.
3 Validation
We performed computational and biological validations to compare the performance of CLAMMS and previous algorithms. We summarize our results here; details are provided in the Supplementary Material.First, we used CLAMMS, XHMM, CoNIFER, CANOES and ExomeDepth to call CNVs for an eight-member pedigree, sequenced in three technical replicates. Ninety-two additional samples were provided as a reference panel. By definition, most CNVs in the pedigree are common variants. CLAMMS called a mean of 13.5 CNV/sample. Ninety-five percent of its calls in children were inherited and 92% of calls were consistent across all technical replicates. XHMM, CoNIFER and CANOES were insensitive to common variants; ExomeDepth is sensitive, but of its calls were inherited.Second, we compared CNV calls from each algorithm (except CANOES, which ran out of memory on a server with 30 GB RAM) against ‘gold-standard’ calls from microarrays (PennCNV, Wang ) for 3164 samples. We limited this analysis to rare variants (AF 0.1%) to avoid false positives related to batch effects in the array data. Using high-quality array-based rare-variant calls as the ground-truth, CLAMMS had the best performance (F-score 6.6% higher than ExomeDepth, 9.3% higher than XHMM and over double that of CoNIFER).Finally, we used TaqMan quantitative polymerase chain reaction to validate a random subset of CNVs predicted by CLAMMS at 20 rare variant loci and 19 common variant loci that overlap disease-associated genes in the Human Gene Mutation Database (not limited to genes associated with CNVs; 7 430 disease genes in total; Stenson ); 19/20 (95%) rare variants were validated. At the common variant loci, CLAMMS genotypes achieved mean precision/recall values of 99.0% and 94.0%, respectively. ExomeDepth also had near-perfect performance for mid-frequency CNVs but had inaccurate genotypes at highly polymorphic loci (see Supplementary Table S5).
Authors: Kai Wang; Mingyao Li; Dexter Hadley; Rui Liu; Joseph Glessner; Struan F A Grant; Hakon Hakonarson; Maja Bucan Journal: Genome Res Date: 2007-10-05 Impact factor: 9.043
Authors: Menachem Fromer; Jennifer L Moran; Kimberly Chambert; Eric Banks; Sarah E Bergen; Douglas M Ruderfer; Robert E Handsaker; Steven A McCarroll; Michael C O'Donovan; Michael J Owen; George Kirov; Patrick F Sullivan; Christina M Hultman; Pamela Sklar; Shaun M Purcell Journal: Am J Hum Genet Date: 2012-10-05 Impact factor: 11.025
Authors: Daniel Backenroth; Jason Homsy; Laura R Murillo; Joe Glessner; Edwin Lin; Martina Brueckner; Richard Lifton; Elizabeth Goldmuntz; Wendy K Chung; Yufeng Shen Journal: Nucleic Acids Res Date: 2014-04-25 Impact factor: 16.971
Authors: Niklas Krumm; Peter H Sudmant; Arthur Ko; Brian J O'Roak; Maika Malig; Bradley P Coe; Aaron R Quinlan; Deborah A Nickerson; Evan E Eichler Journal: Genome Res Date: 2012-05-14 Impact factor: 9.043
Authors: Robert E Handsaker; Vanessa Van Doren; Jennifer R Berman; Giulio Genovese; Seva Kashin; Linda M Boettger; Steven A McCarroll Journal: Nat Genet Date: 2015-01-26 Impact factor: 38.330
Authors: Vincent Plagnol; James Curtis; Michael Epstein; Kin Y Mok; Emma Stebbings; Sofia Grigoriadou; Nicholas W Wood; Sophie Hambleton; Siobhan O Burns; Adrian J Thrasher; Dinakantha Kumararatne; Rainer Doffinger; Sergey Nejentsev Journal: Bioinformatics Date: 2012-08-31 Impact factor: 6.937
Authors: Brooke Sadler; Gabe Haller; Lilian Antunes; Xavier Bledsoe; Jose Morcuende; Philip Giampietro; Cathleen Raggio; Nancy Miller; Yared Kidane; Carol A Wise; Ina Amarillo; Nephi Walton; Mark Seeley; Darren Johnson; Conner Jenkins; Troy Jenkins; Matthew Oetjens; R Spencer Tong; Todd E Druley; Matthew B Dobbs; Christina A Gurnett Journal: J Med Genet Date: 2019-02-25 Impact factor: 6.318
Authors: Jack Fu; Terri H Beaty; Alan F Scott; Jacqueline Hetmanski; Margaret M Parker; Joan E Bailey Wilson; Mary L Marazita; Elisabeth Mangold; Hasan Albacha-Hejazi; Jeffrey C Murray; Alexandre Bureau; Jacob Carey; Stephen Cristiano; Ingo Ruczinski; Robert B Scharpf Journal: Genet Epidemiol Date: 2016-12-01 Impact factor: 2.135
Authors: Brenda Finucane; Matthew T Oetjens; Alicia Johns; Scott M Myers; Ciaran Fisher; Lukas Habegger; Evan K Maxwell; Jeffrey G Reid; David H Ledbetter; H Lester Kirchner; Christa Lese Martin Journal: Genet Med Date: 2021-11-18 Impact factor: 8.864
Authors: Na Zhu; Claudia Gonzaga-Jauregui; Carrie L Welch; Lijiang Ma; Hongjian Qi; Alejandra K King; Usha Krishnan; Erika B Rosenzweig; D Dunbar Ivy; Eric D Austin; Rizwan Hamid; William C Nichols; Michael W Pauciulo; Katie A Lutz; Ashley Sawle; Jeffrey G Reid; John D Overton; Aris Baras; Frederick Dewey; Yufeng Shen; Wendy K Chung Journal: Circ Genom Precis Med Date: 2018-04
Authors: Caroline F Wright; Nicholas M Quaife; Laura Ramos-Hernández; Petr Danecek; Matteo P Ferla; Kaitlin E Samocha; Joanna Kaplanis; Eugene J Gardner; Ruth Y Eberhardt; Katherine R Chao; Konrad J Karczewski; Joannella Morales; Giuseppe Gallone; Meena Balasubramanian; Siddharth Banka; Lianne Gompertz; Bronwyn Kerr; Amelia Kirby; Sally A Lynch; Jenny E V Morton; Hailey Pinz; Francis H Sansbury; Helen Stewart; Britton D Zuccarelli; Stuart A Cook; Jenny C Taylor; Jane Juusola; Kyle Retterer; Helen V Firth; Matthew E Hurles; Enrique Lara-Pezzi; Paul J R Barton; Nicola Whiffin Journal: Am J Hum Genet Date: 2021-05-21 Impact factor: 11.025
Authors: Eugene J Gardner; Matthew D C Neville; Kaitlin E Samocha; Kieron Barclay; Martin Kolk; Mari E K Niemi; George Kirov; Hilary C Martin; Matthew E Hurles Journal: Nature Date: 2022-03-23 Impact factor: 69.504