Literature DB >> 26382196

CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data.

Jonathan S Packer¹, Evan K Maxwell¹, Colm O'Dushlaine¹, Alexander E Lopez¹, Frederick E Dewey¹, Rostislav Chernomorsky¹, Aris Baras¹, John D Overton¹, Lukas Habegger¹, Jeffrey G Reid¹.

Abstract

MOTIVATION: Several algorithms exist for detecting copy number variants (CNVs) from human exome sequencing read depth, but previous tools have not been well suited for large population studies on the order of tens or hundreds of thousands of exomes. Their limitations include being difficult to integrate into automated variant-calling pipelines and being ill-suited for detecting common variants. To address these issues, we developed a new algorithm--Copy number estimation using Lattice-Aligned Mixture Models (CLAMMS)--which is highly scalable and suitable for detecting CNVs across the whole allele frequency spectrum.
RESULTS: In this note, we summarize the methods and intended use-case of CLAMMS, compare it to previous algorithms and briefly describe results of validation experiments. We evaluate the adherence of CNV calls from CLAMMS and four other algorithms to Mendelian inheritance patterns on a pedigree; we compare calls from CLAMMS and other algorithms to calls from SNP genotyping arrays for a set of 3164 samples; and we use TaqMan quantitative polymerase chain reaction to validate CNVs predicted by CLAMMS at 39 loci (95% of rare variants validate; across 19 common variant loci, the mean precision and recall are 99% and 94%, respectively). In the Supplementary Materials (available at the CLAMMS Github repository), we present our methods and validation results in greater detail.
AVAILABILITY AND IMPLEMENTATION: https://github.com/rgcgithub/clamms (implemented in C). CONTACT: jeffrey.reid@regeneron.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Species

Mesh：

Year: 2015 PMID： 26382196 PMCID： PMC4681995 DOI： 10.1093/bioinformatics/btv547

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Detecting copy number variants (CNVs) with whole exome sequencing data is challenging because CNV breakpoints are likely to fall outside of the exome. Almost all CNV-calling algorithms for whole exome sequencing base their calls on read depths within the CNV, which are linearly correlated to copy number state. Previously published algorithms include CoNIFER (Krumm ), XHMM (Fromer ), ExomeDepth (Plagnol ) and CANOES (Backenroth ). Depth-of-coverage is subject to both systematic biases (often related to sequence GC-content) and stochastic volatility (which is exacerbated by variation in input DNA quality). CNV callers must normalize coverage data to correct for systematic biases and characterize the expected coverage profile given diploid copy number, so that true CNVs can be distinguished from noise. Variability in sample preparation and sequencing procedures result in additional coverage biases, often referred to as ‘batch effects’. CoNIFER and XHMM use principal components analysis to identify and remove systematic biases, while ExomeDepth and CANOES handle bias by normalizing each sample’s coverage against the average in a small, ‘custom’ reference panel of samples with coverage profiles that are highly correlated to the individual sample in question. Both strategies have quadratic time complexity and large RAM requirements. Each of these algorithms assumes that reference panel samples are always diploid (presenting a unimodal coverage distribution at each exon), resulting in inaccurate genotypes at common CNV loci. They can also mistake population stratification of common CNVs for batch effects if true batch effects are minimal.

2 Algorithm

The CLAMMS algorithm has three steps, outlined in Figure 1.

Fig. 1.

Overview of the CLAMMS CNV-calling pipeline. A reference panel is selected for each sample based on seven sequencing QC metrics using an efficient k-d tree data structure. After selecting reference panels, each sample and its corresponding reference panel may be processed in parallel across processes and/or servers, requiring only ∼50 MB of RAM per process

Coverage values for individual samples are normalized independently to correct for GC-amplification bias and overall average depth-of-coverage. Low-mappability regions are filtered altogether, as the read depths in these regions do not accurately represent the sequence dosage in the genome. Given a reference panel of samples, a finite mixture model is fit for each exome capture region. Each mixture component models the expected distribution of coverage across samples for a particular integer copy number state. Model parameters will vary from exon to exon, correcting for additional non-GC-related coverage biases. Exome-wide, copy numbers 0–3 are considered; in known duplication regions, copy numbers 4–6 are considered as well (see Supplement Section S1.3). CNVs are called for individual samples using a hidden Markov model (HMM). The input sequence to the HMM is the sample’s normalized coverage values for each region. Emission probabilities are based on the trained mixture models and transition probabilities are similar to those used by XHMM. Overview of the CLAMMS CNV-calling pipeline. A reference panel is selected for each sample based on seven sequencing QC metrics using an efficient k-d tree data structure. After selecting reference panels, each sample and its corresponding reference panel may be processed in parallel across processes and/or servers, requiring only ∼50 MB of RAM per process The details of each step are described in the Supplementary Material. Mixture models allow for copy number polymorphic loci to be handled naturally, while the HMM incorporates the prior expectation that nearby anomalous signals are more likely to be part of a single CNV than multiple small CNVs. Mixture models have previously been used by Genome STRiP (a CNV caller for whole-genome data, Handsaker ), and XHMM, ExomeDepth and CANOES use HMMs; but to our knowledge, no previous CNV-calling algorithm has integrated both in a single probabilistic model. Similar to CANOES and ExomeDepth, we handle data heterogeneity by selecting a ‘custom’ reference panel for each sample. Our CNV calling pipeline (discussed further in the Supplementary Material) works as follows: We define a distance metric between samples based on seven sequencing quality control metrics from Picard (http://broadinstitute.github.io/picard). Each newly sequenced sample is added to a k-d tree in this metric space. CNVs are called using a reference panel consisting of the sample’s 100 nearest neighbors, which are found efficiently using the k-d tree. Indexing n samples and finding the k nearest-neighbors for each sample takes O(kn log n) time. Once the nearest neighbors have been found, calling CNVs for the n samples takes O(kn) time, so the complexity of the whole pipeline is O(kn log n). This improves on the O(n2) complexity of previous algorithms, which all compare the coverage profile of each sample to every other sample.

3 Validation

We performed computational and biological validations to compare the performance of CLAMMS and previous algorithms. We summarize our results here; details are provided in the Supplementary Material. First, we used CLAMMS, XHMM, CoNIFER, CANOES and ExomeDepth to call CNVs for an eight-member pedigree, sequenced in three technical replicates. Ninety-two additional samples were provided as a reference panel. By definition, most CNVs in the pedigree are common variants. CLAMMS called a mean of 13.5 CNV/sample. Ninety-five percent of its calls in children were inherited and 92% of calls were consistent across all technical replicates. XHMM, CoNIFER and CANOES were insensitive to common variants; ExomeDepth is sensitive, but of its calls were inherited. Second, we compared CNV calls from each algorithm (except CANOES, which ran out of memory on a server with 30 GB RAM) against ‘gold-standard’ calls from microarrays (PennCNV, Wang ) for 3164 samples. We limited this analysis to rare variants (AF 0.1%) to avoid false positives related to batch effects in the array data. Using high-quality array-based rare-variant calls as the ground-truth, CLAMMS had the best performance (F-score 6.6% higher than ExomeDepth, 9.3% higher than XHMM and over double that of CoNIFER). Finally, we used TaqMan quantitative polymerase chain reaction to validate a random subset of CNVs predicted by CLAMMS at 20 rare variant loci and 19 common variant loci that overlap disease-associated genes in the Human Gene Mutation Database (not limited to genes associated with CNVs; 7 430 disease genes in total; Stenson ); 19/20 (95%) rare variants were validated. At the common variant loci, CLAMMS genotypes achieved mean precision/recall values of 99.0% and 94.0%, respectively. ExomeDepth also had near-perfect performance for mid-frequency CNVs but had inaccurate genotypes at highly polymorphic loci (see Supplementary Table S5).

7 in total

1. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data.

Authors: Kai Wang; Mingyao Li; Dexter Hadley; Rui Liu; Joseph Glessner; Struan F A Grant; Hakon Hakonarson; Maja Bucan
Journal: Genome Res Date: 2007-10-05 Impact factor: 9.043

2. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth.

Authors: Menachem Fromer; Jennifer L Moran; Kimberly Chambert; Eric Banks; Sarah E Bergen; Douglas M Ruderfer; Robert E Handsaker; Steven A McCarroll; Michael C O'Donovan; Michael J Owen; George Kirov; Patrick F Sullivan; Christina M Hultman; Pamela Sklar; Shaun M Purcell
Journal: Am J Hum Genet Date: 2012-10-05 Impact factor: 11.025

3. CANOES: detecting rare copy number variants from whole exome sequencing data.

Authors: Daniel Backenroth; Jason Homsy; Laura R Murillo; Joe Glessner; Edwin Lin; Martina Brueckner; Richard Lifton; Elizabeth Goldmuntz; Wendy K Chung; Yufeng Shen
Journal: Nucleic Acids Res Date: 2014-04-25 Impact factor: 16.971

4. The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution.

Authors: Peter D Stenson; Edward V Ball; Matthew Mort; Andrew D Phillips; Katy Shaw; David N Cooper
Journal: Curr Protoc Bioinformatics Date: 2012-09

5. Copy number variation detection and genotyping from exome sequence data.

Authors: Niklas Krumm; Peter H Sudmant; Arthur Ko; Brian J O'Roak; Maika Malig; Bradley P Coe; Aaron R Quinlan; Deborah A Nickerson; Evan E Eichler
Journal: Genome Res Date: 2012-05-14 Impact factor: 9.043

6. Large multiallelic copy number variations in humans.

Authors: Robert E Handsaker; Vanessa Van Doren; Jennifer R Berman; Giulio Genovese; Seva Kashin; Linda M Boettger; Steven A McCarroll
Journal: Nat Genet Date: 2015-01-26 Impact factor: 38.330

7. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling.

Authors: Vincent Plagnol; James Curtis; Michael Epstein; Kin Y Mok; Emma Stebbings; Sofia Grigoriadou; Nicholas W Wood; Sophie Hambleton; Siobhan O Burns; Adrian J Thrasher; Dinakantha Kumararatne; Rainer Doffinger; Sergey Nejentsev
Journal: Bioinformatics Date: 2012-08-31 Impact factor: 6.937

7 in total

29 in total

1. Distal chromosome 16p11.2 duplications containing SH2B1 in patients with scoliosis.

Authors: Brooke Sadler; Gabe Haller; Lilian Antunes; Xavier Bledsoe; Jose Morcuende; Philip Giampietro; Cathleen Raggio; Nancy Miller; Yared Kidane; Carol A Wise; Ina Amarillo; Nephi Walton; Mark Seeley; Darren Johnson; Conner Jenkins; Troy Jenkins; Matthew Oetjens; R Spencer Tong; Todd E Druley; Matthew B Dobbs; Christina A Gurnett
Journal: J Med Genet Date: 2019-02-25 Impact factor: 6.318

2. Whole exome association of rare deletions in multiplex oral cleft families.

Authors: Jack Fu; Terri H Beaty; Alan F Scott; Jacqueline Hetmanski; Margaret M Parker; Joan E Bailey Wilson; Mary L Marazita; Elisabeth Mangold; Hasan Albacha-Hejazi; Jeffrey C Murray; Alexandre Bureau; Jacob Carey; Stephen Cristiano; Ingo Ruczinski; Robert B Scharpf
Journal: Genet Epidemiol Date: 2016-12-01 Impact factor: 2.135

3. Clinical Characterization of Copy Number Variants Associated With Neurodevelopmental Disorders in a Large-scale Multiancestry Biobank.

Authors: Rebecca Birnbaum; Behrang Mahjani; Ruth J F Loos; Andrew J Sharp
Journal: JAMA Psychiatry Date: 2022-03-01 Impact factor: 25.911

4. Medical manifestations and health care utilization among adult MyCode participants with neurodevelopmental psychiatric copy number variants.

Authors: Brenda Finucane; Matthew T Oetjens; Alicia Johns; Scott M Myers; Ciaran Fisher; Lukas Habegger; Evan K Maxwell; Jeffrey G Reid; David H Ledbetter; H Lester Kirchner; Christa Lese Martin
Journal: Genet Med Date: 2021-11-18 Impact factor: 8.864

5. Exome Sequencing in Children With Pulmonary Arterial Hypertension Demonstrates Differences Compared With Adults.

Authors: Na Zhu; Claudia Gonzaga-Jauregui; Carrie L Welch; Lijiang Ma; Hongjian Qi; Alejandra K King; Usha Krishnan; Erika B Rosenzweig; D Dunbar Ivy; Eric D Austin; Rizwan Hamid; William C Nichols; Michael W Pauciulo; Katie A Lutz; Ashley Sawle; Jeffrey G Reid; John D Overton; Aris Baras; Frederick Dewey; Yufeng Shen; Wendy K Chung
Journal: Circ Genom Precis Med Date: 2018-04

6. HCMMCNVs: Hierarchical Clustering Mixture Model of Copy Number Variants Detection using Whole Exome Sequencing Technology.

Authors: Chi Song; Shih-Chi Su; Zhiguang Huo; Suleyman Vural; James E Galvin; Lun-Ching Chang
Journal: Bioinformatics Date: 2021-03-14 Impact factor: 6.937

7. Non-coding region variants upstream of MEF2C cause severe developmental disorder through three distinct loss-of-function mechanisms.

Authors: Caroline F Wright; Nicholas M Quaife; Laura Ramos-Hernández; Petr Danecek; Matteo P Ferla; Kaitlin E Samocha; Joanna Kaplanis; Eugene J Gardner; Ruth Y Eberhardt; Katherine R Chao; Konrad J Karczewski; Joannella Morales; Giuseppe Gallone; Meena Balasubramanian; Siddharth Banka; Lianne Gompertz; Bronwyn Kerr; Amelia Kirby; Sally A Lynch; Jenny E V Morton; Hailey Pinz; Francis H Sansbury; Helen Stewart; Britton D Zuccarelli; Stuart A Cook; Jenny C Taylor; Jane Juusola; Kyle Retterer; Helen V Firth; Matthew E Hurles; Enrique Lara-Pezzi; Paul J R Barton; Nicola Whiffin
Journal: Am J Hum Genet Date: 2021-05-21 Impact factor: 11.025

8. Reduced reproductive success is associated with selective constraint on human genes.

Authors: Eugene J Gardner; Matthew D C Neville; Kaitlin E Samocha; Kieron Barclay; Martin Kolk; Mari E K Niemi; George Kirov; Hilary C Martin; Matthew E Hurles
Journal: Nature Date: 2022-03-23 Impact factor: 69.504

9. Benchmarking germline CNV calling tools from exome sequencing data.

Authors: Veronika Gordeeva; Elena Sharova; Konstantin Babalyan; Rinat Sultanov; Vadim M Govorun; Georgij Arapidi
Journal: Sci Rep Date: 2021-07-13 Impact factor: 4.379

10. Detection of homozygous and hemizygous complete or partial exon deletions by whole-exome sequencing.

Authors: Benedetta Bigio; Yoann Seeleuthner; Gaspard Kerner; Mélanie Migaud; Jérémie Rosain; Bertrand Boisson; Carla Nasca; Anne Puel; Jacinta Bustamante; Jean-Laurent Casanova; Laurent Abel; Aurelie Cobat
Journal: NAR Genom Bioinform Date: 2021-05-22