| Literature DB >> 22585873 |
Niklas Krumm1, Peter H Sudmant, Arthur Ko, Brian J O'Roak, Maika Malig, Bradley P Coe, Aaron R Quinlan, Deborah A Nickerson, Evan E Eichler.
Abstract
While exome sequencing is readily amenable to single-nucleotide variant discovery, the sparse and nonuniform nature of the exome capture reaction has hindered exome-based detection and characterization of genic copy number variation. We developed a novel method using singular value decomposition (SVD) normalization to discover rare genic copy number variants (CNVs) as well as genotype copy number polymorphic (CNP) loci with high sensitivity and specificity from exome sequencing data. We estimate the precision of our algorithm using 122 trios (366 exomes) and show that this method can be used to reliably predict (94% overall precision) both de novo and inherited rare CNVs involving three or more consecutive exons. We demonstrate that exome-based genotyping of CNPs strongly correlates with whole-genome data (median r(2) = 0.91), especially for loci with fewer than eight copies, and can estimate the absolute copy number of multi-allelic genes with high accuracy (78% call level). The resulting user-friendly computational pipeline, CoNIFER (copy number inference from exome reads), can reliably be used to discover disruptive genic CNVs missed by standard approaches and should have broad application in human genetic studies of disease.Entities:
Mesh:
Year: 2012 PMID: 22585873 PMCID: PMC3409265 DOI: 10.1101/gr.138115.112
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Cohorts analyzed
Figure 1.Method overview and CNV discovery. Exome sequencing reads from FASTQ files were divided into nonoverlapping 36-bp constituents (A) and aligned to targeted regions (B), allowing for up to two mismatches per 36-bp alignment. (C) For each exon or targeted region, we calculated RPKM values and then transformed these into “ZRPKM” values based on the median and standard deviation of each exon across all samples. (D) ZRPKM values were inputted into the SVD transformation, where we removed the first 12–15 singular values. Finally, a centrally weighted 15-exon average was passed over the SVD-ZRPKM values in order to reduce false positives, and a ±1.5 SVD-ZRPKM threshold was used to discover CNVs. (E) Final image shows ZRPKM values from 1000 consecutive exons on chromosome 16, plotted for 533 ESP exome background samples (black traces) and NA18507 (pink trace). Blue bar corresponds to a rare duplication in NA18507 at the METTL9/OTOA locus at chr16p12.2 that was validated by SNP microarray CNV analysis.
Precision of exome-based CNV calls in HapMap samples
Validation of exome-based CNV calls in autism probands
Figure 2.CNP locus genotyping of RHD and C4A. (A) SVD-transformed values for exons for the Rhesus deletion factor locus (RHD/RHCE) show distinct copy number states across both paralogous genes. (B) Histogram of average SVD-ZRPKM values for the ESP data set (533 individuals) and seven HapMap samples. Clustering was performed using an unsupervised algorithm (Supplemental Note). (C) Correlation between SVD-ZRPKM genotype values (y-axis) and absolute copy number estimate (x-axis) based on whole-genome read-depth for seven HapMap samples and experimentally validated by array-CGH. (D–F) Similar to above, for C4A locus.
Figure 3.Genotyping accuracy across 62 CNP loci. (A) Distribution of correlation coefficients of SVD-ZRPKM to whole-genome copy number estimate (Sudmant et al. 2010) across 62 CNP loci for seven HapMap samples, split by the median copy number of each locus. For loci with copy number less than eight, 32/39 had strong correlations between exome and whole-genome estimates, indicating that exome-based SVD-ZRPKM can be used to genotype such loci. (B) Results from unsupervised clustering algorithm for 43 autosomal loci for which genotype information was available (Campbell et al. 2011).