| Literature DB >> 32894098 |
Stephen Cristiano1, David McKean2, Jacob Carey1, Paige Bracci3, Paul Brennan4, Michael Chou5, Mengmeng Du6, Steven Gallinger7, Michael G Goggins8,9, Manal M Hassan10, Rayjean J Hung7, Robert C Kurtz11, Donghui Li12, Lingeng Lu13, Rachel Neale14, Sara Olson6, Gloria Petersen15, Kari G Rabe15, Jack Fu1, Harvey Risch13, Gary L Rosner1,10, Ingo Ruczinski1, Alison P Klein16,17,18, Robert B Scharpf19,20.
Abstract
BACKGROUND: Germline copy number variants (CNVs) increase risk for many diseases, yet detection of CNVs and quantifying their contribution to disease risk in large-scale studies is challenging due to biological and technical sources of heterogeneity that vary across the genome within and between samples.Entities:
Keywords: Batch effects; CNPBayes; Copy number variants; Genome-wide association; Pancreatic cancer; SNP array
Mesh:
Substances:
Year: 2020 PMID: 32894098 PMCID: PMC7487704 DOI: 10.1186/s12885-020-07304-3
Source DB: PubMed Journal: BMC Cancer ISSN: 1471-2407 Impact factor: 4.430
Fig. 1Overview of sample processing, estimation of batch effects and copy number, and risk model for pancreatic cancer. a DNA samples for pancreatic cancer cases and healthy controls were obtained from 9 different study centers and processed centrally where samples were randomized to chemistry plates. b Initial preprocessing of these samples identified candidate CNV regions. As the principal sources of batch effects were unknown, we developed an approach to identify latent batch effects by clustering empirical cummulative distribution functions (eCDFs) of CNV region summaries (c) and to genotype these samples via a Bayesian hierarchical mixture model (d). Uncertainty of the copy number genotypes (e) was propagated from the genomic analyses to the Bayesian logistic regression model for pancreatic cancer risk (f)
Fig. 2Identification of batch surrogates. a Plate-specific eCDFs of the average log2R ratio for a region on chr5 (155,475,886-155,488,649bp). b The plate-specific eCDFs were grouped by Kolmogorov-Smirnov test statistics, forming batches. The batch-specific eCDFs after grouping plates (right). The eCDFs between batches typically differed by a location shift, though here Batch 6 also captured samples with higher variance. c Single- and multi-batch mixture models were evaluated at each CNP. Densities from the posterior predictive distributions overlay the histograms of the 3-component multi-batch model (left). Adjusted for batch, only three components were needed to fit the apparent deletion polymorphism. B allele frequencies were used to genotype the mixture components. The mapping from the mixture component indices to copy number is indicated by the arrows on the x-axis labels (right)
Fig. 3Study site does not capture the major sources of technical variation. Hybridization intensities were available for four probes in a CNP region on chr4 spanning 9,370,866 bp - 9,410,140 bp (CNP_051). Restricting our analysis to high quality samples, we used the first principal component (PC1) as a one-dimensional summary of the 4 x 6,026 matrix of log2R ratios. The density of the PC1 summaries marginally (black) and stratified by study site (gray) are bimodal, suggesting a copy number polymorphism a. However, stratification of the PC1 summaries by grouping chemistry plates with similar eCDFs reveals an obvious batch effect (b). For example, chemistry plates in group E comprised of 786 samples originating from all nine study sites has a markedly different distribution than the 567 samples processed on group C chemistry plates
Fig. 4Bayesian regression models for pancreatic cancer risk. To incorporate uncertainty of the copy number assignment from the low-level data, the integer copy number was sampled from the subject-specific posterior probabilities provided by CNPBayes at each iteration of the MCMC. While batch effects on CNV inference were already accounted for in the low and high quality sample collections, an imbalance of the pancreatic cancer cases between these collections warranted a stratified model with an interaction between copy number and data quality and an indicator, z, multiplying these coefficients that allowed the slopes to be exactly zero. a Posterior probabilities of association from the stratified model for CNV regions across the genome. For regions where copy number inference was unaffected by data quality and associated with pancreatic cancer risk, regression coefficients for the low and high quality collections were positively correlated and the posterior mean of z (upper right corner) increased in the more powerful unstratified analysis using all 7598 samples (b). By contrast, negatively correlated coefficients indicated an effect of data quality on CNV inference confirmed by visual inspection and the appropriate follow-up analysis and estimated probability of association was limited to the high quality sample collection (c)