| Literature DB >> 36035246 |
Maarja Lepamets1,2, Chiara Auwerx3,4,5,6, Margit Nõukas1,2, Annique Claringbould7, Eleonora Porcu3,5,6, Mart Kals1,8, Tuuli Jürgenson1,9, Andrew Paul Morris1,10, Urmo Võsa1, Murielle Bochud6, Silvia Stringhini11, Cisca Wijmenga12, Lude Franke12,13, Hedi Peterson14, Jaak Vilo14, Kaido Lepik4,5,6,14, Reedik Mägi1, Zoltán Kutalik4,5,6.
Abstract
Copy-number variations (CNV) are believed to play an important role in a wide range of complex traits, but discovering such associations remains challenging. While whole-genome sequencing (WGS) is the gold-standard approach for CNV detection, there are several orders of magnitude more samples with available genotyping microarray data. Such array data can be exploited for CNV detection using dedicated software (e.g., PennCNV); however, these calls suffer from elevated false-positive and -negative rates. In this study, we developed a CNV quality score that weights PennCNV calls (pCNVs) based on their likelihood of being true positive. First, we established a measure of pCNV reliability by leveraging evidence from multiple omics data (WGS, transcriptomics, and methylomics) obtained from the same samples. Next, we built a predictor of omics-confirmed pCNVs, termed omics-informed quality score (OQS), using only PennCNV software output parameters. Promisingly, OQS assigned to pCNVs detected in close family members was up to 35% higher than the OQS of pCNVs not carried by other relatives (p < 3.0 × 10-90), outperforming other scores. Finally, in an association study of four anthropometric traits in 89,516 Estonian Biobank samples, the use of OQS led to a relative increase in the trait variance explained by CNVs of up to 56% compared with published quality filtering methods or scores. Overall, we put forward a flexible framework to improve any CNV detection method leveraging multi-omics evidence, applied it to improve PennCNV calls, and demonstrated its utility by improving the statistical power for downstream association analyses.Entities:
Keywords: PennCNV; anthropometric traits; copy-number variation; gene expression; methylation; multi-omics; structural variation; whole genome sequencing
Year: 2022 PMID: 36035246 PMCID: PMC9399386 DOI: 10.1016/j.xhgg.2022.100133
Source DB: PubMed Journal: HGG Adv ISSN: 2666-2477
Figure 1Workflow overview
(A) Quality estimation and modeling pipeline for PennCNV copy-number variation calls (pCNVs).
(B and C) The pCNV quality metrics are estimated based on (B) whole-genome sequencing (WGS) data and (C) gene expression (GE) and/or overall methylation (MET) intensity of genes/CpG sites overlapping the corresponding CNV calls.
(B) WGS metric is a fraction of pCNV that can be mapped to WGS CNVs of the same individual.
(C) To calculate GE/MET metrics, the reference distribution of expression/intensity based on non-carriers (pink area) is approximated to standard normal distribution (red dashed line), and the Z score of the expression/intensity of each pCNV carrier (xi) is compared with it one at a time. The metric is a difference between the fraction of non-carriers with the corresponding value ≤xi and those with the corresponding value >xi and captures how extreme xi is compared with the reference distribution of non-carriers. In case a pCNV overlaps with several genes/CpG sites, the metric values are averaged over them.
Overview of datasets and final sample sizes used in the analyses.
| Dataset | n | Sample counts per data type | Analysis steps | |||||
|---|---|---|---|---|---|---|---|---|
| WGS | Methyl. | RNA-seq | Omics-based metrics calculation | Model building | Model selection and validation | CNV associations | ||
| Estonian OmniExpress sample set (N = 7,750) | ||||||||
| EstBB-MO | 1,066 | 983 | 295 | 382 | + | + | – | – |
| First-degree relatives | 504 | N/A | N/A | N/A | – | – | + | – |
| Lifelines deep (N = ∼1,500) | ||||||||
| LLDeep | 1,383 | N/A | 768 | 1,098 | + | + | – | – |
| Swiss Kidney Project on Genes in Hypertension (N = 1,128) | ||||||||
| SkiPOGH | 466 | N/A | 148 | 405 | + | – | – | – |
| Parent-child pairs | 319 | N/A | N/A | N/A | – | – | + | – |
| Estonian Biobank GSA sample set (N = ∼200,000) | ||||||||
| EstBB-GSA (unrelated) | 89,516 | N/A | N/A | N/A | – | – | – | + |
| MZ twins | 312 | N/A | N/A | N/A | – | – | + | – |
| First-degree relatives | 79,903 | N/A | N/A | N/A | – | – | + | – |
| UK Biobank (N = ∼500,000) | ||||||||
| UKB (unrelated British) | 331,522 | N/A | N/A | N/A | – | – | – | + |
| MZ twins | 302 | N/A | N/A | N/A | – | – | + | – |
| First-degree relatives | 42,032 | N/A | N/A | N/A | – | – | + | – |
N/A, not applicable.
Estonian OmniExpress first-degree relatives do not overlap with EstBB-MO samples.
Figure 2Overview of CNV quality metrics in EstBB-MO
(A and B) Omics-based metrics— WGS, MET, and GE—and cQS Pearson correlations for EstBB-MO deletions (A) and duplication (B). Note that the number of pCNVs used in correlation calculations is not identical in each group of metric pairs (Figure S9).
(C and D) Bimodal distribution of WGS, MET, and GE metrics (C), as well as their combined metric (see material and methods) (D) for duplications (blue) and deletions (yellow). The combined metric is calculated for pCNVs that have at least two omics-based metrics available (n = 3,496) and the fractions of high-confidence false (combined metric <0.1) and true (combined metric >0.9) calls are reported.
Figure 3Comparison of quality scores on pCNVs of closely related Estonian samples
Consensus-based (cQS) and omics-informed (OQS) CNV quality scores of non-familial and familial (found in two or more family members) deletions (yellow) and duplications (blue) calculated on a subset of Estonian OmniExpress samples (n = 504; do not overlap with EstBB-MO). Familial pCNVs are likely true positives, while non-familial group contains both true and false positives. We included rare (frequency <0.1%, striped background) familiar pCNVs as a subset of CNVs less likely to validate in a relative by chance. The mean score of each pCNV group and their pairwise difference are shown on top of the figure. Compared with cQS, the OQS shows higher values for familial pCNVs and larger differences between non-familial and familial pCNV quality. All differences for both scores are significant with p < 1 × 10−16 (Wilcoxon test).
Figure 4Impact of OQS on CNV-trait associations
(A) Change of variance explained in mirror-type model when using OQS over raw PennCNV, four published quality filtering approaches,,,, or cQS in the EstBB-GSA and UKB, depicted as distribution of F statistics calculated by randomizing the probe pruning priority order 20 times (see material and methods). Explained variance is increased when F >1 and decreased when F <1. Larger F values indicate greater improvement in statistical power when using OQS over the given reference approach.
(B) Locus plot of a CNV region in 16p11.2 BP4-BP5 (red dashed lines: chr16:29,590,000–30,200,000 in GRCh37) associated with BMI in EstBB-GSA dataset. The lines indicate the –log10 association p values using mirror model with raw PennCNV calls (light blue), cQS (purple), and OQS (black). The yellow and blue areas illustrate the frequency of PennCNV deletion and duplication counts, respectively, across the region.