| Literature DB >> 31520489 |
Grace Png1,2,3, Daniel Suveges1,4, Young-Chan Park1,2, Klaudia Walter1, Kousik Kundu1, Ioanna Ntalla5, Emmanouil Tsafantakis6, Maria Karaleftheri7, George Dedoussis8, Eleftheria Zeggini1,3, Arthur Gilly1,3,9.
Abstract
Copy number variants (CNVs) play an important role in a number of human diseases, but the accurate calling of CNVs remains challenging. Most current approaches to CNV detection use raw read alignments, which are computationally intensive to process. We use a regression tree-based approach to call germline CNVs from whole-genome sequencing (WGS, >18x) variant call sets in 6,898 samples across four European cohorts, and describe a rich large variation landscape comprising 1,320 CNVs. Eighty-one percent of detected events have been previously reported in the Database of Genomic Variants. Twenty-three percent of high-quality deletions affect entire genes, and we recapitulate known events such as the GSTM1 and RHD gene deletions. We test for association between the detected deletions and 275 protein levels in 1,457 individuals to assess the potential clinical impact of the detected CNVs. We describe complex CNV patterns underlying an association with levels of the CCL3 protein (MAF = 0.15, p = 3.6x10-12 ) at the CCL3L3 locus, and a novel cis-association between a low-frequency NOMO1 deletion and NOMO1 protein levels (MAF = 0.02, p = 2.2x10-7 ). This study demonstrates that existing population-wide WGS call sets can be mined for germline CNVs with minimal computational overhead, delivering insight into a less well-studied, yet potentially impactful class of genetic variant.Entities:
Keywords: association study; copy-number variant; whole-genome sequencing
Mesh:
Substances:
Year: 2019 PMID: 31520489 PMCID: PMC8653900 DOI: 10.1002/gepi.22260
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.344
Figure 1Overview of the UN‐CNVc algorithm. (a) Overview of the pipeline, with input and output files in blue, and external tools and libraries in grey. (b) Output of a piecewise constant regression (in red) on a 10Mb window on chromosome 11, for a homozygous deletion carrier. The gray signal is the raw relative depth at every sequenced marker for that sample. (c) Pooled regressed segments across the population, with colour indicating the attributed ideal depth (0: red, 0.5: blue, 1: green, and 1.5: purple). (d) Raw count (dashed line) and run‐length encoding (shaded green bars) on the number of high‐quality segments with ideal depth <1. (e) Genotyping using both weighted average segment depth (colour, scheme identical to c) and average depth across markers (plotting glyphs, squares: 0, circles: 0.5, and triangles: 1). UN‐CNVc, Unimaginatively Named CNV caller
Number of CNVs called in each cohort
| MANOLIS | Pomak | TEENAGE | INTERVAL | |
|---|---|---|---|---|
| Total called | 401 | 353 | 349 | 973 |
| Centromeric/telomeric regions | 53 | 55 | 47 | 77 |
| Failed regions (Figure S4b) | 150 | 84 | 197 | 155 |
| Deletions that passed both interval‐based QC and genotype QC | 154 | 178 | 60 | 675 |
| Regions that required manual genotyping | 44 | 36 | 45 | 66 |
| Manually genotyped deletions | 58 | 50 | 49 | 96 |
| Final no. of high‐quality deletions | 212 | 228 | 109 | 771 |
Note: Interval‐based QC was done based on calling metrics and diagnostics plots, with passing events having no multiple breaks within the call regions and homogeneous boundaries, whereas genotype QC was performed using genotype diagnostics plots. Called events with inaccurate genotypes or complex regions containing multiple deletion events were manually genotyped. An example of a “failed region” is shown in Figure S4b. The final set of high‐quality deletion events comprises deletions passing both QC and the manually genotyped deletions.
Abbreviations: CNVs, copy number variants; QC, quality control.
Figure 2Chromosome map of all CNVs called by UN‐CNVc in four cohorts. Light grey tracks represent CNVs that failed QC, while the red, blue, green, and yellow tracks represent high‐quality CNVs in MANOLIS, Pomak, TEENAGE, and INTERVAL, respectively. Within the chromosomes, dark grey regions represent the centromeres. Regions marked in pink are assembly exceptions and patches, taken from the GRC data for GRCh38.p12, regions in blue are segmental duplications (from UCSC), regions in light green are “CNV hotspots,” which are known, highly variable regions comprising an intergenic region on chr6q14.1, an olfactory receptor gene cluster (OR4C11‐OR5L2) on chr11q11, a leukocyte immunoglobulin gene cluster (LILRB3‐LILRB5) on chr19q13.42, the immunoglobulin κ, λ, and heavy chain loci (IGKC, IGLC1, IGH), and the T cell receptor alpha locus (TRA). Regions in orange are large retrotransposable elements larger than 5kb, comprising Alus, SVAS, and L1, L2, and L3 elements. CNVs, copy number variants; QC, quality control; UN‐CNVc, Unimaginatively Named CNV caller