| Literature DB >> 30137632 |
Jean Monlong1,2, Patrick Cossette3, Caroline Meloche3, Guy Rouleau4, Simon L Girard1,3,5, Guillaume Bourque1,2,6.
Abstract
Copy number variants (CNVs) are known to affect a large portion of the human genome and have been implicated in many diseases. Although whole-genome sequencing (WGS) can help identify CNVs, most analytical methods suffer from limited sensitivity and specificity, especially in regions of low mappability. To address this, we use PopSV, a CNV caller that relies on multiple samples to control for technical variation. We demonstrate that our calls are stable across different types of repeat-rich regions and validate the accuracy of our predictions using orthogonal approaches. Applying PopSV to 640 human genomes, we find that low-mappability regions are approximately 5 times more likely to harbor germline CNVs, in stark contrast to the nearly uniform distribution observed for somatic CNVs in 95 cancer genomes. In addition to known enrichments in segmental duplication and near centromeres and telomeres, we also report that CNVs are enriched in specific types of satellite and in some of the most recent families of transposable elements. Finally, using this comprehensive approach, we identify 3455 regions with recurrent CNVs that were missing from existing catalogs. In particular, we identify 347 genes with a novel exonic CNV in low-mappability regions, including 29 genes previously associated with disease.Entities:
Mesh:
Year: 2018 PMID: 30137632 PMCID: PMC6101599 DOI: 10.1093/nar/gky538
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Mappability and population-based RD estimates. (A) Inter-sample mean RD and average mappability in 5 kb bins. Regions with the same mappability estimate can have different RD levels. (B) Z-score distribution. In mappability, Z-scores were computed from the mappability-predicted RD and global standard deviation; In population estimates from the inter-sample mean and standard deviation. (C) Z-score distribution across the mappability spectrum. (D) Average RD in the Twin study. The right-tail of the histogram was winsorized using the IQR and the different coverage classes are shown with colors.
Figure 2.PopSV’s performance in low-mappability regions. (A) Cluster using PopSV calls in extremely low coverage regions (below 100 reads). (B) Proportion and number of calls replicated in the monozygotic twin. The point shows the median value per sample, the error bars the 95% confidence interval. (C) Proportion and number of regions with reliable calls, computed from call replication in twins.
CNVs in the Twins, CageKid normals and GoNL datasets. WG: whole genome; ELC: extremely low-coverage regions. The Total number of variants is the total number after collapsing recurrent variants. Affected genome represents the amount of the reference genome that overlaps at least one CNV
| Variants | Variants <3 Kbp | Affected genome (Mbp) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Set | Depth | Samples | Total | Per sample | Avg Size (Kbp) | Proportion | Per sample | Total | Per sample | |||
|
|
|
|
|
| ||||||||
| Twin study | 42x | 45 | 20 222 | 1 637.27 | 243.24 | 4.21 | 0.65 | 1056.84 | 62.22 | 5.30 | 6.89 | 9.03 |
|
| 10 661 | 727.04 | 13.20 | 4.53 | 0.58 | 423.80 | 33.97 | 2.79 | 3.30 | 3.85 | ||
|
| 10 396 | 910.22 | 230.04 | 3.94 | 0.70 | 633.04 | 34.20 | 2.50 | 3.59 | 5.29 | ||
| CageKid normals | 40x | 95 | 56 256 | 2132.81 | 336.46 | 3.58 | 0.71 | 1 521.16 | 134.77 | 5.53 | 7.63 | 10.24 |
|
| 25 367 | 805.08 | 12.74 | 4.30 | 0.63 | 508.56 | 70.65 | 2.65 | 3.46 | 7.26 | ||
|
| 32 356 | 1327.73 | 323.73 | 3.14 | 0.76 | 112.60 | 76.28 | 2.31 | 4.17 | 6.70 | ||
| GoNL | 13x | 500 | 27 945 | 549.52 | 81.97 | 8.71 | 0.46 | 250.24 | 226.50 | 3.05 | 4.79 | 8.16 |
|
| 13 818 | 262.41 | 1.45 | 8.50 | 0.42 | 110.16 | 106.83 | 1.30 | 2.23 | 3.96 | ||
|
| 15 291 | 287.10 | 80.52 | 8.91 | 0.49 | 140.08 | 139.21 | 1.45 | 2.56 | 5.72 | ||
Figure 3.Comparison with CNV catalogs from the 1000 Genomes Project (34) (1000GP) and a long-read sequencing study (59). (A) The x-axis represents the proportion of individuals with a CNV overlapping a region. The y-axis represents the cumulative proportion of the affected genome. (B) Overlap with the SV catalog from Chaisson et al. (59). In each cohort (color), the proportion of collapsed calls overlapping calls from Chaisson et al. (59) or control regions with similar size distribution was modeled using a logistic regression. Boxplots show variation across 50 sampling of control regions. low-map: calls in low-mappability regions; ext. low-map: calls in extremely low-mappability regions.
Figure 4.CNVs in normal genomes. (A) Enrichment of CNVs in different genomic classes (x-axis) across different cohorts (colors) and controlling for the distance to centromere/telomere/gap. Bars show the median fold enrichment compared to control regions. The error bar represents 90% of the samples in the cohort. (B) Enrichment of CNVs in repeat families (x-axis) controlling for the overlap with segmental duplication and distance to centromere/telomere/gap. The error bars were winsorized at 7 for clarity. STR: Short Tandem Repeat; TE: Transposable Element.
Impact of CNVs on protein-coding genes. The CNVs number represents the number of different CNVs, after collapsing CNVs with more than 50% reciprocal overlap. Repeat CNV: more than 90% of the CNV is annotated as repeat. Genes are protein-coding genes and the promoter region is defined as the 10 kb region upstream of the transcription start site. Novel CNVs are located within regions annotated as novel compared to the 1000 Genome Project catalog
| Genes with CNVs | OMIM genes with CNVs | ||||||
|---|---|---|---|---|---|---|---|
| Set | CNVs | Exon | + Promoter | + Intron | Exon | + Promoter | + Intron |
|
| |||||||
| All | 91 735 | 7206 | 11 341 | 13 259 | 1 241 | 1 857 | 2 196 |
| Low coverage | 32 707 | 848 | 1491 | 2 648 | 95 | 160 | 371 |
| Extremely low coverage | 9 348 | 304 | 401 | 442 | 11 | 14 | 25 |
| TE | 20 491 | 164 | 1747 | 3 998 | 29 | 233 | 664 |
| STR | 4 285 | 45 | 286 | 748 | 5 | 39 | 129 |
| Satellite | 1822 | 2 | 21 | 33 | 0 | 0 | 0 |
|
| |||||||
| All | 17 046 | 418 | 680 | 1 102 | 38 | 59 | 135 |
| Low coverage | 15 263 | 347 | 560 | 894 | 29 | 47 | 111 |
| Extremely low coverage | 6591 | 189 | 263 | 285 | 5 | 6 | 8 |
| TE | 3 896 | 17 | 192 | 504 | 1 | 12 | 66 |
| STR | 1806 | 14 | 81 | 230 | 0 | 9 | 41 |
| Satellite | 890 | 1 | 4 | 5 | 0 | 0 | 0 |