| Literature DB >> 21321017 |
Alberto Magi1, Matteo Benelli, Seungtai Yoon, Franco Roviello, Francesca Torricelli.
Abstract
The discovery of genomic structural variants (SVs), such as copy number variants (CNVs), is essential to understand genetic variation of human populations and complex diseases. Over recent years, the advent of new high-throughput sequencing (HTS) platforms has opened many opportunities for SVs discovery, and a very promising approach consists in measuring the depth of coverage (DOC) of reads aligned to the human reference genome. At present, few computational methods have been developed for the analysis of DOC data and all of these methods allow to analyse only one sample at time. For these reasons, we developed a novel algorithm (JointSLM) that allows to detect common CNVs among individuals by analysing DOC data from multiple samples simultaneously. We test JointSLM performance on synthetic and real data and we show its unprecedented resolution that enables the detection of recurrent CNV regions as small as 500 bp in size. When we apply JointSLM to analyse chromosome one of eight genomes with different ancestry, we identify 3000 regions with recurrent CNVs of different frequency and size: hierarchical clustering on these regions segregates the eight individuals in two groups that reflect their ancestry, demonstrating the potential utility of JointSLM for population genetics studies.Entities:
Mesh:
Year: 2011 PMID: 21321017 PMCID: PMC3105418 DOI: 10.1093/nar/gkr068
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.TPR and FPR estimate for different values of η and ω on synthetic data made of 10 chromosomes. Each point of the plot is obtained by averaging the JointSLM results over 100 repeated simulations. (a) Each curve represents the TPR estimate against deletion events of different size. In each plot are reported the curves for different values of fraction of altered samples f (with f that ranges between 0.1 and 1). (b) Each curve represent the FPR estimate against the size of false detected events.
Figure 2.TPR and FPR for JointSLM, EWT, CBS and GLAD on the synthetic chromosomes data sets. TPR is calculated as the average fraction of correctly detected alterations in each chromosome and the FPR as the average number of FP detected in each chromosome. For JointSLM, we report the results obtained in simulated datasets made of 10, 30 and 50 synthetic chromosomes.
Summary statistics for the CNVs detected by JointSLM on chromosome 1
| Number of samples that share the alterations | 100–500 bp | 500–1000 bp | 1–5 kb | 5–10 kb | >10 kb |
|---|---|---|---|---|---|
| 1 | 142 (53% / 19%) | 458 (33% / 9%) | 318 (55% / 23%) | 26 (100% / 77%) | 14 (100% / 86%) |
| 2 | 95 (59% / 23%) | 221 (43% / 11%) | 107 (73% / 48%) | 14 (100% / 71%) | 20 (100% / 95%) |
| 3 | 109 (49% / 20%) | 117 (45% / 15%) | 91 (80% / 41%) | 10 (100% / 80%) | 3 (100% / 100%) |
| 4 | 77 (58% / 19%) | 98 (48% / 16%) | 48 (79% / 42%) | 8 (100% / 88%) | 2 (100% / 50%) |
| 5 | 39 (51% / 28%) | 66 (48% / 6%) | 45 (87% / 40%) | 7 (100% / 86%) | 8 (100% / 100%) |
| 6 | 56 (59% / 29%) | 53 (57% / 8%) | 33 (79% / 48%) | 5 (100% / 60%) | 8 (100% / 75%) |
| 7 | 75 (55% / 20%) | 45 (51% / 7%) | 29 (86% / 38%) | 9 (100% / 67%) | 10 (80% / 40%) |
| 8 | 227 (51% / 14%) | 73 (55% / 18%) | 89 (87% / 45%) | 47 (96% / 64%) | 98 (95% / 62%) |
| Total | 820 (53% / 20%) | 1131 (42% / 11%) | 760 (70% / 35%) | 126 (98% / 71%) | 163 (96% / 70%) |
The number of CNVs detected by JointSLM are listed separately for different sizes and number of samples that share the alteration. In brackets are reported the proportion of JointSLM calls that overlap (by at least 1 bp) with CNV regions in the Database of Genomic Variants (before the /) and in the GSV validation call set (after the /).
Figure 3.Venn diagram of the comparison between the regions called by JointSLM, PEM-based methods and by the GSV Consortium.
Summary statistics for the seven CNV clusters identified by Ward's hierarchical clustering
| Cluster | Size (bp) | SD (%) | SR (%) | RefSeq (%) | Class | |
|---|---|---|---|---|---|---|
| A | 434 | 2 228566 | 73 | 16 | 34 | Amp |
| B | 653 | 956 547 | 29 | 5.5 | 44 | Del |
| C | 545 | 1 112 655 | 63 | 6.1 | 40 | Amp |
| D | 683 | 1 192 517 | 31 | 8.1 | 61 | Del |
| E | 183 | 718 817 | 54 | 6.5 | 34 | Amp |
| F | 242 | 1 132 458 | 23 | 6.8 | 46 | Del |
| G | 260 | 300 840 | 19 | 7.5 | 53 | NA18507 |
For each cluster we listed the total number of regions Amp, Amplification; Del, Deletion. (N) and the total size in bp. We also reported the overlap between called regions and segmental duplications (SD), simple repeats (SR) and with RefSeq genes (RefSeq).
Figure 4.Hierarchical clustering on the estimated copy number of the 3000 CNV regions detected by JointSLM on chromosome 1 with parameters η = 10−6, ω = 0.1 and K0 = 20. Each row represents a separate CNVs region and each column a separate individual. The coloured bars on the right of the figure represent clusters of genomic events that share similar CNV patterns over multiple individuals.