| Literature DB >> 23442253 |
Nuno Sepúlveda1, Susana G Campino, Samuel A Assefa, Colin J Sutherland, Arnab Pain, Taane G Clark.
Abstract
BACKGROUND: The advent of next generation sequencing technology has accelerated efforts to map and catalogue copy number variation (CNV) in genomes of important micro-organisms for public health. A typical analysis of the sequence data involves mapping reads onto a reference genome, calculating the respective coverage, and detecting regions with too-low or too-high coverage (deletions and amplifications, respectively). Current CNV detection methods rely on statistical assumptions (e.g., a Poisson model) that may not hold in general, or require fine-tuning the underlying algorithms to detect known hits. We propose a new CNV detection methodology based on two Poisson hierarchical models, the Poisson-Gamma and Poisson-Lognormal, with the advantage of being sufficiently flexible to describe different data patterns, whilst robust against deviations from the often assumed Poisson model.Entities:
Mesh:
Year: 2013 PMID: 23442253 PMCID: PMC3679970 DOI: 10.1186/1471-2164-14-128
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Statistical description of the coverage distributions using 100-bp windows and after filtering the data for the exome
| | | | | | | ||||
|---|---|---|---|---|---|---|---|---|---|
| 3D7 | Africa | 19,590,258 | 162.8 | 767.4 | 0–794 | 1 | 3 | 25 | 14 |
| HB3 | Honduras | 14,024,161 | 116.6 | 585.6 | 0–449 | 188 | 262 | 23 | 0 |
| DD2 | Indonesia | 21,080,366 | 175.2 | 1861.9 | 0–749 | 139 | 214 | 873 | 470 |
| 7G8 | Brazil | 13,736,522 | 114.2 | 2141.2 | 0–794 | 188 | 1419 | 365 | 29 |
| GB4 | Ghana | 17,157,171 | 142.6 | 2087.3 | 0–955 | 151 | 540 | 274 | 7 |
| OX005 | Ghana | 17,214,916 | 143.1 | 4387.9 | 0–1386 | 187 | 308 | 7691 | 109 |
| OX006 | Kenya | 20,850,309 | 173.3 | 1072.1 | 0–733 | 46 | 102 | 656 | 7 |
Figure 1Empirical coverage distributions are intrinsically overdispersed and skewed. A. Observed coverage distributions. B. Overdispersion defined as the ratio between coverage mean and variance (see also Table 1).
Analysis of real and simulated 3D7 resequencing data
| | | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| PG with | 0.82% | 0.58% | 0.36% | 0.08% | | 13 | 13 | 13 | 13 |
| PG with | 0.19% | 0.11% | 0.05% | 0.02% | | 13 | 13 | 13 | 13 |
| FREEC | 0.00% | 0.00% | 0.00% | 0.01% | | 0 | 0 | 0 | 13 |
| cn.MOPS + | 0.55% | 0.16% | 0.01% | 0.30% | 0 | 0 | 0 | 11 | |
Results refers to the (mean) percentage of overall hits detected in relation to the total number of 100-bp windows (120,309) using real and simulated data from the 3D7 resequencing sample and the corresponding (average) number of 100-bp hits detected on the GTP cyclohydrolase I gene locus (PFL1155w, 1.3 kb in size).
*Results based on 10 independent simulated data sets.
+Analysis performed across samples (simulated or real where appropriate).
Summary of CNVs detected by the Poisson-Gamma model across different laboratory strains and clinical samples
| | | | |||||||
|---|---|---|---|---|---|---|---|---|---|
| # | |||||||||
| HB3 | Deletion | 322 | 109 | 60 | | 305 | 101 | 56 | PFI1475 (2.0) |
| | Amplification | 246 | 206 | 119 | | 60 | 53 | 46 | PF11_0503 (0.6) |
| DD2 | Deletion | 279 | 98 | 58 | | 265 | 95 | 55 | PFL2550w (1.7) |
| | Amplification | 678 | 84 | 63 | | 634 | 59 | 43 | PFE1120w (14.8) |
| 7G8 | Deletion | 243 | 125 | 83 | | 205 | 101 | 61 | MAL7P1.64 (1.1) |
| | Amplification | 343 | 118 | 106 | | 215 | 49 | 37 | PFL1130w (6.7) |
| GB4 | Deletion | 262 | 98 | 48 | | 253 | 92 | 45 | PFC0110w (2.5) |
| | Amplification | 108 | 84 | 79 | | 47 | 38 | 36 | PFL1155w (0.6) |
| OX005 | Deletion | 308 | 87 | 49 | | 274 | 73 | 39 | PFC0110w (2.8) |
| | Amplification | 1019 | 772 | 516 | | 192 | 140 | 118 | PFD0669c (1.0) |
| OX006 | Deletion | 170 | 65 | 35 | | 167 | 62 | 33 | PF07_0013 (1.3) |
| Amplification | 277 | 226 | 188 | 90 | 70 | 64 | MAL8P1.42 (1.1) | ||
Results refer to the number of individual hits (i.e., 100-bp windows) and loci (pooled hits where contiguous) using the credible levels γ = 99% and 99.9% in the analysis.
Figure 2Copy number variation between the PFL1125w and PFL1160w genes across different laboratory and clinical samples. A. HB3 (Honduras); B. DD2 (Indonesia); C. 7G8 (Brazil); D. GB4 (Ghana); E. OX005 (Ghana); F. OX006 (Kenya). Note that the prefix PFL was removed from the corresponding gene names as available at genedb database (http://www.genedb.org).
Hits shared and exclusively detected by the Poisson-Gamma (PG) model, the FREEC and the cn.MOPS approaches
| | | | ||||||
|---|---|---|---|---|---|---|---|---|
| Deletions | HB3 | 175 (55.2) | 130 (41.0) | 12 (3.8) | | 195 (63.5) | 110 (35.8) | 2 (0.7) |
| | DD2 | 175 (63.9) | 90 (32.8) | 9 (3.3) | | 152 (57.4) | 113 (42.6) | 0 (0.0) |
| | 7G8 | 81 (29.1) | 124 (44.6) | 73 (26.3) | | 120 (21.4) | 85 (15.1) | 357 (63.5) |
| | GB4 | 72 (27.4) | 181 (68.8) | 10 (3.8) | | 150 (50.0) | 103 (34.3) | 47 (15.7) |
| | OX005 | 153 (51.0) | 121 (40.3) | 26 (8.7) | | 205 (69.3) | 69 (23.3) | 22 (7.4) |
| | OX006 | 93 (55.0) | 74 (43.8) | 2 (1.2) | | 62 (37.1) | 105 (62.9) | 0 (0.0) |
| Amplifications | HB3 | 19 (29.7) | 41 (64.1) | 4 (6.3) | | 23 (12.2) | 37 (19.7) | 128 (68.1) |
| | DD2 | 586 (84.3) | 48 (6.9) | 61 (8.8) | | 608 (85.2) | 26 (3.6) | 80 (11.2) |
| | 7G8 | 187 (37.6) | 28 (5.6) | 283 (56.8) | | 212 (17.8) | 3 (0.3) | 973 (81.9) |
| | GB4 | 6 (10.5) | 41 (71.9) | 10 (17.5) | | 38 (11.8) | 9 (2.8) | 274 (85.4) |
| | OX005 | 62 (4.2) | 130 (8.8) | 1291 (87.1) | | 168 (5.7) | 24 (0.8) | 2473 (93.5) |
| OX006 | 27 (22.9) | 63 (53.4) | 28 (23.7) | 64 (20.9) | 26 (8.5) | 216 (70.6) | ||
The frequencies (and the respective percentages in brackets) refer to the number of hits shared and exclusively detected by the PG model against FREEC and cn.MOPS, where and denote the hits shared between the respective pair of methods,, and denote the exclusive hits produced by the corresponding methodology in the respective comparison. Percentages are in relation to the overall number of deletions and amplifications identified by the respective pair of methods.
Hits shared between CGH and coverage data using the Poisson-Gamma (PG) model, the FREEC software, and the cn.MOPS approach
| HB3 | FREEC | — | — | 195/210 (92.9%) |
| | cn.MOPS | — | — | 214/348 (61.5%) |
| | PG with | — | — | 431/568 (75.9%) |
| | PG with | — | — | 288/365 (78.9%) |
| DD2 | FREEC | — | — | 792/831 (95.3%) |
| | cn.MOPS | — | — | 746/840 (88.8%) |
| | PG with | — | — | 854/957 (89.0%) |
| | PG with | — | — | 826/899 (91.9%) |
| 7G8 | FREEC | 89/154 (57.8%) | 285/470 (60.6%) | 374/624 (59.9%) |
| | cn.MOPS | 91/477 (19.1%) | 236/1185 (19.9%) | 327/1662 (19.7%) |
| | PG with | 164/243 (67.5%) | 216/343 (63.0%) | 380/586 (64.9%) |
| | PG with | 153/205 (75.6%) | 176/215 (81.9%) | 329/420 (78.3%) |
| GB4 | FREEC | 32/82 (39.0%) | 4/16 (25.0%) | 36/98 (36.7%) |
| | cn.MOPS | 77/197 (39.1%) | 28/273 (10.3%) | 105/470 (22.3%) |
| | PG with | 152/262 (59.0%) | 24/108 (22.2%) | 176/370 (47.6%) |
| PG with | 148/253 (58.5%) | 14/47 (29.8%) | 162/300 (54.0%) |
CGH hits of HB3 and DD2 lab strains were taken from Samarakoon et al.[19], while CGH hits of 7G8 and GB4 lab strains were obtained by re-analysing the corresponding original data available from Jiang et al.[37]. The percentages in brackets are in relation to the total number of coverage hits produced by the corresponding method.