| Literature DB >> 25136349 |
Hoang T Nguyen1, Tony R Merriman2, Michael A Black2.
Abstract
Recent advances in high-throughout sequencing technologies have made it possible to accurately assign copy number (CN) at CN variable loci. However, current analytic methods often perform poorly in regions in which complex CN variation is observed. Here we report the development of a read depth-based approach, CNVrd2, for investigation of CN variation using high-throughput sequencing data. This methodology was developed using data from the 1000 Genomes Project from the CCL3L1 locus, and tested using data from the DEFB103A locus. In both cases, samples were selected for which paralog ratio test data were also available for comparison. The CNVrd2 method first uses observed read-count ratios to refine segmentation results in one population. Then a linear regression model is applied to adjust the results across multiple populations, in combination with a Bayesian normal mixture model to cluster segmentation scores into groups for individual CN counts. The performance of CNVrd2 was compared to that of two other read depth-based methods (CNVnator, cn.mops) at the CCL3L1 and DEFB103A loci. The highest concordance with the paralog ratio test method was observed for CNVrd2 (77.8/90.4% for CNVrd2, 36.7/4.8% for cn.mops and 7.2/1% for CNVnator at CCL3L1 and DEF103A). CNVrd2 is available as an R package as part of the Bioconductor project: http://www.bioconductor.org/packages/release/bioc/html/CNVrd2.html.Entities:
Keywords: 1000 Genomes; CCL3L1; DEFB4; copy number variation; read depth
Year: 2014 PMID: 25136349 PMCID: PMC4117933 DOI: 10.3389/fgene.2014.00248
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
.
| African ancestry | ACB | 0 (0%) | 1 (1%) | 13 (13.5%) | 19 (19.8%) | 35 (36.5%) | 13 (13.5%) | 9 (9.4%) | 4 (4.2%) | 2 (2.1%) | 0 (0%) | 96 |
| ASW | 0 (0%) | 3 (4.5%) | 11 (16.7%) | 14 (21.2%) | 17 (25.8%) | 12 (18.2%) | 6 (9.1%) | 2 (3%) | 1 (1.5%) | 0 (0%) | 66 | |
| ESN | 0 (0%) | 0 (0%) | 6 (6.1%) | 22 (22.2%) | 24 (24.2%) | 20 (20.2%) | 15 (15.2%) | 10 (10.1%) | 0 (0%) | 2 (2%) | 99 | |
| GWD | 0 (0%) | 1 (0.9%) | 10 (8.8%) | 35 (31%) | 27 (23.9%) | 21 (18.6%) | 10 (8.8%) | 7 (6.2%) | 2 (1.8%) | 0 (0%) | 113 | |
| LWK | 0 (0%) | 2 (2%) | 7 (6.9%) | 24 (23.8%) | 31 (30.7%) | 17 (16.8%) | 9 (8.9%) | 7 (6.9%) | 2 (2%) | 2 (2%) | 101 | |
| MSL | 0 (0%) | 2 (2.4%) | 7 (8.2%) | 17 (20%) | 33 (38.8%) | 15 (17.6%) | 5 (5.9%) | 5 (5.9%) | 0 (0%) | 1 (1.2%) | 85 | |
| YRI | 0 (0%) | 1 (0.9%) | 14 (12.8%) | 19 (17.4%) | 30 (27.5%) | 25 (22.9%) | 8 (7.3%) | 9 (8.3%) | 2 (1.8%) | 1 (0.9%) | 109 | |
| Americas | CLM | 6 (6.4%) | 8 (8.5%) | 29 (30.9%) | 35 (37.2%) | 11 (11.7%) | 5 (5.3%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 94 |
| MXL | 0 (0%) | 8 (11.9%) | 15 (22.4%) | 26 (38.8%) | 10 (14.9%) | 7 (10.4%) | 1 (1.5%) | 0 (0%) | 0 (0%) | 0 (0%) | 67 | |
| PEL | 0 (0%) | 7 (8.1%) | 9 (10.5%) | 45 (52.3%) | 21 (24.4%) | 3 (3.5%) | 0 (0%) | 0 (0%) | 1 (1.2%) | 0 (0%) | 86 | |
| PUR | 4 (3.8%) | 18 (17.1%) | 31 (29.5%) | 34 (32.4%) | 11 (10.5%) | 4 (3.8%) | 1 (1%) | 0 (0%) | 2 (1.9%) | 0 (0%) | 105 | |
| East Asian ancestry | CDX | 1 (1%) | 12 (12.1%) | 23 (23.2%) | 46 (46.5%) | 8 (8.1%) | 7 (7.1%) | 0 (0%) | 1 (1%) | 1 (1%) | 0 (0%) | 99 |
| CHB | 1 (1%) | 3 (2.9%) | 11 (10.7%) | 30 (29.1%) | 20 (19.4%) | 21 (20.4%) | 7 (6.8%) | 5 (4.9%) | 4 (3.9%) | 1 (1%) | 103 | |
| CHS | 1 (0.9%) | 8 (7.4%) | 16 (14.8%) | 38 (35.2%) | 12 (11.1%) | 20 (18.5%) | 9 (8.3%) | 0 (0%) | 3 (2.8%) | 1 (0.9%) | 108 | |
| JPT | 0 (0%) | 4 (3.8%) | 12 (11.5%) | 40 (38.5%) | 11 (10.6%) | 19 (18.3%) | 7 (6.7%) | 10 (9.6%) | 0 (0%) | 1 (1%) | 104 | |
| KHV | 3 (3%) | 10 (9.9%) | 18 (17.8%) | 35 (34.7%) | 16 (15.8%) | 11 (10.9%) | 4 (4%) | 4 (4%) | 0 (0%) | 0 (0%) | 101 | |
| European ancestry | CEU | 2 (2%) | 26 (26.3%) | 51 (51.5%) | 16 (16.2%) | 4 (4%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 99 |
| FIN | 2 (2%) | 23 (23.2%) | 36 (36.4%) | 33 (33.3%) | 5 (5.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 99 | |
| GBR | 1 (1.1%) | 21 (22.8%) | 50 (54.3%) | 16 (17.4%) | 4 (4.3%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 92 | |
| IBS | 3 (2.8%) | 33 (30.8%) | 45 (42.1%) | 22 (20.6%) | 4 (3.7%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 107 | |
| TSI | 7 (6.5%) | 34 (31.5%) | 38 (35.2%) | 23 (21.3%) | 5 (4.6%) | 1 (0.9%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 108 | |
| South Asian ancestry | BEB | 2 (2.3%) | 9 (10.5%) | 35 (40.7%) | 24 (27.9%) | 10 (11.6%) | 3 (3.5%) | 2 (2.3%) | 1 (1.2%) | 0 (0%) | 0 (0%) | 86 |
| GIH | 2 (1.9%) | 11 (10.4%) | 51 (48.1%) | 35 (33%) | 6 (5.7%) | 1 (0.9%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 106 | |
| ITU | 3 (2.9%) | 16 (15.5%) | 43 (41.7%) | 30 (29.1%) | 8 (7.8%) | 3 (2.9%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 103 | |
| PJL | 4 (4.2%) | 12 (12.5%) | 39 (40.6%) | 30 (31.2%) | 6 (6.2%) | 5 (5.2%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 96 | |
| STU | 1 (1%) | 18 (17.5%) | 40 (38.8%) | 32 (31.1%) | 7 (6.8%) | 4 (3.9%) | 1 (1%) | 0 (0%) | 0 (0%) | 0 (0%) | 103 | |
European ancestry: Northern and Western European ancestry residing in Utah, USA (CEU), Toscani in Italia (TSI), British from England and Scotland (GBR), Finnish from Finland (FIN), Iberian populations in Spain (IBS); African ancestry: African Ancestry in Southwest US (ASW), African Caribbean in Barbados (ACB), Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK), Gambian in Western Division, The Gambia (GWD), Mende in Sierra Leone (MSL), Esan in Nigeria (ESN); East Asian ancestry: Han Chinese in Beijing, China (CHB), Japanese in Toyko, Japan (JPT), Han Chinese South (CHS), Kinh in Ho Chi Minh City, Vietnam (KHV), Chinese Dai in Xishuangbanna (CDX); Americas: Mexican Ancestry in Los Angeles (MXL), Puerto Rican in Puerto Rico (PUR), Colombian in Medellin, Colombia (CLM), Peruvian in Lima, Peru (PEL); South Asian ancestry: Gujarati Indian in Houston, TX (GIH), Bengali in Bangladesh (BEB), Indian Telegu in the UK (ITU), Punjabi in Lahore (PJL), Sri Lankan Tamil in the UK (STU).
.
| African ancestry | ACB | 3 (3.1%) | 11 (11.5%) | 28 (29.2%) | 27 (28.1%) | 15 (15.6%) | 8 (8.3%) | 3 (3.1%) | 1 (1%) | 96 |
| ASW | 0 (0%) | 8 (12.1%) | 20 (30.3%) | 17 (25.8%) | 13 (19.7%) | 4 (6.1%) | 3 (4.5%) | 1 (1.5%) | 66 | |
| ESN | 0 (0%) | 19 (19.2%) | 25 (25.3%) | 24 (24.2%) | 17 (17.2%) | 10 (10.1%) | 3 (3%) | 1 (1%) | 99 | |
| GWD | 0 (0%) | 9 (8%) | 30 (26.5%) | 39 (34.5%) | 19 (16.8%) | 12 (10.6%) | 3 (2.7%) | 1 (0.9%) | 113 | |
| LWK | 0 (0%) | 15 (14.9%) | 28 (27.7%) | 21 (20.8%) | 19 (18.8%) | 15 (14.9%) | 3 (3%) | 0 (0%) | 101 | |
| MSL | 1 (1.2%) | 6 (7.1%) | 28 (32.9%) | 30 (35.3%) | 13 (15.3%) | 3 (3.5%) | 2 (2.4%) | 2 (2.4%) | 85 | |
| YRI | 2 (1.8%) | 14 (12.8%) | 34 (31.2%) | 29 (26.6%) | 17 (15.6%) | 7 (6.4%) | 4 (3.7%) | 2 (1.8%) | 109 | |
| Americas | CLM | 2 (2.1%) | 15 (16%) | 46 (48.9%) | 24 (25.5%) | 5 (5.3%) | 2 (2.1%) | 0 (0%) | 0 (0%) | 94 |
| MXL | 1 (1.5%) | 10 (14.9%) | 26 (38.8%) | 23 (34.3%) | 4 (6%) | 2 (3%) | 1 (1.5%) | 0 (0%) | 67 | |
| PEL | 5 (5.8%) | 16 (18.6%) | 35 (40.7%) | 22 (25.6%) | 6 (7%) | 2 (2.3%) | 0 (0%) | 0 (0%) | 86 | |
| PUR | 3 (2.9%) | 24 (22.9%) | 41 (39%) | 26 (24.8%) | 8 (7.6%) | 0 (0%) | 3 (2.9%) | 0 (0%) | 105 | |
| East Asian ancestry | CDX | 2 (2%) | 17 (17.2%) | 38 (38.4%) | 29 (29.3%) | 10 (10.1%) | 1 (1%) | 1 (1%) | 1 (1%) | 99 |
| CHB | 3 (2.9%) | 22 (21.4%) | 45 (43.7%) | 21 (20.4%) | 11 (10.7%) | 1 (1%) | 0 (0%) | 0 (0%) | 103 | |
| CHS | 3 (2.8%) | 21 (19.4%) | 45 (41.7%) | 26 (24.1%) | 9 (8.3%) | 3 (2.8%) | 1 (0.9%) | 0 (0%) | 108 | |
| JPT | 5 (4.8%) | 18 (17.3%) | 43 (41.3%) | 24 (23.1%) | 9 (8.7%) | 4 (3.8%) | 1 (1%) | 0 (0%) | 104 | |
| KHV | 1 (1%) | 21 (20.8%) | 35 (34.7%) | 28 (27.7%) | 10 (9.9%) | 4 (4%) | 1 (1%) | 1 (1%) | 101 | |
| European ancestry | CEU | 3 (3%) | 9 (9.1%) | 49 (49.5%) | 24 (24.2%) | 10 (10.1%) | 2 (2%) | 2 (2%) | 0 (0%) | 99 |
| FIN | 1 (1%) | 11 (11.1%) | 46 (46.5%) | 30 (30.3%) | 10 (10.1%) | 1 (1%) | 0 (0%) | 0 (0%) | 99 | |
| GBR | 1 (1.1%) | 9 (9.8%) | 38 (41.3%) | 30 (32.6%) | 11 (12%) | 3 (3.3%) | 0 (0%) | 0 (0%) | 92 | |
| IBS | 6 (5.6%) | 19 (17.8%) | 41 (38.3%) | 31 (29%) | 8 (7.5%) | 2 (1.9%) | 0 (0%) | 0 (0%) | 107 | |
| TSI | 3 (2.8%) | 22 (20.4%) | 47 (43.5%) | 22 (20.4%) | 11 (10.2%) | 3 (2.8%) | 0 (0%) | 0 (0%) | 108 | |
| South Asian ancestry | BEB | 5 (5.8%) | 14 (16.3%) | 44 (51.2%) | 11 (12.8%) | 11 (12.8%) | 1 (1.2%) | 0 (0%) | 0 (0%) | 86 |
| GIH | 4 (3.8%) | 15 (14.2%) | 44 (41.5%) | 27 (25.5%) | 12 (11.3%) | 4 (3.8%) | 0 (0%) | 0 (0%) | 106 | |
| ITU | 4 (3.9%) | 24 (23.3%) | 41 (39.8%) | 18 (17.5%) | 14 (13.6%) | 2 (1.9%) | 0 (0%) | 0 (0%) | 103 | |
| PJL | 3 (3.1%) | 17 (17.7%) | 37 (38.5%) | 23 (24%) | 13 (13.5%) | 2 (2.1%) | 1 (1%) | 0 (0%) | 96 | |
| STU | 3 (2.9%) | 18 (17.5%) | 51 (49.5%) | 22 (21.4%) | 8 (7.8%) | 1 (1%) | 0 (0%) | 0 (0%) | 103 | |
Refer to Table 1 legend for key to population abbreviations.
Figure 1A schematic diagram of the pipeline. CNVrd2 is a modified version of CNVrd. CNVrd2 is identical to CNVrd at the counting, transforming, standardizing and segmenting steps (A,B,G: black text). However, CNVrd2 has additional steps: identification of polymorphic regions (C), merging sub-regions inside genes/regions being measured and testing boundary regions (D), using a simple linear regression model to adjust segmentation scores between populations (E) and a Bayesian normal mixture to cluster segmentation scores of highly CN variable regions into different groups (F). These new steps are in blue text.
Figure 2Linear relationship between the segmentation scores called for single populations and for all populations at . SSP, the segmentation scores of a single large population; SSP(P), the segmentation score of pooled populations.
Eight loci which were obtained from the intersection of results of Conrad et al. (.
| chr7:141769627-141793931 | 7:141000000-142000000 | 258 | 0 (3; 1.2) 1 (44; 17.1) 2 (208; 80.6) 3 (3; 1.2) |
| chr3:162514938-162619146 | 3:162000000-163000000 | 136 | 0 (93; 68.4) 1 (28; 20.6) 2 (15; 11.0) |
| chr17:44212815-44270230 | 17:43500000-44500000 | 252 | 2 (195; 77.4) 3 (38; 15.1) 4 (19; 7.5) |
| chr1:110222301-110242933 | 1:109500000-110500000 | 251 | 2 (106; 42.2) 3 (102; 40.6) 4 (43; 17.1) |
| chr17:44212815-44270230 | 17:43500000-44500000 | 252 | 2 (195; 77.4) 3 (38; 15.1) 4 (19; 7.5) |
| chr2:79331533-79339762 | 2:79000000-80000000 | 231 | 2 (218; 94.4) 3 (13; 5.6) |
| chr16:72109587-72112297 | 16:71500000-72500000 | 251 | 2 (226; 90.0) 3 (25; 10.0) |
| chr3:26434104-26439360 | 3:26000000-27000000 | 259 | 0 (3; 1.2) 1 (14; 5.4) 2 (240; 92.7) 3 (2; 0.8) |
These samples result from the intersection of the samples of Conrad et al. and Campbell et al. (Conrad et al., .
In parentheses are the number and percentage of samples with the specified CN, as obtained from Conrad et al. (.
Figure 3Comparison of copy number assignments of high-throughput sequencing-based with PRT-based methods. (A) CCL3L1 on 180 samples [only 111 samples measured by Sudmant et al. (2010) overlapped]. (B) DEFB103A on 104 samples.
Figure 4Read lengths and mapping qualities (top), mapping qualities (middle) and average read depth (bottom). Data for the 2 MB CCL3L1 region are on the left and the 2 Mb DEFB103A region on the right.
Figure 5Concordance between the CNVrd2 and microarray based results. The x-axis contains the values of testThreshold2Merge (0.15, 0.25, 0.35, 0.45). The y-axis is the percentage of identical results.
Figure 6Concordance between the CNVrd and microarray based results. Window sizes are shown on the x-axis, while the y-axis shows the percentage of identical results.
Figure 7Plots of polymorphic regions encompassing . The plots show standard deviation (top) and different percentiles (bottom) across 2 Mb sub-regions (for all 2535 samples).
Figure 8Correlation between segmentation scores of .
Figure 9Segmentation scores and CN groups at the CCL3L1 and DEFB103A loci. Segmentation scores of all populations (bottom). Small populations (European for CCL3L1 and South Asian for DEFB103A) are in red. The top pictures show segmentation scores and their CN groups.
Figure 10Observed read-count ratios of samples at the .
Read counts at the .
| HG00240 | 7.3 | 1 | 75 | GBR |
| HG00290 | 7.4 | 0 | 36 | FIN |
| HG00336 | 3.4 | 0 | 35 | FIN |
| HG00410 | 10.1 | 0 | 69 | CHS |
| HG00553 | 3.3 | 0 | 29 | PUR |
| HG01112 | 4.8 | 0 | 37 | CLM |
| HG01204 | 4.8 | 0 | 109 | PUR |
| HG01260 | 7.2 | 1 | 40 | CLM |
| HG01280 | 8.1 | 0 | 23 | CLM |
| HG01286 | 4.6 | 4 | 68 | PUR |
| HG01302 | 5.4 | 0 | 23 | PUR |
| HG01474 | 5.1 | 0 | 40 | CLM |
| HG01489 | 5.2 | 0 | 58 | CLM |
| HG01504 | 5.3 | 0 | 42 | IBS |
| HG01506 | 6.4 | 0 | 30 | IBS |
| HG01550 | 4.7 | 0 | 38 | CLM |
| HG01767 | 8.3 | 0 | 42 | IBS |
| HG01864 | 8.4 | 0 | 64 | KHV |
| HG01873 | 20.2 | 0 | 103 | KHV |
| HG02122 | 6.2 | 0 | 58 | KHV |
| HG02385 | 4.6 | 0 | 19 | CDX |
| HG02604 | 7.1 | 2 | 78 | PJL |
| HG02648 | 6.9 | 0 | 37 | PJL |
| HG02652 | 6.9 | 0 | 45 | PJL |
| HG02658 | 8.8 | 3 | 127 | PJL |
| HG03589 | 5.8 | 0 | 48 | BEB |
| HG03673 | 8.6 | 2 | 68 | STU |
| HG03968 | 5.4 | 2 | 34 | ITU |
| HG04019 | 5.6 | 0 | 39 | ITU |
| HG04062 | 4.0 | 0 | 21 | ITU |
| HG04156 | 5.4 | 0 | 32 | BEB |
| NA07056 | 5.8 | 0 | 54 | CEU |
| NA11831 | 5.8 | 1 | 67 | CEU |
| NA18574 | 4.7 | 0 | 66 | CHB |
| NA20507 | 6.1 | 0 | 49 | TSI |
| NA20540 | 3.9 | 0 | 67 | TSI |
| NA20589 | 7.0 | 0 | 41 | TSI |
| NA20754 | 8.2 | 0 | 33 | TSI |
| NA20762 | 7.0 | 0 | 42 | TSI |
| NA20764 | 8.3 | 1 | 44 | TSI |
| NA20778 | 8.3 | 0 | 78 | TSI |
| NA20850 | 4.1 | 0 | 30 | GIH |
| NA20903 | 7.7 | 0 | 43 | GIH |
The average coverage is calculated for the 2 Mb region (chr17:33670000-35670000).
Figure 11Correlation between segmentation scores of the DEFB103A gene-containing region and the enlarged region encompassing . On the right are histograms of segmentation scores of the enlarged region (top) and DEFB103B region (bottom).