| Literature DB >> 34284719 |
Dayne L Filer1,2, Fengshen Kuo3, Alicia T Brandt4, Christian R Tilley4, Piotr A Mieczkowski4, Jonathan S Berg4, Kimberly Robasky4,3,5, Yun Li4,6, Chris Bizon3, Jeffery L Tilson3, Bradford C Powell4,3, Darius M Bost4,3, Clark D Jeffries3, Kirk C Wilhelmsen4,3,7.
Abstract
BACKGROUND: As exome sequencing (ES) integrates into clinical practice, we should make every effort to utilize all information generated. Copy-number variation can lead to Mendelian disorders, but small copy-number variants (CNVs) often get overlooked or obscured by under-powered data collection. Many groups have developed methodology for detecting CNVs from ES, but existing methods often perform poorly for small CNVs and rely on large numbers of samples not always available to clinical laboratories. Furthermore, methods often rely on Bayesian approaches requiring user-defined priors in the setting of insufficient prior knowledge. This report first demonstrates the benefit of multiplexed exome capture (pooling samples prior to capture), then presents a novel detection algorithm, mcCNV ("multiplexed capture CNV"), built around multiplexed capture.Entities:
Keywords: Capture; Copy number variation; Exome sequencing
Mesh:
Year: 2021 PMID: 34284719 PMCID: PMC8293537 DOI: 10.1186/s12859-021-04246-w
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary of whole-exome sequencing. ‘pool’ indicates the name of the pool of samples; ‘capture’ indicates the capture platform for the pool; ‘N’ gives the number of samples in the pool; ‘medExon’ gives the pool median of the subject median mapped molecule count per exon; ‘medTotal’ gives the median by pool of total mapped molecule counts per subject; ‘minTotal’ and ‘maxTotal’ give the minimum and maximum total mapped molecules; ‘rsdTotal’ gives the relative standard deviation (SD/mean*100) of total mapped molecules
| Pool | Capture | N | medExon | medTotal | minTotal | maxTotal | rsdTotal |
|---|---|---|---|---|---|---|---|
| IDT-ICa | IDT | 16 | 143 | 55,149,058 | 37,453,015 | 85,138,915 | 22.4 |
| IDT-MC | IDT | 16 | 93 | 29,772,684 | 16,674,468 | 118,147,912 | 64.2 |
| IDT-RR | IDT | 16 | 272 | 79,079,629 | 61,289,322 | 120,147,888 | 22.9 |
| NCGENESa | Agilent | 112 | 93 | 24,451,245 | 12,749,793 | 68,565,471 | 27.6 |
| Pool1 | Agilent | 16 | 56 | 13,265,614 | 8,911,132 | 17,324,903 | 18.5 |
| Pool2 | Agilent | 16 | 86 | 21,076,056 | 4,585,195 | 27,846,146 | 27.6 |
| SMA1 | Agilent | 8 | 56 | 12,256,002 | 11,051,840 | 13,600,697 | 6.2 |
| SMA2 | Agilent | 8 | 25 | 5,622,040 | 4,904,000 | 6,545,360 | 10.4 |
| WGS | Agilent | 16 | 196 | 46,406,224 | 36,496,097 | 65,200,410 | 16.4 |
aIndicates captures were performed independently on each sample within the pool, otherwise captures were multiplexed across all samples within the pool
Fig. 1Multiplexed capture (MC) decreases variance with respect to independent captures (IC), as estimated by fitting the Dirichlet distribution. Total counts/sample given on the horizontal axis; mean given on the vertical axis. is inversely proportional to inter-sample variance. Each line/point represents a single pool. The point indicates the median total counts across the pool, with the range given by the line. Orange indicates a multiplexed capture; blue indicates independent captures. Triangles indicate pools using Agilent (AGL) capture; squares indicate Integrated DNA Technologies (IDT)
Fig. 2Mean-variance relationship demonstrates less dispersion in multiplexed capture. a Agilent (AGL) capture pools; b integrated DNA Technologies (IDT) capture pools. Mean counts per exon given on the horizontal axis; mean variance per exon given on the vertical axis. Contours show the distribution of points by pool. Dotted lines show the ordinary least squares regression fit. Orange indicates multiplexed capture pools; blue indicates independently captured pools. The dashed gray line represents the 1:1 relationship expected under a Poisson process. Lines above the plot show the density of mean values by pool; lines to the right of the plot show the density of variance values by pool
Fig. 3ExomeDepth control selection. a median count per exon; b estimated phi parameter from ExomeDepth; c proportion of available samples selected as a control; d total number of controls selected. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms
Fig. 4Algorithm performance comparing mcCNV and ExomeDepth. a–c mcCNV versus ExomeDepth with default parameters, 1/10, 000 transition probability and 50 kb expected variant length. d–f mcCNV versus ExomeDepth with simulation-matched parameters, 1/1000 transition probability and 1 kb expected variant length. Numbered points indicate the simulated depth in millions of molecules. ‘MCC’ indicates Matthew’s correlation coefficient; ‘TPR’ indicates true positive rate/sensitivity; ‘FDR’ indicates false discovery rate. Dashed black line shows the 1:1 relationship
Number of CNV calls by subject and algorithm for the ‘WGS’ pool
| Subject | Total | Duplications | Deletions | ||||||
|---|---|---|---|---|---|---|---|---|---|
| MC | ED | WG | MC | ED | WG | MC | ED | WG | |
| NCG_00012 | 90 | 106 | 143 | 61 | 73 | 121 | 29 | 33 | 22 |
| NCG_00237 | 82 | 101 | 165 | 50 | 64 | 129 | 32 | 37 | 36 |
| NCG_00525 | 68 | 74 | 151 | 30 | 33 | 110 | 38 | 41 | 41 |
| NCG_00593 | 45 | 58 | 142 | 22 | 28 | 81 | 23 | 30 | 61 |
| NCG_00676 | 66 | 78 | 112 | 38 | 46 | 92 | 28 | 32 | 20 |
| NCG_00790 | 5156 | 2204 | 121 | 19 | 37 | 92 | 5137 | 2167 | 29 |
| NCG_00819 | 68 | 76 | 134 | 30 | 41 | 100 | 38 | 35 | 34 |
| NCG_00840 | 78 | 92 | 157 | 44 | 52 | 115 | 34 | 40 | 42 |
| NCG_00851 | 1151 | 859 | 141 | 28 | 51 | 102 | 1123 | 808 | 39 |
| NCG_00857 | 59 | 75 | 119 | 10 | 15 | 81 | 49 | 60 | 38 |
| NCG_00976 | 46 | 58 | 114 | 25 | 37 | 93 | 21 | 21 | 21 |
| NCG_01023 | 59 | 95 | 143 | 32 | 60 | 113 | 27 | 35 | 30 |
| NCG_01043 | 73 | 94 | 128 | 40 | 64 | 105 | 33 | 30 | 23 |
| NCG_01076 | 36 | 57 | 105 | 7 | 22 | 78 | 29 | 35 | 27 |
| NCG_01077 | 135 | 157 | 230 | 103 | 121 | 184 | 32 | 36 | 46 |
| NCG_01117 | 95 | 101 | 154 | 72 | 78 | 129 | 23 | 23 | 25 |
‘MC’ indicates the mcCNV algorithm; ‘ED’ indicates the ExomeDepth algorithm; ‘WG’ indicates the overlap of ERDS/cnvpytor calls from matched whole-genome sequencing. Exons with any overlap of the repetitive and low-complexity regions, as defined in the Trost et al. manuscript, omitted from analysis
mcCNV (MC)/ExomeDepth (ED) calls for ‘WGS’ pool (used as prediction) versus the ERDS/cnvpytor calls from matched genome sequencing (used as truth)
| MCC | TPR | FDR | PPV | BalAcc | |||
|---|---|---|---|---|---|---|---|
| DUP + DEL | Full | MC | 0.18 | 0.34 | 0.90 | 0.10 | 0.67 |
| ED | 0.26 | 0.36 | 0.81 | 0.19 | 0.68 | ||
| Sub | MC | 0.49 | 0.34 | 0.31 | 0.69 | 0.67 | |
| ED | 0.48 | 0.38 | 0.38 | 0.62 | 0.69 | ||
| DUP | Full | MC | 0.40 | 0.24 | 0.33 | 0.67 | 0.62 |
| ED | 0.35 | 0.24 | 0.50 | 0.50 | 0.62 | ||
| Sub | MC | 0.40 | 0.25 | 0.33 | 0.67 | 0.62 | |
| ED | 0.38 | 0.27 | 0.45 | 0.55 | 0.63 | ||
| DEL | Full | MC | 0.18 | 0.64 | 0.95 | 0.05 | 0.82 |
| ED | 0.22 | 0.56 | 0.91 | 0.09 | 0.78 | ||
| Sub | MC | 0.68 | 0.66 | 0.29 | 0.71 | 0.83 | |
| ED | 0.54 | 0.55 | 0.47 | 0.53 | 0.78 |
Calls are subdivided by duplications (DUP) and deletions (DEL). ‘Full’ gives performance across the full pool; ‘Sub’ gives the performance excluding the poorly correlated samples NCG_00790 and NCG_00851 (gray rows). ‘MCC’ is Matthew’s correlation coefficient, ‘TPR’ is true positive rate/sensitivity, ‘FDR’ is false discovery rate, ‘PPV’ is positive predictive value, ‘BalAcc’ is balanced accuracy. Exons with any overlap of the repetitive and low-complexity regions, as defined in the Trost et al. manuscript, omitted from analysis
Fig. 5Copy number variant call concordance for the WGS pool, excluding subjects NCG_00790 and NCG_00851 due to poor correlation to the rest of the pool. a predicted duplications; b predicted deletions. mcCNV (MC) in grey; ExomeDepth (ED) in blue; ERDS/cnvpytor (WG) in orange. Values within overlaps give the number of variants