| Literature DB >> 23275535 |
Jin P Szatkiewicz1, WeiBo Wang, Patrick F Sullivan, Wei Wang, Wei Sun.
Abstract
Structural variation is an important class of genetic variation in mammals. High-throughput sequencing (HTS) technologies promise to revolutionize copy-number variation (CNV) detection but present substantial analytic challenges. Converging evidence suggests that multiple types of CNV-informative data (e.g. read-depth, read-pair, split-read) need be considered, and that sophisticated methods are needed for more accurate CNV detection. We observed that various sources of experimental biases in HTS confound read-depth estimation, and note that bias correction has not been adequately addressed by existing methods. We present a novel read-depth-based method, GENSENG, which uses a hidden Markov model and negative binomial regression framework to identify regions of discrete copy-number changes while simultaneously accounting for the effects of multiple confounders. Based on extensive calibration using multiple HTS data sets, we conclude that our method outperforms existing read-depth-based CNV detection algorithms. The concept of simultaneous bias correction and CNV detection can serve as a basis for combining read-depth with other types of information such as read-pair or split-read in a single analysis. A user-friendly and computationally efficient implementation of our method is freely available.Entities:
Mesh:
Year: 2012 PMID: 23275535 PMCID: PMC3561969 DOI: 10.1093/nar/gks1363
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Relationship between read-depth and mappability in high-confidence CNVs (1). (a) The boxplot of read-depth from windows mapped to the 610 high-confidence deletions (red) and 261 high-confidence duplications (blue), suggesting a similar read-depth distribution between deletions and duplications and no power in detecting CNVs. (b) The boxplot of read-depth stratified by mappability classes, color-coded such that darker shades reflect higher mappability. The labels of the x-axis indicate the CNV class (DEL: deletions; DUP: duplications) and mappability class. For example, label (DEL: 0.2–0.3) indicates windows from the high-confidence deletions and with mappability score ranging from 0.2 to 0.3. Within each mappability class, duplications show higher mean read-depth than deletions, suggesting that correction for mappability improves the ability to detect CNVs. Furthermore, when mappability falls below 0.3, read-depth distribution becomes increasingly similar between deletions and duplications, suggesting that the ability to detect CNVs in those regions is limited. For example, for windows with mappability ranging from 0.2 to 0.3, ∼50% of windows in the duplication regions had read-depths equal to or lower than the average read-depth from compatible copy-normal regions; ∼20% of windows in the deletion regions had read-depths equal to or higher than the average read-depth from compatible copy-normal regions.
Figure 2.Overview of GENSENG inference framework. The required input contains two parts: the triplet data (read-depth, GC content and mappability score) and the initial parameter values. The input is passed to the GENSENG engine for parameter training based on the Baum–Welsh algorithm. To update the emission probability and the parameters for the negative binomial regression model, the weighted generalized linear model (GLM) fitting algorithm is applied iteratively, which uses the updated posterior probability of the copy-number state as the regression weights in each iteration. At the convergence of parameter training, GENSENG identifies the state with the largest posterior probability and assigns the associated copy number to the corresponding window. Finally, GENSENG outputs the coordinates of CNV segments and the confidence scores.
Figure 3.Example high-confidence CNVs predicted by GENSENG from the NA12878 HTS data. Each subfigure (a–e) has four panels from top to bottom, and the x-axis of each subfigure indicates genomic position in base pairs. In the first panel, the black dots on the y-axis indicate read-depth signal; red dashed lines are boundaries from GENSENG prediction; green solid lines are boundaries reported in the high-confidence CNV set (1); and grey lines are the median read-depth of the chromosome. The GC content and mappability of the region are plotted in the second and the third panels respectively. The fourth panel shows the locations of segmental duplication (purple, from the UCSC hg19 segmental duplication track) and repetitive DNAs (orange, from the UCSC hg19 repeatmask track). Shown here are a homozygous deletion (a), a heterozygous deletion (b), a simple and large duplication (c) and a complex duplication (d) that was predicted to be copy number 6+ and was right flanked by a large region with a median mappability of 0.2. Finally, (e) shows a duplication predicted to be copy number 4 from a noisy region with a median mappability of 0.58, illustrating good sensitivity for detecting duplications using simultaneous bias correction and copy-number inference.
Performance assessment based on simulated sequencing data for a chromosome
| Detection method | Post-detection CNV filter | Deletions | Duplications | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Total simulated true CNVs | Total predicted CNV calls | Sensitivitya (number of true CNVs detected) | FDRb (number of false prediction) | Number of simulated true CNVs | Total predicted CNV calls | Sensitivitya (number of true CNVs detected) | FDRb (number of false prediction) | ||
| CNVnator | None | 43 | 2171 | 0.94(40) | 0.98(2130) | 21 | 40 | 1.00(21) | 0.23(9) |
| GENSENG | None | 43 | 690 | 0.77(33) | 0.95(657) | 21 | 26 | 0.90(19) | 0.12(4) |
| CNVnator | q0-filterc | 43 | 1560 | 0.70(30) | 0.98(1530) | 21 | 19 | 0.57(12) | 0(0) |
| CNVnator | q03+(RDA+map)d | 43 | 48 | 0.53(23) | 0.52(25) | 21 | 18 | 0.52(11) | 0(0) |
| GENSENG | RDAe | 43 | 235 | 0.77(33) | 0.86(202) | 21 | 22 | 0.90(19) | 0(0) |
| GENSENG | (RDA+map)d | 43 | 32 | 0.67(29) | 0.09(3) | 21 | 16 | 0.67(14) | 0(0) |
Simulation study permits precise assessment of both sensitivity and specificity. A total of 64 true CNVs were implanted into the reference chromosome 1 (NCBI37) from which sequencing reads were simulated (see ‘Methods’ section). A true CNV is considered detected if it had >50% reciprocal overlap with the predicted CNVs.
aSensitivity is computed as the total true CNVs detected divided by the total true CNVs. Note that a single true duplication could be overlapped by >1 predicted calls. A false prediction is any predicted CNV call that does not have >50% reciprocal overlap with any true CNVs.
bFDR is computed as the total falsely predicted CNV calls divided by the total predicted CNV calls.
cCNVnator filter: the default q0 filters removes any predicted calls that have >50% reads with zero-valued MAPQ (i.e. reads with multiple mapping locations).
dOur stringent filter removes any predicted calls that have RDA values ranging between 0.5 and 1.25 or mappablility <0.3. As all simulated deletions were homozygous deletions, 0.5 was used as the lower threshold of RDA.
eOur RDA filter removes any predicted calls that have RDA values ranging between 0.5 and 1.25.
Performance assessment based on NA12878 HTS data
| Detection method | Post-detection CNV filter | Deletions | Duplications | ||||
|---|---|---|---|---|---|---|---|
| Total high-confidence CNVsa | Total predicted CNV calls (Mb spanned) | Sensitivityb (number of high-confidence CNVs detected) | Total high-confidence CNVsa | Total predicted CNV calls (Mbp spanned) | Sensitivityb (number of high-confidence CNVs detected) | ||
| CNVnator | q0 filterc | 610 | 4105(142.1) | 0.64(393) | 261 | 788(9.5) | 0.16(42) |
| GENSENG | RDAd | 610 | 5370(48.8) | 0.73(446) | 261 | 1577(68.1) | 0.17(45) |
| GENSENG | (RDA+map)e | 610 | 4087(11.7) | 0.7(427) | 261 | 984(38.5) | 0.15(39) |
aThe high-confidence CNV set (Mills et al. 2011) includes 610 deletions and 261 duplications, but it does not provide information on the true negatives needed to assess specificity. Thus, we focused on calibrating sensitivity and used the total number and total base pairs of the predicted CNV calls as surrogate measure of specificity. A high-confidence CNV is considered detected if it had >50% reciprocal overlap with the predicted CNVs.
bSensitivity is computed as the total detected high-confidence CNVs divided by the total high-confidence CNVs.
cCNVnator filter: the default q0 filters removes any predicted calls that have >50% reads with zero-valued MAPQ (i.e. reads with multiple mapping locations).
dGENSENG filter removes any predicted calls that have RDA values ranging between 0.75 and 1.25.
eStringent GENSENG filter removes any predicted calls that have RDA values ranging between 0.75 and 1.25 or mappablility <0.3. As deletions were either homozygous or heterozygous, 0.75 was used as the lower threshold of RDA.