| Literature DB >> 29621297 |
Prashanthi Dharanipragada1, Sriharsha Vogeti1, Nita Parekh1.
Abstract
Discovery of copy number variations (CNVs), a major category of structural variations, have dramatically changed our understanding of differences between individuals and provide an alternate paradigm for the genetic basis of human diseases. CNVs include both copy gain and copy loss events and their detection genome-wide is now possible using high-throughput, low-cost next generation sequencing (NGS) methods. However, accurate detection of CNVs from NGS data is not straightforward due to non-uniform coverage of reads resulting from various systemic biases. We have developed an integrated platform, iCopyDAV, to handle some of these issues in CNV detection in whole genome NGS data. It has a modular framework comprising five major modules: data pre-treatment, segmentation, variant calling, annotation and visualization. An important feature of iCopyDAV is the functional annotation module that enables the user to identify and prioritize CNVs encompassing various functional elements, genomic features and disease-associations. Parallelization of the segmentation algorithms makes the iCopyDAV platform even accessible on a desktop. Here we show the effect of sequencing coverage, read length, bin size, data pre-treatment and segmentation approaches on accurate detection of the complete spectrum of CNVs. Performance of iCopyDAV is evaluated on both simulated data and real data for different sequencing depths. It is an open-source integrated pipeline available at https://github.com/vogetihrsh/icopydav and as Docker's image at http://bioinf.iiit.ac.in/icopydav/.Entities:
Mesh:
Year: 2018 PMID: 29621297 PMCID: PMC5886540 DOI: 10.1371/journal.pone.0195334
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Some of the popular data pre-treatment and segmentation methods in DoC-based approaches are listed.
| Step | Method | Tool |
|---|---|---|
| Filter reads based on GC content score | Control-FREEC [ | |
| Matched-control based ratio | CNV-seq [ | |
| Loess regression | ReadDepth [ | |
| Median approach | RDXplorer [ | |
| Mean fragment count-based correction | GCcorrect [ | |
| Quantile normalization | GROM-RD [ | |
| Rolling median approach | CNVkit [ | |
| Ignore all multi-reads | BIC-seq2 [ | |
| Randomly assign reads | CNVnator [ | |
| Filter reads based on mappability cut-off threshold | Control-FREEC [ | |
| Normalize using mappability score | CNAseg [ | |
| Circular binary segmentation (Top-down) | ReadDepth [ | |
| Mean-shift (Top-down) | CNVnator [ | |
| Event-wise testing (Bottom-up) | RDXPlorer [ | |
| Hidden Markov model (Bottom-up) | HMMcopy [ | |
| Total variation minimization(Bottom-up) | CNV-TV [ | |
| LASSO-regression (Top-down) | Control-FREEC [ | |
| Shifting-level model (Bottom-up) | JointSLM [ | |
| Convex hull-based model (Bottom-up) | CNV-CH [ |
Some of the popular annotation and visualization methods for CNVs are listed.
| Step | Tool | Features | |
|---|---|---|---|
| CNVannotator [ | Functional elements, Structural elements (SD), Known CNVs (DGV, dbVar), Clinical features (CNVD) | ||
| DeAnnCNV [ | Functional elements, Known CNVs (dbVar), Clinical features (ClinVar) | ||
| SG-ADVISOR [ | Functional elements, Known CNVs (DGV, Scrips-Wellderly Genome, 1000 Genome project), Clinical features (OMIM, ClinVar, HGMD) | ||
| cnvScan [ | Functional elements, Known CNVs (DGV, 1000 Genome project), Clinical features (OMIM, ClinVar, DECIPHER, ExAC) | ||
| ANNOVAR [ | Functional elements, Structural elements (SD) | ||
| Control-FREEC [ | Complete chromosome view | ||
| CNAnorm [ | Complete chromosome view | ||
| ReadDepth [ | Complete chromosome view, GC bias correction | ||
| CNView (GenVisR) [ | Complete chromosome view, specific coordinates along with the gene annotations from UCSC genome browser | ||
| cnvCurator [ | Interactive visualizer, Editing the CNV breakpoints | ||
| cnvKit [ | Complete chromosome view, User defined coordinates, across the samples | ||
SD: Segmental duplications, DGV: Database of Genomic Variants, CNVD: Copy number variation in Disease, dbVar: Database of Genomic Structural Variants, HGMD: Human Gene Mutation Database, OMIM: Online Mendelian Inheritance in Man, DECIPHER: Database of Genomic Variation and Phenotype In Humans using Ensembl Resources, ExAC: Exome Aggregation Consortium
Fig 1Flowchart of the iCopyDAV pipeline (input from user: ‘green’, computational steps: ‘Yellow’, Output: ‘Blue’).
Structural and functional annotations provided by the annotate module in iCopyDAV.
| Annotation | Feature | Method/source | Database | Priority | |
|---|---|---|---|---|---|
| Functional | Protein-coding gene | Refseq [ | UCSC genome browser | >1 bp | Medium |
| Gene elements (exon, UTR, start/stop) | Gencode [ | Gencode | >1 bp | Medium | |
| Enhancers | Vista [ | UCSC genome browser | >1 bp | Medium | |
| lincRNA | Cufflinks [ | UCSC genome browser | >1 bp | Medium | |
| miRNA target sites | TargetScanS [ | UCSC genome browser | >1 bp | Medium | |
| Clinical | Pathogenicity & disease | ClinVar [ | ClinVar | 50% with CNV | High |
| OMIM disease ID | OMIM [ | OMIM | 50% with CNV | High | |
| Haploinsufficiency index | DECIPHER [ | DECIPHER | >1 bp | High | |
| Genic intolerance of rare CNV | ExAC [ | ExAC | >1 bp | High | |
| Known | DGV Id | DGV [ | DGV | 50% with CNV | Low |
| Structural | Heterochromatin and telomeric regions | Cytoband [ | UCSC genome browser | 50% with CNV | NA |
| Segmental duplication | Whole-genome assembly comparison method (WGAC) [ | UCSC genome browser | 50% with CNV | NA | |
| Tandem repeats | TRFinder [ | UCSC genome browser | 50% with CNV | NA | |
| Interspersed repeats | RepeatMasker [ | UCSC genome browser | 50% with CNV | NA |
OMIM: Online Mendelian Inheritance in Man, DECIPHER: Database of Genomic Variation and Phenotype in Humans using Ensembl Resources, ExAC: Exome Aggregation Consortium, DGV: Database of Genomic Variants
Fig 2(a) F-score plots, and (b) Recall & Precision plots as a function of sequencing depth. In (b) Recall and precision values are depicted by ‘closed’ and ‘open’ symbols respectively. Performance of TVM (‘circle’) and CBS (‘triangle’) is shown for the two CNV sets: small (≤ 1Kb) shown as dashed lines and large (> 1Kb) as solid lines. Error bars represent standard deviation in each group. Bin size = 50 bp.
Fig 3Box plots representing breakpoint error in CNV detection as a function of sequencing coverage using (a) TVM and (b) CBS segmentation approaches in simulated data. Bin size = 50 bp.
Fig 4Performance of TVM (‘circle’) and CBS (‘triangle’) approaches in predicting copy gain (‘closed’ symbol) and copy loss (‘open’ symbol) events in simulated data is shown.
Dashed line corresponds to detection of small CNVs (≤ 1 Kb) and solid line for large CNVs (> 1 Kb). Bin size = 50 bp.
Fig 5Performance (F-score) of iCopyDAV with three other DoC-based tools (using default parameters for data pre-treatment and segmentation approaches) shown as a function of sequencing coverage.
Error bars represent standard deviation in each group.
Fig 6Size distribution of copy gain (solid) and copy loss (striped) events shown for various combinations of data pre-treatment and segmentation approaches in low sequence coverage data (6×) for Chr 1 of NA12878 sample.
Bin size = 300 bp.
Fig 7Performance of iCopyDAV is shown for low sequence coverage data (6×) of Chr 1 of NA12878 for mappability threshold values (a) 0.5 and (b) 0.8. Recall and precision values for various combinations of GC bias correction and segmentation algorithms are computed with respect to the six studies reported in DGV. Bin size = 300 bp.
Fig 8Size distribution of copy gain (solid) and copy loss (striped) events shown for various combinations of data pre-treatments and segmentation approaches in high sequence coverage data (35×) for Chr 1 of NA12878 sample.
Bin size = 300 bp.
Fig 9Performance of iCopyDAV is shown on high sequence coverage data (35×) of Chr 1 of NA12878 sample for mappability threshold values (a) Mth = 0.5 and (b) Mth = 0.8. Recall and precision values for various combinations of GC bias correction and segmentation algorithms are computed independently for the six studies reported in DGV. Bin size = 300 bp.
Performance of various combinations of GC bias correction and segmentation approaches in the detection of CNVs of different size (small/large) and type (copy gain/loss) are summarized for high sequence coverage data (Mth = 0.8).
| Method | F-score | Overall | |||||
|---|---|---|---|---|---|---|---|
| Small | Large | Gain | Loss | Recall | Precision | F-score | |
| Loess + TVM | 0.26 | 0.59 | 0.34 | 0.47 | 0.34 | 0.59 | 0.43 |
| Loess + CBS | 0.39 | 0.17 | 0.25 | 0.40 | 0.26 | 0.49 | 0.34 |
| Median + TVM | 0.20 | 0.49 | 0.35 | 0.41 | 0.26 | 0.56 | 0.35 |
| Median + CBS | 0.42 | 0.57 | 0.33 | 0.58 | 0.50 | 0.46 | 0.48 |
| Median + (TVM + CBS) | 0.44 | 0.60 | 0.33 | 0.59 | 0.51 | 0.47 | 0.49 |
| Loess + (TVM + CBS) | 0.43 | 0.57 | 0.38 | 0.59 | 0.43 | 0.58 | 0.49 |
Comparison of CNVs detected in iCopyDAV (combined approach, median + (CBS +TVM)) with ReadDepth, Control-FREEC and CNVnator (Mth = 0.8, window size 300 bp).
| ReadDepth | Control-FREEC | CNVnator | ||
|---|---|---|---|---|
| Total | 120 | 177 | 64 | 266 |
| Total | 120 | 177 | 63 | 265 |
| Gain | 70 (0.4) | 108 (6.0) | 51 | 76 (3.2) |
| Loss | 50 (0.2) | 69 (0.4) | 12 (1.9) | 189 |
*Numbers after removing large CNV spanning centromere region in the chromosome.
Fig 10Performance of iCopyDAV with three DoC-based methods, ReadDepth, Control-FREEC, and CNVnator (using default parameters) is shown for high sequence coverage data (35×) of Chr 1 of NA12878 sample.
Fig 11The location of CNVs, gain in ‘red’ and loss in ‘blue’ is shown (a) along Chr 1 of NA12878 sample, and (b) at the locus 1q21.1 spanning NBPF gene-family, which is rich in segmental duplications.
Comparative analysis of various parameters affecting variant calling in TVM and CBS segmentation approaches.
| Bin size | Small, no dependence on sequencing depth |
| Breakpoint error | Small |
| Sequencing depth (6×→ 35×) | Reduction in CNVs predicted on increasing sequencing depth (both gain and loss), No significant difference in gain events for any data pre-treatments |
| Mappability cutoff (0.5→0.8) | Loss (reduction), Gain (increase) in Low coverage, no significant difference in High coverage |
| Loess | Reduction in no. of CNVs predicted with Median, No Loss events (low coverage) |
| No GC | No difference in case of High coverage (Mth = 0.8), indicating no need for GC bias correction in this case |
| Bin size | Dependent on sequencing depth, requires larger bin sizes at low coverage |
| Breakpoint error | Large, dependent on sequencing depth |
| Sequencing depth (6×→ 35×) | Large reduction in Gain events (from low to high coverage) |
| Mappability cutoff (0.5→0.8) | Large reduction in Loss events (for No GC, Loess), No difference (Median) |
| Loess | Large reduction in number of CNVs predicted (Loess), No difference (Median) |
| No GC | Large difference observed, clearly indicating the need for GC bias correction |