| Literature DB >> 27503473 |
Celine S Hong1, Larry N Singh1, James C Mullikin2,3, Leslie G Biesecker4,5.
Abstract
BACKGROUND: Reproducibility is receiving increased attention across many domains of science and genomics is no exception. Efforts to identify copy number variations (CNVs) from exome sequence (ES) data have been increasing. Many algorithms have been published to discover CNVs from exomes and a major challenge is the reproducibility in other datasets. Here we test exome CNV calling reproducibility under three conditions: data generated by different sequencing centers; varying sample sizes; and varying capture methodology.Entities:
Keywords: CNV predictions; Copy number variations (CNV); Exomes; Reproducibility
Mesh:
Year: 2016 PMID: 27503473 PMCID: PMC4976506 DOI: 10.1186/s13073-016-0336-6
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Summary of methods used in CNV callers
| CNV caller | Pre-processing quality control | Approach to discovering CNVs | Published validation rate |
|---|---|---|---|
| CoNIFER [ | RPKM for each target (filter targets with median RPKM <1), ZRPKM , SVD-PCA transformation. Filter samples >0.5 SVD-ZRPKM | ±1.5 SVD-ZRPKM threshold values | 94 % PPV |
| CONTRA [ | Removes base coverage <10, library-size correction by removing linear dependency between log-coverage and log-ratio | Base-level log-ratios using adjusted coverage, followed by region-level log-ratios using mean of base-level log ratios. | 86.8 % SPE 95.4 % SEN |
| EXCAVATOR [ | Data correction by using the medians of Exon-mean-read-count values respect to GC content, mappability, and exon sizes. Log-transformed ratio, LOWESS scatter plot normalization | HMM to discover five states of CNVs (double loss, loss, neutral, gain, or multiple gain) | ~50 % PPV |
| XHMM [ | Filter extreme GC content (<0.1 or >0.9), low complexity (>10 %), target size (<10 bp or >10 kb), samples (mean RD <25 or >500), targets (Mean RD <10 or >500). SVD-PCA normalization, remove K components = 0.7/n s | Z-score calculation as input for three-state HMM | 67–92 % SEN |
Summary of CNV runs on different callers
| Dataset | XHMM | CoNIFER | CONTRA | EXCAVATOR |
|---|---|---|---|---|
| BI (167) | Oa | O | –b | O |
| WUGSC (116) | O | O | – | O |
| BI GIH (48) | O | O | O | O |
| ClinSeq® (54) | O | O | O | O |
| Sample size analysis (ClinSeq® in triplicates by random sampling) | ||||
| 10 | O | O | – | O |
| 30 | O | O | – | O |
| 75 | O | O | – | O |
| 100 | O | O | – | O |
| 300 | O | O | – | O |
| Capture kit analysis (48 samples from ClinSeq® in triplicates by random sampling) | ||||
| SS HAE | O | O | – | O |
| SS ICGC | O | O | – | O |
| TruSeq v2 | O | O | – | O |
| Mix capturec | O | O | – | O |
BI Broad Institute, WUGSC Washington University Genome Sequencing Center, BI-GIH Broad Institute Gujarati Indians in Houston, Texas
aO denotes that a caller was run on a given dataset
b– denotes that a caller was not run on a given dataset
cIndicates data comprising 48 samples, 16 samples from each of the three capture kit samples
Fig. 1Examining the number of CNVs called, sizes, and correlations. a The boxplot of the number of average number of CNVs per sample across the dataset b The significance of association between the exome study attributes (X-axis) and varying data input. Each row shows association values for the given caller (Y-axis). Reliability = PPV. c Boxplots for predicted CNV median size distribution
Fig. 2Examining CNV calls by type. a Percentages of deletion and duplication calls. b PPVs categorized by duplications and deletions. c PPVs of duplication calls. d PPVs of deletion calls. FdelR(Fdel) deletion calls that were not verified, FdupR(Fdup) duplication calls that were not verified, TDelR(Tdel) deletion calls that were verified, TdupR(Tdup) duplication calls that were verified
Fig. 3PPV and SEN for triplicate runs for sample size and capture kit analysis. a PPVs for sample size analysis. Each dot represents a single run. b SEN for sample size analysis. c PPV for capture kit analysis. d SEN for capture kit analysis. The mean and the standard error for the triplicate runs are graphed for B–D. HAE SureSelect HAE kit, ICGC SureSelect ICGC capture kit, mix simulated mixed capture kit data, TSV2 Illumina TruSeq v2 capture kit
Fig. 4Boxplots of number of exons spanning the CNV regions. X-axis category for each boxplot, Y-axis the number of exons spanning CNVs