| Literature DB >> 34944901 |
Migle Gabrielaite1, Mathias Husted Torp1, Malthe Sebro Rasmussen1, Sergio Andreu-Sánchez1, Filipe Garrett Vieira1, Christina Bligaard Pedersen1,2, Savvas Kinalis1, Majbritt Busk Madsen1, Miyako Kodama1, Gül Sude Demircan1, Arman Simonyan1, Christina Westmose Yde1, Lars Rønn Olsen1,2, Rasmus L Marvig1, Olga Østrup1, Maria Rossing1,3, Finn Cilius Nielsen1, Ole Winther1,4,5, Frederik Otzen Bagger1,6,7.
Abstract
Copy-number variations (CNVs) have important clinical implications for several diseases and cancers. Relevant CNVs are hard to detect because common structural variations define large parts of the human genome. CNV calling from short-read sequencing would allow single protocol full genomic profiling. We reviewed 50 popular CNV calling tools and included 11 tools for benchmarking in a reference cohort encompassing 39 whole genome sequencing (WGS) samples paired current clinical standard-SNP-array based CNV calling. Additionally, for nine samples we also performed whole exome sequencing (WES), to address the effect of sequencing protocol on CNV calling. Furthermore, we included Gold Standard reference sample NA12878, and tested 12 samples with CNVs confirmed by multiplex ligation-dependent probe amplification (MLPA). Tool performance varied greatly in the number of called CNVs and bias for CNV lengths. Some tools had near-perfect recall of CNVs from arrays for some samples, but poor precision. Several tools had better performance for NA12878, which could be a result of overfitting. We suggest combining the best tools also based on different methodologies: GATK gCNV, Lumpy, DELLY, and cn.MOPS. Reducing the total number of called variants could potentially be assisted by the use of background panels for filtering of frequently called variants.Entities:
Keywords: benchmark; bioinformatics; copy-number variation (CNV); structural variant; whole exome sequencing (WES); whole genome sequencing (WGS)
Year: 2021 PMID: 34944901 PMCID: PMC8699073 DOI: 10.3390/cancers13246283
Source DB: PubMed Journal: Cancers (Basel) ISSN: 2072-6694 Impact factor: 6.639
Figure 1Schematic visualization of different approaches for calling CNVs from NGS data. RD detects local difference in read-depth, SR detects unmatched read pairs, RP detects decreased insert size or swapped read directions between read pairs, and AS performs de novo assembly to best explain read distribution.
Figure 2Overview of methods CNV calling tools applies, input NGS data, citation number from Google Scholar and available latest version for each tool as of March 2019. Tools highlighted with bold font are included in the benchmark, the horizontal red line shows the cutoff for the citation number.
Datasets used in this benchmark study.
| Name | Number of Samples | Whole Exome Sequencing | Whole Genome Sequencing | Reference Copy Number Variations |
|---|---|---|---|---|
| NA12878 | 1 | Yes | Yes | Haraksingh et al., 2017 [ |
| GB01-GB08 | 8 | Yes | Yes | CytoScan HD SNP-array |
| GB09-GB38 | 30 | No | Yes | CytoScan HD SNP-array |
| GB40-GB45 | 6 | No | Yes | MLPA |
| GB46-GB51 | 6 | Yes | No | MLPA |
Figure 3(A) Number of duplications and deletions called by CNV calling tools in WES and WGS data for the NA12878 sample. (B) Number CNVs called by all tools in WES and WGS data for the NA12878 sample colored by length. (C) Box plots and scatter plots for recall and precision results for 11 CNV calling tools.
Figure 4Recall and precision curves for GB01-08 and NA12878 whole exome sequencing samples, and GB01-GB38 and NA12878 whole genome sequencing samples.
Figure 5Heatmap showing all called CNVs across all samples (A,B) and called CNVs overlap with the true CNVs (C–E). (A) Whole genome sequencing (WGS; n = 407,671) and (B) Whole exome sequencing level (WES; n = 9944). Each row represents a tool, and a blue field denotes a call of the given CNV. All CNVs from each sample were merged across tools, such that any overlapping calls of either duplications or deletions were combined to one. Blue color denotes that the given CNV was called by the tool. The order of rows/columns for WES data and rows for WGS data was determined using complete-linkage hierarchical clustering with Euclidean distance, while the order of columns for WGS data was determined using a combination of k-means and hierarchical clustering due to memory restrictions. Darker grey coloring (WGS only) indicates that the tool was not run for the sample which contained the CNV. (C) 2076 WGS-based and (D) 81 WES-based true CNVs in NA12878 sample. The order of rows/columns was determined using complete- linkage hierarchical clustering with Euclidean distance. (E) CNV calling heatmap for 471 true CNVs at and WGS level in 38 samples (GB01-38). Column dendrogram shows clustering to the level of 20 clusters to reduce complexity. The Quality annotation represents the probe median score from CytoScan HD SNP-array and the Man.annot. refers to whether the CNV was independently manually confirmed. A positive quality score corresponds to duplications, and negative scores denote deletions. Darker grey coloring indicates that the tool was not run for the sample which contained the CNV. The order of rows/columns was determined using complete-linkage hierarchical clustering with Euclidean distance.
Figure 6(A) CNV calling heatmap for 7 tools and 107 true CNVs at whole exome sequencing level in 8 samples (GB01-08). The Quality annotation represents the probe median score from CytoScan HD SNP-array and the Man.annot. refers to whether the CNV was independently manually confirmed. A positive quality score corresponds to duplications, and negative scores denote deletions. The order of rows/columns was determined using complete-linkage hierarchical clustering with Euclidean distance. (B) MLPA-confirmed CNV calling results for 11 CNV calling tools. GATK gCNV is labeled as GermlineCNVCaller.
Figure 7Maximum memory used by a tool measured in megabytes and total CPU time in hours run in 28-core machines with 128 GB RAM, while running NA12878. Some tools can distribute tasks over nodes, and total RAM usage is reported as total maximum.