Literature DB >> 33983414

CNVfilteR: an R/bioconductor package to identify false positives produced by germline NGS CNV detection tools.

José Marcos Moreno-Cabrera^1,2,3, Jesús Del Valle^2,3, Elisabeth Castellanos^1,4, Lidia Feliubadaló^2,3, Marta Pineda^2,3, Eduard Serra^1,3, Gabriel Capellá^2,3, Conxi Lázaro^2,3, Bernat Gel¹.

Abstract

Germline copy-number variants (CNVs) are relevant mutations for multiple genetics fields, such as the study of hereditary diseases. However, available benchmarks show that all next-generation sequencing (NGS) CNV calling tools produce false positives. We developed CNVfilteR, an R package that uses the single nucleotide variant calls usually obtained in germline NGS pipelines to identify those false positives. The package can detect both false deletions and false duplications. We evaluated CNVfilteR performance on callsets generated by 13 CNV calling tools on 3 whole-genome sequencing and 541 panel samples, showing a decrease of up to 44.8% in false positives and consistent F1-score increase. Using CNVfilteR to detect false-positive calls can improve the overall performance of existing CNV calling pipelines. AVAILABILITY: CNVfilteR is released under Artistic-2.0 License. Source code and documentation are freely available at Bioconductor (http://www.bioconductor.org/packages/CNVfilteR). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 33983414 PMCID： PMC9502136 DOI： 10.1093/bioinformatics/btab356

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Copy-number variants (CNVs) are a type of structural variant which has been a matter of interest in multiple genetic fields. In the research and diagnosis of hereditary diseases, where CNVs are relevant contributors (Zhang ), the analysis of germline CNVs plays a key role. Recent improvements in next-generation sequencing (NGS) have resulted in the release of several tools for germline CNV detection on whole-genome sequencing (WGS), whole-exome sequencing and panel data (Mason-Suares ; Roca ; Zhao ). Nevertheless, CNV detection in NGS is challenging due to aspects relative to the technology, such as short-read lengths or GC-content bias (Teo ). Available benchmarks show that all germline CNV calling tools produce false positives (Kim ; Moreno-Cabrera ; Zhang ), frequently reaching high false discovery rates (FDRs). These false-positive calls impact downstream analysis. In a clinical setting, where the use of an orthogonal method is necessary to validate a CNV, false-positive calls lead laboratories to make an important effort to validate them. A tool able to identify these false-positive calls could help in this regard. Most NGS CNV callers are based on one or more of these strategies: read-pair, split-read, read-depth and assembly based (Pirooznia ). However, information from single-nucleotide variants (SNVs), usually available in NGS pipelines, is rarely used in CNV detection strategies although SNV allele frequency can provide evidence to confirm or discard CNV calls. Here, we present CNVfilteR, an R/Bioconductor package that uses SNVs to identify false positives in the output of CNV calling tools.

2 False-positive identification strategy

CNVfilteR uses two different strategies to identify false-positives CNV calls in diploid genomes. Heterozygous deletions are loss-of-heterozygosity regions and cannot overlap with heterozygous SNVs, since only one allele remains. If a heterozygous SNV is detected within a deleted region, either the SNV or the deletion is a false positive (Fig. 1a). To account for errors in SNV calling, a CNV deletion is identified as false positive if at least a percentage of the SNVs overlapping that CNV is heterozygous, 30% by default. On the other hand, CNV duplications are evaluated using a fuzzy-logic-inspired model which scores all heterozygous SNVs overlapping the CNV. If the duplication was a true-positive, the expected allele frequency of heterozygous SNVs would be either 33% or 66%, while it would be 50% if the duplication was a false positive (Fig. 1b). Therefore, each SNV is scored with a value between −1 and 1 depending on how close the allele frequency is to the nearest expected allele frequency (Fig. 1c). If the sum of the scores of all the SNVs in the CNV is greater than the duplication threshold value, the CNV duplication is identified as false positive. Further details of the scoring model can be found in Supplementary File S1.

Fig. 1.

(A) CNV deletion example, adapted from CNVfilteR output. (B) CNV duplication example, adapted from CNVfilteR output. (C) Scoring model for CNV duplications, plotted by CNVfilteR. (D–F) F1-score differences before (light blue) and after (dark blue) removing the false-positive CNVs identified by CNVfilteR in the HuRef, AK1 and NA12878 WGS samples

3 Features

3.1 Input formats

VCF format is the most common output of SNV callers and its interpretation is challenging due to the flexibility provided by the format specification. CNVfilteR provides a function to interpret automatically VCFs produced by VarScan2, Strelka/Strelka2, freeBayes, HaplotypeCaller (GATK) and UnifiedGenotyper (GATK). Output from other tools can also be loaded if adequate parameters are provided.

3.2 Visual output

Results can be plotted and customized through plotting functions based on karyoploteR (Gel and Serra, 2017) and CopyNumberPlots (https://github.com/bernatgel/CopyNumberPlots) packages (Supplementary Fig. S1).

4 Performance evaluation

CNVfilteR was evaluated on 3 WGS samples and 541 gene-panel samples. The default parameters were chosen based on their performance in a WGS sample (HuRef sample) and a gene-panel dataset (HiSeq-panel) (Supplementary File S1).

4.1 Evaluation on WGS data

We evaluated CNVfilteR performance on three reference WGS samples: the HuRef/Venter genome (Zhou ), the AK1 genome (Seo ) and the NA12878 genome. The HuRef and AK1 samples were evaluated using a published reference CNV callset and the results of six CNV calling tools (Canvas, cn. MOPS, CNVnator, ERDS, Genome_STRiP, RDXplorer) (Trost ). For these two samples, we also ran an additional CNV calling tool, LUMPY (Layer ). On the other hand, we evaluated the NA12878 sample with a reference callset and the output of ten CNV calling tools (Canvas, cn. MOPS, CNVnator, RDXplorer, iCopyDAV, GROM-RD, Rsicnv, Control-FREEC, ReadDepth) from a previous work (MacDonald ; Parikh ; Zhang ). For the three WGS samples, SNV calls were obtained using Strelka2 (Kim ). Further details are available in Supplementary File S1. CNVfilteR identified between 15.3% and 44.8% of the false positives and the FDR decreased for all tool-sample evaluations (up to 10.4%). Additionally, F1-score was improved in 19 out of the 24 tool-sample evaluations reaching up to 20.7% F1-score increase (Fig. 1d–f). Sensitivity, however, decreased slightly: tool-sample evaluations had an absolute sensitivity decrease between 0.001 and 0.035. Metrics details are available in Supplementary File S2 and Figures S2–S7. Moreover, additional evaluations were performed to show CNVfilteR performance on different CNV size ranges, on different number of SNVs overlapping each CNV, and on different parameter values (Supplementary Figs S8–S25 and Files S5–S7).

4.2 Evaluation on gene-panel data

We also evaluated CNVfilteR performance on two gene-panel targeted datasets: one containing 411 samples from different Illumina HiSeq runs (HiSeq-panel dataset) and another with 130 samples from different Illumina MiSeq runs (MiSeq-panel dataset). All samples were captured with a 135-gene panel (Castellanos ). To evaluate CNVfilteR, previous MLPA results for a subset of genes were used as gold-standard, CNVs were called using DECoN (Fowler ), and SNVs were called using VarScan2 (Koboldt ) (Supplementary Files S1, S3 and S4). In the HiSeq-panel and MiSeq-panel datasets, CNVfilteR identified 15% of the false-positive calls (3 out of 20 false positives) and 12.5% of the false-positive calls (2 out of 16), respectively. On both datasets, no true CNV was misidentified as false positive (Supplementary File S1), so sensitivity did not change.

4.3 Runtime

Runtime was evaluated on a subset of 79 gene-panel samples and the HuRef WGS sample. The median runtime per sample was 0.85 s for the gene-panel samples and 3.53 min for the HuRef sample (Supplementary File S1).

5 Conclusion

We developed CNVfilteR, an R/Bioconductor package to identify false-positive calls generated by CNV calling tools from germline NGS data using SNVs’ allele frequency. CNVfilteR identified false-positive calls in all tested tools and datasets, from gene-panel to WGS, and F1-score was improved in most tool-sample combinations. CNVfilteR can be plugged in most existing CNV calling pipelines to improve calling performance at virtually no cost. Click here for additional data file.

18 in total

1. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.

Authors: Daniel C Koboldt; Qunyuan Zhang; David E Larson; Dong Shen; Michael D McLellan; Ling Lin; Christopher A Miller; Elaine R Mardis; Li Ding; Richard K Wilson
Journal: Genome Res Date: 2012-02-02 Impact factor: 9.043

2. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives.

Authors: Min Zhao; Qingguo Wang; Quan Wang; Peilin Jia; Zhongming Zhao
Journal: BMC Bioinformatics Date: 2013-09-13 Impact factor: 3.169

Review 3. Free-access copy-number variant detection tools for targeted next-generation sequencing data.

Authors: Iria Roca; Lorena González-Castro; Helena Fernández; Mª Luz Couce; Ana Fernández-Marmiesse
Journal: Mutat Res Rev Mutat Res Date: 2019-02-23 Impact factor: 5.657

4. De novo assembly and phasing of a Korean human genome.

Authors: Jeong-Sun Seo; Arang Rhie; Junsoo Kim; Sangjin Lee; Min-Hwan Sohn; Chang-Uk Kim; Alex Hastie; Han Cao; Ji-Young Yun; Jihye Kim; Junho Kuk; Gun Hwa Park; Juhyeok Kim; Hanna Ryu; Jongbum Kim; Mira Roh; Jeonghun Baek; Michael W Hunkapiller; Jonas Korlach; Jong-Yeon Shin; Changhoon Kim
Journal: Nature Date: 2016-10-05 Impact factor: 49.962

5. A Comprehensive Workflow for Read Depth-Based Identification of Copy-Number Variation from Whole-Genome Sequence Data.

Authors: Brett Trost; Susan Walker; Zhuozhi Wang; Bhooma Thiruvahindrapuram; Jeffrey R MacDonald; Wilson W L Sung; Sergio L Pereira; Joe Whitney; Ada J S Chan; Giovanna Pellecchia; Miriam S Reuter; Si Lok; Ryan K C Yuen; Christian R Marshall; Daniele Merico; Stephen W Scherer
Journal: Am J Hum Genet Date: 2018-01-04 Impact factor: 11.025