Literature DB >> 25524895

VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications.

John C Mu¹, Marghoob Mohiyuddin², Jian Li², Narges Bani Asadi², Mark B Gerstein², Alexej Abyzov², Wing H Wong¹, Hugo Y K Lam².

Abstract

SUMMARY: VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing.
AVAILABILITY AND IMPLEMENTATION: Code in Java and Python along with instructions to download the reads and variants is at http://bioinform.github.io/varsim. CONTACT: rd@bina.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 25524895 PMCID： PMC4410653 DOI： 10.1093/bioinformatics/btu828

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Due to the lack of ground truth for real data, simulation is a common approach for the evaluation of high-throughput sequencing's secondary analysis, ranging from alignment to variant calling. An early attempt to perform validation without simulation is given in Zook . However, their attempt involved extensive biological experiments and does not cover the full spectrum of variants. We present the first integrated pipeline that provides complete validation of secondary analysis through simulation as well as analysis with real data. Most tools simulate variants, but no single tool simulates the full spectrum of variants from small variants to all types of structural variations (SVs). RSVSim (Bartenhagen and Dugas, 2013) simulates SVs, but does not simulate SNVs and small indels. It also does not generate reads. SMASH (Talwalkar ) only considers SV deletions and insertions. Other variant simulation tools exist (see Supplementary Material); however, VarSim is the only one able to simulate SNVs, small indels and many types of SVs. This completeness allows VarSim to be closely representative of real sequencing studies. Furthermore, among the aforementioned tools, only a few simulate both variants and reads. VarSim goes further with the ability to validate the correctness of read alignments even near complex SVs.

2 Methods

VarSim works in two steps. The first step is simulation. A perturbed diploid genome is generated by inserting variants into a user-provided reference genome (e.g. GRCh37). Reads are then simulated from this perturbed genome. These reads are processed using the secondary analysis pipeline under consideration [e.g. BWA + GATK (Lam )]. The second step is validation. The aligned reads and called variants are validated against the true alignments and variants, respectively. Following that, our reporting tools generate detailed interactive plots showing the accuracy of alignment and variant calling. It is also possible to compare the accuracy between multiple tools. Figure 1 provides an overview of the basic germline workflow.

Fig. 1.

VarSim simulation and validation workflow. The germline workflow can be run with or without the somatic workflow

VarSim simulation and validation workflow. The germline workflow can be run with or without the somatic workflow The basic workflow can also be adapted for simulation of tumor/normal pairs and the validation of somatic variant callers (Fig. 1). VarSim is run twice, once with somatic variants from the COSMIC (Forbes ) database and/or a somatic variant VCF, and once without any somatic variants. The two sets of reads generated can be optionally mixed to simulate normal contamination at various allele frequencies. After somatic variant analysis is run on the two sets of reads, somatic variants can then be validated in the same way as in the standard germline workflow. See the Supplementary Material for more details.

2.1 Simulation

For generating a perturbed genome, VarSim samples small variants and SVs from existing databases (e.g. dbSNP, DGV) and/or a provided VCF file. For SV insertions without a known novel sequence, VarSim generates a new insertion sequence from a database of known human insertion sequence (e.g. the Venter genome insertion sequences). It then generates a diploid genome containing the sampled variants with an enhanced version of vcf2diploid (Rozowsky ) (see Supplementary Material). Specifically, we added support for handling more types of SVs (inversions, duplications) and improved VCF reading. We also added the ability to generate a map file (MFF, see Supplementary Material) between the perturbed genome and the reference genome. This map is used to convert locations on the perturbed genome to locations on the reference genome. It is more flexible than the chain file in the original vcf2diploid as it can handle complex SVs such as translocations, which will be simulated by VarSim in a future version. VarSim currently supports DWGSIM and ART (Huang ). It uses ART as the default since ART learns an error profile based on real sequencing reads. VarSim is flexibly designed to support any type of read simulator with minimal work, this is important because unlike the structure of the human genome, sequencing technology will continue to evolve and change. As the reads are generated from the perturbed genome, the true alignment location on the reference genome is not available. To determine the true alignment location on the reference genome, VarSim utilizes the map file generated in the genome simulation step. In addition, VarSim parallelizes the read generation of any read simulator to greatly reduce simulation time.

2.2 Validation

VarSim validates alignments via meta-data stored in the read name. All possible true read alignment locations are stored in the meta-data. This allows VarSim to validate alignments overlapping the breakpoints of SVs. Furthermore, each alignment is annotated with the type of region it was generated from, which allows validating only the alignments overlapping specific types of variants. An alignment is called correct if it is close to any of the true locations. For instance, if a read overlaps the edge of an inversion, the read could either be aligned partially outside the inversion with the rest soft-clipped or partially inside the inversion and similarly soft-clipped. VarSim validates against all of these possible alignments. VarSim validates variants by comparing them to the true set of variants inserted into the perturbed reference genome. VarSim handles the variety of possible encodings for a VCF record by normalizing each record to a canonical form before comparison. The accuracy of variant calling is reported based on sensitivity (TPR) and precision (PPV). VarSim reports TPR and PPV broken down into bins by variant type and also variant size. For details of the computation, please see Supplementary methods.

2.3 Analysis output

The resulting analysis output for alignment and variant validation is a JSON file that can be visually analyzed as a single interactive HTML document with SVG plots. The plots are generated using the D3 library. Validation metrics include sensitivity, precision and F1 score, which is the harmonic mean of precision and sensitivity (see Supplementary methods). The HTML document is also able to compare multiple analysis outputs. This platform agnostic format makes sharing and comparing results relatively simple.

3 Results

We demonstrated VarSim’s completeness in both simulation and validation by simulating NA12878’s personal genome with small variants from genome in a bottle (GiaB) high-confidence regions (Zook ), and with SVs from 1000 genomes (Mills ) and DGV (MacDonald ). Reads were generated at 50× coverage. The accuracy on the simulated reads was similar to the accuracy from the Illumina platinum genome reads of NA12878 (see Supplementary methods). Figure 2a and b present some benchmarking results on SNVs and small deletions. For all variant calling comparisons we used the alignments from BWA-MEM (Li, 2013) after realignment and recalibration with GATK unless otherwise specified. Novoalign’s alignments were used directly as input to Haplotype Caller without realignment and recalibration as recommended by the authors. In this case, Novoalign performed slightly worse in comparison to BWA-MEM. For small deletions, Haplotype Caller performed the best when compared to both Unified Genotyper (McKenna ) and FreeBayes (Garrison and Marth, 2012), especially for larger deletions. The results on SV deletions from several popular SV calling tools (Abyzov ; Chen ; Ye ) are shown in Figure 2c. The three tools represented three different methods for SV calling—split-read, read-depth and paired-end. All tools performed well for moderate-sized deletion SVs. Only BreakDancer (paired-end mapping) was able to recover larger SV deletions. However, it was not able to recover exact breakpoints. All tools failed to adequately recover deletion SVs in the smaller range. When comparing somatic analysis tools MuTect (Cibulskis ) was superior to the other tools, especially when the tumor allele frequency was low. Additional analysis of secondary and somatic analysis tools based on the simulated NA12878 genome are provided in the Supplementary Material.

Fig. 2.

Validation results for some popular secondary analysis tools

4 Conclusions and future work

VarSim is the most comprehensive pipeline for simulation and validation of secondary analysis, covering both small variants and SVs on a diploid genome. Future work on VarSim will add support for translocations, as well as interspersed duplications. We envision VarSim will become an invaluable tool in the evaluation of new secondary analysis methods.

14 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing.

Authors: Alexej Abyzov; Alexander E Urban; Michael Snyder; Mark Gerstein
Journal: Genome Res Date: 2011-02-07 Impact factor: 9.043

3. RSVSim: an R/Bioconductor package for the simulation of structural variations.

Authors: Christoph Bartenhagen; Martin Dugas
Journal: Bioinformatics Date: 2013-04-25 Impact factor: 6.937

4. ART: a next-generation sequencing read simulator.

Authors: Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal: Bioinformatics Date: 2011-12-23 Impact factor: 6.937

5. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.

Authors: Kai Ye; Marcel H Schulz; Quan Long; Rolf Apweiler; Zemin Ning
Journal: Bioinformatics Date: 2009-06-26 Impact factor: 6.937

6. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.

Authors: Ken Chen; John W Wallis; Michael D McLellan; David E Larson; Joelle M Kalicki; Craig S Pohl; Sean D McGrath; Michael C Wendl; Qunyuan Zhang; Devin P Locke; Xiaoqi Shi; Robert S Fulton; Timothy J Ley; Richard K Wilson; Li Ding; Elaine R Mardis
Journal: Nat Methods Date: 2009-08-09 Impact factor: 28.547

7. Mapping copy number variation by population-scale genome sequencing.

Authors: Ryan E Mills; Klaudia Walter; Chip Stewart; Robert E Handsaker; Ken Chen; Can Alkan; Alexej Abyzov; Seungtai Chris Yoon; Kai Ye; R Keira Cheetham; Asif Chinwalla; Donald F Conrad; Yutao Fu; Fabian Grubert; Iman Hajirasouliha; Fereydoun Hormozdiari; Lilia M Iakoucheva; Zamin Iqbal; Shuli Kang; Jeffrey M Kidd; Miriam K Konkel; Joshua Korn; Ekta Khurana; Deniz Kural; Hugo Y K Lam; Jing Leng; Ruiqiang Li; Yingrui Li; Chang-Yun Lin; Ruibang Luo; Xinmeng Jasmine Mu; James Nemesh; Heather E Peckham; Tobias Rausch; Aylwyn Scally; Xinghua Shi; Michael P Stromberg; Adrian M Stütz; Alexander Eckehart Urban; Jerilyn A Walker; Jiantao Wu; Yujun Zhang; Zhengdong D Zhang; Mark A Batzer; Li Ding; Gabor T Marth; Gil McVean; Jonathan Sebat; Michael Snyder; Jun Wang; Kenny Ye; Evan E Eichler; Mark B Gerstein; Matthew E Hurles; Charles Lee; Steven A McCarroll; Jan O Korbel
Journal: Nature Date: 2011-02-03 Impact factor: 49.962

8. AlleleSeq: analysis of allele-specific expression and binding in a network framework.

Authors: Joel Rozowsky; Alexej Abyzov; Jing Wang; Pedro Alves; Debasish Raha; Arif Harmanci; Jing Leng; Robert Bjornson; Yong Kong; Naoki Kitabayashi; Nitin Bhardwaj; Mark Rubin; Michael Snyder; Mark Gerstein
Journal: Mol Syst Biol Date: 2011-08-02 Impact factor: 11.429

9. The Database of Genomic Variants: a curated collection of structural variation in the human genome.

Authors: Jeffrey R MacDonald; Robert Ziman; Ryan K C Yuen; Lars Feuk; Stephen W Scherer
Journal: Nucleic Acids Res Date: 2013-10-29 Impact factor: 16.971

10. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples.

Authors: Kristian Cibulskis; Michael S Lawrence; Scott L Carter; Andrey Sivachenko; David Jaffe; Carrie Sougnez; Stacey Gabriel; Matthew Meyerson; Eric S Lander; Gad Getz
Journal: Nat Biotechnol Date: 2013-02-10 Impact factor: 54.908

25 in total

1. Discovery of tandem and interspersed segmental duplications using high-throughput sequencing.

Authors: Arda Soylev; Thong Minh Le; Hajar Amini; Can Alkan; Fereydoun Hormozdiari
Journal: Bioinformatics Date: 2019-10-15 Impact factor: 6.937

2. BATCAVE: calling somatic mutations with a tumor- and site-specific prior.

Authors: Brian K Mannakee; Ryan N Gutenkunst
Journal: NAR Genom Bioinform Date: 2020-02-06

3. SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution.

Authors: Li Charlie Xia; Dongmei Ai; Hojoon Lee; Noemi Andor; Chao Li; Nancy R Zhang; Hanlee P Ji
Journal: Gigascience Date: 2018-07-01 Impact factor: 6.524

4. FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods.

Authors: Timothy Becker; Wan-Ping Lee; Joseph Leone; Qihui Zhu; Chengsheng Zhang; Silvia Liu; Jack Sargent; Kritika Shanker; Adam Mil-Homens; Eliza Cerveira; Mallory Ryan; Jane Cha; Fabio C P Navarro; Timur Galeev; Mark Gerstein; Ryan E Mills; Dong-Guk Shin; Charles Lee; Ankit Malhotra
Journal: Genome Biol Date: 2018-03-20 Impact factor: 13.583

5. Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms.

Authors: Alexej Abyzov; Shantao Li; Daniel Rhee Kim; Marghoob Mohiyuddin; Adrian M Stütz; Nicholas F Parrish; Xinmeng Jasmine Mu; Wyatt Clark; Ken Chen; Matthew Hurles; Jan O Korbel; Hugo Y K Lam; Charles Lee; Mark B Gerstein
Journal: Nat Commun Date: 2015-06-01 Impact factor: 14.919

6. MetaSV: an accurate and integrative structural-variant caller for next generation sequencing.

Authors: Marghoob Mohiyuddin; John C Mu; Jian Li; Narges Bani Asadi; Mark B Gerstein; Alexej Abyzov; Wing H Wong; Hugo Y K Lam
Journal: Bioinformatics Date: 2015-04-10 Impact factor: 6.937

7. LongISLND: in silico sequencing of lengthy and noisy datatypes.

Authors: Bayo Lau; Marghoob Mohiyuddin; John C Mu; Li Tai Fang; Narges Bani Asadi; Carolina Dallett; Hugo Y K Lam
Journal: Bioinformatics Date: 2016-09-25 Impact factor: 6.937

Review 8. A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data.

Authors: Jennifer D Hintzsche; William A Robinson; Aik Choon Tan
Journal: Int J Genomics Date: 2016-12-14 Impact factor: 2.326

9. GeneBreaker: Variant simulation to improve the diagnosis of Mendelian rare genetic diseases.

Authors: Phillip A Richmond; Tamar V Av-Shalom; Oriol Fornes; Bhavi Modi; Alison M Elliott; Wyeth W Wasserman
Journal: Hum Mutat Date: 2021-02-10 Impact factor: 4.878

10. ClinLabGeneticist: a tool for clinical management of genetic variants from whole exome sequencing in clinical genetic laboratories.

Authors: Jinlian Wang; Jun Liao; Jinglan Zhang; Wei-Yi Cheng; Jörg Hakenberg; Meng Ma; Bryn D Webb; Rajasekar Ramasamudram-Chakravarthi; Lisa Karger; Lakshmi Mehta; Ruth Kornreich; George A Diaz; Shuyu Li; Lisa Edelmann; Rong Chen
Journal: Genome Med Date: 2015-07-29 Impact factor: 11.117