Literature DB >> 31116378

simuG: a general-purpose genome simulator.

Abstract

SUMMARY: Simulated genomes with pre-defined and random genomic variants can be very useful for benchmarking genomic and bioinformatics analyses. Here we introduce simuG, a lightweight tool for simulating the full-spectrum of genomic variants (single nucleotide polymorphisms, Insertions/Deletions, copy number variants, inversions and translocations) for any organisms (including human). The simplicity and versatility of simuG make it a unique general-purpose genome simulator for a wide-range of simulation-based applications.
AVAILABILITY AND IMPLEMENTATION: Code in Perl along with user manual and testing data is available at https://github.com/yjx1217/simuG. This software is free for use under the MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 31116378 PMCID： PMC6821417 DOI： 10.1093/bioinformatics/btz424

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Along with the rapid progress of genome sequencing technologies, many bioinformatics tools have been developed for characterizing genomic variants based on genome sequencing data. While there is an increasing availability of experimentally validated gold-standard genome sequencing dataset from real biological samples, in silico simulation remains a powerful approach for gauging and comparing the performance of bioinformatics tools. Correspondingly, many read simulators have been developed for different sequencing technologies, such as ART (Huang ) for Illumina and 454, SimLoRD (Stöcker ) for PacBio and DeepSimulator (Li ) for Oxford Nanopore. However, when it comes to tools for simulating genome sequences with embedded variants, the choices appear much more limited. The current available tools are either too simple or too specialized. For example, SInC (Pattnaik ) can introduce random single nucleotide polymorphisms (SNPs), Insertions/Deletions (INDELs) and copy number variants (CNVs) into a user-provided reference genome but lacks the ability to simulate known variants, which is actually highly relevant in some simulation applications. Simulome (Price ) is another random variant simulator that provides finer control options, but it is designed for prokaryote genomes only. More sophisticated tools exist, such as VarSim (Mu ) and Xome-Blender (Semeraro ), but these tools are mostly tailored for human cancer genome simulation and often require additional third-party databases. Therefore, we feel there is need for a genome simulator that strikes a balance between simplicity and versatility. With this in mind, we developed a general-purpose genome simulator simuG, which is versatile enough to simulate both small (i.e. SNPs and INDELs) and large (i.e. CNVs, inversions and translocations) genomic variants while staying lightweight with no extra dependency and minimal input requirements. In addition, simuG provides a rich array of fine-grained controls, such as simulating SNPs in different coding partitions (e.g. coding sites, non-coding sites, 4-fold degenerate sites or 2-fold degenerate sites); simulating CNVs with different formation mechanisms (e.g. segmental deletions, dispersed duplications and tandem duplications) and simulating inversions and translocations with specific types of breakpoints. These features together make simuG highly amenable to a wide-range of application scenarios.

2 Description and feature highlights

simuG is a command-line tool written in Perl and supports all mainstream operating systems. It takes the user-supplied reference genome (in FASTA format) as the working template to introduce non-overlapping genomic variants of all major types (i.e. SNPs, INDELs, CNVs, inversions and translocations). SNPs and INDELs can be introduced simultaneously, whereas CNVs (implemented as segmental duplications and deletions), inversions and translocations can be introduced with separated runs. For each variant type, simuG can simulate pre-defined or random variants depending on specified options. For pre-defined variants, a user-supplied Variant Call Format (VCF) file that specifies all desired variants is needed, based on which simuG will operate on the input reference genome to introduce the corresponding variants. For random variants, simuG supports a wide-spectrum of fine control options, such as ‘-titv_ratio’ for specifying the transition/transversion ratio of SNPs, ‘-indel_size_powerlaw_alpha’ and ‘-indel_size_powerlaw_constant’ for specifying the size distribution of INDELs, ‘-cnv_gain_loss_ratio’ for specifying the ratio of segmental duplication versus segmental deletion, ‘-duplication_tandem_dispersed_ratio’ for specifying the ratio of tandem versus dispersed duplications and ‘-centromere_gff’ for specifying the location of centromeres so that simulated random CNVs, inversions and translocations will not disrupt the specified centromeres. An ancillary script vcf2model.pl is further provided to directly calculate the best parameter combinations for the random SNP/INDEL simulation-based on real data. Moreover, given the strong association between gross chromosomal rearrangement breakpoints and repetitive sequences [e.g. transposable elements (TEs)] observed in empirical studies (Yue ; Zhang ), simuG can restrict random inversions and translocations to only use user-defined breakpoints (by specifying the ‘-inversion_breakpoint_gff’ or ‘-translocation_breakpoint_gff’ option). The specific feature type and strand information of these user-defined breakpoints will be considered during the breakpoint sampling. For example, the breakpoint pairs that can trigger inversion should belong to the same feature type but from opposite strands (e.g. inverted repeats). Also, when specified, centromeres will be given special consideration in random translocation simulation so that translocations leading to dicentric chromosomes will not be sampled. Finally, when needed, users can also define a list of chromosomes (e.g. mtDNA) to be excluded from variant introduction. Upon the completion of the simulation, three files will be produced: (i) a simulated genome bearing introduced variants in FASTA format, (ii) a tabular file showing the genomic locations of all introduced variants relative to both the reference genome and the simulated genome and (iii) a VCF file showing the genomic locations of all introduced variants relative to the reference genome. Since simuG’s major input/output formats (e.g. FASTA, VCF and GFF3) are all widely used in the field, it should be fairly straightforward to connect simuG with other computational tools both upstream and downstream. Please note that when comparing the VCF outputs from simuG and other tools, all VCF files used for such comparison should be normalized by tools like vt (Tan ) beforehand.

3 Application demonstration

To demonstrate the application of simuG in a real case scenario, we ran simuG with the budding yeast Saccharomyces cerevisiae (version R64-2-1) and human (version GRCh38) reference genomes to generate nine simulated genomes for each organism: (i) with 10 000 SNPs, (ii) with 1000 random INDELs, (iii) with 10 random CNV due to segmental deletions, (iv) with 10 random CNV due to dispersed duplications, (v) with 10 random CNV due to tandem duplications, (vi) with 5 random inversions, (vii) with 5 random inversions triggered by breakpoints sampled from pre-specified TEs, (viii) with 5 random translocation, (ix) with 5 random translocation triggered by breakpoints sampled from pre-specified TEs. Based on each simulated genome, 50X 150-bp Illumina paired-end reads and 25X PacBio reads were simulated with ART (Huang ) and SimLoRd (Stöcker ), respectively, and subsequently mapped to the yeast and human reference genomes. The read mapping was performed by BWA (Li and Durbin, 2009) for Illumina reads and by minimap2 (Li, 2018) for PacBio reads. With this setup, we evaluated the performance of different variant callers for both small and large variants (Table 1 and Supplementary Note). For small-variants (i.e. SNPs and INDELs), we found freebayes (Garrison and Marth, 2012) and the GATK4 HaplotypeCaller (Poplin ) both performed well, with the latter one marginally won out in INDEL calling. For large structural variants like CNVs, inversions and translocations, we found both the short-read-based callers Delly (Rausch ) and Manta (Chen ) and the long-read-based caller Sniffles (Sedlazeck ) were able to identify most simulated events, especially when no TEs were associated with the breakpoints. The long-read caller Sniffles showed superior accuracy in resolving the exact breakpoints to the base-pair resolution than short-read-based callers by taking advantage of the longer read length, even with half of the sequencing coverage. Between the two short-read-based callers, Manta outperformed Delly in terms of breakpoint accuracy at the base-pair level.

Table 1.

Benchmarking popular variant callers with the small and large genomic variants simulated by simuG

		Yeast			Human
Variant type	Variant caller	Precision	Recall	F ₁ score	Precision	Recall	F ₁ score
SNP (n=10000)	freebayes	1.000	0.971	0.985	0.999	0.981	0.990
SNP (n=10000)	GATK4	1.000	0.970	0.985	1.000	0.977	0.988
INDEL (n=1000)	freebayes	0.954	0.931	0.942	0.939	0.930	0.935
INDEL (n=1000)	GATK4	1.000	0.969	0.984	1.000	0.976	0.988
CNV:segmental deletion (n=10)	Delly	1.000	1.000	1.000	1.000	1.000	1.000
	Manta	1.000	1.000	1.000	1.000	1.000	1.000
	Sniffles	1.000	1.000	1.000	1.000	1.000	1.000
CNV:dispersed duplication (n=10)	Delly	1.000	0.875	0.933	1.000	0.906	0.951
	Manta	1.000	0.906	0.951	1.000	0.906	0.951
	Sniffles	1.000	0.875	0.933	1.000	0.906	0.951
CNV:tandem duplication (n=10)	Delly	1.000	1.000	1.000	1.000	0.700	0.824
	Manta	1.000	1.000	1.000	1.000	0.700	0.824
	Sniffles	1.000	1.000	1.000	1.000	0.800	0.889
INV (n = 5)	Delly	1.000	1.000	1.000	1.000	1.000	1.000
	Manta	1.000	1.000	1.000	1.000	1.000	1.000
	Sniffles	1.000	1.000	1.000	1.000	1.000	1.000
INV with TE breakpoints (n=5)	Delly	1.000	0.200	0.333	1.000	1.000	1.000
	Manta	1.000	0.200	0.333	1.000	1.000	1.000
	Sniffles	1.000	0.200	0.333	1.000	1.000	1.000
TRA (n=5)	Delly	1.000	1.000	1.000	0.800	0.800	0.800
	Manta	1.000	1.000	1.000	1.000	1.000	1.000
	Sniffles	1.000	1.000	1.000	1.000	1.000	1.000
TRA with TE breakpoints (n=5)	Delly	NA	0.000	NA	1.000	1,000	1.000
	Manta	NA	0.000	NA	1.000	1.000	1.000
	Sniffles	NA	0.000	NA	1.000	1.000	1.000

For each variant type, the number of introduced variants is shown in parentheses. INV: inversion. TRA: translocation. TE: transposable elements (full-length Ty1 for S.cerevisiae and full-length intact L1 for human). Precision = true positive/(true positive + false positive). Recall = true positive/(true positive + false negative). F1 score = 2 * (recall * precision)/(recall + precision). For a single CNV derived from dispersed duplication, there could be multiple duplicated copies inserted to different genomic locations, making it tricky to calculate accuracy, precision and F1 score by measuring the number of recovered CNV events. Therefore, we calculated these values based on the number of recovered breakpoints instead in this case.

Benchmarking popular variant callers with the small and large genomic variants simulated by simuG For each variant type, the number of introduced variants is shown in parentheses. INV: inversion. TRA: translocation. TE: transposable elements (full-length Ty1 for S.cerevisiae and full-length intact L1 for human). Precision = true positive/(true positive + false positive). Recall = true positive/(true positive + false negative). F1 score = 2 * (recall * precision)/(recall + precision). For a single CNV derived from dispersed duplication, there could be multiple duplicated copies inserted to different genomic locations, making it tricky to calculate accuracy, precision and F1 score by measuring the number of recovered CNV events. Therefore, we calculated these values based on the number of recovered breakpoints instead in this case.

4 Conclusions

We developed simuG, a simple, flexible and powerful tool to simulate genome sequences with both pre-defined and random genomic variants. Simple as it is, simuG is highly versatile to handle the full-spectrum of genomic variants, which makes it very useful to serve the purpose of various simulation studies. Click here for additional data file.

15 in total

1. Transposable elements as catalysts for chromosome rearrangements.

Authors: Jianbo Zhang; Chuanhe Yu; Lakshminarasimhan Krishnaswamy; Thomas Peterson
Journal: Methods Mol Biol Date: 2011

2. Unified representation of genetic variants.

Authors: Adrian Tan; Gonçalo R Abecasis; Hyun Min Kang
Journal: Bioinformatics Date: 2015-02-19 Impact factor: 6.937

3. ART: a next-generation sequencing read simulator.

Authors: Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal: Bioinformatics Date: 2011-12-23 Impact factor: 6.937

4. VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications.

Authors: John C Mu; Marghoob Mohiyuddin; Jian Li; Narges Bani Asadi; Mark B Gerstein; Alexej Abyzov; Wing H Wong; Hugo Y K Lam
Journal: Bioinformatics Date: 2014-12-17 Impact factor: 6.937

5. SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data.

Authors: Swetansu Pattnaik; Saurabh Gupta; Arjun A Rao; Binay Panda
Journal: BMC Bioinformatics Date: 2014-02-05 Impact factor: 3.169

6. Simulome: a genome sequence and variant simulator.

Authors: Adam Price; Cynthia Gibas
Journal: Bioinformatics Date: 2017-02-10 Impact factor: 6.937

7. DeepSimulator: a deep simulator for Nanopore sequencing.

Authors: Yu Li; Renmin Han; Chongwei Bi; Mo Li; Sheng Wang; Xin Gao
Journal: Bioinformatics Date: 2018-09-01 Impact factor: 6.937

8. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

9. DELLY: structural variant discovery by integrated paired-end and split-read analysis.

Authors: Tobias Rausch; Thomas Zichner; Andreas Schlattl; Adrian M Stütz; Vladimir Benes; Jan O Korbel
Journal: Bioinformatics Date: 2012-09-15 Impact factor: 6.937

10. Accurate detection of complex structural variations using single-molecule sequencing.

Authors: Fritz J Sedlazeck; Philipp Rescheneder; Moritz Smolka; Han Fang; Maria Nattestad; Arndt von Haeseler; Michael C Schatz
Journal: Nat Methods Date: 2018-04-30 Impact factor: 28.547

9 in total

1. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species.

Authors: Messaoud Lefouili; Kiwoong Nam
Journal: Sci Rep Date: 2022-07-05 Impact factor: 4.996

2. DeepSimulator1.5: a more powerful, quicker and lighter simulator for Nanopore sequencing.

Authors: Yu Li; Sheng Wang; Chongwei Bi; Zhaowen Qiu; Mo Li; Xin Gao
Journal: Bioinformatics Date: 2020-04-15 Impact factor: 6.937

3. Twelve quick steps for genome assembly and annotation in the classroom.

Authors: Hyungtaek Jung; Tomer Ventura; J Sook Chung; Woo-Jin Kim; Bo-Hye Nam; Hee Jeong Kong; Young-Ok Kim; Min-Seung Jeon; Seong-Il Eyun
Journal: PLoS Comput Biol Date: 2020-11-12 Impact factor: 4.475

4. GRACy: A tool for analysing human cytomegalovirus sequence data.

Authors: Salvatore Camiolo; Nicolás M Suárez; Antonia Chalka; Cristina Venturini; Judith Breuer; Andrew J Davison
Journal: Virus Evol Date: 2020-12-30

5. Mutation-Simulator: fine-grained simulation of random mutations in any genome.

Authors: M A Kühl; B Stich; D C Ries
Journal: Bioinformatics Date: 2021-05-01 Impact factor: 6.937

Review 6. The Future of Livestock Management: A Review of Real-Time Portable Sequencing Applied to Livestock.

Authors: Harrison J Lamb; Ben J Hayes; Loan T Nguyen; Elizabeth M Ross
Journal: Genes (Basel) Date: 2020-12-09 Impact factor: 4.096

7. Performance and Agreement Between WGS Variant Calling Pipelines Used for Bovine Tuberculosis Control: Toward International Standardization.

Authors: Víctor Lorente-Leal; Damien Farrell; Beatriz Romero; Julio Álvarez; Lucía de Juan; Stephen V Gordon
Journal: Front Vet Sci Date: 2021-12-14

8. RecombineX: A generalized computational framework for automatic high-throughput gamete genotyping and tetrad-based recombination analysis.

Authors: Jing Li; Bertrand Llorente; Gianni Liti; Jia-Xing Yue
Journal: PLoS Genet Date: 2022-05-09 Impact factor: 6.020

9. VirStrain: a strain identification tool for RNA viruses.

Authors: Herui Liao; Dehan Cai; Yanni Sun
Journal: Genome Biol Date: 2022-01-31 Impact factor: 13.583

9 in total