As new data types and methodologies for analysis of biological data are developed, simulation tools are becoming increasingly necessary to the development, testing and benchmarking of bioinformatics research. Simulations provide a valuable control case in many contexts, such as the identification of read mapping bias (Degner ), correction of read bias in RNA-seq mapping (Satya ) and analysis of the accuracy of gene expression profiling (Hirsch ). It is therefore necessary to develop simulation software that is flexible, accessible and able to model a wide range of different genomic conditions.Currently, several tools exist for simulating read data, such as ART (Huang ) and Mason (http://seqan.de/projects/mason.html). Simulation tools such as these approach the problem of simulation from the perspective of reads, and are capable of producing impressive read sets that model variation and sequence errors for a variety of sequencing platforms. Simulome, however, approaches the problem of simulation from the perspective of the reference genome, which expands on the set of potential problems that can be addressed using genomic simulation. When Simulome is used in used in combination with read simulation tools like ART and Mason, entirely simulated experimental scenarios become possible.
2 Features and methods
Simulome is a python-based tool that incorporates biopython (Cock ) to generate synthetic prokaryotic reference genomes by sampling and restructuring existing genomic sequence data. Simulome provides to the ability to create pseudo-reference genomes with a specified gene set and controlled intergenic regions, as well as versions of the simulated genome that contain user controlled mutation events. This functionality makes it possible to analyze the effect of specific mutation types on a large scale, providing researchers with the ability to investigate the efficacy of analysis methodologies on a large number of genes that contain similar mutation events, while providing a control genome to which comparisons can be made. Simulome’s variant simulations can also be applied to whole prokaryotic genomes, allowing researchers to create variants of existing genomes with mutations introduced according to user specifications.Simulome features four different run modes for simulating mutation events: SNP mode, synonymous/nonsynonymous mutation mode, insertion/deletion mode and duplication mode. These run modes can be combined to produce variant genomes containing any combination of mutation events. Further customization of both the control and variant genome is possible through additional optional arguments.
2.1 Reference genome simulation
Creation of a simulated reference genome requires a nucleotide FASTA file containing the sequence of an existing organism, and an associated annotation file in GTF/GFF3 format. Basing the simulation on the properties of a real genome ensures that simulated genomes accurately model natural sequence variation. and allows genomes with different properties to serve as the base for simulation. For example, a researcher studying the effects of GC bias in RNA-seq data may be interested in simulating genomes with low or high GC content, and could do so by providing a low or high GC content genome to Simulome as input. Users can either use the provided genomic data to create a pseudo-genome of arbitrary size and its mutated variant, or simply create a mutated variant of the entire existing genome. If a pseudo-genome is to be created, gene data is extracted from the base genome and a user-specified number of genes are randomly selected to be used in the simulation. These genes are sampled so as to follow a normal distribution of gene length. The mean length and the standard deviation of all genes in the base genome are calculated, and genes are then sampled such that the mean and standard deviation of the simulated reference genome approximates that of the base genome. Randomness is determined by a seed value, and in the event that a user wishes to repeat a simulation, an exact replication can be produced by reusing the same random seed. Once target genes are selected, the simulated genome is created by interspersing intergenic regions and coding regions. Intergenic regions can be simulated in a variety of ways. Users can either specify the use of randomly generated intergenic regions, or can use real intergenic sequence data from the base genome. Random sequences are generated such that each base has a 25% chance of being selected for any given position. When real intergenic regions are used, all intergenic regions for the base genome are extracted and segments are randomly selected when needed. In both cases, users can specify intergenic length or allow randomly sized intergenic regions to be selected. BLAST searches can optionally be performed on each simulated intergenic region to ensure that there are no unintended duplicate regions in the simulation (Altschul ). The simulated genome will be output in FASTA and GFF/GTF3 format. Additional properties, such as operon inclusion and frequency, can be simulated by specifying optional arguments, which are covered in more depth in the Simulome manual.
2.2 Variant genome simulation
Four run modes are available for simulating SNPs, synonymous/nonsynonymous mutations, indels and/or duplication events. These can be used in any combination and are applied to each gene in the simulated genome. For example, it is possible to simulate a reference genome and/or a variant genome in which each gene contains a specific number of SNPs and deletions. Each run mode can be configured to introduce either an exact number of mutations in each gene, or otherwise to simulate variants in a range based on a Gaussian distribution with user-defined means and standard deviations. Once mutation events are simulated, sequence and annotation files representing the variant simulated genome are written, as well as an additional file containing meta-data for the introduced variants.
SNP run mode
In SNP run mode, a specified number of SNP events are introduced at random locations for each gene in the variant genome. By default, polymorphisms can occur at any position in a gene and no base will mutated more than once. However, additional control is provided for SNP mode to allow for the control of SNP density. Users can specify a window size in which SNP events can occur. To simulate the effect of SNPs on read alignment, a user might create simulations with regions of increasingly dense SNPs. By specifying the same number of SNPs in a decreasingly small window size, it would be possible to quantify the effect of locally clustered SNPs on read alignment on a large number of samples.
Synonymous/nonsynonymous run mode
In Synonymous/Nonsynonymous run mode, mutations are performed such that a specified percentage of mutations will be synonymous mutations. This run mode assumes that the first position of each gene begins the open reading frame and requires that the user provide distribution parameters to determine the number of mutations that will occur in each gene.
Indel run mode
Insertion and deletion run mode allows users to simulate indel events for each gene in the simulated reference genome. Users can choose to include insertion events, deletion events, or both, and can specify the number and length of each event. When both insertions and deletion events are specified, deletion events are performed first in order to preserve mutation integrity across all genes.
Duplication run mode
Duplication run mode allows the user to simulate scenarios where multiple possible mapping locations for a read are present in a genome. In duplication mode, a duplication percentage is specified by the user. A duplicate region of the specified size (e.g. 1000 nt if 10% of a base genome of 10000 nt is specified) is simulated and added to the base genome. Genes are randomly selected for duplication, along with appropriate intergenic regions, until the desired duplication level is reached.
2.3 Performance
Simulations were performed for reference genomes and mutated variants using Escherichia coli strain K12. These tests were performed on the Red Hat Enterprise Linux 7.2 operating system, with an Intel Xeon 2.53 GHz processor. Results of performance testing can be seen in Table 1.
Table 1.
Simulome execution time. Speed for various run mode combinations for various simulations based on the E.coli genome
Genes
Simulation modes
Time
1000
Reference only
3 m 5 s
1000
Reference + SNP variant
3 m 12 s
1000
Reference + SNP/Indel variant
3 m 18 s
1000
Reference + Synonymous/Indel variant
3 m 24 s
1000
Reference +
3 m 41 s
SNP/Indel/duplication variant
2000
Reference +
10 m 5 s
3000
SNP/Indel/duplication variant
16 m 40 s
E.coli (Full)
Reference +
E.coli (Full)
SNP/Indel/duplication variant
0 m 48 s
Synonymous/Indel variant
0 m 40 s
SNP/Indel variant
E.coli (Full)
SNP/Indel/duplication variant
0 m 41 s
Simulome execution time. Speed for various run mode combinations for various simulations based on the E.coli genome
Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937
Authors: Peter J A Cock; Christopher J Fields; Naohisa Goto; Michael L Heuer; Peter M Rice Journal: Nucleic Acids Res Date: 2009-12-16 Impact factor: 16.971
Authors: Jacob F Degner; John C Marioni; Athma A Pai; Joseph K Pickrell; Everlyne Nkadori; Yoav Gilad; Jonathan K Pritchard Journal: Bioinformatics Date: 2009-10-06 Impact factor: 6.937
Authors: Stephen J Bush; Dona Foster; David W Eyre; Emily L Clark; Nicola De Maio; Liam P Shaw; Nicole Stoesser; Tim E A Peto; Derrick W Crook; A Sarah Walker Journal: Gigascience Date: 2020-02-01 Impact factor: 6.524