Literature DB >> 28186240

Simulome: a genome sequence and variant simulator.

Abstract

Entities: Disease Species

Year: 2017 PMID： 28186240 PMCID： PMC5870732 DOI： 10.1093/bioinformatics/btx091

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

× No keyword cloud information.

Introduction

As new data types and methodologies for analysis of biological data are developed, simulation tools are becoming increasingly necessary to the development, testing and benchmarking of bioinformatics research. Simulations provide a valuable control case in many contexts, such as the identification of read mapping bias (Degner ), correction of read bias in RNA-seq mapping (Satya ) and analysis of the accuracy of gene expression profiling (Hirsch ). It is therefore necessary to develop simulation software that is flexible, accessible and able to model a wide range of different genomic conditions. Currently, several tools exist for simulating read data, such as ART (Huang ) and Mason (http://seqan.de/projects/mason.html). Simulation tools such as these approach the problem of simulation from the perspective of reads, and are capable of producing impressive read sets that model variation and sequence errors for a variety of sequencing platforms. Simulome, however, approaches the problem of simulation from the perspective of the reference genome, which expands on the set of potential problems that can be addressed using genomic simulation. When Simulome is used in used in combination with read simulation tools like ART and Mason, entirely simulated experimental scenarios become possible.

2 Features and methods

Simulome is a python-based tool that incorporates biopython (Cock ) to generate synthetic prokaryotic reference genomes by sampling and restructuring existing genomic sequence data. Simulome provides to the ability to create pseudo-reference genomes with a specified gene set and controlled intergenic regions, as well as versions of the simulated genome that contain user controlled mutation events. This functionality makes it possible to analyze the effect of specific mutation types on a large scale, providing researchers with the ability to investigate the efficacy of analysis methodologies on a large number of genes that contain similar mutation events, while providing a control genome to which comparisons can be made. Simulome’s variant simulations can also be applied to whole prokaryotic genomes, allowing researchers to create variants of existing genomes with mutations introduced according to user specifications. Simulome features four different run modes for simulating mutation events: SNP mode, synonymous/nonsynonymous mutation mode, insertion/deletion mode and duplication mode. These run modes can be combined to produce variant genomes containing any combination of mutation events. Further customization of both the control and variant genome is possible through additional optional arguments.

2.1 Reference genome simulation

Creation of a simulated reference genome requires a nucleotide FASTA file containing the sequence of an existing organism, and an associated annotation file in GTF/GFF3 format. Basing the simulation on the properties of a real genome ensures that simulated genomes accurately model natural sequence variation. and allows genomes with different properties to serve as the base for simulation. For example, a researcher studying the effects of GC bias in RNA-seq data may be interested in simulating genomes with low or high GC content, and could do so by providing a low or high GC content genome to Simulome as input. Users can either use the provided genomic data to create a pseudo-genome of arbitrary size and its mutated variant, or simply create a mutated variant of the entire existing genome. If a pseudo-genome is to be created, gene data is extracted from the base genome and a user-specified number of genes are randomly selected to be used in the simulation. These genes are sampled so as to follow a normal distribution of gene length. The mean length and the standard deviation of all genes in the base genome are calculated, and genes are then sampled such that the mean and standard deviation of the simulated reference genome approximates that of the base genome. Randomness is determined by a seed value, and in the event that a user wishes to repeat a simulation, an exact replication can be produced by reusing the same random seed. Once target genes are selected, the simulated genome is created by interspersing intergenic regions and coding regions. Intergenic regions can be simulated in a variety of ways. Users can either specify the use of randomly generated intergenic regions, or can use real intergenic sequence data from the base genome. Random sequences are generated such that each base has a 25% chance of being selected for any given position. When real intergenic regions are used, all intergenic regions for the base genome are extracted and segments are randomly selected when needed. In both cases, users can specify intergenic length or allow randomly sized intergenic regions to be selected. BLAST searches can optionally be performed on each simulated intergenic region to ensure that there are no unintended duplicate regions in the simulation (Altschul ). The simulated genome will be output in FASTA and GFF/GTF3 format. Additional properties, such as operon inclusion and frequency, can be simulated by specifying optional arguments, which are covered in more depth in the Simulome manual.

2.2 Variant genome simulation

Four run modes are available for simulating SNPs, synonymous/nonsynonymous mutations, indels and/or duplication events. These can be used in any combination and are applied to each gene in the simulated genome. For example, it is possible to simulate a reference genome and/or a variant genome in which each gene contains a specific number of SNPs and deletions. Each run mode can be configured to introduce either an exact number of mutations in each gene, or otherwise to simulate variants in a range based on a Gaussian distribution with user-defined means and standard deviations. Once mutation events are simulated, sequence and annotation files representing the variant simulated genome are written, as well as an additional file containing meta-data for the introduced variants.

SNP run mode

In SNP run mode, a specified number of SNP events are introduced at random locations for each gene in the variant genome. By default, polymorphisms can occur at any position in a gene and no base will mutated more than once. However, additional control is provided for SNP mode to allow for the control of SNP density. Users can specify a window size in which SNP events can occur. To simulate the effect of SNPs on read alignment, a user might create simulations with regions of increasingly dense SNPs. By specifying the same number of SNPs in a decreasingly small window size, it would be possible to quantify the effect of locally clustered SNPs on read alignment on a large number of samples.

Synonymous/nonsynonymous run mode

In Synonymous/Nonsynonymous run mode, mutations are performed such that a specified percentage of mutations will be synonymous mutations. This run mode assumes that the first position of each gene begins the open reading frame and requires that the user provide distribution parameters to determine the number of mutations that will occur in each gene.

Indel run mode

Insertion and deletion run mode allows users to simulate indel events for each gene in the simulated reference genome. Users can choose to include insertion events, deletion events, or both, and can specify the number and length of each event. When both insertions and deletion events are specified, deletion events are performed first in order to preserve mutation integrity across all genes.

Duplication run mode

Duplication run mode allows the user to simulate scenarios where multiple possible mapping locations for a read are present in a genome. In duplication mode, a duplication percentage is specified by the user. A duplicate region of the specified size (e.g. 1000 nt if 10% of a base genome of 10000 nt is specified) is simulated and added to the base genome. Genes are randomly selected for duplication, along with appropriate intergenic regions, until the desired duplication level is reached.

2.3 Performance

Simulations were performed for reference genomes and mutated variants using Escherichia coli strain K12. These tests were performed on the Red Hat Enterprise Linux 7.2 operating system, with an Intel Xeon 2.53 GHz processor. Results of performance testing can be seen in Table 1.

Table 1.

Simulome execution time. Speed for various run mode combinations for various simulations based on the E.coli genome

Genes	Simulation modes	Time
1000	Reference only	3 m 5 s
1000	Reference + SNP variant	3 m 12 s
1000	Reference + SNP/Indel variant	3 m 18 s
1000	Reference + Synonymous/Indel variant	3 m 24 s
1000	Reference +	3 m 41 s
1000	SNP/Indel/duplication variant
2000	Reference +	10 m 5 s
3000	SNP/Indel/duplication variant	16 m 40 s
E.coli (Full)	Reference +
E.coli (Full)	SNP/Indel/duplication variant	0 m 48 s
	Synonymous/Indel variant	0 m 40 s
	SNP/Indel variant
E.coli (Full)	SNP/Indel/duplication variant	0 m 41 s

Simulome execution time. Speed for various run mode combinations for various simulations based on the E.coli genome

7 in total

1. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

2. Genomic limitations to RNA sequencing expression profiling.

Authors: Cory D Hirsch; Nathan M Springer; Candice N Hirsch
Journal: Plant J Date: 2015-10-06 Impact factor: 6.417

3. ART: a next-generation sequencing read simulator.

Authors: Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal: Bioinformatics Date: 2011-12-23 Impact factor: 6.937

4. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

Review 5. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Authors: Peter J A Cock; Christopher J Fields; Naohisa Goto; Michael L Heuer; Peter M Rice
Journal: Nucleic Acids Res Date: 2009-12-16 Impact factor: 16.971

6. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data.

Authors: Jacob F Degner; John C Marioni; Athma A Pai; Joseph K Pickrell; Everlyne Nkadori; Yoav Gilad; Jonathan K Pritchard
Journal: Bioinformatics Date: 2009-10-06 Impact factor: 6.937

7. A new strategy to reduce allelic bias in RNA-Seq readmapping.

Authors: Ravi Vijaya Satya; Nela Zavaljevski; Jaques Reifman
Journal: Nucleic Acids Res Date: 2012-05-14 Impact factor: 16.971