Literature DB >> 30016412

Simulating Illumina metagenomic data with InSilicoSeq.

Hadrien Gourlé¹, Oskar Karlsson-Lindsjö², Juliette Hayer¹, Erik Bongcam-Rudloff¹.

Abstract

Motivation: The accurate in silico simulation of metagenomic datasets is of great importance for benchmarking bioinformatics tools as well as for experimental design. Users are dependant on large-scale simulation to not only design experiments and new projects but also for accurate estimation of computational needs within a project. Unfortunately, most current read simulators are either not suited for metagenomics, out of date or relatively poorly documented. In this article, we describe InSilicoSeq, a software package to simulate metagenomic Illumina sequencing data. InsilicoSeq has a simple command-line interface and extensive documentation.
Results: InSilicoSeq is implemented in Python and capable of simulating realistic Illumina (meta) genomic data in a parallel fashion with sensible default parameters. Availability and implementation: Source code and documentation are available under the MIT license at https://github.com/HadrienG/InSilicoSeq and https://insilicoseq.readthedocs.io/. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene

Mesh：

Year: 2019 PMID： 30016412 PMCID： PMC6361232 DOI： 10.1093/bioinformatics/bty630

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

With the release of a growing number of bioinformatics tools, it has become challenging to know which tool performs best or is best suited for a particular experiment. The simulation of genomics and metagenomics data holds a prominent role both in the planning of an experiment and the development of new methods. On the contrary to real data, simulated data can be produced with controlled parameters, such as—in the case of metagenomics—the abundance of the species present in a sample. Fixing such parameters allows for benchmarking and testing of new tools in controlled conditions, as well as provides researchers with mock data for testing new tools or pipelines. Such an environment for testing is especially important for fast-growing sub-fields, such as metagenomics (Escalona ). Additionally, simulated data has proven to be very useful in the classroom, where teachers often need mock datasets that are small enough to be analysed quickly and yield meaningful and clear results that are easy to interpret for the students (Halley, 1991). Surprisingly, only a few such simulation software exist for metagenomics, and the existing solutions are often difficult or inconvenient to use as well as poorly maintained. Here, we describe InSilicoSeq, a software that simulates realistic Illumina reads from (meta)genomes. InSilicoSeq is multi-threaded, well-documented and easily installed via Python’s package manager pip. InSilicoSeq aims at making the benchmarking and testing of (meta)genomics software easier.

2 Implementation and benchmarks

2.1 Implementation and features

InSilicoSeq is written in Python, can accurately model PHRED scores, supports substitution, insertion and deletion errors, as well as insert size distribution and GC bias. InSilicoSeq implements Kernel Density Estimation (KDE) to model base quality and insert size. Briefly, KDE is a non-parametric class of estimators that generally produces a smoother estimation of a distribution (Silverman, 1986) than histograms. Simulation of insertions, deletions, substitutions and GC bias is made empirically and calculated from aligned reads. In the current release, InSilicoSeq comes with pre-built error models for MiSeq, HiSeq and NovaSeq instruments, but we provide a command to generate error models from any bam file containing aligned reads. The provided error models are calculated from aligned reads in the bam format, generated from three datasets: PRJEB20178 for the MiSeq instrument and public datasets from Illumina Basespace for HiSeq and NovaSeq instruments. The three datasets were assembled with megahit (Li ) and the reads were mapped back to the assembly using bowtie2 (Langmead and Salzberg, 2012) with default parameters. InSilicoSeq being designed for metagenomics, it will generate reads from multiple genomes according to a log-normal abundance distribution per default. Other distributions are built-in, as well as the possibility to provide the software with the exact abundance for each input genome.

2.2 Usability

Existing software for simulating metagenomics include MetaSim (Richter ), NeSSM (Jia ), BEAR (Johnson ), FASTQSim (Shcherbina, 2014), GemSim (McElroy ), Grinder (Angly ), pIRS (Hu ) and FunctionSim (Lingling, 2014; https://cals.arizona.edu/∼anling/software/FunctionSIM.htm). We attempted to install and run all the aforementioned software as well as ART (Huang ), a popular single genome simulator; of the nine tested software, only ART, Grinder and pIRS could be installed and run without issues. This is symptomatic of software development in several areas of science, including biology (List ; Rother ; Wilson ) and was one of the main drivers behind the development of InSilicoSeq (Refer to Supplementary Material for more information on usability issues of the other simulators).

2.3 Benchmarks

InSilicoSeq can simulate half a million reads in under 10 min (Supplementary Fig. S1) using 4 CPUs and less than 1 G of RAM, and produces more realistic datasets than the other tested simulators. While pIRS ran under 1 min, Grinder took on average more than 13 h to generate half a million reads. Figure 1 shows the per-base quality distribution of forward reads simulated with InSilcioSeq, ART and pIRS compared to real data. InSilicoSeq and ART model very closely the base quality of the MiSeq dataset, while pIRS reports all the bases with a PHRED score of 40. For a figure including Grinder as well as reverse reads, refer to Supplementary Figure S2.

Fig. 1.

Per Base PHRED score distribution of simulated data (forward reads). The grey lines indicate 10% and 90% quantiles, the orange lines indicate the lower and upper quartiles and the blue dot is the median. InSilicoSeq and ART are the most faithful to the real data, while pIRS assigns a PHRED score of 40 to all bases. For a figure including the forward and reverse reads as well as the qualities from grinder, refer to Supplementary Figures S2 and S3 One difficult part of generating realistic Illumina data is generating low-quality sequences. InSilicoSeq models this by clustering the sequences by mean quality before modelling the per-base quality distribution. Supplementary Figure S2 shows that our approach out-performs ART, grinder and pIRS: InSilicoSeq is the only simulator to produce low-quality sequences with a mean quality below 20.

3 Conclusion

We developed a simulator that is free, open-source, well-tested, easy to install, has sufficient documentation and consists of a unified command (iss). InSilicoSeq produces realistic Illumina data with errors models based on recent Illumina machines and chemistry. New models can be produced from bam files in less than an hour, making it easy to keep them up to date. InSilicoSeq produces more realistic data than existing metagenomics simulation methods and is useful for planning experiments and benchmarking new methods.

Funding

This work was supported by the Swedish Research Council, grant number 2015-03443_VR. Conflict of Interest: none declared. Click here for additional data file.

14 in total

1. pIRS: Profile-based Illumina pair-end reads simulator.

Authors: Xuesong Hu; Jianying Yuan; Yujian Shi; Jianliang Lu; Binghang Liu; Zhenyu Li; Yanxiang Chen; Desheng Mu; Hao Zhang; Nan Li; Zhen Yue; Fan Bai; Heng Li; Wei Fan
Journal: Bioinformatics Date: 2012-04-15 Impact factor: 6.937

2. A toolbox for developing bioinformatics software.

Authors: Kristian Rother; Wojciech Potrzebowski; Tomasz Puton; Magdalena Rother; Ewa Wywial; Janusz M Bujnicki
Journal: Brief Bioinform Date: 2011-07-29 Impact factor: 11.622

3. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

Authors: Dinghua Li; Chi-Man Liu; Ruibang Luo; Kunihiko Sadakane; Tak-Wah Lam
Journal: Bioinformatics Date: 2015-01-20 Impact factor: 6.937

4. Fast gapped-read alignment with Bowtie 2.

Authors: Ben Langmead; Steven L Salzberg
Journal: Nat Methods Date: 2012-03-04 Impact factor: 28.547

5. ART: a next-generation sequencing read simulator.

Authors: Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal: Bioinformatics Date: 2011-12-23 Impact factor: 6.937

6. Grinder: a versatile amplicon and shotgun sequence simulator.

Authors: Florent E Angly; Dana Willner; Forest Rohwer; Philip Hugenholtz; Gene W Tyson
Journal: Nucleic Acids Res Date: 2012-03-19 Impact factor: 16.971

7. NeSSM: a Next-generation Sequencing Simulator for Metagenomics.

Authors: Ben Jia; Liming Xuan; Kaiye Cai; Zhiqiang Hu; Liangxiao Ma; Chaochun Wei
Journal: PLoS One Date: 2013-10-04 Impact factor: 3.240

8. FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.

Authors: Anna Shcherbina
Journal: BMC Res Notes Date: 2014-08-15

9. A better sequence-read simulator program for metagenomics.

Authors: Stephen Johnson; Brett Trost; Jeffrey R Long; Vanessa Pittet; Anthony Kusalik
Journal: BMC Bioinformatics Date: 2014-09-10 Impact factor: 3.169

10. Ten Simple Rules for Developing Usable Software in Computational Biology.

Authors: Markus List; Peter Ebert; Felipe Albrecht
Journal: PLoS Comput Biol Date: 2017-01-05 Impact factor: 4.475

19 in total

1. A Genomic Toolkit for the Mechanistic Dissection of Intractable Human Gut Bacteria.

Authors: Jordan E Bisanz; Paola Soto-Perez; Cecilia Noecker; Alexander A Aksenov; Kathy N Lam; Grace E Kenney; Elizabeth N Bess; Henry J Haiser; Than S Kyaw; Feiqiao B Yu; Vayu M Rekdal; Connie W Y Ha; Suzanne Devkota; Emily P Balskus; Pieter C Dorrestein; Emma Allen-Vercoe; Peter J Turnbaugh
Journal: Cell Host Microbe Date: 2020-04-28 Impact factor: 21.023

2. KARGA: Multi-platform Toolkit for k-mer-based Antibiotic Resistance Gene Analysis of High-throughput Sequencing Data.

Authors: Mattia Prosperi; Simone Marini
Journal: IEEE EMBS Int Conf Biomed Health Inform Date: 2021-08-10

3. AMR-meta: a k-mer and metafeature approach to classify antimicrobial resistance from high-throughput short-read metagenomics data.

Authors: Simone Marini; Marco Oliva; Ilya B Slizovskiy; Rishabh A Das; Noelle Robertson Noyes; Tamer Kahveci; Christina Boucher; Mattia Prosperi
Journal: Gigascience Date: 2022-05-18 Impact factor: 7.658

4. SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data.

Authors: Collin Giguere; Harsh Vardhan Dubey; Vishal Kumar Sarsani; Hachem Saddiki; Shai He; Patrick Flaherty
Journal: BMC Bioinformatics Date: 2020-05-26 Impact factor: 3.169

5. Large scale microbiome profiling in the cloud.

Authors: Camilo Valdes; Vitalii Stebliankin; Giri Narasimhan
Journal: Bioinformatics Date: 2019-07-15 Impact factor: 6.937

6. Identification of discriminatory antibiotic resistance genes among environmental resistomes using extremely randomized tree algorithm.

Authors: Suraj Gupta; Gustavo Arango-Argoty; Liqing Zhang; Amy Pruden; Peter Vikesland
Journal: Microbiome Date: 2019-08-29 Impact factor: 14.650