Literature DB >> 22434876

Grinder: a versatile amplicon and shotgun sequence simulator.

Florent E Angly¹, Dana Willner, Forest Rohwer, Philip Hugenholtz, Gene W Tyson.

Abstract

We introduce Grinder (http://sourceforge.net/projects/biogrinder/), an open-source bioinformatic tool to simulate amplicon and shotgun (genomic, metagenomic, transcriptomic and metatranscriptomic) datasets from reference sequences. This is the first tool to simulate amplicon datasets (e.g. 16S rRNA) widely used by microbial ecologists. Grinder can create sequence libraries with a specific community structure, α and β diversities and experimental biases (e.g. chimeras, gene copy number variation) for commonly used sequencing platforms. This versatility allows the creation of simple to complex read datasets necessary for hypothesis testing when developing bioinformatic software, benchmarking existing tools or designing sequence-based experiments. Grinder is particularly useful for simulating clinical or environmental microbial communities and complements the use of in vitro mock communities.

Entities: Chemical Disease Species

Mesh：

Substances：
RNA, Ribosomal, 16S

Year: 2012 PMID： 22434876 PMCID： PMC3384353 DOI： 10.1093/nar/gks251

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The rapid development of high-throughput sequencing technologies such as 454 and Illumina has made large-scale sequencing projects both feasible and affordable. Bioinformatic tools are constantly being developed to manage and analyze data generated by these new sequencing platforms. Rigorous evaluation of the accuracy of these tools requires either the sequencing of synthetic communities of known composition created in vitro or the generation of simulated datasets in silico, which can account for both community structure and technical aspects of sequencing such as read length and errors. The construction of artificial in vitro communities and nucleic acid pools in the laboratory is both expensive and labor intensive, which limits the number of sequence libraries that can be produced (1–6). Mavromatis et al. (7) circumvented the need for in vitro manipulations when assembling the FAMES artificial metagenomes using DNA reads from existing single-genome shotgun sequencing projects. While both approaches produce realistic datasets, they are limited by confounding factors such as genome length bias (8,9), DNA amplification bias (10,11) and sequencing artifacts (12,13), the extent of which is generally unknown and can compromise interpretation of the sequence data. In contrast, bioinformatic tools that produce simulated reads based on reference sequences in silico allow users to rapidly generate large numbers of sequence libraries with controlled and predefined parameters. Recently, characterization of microbial communities by 16S rRNA gene amplicon sequencing has experienced a renaissance, largely owing to the advent of high-throughput sequencing (14). This has spurred the development of an unprecedented number of tools and pipelines for the analysis of 16S rRNA amplicon sequences, but microbial ecologists lack a read simulator capable of generating synthetic amplicon libraries to validate existing and upcoming bioinformatic tools. To address this limitation and also to expand upon existing shotgun sequence simulators, we present Grinder, an open-source software package that generates in silico simulated amplicon and shotgun (genomic, metagenomic, transcriptomic and metatranscriptomic) libraries from reference sequences. Grinder incorporates error models for a variety of sequencing platforms, can generate paired-end reads with variable insert size, and libraries with a user-specified species composition. Grinder libraries can also be designed based on α diversity metrics and model-based community structures, while sets of related libraries can be created by providing their β diversity. Unlike existing read simulators, Grinder can simulate the multiplexed PCR process to produce barcoded amplicon reads for any gene of interest, while also introducing experimental artifacts such as chimeras and biological biases due to variations in gene copy number between different species.

MATERIALS AND METHODS

Grinder implementation

Overview

Grinder is a platform-independent software package implemented in Perl and uses the Bioperl toolkit (15). Grinder is designed to run on a standard desktop computer and can be installed using a Perl module installer or a Debian package. Grinder includes a full test suite that automatically validates all components during installation. Grinder uses the Mersenne Twister algorithm (16) to generate random numbers because the default random number generation routines in many packages, such as Java, are below simulation grade (17). The read simulation in Grinder generates amplicon (Figure 1A), or shotgun (Figure 1B) reads. While most steps in read simulation are common to shotgun and amplicon libraries, there is an additional initial step in amplicon simulation that identifies and extracts full-length amplicons in the input reference sequences based on the provided PCR primers (Figure 1, Step i). For both amplicon and shotgun read simulations, species relative abundance (which defines community structure) is calculated from rank-abundance models, α and β diversity (Figure 1, Step ii). Reads are selected from the community either from the beginning of the full-length reference amplicon (for amplicon datasets) or randomly in the reference shotgun sequences (for shotgun datasets) (Figure 1, Step iii). Finally, sequencing errors (indels, substitutions, homopolymers) are introduced in the reads in a position-specific manner (Figure 1, Step iv). An exhaustive list of options that affect these steps can be obtained at the command line using the standard help function (Grinder–help) and all specific parameters used for a particular execution of Grinder can be put in a profile file to allow the easy reuse of complex custom configurations. A subset of the available options and features are described in detail below.

Figure 1.

Flowchart of the processes and parameters used by Grinder to generate related (A) amplicon and (B) shotgun libraries.

Input and output sequences for simulated datasets

Publicly available FASTA-formatted databases can readily be used in Grinder. For example, the curated microbial and viral genome sequences in the NCBI RefSeq collection (18) are suitable to produce artificial genomic, metagenomic or amplicon libraries. While reads can be taken from a reference sequence and its reverse complement, for example to simulate (meta)genomic data, strand-specific datasets such as some transcriptomes (19) can be put together by taking reads from only one strand, either forward or reverse, of the reference sequences. Curated gene-specific sequence databases such as Greengenes (20), Silva (21) and PseudoMLSA (22) can also be used to simulate amplicon datasets. Simulated read libraries are output as FASTA files with optional QUAL and FASTQ files as well as accompanying text files describing library content and community rank-abundances. Grinder offers many options to adjust the read characteristics. For example, read length can have a fixed value or follow a uniform or normal distribution and insert length for mate pairs or paired-end datasets can be specified in the same way. Detailed information for each read including its source, location on the reference sequence and introduced errors are provided in its description line, making reads entirely traceable for downstream analyses and applications (Supplementary Figure S1).

PCR simulation

A unique feature of Grinder relative to other read simulators is that a PCR simulation is performed when an amplicon read library is requested. The forward and reverse primers provided in a FASTA file by the user can contain degenerate residues following the IUPAC convention. In cases where PCR primers match different positions of a genome, several full-length amplicons will be extracted, except if these amplicons overlap, in which case only the smallest one will be extracted to mimic the PCR process (Figure 2). In subsequent Grinder steps, simulated amplicon reads are taken from the start of each full-length PCR amplicon, forward primer included.

Figure 2.

Grinder PCR amplicon selection process. All possible combinations of degenerate primer matches on the template DNA are considered. By default, Grinder will extract the shortest amplicon.

Community structure, diversity and multiplexed identifiers

Community structure for simulated shotgun or amplicon libraries can be specified in a text file listing species and their relative abundances. Unlike most read simulators, Grinder can alternatively generate community structures based on a specified community richness (α diversity) and a deterministic rank-abundance model (uniform, linear, power law, logarithmic or exponential), with species selected randomly during library construction. Another novel feature of Grinder is the simultaneous production of multiple read libraries (shotgun or amplicon) with related characteristics, allowing the user to vary the percentage of species shared between libraries and the percent of dominant species with different rank abundances (β diversity) (23). Multiplexed libraries consisting of individual barcoded samples pooled and sequenced on the same sequencing run can also be simulated by appending multiplexed identifiers (MIDs) given in a FASTA file to the beginning of each read. Optional MIDs are added to the reads prior to applying sequencing errors, so that MIDs may contain errors, as in real read libraries.

Simulation of biological and experimental biases

Sequencing errors such as substitutions, indels (insertions and deletions) or homopolymers can be introduced in Grinder-simulated reads by specifying position-specific models (uniform, linear or polynomial). Sanger reads can be simulated by increasing the number of substitutions and indels linearly along the reads, from 1% at its 5′ end to 2% at its 3′ end (24,25) (Supplementary Figure S2A). A fourth-degree polynomial model was implemented to reflect the accrued error rate (e) of substitutions at the 3′ end of Illumina reads (26): e = 3.10−3 + 3.3.10−8.i 4, where i is the position from the 5′ end (in bp) (Supplementary Figure S2A). Grinder also uses several deterministic models to simulate the homopolymer errors typical of 454 pyrosequencing (25,27,28). The recent empirical homopolymer model described by Balzer et al. inserts more errors as the length of the homopolymeric region increases (27). This is achieved by assigning each homopolymer a new length (n′) that is normally distributed around the actual length n, but with a standard deviation that increases linearly with homopolymer length: n′ ∼ N(n, 0.03494 + 0.06856n), for n ≥ 6 (Supplementary Figure S2B). Quality files (FASTQ or QUAL) can be generated based on two user-specified values, one for low (e.g. 10) and one for high (e.g. 30) quality bases. Grinder assigns the low-quality score to introduced errors and the high-quality score to all other bases. Users requiring 454 pyrosequencing libraries with more realistic quality files (in native SFF format) can run Flowsim (27) on the reads generated by Grinder. A known issue with amplicon sequencing is the formation of chimeras, spurious sequences formed during co-amplification of homologous genes (1,29). The most common type of chimera is a bimera, which results from the fusion of two amplicon template sequences. Higher order chimeras such as trimeras and quadrameras can also occur in amplicon read datasets, albeit at lower frequencies (30). In Grinder, chimeras are generated in one of two ways. In the first method, amplicon sequences and breakpoints are randomly selected in frame. The chimeras are generated by appending consecutive amplicon segments at the breakpoint. The second method is similar to that used by CHsim (31), i.e. chimeras are produced by concatenating two or more amplicon sequences, split at particular break points. The chosen breakpoints are k-mers, or short sequence stretches of k bp, shared by two amplicons and are more likely to be chosen if the amplicons are abundant and more similar to each other. Finally, biological bias affects sequence libraries. Similar to the bias described in metagenomes arising from genome length differences (8), the presence of several gene copies in a genome may affect the composition of an amplicon library (32). When complete genomes are used as input, the effect of variable gene copy number in different genomes is modeled in Grinder by sampling species proportionally to their relative abundance and to the number of copies of the amplicon in its genome, instead of proportionally to their relative abundance only.

User interfaces

Grinder provides a command-line interface (CLI), graphical user interface (GUI) and application programming interface (API). The CLI can be used in a terminal and permits the automated generation of the many replicate datasets needed for statistical validation of bioinformatic tools. We have also implemented a GUI for Grinder on the Galaxy platform (Supplementary Figure S3) (33), which makes it possible to run Grinder through a web browser on any local desktop, remote server equipped with Galaxy or even on distributed computers (34). Unlike previous read simulators, Grinder also provides an object-oriented Perl API, which technical users can take advantage of when writing Perl pipelines. When using the API (Supplementary Figure S4), a Grinder factory has to be created first by using the new() method, which accepts the same options as the CLI. From there, the next_lib() method allows the user to proceed to the next sequence library and next_read() generates the next simulated read of that library. Each read produced is a Bio::Seq::SimulatedRead object (implemented in a Perl module written for Grinder and contributed to Bioperl) that has methods to query its nucleotide sequence, position, errors and other tracking information (Supplementary Figure S2).

I6S rRNA amplicon case study

Eight amplicon libraries, each with a unique MID sequence, were generated from the Greengenes database of named isolates (http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Isolated_named_strains_16S_aligned.fasta) using the universal primer set for 16S rRNA: 926F and 1492R (35). For each library, 454 GS-FLX Titanium pyrosequencing was simulated by requesting 5000 reads with normally distributed lengths (mean: 400 bp, standard deviation: 50 bp) and homopolymer errors (Balzer model) (27). Two additional libraries were constructed without homopolymer errors. All libraries were designed to contain 100 unique phylotypes following a power law rank-abundance curve (with parameter value of 2) and to have 80% of their phylotypes in common. The resulting Grinder files are available in Supplementary Dataset S1. FASTA and QUAL files for all libraries were concatenated prior to analysis to mimic the output of multiplexed sequencing. QIIME (36) was used to separate the libraries based on their MID, to cluster the reads at 100% and 97% identity and to assign taxonomy by comparing sequences to the Greengenes database using BLAST. A normal distribution was also fit to the empirical distribution of sequence lengths in each library by the R function fit_distr (37).

RESULTS AND DISCUSSION

Recent advances in DNA sequencing technology have allowed for the rapid generation of large sequence datasets, ushering in the age of genomics and metagenomics. Platforms and chemistries evolve quickly, engendering newer generations of sequencing that rapidly replace old methods and require the development and refinement of bioinformatic tools for analysis. Proper algorithm design and implementation requires large amounts of sequence data. However, such data may not be publicly accessible or exist in the volume necessary for rigorous testing. In silico simulated datasets overcome these limitations and also allow for optimization of study parameters, which may depend on sequencing depth and quality (e.g. sample size) in advance.

Grinder for shotgun dataset simulation

Grinder incorporates many common features of existing read simulators (Table 1) including deterministic error profiles, support for paired-end reads and the generation of sequences characteristic of particular sequencing technologies. Similar to other modern read simulators, Grinder provides sequencing errors, allowing users to flexibly specify their own error models or use preset values corresponding to known error profiles for the Sanger, 454, and Illumina platforms (Table 1). For example, Grinder was used with the Balzer error model (27) to test different short read alignment methods to improve PaPaRa (38).

Table 1.

Comparison of existing sequencing read simulators

Name	References	Lic.	Homepage	Lang.	Interf.	Dataset types	Paired-end	Sequencing technologies	Qual. scores	Distinguishing features
Grinder	Angly et al. 2012 (this article)	GPL	sf.net/projects/biogrinder	Perl	CLI, API, GUI	Amplicon, (meta)genomic, (meta) transcript-omic	Yes	Sanger, 454, Illumina	Yes	Species abundance models, α and β diversity, MIDs, FASTQ output, multimeras, genome length and gene copy number bias
GemSIM	McElroy KE (unpublished data)	GPL	sf.net/projects/gemsim	Python	CLI	(Meta)genomic	Yes	Sanger, 454, Illumina	Yes	Haplotypes, FASTQ and SAM output
Mason	Holtgrewe (44)	GPL	www.seqan.de/projects/mason.html	C++	CLI	Genomic	Yes	Sanger, 454, Illumina	Yes	Haplotypes, speed-focused
Flowsim	Balzer et al. (27)	GPL	biohaskell.org/Applications/FlowSim	Haskell	CLI	Genomic	No	454	Yes	Targets 454 simulation: SFF flowgram output, artificial replicates
MetaSim	Richter et al. (25)	Prop.	ab.inf.uni-tuebingende/software/metasim	Java	CLI, GUI	(Meta)genomic	Yes	Sanger, 454, Illumina	No	Genome evolution model
FASIM	Hur et al. (45)	Prop.	www.gem.re.kr/fasim	C	CLI	Genomic	No	Sanger	No	Biased sampling model, chimeras, chromatograms
CelSim	Myers (24)	Prop.	–	Awk, Perl	CLI	Genomic	No	Sanger	No	Repeat and variants generation
GenFrag	Engle and Burks (46,47)	Prop.	–	C	CLI	Genomic	No	Sanger	No	First read simulator

Lic, License; Prop, proprietary; Lang, Programming language; Interf, Interfaces; Sim, Simulation; Qual, Quality.

Comparison of existing sequencing read simulators Lic, License; Prop, proprietary; Lang, Programming language; Interf, Interfaces; Sim, Simulation; Qual, Quality. Grinder also includes unique features such as the ability to specify a community structure based on a given richness (number of species) and ecologically-realistic species-abundance models (39). Multiple libraries representing communities with a specified structure and α and β diversity can be generated simultaneously. The β diversity feature in Grinder was recently used to establish empirical cutoffs for statistically significant differences between viral metagenomes (40). Grinder also provides parameters to introduce sampling biases inherent in metagenomic studies into sequence libraries. The development and benchmarking of GAAS (8) relied on the unique capability of Grinder to account for how the different length of genomes in a microbial or viral community affects the number of reads obtained from these genomes in a metagenome.

Grinder for amplicon dataset simulation

Grinder is the first read simulator to generate amplicon datasets (Table 1). Amplicon sequencing has most commonly been used for the characterization of bacterial and archaeal communities, but its applications are rapidly expanding to include characterization of fungal (41) and viral populations (42) as well as HLA class I genotyping (43). Amplicon libraries can be created in Grinder both with and without copy number bias, i.e. correction for the presence of multiple amplicons in a single reference sequence, and also with and without multiplex identifiers. Grinder uses an input set of PCR primers to find amplicons in reference sequences (Figure 2), and thus can be applied to any desired target gene or sequence. To demonstrate the use of Grinder for amplicon reads, MID-barcoded 16S rRNA libraries with and without pyrosequencing errors were simulated. Grinder faithfully produced 5000 simulated amplicon reads with MIDs in accordance with the input specifications: normal read distribution (Figure 3A), power law rank-abundance and richness (Figure 3B), β diversity (Figure 3C). All libraries were processed with QIIME and a total of 22 411 operational taxonomic units (OTUs) at 100% identity clustering, nearly 100 times the expected number. Kunin et al. (48) reported similar results for 454 amplicon pyrosequencing of Escherichia coli, demonstrating a 40- to 150-fold increase in the expected number of 100% OTUs depending on the type of quality filtering used. An approximately 100-fold increase in 100% OTUs due to homopolymer errors was also observed by Quince et al. (4). Consistent with the empirical observation of Kunin et al., 97% identity clustering reduced the number of OTUs, resulting in a rank-abundance distribution approaching the theoretical values (Figure 3B).

Figure 3.

Analysis of 10 MID-containing 16S rRNA gene amplicon libraries generated by Grinder that share 20% of their phylotypes. (A) Histogram of read lengths for the libraries and curve representing their expected normal distribution. (B) Log-log plot of phylotype rank-abundance in the MID1 amplicon library, with and without simulated sequencing errors, using 97% and 100% identity for OTU clustering in QIIME. (C) Heatmap comparison of the OTU distribution in the amplicon libraries analyzed with QIIME at 97% identity OTU clustering. Comparison of the error-free libraries with their counterparts demonstrated changes in relative abundance for some OTUs, the introduction of 21 novel OTUs and the elimination of 15 others due to homopolymer errors (Figure 3C). While most of the discrepancies occurred for OTUs at a low abundance level (<1%), as previously reported (4,48,49), the decrease of two OTUs from a medium abundance level (1–25%) to a low abundance level (<1%) shows that care should be taken when analyzing amplicon data that contain sequencing errors. The simulated errors mostly affected low-abundance OTUs, artificially inflating the size of the rare biosphere (4,48,50). Overall, this example illustrates that Grinder is capable of creating realistic amplicon libraries and modeling the effects of 454 homopolymer errors on microbial community profiling using the 16S rRNA gene.

CONCLUSION

Grinder is a read simulator that generates shotgun and amplicon libraries for software benchmarking, algorithm development, statistical testing and educational purposes. Grinder has been used in this capacity to simulate large volumes of environmental and clinical sequence data (8,38,40,51). Grinder libraries can be given a variety of community structures by specifying an ecological species-abundance distribution and α diversity or β diversity and MIDs when multiple libraries are created simultaneously. As demonstrated here Grinder has the unique ability to generate realistic 16S rRNA amplicon reads in silico with 454 homopolymer errors. The errors of current sequencing technologies can be flexibly specified in Grinder by combining several deterministic models. Sequencing technologies evolve rapidly, but the open-source nature of Grinder will facilitate the addition of new technologies such as IonTorrent (52) as their error profiles become available. By helping test hypotheses, create better bioinformatic tools and enhance data interpretation, the more systematic use of read simulators has the potential to accelerate the rate of biological discoveries. In this context, we believe that Grinder will be a valuable tool for bioinformaticians and biologists alike.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Figures 1–4 and Supplementary Dataset 1.

FUNDING

QEII Fellowship from the Australian Research Council, [DP1093175 (to G.W.T.)]; University of Queensland strategic funding of the Australian Centre for Ecogenomics. Funding for open access charge: F.E.A's Discovery Early Career Research Award.

45 in total

1. Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins.

Authors: Peter J Turnbaugh; Christopher Quince; Jeremiah J Faith; Alice C McHardy; Tanya Yatsunenko; Faheem Niazi; Jason Affourtit; Michael Egholm; Bernard Henrissat; Rob Knight; Jeffrey I Gordon
Journal: Proc Natl Acad Sci U S A Date: 2010-04-02 Impact factor: 11.205

2. Multiple displacement amplification compromises quantitative analysis of metagenomes.

Authors: Suzan Yilmaz; Martin Allgaier; Philip Hugenholtz
Journal: Nat Methods Date: 2010-12 Impact factor: 28.547

3. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.

Authors: T Z DeSantis; P Hugenholtz; N Larsen; M Rojas; E L Brodie; K Keller; T Huber; D Dalevi; P Hu; G L Andersen
Journal: Appl Environ Microbiol Date: 2006-07 Impact factor: 4.792

Review 4. A renaissance for the pioneering 16S rRNA gene.

Authors: Susannah G Tringe; Philip Hugenholtz
Journal: Curr Opin Microbiol Date: 2008-10-08 Impact factor: 7.934

5. Accurate determination of microbial diversity from 454 pyrosequencing data.

Authors: Christopher Quince; Anders Lanzén; Thomas P Curtis; Russell J Davenport; Neil Hall; Ian M Head; L Fiona Read; William T Sloan
Journal: Nat Methods Date: 2009-08-09 Impact factor: 28.547

6. Aligning short reads to reference alignments and trees.

Authors: Simon A Berger; Alexandros Stamatakis
Journal: Bioinformatics Date: 2011-06-02 Impact factor: 6.937

7. An integrated semiconductor device enabling non-optical genome sequencing.

Authors: Jonathan M Rothberg; Wolfgang Hinz; Todd M Rearick; Jonathan Schultz; William Mileski; Mel Davey; John H Leamon; Kim Johnson; Mark J Milgrew; Matthew Edwards; Jeremy Hoon; Jan F Simons; David Marran; Jason W Myers; John F Davidson; Annika Branting; John R Nobile; Bernard P Puc; David Light; Travis A Clark; Martin Huber; Jeffrey T Branciforte; Isaac B Stoner; Simon E Cawley; Michael Lyons; Yutao Fu; Nils Homer; Marina Sedova; Xin Miao; Brian Reed; Jeffrey Sabina; Erika Feierstein; Michelle Schorn; Mohammad Alanjary; Eileen Dimalanta; Devin Dressman; Rachel Kasinskas; Tanya Sokolsky; Jacqueline A Fidanza; Eugeni Namsaraev; Kevin J McKernan; Alan Williams; G Thomas Roth; James Bustillo
Journal: Nature Date: 2011-07-20 Impact factor: 49.962

8. The frequency of chimeric molecules as a consequence of PCR co-amplification of 16S rRNA genes from different bacterial species.

Authors: G C Wang; Y Wang
Journal: Microbiology (Reading) Date: 1996-05 Impact factor: 2.777

9. Analysis of high-throughput sequencing and annotation strategies for phage genomes.

Authors: Matthew R Henn; Matthew B Sullivan; Nicole Stange-Thomann; Marcia S Osburne; Aaron M Berlin; Libusha Kelly; Chandri Yandava; Chinnappa Kodira; Qiandong Zeng; Michael Weiand; Todd Sparrow; Sakina Saif; Georgia Giannoukos; Sarah K Young; Chad Nusbaum; Bruce W Birren; Sallie W Chisholm
Journal: PLoS One Date: 2010-02-05 Impact factor: 3.240

10. The marine viromes of four oceanic regions.

Authors: Florent E Angly; Ben Felts; Mya Breitbart; Peter Salamon; Robert A Edwards; Craig Carlson; Amy M Chan; Matthew Haynes; Scott Kelley; Hong Liu; Joseph M Mahaffy; Jennifer E Mueller; Jim Nulton; Robert Olson; Rachel Parsons; Steve Rayhawk; Curtis A Suttle; Forest Rohwer
Journal: PLoS Biol Date: 2006-11 Impact factor: 8.029

84 in total

1. FAVITES: simultaneous simulation of transmission networks, phylogenetic trees and sequences.

Authors: Niema Moshiri; Manon Ragonnet-Cronin; Joel O Wertheim; Siavash Mirarab
Journal: Bioinformatics Date: 2019-06-01 Impact factor: 6.937

2. Viral assemblage composition in Yellowstone acidic hot springs assessed by network analysis.

Authors: Benjamin Bolduc; Jennifer F Wirth; Aurélien Mazurie; Mark J Young
Journal: ISME J Date: 2015-06-30 Impact factor: 10.302

3. MSC: a metagenomic sequence classification algorithm.

Authors: Subrata Saha; Jethro Johnson; Soumitra Pal; George M Weinstock; Sanguthevar Rajasekaran
Journal: Bioinformatics Date: 2019-09-01 Impact factor: 6.937

4. Sim3C: simulation of Hi-C and Meta3C proximity ligation sequencing technologies.

Authors: Matthew Z DeMaere; Aaron E Darling
Journal: Gigascience Date: 2018-02-01 Impact factor: 6.524

5. Immunoglobulin Classification Using the Colored Antibody Graph.

Authors: Stefano R Bonissone; Pavel A Pevzner
Journal: J Comput Biol Date: 2016-05-05 Impact factor: 1.479

6. Uncovering oral Neisseria tropism and persistence using metagenomic sequencing.

Authors: Claudio Donati; Moreno Zolfo; Davide Albanese; Duy Tin Truong; Francesco Asnicar; Valerio Iebba; Duccio Cavalieri; Olivier Jousson; Carlotta De Filippo; Curtis Huttenhower; Nicola Segata
Journal: Nat Microbiol Date: 2016-05-27 Impact factor: 17.745

7. SeqNet: An R Package for Generating Gene-Gene Networks and Simulating RNA-Seq Data.

Authors: Tyler Grimes; Somnath Datta
Journal: J Stat Softw Date: 2021-07-10 Impact factor: 6.440

8. HPViewer: sensitive and specific genotyping of human papillomavirus in metagenomic DNA.

Authors: Yuhan Hao; Liying Yang; Antonio Galvao Neto; Milan R Amin; Dervla Kelly; Stuart M Brown; Ryan C Branski; Zhiheng Pei
Journal: Bioinformatics Date: 2018-06-15 Impact factor: 6.937

Review 9. A broad survey of DNA sequence data simulation tools.

Authors: Shatha Alosaimi; Armand Bandiang; Noelle van Biljon; Denis Awany; Prisca K Thami; Milaine S S Tchamga; Anmol Kiran; Olfa Messaoud; Radia Ismaeel Mohammed Hassan; Jacquiline Mugo; Azza Ahmed; Christian D Bope; Imane Allali; Gaston K Mazandu; Nicola J Mulder; Emile R Chimusa
Journal: Brief Funct Genomics Date: 2020-01-22 Impact factor: 4.241

10. Healthy human gut phageome.

Authors: Pilar Manrique; Benjamin Bolduc; Seth T Walk; John van der Oost; Willem M de Vos; Mark J Young
Journal: Proc Natl Acad Sci U S A Date: 2016-08-29 Impact factor: 11.205