Literature DB >> 27638685

In Silico Whole Genome Sequencer and Analyzer (iWGS): a Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies.

Xiaofan Zhou¹, David Peris², Jacek Kominek², Cletus P Kurtzman³, Chris Todd Hittinger⁴, Antonis Rokas⁵.

Abstract

The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS ( in silicoWhole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.

Entities: CellLine Disease Species

Keywords: de novo assembly; experimental design; genome sequencing; high-throughput sequencing; nonmodel organism; simulation

Year: 2016 PMID： 27638685 PMCID： PMC5100864 DOI： 10.1534/g3.116.034249

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Whole genome sequences are rich sources of information about organisms that are superbly useful for addressing a wide variety of evolutionary questions, such as measuring mutation rates (Kumar and Subramanian 2002), characterizing the genomic basis of adaptation (Roux ), and building the tree of life (Rokas ; Salichos and Rokas 2013). Until now, however, organismal diversity has been highly unevenly covered, and most sequenced genomes correspond to model organisms, organisms of medical or economic importance, or ones that have relatively small and simple genomes (Reddy ). The rapid advance of DNA sequencing technologies has dramatically reduced the labor and cost required for genome sequencing, which is evidenced by the burst of large-scale genome projects in recent years that includes, for example, the 1000 Fungal Genomes (1KFG) Project (Grigoriev ), the Yeast 1000 Plus (Y1000+) Project (Hittinger ), the Insect 5K Project (Robinson ), and the Genome 10K Project (Genome 10K Community of Scientists 2009). Some of these projects have already begun to fuel important discoveries in evolution and other fields (Zhang ). Equally importantly, high-throughput DNA sequencing has made it possible for single investigators to perform de novo genome sequencing in virtually any organism they are interested in (Rokas and Abbot 2009). Such sequencing efforts may target various organisms with a large diversity of genome architectures. Therefore, to achieve optimal results, the choice of sequencing strategy (i.e., the combination of sequencing technology [e.g., Illumina or Pacific Biosciences (PacBio)], sequencing assay (e.g., paired-end or mate-pair), and other variables, such as sequencing depth and assembly protocols (e.g., assemblers and the associated parameters) should ideally be tailored to the characteristics of a given genome, such as size and GC/repeat content (Nagarajan and Pop 2013). The vast majority of de novo sequenced genomes have been generated using the Illumina technology, either solely or in combination with other technologies (Reddy ). This is largely due to the Illumina technology’s ability to quickly generate tens to hundreds of millions of highly accurate short sequence reads of up to 300 bases per run at very low per base cost (Glenn 2011). Additionally, the Illumina technology offers two powerful sequencing assays, paired-end (PE) and mate-pair (MP), which generate sequence read pairs that span short (hundreds of base-pairs) and relatively long (thousands of base-pairs) genomic regions, respectively. Mixing multiple PE and MP libraries with different insert sizes allows for highly flexible sequencing strategies, and several state-of-the-art assembly algorithms have been developed that exploit all these advantages. For instance, the de novo genome assembler ALLPATHS-LG can generate high quality draft assemblies for mammalian-size genomes using only Illumina short-read data by including both MP and overlapping PE libraries (Gnerre ). On its own, however, the Illumina technology performs less well for more complex genomes, mainly due to the short lengths of Illumina sequence reads and the technology’s bias against certain genomic regions (e.g., GC-rich regions) (Ross ). The PacBio technology generates sequence reads that are substantially longer and have much less sequencing bias, albeit at the cost of a substantially lower per-read accuracy; the average read length increases to above 10 kb with the latest chemistry but displays only ∼87% accuracy (Koren and Phillippy 2015). Thus, this technology is particularly useful for the sequencing of complex genomes (Koren and Phillippy 2015). Recent developments in both sequencing chemistry and assembly algorithms have enabled PacBio-only de novo assembly for microbial genomes (Koren ), but the high sequence coverage required for this approach remains cost-prohibitive for large eukaryotic genomes. Nevertheless, in combination with more affordable Illumina short-read data, PacBio long reads—even at low coverage—can lead to significantly improved assemblies (Utturkar ; McIlwain ). De novo genome sequencing projects are further complicated by the large array of assembly software tools, which differ in many aspects, such as algorithmic design, supported/required data types, and computational efficiency (Nagarajan and Pop 2013; Simpson and Pop 2015). Systematic evaluations of assembly programs show that no single assembler is the best across all circumstances; rather, an assembler’s performance critically depends on genome complexity and the sequencing strategy adopted (Earl ; Bradnam ). Moreover, many assemblers use adjustable parameters (e.g., the k-mer size for de Bruijn assemblers), the values of which can critically affect the assembly quality. In practice, such parameters are often selected intuitively or through the time-consuming process of testing multiple values. The great number of possible ways to combine sequencing technologies, assays, and assembly algorithms poses a great challenge for the experimental design and data analysis in de novo genome sequencing projects, which in turn can sometimes lead to poor quality or downright incorrect assemblies (Denton ). As a consequence, several pipelines have been developed to automate specific steps in the process; for example, the recently developed iMetAMOS (Koren ) and RAMPART (Mapleson ) have been specifically designed to automate genome assembly. However, as de novo genome sequencing is increasingly adopted by single investigator laboratories, there is an urgent need for streamlined approaches that enable investigators to not only efficiently generate high-quality draft genome assemblies but also to predict (via simulation) and identify the most suitable design(s) [i.e., the most suitable combination(s) of sequencing strategy and assembly protocol] currently available for a specific genome. To address this need, we have developed an automated pipeline for the design and execution of de novo genome sequencing projects that we name iWGS ( Whole Genome Sequencer and Analyzer). To approximate the performance of different sequencing strategies and assembly protocols, iWGS simulates high-throughput genome sequencing on user-provided reference genomes (e.g., genomes that closely represent the characteristics of the real targets), facilitating the identification of optimal experimental designs. iWGS allows users to experiment with various combinations of sequencing technologies, assays, assembly tools, and relevant parameters in a single run. iWGS is also designed to work with real data and can be used as a convenient tool for automated selection of the best assembly or genome assembler. Finally, using three case studies, each one focused on specific challenges frequently encountered in de novo genome sequencing studies (e.g., high repeat content and biased nucleotide composition, etc.), we illustrate how iWGS can be applied to guiding the design and analysis of de novo genome sequencing studies.

Results

The design of iWGS

iWGS encompasses all major steps of a typical de novo genome sequencing study, including the generation of sequence reads, data quality control, de novo assembly, and evaluation of assemblies (Figure 1).

Figure 1

iWGS workflow. A typical iWGS analysis consists of four steps: (1) data simulation (optional); (2) preprocessing (optional); (3) de novo assembly; and (4) assembly evaluation. iWGS supports both Illumina short reads and PacBio long reads, and a wide selection of assemblers to enable de novo assembly using either or both types of data. Users can start the analysis simulating data drawn from a reference genome assembly or, alternatively, use real sequencing data as input and skip the simulation step. iWGS, in silico Whole Genome Sequencer and Analyzer; MP, mate-pair; PacBio, Pacific Biosciences; PE, paired-end.

Simulation:

iWGS uses the realistic high-throughput sequencing (HTS) read simulators ART (Huang ), pIRS (Hu ), and PBSIM (Ono ) to generate Illumina and PacBio sequence reads from a given user-specified genome. These programs can simulate all popular data types, including Illumina PE and MP sequence reads, as well as PacBio continuous long sequence reads. The distributions of read quality and read length are easily adjustable for both Illumina and PacBio data. Furthermore, these simulators mimic sequencing errors and nucleotide composition biases in real data by using empirical profiles of these artifacts, which can be easily customized to stay current with upgrades in sequencing technologies. For instance, we have created a quality-score frequency profile learned from sequence reads generated by the latest PacBio chemistry to better reflect the improved sequence read accuracy. This simulation step can be omitted when the goal is the analysis of real data. Alternatively, the users may choose to perform only the simulation and use the simulated data for other analyses.

Quality control:

HTS data generated by all technologies contain errors and artifacts, which may sometimes substantially compromise the quality of the assembly (Zhou and Rokas 2014). Therefore, iWGS includes an optional step to perform preprocessing of the data, including trimming of low-quality bases, removal of adapter contaminations, and correction of sequencing errors. Since some assemblers [e.g. ALLPATHS-LG (Ribeiro )] have their own preprocessing modules, iWGS automatically determines for each assembly protocol whether to use the original or the processed data.

Assembly:

To maximize users’ flexibility in experimental design, iWGS supports 15 de novo genome assembly tools [ABYSS (Simpson ), ALLPATHS-LG (Ribeiro ), Celera Assembler (Myers ; Berlin ), Canu (Koren ), DBG2OLC (Ye et al. 2014), DISCOVAR (Weisenfeld ), Falcon (Chin ), MaSuRCA (Zimin ), Meraculous (Chapman ), Minia (Salikhov ), Platanus (Kajitani ), SGA (Simpson and Durbin 2012), SOAPdenovo2 (Luo ), SPAdes (and a diploid-aware version called dipSPAdes) (Bankevich ), and Velvet (Zerbino and Birney 2008)], most of which have participated in recent large-scale assembler comparisons (Bradnam ; Magoc ). These supported assemblers allow users to carry out de novo assembly using only Illumina short-read data (e.g., SOAPdenovo2) and only PacBio long-read data (e.g., Canu and Falcon), or to perform hybrid assembly that uses both (e.g., SPAdes and DBG2OLC). To achieve the best possible results while avoiding the computationally expensive process of testing multiple combinations of parameters, iWGS takes advantage of successful assembly recipes (i.e., recommended settings for each assembler) established in studies such as Assemblathon 2 (Bradnam ) and GAGE-B (Magoc ), and uses KmerGenie to determine the optimal k-mer size (Chikhi and Medvedev 2014). In addition, assemblies generated from different underlying data and/or assembly algorithms can be merged using Metassembler (Wences and Schatz 2015) to achieve a potentially better final assembly.

Evaluation:

iWGS uses QUAST (Gurevich ) to evaluate all generated assemblies. In addition to providing basic statistics like N50 (the largest contig/scaffold size wherein half of the total assembly size is contained in contigs/scaffolds no shorter than this value), QUAST compares each assembly against the reference genome (in the case of simulations) and generates a number of highly informative quality matrices, such as misassemblies, assembled sequences not present in the reference (and vice versa), and genes recovered in the assembly if the reference genome is annotated. At the end, iWGS ranks all assemblies based on selected matrices in the QUAST report using a previously described weighting strategy (Abbas ). This ranking, along with the detailed QUAST report, helps users to identify the best overall assembly, as well as the corresponding combination of sequencing strategy and assembly protocol. REAPR, which utilizes the sequence data itself for assembly evaluation, is also implemented to better suit real data analysis (Hunt ). iWGS is designed with flexibility and ease-of-use in mind to allow users to readily examine various experimental designs; each data set may be used multiple times in different assembly protocols, and each assembler may be run repeatedly with different input data sets. Multiple sequencing strategies and assembly protocols can be specified in a straightforward fashion in a single configuration file; only a few parameters are required for each strategy/protocol, while other settings (e.g., quality profiles for read simulation) are globally shared across strategies/protocols of the same type. Alternatively, advanced users can opt to customize the strategies/protocols so that, for example, each sequencing data set is simulated with different quality settings. Furthermore, iWGS rigorously checks the configurations for issues such as the compatibility between sequencing strategies and assembly protocols. iWGS is a lightweight pipeline written in Perl. The source code, detailed documentation, and example test sets are freely available at https://github.com/zhouxiaofan1983/iWGS. Like many other bioinformatics pipelines, iWGS inevitably relies on a number of third-party software tools to carry out individual analyses such as data simulation and genome assembly. However, most of the tools, including at least one for each of the four major steps aforementioned, either have precomplied executables or can be compiled locally with ease. For the convenience of users, we also include in the package scripts to automate the acquisition and installation of most software dependencies. The users can also customize the selection of tools to install according to their own needs and computational environments.

Case studies

To demonstrate the use of iWGS and provide examples of its utility, we developed three case studies where iWGS was used to guide the selection of sequencing strategy for genomes representing a wide range of sizes and complexity levels (Supplemental Material, Table S1). The competing strategies were selected to enable both Illumina-only and PacBio-only assemblies, as well as hybrid assembly of the two data types (Table 1). To examine the effectiveness of the simulation step of our approach, we also analyzed real sequencing data that largely match our simulation settings.

Table 1

Sequencing strategies (top) and assembly protocols (bottom) evaluated in the three case studies

Name	Read Type	Parameters for Read Simulation
LIB1	Illumina PE	Depth: 50 ×; read length: 100 bp; insert size: 180 ± 9 bp
LIB2	Illumina MP	Depth: 50 ×; read length: 100 bp; insert size: 8000 ± 400 bp
LIB3	Illumina PE	Depth: 50 ×; read length: 250 bp; insert size: 450 ± 23 bp
LIB4	PacBio CLR	Depth: 60 ×; read accuracy: 0.87 ± 0.03; read length: 11,500 ± 8000 bp
LIB5	PacBio CLR	Depth: 10 ×; read accuracy: 0.87 ± 0.03; read length: 11,500 ± 8000 bp
Name	Assembler	Sequencing strategies used for assembly
ILMN1	ABYSS	LIB1, LIB2 (Illumina-only)
ILMN2	ALLPATHS-LG
ILMN3	MaSuRCA
ILMN4	SGA
ILMN5	SOAPdenov2
ILMN6	SPAdes
ILMN7	Velvet
META	Metassembler
ILMN8	DISCOVAR	LIB3 (Illumina-only)
PACB1	Celera Assembler	LIB4 (PacBio-only)
PACB2	Canu
PACB3	FALCON
HYBR1	SPAdes	LIB1, LIB2, LIB5 (Hybrid)
HYBR2	DBG2OLC^a	LIB1, LIB5 (Hybrid)

PE, paired-end; MP, mate pair; PacBio, Pacific Biosciences; CLR, continuous long-read.

SparseAssembler (Ye ) was used to assemble LIB1 into contigs, which in turn were then used as input for DBG2OLC.

PE, paired-end; MP, mate pair; PacBio, Pacific Biosciences; CLR, continuous long-read. SparseAssembler (Ye ) was used to assemble LIB1 into contigs, which in turn were then used as input for DBG2OLC.

Case study I (repeat-content issue):

We first compared the sequencing of two fungi, Zymoseptoria tritici (synonym: Mycosphaerella graminicola) (Goodwin ) and Pseudocercospora fijiensis (synonym: M. fijiensis) (Ohm ), which both belong to the class Dothideomycetes yet have dramatically different repeat contents; the estimated repeat contents are ∼15 and ∼50% for the two genomes, respectively. Our simulations showed that, while good quality assemblies can be obtained for Z. tritici using either data type, the PacBio-only assembly for Ps. fijiensis vastly outperforms assemblies based on Illumina data alone or in combination with low-coverage PacBio data (Figure 2). The results are consistent with the notion that PacBio long reads are particularly powerful in resolving repeats (Koren ). We then further tested if these results are informative for guiding the sequencing of another highly repetitive Dothideomycetes genome, Cenococcum geophilum, which has a repeat content of ∼76% (http://genome.jgi.doe.gov/Cenge3). For C. geophilum, the PacBio-only assembly was again found to be the best, while the hybrid assembly using DBG2OLC and the Illumina-only assembly using ALLPATHS-LG were next in rank (Figure 2 and Table S2), nicely recapitulating the results of Ps. fijiensis. We also performed meta-assembly of Illumina-only assemblies ILMN1 to ILMN7 (Table 1) on all three genomes using Metassembler. While the meta-assembly approach substantially improved the assembly continuity for Z. tritici, no improvement was observed for Ps. fijiensis and C. geophilum (Figure 2 and Table S2). These results suggest that the use of iWGS would provide critical information to help end users choose a successful sequencing of highly repetitive genomes that share similar characteristics. Importantly, since simulated assemblies are recoverable, the likely impact of the different assembly strategies on genes, gene families, or pathways of interest could also be examined in detail.

Figure 2

Performance comparison of five representative experimental designs on three Dothideomycetes genomes. The five designs shown include three Illumina-only designs (ILMN2: ALLPATHS-LG, META: Metassembler, and ILMN8: DISCOVAR), the best performing PacBio-only design (PACB2: Canu), and the best performing hybrid design (HYBR2: DBG2OLC) for each genome. The statistics on the assembled fraction of the reference genome, scaffold N50, and largest scaffold size are all after correction for assembly errors using the reference genome as reported by QUAST in GAGE mode. By default, QUAST (in GAGE mode) corrects contigs/scaffolds by breaking them at assembly errors larger than 5 bp. Scaffold N50 and largest scaffold size are shown in log10 scale.

Case study II (GC-content and mtDNA assembly issue):

We next examined the de novo assembly of mitochondrial genomes from whole genome sequencing data of Saccharomyces cerevisiae (Mewes ; Foury ). Yeast mitochondrial genomes are valuable resources for evolutionary and functional studies (Freel ), yet the acquisition of finished mitochondrial genome assemblies is not trivial because of their very low GC-content (∼17%). We simulated a genome sequencing experiment using the nuclear and mitochondrial genomes of S. cerevisiae. We tested two ratios of nuclear to mitochondrial genome copy numbers representing low (1:50) and high (1:200) mitochondrial contents, respectively (Solieri 2010). iWGS analysis showed that the S. cerevisiae mitochondrial genome was fully recovered at both low and high mitochondrial contents using Illumina data (Table 2). Consistent with recent observations made during the assembly of the S. eubayanus genome, only certain assemblers performed well; for example, ALLPATHS-LG performed surprisingly poorly, while SPAdes performed quite well (Baker ). Importantly, the complete mitochondrial genome can be obtained as a single contig using only Illumina or only PacBio data, or using both data types (Table 2). Similarly, both Illumina and PacBio data resulted in good quality assemblies of the nuclear genome (Table S2). At the same time, different assemblers exhibited widely different performances even with the same input data (Table 2).

Table 2

Performance of all experimental designs evaluated in case study II

Nuclear:Mitochondrial Genome Ratio	Performance of Strategies^a
Nuclear:Mitochondrial Genome Ratio	Complete, Single Contig Assembly of the Mitochondrial Genome	Assembled Fraction of Mitochondrial Genome ≥ 99%	20% ≤ Assembled Fraction of Mitochondrial Genome < 99%	Assembled Fraction of Mitochondrial Genome < 20%
1:50 (low mitochondrial content)	ILMN1, ILMN6, ILMN8, PACB2, HYBR1, HYBR2	ILMN7	ILMN2, ILMN4, ILMN5, PACB1, PACB3	ILMN3
1:200 (high mitochondrial content)	PACB2, HYBR1, HYBR2	ILMN6, ILMN7	ILMN1, ILMN8, PACB1, PACB3	ILMN2, ILMN3, ILMN4, ILMN5

The de novo assembly generated by each strategy was compared against the reference mitochondrial genome of S. cerevisiae using both QUAST and BLASTN. Unless a single contig was found to represent the complete mitochondrial genome, the assembled fraction of mitochondrial genome was determined based on the number of “missing reference bases” reported by QUAST, and further confirmed by the BLASTN result.

Case study III (genomic architecture issue):

Lastly, we applied iWGS to three model eukaryotic genomes from different kingdoms and with different genomic architectures. Specifically, we analyzed Drosophila melanogaster (Adams ) and Arabidopsis thaliana (Arabidopsis Genome Initiative 2000), which are medium-sized animal and plant genomes, respectively, as well as Plasmodium falciparum 3D7 (Gardner ), a smaller protist genome with extremely low GC-content (∼19%). For all three genomes, the best assembly was generated by using only PacBio data (Table 3). In D. melanogaster and A. thaliana, several Illumina-only assemblies were of relatively high-quality (i.e., corrected scaffold N50 ≥ 100 kb; Table S2), among which the best two were generated by ALLPATHS-LG and DISCOVAR (Table 3). However, all Illumina-only assemblies of Pl. falciparum 3D7 had considerably lower corrected scaffold N50 values, except for DISCOVAR whose sequencing strategy is unique in requiring a PE library with a limited insert size.

Table 3

Summary of top-ranking assemblies generated in case study III

Organism (Genome Size)	Best Assembly from Each Sequencing Strategy	Assembly Statistics^a
Organism (Genome Size)	Best Assembly from Each Sequencing Strategy	Scaffold N50 (kb)	Largest Scaffold (kb)	Assembled Fraction of the Reference Genome
D. melanogaster (137.55 Mb)	ILMN2	169.7	1,007.9	89.1%
	ILMN8	155.0	1,007.7	91.8%
	PACB2	5107.5	13,108.3	99.3%
	HYBR2	279.3	1,536.8	89.7%
A. thaliana (119.15 Mb)	ILMN2	307.0	1,789.4	97.3%
	ILMN8	266.6	2,533.4	98.5%
	PACB2	2065.7	8,552.9	99.7%
	HYBR2	289.3	1,412.2	97.4%
Pl. falciparum 3D7 (23.29 Mb)	ILMN2	28.0	146.2	96.7%
	ILMN8	222.0	729.9	98.4%
	PACB2	282.9	1,378.8	99.6%
	HYBR2	15.5	91.5	97.4%
Pl. falciparum IT (real data)	ILMN2	167.1	641.7	>100%
	ILMN8	141.2	631.4	97.6%
	PACB3	1574.7	3,355.3	97.7%
	HYBR2	198.4	602.1	92.2%

The statistics for simulation-based analysis of D. melanogaster, A. thaliana, and Pl. falciparum 3D7 are after correction for assembly errors using the reference genome, as reported by QUAST in GAGE mode. By default, QUAST (in GAGE mode) corrects contigs/scaffolds by breaking them at assembly errors larger than 5 bp. The statistics for real data based analysis of Pl. falciparum IT are calculated from the original de novo assemblies. To examine how well the simulation-based predictions made by iWGS are supported by empirical data, we collected four real genome sequencing data sets from a previous study of Pl. falciparum IT [one overlapping 100 bp PE library, one overlapping 250 bp PE library, one MP library, and one PacBio library from (Otto ); Table S1] that were a good match to our simulated data sets, and ran the same set of assembly protocols. The best assembly was again generated by PacBio data alone, and the assemblies generated by ALLPATHS-LG, DISCOVAR (both are Illumina-only), and DBG2OLC (hybrid) were ranked next, while all other Illumina-only assembly protocols performed poorly (Table 3 and Table S2). The results are largely consistent with our simulation study, suggesting that our simulation-based approach is indeed informative.

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Discussion

The design and analysis of de novo genome sequencing experiments is not trivial. On the design front, one has to balance between the complexity of the target genome, the strengths and weaknesses of each sequencing technology, and, importantly, the cost. Analysis is also challenging, as one is faced with multiple different algorithms and dozens of parameters. Although substantial efforts have been made to benchmark different approaches for genome assembly (Earl ; Salzberg ; Bradnam ; Magoc ), much less attention has been paid to investigating start-to-finish optimal sequencing strategies for a given genome [see (Chakraborty ) for one example]. iWGS is an automated tool that allows users to explicitly compare alternative experimental designs by using simulated sequencing data, even allowing users to estimate costs when these are known for the generation of each data type. We have illustrated the utility of iWGS in several case studies on mitochondrial and nuclear genomes with varying levels of complexity. For instance, our simulations suggest that Illumina-only sequencing strategies may be economical choices for the sequencing of relatively simple genomes (e.g., Z. tritici; Table S2), whereas PacBio data would be highly desirable for genomes of greater complexity (e.g., Ps. fijiensis, C. geophilum, and Pl. falciparum). Although not done here, iWGS could also be used to evaluate different combinations of sequencing assays (e.g., PE and MP libraries), read quality, read lengths, and sequencing depths. Empirical studies of both short- and long-read data have shown that these parameters are critical determinants of the quality of de novo genome assemblies (Utturkar ; Chakraborty ). One key function of iWGS is the use of simulation data generated from a related reference genome to inform the experimental design for organisms lacking genomic data. A similar concept was previously used to evaluate sequencing strategies for cacao by using the rice genome as the reference (Haiminen ). In principle, one could apply iWGS on one or more related reference genomes that resemble the characteristics (e.g., genome size, repeat content, and sequence composition) of the sequencing target. However, if such reference genome is lacking, one solution is to start with a closely related reference genome and tune it toward the target (e.g., adjust GC- and repeat contents) by using third-party tools that simulate genome-wide evolution (Arenas and Posada 2014) before running iWGS. Alternatively, one may simply use iWGS with reference genomes that are of comparable complexity (e.g., similar in size and repeat content) regardless of the evolutionary relatedness. As suggested by previous studies, these factors not only influence the difficulty of genome assembly, but can also be excellent predictors of the assembly quality (Lee ). Therefore, iWGS could also be informative in evaluating the performance of alternative experimental designs on genomes with similar characteristics to the sequencing target. Other important features of iWGS include the support for both Illumina short, and PacBio long, sequence reads and, correspondingly, a wide selection of software tools compatible with these data types, as well as the ability to analyze real data. In comparison, the support for third generation sequencing data are relatively limited in iMetAMOS and currently lacking in RAMPART. Given the increasing importance of long sequence reads in de novo genome assembly, iWGS aims to allow users to fully exploit the strength of long-read data and explore alternative ways of data analysis. Along these lines, several further developments can be envisioned. First, support for additional sequencing technologies, such as Oxford Nanopore, can be added as technologies become commercially available. In fact, the Celera Assembler, Canu, and SPAdes assemblers, which are supported by iWGS, can already utilize nanopore reads (Bankevich ; Berlin ). Similarly, realistic simulation of nanopore data will be possible once the patterns of errors and biases are better characterized using real data. Second, iWGS will continue to expand its functionality to achieve better assemblies. For instance, a number of assembly polishing tools can be integrated in iWGS to improve the quality of the final output, including Pilon (Walker ), Quiver (Chin ), and Nanopolish (Loman ), which use Illumina, PacBio, and nanopore data, respectively. In addition, iWGS currently uses Metassembler for meta-assembly; in the future, other meta-assembly tools that support assemblies based on PacBio data alone, such as quickmerge (Chakraborty ), could be added. Lastly, it would be beneficial to enable users to add new software tools to iWGS in order to stay up-to-date with the rapid advances in genome assembly and other aspects of HTS data analysis. We intend to provide periodic updates, and the expert user can edit iWGS on their own. In summary, iWGS is a flexible, expandable, and easy to use pipeline that will aid in the design and execution of genome assembly experiments across the tree of life.

68 in total

1. Genome-scale approaches to resolving incongruence in molecular phylogenies.

Authors: Antonis Rokas; Barry L Williams; Nicole King; Sean B Carroll
Journal: Nature Date: 2003-10-23 Impact factor: 49.962

2. Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors: Daniel R Zerbino; Ewan Birney
Journal: Genome Res Date: 2008-03-18 Impact factor: 9.043

Review 3. Harnessing genomics for evolutionary insights.

Authors: Antonis Rokas; Patrick Abbot
Journal: Trends Ecol Evol Date: 2009-02-07 Impact factor: 17.712

4. ABySS: a parallel assembler for short read sequence data.

Authors: Jared T Simpson; Kim Wong; Shaun D Jackman; Jacqueline E Schein; Steven J M Jones; Inanç Birol
Journal: Genome Res Date: 2009-02-27 Impact factor: 9.043

5. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species.

Authors:
Journal: J Hered Date: 2009-11-05 Impact factor: 2.645

6. A whole-genome assembly of Drosophila.

Authors: E W Myers; G G Sutton; A L Delcher; I M Dew; D P Fasulo; M J Flanigan; S A Kravitz; C M Mobarry; K H Reinert; K A Remington; E L Anson; R A Bolanos; H H Chou; C M Jordan; A L Halpern; S Lonardi; E M Beasley; R C Brandon; L Chen; P J Dunn; Z Lai; Y Liang; D R Nusskern; M Zhan; Q Zhang; X Zheng; G M Rubin; M D Adams; J C Venter
Journal: Science Date: 2000-03-24 Impact factor: 47.728

7. Mutation rates in mammalian genomes.

Authors: Sudhir Kumar; Sankar Subramanian
Journal: Proc Natl Acad Sci U S A Date: 2002-01-15 Impact factor: 11.205

8. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

Authors:
Journal: Nature Date: 2000-12-14 Impact factor: 49.962

9. Genome sequence of the human malaria parasite Plasmodium falciparum.

Authors: Malcolm J Gardner; Neil Hall; Eula Fung; Owen White; Matthew Berriman; Richard W Hyman; Jane M Carlton; Arnab Pain; Karen E Nelson; Sharen Bowman; Ian T Paulsen; Keith James; Jonathan A Eisen; Kim Rutherford; Steven L Salzberg; Alister Craig; Sue Kyes; Man-Suen Chan; Vishvanath Nene; Shamira J Shallom; Bernard Suh; Jeremy Peterson; Sam Angiuoli; Mihaela Pertea; Jonathan Allen; Jeremy Selengut; Daniel Haft; Michael W Mather; Akhil B Vaidya; David M A Martin; Alan H Fairlamb; Martin J Fraunholz; David S Roos; Stuart A Ralph; Geoffrey I McFadden; Leda M Cummings; G Mani Subramanian; Chris Mungall; J Craig Venter; Daniel J Carucci; Stephen L Hoffman; Chris Newbold; Ronald W Davis; Claire M Fraser; Bart Barrell
Journal: Nature Date: 2002-10-03 Impact factor: 49.962

10. The genome sequence of Drosophila melanogaster.

Authors: M D Adams; S E Celniker; R A Holt; C A Evans; J D Gocayne; P G Amanatides; S E Scherer; P W Li; R A Hoskins; R F Galle; R A George; S E Lewis; S Richards; M Ashburner; S N Henderson; G G Sutton; J R Wortman; M D Yandell; Q Zhang; L X Chen; R C Brandon; Y H Rogers; R G Blazej; M Champe; B D Pfeiffer; K H Wan; C Doyle; E G Baxter; G Helt; C R Nelson; G L Gabor; J F Abril; A Agbayani; H J An; C Andrews-Pfannkoch; D Baldwin; R M Ballew; A Basu; J Baxendale; L Bayraktaroglu; E M Beasley; K Y Beeson; P V Benos; B P Berman; D Bhandari; S Bolshakov; D Borkova; M R Botchan; J Bouck; P Brokstein; P Brottier; K C Burtis; D A Busam; H Butler; E Cadieu; A Center; I Chandra; J M Cherry; S Cawley; C Dahlke; L B Davenport; P Davies; B de Pablos; A Delcher; Z Deng; A D Mays; I Dew; S M Dietz; K Dodson; L E Doup; M Downes; S Dugan-Rocha; B C Dunkov; P Dunn; K J Durbin; C C Evangelista; C Ferraz; S Ferriera; W Fleischmann; C Fosler; A E Gabrielian; N S Garg; W M Gelbart; K Glasser; A Glodek; F Gong; J H Gorrell; Z Gu; P Guan; M Harris; N L Harris; D Harvey; T J Heiman; J R Hernandez; J Houck; D Hostin; K A Houston; T J Howland; M H Wei; C Ibegwam; M Jalali; F Kalush; G H Karpen; Z Ke; J A Kennison; K A Ketchum; B E Kimmel; C D Kodira; C Kraft; S Kravitz; D Kulp; Z Lai; P Lasko; Y Lei; A A Levitsky; J Li; Z Li; Y Liang; X Lin; X Liu; B Mattei; T C McIntosh; M P McLeod; D McPherson; G Merkulov; N V Milshina; C Mobarry; J Morris; A Moshrefi; S M Mount; M Moy; B Murphy; L Murphy; D M Muzny; D L Nelson; D R Nelson; K A Nelson; K Nixon; D R Nusskern; J M Pacleb; M Palazzolo; G S Pittman; S Pan; J Pollard; V Puri; M G Reese; K Reinert; K Remington; R D Saunders; F Scheeler; H Shen; B C Shue; I Sidén-Kiamos; M Simpson; M P Skupski; T Smith; E Spier; A C Spradling; M Stapleton; R Strong; E Sun; R Svirskas; C Tector; R Turner; E Venter; A H Wang; X Wang; Z Y Wang; D A Wassarman; G M Weinstock; J Weissenbach; S M Williams; K C Worley; D Wu; S Yang; Q A Yao; J Ye; R F Yeh; J S Zaveri; M Zhan; G Zhang; Q Zhao; L Zheng; X H Zheng; F N Zhong; W Zhong; X Zhou; S Zhu; X Zhu; H O Smith; R A Gibbs; E W Myers; G M Rubin; J C Venter
Journal: Science Date: 2000-03-24 Impact factor: 47.728

20 in total

1. Repeated horizontal gene transfer of GALactose metabolism genes violates Dollo's law of irreversible loss.

Authors: Max A B Haase; Jacek Kominek; Dana A Opulente; Xing-Xing Shen; Abigail L LaBella; Xiaofan Zhou; Jeremy DeVirgilio; Amanda Beth Hulfachor; Cletus P Kurtzman; Antonis Rokas; Chris Todd Hittinger
Journal: Genetics Date: 2021-02-09 Impact factor: 4.562

2. Repeat-aware evaluation of scaffolding tools.

Authors: Igor Mandric; Sergey Knyazev; Alex Zelikovsky
Journal: Bioinformatics Date: 2018-08-01 Impact factor: 6.937

3. Tempo and Mode of Genome Evolution in the Budding Yeast Subphylum.

Authors: Xing-Xing Shen; Dana A Opulente; Jacek Kominek; Xiaofan Zhou; Jacob L Steenwyk; Kelly V Buh; Max A B Haase; Jennifer H Wisecaver; Mingshuang Wang; Drew T Doering; James T Boudouris; Rachel M Schneider; Quinn K Langdon; Moriya Ohkuma; Rikiya Endoh; Masako Takashima; Ri-Ichiroh Manabe; Neža Čadež; Diego Libkind; Carlos A Rosa; Jeremy DeVirgilio; Amanda Beth Hulfachor; Marizeth Groenewald; Cletus P Kurtzman; Chris Todd Hittinger; Antonis Rokas
Journal: Cell Date: 2018-11-08 Impact factor: 41.582

4. Pathogenic Allodiploid Hybrids of Aspergillus Fungi.

Authors: Jacob L Steenwyk; Abigail L Lind; Laure N A Ries; Thaila F Dos Reis; Lilian P Silva; Fausto Almeida; Rafael W Bastos; Thais Fernanda de Campos Fraga da Silva; Vania L D Bonato; André Moreira Pessoni; Fernando Rodrigues; Huzefa A Raja; Sonja L Knowles; Nicholas H Oberlies; Katrien Lagrou; Gustavo H Goldman; Antonis Rokas
Journal: Curr Biol Date: 2020-06-04 Impact factor: 10.834

5. Fermentation innovation through complex hybridization of wild and domesticated yeasts.

Authors: Quinn K Langdon; David Peris; EmilyClare P Baker; Dana A Opulente; Huu-Vang Nguyen; Ursula Bond; Paula Gonçalves; José Paulo Sampaio; Diego Libkind; Chris Todd Hittinger
Journal: Nat Ecol Evol Date: 2019-10-21 Impact factor: 15.460

6. Eukaryotic Acquisition of a Bacterial Operon.

Authors: Jacek Kominek; Drew T Doering; Dana A Opulente; Xing-Xing Shen; Xiaofan Zhou; Jeremy DeVirgilio; Amanda B Hulfachor; Marizeth Groenewald; Mcsean A Mcgee; Steven D Karlen; Cletus P Kurtzman; Antonis Rokas; Chris Todd Hittinger
Journal: Cell Date: 2019-02-21 Impact factor: 41.582

7. Genome sequence and physiological analysis of Yamadazyma laniorum f.a. sp. nov. and a reevaluation of the apocryphal xylose fermentation of its sister species, Candida tenuis.

Authors: Max A B Haase; Jacek Kominek; Quinn K Langdon; Cletus P Kurtzman; Chris Todd Hittinger
Journal: FEMS Yeast Res Date: 2017-05-01 Impact factor: 2.796

8. Drivers of genetic diversity in secondary metabolic gene clusters within a fungal species.

Authors: Abigail L Lind; Jennifer H Wisecaver; Catarina Lameiras; Philipp Wiemann; Jonathan M Palmer; Nancy P Keller; Fernando Rodrigues; Gustavo H Goldman; Antonis Rokas
Journal: PLoS Biol Date: 2017-11-17 Impact factor: 8.029

9. Hybridization and adaptive evolution of diverse Saccharomyces species for cellulosic biofuel production.

Authors: David Peris; Ryan V Moriarty; William G Alexander; EmilyClare Baker; Kayla Sylvester; Maria Sardi; Quinn K Langdon; Diego Libkind; Qi-Ming Wang; Feng-Yan Bai; Jean-Baptiste Leducq; Guillaume Charron; Christian R Landry; José Paulo Sampaio; Paula Gonçalves; Katie E Hyma; Justin C Fay; Trey K Sato; Chris Todd Hittinger
Journal: Biotechnol Biofuels Date: 2017-03-27 Impact factor: 6.040

Review 10. Into the wild: new yeast genomes from natural environments and new tools for their analysis.

Authors: D Libkind; D Peris; F A Cubillos; J L Steenwyk; D A Opulente; Q K Langdon; A Rokas; C T Hittinger
Journal: FEMS Yeast Res Date: 2020-03-01 Impact factor: 2.796