Literature DB >> 25637556

RAMPART: a workflow management system for de novo genome assembly.

Daniel Mapleson1, Nizar Drou1, David Swarbreck1.   

Abstract

MOTIVATION: The de novo assembly of genomes from whole- genome shotgun sequence data is a computationally intensive, multi-stage task and it is not known a priori which methods and parameter settings will produce optimal results. In current de novo assembly projects, a popular strategy involves trying many approaches, using different tools and settings, and then comparing and contrasting the results in order to select a final assembly for publication.
RESULTS: Herein, we present RAMPART, a configurable workflow management system for de novo genome assembly, which helps the user identify combinations of third-party tools and settings that provide good results for their particular genome and sequenced reads. RAMPART is designed to exploit High performance computing environments, such as clusters and shared memory systems, where available.
AVAILABILITY AND IMPLEMENTATION: RAMPART is available under the GPLv3 license at: https://github.com/TGAC/RAMPART.
© The Author 2015. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2015        PMID: 25637556      PMCID: PMC4443680          DOI: 10.1093/bioinformatics/btv056

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

The de novo genome assembly of whole genome sequence data is a complex task and typically involves testing multiple tools, parameters and approaches to produce the best possible assembly for downstream analysis. This is necessary because it is not always known a priori, which tools and settings will work best on the available sequence data given the organism’s specific genomic properties, such as size, ploidy and repetitive content. Despite advances in computing hardware and sequencing technologies, de novo assembly, particularly for more complex eukaryotic genomes, remains a non-trivial task and an ongoing challenge. Recently, several tools, such as iMetAMOS (Koren ) and A5 (Tritt ), approach this problem by exhaustively testing many tools in parallel and then identifying and selecting the best assembly. However, these pipelines focus on prokaryote assemblies, where the computational demands are manageable and the genomes are easier to assemble. The complexities of eukaryotic genomes prohibit exhaustive testing of all tools and parameters with current computing hardware. For these projects the user must use the literature and their own experience to decide which avenues are worth considering.

2 RAMPART

This article presents a workflow management system for de novo genome assembly called RAMPART, which allows the user to design and execute their own assembly workflows using a set of third-party open-source tools. This reduces human error and relieves the burden of organizing data files and executing tools manually. Frequently, this helps to produce better assemblies in less time than is possible otherwise. RAMPART gives the user the freedom to compare tools and parameters to identify the effect these have on the given data sets. The flexibility to roll-your-own workflow enables the user to tackle both prokaryotic and eukaryotic assembly projects, tailoring the amount of work to be done based on the availability of computing resources, quantity of sequence data and complexity of the genome. In addition, RAMPART produces logs, metrics and reports throughout the workflow, which allows users to identify, and subsequently rectify, any problems.

2.1 Workflow design

Input to RAMPART consists of one or more sequenced whole genome shotgun libraries and a configuration file describing properties of those libraries and the workflow through which the libraries should be processed. The workflow is comprised of a number of configurable stages as depicted in Figure 1. This design allows the user to answer project-specific questions such as: whether raw or error corrected sequence data works best; which assembler works best; or which parameter value is optimal for a specific tool. The final output from RAMPART is the assembled sequences, although plots, statistics, reports and log files are produced as the pipeline progresses.
Fig. 1.

A simplified representation of RAMPART’s architecture. Although user’s workflow must conform to the linear structure depicted here, each stage is optional and highly configurable. Most stages allow the user to select which third-party tool(s) and parameters are used, although primary input and output parameters to all tools are managed automatically. The most important pipeline stage, MASS, allows the user to execute multiple assemblers, with varying parameters. In the subsequent step, the resultant assemblies are analyzed before a single assembly is selected for use in the second half of the pipeline. Input to the MASS and AMP stages can be selected from any raw input library or from any modified libraries generated during the MECQ stage

A simplified representation of RAMPART’s architecture. Although user’s workflow must conform to the linear structure depicted here, each stage is optional and highly configurable. Most stages allow the user to select which third-party tool(s) and parameters are used, although primary input and output parameters to all tools are managed automatically. The most important pipeline stage, MASS, allows the user to execute multiple assemblers, with varying parameters. In the subsequent step, the resultant assemblies are analyzed before a single assembly is selected for use in the second half of the pipeline. Input to the MASS and AMP stages can be selected from any raw input library or from any modified libraries generated during the MECQ stage RAMPART connects standardized input and outputs from adjacent pipeline stages, which in some cases requires translating in order to drive specific third-party tools. Designing the software this way has three advantages. First, the user only needs to install those tools required for their specific project. Second, the user does not have to manually specify many input and output parameters for the tools, particularly library properties and file locations. Finally, RAMPART developers can add new tools without changing the pipeline logic. RAMPART is an open source project so any user with the right skillset can add their own tools to their pipeline, providing those tools can be made to comply with appropriate interfaces.

2.2 Assembly comparison and selection

To compare assemblies, RAMPART measures properties of each assembly relating to contiguity, conservation and assembly problems using third-party tools. The user can control which analysis tools, if any, are executed in their pipeline. To function as a fully automated pipeline, RAMPART, at particular stages, must be capable of selecting the best assembly to proceed with. We address this by assigning a single score to each assembly using a method similar to that described by Abbas , which groups and weights individual assembly metrics before assigning a single score. The user has the option to override the default weightings for automatic selection, or can select an assembly manually at their discretion. Please see Supplementary Material Section 2 for more information.

2.3 High performance computing support

Experimenting with de novo assembly for large, complex genomes is a computationally intensive process. Therefore, RAMPART is designed to exploit high performance computing environments, such as clusters or shared memory machines, by executing tools in parallel where possible via the system’s job scheduler. However, RAMPART still runs on desktop and server machines sequentially with sufficient resources. RAMPART currently supports both the Platform Load Sharing Facility and Portable Batch System schedulers, with plans to support Sun Grid Engine in the future.

3 Concluding remarks

RAMPART is a workflow management system for de novo genome assembly that provides an effective means of producing quality prokaryotic and eukaryotic assemblies by reducing the amount of manual work required in such projects. In addition, it offers a way for users to better understand differences in their genomic sequence data, assemblies and assembly tools. RAMPART is already used in production workflows at The Genome Analysis Centre, is under active development and is updated regularly to adapt to the latest challenges, tools and data. As sequencing costs have come down it has been possible to sequence multiple isolates of the same species in parallel, these kinds of projects present additional challenges for the bioinformatican in terms of managing the numbers of files and comparing results of de novo assemblies across isolates. RAMPART contains some preliminary scripts for managing these kinds of projects. It also enables the rapid functional annotation of prokaryote genomes via PROKKA (Seemann, 2014). In the future we would like to improve these scripts and workflows and to provide the ability to annotate eukaryotic genomes. Over time, the community will develop a better understanding of what assembly workflows are appropriate for certain types of genomes with certain types of sequence data. For example, the ALLPATHS-LG ‘recipe’ (Gnerre ) has been shown to produce high-quality assemblies of mammalian genomes. We plan to encourage this process in the future by allowing users to share their own RAMPART workflows and metrics describing their results on a website for appraisal by the community.
  5 in total

1.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data.

Authors:  Sante Gnerre; Iain Maccallum; Dariusz Przybylski; Filipe J Ribeiro; Joshua N Burton; Bruce J Walker; Ted Sharpe; Giles Hall; Terrance P Shea; Sean Sykes; Aaron M Berlin; Daniel Aird; Maura Costello; Riza Daza; Louise Williams; Robert Nicol; Andreas Gnirke; Chad Nusbaum; Eric S Lander; David B Jaffe
Journal:  Proc Natl Acad Sci U S A       Date:  2010-12-27       Impact factor: 11.205

2.  Prokka: rapid prokaryotic genome annotation.

Authors:  Torsten Seemann
Journal:  Bioinformatics       Date:  2014-03-18       Impact factor: 6.937

3.  An integrated pipeline for de novo assembly of microbial genomes.

Authors:  Andrew Tritt; Jonathan A Eisen; Marc T Facciotti; Aaron E Darling
Journal:  PLoS One       Date:  2012-09-13       Impact factor: 3.240

4.  Assessment of de novo assemblers for draft genomes: a case study with fungal genomes.

Authors:  Mostafa M Abbas; Qutaibah M Malluhi; Ponnuraman Balakrishnan
Journal:  BMC Genomics       Date:  2014-12-08       Impact factor: 3.969

5.  Automated ensemble assembly and validation of microbial genomes.

Authors:  Sergey Koren; Todd J Treangen; Christopher M Hill; Mihai Pop; Adam M Phillippy
Journal:  BMC Bioinformatics       Date:  2014-05-03       Impact factor: 3.169

  5 in total
  11 in total

1.  In Silico Whole Genome Sequencer and Analyzer (iWGS): a Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies.

Authors:  Xiaofan Zhou; David Peris; Jacek Kominek; Cletus P Kurtzman; Chris Todd Hittinger; Antonis Rokas
Journal:  G3 (Bethesda)       Date:  2016-11-08       Impact factor: 3.154

2.  Finding a partner in the ocean: molecular and evolutionary bases of the response to sexual cues in a planktonic diatom.

Authors:  Swaraj Basu; Shrikant Patil; Daniel Mapleson; Monia Teresa Russo; Laura Vitale; Cristina Fevola; Florian Maumus; Raffaella Casotti; Thomas Mock; Mario Caccamo; Marina Montresor; Remo Sanges; Maria Immacolata Ferrante
Journal:  New Phytol       Date:  2017-04-21       Impact factor: 10.151

3.  Robust high-throughput prokaryote de novo assembly and improvement pipeline for Illumina data.

Authors:  Andrew J Page; Nishadi De Silva; Martin Hunt; Michael A Quail; Julian Parkhill; Simon R Harris; Thomas D Otto; Jacqueline A Keane
Journal:  Microb Genom       Date:  2016-08-25

Review 4.  New approaches for metagenome assembly with short reads.

Authors:  Martin Ayling; Matthew D Clark; Richard M Leggett
Journal:  Brief Bioinform       Date:  2020-03-23       Impact factor: 11.622

5.  Considerations for Optimization of High-Throughput Sequencing Bioinformatics Pipelines for Virus Detection.

Authors:  Christophe Lambert; Cassandra Braxton; Robert L Charlebois; Avisek Deyati; Paul Duncan; Fabio La Neve; Heather D Malicki; Sebastien Ribrioux; Daniel K Rozelle; Brandye Michaels; Wenping Sun; Zhihui Yang; Arifa S Khan
Journal:  Viruses       Date:  2018-09-27       Impact factor: 5.048

6.  A critical comparison of technologies for a plant genome sequencing project.

Authors:  Pirita Paajanen; George Kettleborough; Elena López-Girona; Michael Giolai; Darren Heavens; David Baker; Ashleigh Lister; Fiorella Cugliandolo; Gail Wilde; Ingo Hein; Iain Macaulay; Glenn J Bryan; Matthew D Clark
Journal:  Gigascience       Date:  2019-03-01       Impact factor: 6.524

7.  Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis.

Authors:  Rowena A Bull; Thiruni N Adikari; James M Ferguson; Jillian M Hammond; Igor Stevanovski; Alicia G Beukers; Zin Naing; Malinna Yeang; Andrey Verich; Hasindu Gamaarachchi; Ki Wook Kim; Fabio Luciani; Sacha Stelzer-Braid; John-Sebastian Eden; William D Rawlinson; Sebastiaan J van Hal; Ira W Deveson
Journal:  Nat Commun       Date:  2020-12-09       Impact factor: 14.919

8.  CODEHOP-Mediated PCR Improves HIV-1 Genotyping and Detection of Variants by MinION Sequencing.

Authors:  Horeyah Sarkhouh; Wassim Chehadeh
Journal:  Microbiol Spectr       Date:  2021-10-20

9.  Anthropogenic Infection of Domestic Cats With SARS-CoV-2 Alpha Variant B.1.1.7 Lineage in Buenos Aires.

Authors:  Andrea Pecora; Dario Amilcar Malacari; Marina Valeria Mozgovoj; María de Los Ángeles Díaz; Andrea Verónica Peralta; Marco Cacciabue; Andrea Fabiana Puebla; Cristian Carusso; Silvia Leonor Mundo; María Mora Gonzalez Lopez Ledesma; Andrea Vanesa Gamarnik; Osvaldo Rinaldi; Osvaldo Vidal; Javier Mas; María José Dus Santos
Journal:  Front Vet Sci       Date:  2022-03-01

Review 10.  Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine.

Authors:  Wenming Xiao; Leihong Wu; Gokhan Yavas; Vahan Simonyan; Baitang Ning; Huixiao Hong
Journal:  Pharmaceutics       Date:  2016-04-22       Impact factor: 6.321

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.