Literature DB >> 23630175

RUbioSeq: a suite of parallelized pipelines to automate exome variation and bisulfite-seq analyses.

Miriam Rubio-Camarillo1, Gonzalo Gómez-López, José M Fernández, Alfonso Valencia, David G Pisano.   

Abstract

MOTIVATION: RUbioSeq has been developed to facilitate the primary and secondary analysis of re-sequencing projects by providing an integrated software suite of parallelized pipelines to detect exome variants (single-nucleotide variants and copy number variations) and to perform bisulfite-seq analyses automatically. RUbioSeq's variant analysis results have been already validated and published. AVAILABILITY: http://rubioseq.sourceforge.net/.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 23630175      PMCID: PMC3694642          DOI: 10.1093/bioinformatics/btt203

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Primary and secondary data analyses of next-generation sequencing studies (NGS) consist of a set of successive stages that are repetitively and routinely executed using a wide collection of tools (e.g. quality control tools, read aligners, variant callers and so forth). These tools have different origins and usually lack of straight interoperability. This issue has driven computational biologists to demand intuitive, efficient and integrated pipelines to facilitate routine NGS analysis and improve the reproducibility of the results. Several remarkable efforts have been carried out in this sense. Prominent examples include NARWHAL, a recent proposal to automate Illumina’s primary analysis (Brouwer ) and HugeSeq, a powerful pipeline designed to cover primary and secondary analysis of single-nucleotide variant (SNVs) and copy number variation (CNV) experiments (Lam ). HugeSeq uses FASTQ files as input to detect and annotate genomic variants running GATK (DePristo ) and SAMtools; however, the current version of HugeSeq does not support either sample quality control tools or bisulfite-seq (BS-Seq) analysis methods. Galaxy, a large and flexible web-based platform also provides an NGS toolbox (Blakenberg ). Despite its potential, Galaxy’s NGS tools are still in β and do not support either CNV or BS-Seq analysis. We present RUbioSeq, an automated and parallelized software suite for primary and secondary analysis of Illumina and SOLiD experiments. Using standard input and output file formats and an intuitive XML configuration file, the application offers an integrated framework to run parallelized pipelines for variant detection in exome enrichment and methylation studies. RUbioSeq results have been experimentally validated and accepted for publication (Domenech ).

2 FEATURES AND METHODS

RUbioSeq is highly configurable. The parameters of the analysis are specified in an intuitive XML configuration file, which allows customization of the pipeline. Every RUbioSeq workflow accepts single- and paired-end experiments and detects Illumina’s CASAVA version automatically. We have included additional quality control steps to check the integrity of the inputs and the BAM files generated. RUbioSeq workflows are divided into functional modules that may be executed independently. The results are saved in a project directory tree maintaining a structured organization for the output files. Further details are available in the user manual at http://rubioseq.sourceforge.net/.

2.1 SNVs detection pipeline

The primary input files accepted by RUbioSeq are reads in FASTQ (Illumina) or CSFASTA/QUAL (SOLiD) format. Alternatively, BAM alignment files are supported as input (Fig. 1). SNV pipeline is divided into three main modules: (i) short-read alignment with a combination of BWA + BFAST aligners (Li and Durbin, 2009; Homer ) and quality control analysis using FastQC, (ii) duplicate marking using Picard tools, realignment and recalibration using GATK, and TEQC as quality control and (iii) GATK variant calling, tumor/control somatic indels detection and advanced filtering using GATK’s VariantFiltration walker. Finally, variants are annotated using Ensembl Variant Effect Predictor (VEP, McLaren ). All the output files are generated in standard formats, such as BAM and VCF (Danecek ; Li ).
Fig. 1.

RUbioSeq pipelines for exome variant detection and BS-Seq analyses. Dark gray boxes correspond to the main steps of the pipelines. Light gray boxes indicate optional steps

RUbioSeq pipelines for exome variant detection and BS-Seq analyses. Dark gray boxes correspond to the main steps of the pipelines. Light gray boxes indicate optional steps

2.2 CNV detection pipeline

RUbioSeq’s CNV detection pipeline uses the modules (i) and (ii) described in Section 2.1 to generate GATK recalibrated BAM files. Then CONTRA software uses recalibrated BAMs to perform the CNV analyses for case–control comparisons (Li ). CONTRA calls copy number gains and losses based on normalized depth of coverage, generating output files in standard VCF format (Fig. 1).

2.3 Bisulfite-seq pipeline

RUbioSeq requires bisulfite-converted reads in FASTQ format as input. The software accepts input data generated from Cokus and Lister protocols (Krueger ). This pipeline has been structured in three analysis modules: (i) read filtering, FastQC quality control, bisulfite sequence alignment and methylation calling using Bismark, (ii) depth filtering and output files generation and (iii) an optional interval methylation percentages calculation (Fig. 1). The lack of standard output format for methylation-calls has encouraged us to adapt this output to the widely established VCF format. See RUbioSeq’s documentation for further details.

2.4 Implementation details

RUbioSeq is written in Perl. Its modular design provides a high flexibility to facilitate the inclusion of additional functionalities in future versions of the tool. RUbioSeq has been implemented to run on UNIX HPC systems scheduled by SGE or PBS. The software allows pipelines to be launched in a UNIX workstation as well. We have also implemented a parallelized and multithreaded execution of the analysis process enabling different levels of execution. RUbioSeq’s workflows are prepared to perform multiple samples simultaneously on an HPC system. Under this parallelized design, the real execution time for N samples (N * t) is reduced to t, where t represents execution time for one sample. This feature can be executed in two ways: Standalone multisample where every sample generates an independent result and Joint multisample where all samples contribute to a unique final result.

2.4.1 Analysis protocols

All the implemented code and programs used in RUbioSeq are open-source. Our modules use state-of-art software, such as BWA and BFAST aligners, GATK variant caller and Ensembl’s VEP. We have set RUbioSeq’s parameters with defaults established in best practice recommendations provided by developers for each of the analysis tasks and platforms supported. We have also set-up platform-specific alignment protocols. For instance, for Illumina exome variation analysis, the software takes advantage of BWA efficiency and BFAST sensitivity by first performing a BWA alignment step and then a BFAST alignment for those reads unmapped at the first step. Next, RUbioSeq generates the output BAM file containing all the mapped reads that will be accepted by RUbioSeq’s downstream execution module.

2.4.2 Benchmarking

RUbioSeq has been executed in a 24 node Intel Nehalem cluster with 16 cores (2.67 Ghz each core) and 48 GB of ransom access memory per node. The variant detection workflow generated full lists of genomic variants in 3 h for an Illumina paired-end experiment carried out in 10 chronic lymphocytic leukemia samples (CLLs) and their corresponding healthy controls (SRA ID: SRA049097). This study covered coding and regulatory regions belonging to 301 genes (1.36 Mb) associated to CLLs (Domenech ). We additionally tested our software with BS-Seq data available from the NIH Roadmap Epigenomics consortium. We used the Illumina’s H1 cell line sample (SRS004212) from the UCSD Human Reference Epigenome Mapping Project (SRP000941). We have analyzed 10 FASTQ files (∼1.5 GB per file) using the joint multi-sample execution mode and the default parameters. The final results (without bowtie-build) were generated in ∼3.5 h.

3 CONCLUSIONS

We have developed RUbioSeq, an integrated and parallelized workflow for DNA-Seq and BS-Seq studies. As RUbioSeq depends on >20 different software packages, we have created a customized 64-bit LiveDVD (based on Ubuntu 12.10 Desktop LiveCD), which bundles RUbioSeq plus all its dependencies, ready to be used on any computer. The results generated by RUbioSeq have been validated and accepted for publication. RUbioSeq source code and full documentation are accessible under Creative Commons License at http://rubioseq.sourceforge.net.
  14 in total

1.  NARWHAL, a primary analysis pipeline for NGS data.

Authors:  R W W Brouwer; M C G N van den Hout; F G Grosveld; W F J van Ijcken
Journal:  Bioinformatics       Date:  2011-11-08       Impact factor: 6.937

2.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

3.  Human DNA methylomes at base resolution show widespread epigenomic differences.

Authors:  Ryan Lister; Mattia Pelizzola; Robert H Dowen; R David Hawkins; Gary Hon; Julian Tonti-Filippini; Joseph R Nery; Leonard Lee; Zhen Ye; Que-Minh Ngo; Lee Edsall; Jessica Antosiewicz-Bourget; Ron Stewart; Victor Ruotti; A Harvey Millar; James A Thomson; Bing Ren; Joseph R Ecker
Journal:  Nature       Date:  2009-10-14       Impact factor: 49.962

4.  Galaxy: a web-based genome analysis tool for experimentalists.

Authors:  Daniel Blankenberg; Gregory Von Kuster; Nathaniel Coraor; Guruprasad Ananda; Ross Lazarus; Mary Mangan; Anton Nekrutenko; James Taylor
Journal:  Curr Protoc Mol Biol       Date:  2010-01

5.  A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors:  Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal:  Nat Genet       Date:  2011-04-10       Impact factor: 38.330

6.  The variant call format and VCFtools.

Authors:  Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal:  Bioinformatics       Date:  2011-06-07       Impact factor: 6.937

7.  New mutations in chronic lymphocytic leukemia identified by target enrichment and deep sequencing.

Authors:  Elena Doménech; Gonzalo Gómez-López; Daniel Gzlez-Peña; Mar López; Beatriz Herreros; Juliane Menezes; Natalia Gómez-Lozano; Angel Carro; Osvaldo Graña; David G Pisano; Orlando Domínguez; José A García-Marco; Miguel A Piris; Margarita Sánchez-Beato
Journal:  PLoS One       Date:  2012-06-01       Impact factor: 3.240

8.  BFAST: an alignment tool for large scale genome resequencing.

Authors:  Nils Homer; Barry Merriman; Stanley F Nelson
Journal:  PLoS One       Date:  2009-11-11       Impact factor: 3.240

9.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

10.  Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning.

Authors:  Shawn J Cokus; Suhua Feng; Xiaoyu Zhang; Zugen Chen; Barry Merriman; Christian D Haudenschild; Sriharsa Pradhan; Stanley F Nelson; Matteo Pellegrini; Steven E Jacobsen
Journal:  Nature       Date:  2008-02-17       Impact factor: 49.962

View more
  17 in total

1.  Treatment of Pancreatic Cancer Patient-Derived Xenograft Panel with Metabolic Inhibitors Reveals Efficacy of Phenformin.

Authors:  N V Rajeshkumar; Shinichi Yabuuchi; Shweta G Pai; Elizabeth De Oliveira; Jurre J Kamphorst; Joshua D Rabinowitz; Héctor Tejero; Fátima Al-Shahrour; Manuel Hidalgo; Anirban Maitra; Chi V Dang
Journal:  Clin Cancer Res       Date:  2017-06-13       Impact factor: 12.531

2.  Mutational screening of newly diagnosed multiple myeloma patients by deep targeted sequencing.

Authors:  Yanira Ruiz-Heredia; Beatriz Sánchez-Vega; Esther Onecha; Santiago Barrio; Rafael Alonso; Jose Carlos Martínez-Ávila; Isabel Cuenca; Xabier Agirre; Esteban Braggio; Miguel-T Hernández; Rafael Martínez; Laura Rosiñol; Norma Gutierrez; Marisa Martin-Ramos; Enrique M Ocio; María-Asunción Echeveste; Jaime Pérez de Oteyza; Albert Oriol; Joan Bargay; Mercedes Gironella; Rosa Ayala; Joan Bladé; María-Victoria Mateos; Klaus M Kortum; Keith Stewart; Ramón García-Sanz; Jesús San Miguel; Juan José Lahuerta; Joaquín Martinez-Lopez
Journal:  Haematologica       Date:  2018-06-28       Impact factor: 9.941

3.  Exome sequencing reveals novel and recurrent mutations with clinical impact in blastic plasmacytoid dendritic cell neoplasm.

Authors:  J Menezes; F Acquadro; M Wiseman; G Gómez-López; R N Salgado; J G Talavera-Casañas; I Buño; J V Cervera; S Montes-Moreno; J M Hernández-Rivas; R Ayala; M J Calasanz; M J Larrayoz; L F Brichs; M Gonzalez-Vicent; D G Pisano; M A Piris; S Álvarez; J C Cigudosa
Journal:  Leukemia       Date:  2013-09-27       Impact factor: 11.528

4.  Fastq2vcf: a concise and transparent pipeline for whole-exome sequencing data analyses.

Authors:  Xiaoyi Gao; Jianpeng Xu; Joshua Starmer
Journal:  BMC Res Notes       Date:  2015-03-08

5.  The contribution of cohesin-SA1 to gene expression and chromatin architecture in two murine tissues.

Authors:  Ana Cuadrado; Silvia Remeseiro; Osvaldo Graña; David G Pisano; Ana Losada
Journal:  Nucleic Acids Res       Date:  2015-03-03       Impact factor: 16.971

6.  A mutation in the POT1 gene is responsible for cardiac angiosarcoma in TP53-negative Li-Fraumeni-like families.

Authors:  Oriol Calvete; Paula Martinez; Pablo Garcia-Pavia; Carlos Benitez-Buelga; Beatriz Paumard-Hernández; Victoria Fernandez; Fernando Dominguez; Clara Salas; Nuria Romero-Laorden; Jesus Garcia-Donas; Jaime Carrillo; Rosario Perona; Juan Carlos Triviño; Raquel Andrés; Juana María Cano; Bárbara Rivera; Luis Alonso-Pulpon; Fernando Setien; Manel Esteller; Sandra Rodriguez-Perales; Gaelle Bougeard; Tierry Frebourg; Miguel Urioste; Maria A Blasco; Javier Benítez
Journal:  Nat Commun       Date:  2015-09-25       Impact factor: 17.694

7.  ASXL1, TP53 and IKZF3 mutations are present in the chronic phase and blast crisis of chronic myeloid leukemia.

Authors:  J Menezes; R N Salgado; F Acquadro; G Gómez-López; M C Carralero; A Barroso; F Mercadillo; L Espinosa-Hevia; J G Talavera-Casañas; D G Pisano; S Alvarez; J C Cigudosa
Journal:  Blood Cancer J       Date:  2013-11-08       Impact factor: 11.037

8.  CSF3R T618I co-occurs with mutations of splicing and epigenetic genes and with a new PIM3 truncated fusion gene in chronic neutrophilic leukemia.

Authors:  J Menezes; H Makishima; I Gomez; F Acquadro; G Gómez-López; O Graña; A Dopazo; S Alvarez; M Trujillo; D G Pisano; J P Maciejewski; J C Cigudosa
Journal:  Blood Cancer J       Date:  2013-11-08       Impact factor: 11.037

9.  Functional and molecular characterization of inherited platelet disorders in the Iberian Peninsula: results from a collaborative study.

Authors:  Isabel Sánchez-Guiu; Ana I Antón; José Padilla; Francisco Velasco; José F Lucia; Miguel Lozano; Ana Rosa Cid; Teresa Sevivas; María F Lopez-Fernandez; Vicente Vicente; Consuelo González-Manchón; José Rivera; María L Lozano
Journal:  Orphanet J Rare Dis       Date:  2014-12-24       Impact factor: 4.123

10.  NSD2 contributes to oncogenic RAS-driven transcription in lung cancer cells through long-range epigenetic activation.

Authors:  Verónica García-Carpizo; Jacinto Sarmentero; Bomie Han; Osvaldo Graña; Sergio Ruiz-Llorente; David G Pisano; Manuel Serrano; Harold B Brooks; Robert M Campbell; Maria J Barrero
Journal:  Sci Rep       Date:  2016-09-08       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.