Literature DB >> 29069297

SeqBox: RNAseq/ChIPseq reproducible analysis on a consumer game computer.

Marco Beccuti1, Francesca Cordero1, Maddalena Arigoni2, Riccardo Panero2, Elvio G Amparore1, Susanna Donatelli1, Raffaele A Calogero2.   

Abstract

Summary: Short reads sequencing technology has been used for more than a decade now. However, the analysis of RNAseq and ChIPseq data is still computational demanding and the simple access to raw data does not guarantee results reproducibility between laboratories. To address these two aspects, we developed SeqBox, a cheap, efficient and reproducible RNAseq/ChIPseq hardware/software solution based on NUC6I7KYK mini-PC (an Intel consumer game computer with a fast processor and a high performance SSD disk), and Docker container platform. In SeqBox the analysis of RNAseq and ChIPseq data is supported by a friendly GUI. This allows access to fast and reproducible analysis also to scientists with/without scripting experience. Availability and implementation: Docker container images, docker4seq package and the GUI are available at http://www.bioinformatica.unito.it/reproducibile.bioinformatics.html. Contact: beccuti@di.unito.it. Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author(s) 2017. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2018        PMID: 29069297      PMCID: PMC6030956          DOI: 10.1093/bioinformatics/btx674

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Whole transcriptome sequencing (WTS) and ChIPseq made obsolete the corresponding array hybridization based technologies. Short reads sequencing technology has been used for more than a decade now, and experience shows that the main bottleneck in sequencing workflows is the time spent in analyzing and interpreting data (Wong ). The primary analysis of the data, i.e. mapping short read sequences on the reference genome, is still computationally demanding and requires computer performances that are not commonly available in laptops. In particular, WTS requires a significant amount of RAM and multicores processors. The needs of high performance computing infrastructure for the analysis of sequencing data has brought to the development of cloud based analysis tools, e.g. Illumina BaseSpace (https://basespace.illumina.com/home/index), Galaxy (https://usegalaxy.org/), etc. However, cloud based solutions suffer of some criticalities, e.g. data uploading speed, limited storage space and significant computing and data transfer costs. Moreover, although all available data analysis platforms guarantee a certain level of reproducibility, typically storing the version of the software being used, tracking changes in the system libraries, which might lead to sneaky reproducibility issues, is not provided. To combine reproducible data generation with cost effective but efficient hardware we have developed SeqBox, a software/hardware ecosystem providing the most common analyses of RNAseq and ChIPseq data (i.e. genomic mapping, experimental power evaluation, differential expression, transcription factors/histone-marks peaks identification, etc.) on a consumer game computer (Fig. 1A).
Fig. 1

(A) SeqBox framework: depicting the structure of SeqBox and its functionalities from a user point of view. The analysis flows from left to right. (B) Characteristics of the hardware used to evaluate the SeqBox performances

(A) SeqBox framework: depicting the structure of SeqBox and its functionalities from a user point of view. The analysis flows from left to right. (B) Characteristics of the hardware used to evaluate the SeqBox performances

2 Materials and methods

The SeqBox ecosystem (Fig. 1A) is the union of SeqBox software and SeqBox target hardware. A user can access the system either through a Java-based graphical interface (GUI, see Supplementary Material), or through R console (see Supplementary Material). Independently of the access type, the user can exploit three different workflows: RNAseq, miRNA and ChipSeq, which are managed by a Controlling Engine written in R (Fig. 1A). The functions that realize the workflows are either standard analysis algorithms or a set of supporting functions that have been developed and included in Bioconductor packages (e.g. DESeq2, ChIPpeakAnno). The algorithms used for sequencing data analysis include STAR (Dobin ) for RNAseq genomic mapping, DESeq2 (Love ) for differential expression analysis, BWA (Li and Durbin, 2009) for ChIPseq genomic mapping, MACS (Zhang ), and SICER (Xu ) for ChIP peaks detection (see Supplementary Material). All of them are encapsulated into Docker images. A Docker image is a lightweight, stand-alone, executable package that includes everything needed to run a specific software. A runtime instance of an image, called container, runs completely isolated from the host environment except for user-specified host files. The advantage of using Docker images is that the whole environment is fixed, the images are available in the Docker repository, and the identity of the images is the only element needed to reproduce the results. The execution of the Docker images, implementing the workflow chosen by the user, is done by docker4seq, a R package which embeds a set of functions providing the running parameters to the mapping and counting engine. SeqBox provides six Docker images: (i) skewer.2017.01, which uses skewer (Jiang ) for adapter trimming; (ii) rsemstar.2017.01, which uses STAR (Dobin ) to map short reads mapping on the reference genome and RSEM (Li and Dewey, 2011) for gene and isoform-level quantification; (iii) annotate.2017.01, which is used to associate RSEM output id with gene symbols; (iv) mirnaseq.2017.01, which implements the miRNAseq analysis workflow described in (Cordero, ); (v) r332.2017.01, which allows differential expression analysis via Bioconductor package DESeq2; (vi) chipseq.2017.01, which uses BWA (Li and Durbin, 2009) to map short reads on the reference genome, MACS (Zhang ) to detect transcription factors binding sites, and SICER (Xu ) to define histone-marks. The GUI provides a graphical access to the docker4seq functions allowing the use of the tools to biologists without scripting experience. SeqBox hardware: The parameters setting of the algorithms (in terms of memory size versus number of assigned cores) is optimized for an execution on the game computer NUC617KYK (Fig. 1A), which is based on an Intel Core I7, featuring 4 cores running up to eight threads that share a common memory of 32 Gb and a SSD disk of 256 GB.

3 Results

We benchmarked SeqBox with respect to a high-end server (SGI UV2000, Fig. 1B). The performance comparison was done for the three workflows (see Supplementary Figs S1–S3). In brief, we compared the workflows using increasing amounts of reads (see Supplementary Material) on SeqBox, using eight threads, and on the SGI server increasing the number of threads from 8 to 160 (Supplementary Figs S1–S3). Parallelization provided by the SGI server did not improve very much the overall performances in the RNAseq workflow (Supplementary Fig. S1). SeqBox significantly outperformed the server, because of the presence of a SSD with high I/O performance which can cope with the limited parallelism of SeqBox. In the case of miRNA and ChIPseq workflows the parallelization is only available for the reads mapping procedures. The limited parallelization of these two workflows combined with the higher I/O performances of the SSD with respect to the SATA array makes SeqBox extremely effective even with very high number of reads to be processed (Supplementary Figs S2 and S3).

4 Conclusion

The majority of the algorithms used in the considered bioinformatics workflows is strongly I/O bound and exhibits a limited exploitation of parallelism. Our experiments show that a combination of a consumer computer with a fast storage is able to over-perform a high-end server. The integration of Docker technology within a mini-PC consumer computer such as Intel NUC6I7KYK provides therefore, to small biology laboratories, a solution for Next Generation Sequence (NGS) analysis which is cheap, efficient and reproducible.

Funding

This work has been supported by the EPIGEN FLAG PROJECT. Conflict of Interest: none declared. Click here for additional data file.
  8 in total

1.  STAR: ultrafast universal RNA-seq aligner.

Authors:  Alexander Dobin; Carrie A Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R Gingeras
Journal:  Bioinformatics       Date:  2012-10-25       Impact factor: 6.937

2.  Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells.

Authors:  Shiliyang Xu; Sean Grullon; Kai Ge; Weiqun Peng
Journal:  Methods Mol Biol       Date:  2014

3.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.

Authors:  Bo Li; Colin N Dewey
Journal:  BMC Bioinformatics       Date:  2011-08-04       Impact factor: 3.307

4.  Optimizing a massive parallel sequencing workflow for quantitative miRNA expression analysis.

Authors:  Francesca Cordero; Marco Beccuti; Maddalena Arigoni; Susanna Donatelli; Raffaele A Calogero
Journal:  PLoS One       Date:  2012-02-20       Impact factor: 3.240

5.  Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads.

Authors:  Hongshan Jiang; Rong Lei; Shou-Wei Ding; Shuifang Zhu
Journal:  BMC Bioinformatics       Date:  2014-06-12       Impact factor: 3.169

6.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Authors:  Michael I Love; Wolfgang Huber; Simon Anders
Journal:  Genome Biol       Date:  2014       Impact factor: 13.583

7.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

8.  Model-based analysis of ChIP-Seq (MACS).

Authors:  Yong Zhang; Tao Liu; Clifford A Meyer; Jérôme Eeckhoute; David S Johnson; Bradley E Bernstein; Chad Nusbaum; Richard M Myers; Myles Brown; Wei Li; X Shirley Liu
Journal:  Genome Biol       Date:  2008-09-17       Impact factor: 13.583

  8 in total
  11 in total

1.  Frequent mutations of FBXO11 highlight BCL6 as a therapeutic target in Burkitt lymphoma.

Authors:  Chiara Pighi; Taek-Chin Cheong; Mara Compagno; Enrico Patrucco; Maddalena Arigoni; Martina Olivero; Qi Wang; Cristina López; Stephan H Bernhart; Bruno M Grande; Teresa Poggio; Fernanda Langellotto; Lisa Bonello; Riccardo Dall'Olio; Sandra Martínez-Martín; Luca Molinaro; Paola Francia di Celle; Jonathan R Whitfield; Laura Soucek; Claudia Voena; Raffaele A Calogero; Ryan D Morin; Louis M Staudt; Reiner Siebert; Alberto Zamò; Roberto Chiarle
Journal:  Blood Adv       Date:  2021-12-14

2.  A regulatory microRNA network controls endothelial cell phenotypic switch during sprouting angiogenesis.

Authors:  Federico Bussolino; Alessio Noghero; Stefania Rosano; Davide Corà; Sushant Parab; Serena Zaffuto; Claudio Isella; Roberta Porporato; Roxana Maria Hoza; Raffaele A Calogero; Chiara Riganti
Journal:  Elife       Date:  2020-01-24       Impact factor: 8.140

3.  Macrophages attenuate the transcription of CYP1A1 in breast tumor cells and enhance their proliferation.

Authors:  Sofia Winslow; Anica Scholz; Peter Rappl; Thilo F Brauß; Christina Mertens; Michaela Jung; Andreas Weigert; Bernhard Brüne; Tobias Schmid
Journal:  PLoS One       Date:  2019-01-07       Impact factor: 3.240

4.  Early stability and late random tumor progression of a HER2-positive primary breast cancer patient-derived xenograft.

Authors:  Lorena Landuzzi; Arianna Palladini; Claudio Ceccarelli; Sofia Asioli; Giordano Nicoletti; Veronica Giusti; Francesca Ruzzi; Marianna L Ianzano; Laura Scalambra; Roberta Laranga; Tania Balboni; Maddalena Arigoni; Martina Olivero; Raffaele A Calogero; Carla De Giovanni; Massimiliano Dall'Ora; Enrico Di Oto; Donatella Santini; Maria Pia Foschini; Maria Cristina Cucchi; Simone Zanotti; Mario Taffurelli; Patrizia Nanni; Pier-Luigi Lollini
Journal:  Sci Rep       Date:  2021-01-15       Impact factor: 4.379

5.  Gene signature of children with severe respiratory syncytial virus infection.

Authors:  Clyde Dapat; Satoru Kumaki; Hiroki Sakurai; Hidekazu Nishimura; Hannah Karen Mina Labayo; Michiko Okamoto; Mayuko Saito; Hitoshi Oshitani
Journal:  Pediatr Res       Date:  2021-01-28       Impact factor: 3.756

6.  Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines.

Authors:  Neha Kulkarni; Luca Alessandrì; Riccardo Panero; Maddalena Arigoni; Martina Olivero; Giulio Ferrero; Francesca Cordero; Marco Beccuti; Raffaele A Calogero
Journal:  BMC Bioinformatics       Date:  2018-10-15       Impact factor: 3.169

7.  rCASC: reproducible classification analysis of single-cell sequencing data.

Authors:  Luca Alessandrì; Francesca Cordero; Marco Beccuti; Maddalena Arigoni; Martina Olivero; Greta Romano; Sergio Rabellino; Nicola Licheri; Gennaro De Libero; Luigia Pace; Raffaele A Calogero
Journal:  Gigascience       Date:  2019-09-01       Impact factor: 6.524

8.  Identification of Altered miRNAs in Cerumen of Dogs Affected by Otitis Externa.

Authors:  Cristina Lecchi; Valentina Zamarian; Giorgia Borriello; Giorgio Galiero; Guido Grilli; Mario Caniatti; Elisa Silvia D'Urso; Paola Roccabianca; Roberta Perego; Michela Minero; Sara Legnani; Raffaele Calogero; Maddalena Arigoni; Fabrizio Ceciliani
Journal:  Front Immunol       Date:  2020-05-29       Impact factor: 7.561

9.  Docker4Circ: A Framework for the Reproducible Characterization of circRNAs from RNA-Seq Data.

Authors:  Giulio Ferrero; Nicola Licheri; Lucia Coscujuela Tarrero; Carlo De Intinis; Valentina Miano; Raffaele Adolfo Calogero; Francesca Cordero; Michele De Bortoli; Marco Beccuti
Journal:  Int J Mol Sci       Date:  2019-12-31       Impact factor: 5.923

10.  miRNA profiles of canine cutaneous mast cell tumours with early nodal metastasis and evaluation as potential biomarkers.

Authors:  Valentina Zamarian; Roberta Ferrari; Damiano Stefanello; Fabrizio Ceciliani; Valeria Grieco; Giulietta Minozzi; Lavinia Elena Chiti; Maddalena Arigoni; Raffaele Calogero; Cristina Lecchi
Journal:  Sci Rep       Date:  2020-11-03       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.