Literature DB >> 31072312

NanoDJ: a Dockerized Jupyter notebook for interactive Oxford Nanopore MinION sequence manipulation and genome assembly.

Héctor Rodríguez-Pérez1, Tamara Hernández-Beeftink1, José M Lorenzo-Salazar2, José L Roda-García3, Carlos J Pérez-González4, Marcos Colebrook5, Carlos Flores6,7,8.   

Abstract

BACKGROUND: The Oxford Nanopore Technologies (ONT) MinION portable sequencer makes it possible to use cutting-edge genomic technologies in the field and the academic classroom.
RESULTS: We present NanoDJ, a Jupyter notebook integration of tools for simplified manipulation and assembly of DNA sequences produced by ONT devices. It integrates basecalling, read trimming and quality control, simulation and plotting routines with a variety of widely used aligners and assemblers, including procedures for hybrid assembly.
CONCLUSIONS: With the use of Jupyter-facilitated access to self-explanatory contents of applications and the interactive visualization of results, as well as by its distribution into a Docker software container, NanoDJ is aimed to simplify and make more reproducible ONT DNA sequence analysis. The NanoDJ package code, documentation and installation instructions are freely available at https://github.com/genomicsITER/NanoDJ .

Entities:  

Keywords:  Docker; Genome analysis; Jupyter; Nanopore sequencing

Mesh:

Year:  2019        PMID: 31072312      PMCID: PMC6509807          DOI: 10.1186/s12859-019-2860-z

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

It has never been before so easy and affordable to access and utilize genetic variation of any organism and purpose. This has been motivated by the continuous development of high-throughput DNA sequencing technologies, most commonly known as Next Generation Sequencing (NGS). A key improvement is the possibility of obtaining long single molecule sequences with the fast and cost-efficiency technology released by Oxford Nanopore Technologies (ONT) and the marketing in 2014 of the MinION, a portable, pocket-size, nanopore-based NGS platform [1]. Since then, several algorithms and software tools have flourished specifically for ONT sequence data. Despite its size, it provides multi-kilobase reads with a throughput comparable to other benchtop sequencers in the market (1–10 Gbases by 2017), therefore still necessitating of efficient and integrated bioinformatics tools to facilitate the widespread use of the technology. While MinION has shown promise in distinct applications [2], because of the low cost, laptop operability, and the USB-powered compact design of MinION, cutting-edge NGS technology is not any more necessarily linked to the established idea of a large machine with high cost that must be located in centralized sequencing centers or in a laboratory bench. As a consequence, the utility of MinION in field experiments to move from sample-to-answers on site have been demonstrated with infectious disease studies [3, 4], off-Earth genome sequencing [5], and species identification in extreme environments [6-8], among others. Leveraging of MinION capabilities in the academic classroom is a natural extension of these field studies to facilitate education of genomics in undergraduate and graduate students [9]. To date, there is no specific software solution aimed to facilitate ONT sequence analyses by integrating capabilities for data manipulation, sequence comparison and assembly in field experiments or for educational purposes to help facilitate learning of genomics [9]. We have developed NanoDJ, an interactive collection of Jupyter notebooks to integrate a variety of software, advanced computer code, and plain contextual explanations. In addition, NanoDJ is distributed as a Docker software container to simplify installation of dependencies and improve the reproducibility of results.

Implementation

NanoDJ is distributed as a Docker container built underneath Jupyter notebooks, which is increasingly popular in life sciences to significantly facilitate the interactive exploration of data [10], and has been recently integrated in the widely used Galaxy portal [11]. The Docker container allows NanoDJ to run in an isolated, self-contained package, that can be executed seamlessly across a wide range of computing platforms [12], having a negligible impact on the execution performance [13]. NanoDJ integrates diverse applications (Additional file 1: Table S1) organized into 12 notebooks grouped on three sections (Fig. 1; Table 1). Main results are presented as embedded objects. In addition, one of the notebooks was conceived for educational purposes by setting a particularly simple problem and the inclusion of low-level explanations. To facilitate the use of the educational notebook and bypassing the installation of Docker and NanoDJ, a lightweight version of this notebook and small sets of ONT reads can be utilized from a web-browser using Binder (https://mybinder.org) in the NanoDJ GitHub repository. In addition, as part of the CyVerse project (https://www.cyverse.org/), NanoDJ has been incorporated into VICE, a visual and interactive computing environment that facilitates training of ONT data analysis. We illustrate the versatility of NanoDJ in distinct scenarios by providing results from four case studies (Additional file 1: Text S1).
Fig. 1

Simplified scheme of all NanoDJ functionalities

Table 1

Summary of NanoDJ notebooks

NameFunctionality
0.0_QualityControl.ipynbEvaluate the quality control and sequence handling
1.0_Basecalling.ipynbTranslates the events or the raw electrical signal from an ONT sequencer (FAST5 format) to a DNA sequence to obtain a FASTA or a FASTQ file
1.1_Trim+Demux.ipynbPerform sequence trimming and demultiplexing
2.0_DeNovo_Canu-Miniasm.ipynbDe novo assembly with Canu or Miniasm, and polish with Racon and Pilon
3.0_DeNovo_Canu+polish.ipynbNanopolish modules to improve the Canu assembly
4.0_DeNovo_Flye.ipynbDe novo assembly with Flye software
5.0_DeNovo_Hybrid.ipynbPerform de novo assembly of Nanopore reads in conjunction with Illumina reads using MaSuRCA and/or Unicycler software
6.0_AssemblyCompare.ipynbCompare distinct assembly results based on QUAST software
7.0_SimulateReads.ipynbObtain simulated reads made with Nanosim software and the Nanosim-h fork with precomputed models
8.0_Alignment.ipynbReference-based assembly using either BWA, BLAST or Rebaler software
9.0_AssemblyGraph.ipynbAssembly graph visualization
Educational.ipynbPerforms basecalling (with Albacore), quality control steps, and a BLAST-based classification of the reads (for educational purposes)
Simplified scheme of all NanoDJ functionalities Summary of NanoDJ notebooks

Input, basecalling, and simulations

Input data can be a list of FAST5 files from previous basecalled runs (e.g. a Metrichor output) or event-level signal data to be basecalled using the latest ONT caller. The user can also simulate reads with NanoSim and pre-computed model parameters. This possibility is important in different scenarios as to help designing an experiment, or to bypass technical difficulties in academic setups [9].

Summary, quality control and filtering

Either for a simulated or an empirical run, the user will obtain summary data and plots informing of read length distribution, GC content vs. length, and read length vs. quality score (when available). If barcodes were used in the experiment, Porechop can be used for demultiplexing, barcode trimming and to filter out reads.

Genome assembly and comparison

Depending on the application, sequence data can be aligned against reference sequences or used for genome assembly using diverse methods. Alignment is performed either against one (BWA and Rebaler) or multiple (BLAST) reference sequences, providing the generation of BAM files for downstream applications (e.g., variant identification) or information of species composition. Alternatively, the user may opt for a de novo assembly. NanoDJ allows the use of some of the best-performing algorithms (Canu, Flye, and Miniasm), or to combine ONT reads with others obtained with second-generation NGS platforms for a hybrid assembly (Unicycler and MaSuRCA). The latter provides more effective assemblies and reduced error rate compared to assemblies based only on ONT reads [14]. NanoDJ includes the possibility of contig correction (Racon, Nanopolish, and Pilon). Assemblies can be evaluated with the embedded version of QUAST, and represented with Bandage.

Limitations and future directions

For non-expert users, it would have been better if NanoDJ was envisaged as an on-line application to facilitate its use. However, our main objective was to integrate major tools for the analysis of ONT sequences in an interactive software environment to facilitate learning the basics behind ONT sequence analysis while providing a useful tool for professionals. Providing it as a Dockerized solution simply bolsters the focus on the use of the tool, reducing the burden of installing all dependencies by the user. At the moment, NanoDJ is set for the analysis of small genomes and targeted NGS studies, although focusing on primary and secondary analysis of DNA sequences. The integration of tools for variant identification and tertiary analysis (annotation of variants or sequence elements, interpretation, etc.) [15, 16], as well as for epigenetics [17] and direct RNA sequencing [18] will be the focus of further developments of NanoDJ.

Conclusions

We present NanoDJ as an integrated Jupyter-based toolbox distributed as a Docker software container to facilitate ONT sequence analysis. NanoDJ is best suited for the analyses of small genomes and targeted NGS studies. We anticipate that the Jupyter notebook-based structure will simplify further developments in other applications.

Availability and requirements

Project name: NanoDJ Project home page: https://github.com/genomicsITER/NanoDJ Operating system(s): Windows, Linux, Mac OS Programming language: Bash/Python Other requirements: Docker installation License: GPL Any restrictions to use by non-academics: None Table S1. Applications integrated in NanoDJ. Text S1. Testing on case study datasets. Table S2. Datasets for illustrative uses of NanoDJ. Table S3. Comparison of de novo assemblies using different inputs or with an assembly corrector. Table S4. Comparison of three de novo assemblers in a high-coverage ONT dataset. Table S5. Comparison of results from two hybrid de novo assemblers. Figure S1. Human mitochondrial DNA variant representation against the reference sequence. Table S6. Source of mitochondrial DNA genomes, simulations and classification results. (DOCX 1544 kb)
  16 in total

1.  Nanopore development at Oxford Nanopore.

Authors:  Clive G Brown; James Clarke
Journal:  Nat Biotechnol       Date:  2016-08-09       Impact factor: 54.908

2.  Reproducible Bioconductor workflows using browser-based interactive notebooks and containers.

Authors:  Reem Almugbel; Ling-Hong Hung; Jiaming Hu; Abeer Almutairy; Nicole Ortogero; Yashaswi Tamta; Ka Yee Yeung
Journal:  J Am Med Inform Assoc       Date:  2018-01-01       Impact factor: 4.497

3.  Real-Time DNA Sequencing in the Antarctic Dry Valleys Using the Oxford Nanopore Sequencer.

Authors:  Sarah S Johnson; Elena Zaikova; David S Goerlitz; Yu Bai; Scott W Tighe
Journal:  J Biomol Tech       Date:  2017-03-22

4.  Establishment and cryptic transmission of Zika virus in Brazil and the Americas.

Authors:  N R Faria; J Quick; I M Claro; J Thézé; J G de Jesus; M Giovanetti; M U G Kraemer; S C Hill; A Black; A C da Costa; L C Franco; S P Silva; C-H Wu; J Raghwani; S Cauchemez; L du Plessis; M P Verotti; W K de Oliveira; E H Carmo; G E Coelho; A C F S Santelli; L C Vinhal; C M Henriques; J T Simpson; M Loose; K G Andersen; N D Grubaugh; S Somasekar; C Y Chiu; J E Muñoz-Medina; C R Gonzalez-Bonilla; C F Arias; L L Lewis-Ximenez; S A Baylis; A O Chieppe; S F Aguiar; C A Fernandes; P S Lemos; B L S Nascimento; H A O Monteiro; I C Siqueira; M G de Queiroz; T R de Souza; J F Bezerra; M R Lemos; G F Pereira; D Loudal; L C Moura; R Dhalia; R F França; T Magalhães; E T Marques; T Jaenisch; G L Wallau; M C de Lima; V Nascimento; E M de Cerqueira; M M de Lima; D L Mascarenhas; J P Moura Neto; A S Levin; T R Tozetto-Mendoza; S N Fonseca; M C Mendes-Correa; F P Milagres; A Segurado; E C Holmes; A Rambaut; T Bedford; M R T Nunes; E C Sabino; L C J Alcantara; N J Loman; O G Pybus
Journal:  Nature       Date:  2017-05-24       Impact factor: 49.962

5.  The impact of Docker containers on the performance of genomic pipelines.

Authors:  Paolo Di Tommaso; Emilio Palumbo; Maria Chatzou; Pablo Prieto; Michael L Heuer; Cedric Notredame
Journal:  PeerJ       Date:  2015-09-24       Impact factor: 2.984

6.  Using mobile sequencers in an academic classroom.

Authors:  Sophie Zaaijer; Yaniv Erlich
Journal:  Elife       Date:  2016-04-07       Impact factor: 8.140

7.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community.

Authors:  Miten Jain; Hugh E Olsen; Benedict Paten; Mark Akeson
Journal:  Genome Biol       Date:  2016-11-25       Impact factor: 13.583

8.  Jupyter and Galaxy: Easing entry barriers into complex data analyses for biomedical researchers.

Authors:  Björn A Grüning; Eric Rasche; Boris Rebolledo-Jaramillo; Carl Eberhard; Torsten Houwaart; John Chilton; Nate Coraor; Rolf Backofen; James Taylor; Anton Nekrutenko
Journal:  PLoS Comput Biol       Date:  2017-05-25       Impact factor: 4.475

9.  On site DNA barcoding by nanopore sequencing.

Authors:  Michele Menegon; Chiara Cantaloni; Ana Rodriguez-Prieto; Cesare Centomo; Ahmed Abdelfattah; Marzia Rossato; Massimo Bernardi; Luciano Xumerle; Simon Loader; Massimo Delledonne
Journal:  PLoS One       Date:  2017-10-04       Impact factor: 3.240

10.  Real-time, portable genome sequencing for Ebola surveillance.

Authors:  Joshua Quick; Nicholas J Loman; Sophie Duraffour; Jared T Simpson; Ettore Severi; Lauren Cowley; Joseph Akoi Bore; Raymond Koundouno; Gytis Dudas; Amy Mikhail; Nobila Ouédraogo; Babak Afrough; Amadou Bah; Jonathan Hj Baum; Beate Becker-Ziaja; Jan-Peter Boettcher; Mar Cabeza-Cabrerizo; Alvaro Camino-Sanchez; Lisa L Carter; Juiliane Doerrbecker; Theresa Enkirch; Isabel Graciela García Dorival; Nicole Hetzelt; Julia Hinzmann; Tobias Holm; Liana Eleni Kafetzopoulou; Michel Koropogui; Abigail Kosgey; Eeva Kuisma; Christopher H Logue; Antonio Mazzarelli; Sarah Meisel; Marc Mertens; Janine Michel; Didier Ngabo; Katja Nitzsche; Elisa Pallash; Livia Victoria Patrono; Jasmine Portmann; Johanna Gabriella Repits; Natasha Yasmin Rickett; Andrea Sachse; Katrin Singethan; Inês Vitoriano; Rahel L Yemanaberhan; Elsa G Zekeng; Racine Trina; Alexander Bello; Amadou Alpha Sall; Ousmane Faye; Oumar Faye; N'Faly Magassouba; Cecelia V Williams; Victoria Amburgey; Linda Winona; Emily Davis; Jon Gerlach; Franck Washington; Vanessa Monteil; Marine Jourdain; Marion Bererd; Alimou Camara; Hermann Somlare; Abdoulaye Camara; Marianne Gerard; Guillaume Bado; Bernard Baillet; Déborah Delaune; Koumpingnin Yacouba Nebie; Abdoulaye Diarra; Yacouba Savane; Raymond Bernard Pallawo; Giovanna Jaramillo Gutierrez; Natacha Milhano; Isabelle Roger; Christopher J Williams; Facinet Yattara; Kuiama Lewandowski; Jamie Taylor; Philip Rachwal; Daniel Turner; Georgios Pollakis; Julian A Hiscox; David A Matthews; Matthew K O'Shea; Andrew McD Johnston; Duncan Wilson; Emma Hutley; Erasmus Smit; Antonino Di Caro; Roman Woelfel; Kilian Stoecker; Erna Fleischmann; Martin Gabriel; Simon A Weller; Lamine Koivogui; Boubacar Diallo; Sakoba Keita; Andrew Rambaut; Pierre Formenty; Stephan Gunther; Miles W Carroll
Journal:  Nature       Date:  2016-02-03       Impact factor: 69.504

View more
  1 in total

1.  Prevalence of an Insect-Associated Genomic Region in Environmentally Acquired Burkholderiaceae Symbionts.

Authors:  Patrick T Stillson; David A Baltrus; Alison Ravenscraft
Journal:  Appl Environ Microbiol       Date:  2022-04-18       Impact factor: 5.005

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.