| Literature DB >> 27308864 |
Balázs Brankovics1,2, Hao Zhang3, Anne D van Diepeningen1, Theo A J van der Lee4, Cees Waalwijk4, G Sybren de Hoog1,2.
Abstract
GRAbB (Genomic Region Assembly by Baiting) is a new program that is dedicated to assemble specific genomic regions from NGS data. This approach is especially useful when dealing with multi copy regions, such as mitochondrial genome and the rDNA repeat region, parts of the genome that are often neglected or poorly assembled, although they contain interesting information from phylogenetic or epidemiologic perspectives, but also single copy regions can be assembled. The program is capable of targeting multiple regions within a single run. Furthermore, GRAbB can be used to extract specific loci from NGS data, based on homology, like sequences that are used for barcoding. To make the assembly specific, a known part of the region, such as the sequence of a PCR amplicon or a homologous sequence from a related species must be specified. By assembling only the region of interest, the assembly process is computationally much less demanding and may lead to assemblies of better quality. In this study the different applications and functionalities of the program are demonstrated such as: exhaustive assembly (rDNA region and mitochondrial genome), extracting homologous regions or genes (IGS, RPB1, RPB2 and TEF1a), as well as extracting multiple regions within a single run. The program is also compared with MITObim, which is meant for the exhaustive assembly of a single target based on a similar query sequence. GRAbB is shown to be more efficient than MITObim in terms of speed, memory and disk usage. The other functionalities (handling multiple targets simultaneously and extracting homologous regions) of the new program are not matched by other programs. The program is available with explanatory documentation at https://github.com/b-brankovics/grabb. GRAbB has been tested on Ubuntu (12.04 and 14.04), Fedora (23), CentOS (7.1.1503) and Mac OS X (10.7). Furthermore, GRAbB is available as a docker repository: brankovics/grabb (https://hub.docker.com/r/brankovics/grabb/).Entities:
Mesh:
Substances:
Year: 2016 PMID: 27308864 PMCID: PMC4911045 DOI: 10.1371/journal.pcbi.1004753
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Schematic workflow of GRAbB.
a: The input files for GRAbB. b: The main loop of GRAbB. c: Internal loops for the individual threads, the number of threads is based on the number of sequences in the reference file in multi-mode, otherwise there is only a single thread. d: Check whether there are incomplete threads left or not. e: Output files for each thread. Dashed arrows indicate optional files (only in exonerate-mode). Green arrows indicate termination signal of a given thread or run. If GRAbB is not run in multi-mode, then the first step within the internal loop is the de novo assembly, the preceding ones are skipped.
Comparison between GRAbB and MITObim using our in silico generated paired-end read library.
| Program | Reference | N | CPU usage (%) | Time elapsed (m:s) | Maximum memory usage (Mb) | Output folder size (Mb) |
|---|---|---|---|---|---|---|
| GRAbB | 2 | 157 | 00:15.52 | 285.09 | 70 | |
| GRAbB | 10 | 161 | 01:49.24 | 286.05 | 527 | |
| MITObim | 3 | 150 | 09:42.15 | 1,354.86 | 4,126 | |
| MITObim | 9 | 150 | 25:37.71 | 1,358.81 | 10,667 |
Data were generated by /usr/bin/time on Ubuntu 14.04.1. Both programs were run on the same simulated read library derived from F. graminearum mitogenome. Reference column shows which mitogenome was the input for the run. N is the number of iteration it took to complete the assembly. Output folder size refers to the disk space occupied by the output files of the run. This was assessed using /usr/bin/du on Ubuntu 14.04.1.
* The program, /usr/bin/time, assigns 100% for each of the cores of the processor, and the computer used has four cores, thus the maximal CPU usage is 400%.
Overview of the multi-query run.
| Thread number | Target | Reference | Result size (bp) | Assembly type | Number of iterations | ||||
|---|---|---|---|---|---|---|---|---|---|
| Region | Species | Accession number | Size (bp) | Specific | General | ||||
| 1 | mt | mt | NC_017930 | 34,477 | 49,697 | exhaustive | 18 | 3 | |
| 2 | rDNA | IGS | FD_00403_IGS | 1,449 | 7,872 | exhaustive | 2 | 10 | |
| 3 | IGS | IGS | FD_00403_IGS | 1,449 | 1,446 | exonerate | 1 | 1 | |
| 4 | TEF1a | TEF1a | FD_00403_EF-1a | 689 | 684 | exonerate | 1 | 1 | |
| 5 | TEF1a | TEF1a | FD_00403_EF-1a | 652 | 647 | exonerate | 1 | 1 | |
| 6 | TEF1a | TEF1a | FD_00001_EF-1a | 643 | 647 | exonerate | 3 | 1 | |
| 7 | TEF1a | TEF1a | FD_01036_EF-1a | 677 | 647 | exonerate | 4 | 1 | |
| 8 | RPB1 | RPB1 | FD_02003_RPB1 | 1,606 | 1,606 | exonerate | 1 | 1 | |
| 9 | RPB2 | RPB2 | FD_02003_RPB2 | 1,762 | 1,899 | exonerate | 1 | 1 | |
| 10 | RPB2 | RPB2 | FD_00120_RPB2-57 | 879 | 1,876 | exonerate | 1 | 1 | |
The accession numbers are either GenBank accessions or FusariumID (http://isolate.fusariumdb.org) accessions. The assembly type specifies which assembly mode was selected for the given thread during the GRAbB run. Multiple specific iterations are run only during the first general iteration.
*: This sequence was truncated to span the same region as FD_00001_EF-1a and FD_01036_EF-1a.