| Literature DB >> 28105921 |
Diego C B Mariano1, Felipe L Pereira2, Edgar L Aguiar1, Letícia C Oliveira1, Leandro Benevides1, Luís C Guimarães1, Edson L Folador1, Thiago J Sousa1, Preetam Ghosh3,4, Debmalya Barh3, Henrique C P Figueiredo2, Artur Silva5, Rommel T J Ramos5, Vasco A C Azevedo6,7.
Abstract
BACKGROUND: The evolution of Next-Generation Sequencing (NGS) has considerably reduced the cost per sequenced-base, allowing a significant rise of sequencing projects, mainly in prokaryotes. However, the range of available NGS platforms requires different strategies and software to correctly assemble genomes. Different strategies are necessary to properly complete an assembly project, in addition to the installation or modification of various software. This requires users to have significant expertise in these software and command line scripting experience on Unix platforms, besides possessing the basic expertise on methodologies and techniques for genome assembly. These difficulties often delay the complete genome assembly projects.Entities:
Keywords: Bacterial genome; Bioinformatics; Genome assembly; Genome finishing; Ion Torrent PGM; Web tool
Mesh:
Year: 2016 PMID: 28105921 PMCID: PMC5249034 DOI: 10.1186/s12859-016-1344-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Workflow representing data fluxes in SIMBA’s process. a Projects management: creation and management of projects; and raw data conversion. b De novo assembly (Mira software – support to the assemblers Minia, Newbler and SPAdes). In this step the user can insert output files of de novo assembly yielded by external software. c Curation: This step has five subdivisions: 1) contigs orientation using CONTIGuator2 software by reference or by optical mapping report; 2) Start DNA correction based on reference genome; 3) merging neighbor contigs with overlap in flank regions using PHP parser and BLAST (blastn); 4) mapping 3,000 bp of flank regions of neighbor contigs (Contig A and Contig B) against reference genome; BLAST is used to determine the start and end position (targets A and B); the reference genome is trimmed at targets A and B positions; then the raw data is mapped using Mira4; the contigs A and B are mapped on consensus sequence obtained through raw data mapping; targets C and D are used to detect start and end position of specific GAP; the region is extracted, and the gap is closed. 5) Showing statistics about unknown nucleotides present in the genome and allow genome download for manual curation
Comparison among SIMBA and other software
| SIMBA | Galaxy | CLC | Lasergene | |
|---|---|---|---|---|
| User-friendly interface | X | X | X | X |
|
| X | X | X | X |
|
| X | X | ||
| Scaffolding by reference | X | X | X | X |
| Assembly quality evaluation | X | |||
| Scaffolding by optical mapping | X | |||
| Gap closing | X | X | X | X |
| Genome visualization | X | |||
| Free and open source | X | X | ||
| Web tool | X | X | ||
| Multiple users in parallel | X | X | ||
| Free sequence edition | X | X | ||
| Annotation support | X | X | X |
Assemblies using SIMBA
| Genome | NGS | Software | Contigs | Scaffolding method | Genome length (bp) | Total scaffolds | Reference |
|---|---|---|---|---|---|---|---|
|
| Ion PGM | Mira3 | 41 | Optical mapping (enzyme | 2,369,817 | 15 | Data not published. |
| Mira4 | 56 | ||||||
| Minia | 675 | ||||||
| Newbler | 27 | ||||||
| SPAdes | 58 | ||||||
|
| Ion PGM | Mira3 | 9 | Optical mapping (enzyme | 2,335,107 | 5 | [ |
| Mira4 | 12 | ||||||
| Minia | 2,425 | ||||||
| Newbler | 10 | ||||||
| SPAdes | 15 | ||||||
|
| Ion PGM | Mira3 | 25 | Reference ( | 2,337,451 | 9 | [ |
| Mira4 | 62 | ||||||
| Minia | - | ||||||
| Newbler | 17 | ||||||
| SPAdes | 631 | ||||||
|
| Ion PGM | Mira3 | 11 | Reference ( | 2,337,177 | 6 | [ |
| Mira4 | 15 | ||||||
| Minia | 1,146 | ||||||
| Newbler | 9 | ||||||
| SPAdes | 10 | ||||||
|
| Ion PGM | Mira3 | - | Reference ( | 2,442,826 | 6 | [ |
| Mira4 | 30 | ||||||
| Minia | 21a | ||||||
| Newbler | 9 | ||||||
| SPAdes | 8 | ||||||
|
| Ion PGM | Mira3 | 12 | Reference ( | 2,484,335 | 2 | [ |
| Mira4 | - | ||||||
| Minia | 343 | ||||||
| Newbler | 8 | ||||||
| SPAdes | 18 | ||||||
|
| Ion PGM | Mira3 | 661 | Reference ( | 2,554,693 | 18 | [ |
| Mira4 | - | ||||||
| Minia | - | ||||||
| Newbler | 43 | ||||||
| SPAdes | - |
aMinia assembled only 8,097 bp of the genome
-The assembly fails for an unknown reason
Fig. 2MapSolver and SIMBA visualizations of the comparison between the new and old assemblies of Cp258. a MapSolver alignment visualization among the whole-genome optical mapping (enzyme kpnI; the central barcode in the color red), the old assembly (NC_017945; the barcode above in the color blue), and the new assembly (performed by SIMBA; the barcode below also in the color blue). Dark blue lines in the barcodes represent restriction sites. Lines connecting barcodes represent similarity regions. b The old assembly (above) is ~60 Kbp smaller than the restriction map (center), that have a length near to the new assembly (below). c SIMBA visualization compares the Cp258 old assembly (horizontal line red above) with the new assembly (horizontal line light blue below). Red lines that connect the line above and the line below represent syntenic regions. The visualization shows: d regions undetected in the old assembly; e mis-assemblies in the old assembly; and (f) the length difference between the genomes. The visualization showed by SIMBA agrees with the MapSolver results. In addition, it gives more detailed information about the genome differences