| Literature DB >> 27721396 |
Dieval Guizelini1,2, Roberto T Raittz2, Leonardo M Cruz1,2, Emanuel M Souza1,2, Maria B R Steffens1,2, Fabio O Pedrosa1,2.
Abstract
Despite the development in DNA sequencing technology, improving the number and the length of reads, the process of reconstruction of complete genome sequences, the so called genome assembly, is still complex. Only 13% of the prokaryotic genome sequencing projects have been completed. Draft genome sequences deposited in public databases are fragmented in contigs and may lack the full gene complement. The aim of the present work is to identify assembly errors and improve the assembly process of bacterial genomes. The biological patterns observed in genomic sequences and the application of a priori information can allow the identification of misassembled regions, and the reorganization and improvement of the overall de novo genome assembly. GFinisher starts generating a Fuzzy GC skew graphs for each contig in an assembly and follows breaking down the contigs in critical points in order to reassemble and close them using jFGap. This has been successfully applied to dataset from 96 genome assemblies, decreasing the number of contigs by up to 86%. GFinisher can easily optimize assemblies of prokaryotic draft genomes and can be used to improve the assembly programs based on nucleotide sequence patterns in the genome. The software and source code are available at http://gfinisher.sourceforge.net/.Entities:
Mesh:
Year: 2016 PMID: 27721396 PMCID: PMC5056350 DOI: 10.1038/srep34963
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1GFinisher workflows.
A target assembly, some alternative assemblies and a reference genome sequence are inputs for the program. Numbered boxes represent the seven main steps in the analysis. A–F represents intermediate assemblies generated during the analysis. Solid lines show the process flows and dotted lines show the assembly flow. In step 5, intermediate assemblies B and D are compared to recover the true critical points (dashed line).
Figure 2The example of the difference in sensitivity in the detection of critical points in the GC-Skew and Fuzzy-GC-Skew curves.
The curves were calculated along a 1.1 Mbp contig in a assembly of Aeromonas hydrophila. GC Skew accumulated graphs obtained by classical equation (red line) or by fuzzy method (green line); a 10 kbp window was applied for the calculation. Critical points on GC Skew graphs are shown for classical (blue diamonds) and fuzzy (yellow diamonds) calculations. Regions of divergence in dotplot-like graph (pink and cyan lines) may be cause of variations in the GC Skew curve.
Figure 3Eulerian path dilemma and GC skew context in genome sequence assembly using de Bruijn graph.
The de-Bruijn graph representing a collapsed repeated region (red box) and the path dilemma to choose the path through this region in the assembler. The orange line shows the trend that is not used by assemblers. The red dashed line may be misassemblies.
Figure 4GFinisher improvement in the average number of contigs for 12 prokaryotic genome sequence assemblies available in GAGE-B.
Assembled by GAGE-B (blue) and average number of contigs after reassembled by GFinisher (green).
Average number of contigs and reduction ratio obtained by GFinisher.
| Assembler | Contig/Scaffold | Average number of contigs | Correction Rate | ||||
|---|---|---|---|---|---|---|---|
| GAGE-B | GFinisher intermediate assemblies | Reduction rate between the GFinisher and GAGE assemblies (%) | |||||
| (N) | (B) | (F) | Before error detection (N − B)/N | After error correction (N − F)/N | |||
| ABySS | Contig | 176,17 | 40,33 | 27,67 | 77,11 | 84,30 | 7,19 |
| ABySS | Scaffold | 164,00 | 42,75 | 28,67 | 73,93 | 82,52 | 8,59 |
| CABOG | Contig | 219,08 | 26,33 | 18,00 | 87,98 | 91,78 | 3,80 |
| CABOG | Scaffold | 187,25 | 26,42 | 18,33 | 85,89 | 90,21 | 4,32 |
| MaSuRCA | Contig | 102,75 | 43,33 | 15,83 | 57,83 | 84,59 | |
| MaSuRCA | Scaffold | 99,50 | 43,33 | 15,83 | 56,45 | 84,09 | |
| MIRA | Contig | 215,58 | 69,83 | 40,58 | 67,61 | 81,18 | 13,57 |
| SGA | Contig | 344,92 | 31,25 | 25,33 | 90,94 | 92,66 | 1,72 |
| SGA | Scaffold | 305,50 | 26,83 | 21,27 | 91,22 | 93,04 | 1,82 |
| SOAPdenovo2 | Contig | 163,42 | 32,25 | 24,17 | 80,27 | 85,21 | 4,95 |
| SOAPdenovo2 | Scaffold | 130,75 | 32,25 | 24,17 | 75,33 | 81,52 | 6,18 |
| SPAdes | Contig | 91,67 | 29,08 | 20,64 | 68,27 | 77,49 | 9,21 |
| SPAdes | Scaffold | 79,75 | 30,08 | 22,75 | 62,28 | 71,47 | 9,20 |
| Velvet | Contig | 201,00 | 35,75 | 24,08 | 82,21 | 88,02 | 5,80 |
| Velvet | Scaffold | 112,92 | 35,75 | 24,08 | 68,34 | 78,67 | 10,33 |
| Average | 172,95 | 36,37 | 23,43 | 78,97 | 86,45 | 7,48 | |
Columns B and F represent the average number of contigs obtained by GFinisher after the second and last steps described in the flowchart of Fig. 1. The rate of reduction of the average number of contigs between GFinisher and GAGE-B, before and after error detection by Fuzzy GC Skew algorithm, is also shown in the 6th and 7th columns. The last column shows the correction rates.