| Literature DB >> 28898352 |
Frederico Schmitt Kremer1, Alan John Alexander McBride1, Luciano da Silva Pinto1.
Abstract
The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic information, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes remained as "drafts", incomplete representations of the whole genetic content. The previous genome sequencing studies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to optimize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing.Entities:
Year: 2017 PMID: 28898352 PMCID: PMC5596377 DOI: 10.1590/1678-4685-GMB-2016-0230
Source DB: PubMed Journal: Genet Mol Biol ISSN: 1415-4757 Impact factor: 1.771
Overview of the tools described in the present review
| Category | Tool | Main features | Dependences | Reference | Download link / webserver |
|---|---|---|---|---|---|
| Scaffolding | ABySS |
Paired-end scaffolding. Scaffolding feature already integrated in the ABySS Uses the estimated distances generated by the program DistanceEst (from the same package) as input. Allows the scaffolding using long-reads, such as those generated by PacBio and Oxford Nanopore platforms. | boost libraries: | ( |
|
| Scaffolding | Bambus 2 |
Paired-end scaffolding. Can be easily integrated with assembly projects that are built on top of the AMOS package. Supports the scaffolding of metagenomes. Requires experience with the AMOS package and its data formats. | AMOS package ( | ( |
|
| Scaffolding | MIP |
Paired-end scaffolding. Supports both paired-end and mate-pair (long range) reads. | lpsolve library: | ( |
|
| Scaffolding | OPERA |
Paired-end scaffolding. Identifies potential spurious connections caused by chimeric reads and repetitive genomics elements that may affect the reliability of the scaffolding. Contigs identified as misassembled may be used in the construction of more than one scaffold, but sometimes it may lead to new assembly errors. | BWA ( | ( |
|
| Scaffolding | SCARPA |
Paired-end scaffolding. Only uses for scaffolding those contigs with length greater than the N50 of the assembly. Allows multiple libraries to be used in the same scaffolding project. | None | ( |
|
| Scaffolding | SGA |
Paired-end scaffolding. Scaffolding feature already integrated in the SGA assembly pipeline, which is optimized for Illumina data and large genomes. Uses the estimated distances generated by the program DistanceEst (from the package ABySS) as input, along with the read mapping file in .BAM format. Allows multiple libraries to be used in the same scaffolding project. | Bamtools ( | ( |
|
| Scaffolding | SOPRA |
Paired-end scaffolding. Developed to improve the assemblies generated by Velvet and SSAKE, and required the .AFG files. Supports data from early Illumina and ABI SOLiD platforms, including paired-end and mate-pair reads. Is not fully automated, so it is necessary to run different scripts for each step of the scaffolding. | None | ( |
|
| Scaffolding | SSPACE |
Paired-end scaffolding. Trims the edge of the contigs as they are more suitable to assembly errors. Requires information about the paired-end library, including mean size of the insert, standard deviation and the relative orientation of the mates. | None | ( |
|
| Scaffolding | SSPACE-LongRead |
Paired-end scaffolding. Allows the scaffolding using long-reads, such as those generated by PacBio and Oxford Nanopore platforms. | None | ( |
|
| Scaffolding | MUMmer |
Single reference-based scaffolding. The result of the alignment must be post-processed to obtain the scaffolds. | ( |
| |
| Scaffolding | ABACAS |
Single reference-based scaffolding. Useful when the reference and the target genome are closely-related, and the genome to be scaffolded is not larger than the reference genome. Not optimized for bacteria with two or more replicons/chromosomes
(ex: Allows the design of primers for gap-closing. | MUMmer ( | ( |
|
| Scaffolding | CONTIGuator |
Single reference-based scaffolding. Useful when the target genome is composed by more than one chromosome / replicon. Allows a more sensitive identification of syntenic regions, if compared to ABACAS, as it applies a BLAST search after MUMmmer. | ABACAS ( | ( |
|
| Scaffolding | Mauve |
Single reference- based scaffolding. Can be used both through a commandline interface (CLI) and a graphical user interface (GUI). Allows the identification of genomic inversions and translocations. Not optimized for bacteria with two or more replicons/chromosomes. | Java: | ( |
|
| Scaffolding | FillScaffolds |
Single reference- based scaffolding. Not optimized for bacteria with two or more replicons/chromosomes. Results may require post-processing to reconstruct the sequence of the scaffold. | Java: | ( | Supplementary data of |
| Scaffolding | SIS |
Single reference-based scaffolding. Allows the identification of genomic inversions. Not optimized for bacteria with two or more replicons/chromosomes. | MUMmer ( | ( |
|
| Scaffolding | CAR |
Single reference-based scaffolding. Allows the identification of genomic inversions and translocations. Also available as a webserver. Not optimized for bacteria with two or more replicons/chromosomes. | MUMmer ( | ( |
|
| Scaffolding | RACA |
Multiple reference-based scaffolding. Optimized for large genomes and with multiple chromosomes. Can also use paired-end data. | None | ( |
|
| Scaffolding | Ragout |
Multiple reference-based scaffolding. Uses phylogenetic information to identify the most probable orientation of the contigs / scaffolds. | Networkx (Python package): | ( |
|
| Scaffolding | MeDuSa |
Multiple reference-based scaffolding. Accepts both finished and draft genomes as reference. | BioPython (Python package): | ( |
|
| Assembly integration | Minimus |
Can be easily integrated with assembly projects that are built on top of the AMOS package. Requires experience with the AMOS package and its data formats. | AMOS package ( | ( |
|
| Assembly integration | Reconciliator |
Corrects the misassembled regions in a target assembly by comparing to an alternative assembly for the same genome. Identifies repetitive regions that suffered compressions or expansions. | MUMmer ( | ( |
|
| Assembly integration | MAIA |
Allows the integration of two or more assemblies. Accepts reference genome to perform scaffolding, what is useful for those contigs without correspondence in the other assemblies. | Matlab: | ( |
|
| Assembly integration | CISA |
Allows the integration of three or more assemblies. Corrects misassembled regions and compressed / expanded repeated regions. | BLAST+ ( | ( |
|
| Assembly integration | GAA |
Uses the alignment between the different contigs in the set of assemblies to generate an assembly graph, which is explored to identify to minimal set of independent paths. | BLAT ( | ( |
|
| Assembly integration | Mix |
Generate an extension graph that represents the connection between the contigs. Filters the alignment to reduce the ambiguities caused by repetitive sequences. | Networkx (Python package): | ( |
|
| Assembly integration | GAM / GAM-NGS |
Requires the read files to perform the assembly integration. One of the assemblies to be merged is defined as “master”, while the others are defined as “slaves”. Allows the identification of misassembled regions in the master, which are corrected before the generation of the final assembly. | cmake: | ( |
|
| Assembly integration | Zorro |
Requires the read files to perform the assembly integration. Remaps the reads back to the contigs and identifies misassembled and repetitive regions based on the coverage. Splits the misassembled contigs and performs the assembly integration using Minimus. | AMOS ( | ( |
|
| Gap closing | GapCloser |
Gap-closing feature already integrated in the SOAPdenovo Performs a local reassembly in the gap region using the reads located in the edges of the surrounding contigs. | None | ( |
|
| Gap closing | IMAGE |
Iteratively performs a remapping of the reads to the contigs, followed by the selection of those that overlap the gap region and a local reassembly. | None | ( |
|
| Gap closing | GapFiller |
Iteratively performs a remapping of the reads to the contigs, followed by the selection of those that overlap the gap region and a local reassembly. Requires information about the paired-end library, including mean size of the insert, its standard deviation and the relative orientation of the mates. | None | ( |
|
| Gap closing | Enly |
Iteratively performs a remapping of the reads to the contigs, followed by the selection of those that overlap the gap region and a local reassembly. If a reference genome is provided, a new scaffolding step can be performed to improve the assembly. | BioPython (Python package): | ( |
|
| Gap closing | FGAP |
Uses alternative assemblies of the target genome to identify regions that overlap the gap. | Matlab: | ( |
|
| Gap closing | Sealer |
Performs a local re-assembly of the gap regions using different settings of k-mer, what may help in the solving of regions with repetitive sequences. | boost libraries: | ( |
|
| Gap closing | GMCLoser |
May use both paired-end reads and alternative assemblies to perform the gap-closing. Applies a likelihood analysis to avoid the effect of misassemblies in the alternative assemblies. | MUMmer ( | ( |
|
| Gap closing | MapRepeat |
Performs a reference-based scaffolding using a closely-related genome provided by the user. Uses a reference-guided assembly to perform the gap-closing process. | BLAST+ ( | ( |
|
| Gap closing | GapBlaster |
Allows a manual gap-closing using an alternative assembly of the target genome. | BLAST and BLAST+ ( | ( |
|
| Assembly evaluation | REAPR |
Calculates the accuracy of the assembly based on the coverage after remapping the reads back to the scaffolds. Misassembled regions can be identified as they usually present a discrepant coverage. A new set of scaffolds is generated by splitting the regions identified as misassembled. | File::Basename, File::Copy,
File::Spec, File::Spec::Link, Getopt::Long and List::Util (Perl modules): | ( |
|
| Assembly evaluation | QUAST |
Calculate several assembly metrics, such as C+G%, N50 and L50. Can be used to compare different assemblies for the same genome, and / or compare then to a reference genome. | boost libraries: | ( |
|
| Assembly evaluation | ALE |
Calculates the accuracy of the assembly based on the k-mers and C+G% distribution along the scaffolds. Doesn't require a reference genome. | Matplotlib (Python package): | ( |
|
| Assembly evaluation | CGAL |
Calculates the accuracy of the assembly based on the coverage after remapping the reads back to the scaffolds. | None | ( |
|
| Assembly evaluation | GMvalue |
Aligns the assembly to a reference genome (or alternative assembly) to identify misassembled regions. A new set of scaffolds is generated by splitting the regions identified as misassembled. | MUMmer ( | ( |
|
| Assembly correction | iCORN |
Requires paired-end reads. Interactively identifies and corrects short misassemblies, such as base-substitutions and short INDELs. | SNP-o-matic ( | ( |
|
| Assembly correction | SEQuel |
Requires paired-end reads. Interactively identifies and corrects short misassemblies, such as base-substitutions and short INDELs. Performs a local reassembly of the misassembled regions using information from k-mers and paired-end reads. | Java: | ( |
|
| Assembly correction | GFinisher |
Doesn't require paired-end reads. Integrates a reference-guided scaffolding step and gap-closing procedures, along with the assembly correction process. Identifies misassembled regions based on the GC-Skew distribution. | Java: | ( |
|
= Considering a computer running UNIX, Linux or Mac OS operating systems (OSs). As Make, sed, awk, GCC, Perl, Bash, Python and the GNU/Unix standard utility set are already included in most of the distributions / versions of these OSs, these programs were not listed as dependences.
Figure 1A flowchart demonstrating how and when the different genome finishing approaches can be combined according to the data that is available for the user. (a) Scaffolding using paired-end reads or long-reads, which is directly dependent on the way the genome was sequencing (platform, library), and sometimes performed as part of the de novo assembly process. (b) Assembly integration, which consists in the combination of different de novo assemblies and generation of a consensus/extended assembly. Some programs use only the assemblies as input, while others use also the sequencing reads. (c) The standard contig-ordering approach based on a single reference genome, which consists in the identification of synteny blocks that guide the orientation of the contigs in the draft genome, without taking into count the occurrence of genome inversions other rearrangements. (d) The rearrangement-aware contig-ordering, that identifies potential sites of inversion and translocations based on signatures on the alignment against the reference genome. (e) The multiple-reference contig ordering, that may be more appropriate in those cases where there is no finished reference genome, but there is a relatively high number of close-related drafts, or when there are no apparent closest reference to be used. (f) Assembly correction, which consists in the removing of short misassemblies, including base-substitutions and short insertions and deletions. (g) Gap-closing, which consists in the joining of adjacent contigs that used to be spaced by a gap. (h) Assembly evaluation, which may provide help to access the reliability of the assembly.
Figure 2Reference-based contig ordering. (a) The program takes a set of contigs (or scaffolds) and (b) aligns these to a reference genome to identify the most probable relative orientation of the sequences in the draft genome. (c) Regions not covered by the contigs represent gaps and may be sequencing/assembling artifacts or natural deletions. Based on the relative position of each contig, a scaffold is created.
Figure 3Example of a gap-closing approach using paired-end reads. (a) Taking as example a scaffold constituted by two contigs joined by an assembly gap (a run of `N's) by remapping the reads back to the contigs (b) it is possible to identify reads that have at least one of the mates in the gap region. Finally, (c) the reads identified inside the gap can be de novo assembled to fill the region, resulting in a (d) closed gap.
Figure 4Example of a simplified assembly correction approach for base substitutions and insertion/deletion misassemblies. The process steps are (a) map the reads to the assembly, (b) identify variants (eg: SNPs and INDELs) in a similar way to the common variant calling analysis pipelines, and finally, (c) correct the regions in the assembly that show discrepancies. These steps may be reiterated several times until no further change be able to improve the assembly.