| Literature DB >> 26229287 |
Diego Cb Mariano1, Felipe L Pereira2, Preetam Ghosh3, Debmalya Barh4, Henrique Cp Figueiredo2, Artur Silva5, Rommel Tj Ramos5, Vasco Ac Azevedo1.
Abstract
UNLABELLED: The newest technologies for DNA sequencing have led to the determination of the primary structure of the genomes of organisms, mainly prokaryotes, with high efficiency and at lower costs. However, the presence of regions with repetitive sequences, in addition to the short reads produced by the Next-Generation Sequencing (NGS) platforms, created a lot of difficulty in reconstructing the original genome in silico. Thus, even today, genome assembly continues to be one of the major challenges in bioinformatics specifically when repetitive sequences are considered. In this paper, we present an approach to assemble repetitive regions in prokaryotic genomes. Our methodology enables (i) the identification of these regions through visual tools, (ii) the characterization of sequences on the extremities of gaps and (iii) the extraction of consensus sequences based on mapping of raw data to a reference genome. We also present a case study on the assembly of regions that encode ribosomal RNAs (rRNA) in the genome of Corynebacterium ulcerans FRC11, in order to show the efficiency of the strategies presented here. The proposed methods and tools will help in finishing genome assemblies, besides reducing the running time and associated costs. AVAILABILITY: All scripts are available at http://github.com/dcbmariano/maprepeat.Entities:
Keywords: bioinformatics; finishing assemblies; genome assembly; repetitive sequences
Year: 2015 PMID: 26229287 PMCID: PMC4512001 DOI: 10.6026/97320630011276
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
Figure 1pipeline flowchart. The pipeline receives as input: contigs file, reference files (Fasta and GenBank files) and NGS raw data (reads file). The first step is the scaffolding of the contigs. This step can be realized by a modified version of the CONTIGuator software and it has as output a scaffolds file and a synteny graphic with colored targets indicating repetitive regions in the reference file and the gaps׳ positions in the scaffolds file. Using this file it is possible to conduct a manual analysis to choose two contigs׳ names as neighbors to a gap. Note that in the scaffolds file, we do not have contigs (orientated are called scaffolds), however we preserve this denomination in this flowchart to facilitate the comprehension. After this step, we developed the movednaa.py script to correct the beginning of the scaffold file for circular genomes searching the gene dnaA. We also developed the script cut_left.pl to remove barcodes on raw data, when needed. Thus, one can complete the assembly of repetitive regions based on the extraction of the consensus sequence of the mapping of raw data to the reference genome. To automate this step, we developed a software called MapRepeat. It receives as input the name of the two contigs and the path of the scaffolds file, reference Fasta file and the folder containing the NGS raw data file. MapRepeat has as output a new scaffolds file with a closed gap that was indicated in the step before. To analyze the result we developed the scripts: mcontig.py (to divided scaffold files in Multi-Fasta files breaking Ns regions) and contiginfo.py (to analyze number of gaps, length of the genome, length of larger and smaller contigs, and calculate the N50 value).
Figure 2(A) Synteny graphic generated by CONTIGuator. The figure shows the reference genome on top, and below are the contigs aligned with the localization of gaps, which may be used as input parameters for MapRepeat; (B) First step of running MapRepeat. The software uses BLAST to detect whether there are similarities between two neighbors of contigs. If there are similarities, targets will be used to delimit the initial position of similarity with the left contig and the final position of similarity with the right contig. MapRepeat analyzes a region in the left contig until 3,000 pb before the gap and in the right contig until 3,000 pb after the gap; (C) The region between the targets is extracted, and then the raw data of sequencing are mapped against the extracted sequence using Mira assembler. A consensus sequence is generated based on whether there is coverage that proves the existence of this region in the reference genome and also in the genome sequenced; (D) The consensus sequence is aligned against the fragments of the two contigs. New targets (C and D) are used to identify the unknown regions from the contigs file mapped in the consensus; (E) The sequence contained between the targets C and D is extracted and used to close the gap.