Literature DB >> 26229287

MapRepeat: an approach for effective assembly of repetitive regions in prokaryotic genomes.

Diego Cb Mariano¹, Felipe L Pereira², Preetam Ghosh³, Debmalya Barh⁴, Henrique Cp Figueiredo², Artur Silva⁵, Rommel Tj Ramos⁵, Vasco Ac Azevedo¹.

Abstract

UNLABELLED: The newest technologies for DNA sequencing have led to the determination of the primary structure of the genomes of organisms, mainly prokaryotes, with high efficiency and at lower costs. However, the presence of regions with repetitive sequences, in addition to the short reads produced by the Next-Generation Sequencing (NGS) platforms, created a lot of difficulty in reconstructing the original genome in silico. Thus, even today, genome assembly continues to be one of the major challenges in bioinformatics specifically when repetitive sequences are considered. In this paper, we present an approach to assemble repetitive regions in prokaryotic genomes. Our methodology enables (i) the identification of these regions through visual tools, (ii) the characterization of sequences on the extremities of gaps and (iii) the extraction of consensus sequences based on mapping of raw data to a reference genome. We also present a case study on the assembly of regions that encode ribosomal RNAs (rRNA) in the genome of Corynebacterium ulcerans FRC11, in order to show the efficiency of the strategies presented here. The proposed methods and tools will help in finishing genome assemblies, besides reducing the running time and associated costs. AVAILABILITY: All scripts are available at http://github.com/dcbmariano/maprepeat.

Entities: Disease Species

Keywords: bioinformatics; finishing assemblies; genome assembly; repetitive sequences

Year: 2015 PMID： 26229287 PMCID： PMC4512001 DOI： 10.6026/97320630011276

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Recently, the Next-Generation Sequencing (NGS) platforms have led to the determination of primary structure of DNA with high efficiency and accuracy as well as at a lower cost mainly for prokaryotic genomes such as bacteria. Despite of such great advances, the new sequencing platforms are still unable to read with precision all the genome in a single run. It is necessary to fragment the DNA molecules before the sequencing, after which an in silico strategy has to be employed to reassemble these fragments based on the orientation of the individual reads. This process is known as genome assembly [1]. Several algorithms, models and tools have been used for the reconstruction of genomes after sequencing. However, such genomes may present sequences that repeat several times over a chromosome, e.g., regions codifying the ribosomal RNA (rRNA), transposases, regions of phages and plasmids. The assembly of these regions poses a significant challenge having high complexity for the assembler software [2]. In order to resolve the problem of repetitive sequences and to finish genome assemblies within a reasonable time, it becomes necessary to employ manual curation of gaps or even new sequencing of regions next to the gaps, which increases the cost of the assembly process. Thus, the steps required to finish the genome assembly are the major hurdles both in terms of cost and time [3]. Here, we propose a pipeline for assembly of repetitive regions, mainly in bacterial genomes, using the genome of an organism phylogenetically close to the one under study as reference. The proposed strategy enables the scaffolding of the contigs obtained by de novo assembly, including repetitive regions based on the extraction of the consensus sequence from the reads mapped into the reference genome (Figure 1).

Figure 1

pipeline flowchart. The pipeline receives as input: contigs file, reference files (Fasta and GenBank files) and NGS raw data (reads file). The first step is the scaffolding of the contigs. This step can be realized by a modified version of the CONTIGuator software and it has as output a scaffolds file and a synteny graphic with colored targets indicating repetitive regions in the reference file and the gaps׳ positions in the scaffolds file. Using this file it is possible to conduct a manual analysis to choose two contigs׳ names as neighbors to a gap. Note that in the scaffolds file, we do not have contigs (orientated are called scaffolds), however we preserve this denomination in this flowchart to facilitate the comprehension. After this step, we developed the movednaa.py script to correct the beginning of the scaffold file for circular genomes searching the gene dnaA. We also developed the script cut_left.pl to remove barcodes on raw data, when needed. Thus, one can complete the assembly of repetitive regions based on the extraction of the consensus sequence of the mapping of raw data to the reference genome. To automate this step, we developed a software called MapRepeat. It receives as input the name of the two contigs and the path of the scaffolds file, reference Fasta file and the folder containing the NGS raw data file. MapRepeat has as output a new scaffolds file with a closed gap that was indicated in the step before. To analyze the result we developed the scripts: mcontig.py (to divided scaffold files in Multi-Fasta files breaking Ns regions) and contiginfo.py (to analyze number of gaps, length of the genome, length of larger and smaller contigs, and calculate the N50 value).

Methodology

Inputs:

(1) Multi-FASTA (Multiple FASTA format) file with contigs obtained by an assembler software, like Mira Assembler [4]; (2) Raw data file obtained by sequencing in NGS platforms in FASTQ, FASTQ/XML or FASTA/QUAL format (we recommend a depth coverage of approximately 50-fold for lower run time); (3) Two files of an organism of the same specie or genus to be used as reference: the first must contain a complete genome (nucleotide sequences in Fasta format), and the second has information about genes׳ annotation, in GBK format (GenBank Flat File Format). Both can be obtained from public data banks, like in the FTP utility of NCBI [5].

Determination of repetitive regions next to gaps:

We propose an approach for scaffolding the contigs through the software CONTIGuator v2.7 [6] (Figure 2A). The source code of CONTIGuator was modified to allow inputs as a GBK file. Additionally, these modifications allowed the insertion of colored targets in the synteny graphic generated by CONTIGuator. Targets of blue color indicate regions codifying ribosomal RNA; light blue color indicate regions codifying transposases; green color indicate phages and yellow color indicate plasmids. Through the synteny graphic of CONTIGuator, it is possible to identify a contig׳s neighbors when ordered and oriented properly; consequently, it is also possible to infer the existence of a repetitive region based on a similar region in the reference genome.

Figure 2

(A) Synteny graphic generated by CONTIGuator. The figure shows the reference genome on top, and below are the contigs aligned with the localization of gaps, which may be used as input parameters for MapRepeat; (B) First step of running MapRepeat. The software uses BLAST to detect whether there are similarities between two neighbors of contigs. If there are similarities, targets will be used to delimit the initial position of similarity with the left contig and the final position of similarity with the right contig. MapRepeat analyzes a region in the left contig until 3,000 pb before the gap and in the right contig until 3,000 pb after the gap; (C) The region between the targets is extracted, and then the raw data of sequencing are mapped against the extracted sequence using Mira assembler. A consensus sequence is generated based on whether there is coverage that proves the existence of this region in the reference genome and also in the genome sequenced; (D) The consensus sequence is aligned against the fragments of the two contigs. New targets (C and D) are used to identify the unknown regions from the contigs file mapped in the consensus; (E) The sequence contained between the targets C and D is extracted and used to close the gap.

Assembly of repetitive regions:

We developed a software, called MapRepeat, that can: (i) infer the position of repetitive regions on the contigs file based on the reference genome; (ii) assemble these regions after the scaffolding process performed by CONTIGuator; and (iii) close gaps. The software was implemented using the high-level programming language Python and the Biopython library. MapRepeat uses as input the following: (i) a FASTA file with the complete genome of the reference organism, (ii) a MultiFASTA file with the contigs, (iii) the directory name with all raw data files, and (iv) the name of the two contigs׳ neighbors of a determinate gap. MapRepeat uses the software BLAST (Basic Local Alignment Search Tool) [7] to determine, in the reference genome, the position of syntenic regions and regions on the extremities of a gap (input contigs׳ neighbors), and it also stores the information on the targets A and B (Figure 2B). Then, the sequence between the targets A and B is extracted to be used in the mapping of raw data through the software Mira version 4.0. Thus it generates a consensus sequence (Figure 2C). BLAST is used one more time to determine the values of the targets C and D, that indicates the position of the beginning and end of a gap in the consensus sequence obtained (Figure 2D). Finally, the sequence contained between the targets C and D is extracted and used to close the gap (Figure 2E).

Case study:

To evaluate the efficacy of the proposed method, the genome of Corynebacterium ulcerans FRC11 (CuFRC11), access number CP009622, was used as a model. CuFRC11 was sequenced using the platform Ion Torrent™ Personal Genome Machine® (PGM) System (Life Technologies, USA) with 200 pb fragment library kit. The de novo assembly was performed with Mira 4.0 and produced a total of 30 contigs, N50 value of 236,335 and depth coverage for reads mapped to ~179× [9]. As reference genome, we used Corynebacterium ulcerans 0102 (Cu0102), access number NC_018101.1. The scaffolding was performed using our modified version of CONTIGuator. The synteny graphic showed the presence of four regions marked with a blue color, i.e., four clusters codifying rRNA. MapRepeat was used to close the gaps among the contigs: (i) frc11_c6 and frc11_c10; (ii) frc11_c7 and frc11_c8; (iii) frc11_c1 and frc11_c3; and (iv) frc11_c4 and frc11_c2. The extraction of a consensus of the mapping in the reference Cu0102 was successful for the four gaps. These were filled with sequence insertions of length 5,402 pb, 6,101 pb, 4,042 pb, and 4,606 pb, respectively. The BLAST online tool was used to prove that the inserted regions contained sequences codifying rRNA.

Discussion

The efficacy of our proposed method and the developed pipeline for resolving gaps in assemblies of bacterial genomes were illustrated by the results of the case study. They represent alternatives for the finishing of assemblies without additional costs, while also allowing for the code to be modified and adapted as per the needs of the pipeline. We point out that the strategies presented here can be performed through other software without great modifications in the final results. For example, the scaffolding process can be performed using the software Mauve [9], in addition with the BLAST web tool to detect the repetitive regions. We can also use the proprietary tool CLC Genomics Workbench (Qiagen, USA) for extraction of the consensus of the raw data mapping (this tool was used for the finalization of CuFRC11 in [8]).

Conclusion

The tools and methods proposed here are good alternatives for improving the process of finishing bacterial genomes, providing a reduction in costs and also accelerating the process. However, we currently have a command line interface for running the pipeline, which may present some difficulties to users with less informatics skills. All the software developed or modified, in addition to the scripts that can help in the genome assembly process have been made available for download at: http://github.com/dcbmariano /maprepeat. The corresponding documentation on the usage and installation of these tools has been included in the supplementary materials.

Prospects for the future

We aim to improve MapRepeat in future by including (i) the automation of the pipeline, (ii) the integration with other steps of the assembly process, and (iii) the construction of a userfriendly web-interface.

5 in total

Review 1. Assembly algorithms for next-generation sequencing data.

Authors: Jason R Miller; Sergey Koren; Granger Sutton
Journal: Genomics Date: 2010-03-06 Impact factor: 5.736

Review 2. High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity.

Authors: Nicholas J Loman; Chrystala Constantinidou; Jacqueline Z M Chan; Mihail Halachev; Martin Sergeant; Charles W Penn; Esther R Robinson; Mark J Pallen
Journal: Nat Rev Microbiol Date: 2012-08-06 Impact factor: 60.633

3. Finished bacterial genomes from shotgun sequence data.

Authors: Filipe J Ribeiro; Dariusz Przybylski; Shuangye Yin; Ted Sharpe; Sante Gnerre; Amr Abouelleil; Aaron M Berlin; Anna Montmayeur; Terrance P Shea; Bruce J Walker; Sarah K Young; Carsten Russ; Chad Nusbaum; Iain MacCallum; David B Jaffe
Journal: Genome Res Date: 2012-07-24 Impact factor: 9.043

4. CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes.

Authors: Marco Galardini; Emanuele G Biondi; Marco Bazzicalupo; Alessio Mengoni
Journal: Source Code Biol Med Date: 2011-06-21

5. Genome Sequence of Corynebacterium ulcerans Strain FRC11.

Authors: Leandro de Jesus Benevides; Marcus Vinicius Canário Viana; Diego César Batista Mariano; Flávia de Souza Rocha; Priscilla Carolinne Bagano; Edson Luiz Folador; Felipe Luiz Pereira; Fernanda Alves Dorella; Carlos Augusto Gomes Leal; Alex Fiorini Carvalho; Siomar de Castro Soares; Adriana Carneiro; Rommel Ramos; Edgar Badell-Ocando; Nicole Guiso; Artur Silva; Henrique Figueiredo; Vasco Azevedo; Luis Carlos Guimarães
Journal: Genome Announc Date: 2015-03-12

5 in total

10 in total

1. riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions.

Authors: Nicholas R Waters; Florence Abram; Fiona Brennan; Ashleigh Holmes; Leighton Pritchard
Journal: Nucleic Acids Res Date: 2018-06-20 Impact factor: 16.971

2. Acropora digitifera Encodes the Largest Known Family of Fluorescent Proteins that Has Persisted during the Evolution of Acropora Species.

Authors: Shiho Takahashi-Kariyazono; Jun Gojobori; Yoko Satta; Kazuhiko Sakai; Yohey Terai
Journal: Genome Biol Evol Date: 2016-12-05 Impact factor: 3.416

3. Genome Sequences of Three Brucella canis Strains Isolated from Humans and a Dog.

Authors: Marcus Vinicius Canário Viana; Alice Rebecca Wattam; Dhwani Govil Batra; Sébastien Boisvert; Thomas Scott Brettin; Michael Frace; Fangfang Xia; Vasco Azevedo; Rebekah Tiller; Alex R Hoffmaster
Journal: Genome Announc Date: 2017-02-23

4. SIMBA: a web tool for managing bacterial genome assembly generated by Ion PGM sequencing technology.

Authors: Diego C B Mariano; Felipe L Pereira; Edgar L Aguiar; Letícia C Oliveira; Leandro Benevides; Luís C Guimarães; Edson L Folador; Thiago J Sousa; Preetam Ghosh; Debmalya Barh; Henrique C P Figueiredo; Artur Silva; Rommel T J Ramos; Vasco A C Azevedo
Journal: BMC Bioinformatics Date: 2016-12-15 Impact factor: 3.169

5. Genome Sequences of Two Brucella suis Strains Isolated from the Same Patient, 8 Years Apart.

6. Re-sequencing and optical mapping reveals misassemblies and real inversions on Corynebacterium pseudotuberculosis genomes.

Authors: Thiago de Jesus Sousa; Doglas Parise; Rodrigo Profeta; Mariana Teixeira Dornelles Parise; Anne Cybelle Pinto Gomide; Rodrigo Bentos Kato; Felipe Luiz Pereira; Henrique Cesar Pereira Figueiredo; Rommel Ramos; Bertram Brenig; Artur Luiz da Costa da Silva; Preetam Ghosh; Debmalya Barh; Aristóteles Góes-Neto; Vasco Azevedo
Journal: Sci Rep Date: 2019-11-08 Impact factor: 4.379

7. Whole-genome optical mapping reveals a mis-assembly between two rRNA operons of Corynebacterium pseudotuberculosis strain 1002.

Authors: Diego César Batista Mariano; Thiago de Jesus Sousa; Felipe Luiz Pereira; Flávia Aburjaile; Debmalya Barh; Flávia Rocha; Anne Cybelle Pinto; Syed Shah Hassan; Tessália Diniz Luerce Saraiva; Fernanda Alves Dorella; Alex Fiorini de Carvalho; Carlos Augusto Gomes Leal; Henrique César Pereira Figueiredo; Artur Silva; Rommel Thiago Jucá Ramos; Vasco Ariston Carvalho Azevedo
Journal: BMC Genomics Date: 2016-04-30 Impact factor: 3.969

8. High-Quality Draft Genome Sequence of Bacillus amyloliquefaciens Strain 629, an Endophyte from Theobroma cacao.

Authors: Brena M M SantAnna; Phellippe P A Marbach; Marcelo Rojas-Herrera; Jorge T De Souza; Milton R A Roque; Artur T L Queiroz
Journal: Genome Announc Date: 2015-11-19

9. Approaches for in silico finishing of microbial genome sequences.

Authors: Frederico Schmitt Kremer; Alan John Alexander McBride; Luciano da Silva Pinto
Journal: Genet Mol Biol Date: 2017 Jul-Sep 01 Impact factor: 1.771

10. Presence-Absence Polymorphisms of Highly Expressed FP Sequences Contribute to Fluorescent Polymorphisms in Acropora digitifera.

Authors: Shiho Takahashi-Kariyazono; Kazuhiko Sakai; Yohey Terai
Journal: Genome Biol Evol Date: 2018-07-01 Impact factor: 3.416

10 in total