Literature DB >> 24267787

Combining de novo and reference-guided assembly with scaffold_builder.

Genivaldo Gz Silva1, Bas E Dutilh, T David Matthews, Keri Elkins, Robert Schmieder, Elizabeth A Dinsdale, Robert A Edwards.   

Abstract

Genome sequencing has become routine, however genome assembly still remains a challenge despite the computational advances in the last decade. In particular, the abundance of repeat elements in genomes makes it difficult to assemble them into a single complete sequence. Identical repeats shorter than the average read length can generally be assembled without issue. However, longer repeats such as ribosomal RNA operons cannot be accurately assembled using existing tools. The application Scaffold_builder was designed to generate scaffolds - super contigs of sequences joined by N-bases - based on the similarity to a closely related reference sequence. This is independent of mate-pair information and can be used complementarily for genome assembly, e.g. when mate-pairs are not available or have already been exploited. Scaffold_builder was evaluated using simulated pyrosequencing reads of the bacterial genomes Escherichia coli 042, Lactobacillus salivarius UCC118 and Salmonella enterica subsp. enterica serovar Typhi str. P-stx-12. Moreover, we sequenced two genomes from Salmonella enterica serovar Typhimurium LT2 G455 and Salmonella enterica serovar Typhimurium SDT1291 and show that Scaffold_builder decreases the number of contig sequences by 53% while more than doubling their average length. Scaffold_builder is written in Python and is available at http://edwards.sdsu.edu/scaffold_builder. A web-based implementation is additionally provided to allow users to submit a reference genome and a set of contigs to be scaffolded.

Entities:  

Year:  2013        PMID: 24267787      PMCID: PMC4177539          DOI: 10.1186/1751-0473-8-23

Source DB:  PubMed          Journal:  Source Code Biol Med        ISSN: 1751-0473


Background

Second generation sequencing remains the most cost-effective and readily available technique for complete genome sequencing. While read lengths are increasing, assembly and scaffolding of complete genome sequences often remains a challenge [1]. Paired-end sequencing can greatly improve this by creating scaffolds [2], but if paired-end information is not available or has been exhausted, the similarity provided by a closely related reference genome can provide independent information to assist with scaffolding of the contigs [3]. Some assemblers, MIRA [4] for example, can create a reference-based assembly, where the genome is scaffolded during the assembly process, and impose the complete genome structure of the reference on the assembly [5,6]. While this restriction may not be problematic for genomes with low rearrangement rates, some bacterial genomes are highly plastic with mobile regions that can be located in different genomic locations even in closely related species [3,7]. In addition to mobile genetic elements, bacterial genomes frequently have large-scale rearrangements via recombination between multicopy sequences such as IS elements and rrn operons. These rearrangements decrease synteny between related genomes by either inverting or translocating the intervening region between the multicopy sequences [8,9]. Most scaffolding programs that are currently available use the information provided by mate-pair sequencing to combine contigs into longer scaffolds [2,10]. Also, there is software for manual genome scaffolding [11] based on ordering the contigs. Here, we present the program Scaffold_builder that provides a complementary approach, exploiting the homology of a reference genome to order contigs and build scaffolds. After composing an initial de novo assembly from reads and possible paired-end data, a scaffold of the contigs is built using a reference genome. Thus, we accept the contig sequences in regions where the de novo assembly is certain, and allow the reference genome to add structure to the composed sequence by ordering and orienting the contigs. An additional feature of Scaffold_builder compared to tools like CONTIGuator [12], Projector 2 [13], and ABACAS [14], e-RGA [15] which take a similar approach of contig mapping, is that sequential contigs can be merged if their terminal sequences are highly similar. In such cases, the reference genome helps resolve cases where the de novo assembly program broke the sequence into separate contigs, e.g. because the overlapping region was too short or because the assembly was ambiguous. Scaffolds provide a better insight into the sequencing coverage, and generate a more accurate estimation of the number and sizes of the remaining gaps in the sequence. Moreover, the map of the scaffolded contigs to the reference genome allows an exploration of whether those gaps lie in toxic regions that may prevent cloning, or in complex regions that may hinder accurate sequencing with one or more current sequencing approaches. Thus, the scaffolds can direct closing efforts, the most difficult part of completing microbial genomes.

Methods

Real data

Two laboratory derived derivatives of Salmonella enterica serovar Typhimurium LT2 (S. Typhimurium LT2), SDT1291 and G455 were sequenced on the 454 GS FLX Titanium. Sequences were deposited in the Sequence Read Archive (ERP000942). An initial assembly was constructed using Newbler 2.7 [16] with default parameters. All contigs were used for scaffolding. The complete S. Typhimurium LT2 genome sequence (NC_003197) in the SEED database [17] was used as reference.

Simulated data

Simulated reads were sampled from three complete bacterial genomes: Escherichia coli 042 (FN554766), Lactobacillus salivarius UCC118 (NC_007929) and Salmonella enterica subsp. enterica sv Typhi P-stx-12 (NC_016832) using GemSIM 1.6 [18]. 400,000 reads with a length of 395 nt ± 116 nt were generated per genome using the supplied error model for Roche/454; these simulated reads were assembled using Newbler 2.7 [16] as above. Moreover, 1,600,000 paired 2 × 101 nt reads were generated per genome using supplied error model for Illumina GA IIx with TrueSeq SBS Kit v5‒GA and assembled using MIRA 3.19 [4]. Ten simulated datasets were generated for each strain.

Scaffolding

The Scaffold_builder program performs several analysis steps (Figure 1). First, Nucmer (with default parameters) is run to map contigs to the reference genome and the hits are parsed with Show-coords [19]. Contigs mapped to more than one location over at least 95% of their length (a default value) are considered ambiguously mapped and reported separately. Then, Scaffold_builder uses the location of the longest hit to place the entire contig, while extending any unmapped “overhangs” along the reference. Although this potentially extends the alignment beyond similar region, we trust the sequence of the contig to be correct. Any contig that aligns entirely within a region that already contains a longer contig is not scaffolded and reported separately (as a duplicate region). The algorithm then proceeds along the reference sequence and inserts an appropriate number of Ns between every pair of non-overlapping contigs. For contigs that are mapped with an overlap, Scaffold_builder checks whether those contigs could be joined, broken, or placed end-to-end, by determining the sequence identity between their terminal regions using the Needleman–Wunsch algorithm [20]. Initially, the length of the terminal sequences tested for 300 nt of similarity, or less if the overlapping region is shorter. Then, the program elongates these regions in increments of 10 nt and re-aligns them to a maximum of 400 nt, testing if the similarity between the terminal sequences exceeds the minimum identity threshold of 80%. If the minimum identity is not reached, then Scaffold_builder either joins the contigs end-to-end (default) or breaks the scaffold. All the values mentioned above are default values and can be optionally adjusted by the user.
Figure 1

Flow diagram of the Scaffold_builder script.

Flow diagram of the Scaffold_builder script.

Output

The program outputs a log file with all the decisions that were made, a map of the original and scaffolded locations of the sequence against the reference genome, and a summary of the statistics associated with the assembly before and after scaffolding.

Web-based version

As an alternative to the command line version of the program, we have created a user-friendly web version of Scaffold_builder which provides a tutorial and example of input and output file. The web server is available at http://edwards.sdsu.edu/scaffold_builder.

Results and discussion

Scaffold_builder is a versatile tool that allows ordering and merging of de novo assembled contigs by using a homologous reference genome. Using three simulated datasets, as well as two newly sequenced S. Typhimurium LT2 genomes, we show that Scaffold_builder greatly reduced the number of contigs of the draft genomes. As shown in Additional file 1: Table S1, the number of contigs was decreased by 53% ± 12% while increasing their average length by 114% ± 65%. An average of 38% ± 13% of the contigs were not scaffolded, including sequences that could not be mapped, sequences that mapped in a location that was occupied by another contig with a longer hit, and contigs that were mapped ambiguously. Notably, an average of more than 63% ± 31% of contigs that overlapped after mapping to the reference sequence could be merged using the default identity threshold of 80% (see Additional file 1: Table S1). The average length of these merged regions was 56 nt ± 24 nt. Although these short regions could not unambiguously be assembled by the Newbler or MIRA assembler, the high sequence identity combined with the homology of the reference genome nevertheless enabled Scaffold_builder to join them into a single scaffold. Genes that overlapped with the break points between contigs include rrn operons, transposases and other known multi-copy genes (Additional file 2: Table S2). This illustrates that a reference sequence provides the structure needed to bridge many of the repeat regions that the de novo assembler was unable to join. The increase in sequence length obtained by using Scaffold_builder compares well with the results obtained with CONTIGuator [12] when using similar mapping parameters (CONTIGuator uses blastn [21] to map the contigs). In all of the simulated and real cases examined, Scaffold_builder scaffolded the same number or more sequences than CONTIGuator (Additional file 3: Table S3). Although Scaffold_builder was written and tested using bacterial genomes, the tool can also be used with smaller or larger genomes. For eukaryotic genomes Nucmer requires more memory; for example, for the largest human chromosome Nucmer requires 15.4 bytes per base pair [19]. Scaffold_builder has been used to create scaffolds of the 3.1 Gbp genome of the California Sea lion, Zalophus californianus, that was scaffolded against its closest relative, the dog Canis familiaris[22]. Scaffold_builder greatly extended the length of the contigs compared with the initial assembly. Even though the tool was tested with second-generation sequencing data, Scaffold_builder does not rely on a particular platform. The third-generation sequencing platforms, e.g. PacBio RS, provide longer reads and contigs than the previous generations [23]. Long contigs facilitate the mapping, ordering and scaffolding of the sequences and reduces the number of ambiguous sequences.

Limitations

Scaffold_builder depends on an available genome sequence of a closely related organism. Mapping success depends on sequence similarity [24]. One limitation of Scaffold_builder is the inability to detect large-scale genomic rearrangements relative to the reference if the end points of the rearrangements fall within contig gaps. Multi-copy sequences such as rRNA operons may more frequent in rearrangement breakpoints due to similar recombination, and they are also more difficult to assemble due to ambiguous read mapping. The presence of large-scale rearrangements relative to the reference can be resolved by either using mate-pair sequencing or long-read sequencing across the gaps, or by using PCR to determine the correct sequences flanking the gap [25]. Scaffold_builder tries to join contigs if their overlapping region is highly similar in sequence. In order to test the program limitations, we selected one simulated dataset of Lactobacillus salivarius UCC118, and using Mauve [26] we evaluated 30 sequential contig pairs that exceeded the identity threshold of 80% and were merged. Only 1 pair of contigs was joined incorrectly. Contigs mapped to more than one location over at least 95% of their length are not scaffolded. These contigs are labeled as ambiguous in the output file, and probably the result of a duplicated region in the reference genome. Conversely, contigs that are mapped to the reference within a region that already contains a longer contig are duplicated regions in the query genome, and also reported separately.

Conclusions

Here we present Scaffold_builder as a solution to scaffolding pre-assembled contigs against a reference sequence. Scaffold_builder enables contigs derived from draft genome sequencing to be sorted and similar contig termini to be merged where the de novo assembly program broke the contigs, for example, in a repeat region. While generating draft genomes remains considerably faster and cheaper than generating complete genome sequences, Scaffold_builder both increases the value of these drafts by predicting global genomic context, and brings down the cost of gap closure by suggesting targets for PCR validation.

Availability and requirements

Project name: Scaffold_builder Project and webserver home page: http://edwards.sdsu.edu/scaffold_builder. Operating system: the program was developed for Linux but should also run on Windows or OS X command line interpreters. Programming language: Python. Other requirements: Nucmer (http://mummer.sourceforge.net) and Python programming language (http://www.python.org). License: GNU GPL3. Any restrictions to use by non-academics: no special restrictions.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

GGZS wrote the Scaffold_builder script. GGZS, BED, TDM and RAE designed the algorithm. TDM, KE and EAD validated the tool. GGZS, BED, RS, TDM and RAE drafted the manuscript. RAE conceived of the study. All authors read and approved the final manuscript.

Additional file 1: Table S1

Details of scaffolding statistics for real and simulated genomes. Click here for file

Additional file 2: Table S2

Details of the gaps between scaffolded contigs. These details include the locations and lengths of the gaps (negative gap lengths indicate insertions relative to the reference) and identifiers and annotation of any overlapping genes. The file contains 4 tabs: one each for the chromosomes and plasmids of both S. Typhimurium strains. Click here for file

Additional file 3: Table S3

Comparison between Scaffold_builder and CONTIGuator. Click here for file
  25 in total

1.  Rapid approach to determine rrn arrangement in Salmonella serovars.

Authors:  R A Helm; S Maloy
Journal:  Appl Environ Microbiol       Date:  2001-07       Impact factor: 4.792

2.  Comparative genome assembly.

Authors:  Mihai Pop; Adam Phillippy; Arthur L Delcher; Steven L Salzberg
Journal:  Brief Bioinform       Date:  2004-09       Impact factor: 11.622

3.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

4.  Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences.

Authors:  Song Gao; Wing-Kin Sung; Niranjan Nagarajan
Journal:  J Comput Biol       Date:  2011-09-19       Impact factor: 1.479

5.  progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement.

Authors:  Aaron E Darling; Bob Mau; Nicole T Perna
Journal:  PLoS One       Date:  2010-06-25       Impact factor: 3.240

6.  Versatile and open software for comparing large genomes.

Authors:  Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal:  Genome Biol       Date:  2004-01-30       Impact factor: 13.583

7.  Projector: automatic contig mapping for gap closure purposes.

Authors:  Sacha A F T van Hijum; Aldert L Zomer; Oscar P Kuipers; Jan Kok
Journal:  Nucleic Acids Res       Date:  2003-11-15       Impact factor: 16.971

8.  Local mobile gene pools rapidly cross species boundaries to create endemicity within global Vibrio cholerae populations.

Authors:  Yan Boucher; Otto X Cordero; Alison Takemura; Dana E Hunt; Klaus Schliep; Eric Bapteste; Philippe Lopez; Cheryl L Tarr; Martin F Polz
Journal:  MBio       Date:  2011-04-12       Impact factor: 7.867

9.  Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies.

Authors:  Sacha A F T van Hijum; Aldert L Zomer; Oscar P Kuipers; Jan Kok
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

10.  Microbes, metagenomes and marine mammals: enabling the next generation of scientist to enter the genomic era.

Authors:  Robert Alan Edwards; John Matthew Haggerty; Noriko Cassman; Julia Christine Busch; Kristen Aguinaldo; Sowmya Chinta; Meredith Houle Vaughn; Robert Morey; Timothy T Harkins; Clotilde Teiling; Karin Fredrikson; Elizabeth Ann Dinsdale
Journal:  BMC Genomics       Date:  2013-09-04       Impact factor: 3.969

View more
  25 in total

1.  Genome Sequence of Geobacillus sp. Strain ZGt-1, an Antibacterial Peptide-Producing Bacterium from Hot Springs in Jordan.

Authors:  Rawana N Alkhalili; Rajni Hatti-Kaul; Björn Canbäck
Journal:  Genome Announc       Date:  2015-07-23

2.  Genomic Comparison of the Closely-Related Salmonella enterica Serovars Enteritidis, Dublin and Gallinarum.

Authors:  T David Matthews; Robert Schmieder; Genivaldo G Z Silva; Julia Busch; Noriko Cassman; Bas E Dutilh; Dawn Green; Brian Matlock; Brian Heffernan; Gary J Olsen; Leigh Farris Hanna; Dieter M Schifferli; Stanley Maloy; Elizabeth A Dinsdale; Robert A Edwards
Journal:  PLoS One       Date:  2015-06-03       Impact factor: 3.240

3.  Contig-Layout-Authenticator (CLA): A Combinatorial Approach to Ordering and Scaffolding of Bacterial Contigs for Comparative Genomics and Molecular Epidemiology.

Authors:  Sabiha Shaik; Narender Kumar; Aditya K Lankapalli; Sumeet K Tiwari; Ramani Baddam; Niyaz Ahmed
Journal:  PLoS One       Date:  2016-06-01       Impact factor: 3.240

4.  Assembling the Setaria italica L. Beauv. genome into nine chromosomes and insights into regions affecting growth and drought tolerance.

Authors:  Kevin J Tsai; Mei-Yeh Jade Lu; Kai-Jung Yang; Mengyun Li; Yuchuan Teng; Shihmay Chen; Maurice S B Ku; Wen-Hsiung Li
Journal:  Sci Rep       Date:  2016-10-13       Impact factor: 4.379

5.  Virus Adaptation Following Experimental Infection of Chickens with a Domestic Duck Low Pathogenic Avian Influenza Isolate from the 2017 USA H7N9 Outbreak Identifies Polymorphic Mutations in Multiple Gene Segments.

Authors:  Klaudia Chrzastek; Karen Segovia; Mia Torchetti; Mary Lee Killian; Mary Pantin-Jackwood; Darrell R Kapczynski
Journal:  Viruses       Date:  2021-06-18       Impact factor: 5.048

6.  Comparative genomics of 274 Vibrio cholerae genomes reveals mobile functions structuring three niche dimensions.

Authors:  Bas E Dutilh; Cristiane C Thompson; Ana C P Vicente; Michel A Marin; Clarence Lee; Genivaldo G Z Silva; Robert Schmieder; Bruno G N Andrade; Luciane Chimetto; Daniel Cuevas; Daniel R Garza; Iruka N Okeke; Aaron Oladipo Aboderin; Jessica Spangler; Tristen Ross; Elizabeth A Dinsdale; Fabiano L Thompson; Timothy T Harkins; Robert A Edwards
Journal:  BMC Genomics       Date:  2014-08-05       Impact factor: 3.969

7.  Sequencing at sea: challenges and experiences in Ion Torrent PGM sequencing during the 2013 Southern Line Islands Research Expedition.

Authors:  Yan Wei Lim; Daniel A Cuevas; Genivaldo Gueiros Z Silva; Kristen Aguinaldo; Elizabeth A Dinsdale; Andreas F Haas; Mark Hatay; Savannah E Sanchez; Linda Wegley-Kelly; Bas E Dutilh; Timothy T Harkins; Clarence C Lee; Warren Tom; Stuart A Sandin; Jennifer E Smith; Brian Zgliczynski; Mark J A Vermeij; Forest Rohwer; Robert A Edwards
Journal:  PeerJ       Date:  2014-08-19       Impact factor: 2.984

8.  Genome Sequences of Two Pseudorabies Virus Strains Isolated in Greece.

Authors:  Konstantinos V Papageorgiou; Nicolás M Suárez; Gavin S Wilkie; George Filioussis; Nikolaos Papaioannou; Hans J Nauwynck; Andrew J Davison; Spyridon K Kritas
Journal:  Genome Announc       Date:  2016-01-21

Review 9.  Metagenomic Assembly: Overview, Challenges and Applications.

Authors:  Jay S Ghurye; Victoria Cepeda-Espinoza; Mihai Pop
Journal:  Yale J Biol Med       Date:  2016-09-30

10.  HHV-8-unrelated primary effusion-like lymphoma associated with clonal loss of inherited chromosomally-integrated human herpesvirus-6A from the telomere of chromosome 19q.

Authors:  Enjie Zhang; Victoria E Cotton; Alberto Hidalgo-Bravo; Yan Huang; Adam J Bell; Ruth F Jarrett; Gavin S Wilkie; Andrew J Davison; Ellie P Nacheva; Reiner Siebert; Aneela Majid; Inga Kelpanides; Sandrine Jayne; Martin J Dyer; Nicola J Royle
Journal:  Sci Rep       Date:  2016-03-07       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.