Literature DB >> 25077983

Employing whole genome mapping for optimal de novo assembly of bacterial genomes.

Basil Britto Xavier, Julia Sabirova, Moons Pieter, Jean-Pierre Hernalsteens, Henri de Greve, Herman Goossens, Surbhi Malhotra-Kumar1.   

Abstract

BACKGROUND: De novo genome assembly can be challenging due to inherent properties of the reads, even when using current state-of-the-art assembly tools based on de Bruijn graphs. Often users are not bio-informaticians and, in a black box approach, utilise assembly parameters such as contig length and N50 to generate whole genome sequences, potentially resulting in mis-assemblies.
FINDINGS: Utilising several assembly tools based on de Bruijn graphs like Velvet, SPAdes and IDBA, we demonstrate that at the optimal N50, mis-assemblies do occur, even when using the multi-k-mer approaches of SPAdes and IDBA. We demonstrate that whole genome mapping can be used to identify these mis-assemblies and can guide the selection of the best k-mer size which yields the highest N50 without mis-assemblies.
CONCLUSIONS: We demonstrate the utility of whole genome mapping (WGM) as a tool to identify mis-assemblies and to guide k-mer selection and higher quality de novo genome assembly of bacterial genomes.

Entities:  

Mesh:

Year:  2014        PMID: 25077983      PMCID: PMC4118782          DOI: 10.1186/1756-0500-7-484

Source DB:  PubMed          Journal:  BMC Res Notes        ISSN: 1756-0500


Findings

Genome assembly is often a primary step in the process of yielding results that lead to interpretation of biological data and hence sub-optimally assembled genomes might lead to faulty conclusions [1]. Factors causing such low quality genome assembly include sequence quality, presence of repetitive sequences, base composition, size and low genome coverage [2,3], all of which complicate downstream data analysis using the available tools [4]. Currently, de novo assemblers based on de Bruijn graph are considered to yield the best results provided sufficient sequence quality and coverage are achieved. Such assembly tools based on de Bruijn graph algorithms, like Velvet [5] and SPAdes [6] use k-mers as building blocks, but as most users are not bio-informaticians, these tools are often considered as an encrypted black box with the quality of the assembly usually determined by statistics parameters such as the N50 and the size and number of contigs or scaffolds produced by the assemblers [7]. However, the choice of the k-mer size is crucial as too low or too high k-mer sizes lead to sub-optimal assemblies. Indeed, low quality reads might produce false positive vertices, repeats lead to branching, while an uneven distribution of the reads results in gaps. The use of smaller k-mers reduces the problem associated with low quality reads and their uneven distribution, while larger k-mer sizes help to bridge repeat regions decreasing the branching problem [8]. In a balancing exercise, various k-mer sizes are usually selected, evaluating optimization by aiming for high N50 values and long, but fewer contigs. Whole Genome Mapping (WGM; Opgen Inc, Gaithersburg, MD, USA) is a relatively novel technique that generates high-resolution restriction maps of a genome based on the alignment of single DNA molecules cut with restriction enzymes and ordered with high resolution and accuracy [9]. WGM was proven helpful in mapping de novo assembled contigs against previously sequenced related genomes [10]. In this paper, we evaluated the utility of WGM for proper k-mer size selection and for optimization of parameters for de novo genome assembly. The whole genome sequence of a methicillin-resistant Staphylococcus aureus (MRSA) strain (E-MRSA15-CC22-SCCmecIV) was generated on an Illumina HiSeq-2000 via 2X150b paired end sequencing [11]. Reads were de novo assembled using Velvet, SPAdes and IDBA-UD employing a range of k-mers. Velvet, using a single k-mer approach ranging from k-mer size 81 to 123 showed an initial increase in N50 (until k-mer size 115) and longest contig size and a concomitant decrease in the total number of contigs with increasing k-mer size with both these parameters positively influencing the assembly outcome (Table 1). Ensuing, using WGM, a whole genome map of S. aureus EMRSA-15 was generated which, using MapSolver, was aligned with the assembly files corresponding to different k-mer sizes. Although the percentage of the genome covered by contigs increased with increasing k-mer size, a mis-assembly (spanning 119 kb) was identified for the mapped contigs (>40 kb) for higher k-mer sizes (Table 1), revealing the best (without mis-assemblies) assembly was actually obtained using a k-mer size of 93 despite a higher N50 and fewer contigs as when for example utilizing a k-mer size of 115 (Table 1, Figure 1A, B). In contrast, SPAdes, which allows to combine a range of k-mer sizes in a multi-k-mer approach did not yield any mis-assemblies on this sequence for the N50 based best assemblies (Figure 1C). The same was true for IDBA, which similarly utilizes an iterative process including multiple k-mer sizes, while removing assembled sequences in subsequent rounds of analysis.
Table 1

Assembly statistics of Velvet applied on (MRSA) strain E-MRSA15-CC22-SCC IV showing an increase in contig size and N50 when using higher sizes, but revealing a mis-assembly starting from size 97 using whole genome mapping

K-mer sizeN50Total number of contigsLongest contig sizeMis-assemblies on mapped contigs*Approx. nts involved in mis-assemblies
Velvet
 
81
162295
40
340060
1 (10)
122303
83
170447
38
351373
1 (9)
122303
85
170449
37
351321
0 (10)
 
87
173763
33
351326
0 (10)
 
89
173765
33
351394
0 (10)
 
91
173767
33
351330
0 (10)
 
93
173769
35
340092
0 (10)
 
97
175770
33
365247
1 (9)
130273
99
175776
33
365260
1 (10)
130273
101
187438
32
365623
1 (9)
130273
103
187448
32
365625
1 (9)
130273
105
187458
32
365638
1 (9)
130273
107
187465
33
365647
1 (9)
130273
109
212189
32
365656
1 (9)
130273
111
212287
33
349286
2 (8)
93632 & 153207
113
212292
34
349288
1 (10)
93634
115
212294
34
349290
1 (10)
118928
117
174074
35
349419
1 (11)
93634
119
174076
35
349423
1 (11)
93638
121
170642
37
349435
0 (11)
 
123170654383404560 (10) 

*Number of mapped contigs indicated between brackets.

Figure 1

Alignment of contigs to the corresponding whole genome map: A) Velvet derived assembly using size 93, revealing no mis-assemblies; B) Velvet derived assembly using size 115, corresponding to the highest N50, but revealing mis-assemblies; C: SPAdes derived assembly using a multi- approach up to k-mer size 83, yielding the optimal N50 for this sequence and showing no mis-assemblies; D: SPAdes derived assembly using a multi- approach up to k-mer size 77, yielding the optimal N50 for this sequence, but showing mis-assemblies.

Assembly statistics of Velvet applied on (MRSA) strain E-MRSA15-CC22-SCC IV showing an increase in contig size and N50 when using higher sizes, but revealing a mis-assembly starting from size 97 using whole genome mapping *Number of mapped contigs indicated between brackets. Alignment of contigs to the corresponding whole genome map: A) Velvet derived assembly using size 93, revealing no mis-assemblies; B) Velvet derived assembly using size 115, corresponding to the highest N50, but revealing mis-assemblies; C: SPAdes derived assembly using a multi- approach up to k-mer size 83, yielding the optimal N50 for this sequence and showing no mis-assemblies; D: SPAdes derived assembly using a multi- approach up to k-mer size 77, yielding the optimal N50 for this sequence, but showing mis-assemblies. The general applicability of these results was investigated using two additional, similarly obtained, S. aureus sequences [UA-S391(accession # CP007690) and Mu50-CC5-SCCmecII (ATCC700699; previously sequenced and available under accession # NC_002758], again revealing mis-assemblies for Velvet at the highest N50 values, while error free assemblies could be obtained for lower k-mer sizes (data not shown). In addition, the sequence of Klebsiella pneumoniae ST258 was similarly generated using Illumina HiSeq-2000 via 2X150b paired end sequencing and was assembled using all three assembly tools. In this case, apart from mis-assemblies seen for Velvet, also SPAdes and IDBA were shown to produce mis-assemblies for certain k-mer sizes (Figure 1D), further demonstrating the potential of WGM to identify mis-assemblies, even for assemblers utilizing multi-k-mer approaches (Additional file 1: Table S2).

Conclusion

Genome assembly based on de Bruijn graphs potentially yields mis-assemblies when only considering standard parameters such as total number and length of the contigs and N50. However, Whole Genome Mapping provides a powerful tool to identify such mis-assemblies and to select the optimal k-mer sizes to produce optimally assembled genomes. Despite of its additional cost, the biological need for error-free and complete genomes makes WGM an indispensable technique during the process of genome assembly and its validation.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

BBX participated in the design of the study and performed the data analysis; JS participated in the design of the study and was involved in whole genome mapping; PM participated in conceiving and drafting the manuscript and performed the image processing; JPH and HdG provided the strains and were involved in sequencing for this study; HG and SMK participated in conceiving the study and revised the manuscript. All authors read and approved the final manuscript.

Additional file 1: Table S2

Assembly statistics of Velvet, SPAdes and IDBA-UD applied on Staphylococcus aureus (MRSA) strain E-MRSA15-CC22-SCCmecIV and assembly statistics of Velvet for Klebsiella pneumoniae showing an initial increase in contig size and N50 when using higher k-mer sizes, but revealing mis-assemblies associated with higher k-mer sizes in some cases. Click here for file
  10 in total

1.  SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.

Authors:  Anton Bankevich; Sergey Nurk; Dmitry Antipov; Alexey A Gurevich; Mikhail Dvorkin; Alexander S Kulikov; Valery M Lesin; Sergey I Nikolenko; Son Pham; Andrey D Prjibelski; Alexey V Pyshkin; Alexander V Sirotkin; Nikolay Vyahhi; Glenn Tesler; Max A Alekseyev; Pavel A Pevzner
Journal:  J Comput Biol       Date:  2012-04-16       Impact factor: 1.479

2.  Beware of mis-assembled genomes.

Authors:  Steven L Salzberg; James A Yorke
Journal:  Bioinformatics       Date:  2005-12-15       Impact factor: 6.937

3.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors:  Daniel R Zerbino; Ewan Birney
Journal:  Genome Res       Date:  2008-03-18       Impact factor: 9.043

4.  Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly.

Authors:  Yingrui Li; Hancheng Zheng; Ruibang Luo; Honglong Wu; Hongmei Zhu; Ruiqiang Li; Hongzhi Cao; Boxin Wu; Shujia Huang; Haojing Shao; Hanzhou Ma; Fan Zhang; Shuijian Feng; Wei Zhang; Hongli Du; Geng Tian; Jingxiang Li; Xiuqing Zhang; Songgang Li; Lars Bolund; Karsten Kristiansen; Adam J de Smith; Alexandra I F Blakemore; Lachlan J M Coin; Huanming Yang; Jian Wang; Jun Wang
Journal:  Nat Biotechnol       Date:  2011-07-24       Impact factor: 54.908

5.  Enhanced de novo assembly of high throughput pyrosequencing data using whole genome mapping.

Authors:  Fatma Onmus-Leone; Jun Hang; Robert J Clifford; Yu Yang; Matthew C Riley; Robert A Kuschner; Paige E Waterman; Emil P Lesho
Journal:  PLoS One       Date:  2013-04-17       Impact factor: 3.240

6.  Genome assembly forensics: finding the elusive mis-assembly.

Authors:  Adam M Phillippy; Michael C Schatz; Mihai Pop
Journal:  Genome Biol       Date:  2008-03-14       Impact factor: 13.583

7.  Optical mapping discerns genome wide DNA methylation profiles.

Authors:  Gene E Ananiev; Steve Goldstein; Rod Runnheim; Dan K Forrest; Shiguo Zhou; Konstantinos Potamousis; Chris P Churas; Veit Bergendahl; James A Thomson; David C Schwartz
Journal:  BMC Mol Biol       Date:  2008-07-30       Impact factor: 2.946

Review 8.  Whole-genome sequencing in bacteriology: state of the art.

Authors:  Michael J Dark
Journal:  Infect Drug Resist       Date:  2013-10-08       Impact factor: 4.003

9.  REAPR: a universal tool for genome assembly evaluation.

Authors:  Martin Hunt; Taisei Kikuchi; Mandy Sanders; Chris Newbold; Matthew Berriman; Thomas D Otto
Journal:  Genome Biol       Date:  2013-05-27       Impact factor: 13.583

10.  Complete Genome Sequences of Two Prolific Biofilm-Forming Staphylococcus aureus Isolates Belonging to USA300 and EMRSA-15 Clonal Lineages.

Authors:  J S Sabirova; B B Xavier; J-P Hernalsteens; H De Greve; M Ieven; H Goossens; S Malhotra-Kumar
Journal:  Genome Announc       Date:  2014-06-26
  10 in total
  8 in total

1.  Colistin-Resistant Acinetobacter baumannii Clinical Strains with Deficient Biofilm Formation.

Authors:  Konstantina Dafopoulou; Basil Britto Xavier; An Hotterbeekx; Lore Janssens; Christine Lammens; Emmanuelle Dé; Herman Goossens; Athanasios Tsakris; Surbhi Malhotra-Kumar; Spyros Pournaras
Journal:  Antimicrob Agents Chemother       Date:  2015-12-14       Impact factor: 5.191

2.  Misassembly detection using paired-end sequence reads and optical mapping data.

Authors:  Martin D Muggli; Simon J Puglisi; Roy Ronen; Christina Boucher
Journal:  Bioinformatics       Date:  2015-06-15       Impact factor: 6.937

3.  Whole genome mapping as a fast-track tool to assess genomic stability of sequenced Staphylococcus aureus strains.

Authors:  Julia S Sabirova; Basil Britto Xavier; Margareta Ieven; Herman Goossens; Surbhi Malhotra-Kumar
Journal:  BMC Res Notes       Date:  2014-10-08

Review 4.  Computational methods for optical mapping.

Authors:  Lee Mendelowitz; Mihai Pop
Journal:  Gigascience       Date:  2014-12-30       Impact factor: 6.524

5.  Genome Structural Diversity among 31 Bordetella pertussis Isolates from Two Recent U.S. Whooping Cough Statewide Epidemics.

Authors:  Katherine E Bowden; Michael R Weigand; Yanhui Peng; Pamela K Cassiday; Scott Sammons; Kristen Knipe; Lori A Rowe; Vladimir Loparev; Mili Sheth; Keeley Weening; M Lucia Tondella; Margaret M Williams
Journal:  mSphere       Date:  2016-05-11       Impact factor: 4.389

Review 6.  PulseNet International: Vision for the implementation of whole genome sequencing (WGS) for global food-borne disease surveillance.

Authors:  Celine Nadon; Ivo Van Walle; Peter Gerner-Smidt; Josefina Campos; Isabel Chinen; Jeniffer Concepcion-Acevedo; Brent Gilpin; Anthony M Smith; Kai Man Kam; Enrique Perez; Eija Trees; Kristy Kubota; Johanna Takkinen; Eva Møller Nielsen; Heather Carleton
Journal:  Euro Surveill       Date:  2017-06-08

7.  Gerbil: a fast and memory-efficient k-mer counter with GPU-support.

Authors:  Marius Erbert; Steffen Rechner; Matthias Müller-Hannemann
Journal:  Algorithms Mol Biol       Date:  2017-03-31       Impact factor: 1.405

8.  A benchmark study of k-mer counting methods for high-throughput sequencing.

Authors:  Swati C Manekar; Shailesh R Sathe
Journal:  Gigascience       Date:  2018-12-01       Impact factor: 6.524

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.