Literature DB >> 24950923

SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information.

Marten Boetzer, Walter Pirovano1.   

Abstract

BACKGROUND: The recent introduction of the Pacific Biosciences RS single molecule sequencing technology has opened new doors to scaffolding genome assemblies in a cost-effective manner. The long read sequence information is promised to enhance the quality of incomplete and inaccurate draft assemblies constructed from Next Generation Sequencing (NGS) data.
RESULTS: Here we propose a novel hybrid assembly methodology that aims to scaffold pre-assembled contigs in an iterative manner using PacBio RS long read information as a backbone. On a test set comprising six bacterial draft genomes, assembled using either a single Illumina MiSeq or Roche 454 library, we show that even a 50× coverage of uncorrected PacBio RS long reads is sufficient to drastically reduce the number of contigs. Comparisons to the AHA scaffolder indicate our strategy is better capable of producing (nearly) complete bacterial genomes.
CONCLUSIONS: The current work describes our SSPACE-LongRead software which is designed to upgrade incomplete draft genomes using single molecule sequences. We conclude that the recent advances of the PacBio sequencing technology and chemistry, in combination with the limited computational resources required to run our program, allow to scaffold genomes in a fast and reliable manner.

Entities:  

Mesh:

Year:  2014        PMID: 24950923      PMCID: PMC4076250          DOI: 10.1186/1471-2105-15-211

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Since the introduction of Next Generation Sequencing (NGS) much attention has been given to the development of de novo assembly software. The aim is to combine the short sequencing reads into a minimum number of linear stretches, though this goal is only partially achieved by draft assembly methods such as Velvet [1], SOAPdenovo [2] or ABySS [3]. Consequently more emphasis has been placed on so-called “genome finishing” tools which aim to reduce the number of contiguous sequences, for instance by the use of distance information between short paired reads. Indeed draft assemblies can be significantly enhanced when applying a scaffolding routine [4-6]. Nonetheless these protocols can still not overcome major hurdles such as (large) repeats and low-coverage regions. Alternatively the use of long read sequences as offered by the PacBio RS methodology can potentially solve complex genomic situations, yet the algorithmic implementation still suffers from a relatively high error rate. At present so-called Continuous Long Reads (CLR) may even exceed a size of 20 kbp at the cost of an error rate of approximately 15%, whereas the shorter Circular Consensus Reads (CCS) can span maximally 3 kbp though at a 2.5% error rate [7]. It is a common thought that the quality of the CLR reads is yet insufficient for high quality assemblies unless the genome coverage is very high. Recently, Chin et al. [8] have presented a novel non-hybrid method called HGAP. Here only CLR reads are used to finish bacterial genomes giving the advantage of a single library preparation. Although a sequence coverage of approximately 50× is sufficient to correct the error rate, a higher coverage is needed to span repeated elements. Also a significant manual intervention is needed to polish the genomes (i.e. manual error correction). From a cost perspective it means that a relatively large budget is needed to close a single genome using only CLR reads, especially when (larger) eukaryotes are studied. Contemporarily it has been proposed to use a hybrid assembly approach in the attempt to enhance error-prone CLR reads. In principle this is possible using either PacBio CCS reads or short read NGS data (or a combination of both). To date a few algorithms have been released that are capable of upgrading PacBio CLR data with high accuracy data from CCS or short read NGS data, among which PacBioToCA [9] and LSC [10]. These are further incorporated into hybrid assembly methods such as Celera [11], MIRA [12] and ALLPATHS-LG [13]. Even though promising results have been obtained, the error-correction step with short reads requires a sufficient read length (>75 bp) and sequencing depth, as well as large computational demands. Notably the PacBioToCA (PBcR) correction pipeline also supports non-hybrid PacBio assemblies in case C2 or newer sequence reads are used. As regards scaffolding, the AHA (A Hybrid Assembler) strategy is currently the most widely used approach: the method employs the usage of the CLR reads only for scaffolding of pre-assembled contigs [14], yet it generally results in incomplete assemblies and is not designed for large-scale genome assembly applications given the long runtime and input restrictions (the current draft genome size limit for AHA in the SMRT Analysis suite v2.0 is 160 Mb and 20,000 contigs). Recently a novel hybrid assembly tool has been released, Cerulean [15], which uses ABySS [3] contig graph information and uncorrected CLR reads to create genome scaffolds. Despite of the promising results obtained, the requirement to use ABySS for the draft assembly is a limiting factor as other methods might generate a better draft [16]. Finally dedicated tools have been developed to close gaps within scaffolds using PacBio reads, among which PBJelly [17]. A future release of this software aims to also incorporate a scaffolding module. Given the limitations encountered for short read sequencing (in terms of length) and PacBio long read sequencing (in terms of quality), the complete closure of prokaryotic and eukaryotic genomes is still a relatively expensive and difficult task. Indeed it can be observed that most genome submissions consist of hundreds (or thousands) of unconnected contigs. Here we present a hybrid approach to scaffold draft genomes in a cheap, fast and reliable manner. In brief the algorithm consists of three steps: 1) Alignment of long reads against the pre-assembled contigs (or scaffolds) 2) Computation of contig linkage from the alignment order and 3) Scaffolding of contigs (or scaffolds) including placement of repeated elements. On a number of test datasets we show our method, called SSPACE-LongRead, outperforms the AHA method in terms of genome completeness. Importantly the input draft assemblies were constructed using only a single Illumina MiSeq or Roche 454 library and a complementary PacBio RS C2 long read library. The combination of barcoding possibilities on the PacBio RS instrument and the relatively low coverage needed by the SSPACE-LongRead algorithm opens novel ways of scaffolding genomes at reduced costs compared for instance to the use of NGS mate paired-end libraries. Our software is setup in a user-friendly manner and is suited for analysis on small computing systems given the very fast runtime. Academics can request a free copy at http://www.baseclear.com/bioinformatics-tools/.

Results and discussion

In order to test the performance of SSPACE-LongRead and compare it to PacBio’s AHA scaffolds, we have performed in-depth analysis on six bacterial datasets. These include B. trehalosi, two E. coli strains (K12-MG1655 and O157:H7), F. tularensis, B. trehalosi, M. haemolytica and S. enterica. Details are described in Table 1 and the Methods section. For each organism paired-end Illumina MiSeq reads (on average 90-100× coverage) were used to create two draft assemblies: one with Ray [18] and one with CLCbio de novo assembly software (CLC bio, Aarhus, Denmark). An alternative draft assembly was made with Newbler (Roche) using only Roche 454 reads (on average 45-50× coverage), except for M. haemolytica for which no 454 reads were available. Subsequently the draft contigs were scaffolded using uncorrected PacBio RS long CLR reads (up to 200× coverage) with both SSPACE-LongRead and AHA. The results are displayed in Table 2.
Table 1

Statistics of the sequence datasets used for the comparative study

 
Illumina MiSeq reads
Roche 454 reads
PacBio RS reads
OrganismTotal reads (Kbp)Total bases (Mbp)Mean read length (bp)Total reads (Kbp)Total bases (Mbp)Mean read length (bp)Total reads (Kbp)Total bases (Mbp)Mean read length (bp)
E. coli K12 MG1655
3,046.4
460.0
151
44.94
232.0
516
383.5
929.1
2,422
E. coli O157:H7
3,794.9
548.5
144
946.2
219.1
231
403.9
1,100.3
2,724
B. trehalosi
1,718.2
249.2
145
351.9
190.7
542
205.1
499.9
2,437
M. haemolytica
1,724.4
249.4
144
NA
NA
NA
176.0
531.2
3,019
F. tularensis
926.7
199.2
214
178.9
74.5
416
176.4
399.8
2,266
S. enterica1,943.8279.8143351.5127.1361394.71,000.22,534

Overall a 90-100× Illumina MiSeq and 45-50× Roche 454 genome coverage is used. For the PacBio dataset an input coverage of ~200× is available.

Table 2

Genome reconstruction of 6 bacterial genomes using different sequencing platforms and assembly strategies

OrganismAssemblerScaffolderExpected scaffoldsFinal scaffoldsUnaligned scaffoldsSum (bp)N50Gap size (bp)IndelsRearran-gementsRuntime
B. trehalosi
Ray
-
Unknown
34
-
2,384,099
212,852
0
-
-
-
 
AHA
Unknown
21
-
2,390,466
245,559
6,367
-
-
110 min
 
SSPACE LongRead
Unknown
7
-
2,410,351
1,215,562
8,899
-
-
16 min
CLC
-
Unknown
62
-
2,361,409
146,347
0
-
-
-
 
AHA
Unknown
36
-
2,389,684
222,352
16,915
-
-
118 min
 
SSPACE LongRead
Unknown
6
-
2,395,822
1,361,277
8,650
-
-
19 min
Newbler
-
Unknown
58
-
2,362,898
117,742
0
-
-
-
 
AHA
Unknown
21
-
2,391,876
505,738
12,781
-
-
117 min
 
SSPACE LongRead
Unknown
5
-
2,393,982
1,317,689
7,692
-
-
16 min
E. coli K12
Ray
-
1
99
0
4,583,740
95,924
0
0
2
-
 
AHA
1
57
0
4,632,207
220,952
32,147
2
2
194 min
 
SSPACE LongRead
1
11
0
4,636,946
570,605
30,741
1
9
28 min
CLC
-
1
126
0
4,554,695
88,183
0
-
-
-
 
AHA
1
57
0
4,636,666
497,336
34,587
2
6
214 min
 
SSPACE LongRead
1
1
0
4,642,513
4,642,513
18,788
3
8
28 min
Newbler
-
1
80
0
4,567,139
117,490
0
-
-
-
 
AHA
1
12
0
4,652,318
3,320,126
45,090
6
14
201 min
 
SSPACE LongRead
1
2
0
4,635,316
3,716,545
7,793
7
10
32 min
E .coli O157:H7
Ray
-
10
144
1
5,432,073
112,112
0
-
-
-
 
AHA
10
110
1
5,475,255
227,802
34,035
1
2
226 min
 
SSPACE LongRead
10
38
1
5,845,919
348,040
58,068
2
23
31 min
CLC
-
10
293
13
5,335,444
105,156
0
-
-
-
 
AHA
10
238
8
5,437,860
201,528
42,214
4
9
312 min
 
SSPACE LongRead
10
33
2
5,539,369
1,172,184
51,676
13
17
32 min
Newbler
-
10
279
14
5,322,767
142,438
0
-
-
-
 
AHA
10
209
8
5,471,954
254,465
65,936
5
9
297 min
 
SSPACE LongRead
10
39
3
5,565,065
703,452
75,126
11
34
37 min
F. tularensis
Ray
-
3
100
0
1,806,660
25,623
0
-
-
-
 
AHA
3
38
0
1,859,591
82,151
47,651
1
5
95 min
 
SSPACE LongRead
3
8
0
1,886,509
279,967
27,386
1
8
14 min
CLC
-
3
110
1
1,780,141
25,117
0
-
-
-
 
AHA
3
53
1
1,844,586
63,063
50,494
0
6
104 min
 
SSPACE LongRead
3
7
1
1,877,533
444,696
19,639
2
6
18 min
Newbler
-
3
316
0
1,653,291
8,912
0
-
-
-
 
AHA
3
61
0
1,965,997
69,167
255,189
7
7
95 min
 
SSPACE LongRead
3
7
0
1,867,474
480,062
160,504
16
13
14 min
M. haemolytica
Ray
-
Unknown
80
-
2,639,260
75,015
0
-
-
-
 
AHA
Unknown
44
-
2,676,952
108,006
25,336
-
-
148 min
 
SSPACE LongRead
Unknown
14
-
2,682,588
703,034
29,889
-
-
21 min
CLC
-
Unknown
129
-
2,630,768
63,442
0
-
-
-
 
AHA
Unknown
41
-
2,769,108
239,432
73,082
-
-
166 min
 
SSPACE LongRead
Unknown
8
-
2,742,871
1,996,208
33,032
-
-
25 min
S. enterica Ray
-
4
119
2
4,972,739
90,542
0
-
-
-
 
AHA
4
40
2
5,012,323
203,631
34,496
0
4
190 min
 
SSPACE LongRead
4
20
2
5,112,337
488,483
27,988
0
6
28 min
CLC
-
4
238
5
4,974,534
43,328
0
-
-
-
 
AHA
4
62
4
5,064,555
376,354
68,292
3
7
200 min
 
SSPACE LongRead
4
7
3
5,038,082
3,235,544
21,588
6
2
34 min
Newbler
-
4
101
12
4,990,994
372,513
0
-
-
-
 
AHA
4
69
12
5,040,830
787,589
30,907
2
6
193 min
  SSPACE LongRead 4 4 12 5,036,244 3,729,047 10,430 3 11 29 min

In italic-bold the platform/strategy that leads to the lowest amount of assembled scaffolds is highlighted. The number of expected scaffolds refers to the number of chromosomes plus the number of plasmids present in the reference genome (if available). Generally the combination 1) draft assembly using CLCbio for Illumina MiSeq reads or Newbler for Roche 454 reads and 2) scaffolding using SSPACE-LongRead for PacBio CLR reads gives the best results in terms of closure and time. Notably some draft assembly contigs are not covered with PacBio reads (such as PhiX control or bacterial host sequences). The number of errors introduced during scaffolding is only limited and often are a consequence of true variations between the sequenced library and the earlier deposited reference genome.

Statistics of the sequence datasets used for the comparative study Overall a 90-100× Illumina MiSeq and 45-50× Roche 454 genome coverage is used. For the PacBio dataset an input coverage of ~200× is available. Genome reconstruction of 6 bacterial genomes using different sequencing platforms and assembly strategies In italic-bold the platform/strategy that leads to the lowest amount of assembled scaffolds is highlighted. The number of expected scaffolds refers to the number of chromosomes plus the number of plasmids present in the reference genome (if available). Generally the combination 1) draft assembly using CLCbio for Illumina MiSeq reads or Newbler for Roche 454 reads and 2) scaffolding using SSPACE-LongRead for PacBio CLR reads gives the best results in terms of closure and time. Notably some draft assembly contigs are not covered with PacBio reads (such as PhiX control or bacterial host sequences). The number of errors introduced during scaffolding is only limited and often are a consequence of true variations between the sequenced library and the earlier deposited reference genome.

SSPACE-LongRead effectively reconstructs (nearly) complete genomes

The assembly statistics show that both AHA and SSPACE-LongRead are able to significantly reduce the amount of input draft contigs using error-prone PacBio RS CLR reads as guidance. Overall the total assembly length remains relatively stable. It is apparent that SSPACE-LongRead is better able to reconstruct continuous genome segments given the final number of scaffolds is generally lower than 10. In practice this generally means a reduction of the initial amount of contigs by at least 90%. It should also be remarked that the runtime of these tools differs by a factor of 7, making our software more suited for scaffolding genomes on smaller computing systems. In terms of accuracy (through comparison with the corresponding reference genomes, not available for B. trehalosi and M. haemolytica) our software introduces some more errors compared to AHA but this is clearly explained by the more conservative approach of AHA (leading to less contig connections). Also it should be remarked that some of the apparent errors are actually true variations between the sequenced strain and the reference genome, an issue which is also explained by Koren et al. [19] as for each sequenced strain a close (but not necessarily the exact) reference was selected for quality assessment. Not all input contigs were covered by PacBio reads (i.e. no significant alignment could be found between the draft assembly and the PacBio reads). This may be explained by the possible presence of short sequences such as plasmids which can not be captured by the long insert PacBio libraries. For this reason a hybrid assembly approach provides more complete assemblies than experiments that include only one PacBio library (as a compromise needs to be made between a large insert library – generating larger reads and better genome closure – and a short insert library to capture also short plasmids). Another explanation can be given by presence of different DNA sources in the NGS library. For instance, in the case of S. enterica a relatively high number of sequences is generated after scaffolding the Newbler draft assembly. However 12 of these input contigs are not covered by PacBio reads and surprisingly most of these show a high similarity with B. Taurus after a BLAST [20] analysis on the NCBI nr database. Other contigs uncovered by PacBio show similarity with the 5386 bp genome of phi X 174 (phiX), which is used by Illumina as a control to validate the quality of a run. In principle the corresponding reads should be removed from the MiSeq runs prior to downstream analysis.

Genomes become less fragmented also at a low PacBio RS CLR read coverage

In Figure 1 it can be observed that a higher PacBio RS long read coverage leads to less fragmented genomes. Nonetheless a coverage of 50× is generally sufficient to reduce the number of genome fragments to less than 10 scaffolds. The best assembly results are yielded at a coverage value between 110-160×. A higher coverage does not seem to improve the outcomes, instead this leads to more fragmented genomes. From a cost-perspective, the PacBio XL-C2 chemistry specifications (about 300 Mbases per SMRT cell) imply that for a small bacteria genome closure can be achieved using one SMRT cell in combination with e.g. one MiSeq paired-end library. More recent improvements of the sequencing chemistry (P4-C2 and P5-C3) aim to further enhance the data throughput (>500 Mbases) and the read length (>20 kbp). Nonetheless still a relatively high coverage is needed to completely finish a bacterial genome which is partly explained by the high error-rate which needs to be corrected for. Generally the alignment of PacBio RS long reads to a set of contigs yields low alignment scores and a non-conservative approach (based only a few reads that span consecutive contigs) will likely lead to misassemblies. Moreover it should be considered that after sequencing the number of long reads (e.g. more than 5 or 10 kbp), which are essential to overcome large repeats, is relatively small compared to the number of short reads (the median read length is currently between 2 and 5 kbp). Thus the contribution of long reads to the overall sequencing coverage is only limited. It is to be expected though that the total data yield per SMRT cell, as well as the mean read length, will significantly increase in the near future. Thus costs can be further reduced if samples are barcoded and sequenced together on the same SMRT cell. At this point a hybrid approach consisting of one short read paired-end and one long read PacBio library may eventually be the method of choice in terms of accuracy and costs compared to an alternative approach involving mate paired-end libraries. Last but not least, the computational time involved to close bacterial genomes allows high-throughput assemblies even on small compute systems. It should be underscored that the current study was performed using uncorrected PacBio RS CLR reads, thus bypassing a time-consuming error-correction step with short reads.
Figure 1

The effect of PacBio RS long read coverage on genome closure. Results are displayed for SSPACE-LongRead based on the CLCbio draft assembly for 5 organisms. For all samples the addition of PacBio reads has a positive effect and leads to a significant contig reduction. In general a 50× coverage is sufficient to scaffold over most gaps, though ideally a 110-160× coverage is required to guarantee an optimal performance of our software. Arguably a higher coverage (>160×) leads to more fragmented genomes, which is likely due to the increased complexity of the assembly graph.

The effect of PacBio RS long read coverage on genome closure. Results are displayed for SSPACE-LongRead based on the CLCbio draft assembly for 5 organisms. For all samples the addition of PacBio reads has a positive effect and leads to a significant contig reduction. In general a 50× coverage is sufficient to scaffold over most gaps, though ideally a 110-160× coverage is required to guarantee an optimal performance of our software. Arguably a higher coverage (>160×) leads to more fragmented genomes, which is likely due to the increased complexity of the assembly graph.

Both the quality of the draft assembly and the selection of the assembly strategy are crucial for the success of an experiment

From Table 2 it can be observed that the assembly results are also largely dependent on the strategy chosen for constructing the draft assembly. Common criteria used to judge a draft assembly are the number of contigs (should be low) and the N50 value (should be high to indicate that at least 50% of the assembly is in contigs of at least this length/value). Nonetheless these quantitative measures do not necessarily guarantee the choice of the best assembly.It is therefore adviced to assess the quality of an assembly using a dedicated tool, such as QUAST [21], which gives a complete overview of different quality metrics. Yet also other factors can play an important role in the selection of the most appropriate input assembly. For example, assemblies constructed with Ray do show a higher N50 value and a lower number of contigs compared to those constructed with CLCbio or Newbler. Therefore it might be intuitive to choose the Ray contigs as input for SSPACE-LongRead scaffolding. Nonetheless from the SSPACE-LongRead statistics it appears to be easier to reconstruct complete genomes from CLCbio or Newbler contigs. From further investigations (data not shown) we noticed that assemblies constructed with Ray contain repeated segments which are especially present at the contig edges. Similar observations were made for draft assemblies constructed from Illumina MiSeq reads with SPAdes [22] and MaSuRCA [23] (data not shown): the amount of contigs was lower or similar to that observed for CLCbio and Newbler, though the errors in the draft assembly led to more (erroneous) scaffolds. Although SSPACE-LongRead performs well on all tested draft assembly strategies, the user should preferably choose an assembly method that splits contigs at repeat boundaries. Consequently the assessment of the most appropriate draft assembly should not be solely based on the assembly quality-metrics, as the extent of the genome fragment reduction also depends on the draft assembly strategy.

Conclusions

We propose a novel tool for scaffolding pre-assembled contigs using long read information. We show that even error-prone PacBio RS CLR reads can be well used to connect contigs, place repeats and consequently reconstruct bacterial genomes in less than 10 segments when using SSPACE-LongRead. Importantly only two libraries are needed for the hybrid assembly of a bacterial genome. One Illumina MiSeq or Roche-454 paired-end library is sufficient for the construction of a proper draft assembly using a state-of-the-art De Bruijn-graph or Overlap-consensus layout assembler, although the final quality is influenced by the exact method chosen. Importantly the number of contigs (or other metrics such as the N50 value) may not be the best criteria for evaluating the best draft assembly. Indeed we show that, although draft assemblies created with Ray yield fewer contigs, draft assemblies created with CLCbio or Newbler software lead to more closed genomes after scaffolding. It can be therefore argued that ideally the user should choose a draft assembly method that places repeated elements into separate contigs: whereas CLCbio and Newbler tend to automatically break reads and contigs at repeat boundaries, Ray and SPAdes tend to merge unique and repeated sequences together in the same contigs. We argue that one PacBio RS library is sufficient to nearly finish the bacterial assembly. In this paper we show promising results using a PacBio coverage of 50×, a CLR error-rate of 15% and a mean read length between 2-3 kb. It is likely that further improvements of the sequencing platforms and chemistry will show additional improvements, i.e. a complete genome closure at lower costs. To this regard also the introduction of a barcoding system for PacBio libraries will help to use SMRT cells more efficiently and enlarge the capacity per sequencing run. Also our method can be used for possible new sequencing platforms, such as the Illumina Moleculo and Oxford Nanopore systems, given that the input format (FASTA or FASTQ) is standardized and not specifically bound to a platform. Contemporarily it can be foreseen that additional methods will appear that perform assemblies using only long read information. At present the HGAP assembly method seems to be the only such available strategy, however there are some major issues with the overlap-consensus implementation (which requires an accurate prior estimation of the expected genome size) and the introduction of erroneous duplications (which need to be manually removed [24]). Moreover our SSPACE-LongRead method is optimized for high-throughput experiments (given the simple running mode of the script and short runtime) and can be the method of choice for upgrading existing draft genomes in a cost-effective manner. Also we expect benefits for (larger) eukaryotic organisms for which genome submissions generally consist of highly fragmented chromosomes. We feel the current study opens new ways to address the genome assembly question and can positively contribute to the reconstruction of more complete genomic contexts.

Availability and requirements

•Project name: SSPACE-LongRead •Project home page:http://www.baseclear.com/bioinformatics-tools/ •Operating systems: All major Linux platforms •Programming languages: Perl, C++ (the latter is required for BLASR, see below) •Other requirements: BLASR for the alignment of long reads [22] •License: BaseTools software license •Any restrictions to use by non-academics: commercial licence needed

Methods

The SSPACE-LongRead methodology can be summarized in a few steps which are described below and summarized in Figure 2. The pseudocode is summarized in Additional file 1: Figure S1.
Figure 2

Overview of the SSPACE-LongRead scaffolding algorithm. A) The input consists of a set of pre-assembled contigs (or scaffolds) in FASTA format and a set of PacBio CLR reads (in FASTA or FASTQ format). B) The PacBio CLR reads are aligned against the contigs using BLASR and only the best alignment matches are kept. In red a repeated element is indicated. C) Contig pairings and multi-contig linkage information is stored, from this information also repeated elements are detected. D) Based on the pairing and linkage information, contigs are ordered, oriented and connected into scaffolds. E) A post-processing step performs the final linearization and circularization.

Overview of the SSPACE-LongRead scaffolding algorithm. A) The input consists of a set of pre-assembled contigs (or scaffolds) in FASTA format and a set of PacBio CLR reads (in FASTA or FASTQ format). B) The PacBio CLR reads are aligned against the contigs using BLASR and only the best alignment matches are kept. In red a repeated element is indicated. C) Contig pairings and multi-contig linkage information is stored, from this information also repeated elements are detected. D) Based on the pairing and linkage information, contigs are ordered, oriented and connected into scaffolds. E) A post-processing step performs the final linearization and circularization.

Software input

The user needs to create a draft assembly using a de novo assembly method of choice (e.g. Velvet [1], SOAPdenovo [2], Ray [18], CLCbio (CLC bio, Aarhus, Denmark) or Newbler (Roche)). Optionally the user may also provide scaffold sequences generated with dedicated software (e.g. SSPACE [5] or SOPRA [4]). The resulting contigs or scaffolds (in FASTA format) are to be provided as input to SSPACE-LongRead software together with a set of long reads in FASTQ or FASTA format (e.g. PacBio CLR reads). Note that in our study we observe that SSPACE-LongRead obtains the best results if the draft assembly is constructed with CLCbio or Newbler as these tend to better split contigs at repeat boundaries (see the Results and Discussion section for additional explanations).

Alignment of long reads against the pre-assembled contigs (or scaffolds)

Each long read is aligned against the pre-assembled contigs with BLASR [25] resulting in a local alignment and corresponding similarity score. For gap-estimation purposes, the local alignments are extended to generate a full contig match. In order to remove false-positive alignments, contigs that display a (partial) overlap with a contig that has obtained a higher alignment score are iteratively removed from the dataset (Figure 2, step B). The minimum overlap (in bp) required to remove a contig from the alignment is defined by parameter -g (default value = 200).

Computation of contig linkage from the alignment order

The remaining contigs are sorted based on their alignment position on the long reads. Subsequently the contig distance and orientation is computed. Each contig-pairing and multi-contig linkage is stored in a hash: the preferred pairings are retained by removing contig-links which are also found within a multi-contig path of another contig-link. For example, if there are two paths, A- > B- > C- > D and A- > C- > D, the linkage of A- > C is removed. The ambiguous paths are mainly explained by the fact that the BLASR alignment tool generally can not resolve (align) low-quality alignment regions of PacBio reads. If multiple paths remain, a ratio is calculated between the two best alternatives. Ambiguous pairings are solved using the multi-contig linkage information, the unsolved pairings are flagged as repeated elements.

Scaffolding contigs into scaffolds

The contigs are now connected into linear stretches where repeated elements are placed based on the multi-contig linkage information. Repeated elements on the edges of the linear stretches are removed. As a post-processing step, the non-repeated edges of the preliminary scaffolds are reused to find connections between other preliminary scaffolds. Finally the gap-size between each contig is calculated: if this value is negative, and an overlap is found, contigs are merged; if this value is positive, a gap is inserted between contigs (the gap is represented by one or more undefined ‘N’ nucleotides depending on the gap-size).

Software output

The final linear assembly is represented in an easily interpretable FASTA file. In addition a AGP (Accessioned Golden Path) file is generated which describes the contig order within the scaffolds. The latter files can be readily used for NCBI genome submissions. Also a summary file containing statistics of the assembly process and final assembly structure is provided in TEXT format.

Datasets

In total six bacterial datasets were used for testing the performance of the software. These comprise Illumina MiSeq, Roche-454 and PacBio RS reads from Escherichia coli (K12 MG1655), Escherichia coli (O157:H7 F8092B), Bibersteina trehalosi (USDA-ARS-USMARC-192), Mannheimia haemolytica (USDA-ARS-USMARC-2286), Francisella tularensis (99A-2628) and Salmonella enterica (Newport SN31241). Datasets are downloaded from http://www.cbcb.umd.edu/software/PBcR/closure/index.html and further described in Koren et al. (2013). Dataset statistics are displayed in Table 1. To assess the assembly correctness we used close reference genomes deposited in the NCBI database (E. coli K12 MG1655 = NC_000913, E. coli O157:H7 = NC_002127, NC_002128, NC_002695, F. tularensis = NC_008369, S. enterica = NC_011079, NC_011080, NC_009140). For B. trehalosi and M. haemolytica no reference genome is currently available.

Assembly procedure

Draft assemblies of Illumina MiSeq data were constructed using Ray version 2.3.0 [18] and the CLCbio de novo assembler version 6.5.1 (CLC bio, Aarhus, Denmark) using for each program a k-mer setting of 31. Draft assemblies of Roche-454 data were constructed using Newbler version 2.8 (Roche) as described in Koren et al. [19]. Scaffolding was performed using AHA [14] which is part of the SMRT Analysis Package version 2.0 and SSPACE-LongRead. For the latter software we required a minimal estimated overlap of 200 bp (option -g) between the contigs in order to avoid false positive alignments. For Ray the minimal estimated overlap was set to 500 bp since Ray tends to include repeated elements on contigs edges: as a result a larger overlap between the contigs is observed.

System

All analysis were performed on a 48 Gb Linux machine (Intel Xeon X5650, 2.67 GHz).

Sourcecode

SSPACE-LongRead is written in Perl and runs on all major Linux platforms. A pseudocode of the algorithm is given in Additional file 1: Figure S1.

Competing interests

The authors declared that they have no competing interests.

Authors’ contribution

MB and WP authors contributed to the design of the study and the interpretation of the results. MB performed the implementation, WP wrote the manuscript. Both authors read and approved the final manuscript.

Additional file 1

Pseudocode SSPACE-LongRead. In this figure an overview is given of the operating principle of the SSPACE-LongRead algorithm. Click here for file
  22 in total

1.  GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Authors:  Steven L Salzberg; Adam M Phillippy; Aleksey Zimin; Daniela Puiu; Tanja Magoc; Sergey Koren; Todd J Treangen; Michael C Schatz; Arthur L Delcher; Michael Roberts; Guillaume Marçais; Mihai Pop; James A Yorke
Journal:  Genome Res       Date:  2012-01-06       Impact factor: 9.043

2.  A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes.

Authors:  Susanne M D Goldberg; Justin Johnson; Dana Busam; Tamara Feldblyum; Steve Ferriera; Robert Friedman; Aaron Halpern; Hoda Khouri; Saul A Kravitz; Federico M Lauro; Kelvin Li; Yu-Hui Rogers; Robert Strausberg; Granger Sutton; Luke Tallon; Torsten Thomas; Eli Venter; Marvin Frazier; J Craig Venter
Journal:  Proc Natl Acad Sci U S A       Date:  2006-07-13       Impact factor: 11.205

3.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors:  Daniel R Zerbino; Ewan Birney
Journal:  Genome Res       Date:  2008-03-18       Impact factor: 9.043

4.  The MaSuRCA genome assembler.

Authors:  Aleksey V Zimin; Guillaume Marçais; Daniela Puiu; Michael Roberts; Steven L Salzberg; James A Yorke
Journal:  Bioinformatics       Date:  2013-08-29       Impact factor: 6.937

5.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.

Authors:  Chen-Shan Chin; David H Alexander; Patrick Marks; Aaron A Klammer; James Drake; Cheryl Heiner; Alicia Clum; Alex Copeland; John Huddleston; Evan E Eichler; Stephen W Turner; Jonas Korlach
Journal:  Nat Methods       Date:  2013-05-05       Impact factor: 28.547

6.  The sequence and de novo assembly of the giant panda genome.

Authors:  Ruiqiang Li; Wei Fan; Geng Tian; Hongmei Zhu; Lin He; Jing Cai; Quanfei Huang; Qingle Cai; Bo Li; Yinqi Bai; Zhihe Zhang; Yaping Zhang; Wen Wang; Jun Li; Fuwen Wei; Heng Li; Min Jian; Jianwen Li; Zhaolei Zhang; Rasmus Nielsen; Dawei Li; Wanjun Gu; Zhentao Yang; Zhaoling Xuan; Oliver A Ryder; Frederick Chi-Ching Leung; Yan Zhou; Jianjun Cao; Xiao Sun; Yonggui Fu; Xiaodong Fang; Xiaosen Guo; Bo Wang; Rong Hou; Fujun Shen; Bo Mu; Peixiang Ni; Runmao Lin; Wubin Qian; Guodong Wang; Chang Yu; Wenhui Nie; Jinhuan Wang; Zhigang Wu; Huiqing Liang; Jiumeng Min; Qi Wu; Shifeng Cheng; Jue Ruan; Mingwei Wang; Zhongbin Shi; Ming Wen; Binghang Liu; Xiaoli Ren; Huisong Zheng; Dong Dong; Kathleen Cook; Gao Shan; Hao Zhang; Carolin Kosiol; Xueying Xie; Zuhong Lu; Hancheng Zheng; Yingrui Li; Cynthia C Steiner; Tommy Tsan-Yuk Lam; Siyuan Lin; Qinghui Zhang; Guoqing Li; Jing Tian; Timing Gong; Hongde Liu; Dejin Zhang; Lin Fang; Chen Ye; Juanbin Zhang; Wenbo Hu; Anlong Xu; Yuanyuan Ren; Guojie Zhang; Michael W Bruford; Qibin Li; Lijia Ma; Yiran Guo; Na An; Yujie Hu; Yang Zheng; Yongyong Shi; Zhiqiang Li; Qing Liu; Yanling Chen; Jing Zhao; Ning Qu; Shancen Zhao; Feng Tian; Xiaoling Wang; Haiyin Wang; Lizhi Xu; Xiao Liu; Tomas Vinar; Yajun Wang; Tak-Wah Lam; Siu-Ming Yiu; Shiping Liu; Hemin Zhang; Desheng Li; Yan Huang; Xia Wang; Guohua Yang; Zhi Jiang; Junyi Wang; Nan Qin; Li Li; Jingxiang Li; Lars Bolund; Karsten Kristiansen; Gane Ka-Shu Wong; Maynard Olson; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang; Jun Wang
Journal:  Nature       Date:  2009-12-13       Impact factor: 49.962

7.  Finished bacterial genomes from shotgun sequence data.

Authors:  Filipe J Ribeiro; Dariusz Przybylski; Shuangye Yin; Ted Sharpe; Sante Gnerre; Amr Abouelleil; Aaron M Berlin; Anna Montmayeur; Terrance P Shea; Bruce J Walker; Sarah K Young; Carsten Russ; Chad Nusbaum; Iain MacCallum; David B Jaffe
Journal:  Genome Res       Date:  2012-07-24       Impact factor: 9.043

8.  Improving PacBio long read accuracy by short read alignment.

Authors:  Kin Fai Au; Jason G Underwood; Lawrence Lee; Wing Hung Wong
Journal:  PLoS One       Date:  2012-10-04       Impact factor: 3.240

9.  Fast scaffolding with small independent mixed integer programs.

Authors:  Leena Salmela; Veli Mäkinen; Niko Välimäki; Johannes Ylinen; Esko Ukkonen
Journal:  Bioinformatics       Date:  2011-10-13       Impact factor: 6.937

10.  Hybrid error correction and de novo assembly of single-molecule sequencing reads.

Authors:  Sergey Koren; Michael C Schatz; Brian P Walenz; Jeffrey Martin; Jason T Howard; Ganeshkumar Ganapathy; Zhong Wang; David A Rasko; W Richard McCombie; Erich D Jarvis
Journal:  Nat Biotechnol       Date:  2012-07-01       Impact factor: 54.908

View more
  222 in total

1.  Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.

Authors:  Derek M Bickhart; Benjamin D Rosen; Sergey Koren; Brian L Sayre; Alex R Hastie; Saki Chan; Joyce Lee; Ernest T Lam; Ivan Liachko; Shawn T Sullivan; Joshua N Burton; Heather J Huson; John C Nystrom; Christy M Kelley; Jana L Hutchison; Yang Zhou; Jiajie Sun; Alessandra Crisà; F Abel Ponce de León; John C Schwartz; John A Hammond; Geoffrey C Waldbieser; Steven G Schroeder; George E Liu; Maitreya J Dunham; Jay Shendure; Tad S Sonstegard; Adam M Phillippy; Curtis P Van Tassell; Timothy P L Smith
Journal:  Nat Genet       Date:  2017-03-06       Impact factor: 38.330

2.  Tandem Amplification of the Staphylococcal Cassette Chromosome mec Element Can Drive High-Level Methicillin Resistance in Methicillin-Resistant Staphylococcus aureus.

Authors:  Laura A Gallagher; Simone Coughlan; Nikki S Black; Pierce Lalor; Elaine M Waters; Bryan Wee; Mick Watson; Tim Downing; J Ross Fitzgerald; Gerard T A Fleming; James P O'Gara
Journal:  Antimicrob Agents Chemother       Date:  2017-08-24       Impact factor: 5.191

3.  Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms.

Authors:  Luis Zapata; Jia Ding; Eva-Maria Willing; Benjamin Hartwig; Daniela Bezdan; Wen-Biao Jiao; Vipul Patel; Geo Velikkakam James; Maarten Koornneef; Stephan Ossowski; Korbinian Schneeberger
Journal:  Proc Natl Acad Sci U S A       Date:  2016-06-27       Impact factor: 11.205

4.  Evolutionary dynamics of recent selection on cognitive abilities.

Authors:  Sara E Miller; Andrew W Legan; Michael T Henshaw; Katherine L Ostevik; Kieran Samuk; Floria M K Uy; Michael J Sheehan
Journal:  Proc Natl Acad Sci U S A       Date:  2020-01-24       Impact factor: 11.205

5.  Comparative genomics uncovers the prolific and distinctive metabolic potential of the cyanobacterial genus Moorea.

Authors:  Tiago Leao; Guilherme Castelão; Anton Korobeynikov; Emily A Monroe; Sheila Podell; Evgenia Glukhov; Eric E Allen; William H Gerwick; Lena Gerwick
Journal:  Proc Natl Acad Sci U S A       Date:  2017-03-06       Impact factor: 11.205

6.  Pacmanvirus, a New Giant Icosahedral Virus at the Crossroads between Asfarviridae and Faustoviruses.

Authors:  Julien Andreani; Jacques Yaacoub Bou Khalil; Madhumati Sevvana; Samia Benamar; Fabrizio Di Pinto; Idir Bitam; Philippe Colson; Thomas Klose; Michael G Rossmann; Didier Raoult; Bernard La Scola
Journal:  J Virol       Date:  2017-06-26       Impact factor: 5.103

7.  Multiplexed Non-barcoded Long-Read Sequencing and Assembling Genomes of Bacillus Strains in Error-Free Simulations.

Authors:  Jiating Qian; Qiao Meng; Yifan Feng; Xuanxuan Mao; Yayue Ling; Jie Li
Journal:  Curr Microbiol       Date:  2019-11-13       Impact factor: 2.188

8.  Genome expansion via lineage splitting and genome reduction in the cicada endosymbiont Hodgkinia.

Authors:  Matthew A Campbell; James T Van Leuven; Russell C Meister; Kaitlin M Carey; Chris Simon; John P McCutcheon
Journal:  Proc Natl Acad Sci U S A       Date:  2015-05-18       Impact factor: 11.205

9.  Dynamics of Mutations during Development of Resistance by Pseudomonas aeruginosa against Five Antibiotics.

Authors:  Yanfang Feng; Martijs J Jonker; Ioannis Moustakas; Stanley Brul; Benno H Ter Kuile
Journal:  Antimicrob Agents Chemother       Date:  2016-06-20       Impact factor: 5.191

10.  Pacific Biosciences assembly with Hi-C mapping generates an improved, chromosome-level goose genome.

Authors:  Yan Li; Guangliang Gao; Yu Lin; Silu Hu; Yi Luo; Guosong Wang; Long Jin; Qigui Wang; Jiwen Wang; Qianzi Tang; Mingzhou Li
Journal:  Gigascience       Date:  2020-10-24       Impact factor: 6.524

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.