Andrey D Prjibelski1, Irina Vasilinetc1, Anton Bankevich1, Alexey Gurevich1, Tatiana Krivosheeva1, Sergey Nurk1, Son Pham1, Anton Korobeynikov2, Alla Lapidus2, Pavel A Pevzner2. 1. Algorithmic Biology Lab, St. Petersburg Academic University, St. Petersburg, Russia, Department of Mathematics and Mechanics, Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russia and Department of Computer Science and Engineering, University of California, San Diego, USA. 2. Algorithmic Biology Lab, St. Petersburg Academic University, St. Petersburg, Russia, Department of Mathematics and Mechanics, Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russia and Department of Computer Science and Engineering, University of California, San Diego, USAAlgorithmic Biology Lab, St. Petersburg Academic University, St. Petersburg, Russia, Department of Mathematics and Mechanics, Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russia and Department of Computer Science and Engineering, University of California, San Diego, USA.
Abstract
UNLABELLED: Next-generation sequencing (NGS) technologies have raised a challenging de novo genome assembly problem that is further amplified in recently emerged single-cell sequencing projects. While various NGS assemblers can use information from several libraries of read-pairs, most of them were originally developed for a single library and do not fully benefit from multiple libraries. Moreover, most assemblers assume uniform read coverage, condition that does not hold for single-cell projects where utilization of read-pairs is even more challenging. We have developed an exSPAnder algorithm that accurately resolves repeats in the case of both single and multiple libraries of read-pairs in both standard and single-cell assembly projects. AVAILABILITY AND IMPLEMENTATION: http://bioinf.spbau.ru/en/spades
UNLABELLED: Next-generation sequencing (NGS) technologies have raised a challenging de novo genome assembly problem that is further amplified in recently emerged single-cell sequencing projects. While various NGS assemblers can use information from several libraries of read-pairs, most of them were originally developed for a single library and do not fully benefit from multiple libraries. Moreover, most assemblers assume uniform read coverage, condition that does not hold for single-cell projects where utilization of read-pairs is even more challenging. We have developed an exSPAnder algorithm that accurately resolves repeats in the case of both single and multiple libraries of read-pairs in both standard and single-cell assembly projects. AVAILABILITY AND IMPLEMENTATION: http://bioinf.spbau.ru/en/spades
Most of existing next-generation sequencing (NGS) platforms generate read-pairs—pairs of reads (called mates) that are sequenced from different ends of a genomic fragment with approximately known length (called the insert size). Because the insert size usually exceeds the length of a single read, read-pairs may match up unique regions surrounding repeats that are longer than the read length. A combination of several libraries of read-pairs with different insert sizes is often used to produce high-quality assemblies (Butler ; Bresler ). Paired-end libraries usually have insert size <1 kb and are used for resolving relatively short repeats. Jumping libraries are characterized by an average insert size of anywhere from 1 to 20 kb and are helpful in resolving longer repeats and contig scaffolding (inferring the order of contigs in the genome). However, because of a high insert size variation, information from jumping libraries is rather difficult to use for the purpose of assembly.The problem of using multiple read-pair libraries was previously addressed by ALLPATHS-LG (Gnerre ), Ray (Boisvert ), Velvet (Zerbino and Birney, 2008) and some other assemblers. However, these tools are designed for standard (mutlicell) assemblies and do not perform well on single-cell datasets. On the other hand, the single-cell assemblers ESC (Chitsaz ), IDBA-UD (Peng ) and SPAdes (Bankevich ) are designed for a single read-pair library. In addition, the recently proposed Paired de Bruijn Graph algorithms for repeat resolution (Medvedev ; Pham ; Vyahhi ) also focus on a single library and it remains unclear how to extend them to multiple libraries.We present exSPAnder algorithm that works with both single and multiple libraries in standard and single-cell assembly projects. exSPAnder uses a simple path extension approach for repeat resolution that was originally proposed in the Ray assembler [and later used in Telescoper (Bresler )] and combines it with some ideas from the Rectangle Graph approach (Bankevich ; Vyahhi ). Given a set of paths in the assembly graph (Bankevich ) (i.e. simplified de Bruijn graph (Compeau ; Pevzner ) of k-mers in reads after removal of bulges, tips and chimeric edges), exSPAnder attempts to extend each path with the goal to generate longer paths. For a path P ending in a vertex v, we consider all edges starting at v (referred to as extension edges) and compute Score(e) for each extension edge. To compute Score(e) we analyze all reads that map to path P and whose mates map to e. Thus, Score(e) reflects our confidence that an extension of the path P by the edge e is correct. We note that to properly map read-pairs and calculate Score(e), the total length of path P and edge e should be longer than the insert size.In addition to function Score(e), exSPAnder uses a decision rule Extend(P) that either chooses one of the extension edges to extend the path P or makes the decision to stop growing this path beyond the ending vertex of P. The procedure is iterated over all the paths until no path can be further extended. To initiate this algorithm one can start with a set of single-edge paths formed by all sufficiently long edges in the assembly graph. The resulting paths are output as contigs after removing the paths that are contained within other paths as well as removing non-informative overlaps (i.e. suffixes of paths that represent prefixes of other paths).This simple approach is merely a framework and, depending on the specifics of the scoring function and the decision rule, it can be either efficient (like in the Ray assembler) or disastrous. The authors of Telescoper made an attempt to improve on Ray’s scoring function and to substantiate it with rigorous statistical analysis. However, scoring functions in both Ray and Telescoper are not universal, e.g. they assume the uniform genome coverage by reads, condition that does not hold for single-cell data.We demonstrate that exSPAnder works well on single-cell datasets with multiple libraries. We also show that exSPAnder [implemented as a part of SPAdes (Bankevich )] improves on existing assemblers on standard bacterial datasets and outperforms such popular assemblers as ABySS (Simpson ), Ray (Boisvert ), SOAPdenovo (Li ) and Velvet (Zerbino and Birney, 2008). ALLPATHS-LG (Gnerre ) is an excellent assembler whose applications, however, are limited to specially constructed read-pair libraries. On such libraries exSPAnder and ALLPATHS-LG generate comparable results (exSPAnder generates longer contigs but ALLPATHS-LG generates longer scaffolds).
2 ANALYSIS OF READ-PAIR LIBRARIES
As we mentioned in the introduction, jumping libraries present additional challenges for genome assembly owing to their high variations in the insert size. Additionally, jumping libraries have high rate of chimeric read-pairs—read-pairs that either have abnormal insert size or incorrect orientation. Chimeric read-pairs further complicate utilization of such libraries while resolving repeats.Below we present analysis of the insert size distributions for the following data: Brachybacterium faecium isolate dataset (read length 150 bp) and Staphylococcus aureus single-cell dataset (read length 101 bp). Both datasets contain one paired-end and one jumping library. To analyze read-pair libraries we aligned reads to the B.faecium and S.aureus reference genomes using Bowtie 2 (Langmead and Salzberg, 2012) and computed the chimeric read-pair rates and insert size distributions.As Figure 1 and Table 1 demonstrate, both isolate and single-cell paired-end libraries have small variations of the insert size. In contrast, the jumping library from the B.faecium dataset has a much higher variation in the insert size and higher rate of chimeric read-pairs (Table 1). At the same time, in addition to the large insert size variations, S.aureus jumping library contains only 22% of all read-pairs aligned with the proper reverse–forward orientation. Thirty-six percent of all read-pairs have incorrect forward–reverse orientation (correspond to the left peak in Fig. 1d) and 14% are classified as chimeric read-pairs of other types. Forward–reverse read-pairs in jumping libraries represent an artifact of the sample preparation and are common for datasets of different types.
Fig. 1.
Plots of the insert size distributions for B.faecium isolate (a) paired-end and (b) jumping library, and S.aureus single-cell dataset with (c) paired-end and (d) jumping library. The distributions were computed by mapping reads to the B.faecium str. DSM4810 (Lapidus ) and S. aureus str. USA300 substr. FPR3757 (Diep ) reference genomes, respectively. All plots are in the logarithmic scale
Table 1.
Information on the B.faecium isolate dataset and the S.aureus single-cell dataset
Dataset
B.faecium
S.aureus
Library
Paired-end
Jumping
Paired-end
Jumping
Number of reads
13 M
41 M
38 M
41 M
Average coverage
400×
1100×
1050×
1050×
Coverage span
210–570×
0–3000×
0–3500×
0–3500×
Insert size
270 bp
7.5 kb
210 bp
1.8 kb
Insert span
150–400 bp
6–10 kb
180–230 bp
0.5–4 kb
Chimeric read-pairs (%)
1
9
3
50
Unaligned read-pairs (%)
16
10
6
28
Note: Insert span is the shortest insert size interval that contains at least 95% of properly aligned read-pairs. Unaligned reads refer to the percentage of read-pairs that have at least one read unaligned. Chimeric read-pairs refer to the percentage of chimeric read-pairs among all read-pairs. All statistics was obtained using Bowtie 2 (Langmead and Salzberg, 2012). Coverage span is the smallest coverage interval that includes a least 95% of all genomic positions
Plots of the insert size distributions for B.faecium isolate (a) paired-end and (b) jumping library, and S.aureus single-cell dataset with (c) paired-end and (d) jumping library. The distributions were computed by mapping reads to the B.faecium str. DSM4810 (Lapidus ) and S. aureus str. USA300 substr. FPR3757 (Diep ) reference genomes, respectively. All plots are in the logarithmic scaleInformation on the B.faecium isolate dataset and the S.aureus single-cell datasetNote: Insert span is the shortest insert size interval that contains at least 95% of properly aligned read-pairs. Unaligned reads refer to the percentage of read-pairs that have at least one read unaligned. Chimeric read-pairs refer to the percentage of chimeric read-pairs among all read-pairs. All statistics was obtained using Bowtie 2 (Langmead and Salzberg, 2012). Coverage span is the smallest coverage interval that includes a least 95% of all genomic positionsDespite the fact that various artifacts of jumping libraries make it difficult to incorporate them into existing assembly tools, exSPAnder uses jumping libraries to generate high-quality assemblies.
3 EXSPANDER ALGORITHM
exSPAnder uses an assembly graph constructed by SPAdes (Bankevich ) and a set of read-pair libraries. For each library, we map read-pairs to the long edges of the assembly graph and estimate the average insert size along with its confidence interval—a shortest insert size interval that contains at least 80% of properly aligned read-pairs. These estimates are used as parameters of the scoring function and the decision rule.
3.1 The decision rule
3.1.1 Single library
Given a path P, we define a winner as an edge e with the maximal score Score(e) among all extension edges for P. Similarly, a contender is defined as an extension edge with the second best score. The winner edge is called the strong winner if (i) Score(winner) >Θ and (ii) Score(winner) > C · Score(contender), where Θ and C are parameters of the algorithm, which are discussed below. If the path P has a single extension edge (which is obviously the winner), only the first condition is used. The decision rule is defined as follows:
3.1.2 Multiple libraries
The decision rule described above can be generalized for several read-pair libraries. Consider M read-pair libraries, which are sorted in the order of increasing insert sizes and the associated decision rules Extend(P) for 1 ≤ i ≤ M. We process the libraries in this order because our analysis revealed that the smaller is the insert size of a library (and its variation), the more reliable is the decision rule for this individual library. We thus select the library with the smallest index i that has the strong winner and define the decision rule for multiple libraries Extend(P) as simply Extend(P). If neither library has a strong winner, we define .
3.2 The scoring function
3.2.1 The support function
We first consider an idealized case when the genome defines a genomic path in the assembly graph. We say that an edge e′ follows edge e at a distance D if the distance between starts of these edges in the genomic path is D. We define a boolean function Support(e,e′) that reflects our confidence that edge e′ follows edge e in the genome at distance D. Below we describe how to calculate Support(e,e′).(a) Reads r and r′ form a read-pair mapping to consecutive edges e and e′ in the assembly graph at positions x0 and y0, respectively. (b) Representation of a read-pair (r,r′) as a point in a rectangle (e,e′). (c) ‘Ideal read-pairs’ with the exact insert size I connecting edges e and e′ form a 45° line within a rectangle. (d) Read-pairs from the real sequencing data with variations in the insert size represented as points within a rectangle. Most points are located within the confidence strip providing the evidence that edges e and e′ are supported by the read-pairs and are genome-consecutive. (e) A rectangle formed by a pair of edges that has few points falling into the confidence strip revealing that e and e′ are not genome-consecutive edgesLet I and [I,I] be the mean and the confidence interval of the insert size for a read-pair library formed by reads of length ReadLength. Consider consecutive edges e and e′ in the assembly graph and a read-pair (r,r′) such that read r maps to e at position x0 and read r′ maps to e′ in position y0 (Fig. 2a). We say that the read-pair (r,r′) connects edges e and e′. Figure 2b shows a rectangle formed by the edges e and e′ [further simply referred to as rectangle (e,e′)] with the read-pair (r,r′) represented as a point (x0,y0) within the rectangle. If edges e and e′ represent consecutive regions in the genome, then the genomic distance from the start of read r to the start of read r′ equals to , where Length(e) stands for the length of edge e. Therefore, in the case of an ‘ideal read-pair’ (r,r′) (e.g. a read-pair with the exact insert size I), , where . Thus all ‘ideal read-pairs’ mapping to edges e and e′ form a set of integer points on the 45° line y = x– d within the rectangle (Fig. 2c). Because the read-pairs from the real sequencing data have variations in the insert size, their corresponding points are typically scattered in the strip between the 45° lines y = x – d and y = x – d, where
This strip in the rectangle is further referred to as the confidence strip (Fig. 2d).
Fig. 2.
(a) Reads r and r′ form a read-pair mapping to consecutive edges e and e′ in the assembly graph at positions x0 and y0, respectively. (b) Representation of a read-pair (r,r′) as a point in a rectangle (e,e′). (c) ‘Ideal read-pairs’ with the exact insert size I connecting edges e and e′ form a 45° line within a rectangle. (d) Read-pairs from the real sequencing data with variations in the insert size represented as points within a rectangle. Most points are located within the confidence strip providing the evidence that edges e and e′ are supported by the read-pairs and are genome-consecutive. (e) A rectangle formed by a pair of edges that has few points falling into the confidence strip revealing that e and e′ are not genome-consecutive edges
Let F(x) be the empirical distribution of the insert size and S be a set of all integer points within the confidence strip in the rectangle (e,e′). We define the expected number of read-pairs within the confidence strip (under the assumption of the uniform coverage) as
where represents the insert size of a read-pair that corresponds to the point (x, y). We also define Points(e,e′) as the total number of read-pairs (from the real dataset) that correspond to the points within the confidence strip. The notion of density is defined as
We set Density(e,e′) = 0, if Expected(e,e′) = 0. The points outside the confidence strip may represent read-pairs with somewhat larger deviations from the mean insert sizes or chimeric read-pairs. Our analysis revealed that being conservative (e.g. limiting analysis to the confidence strip) allows one to avoid most of the assembly errors caused by chimeric read-pairs, particularly in single-cell projects.We distinguish between notions of genome-consecutive and graph-consecutive edges and emphasize that graph-consecutive edges are not necessarily genome-consecutive. The decision about which graph-consecutive edges are genome-consecutive is an important part of any assembler. Figure 2d and e illustrate how rectangles help us to make such decisions: both rectangles correspond to graph-consecutive edges, but only rectangle in Figure 2d is formed by the pair of genome-consecutive edges.The described notions of Expected(e,e′), Points(e,e′) and Density(e,e′) (defined for the case when edges e and e′ are genome-consecutive) can be generalized for the case when e and e′ are not consecutive genomic edges under the assumption that genomic distance between them is D. In this case the confidence strip [further referred to as Strip(e,e′)] is bounded by the lines y = x – d and y = x – d, where
Points(e,e′) similarly represents the number of points within strip Strip(e,e′). Expected(e,e′) and Density(e,e′) are defined using the same formulas and the corresponding confidence strip. Clearly, for two genome-consecutive edges Expected(e,e′) = Expected(e,e′) and Density(e,e′) = Density(e,e′), where L = Length(e).The support function reflects whether the number of read-pairs connecting edges e and e′ supports the conjecture that e′ follows e in the genome at distance D:
where Ψ is a parameter of the algorithm, which is automatically computed for each read-pair library based on the chimeric read-pair rate (see below). For the standard isolate datasets this parameter corresponds to the coverage cutoff for read-pairs. For single-cell datasets this parameter is usually set to be very low to retain the regions with low coverage, which are typical for single-cell projects. If Support(e,e′) = 1, we say that the rectangle (e,e′) is supported by the read-pairs.
3.2.2 The naive scoring function
To explain the intuition behind exSPAnder, we first introduce the naive scoring function. We further modify the naive scoring function to arrive to the advanced scoring function used in the real exSPAnder implementation.(a) An example of an assembly graph with the genomic paths (p1, p2, p3, e1) and . (b, e) The composite rectangles for correct genomic extension of each path: in these cases the points are evenly distributed within the confidence strip and the resulting score is equal to 1. (c, d) The composite rectangles that correspond to incorrect extensions edges of these two paths. In each of these cases, at least one simple rectangle contains few points within the confidence stripA path P = (p1, … ,p) and its extension edge e can be represented as a composite rectangle formed by m simple rectangles (p,e) containing points that correspond to read-pairs connecting edges of P and e. Figure 3b shows an example of a composite rectangle, which is formed by a path (p1, p2, p3) and its extension edge e1 and consists of three simple rectangles. The notion of the confidence strip remains (it now consists of up to m substrips within simple rectangles), except that it is bounded by the lines y = x – d and y = x – d, where
For an edge p from the path P we define the expected number of points in the confidence substrip within the simple rectangle (p,e) as , where D is the distance between start of p and start of e according to the path P (i.e. ). We consider rectangles such that and introduce the function Score(e) as the fraction of the total number of expected read-pairs in these rectangles with :
We set Score(e) = 0 if all simple rectangle have zero expected read-pairs.
Fig. 3.
(a) An example of an assembly graph with the genomic paths (p1, p2, p3, e1) and . (b, e) The composite rectangles for correct genomic extension of each path: in these cases the points are evenly distributed within the confidence strip and the resulting score is equal to 1. (c, d) The composite rectangles that correspond to incorrect extensions edges of these two paths. In each of these cases, at least one simple rectangle contains few points within the confidence strip
Figure 3a shows paths P = (p1, p2, p3) and P′ = (p1,p′2, p3) and its extension edges e1,e2. Let (p1, p2, p3, e1) and (p1,p′2,p3, e2) be the true (but unknown) genomic paths. Figure 3b shows the composite rectangle for path P and its correct extension e1, in which points within the confidence strip are rather evenly distributed resulting in Score(e1) = 1. Figure 3c shows the composite rectangle for path P = (p1, p2, p3) and its incorrect extension edge e2. Because (p1, p′2, p3, e2) is a genomic path, density of the points in the sectors of the confidence strip corresponding to edges p1 and p3 is high. However, edge p2 of the path P does not support extension edge e2 because there are few points in the rectangle (p2, e2). Additionally, Figure 3d and e shows composite rectangles for all possible extension edges for path P′ = (p1, p′2, p3).Because the defined scoring function does not linearly depend on read coverage, it is well suitable for both single-cell and standard sequencing projects. At the same time, considering only read-pairs with insert size in [I,I] (which correspond to points within the confidence strip) allows one to filter out most of the chimeric read-pairs (common for single-cell datasets) and to minimize their influence on the scoring function.
3.2.3 The advanced scoring function
The naive scoring function Score(e) described above works well in many cases but may be too conservative when the path P contains repetitive edges (edges that are visited more than once by the genomic traversal). Figure 4 illustrates the case when the path P has a repetitive edge and motivates the need for further improvements in the scoring function.
Fig. 4.
Scoring a path that contains repetitive edges. (a) An example of the assembly graph with a repetitive edge p. (b) A composite rectangle for the correct extension e of path (p,p). (c) A composite rectangle for the incorrect extension e′ of the path (p,p)
Scoring a path that contains repetitive edges. (a) An example of the assembly graph with a repetitive edge p. (b) A composite rectangle for the correct extension e of path (p,p). (c) A composite rectangle for the incorrect extension e′ of the path (p,p)Figure 4a shows an assembly graph with four unique edges (p, p′, e and e′) and a single repetitive edge p with multiplicity 2. We assume that paths (p,p,e) and (p′,p,e′) are genomic and paths (p,p,e′) and (p′,p,e) are non-genomic. Our goal is to design an algorithm that correctly extends genomic paths (p,p) and (p′,p) into longer genomic paths (p,p,e) and (p′,p,e′), respectively.Consider a path P = (p,p) and composite rectangles (P,e) and (P,e′) (Fig. 4b and c). As Figure 4 illustrates, Score(e) is similar to Score(e′), implying that there is no strong winner for path P and preventing us from extending the path P by edge e. However, because the repetitive edge p supports both extension edges e and e′, it does not provide any valuable information about the correct extension of the path P. Therefore, to make a decision about extending the path P by an extension edge e, we should have excluded p from the consideration as a repetitive edge. Because we do not know in advance which edges of the assembly graph correspond to repeats in the genome, we classify p as repetitive because it supports both extension edges e and e′.Below we present the exSPAnder algorithm that allows us to exclude repetitive edges from contributing to scoring. An extension edge e of path P is called an active edge if . At the first step of the algorithm we score all extension edges and form a set of active edges . An edge p in path P is classified as repetitive if it supports all active edges, i.e. for all . At the second step we mark all repetitive edges and recalculate scores of all edges in ignoring these repetitive edges. We then update by removing all non-active edges and iterate the process. The process continues until yet another iteration does not change the set of active edges . If and Score(e) >Θ (which means that e satisfies both conditions in the decision rule) the extension edge e is considered to be a strong winner and added to the path P. Otherwise, we stop extending path P.We further demonstrate the work of the exSPAnder algorithm using a simple assembly graph shown in Figure 5. The paths and are genomic paths, which means that edges p2 and p3 are repetitive and have multiplicities 2 and 3, respectively.
Fig. 5.
An example of the assembly graph with repetitive edges p2 and p3
An example of the assembly graph with repetitive edges p2 and p3Let P = (p1, p2, p3) be a path we aim to extend. We first calculate scores of all extension edges using the composite rectangles (Fig. 6a–c) and form a set of active edges based on their scores (marked red in Fig. 6d). Because for i = 1, 2, 3, edge p3 is classified as repetitive and is removed from further consideration (Fig. 6e). We now recalculate scores for the extension edges in ignoring repetitive edge p3 (Fig. 6f–h) and remove non-active edge e3 from (Fig. 6i). Using the updated set we again proceed to the repeat detection step and mark edge p2 as repetitive because for i = 1, 2 (Fig. 6j). Finally, we once again recalculate scores of the extension edges in (Fig. 6k–m) and remove e2 as non-active (Fig. 6n). The extension edge e1 remains the only active edge in and is used to extend path P.
Fig. 6.
A step-by-step example of the exSPAnder algorithm. (a–c) Forming a set of active edges {e1, e2, e3} (marked red) for the path P = (p1, p2, p3) using the corresponding composite rectangles. (d, e) Classifying of edge p3 as repetitive and removing it from further consideration (marking gray). Edges that are not classified as repetitive are colored in blue. (f–h) Recalculating scores of the extension edges and updating the set of active edges. (i, j) Removing repetitive edge p2. (k–m) Recalculating scores for the remaining active edges {e1, e2} and removing e2 as non-active. (n) Selecting the only active edge e1 as an extension for the path P
A step-by-step example of the exSPAnder algorithm. (a–c) Forming a set of active edges {e1, e2, e3} (marked red) for the path P = (p1, p2, p3) using the corresponding composite rectangles. (d, e) Classifying of edge p3 as repetitive and removing it from further consideration (marking gray). Edges that are not classified as repetitive are colored in blue. (f–h) Recalculating scores of the extension edges and updating the set of active edges. (i, j) Removing repetitive edge p2. (k–m) Recalculating scores for the remaining active edges {e1, e2} and removing e2 as non-active. (n) Selecting the only active edge e1 as an extension for the path PExtensive tests of the advanced scoring function revealed that it works well across diverse datasets including single-cell jumping libraries with high variations in the insert size, extremely non-uniform coverage and large number of chimeric reads and chimeric read-pairs (see Section 4).
3.3 Scaffolding
After all paths are constructed, we consider all pairs of paths that form composite rectangles with non-zero number of points (Fig. 7). For each such pair of paths P and P′ we can check whether points in the corresponding composite rectangle are scattered around a certain 45° line using SPAdes distance estimation procedure (Nurk ). When SPAdes provides the estimated distance D between P and P′, we use exSPAnder to verify the conjecture that P′ follows P at distance D. If this conjecture is supported and does not contradict to any other conjectures about these paths, we extend the path P by P′ (the scaffolding step). We estimate the gap length between the paths as and insert the appropriate number of ‘N’ symbols (unspecified nucleotide) between end of P and start of P′. If paths P and P′ overlap, we construct their overlap alignment to correct distance D.
Fig. 7.
An example of a composite rectangle formed by paths (p1, p2, p3) and
An example of a composite rectangle formed by paths (p1, p2, p3) and
3.4 Choice of the parameters
3.4.1 The scoring function
We select the parameter Ψ as a threshold for the density of the read-pairs within the confidence strip. We therefore assume that rectangles with the density below Ψ contain mostly chimeric read-pairs and should be ignored while calculating the score of an extension edge. To select Ψ for a particular read-pair library, we estimate the distribution of the densities for rectangles that contain only chimeric read-pairs (false rectangles) and for rectangles that contain only non-chimeric read-pairs (true rectangles).To partition all read-pairs into chimeric and non-chimeric, one needs the complete genome that is unavailable. To get around this, we identify a subset of chimeric reads using the long edges in the assembly graph (e.g. edges longer than N50) that can be viewed as subgenomes of the complete genome. By mapping a read-pair to a long edge we can compute its insert size and thus classify read-pairs with abnormally large insert sizes as chimeric.To estimate the parameter Ψ, we generate a large number of equally sized rectangles by artificially ‘splitting’ each long edge into shorter edges of equal length (e.g. 100 bp) called uni-edges. We then consider rectangles formed by all pairs of uni-edges within the same long edge. Because we know the exact distance between such uni-edges, we can calculate the expected number of read-pairs Expected(e1,e2) for all pairs of uni-edges within the same long edge. If uni-edges e1 and e2 come from the same long edge and Expected(e1,e2) = 0, then all points in the confidence strip Strip(e1,e2) represent chimeric read-pairs connecting e1 and e2. Such edge-pair (e1,e2) is classified as a false edge-pair. Otherwise, if , we assume that all points within represent correct read-pairs and classify (e1,e2) as a true edge-pair.Edges e1 and e2 and a parameter D define the strip Strip(e1,e2) with the number of points Points(e1,e2). For a pair of edges e1 and e2, we define
When (e1,e2) is a false edge-pair, D* defines a confidence strip with the maximum number of chimeric read-pairs. To compute the threshold Ψ, we assume this worst-case scenario for all pairs of uni-edges (within the same long edge) by using distance D* (rather than the known genomic distance) for calculating the densities .For a certain value of Ψ we define false positives (false negatives) as the false (true) edge-pairs that have density higher (lower) than . Figures 8a and b illustrate how false-positive rate (green) and false-negative rate (blue) depend on the parameter Ψ. exSPAnder selects Ψ that corresponds to the intersection point of the false-positive and false-negative plots. Our benchmarking revealed that such choice of the parameter Ψ allows one to filter out the rectangles containing only chimeric reads-pairs based on their densities. Additional analysis revealed that estimating the parameter Ψ is an important step in the exSPAnder algorithm because varying this parameter may significantly affect the assembly quality (see Section 3 in Appendix).
Fig. 8.
Plots of the false-positive (green) and false-negative (blue) rates for (a) B.facium and (b) S.aureus paired-end libraries
Plots of the false-positive (green) and false-negative (blue) rates for (a) B.facium and (b) S.aureus paired-end libraries
3.4.2 The decision rule
Our analysis revealed that varying parameters C and Θ within specified ranges (see Section 3 of the Appendix) hardly affects the quality of the resulting assemblies. However, selecting inappropriate C and Θ may result in a deteriorated performance of exSPAnder. Thus, we arbitrarily select these parameters within the ranges specified in Section 3 of Appendix (analysis of diverse sequencing datasets supports the default values C = 1.5 and Θ = 0.5).
4 RESULTS
4.1.1 Datasets
We have compared exSPAnder (coupled with SPAdes assembler) with several popular assemblers on the B.faecium isolate dataset (genome size 3.6 Mb) and the S.aureus single-cell dataset (genome size 2.9 Mb). For each dataset we have generated assemblies of (i) only paired-end library and (ii) both paired-end and jumping libraries. Section 1 of Appendix provides a detailed description of both datasets.
4.1.2 Benchmarking assemblies
ABySS 1.3.6, Ray 2.0.0, Velvet, Velvet-SC and SOAPdenovo 2.0.4 were run with k-mer size 55. IDBA-UD 1.1.1 was run in its default iterative mode. The authors of (Peng ) released this new version of IDBA-UD that is capable of using several paired-end libraries, but there is no manuscript yet covering this new development. ALLPATHS-LG was run with the default parameters; however, we down-sampled jumping library for the B.faecium dataset to generate 100× coverage required by ALLPATHS-LG. SPAdes 2.4 (previous version of SPAdes that did not include exSPAnder and did not support multiple libraries) was run in its default iterative mode with k = 21, 33, 55, 77 for the B.faecium dataset and k = 21, 33, 55 for the S.aureus dataset. exSPAnder (coupled with SPAdes) was run using the default parameters.To analyze the resulting assemblies we used QUAST 2.2 (Gurevich ) that reports various parameters including NG50 (similar to N50, but is calculated with respect to the reference genome size), the total number of contigs/scaffolds, the length of the longest assembled contig/scaffold, the number of misassemblies and the fraction of genome mapped. QUAST defines a misassembly breakpoint as a position in the contig/scaffold, such that its left and right flanking sequences either align to the reference genome over 1 kb away from each other, or overlap by > 1 kb, or align on opposite strands or different chromosomes (Gurevich ). To compare assemblers we used both contigs and scaffolds of length exceeding 500 bp.Tables 2 and 3 show the benchmarking results for the B.faecium isolate dataset. Interestingly, the single-cell assemblers (IDBA-UD and exSPAnder coupled with SPAdes) as well as ABySS performed well on the B.faecium isolate dataset and produced contigs with the largest NG50 in the case of a single library. While AbySS generated the assembly with the maximal genome fraction, manual inspection revealed that it reflects the specifics of ABySS and QUAST reporting (mapping each repeat to a single position in the genome) rather than real superiority of ABySS by this metric.
Table 2.
Comparison of contigs for the B.faecium isolate dataset
Assembler
NG50
Number of contigs
Largest
Number of mis
GF
Only paired-end library
ABySS
203
40
672
0
99.9
Ray
114
51
436
1
98.9
SOAPdenovo
20
333
61
0
98.8
Velvet
144
47
550
0
99.4
Velvet-SC
163
46
550
0
99.4
IDBA-UD
202
39
483
0
99.4
SPAdes 2.4
361
24
635
1
99.7
exSPAnder
380
22
672
1
99.5
Both paired-end and jumping libraries
ABySS
203
40
672
0
99.9
ALLPATHS-LG
313
21
686
0
99.5
Ray
87
88
416
2
96.8
SOAPdenovo
20
333
61
0
98.8
Velvet
103
75
242
11
99.0
Velvet-SC
253
40
545
15
99.8
IDBA-UD
207
41
483
0
99.4
exSPAnder
3268
2
3268
1
99.9
Note: NG50 is given in kb; number of contigs is the total number of contigs >500 bp; largest stands for the length (in kb) of the longest contig assembled; number of mis is the number of misassemblies; GF stands for the fraction of genome mapped given in percent. In each column, the best value is indicated in bold.
Table 3.
Comparison of scaffolds for the B.faecium isolate dataset
Assembler
NG50
Number of scaffolds
Largest
Number of mis
GF
Only paired-end library
ABySS
383
24
676
0
99.9
Ray
204
31
553
1
98.9
SOAPdenovo
477
26
724
0
99.3
Velvet
477
28
724
0
99.4
Velvet-SC
477
28
724
0
99.4
IDBA-UD
250
30
671
0
99.4
SPAdes 2.4
361
22
671
1
99.7
exSPAnder
380
22
672
1
99.5
Both paired-end and jumping libraries
ABySS
250
30
739
1
99.9
ALLPATHS-LG
3610
7
3610
1
99.5
Ray
106
75
416
2
96.8
SOAPdenovo
480
28
810
2
99.4
Velvet
2651
14
2651
78
99.1
Velvet-SC
945
102
1381
500
98.9
IDBA-UD
1002
9
1692
0
99.4
exSPAnder
3268
2
3268
1
99.9
Note: NG50 is given in kb; number of scaffolds is the total number of scaffolds >500 bp; largest stands for the length (in kb) of the longest scaffold assembled; number of mis is the number of misassemblies; GF stands for the fraction of genome mapped given in percent. In each column, the best value is indicated in bold.
Comparison of contigs for the B.faecium isolate datasetNote: NG50 is given in kb; number of contigs is the total number of contigs >500 bp; largest stands for the length (in kb) of the longest contig assembled; number of mis is the number of misassemblies; GF stands for the fraction of genome mapped given in percent. In each column, the best value is indicated in bold.Comparison of scaffolds for the B.faecium isolate datasetNote: NG50 is given in kb; number of scaffolds is the total number of scaffolds >500 bp; largest stands for the length (in kb) of the longest scaffold assembled; number of mis is the number of misassemblies; GF stands for the fraction of genome mapped given in percent. In each column, the best value is indicated in bold.In the case of two libraries, exSPAnder produced the best contigs while ALLPATHS-LG produced the best scaffolds. The complexity of using jumping libraries is reflected in a deteriorated performance of ABySS and Ray (reduction in NG50) as well as Velvet and Velvet-SC (dramatic increase in the number of misassemblies).Tables 4 and 5 compare various assemblers on the S.aureus single-cell dataset. This comparison highlights the complexity of both (i) assembling single-cell datasets and (ii) using jumping libraries. For example, SOAPdenovo produced assemblies of poor quality for single-cell data (we decided not to include it in Tables 4 and 5). Similarly, ABySS produced assemblies with high number of misassemblies for the single-cell data. Velvet and Velvet-SC are not included in the benchmark experiment for jumping libraries testing because they also produce low-quality assemblies when both paired-end and jumping libraries are used simultaneously. IDBA-UD performed well on a single paired-end library, but produced an assembly of lower quality when both libraries were provided (decreased NG50). exSPAnder produced assemblies with the highest NG50 and largest assembled contig/scaffold.
Table 4.
Comparison of contigs for the S.aureus single-cell dataset
Assembler
NG50
Number of contigs
Largest
Number of mis
GF
Only paired-end library
ABySS
27
914
91
262
98.0
Ray
21
306
108
14
88.7
Velvet
10
538
56
2
93.2
Velvet-SC
9
616
56
4
94.2
IDBA-UD
75
390
161
7
98.6
SPAdes 2.4
98
400
230
8
99.1
exSPAnder
148
366
275
3
98.6
Both paired-end and jumping libraries
ABySS
27
914
91
262
98.0
ALLPATHS-LG
15
283
75
26
79.9
Ray
100
178
486
21
93.5
IDBA-UD
47
415
161
7
98.6
exSPAnder
314
322
603
9
99.3
Note: NG50 is given in kb; number of contigs is the total number of contigs >500 bp; largest stands for the length (in kb) of the longest contig assembled; number of mis is the number of misassemblies; GF stands for the fraction of genome mapped given in percent. In each column, the best value is indicated in bold.
Table 5.
Comparison of scaffolds for the S.aureus single-cell dataset
Assembler
NG50
Number of scaffolds
Largest
Number of mis
GF
Only paired-end library
ABySS
28
910
91
270
98.2
Ray
21
306
108
14
88.7
Velvet
10
538
56
2
93.2
Velvet-SC
10
620
56
5
94.2
IDBA-UD
88
382
161
8
98.2
SPAdes 2.4
99
391
230
8
99.2
exSPAnder
148
357
426
4
98.6
Both paired-end and jumping libraries
ABySS
30
852
91
275
98.0
ALLPATHS-LG
40
165
132
69
79.9
Ray
100
169
486
25
93.5
IDBA-UD
55
397
161
9
98.6
exSPAnder
314
302
603
9
99.3
Note: NG50 is given in kb; number of scaffolds is the total number of scaffolds >500 bp; largest stands for the length (in kb) of the longest scaffold assembled; number of mis is the number of misassemblies; GF stands for the fraction of genome mapped given in percent. In each column, the best value is indicated in bold.
Comparison of contigs for the S.aureus single-cell datasetNote: NG50 is given in kb; number of contigs is the total number of contigs >500 bp; largest stands for the length (in kb) of the longest contig assembled; number of mis is the number of misassemblies; GF stands for the fraction of genome mapped given in percent. In each column, the best value is indicated in bold.Comparison of scaffolds for the S.aureus single-cell datasetNote: NG50 is given in kb; number of scaffolds is the total number of scaffolds >500 bp; largest stands for the length (in kb) of the longest scaffold assembled; number of mis is the number of misassemblies; GF stands for the fraction of genome mapped given in percent. In each column, the best value is indicated in bold.Using only paired-end library IDBA-UD, SPAdes 2.4 and exSPAnder recovered the largest fraction of the genome (>98.5%). However, the highest genome fraction of the assembly generated by SPAdes 2.4 reflects the specifics of SPAdes 2.4 and QUAST reporting (some artifacts with reporting of repetitive regions) rather than real advantage of SPAdes 2.4 with respect to this parameter.When using both libraries simultaneously, exSPAnder produced assemblies with the highest genome fraction exceeding 99%, the largest genome fraction we saw across dozens of single-cell datasets assembled with SPAdes in the past 2 years. Moreover, Tables 4 and 5 show that exSPAnder successfully deals with the high rate of the chimeric read-pairs and relatively high variations in the insert size.
5 CONCLUSION
We have presented exSPAnder algorithm for resolving repeats using either a single or multiple read-pair libraries with different insert sizes, which is applicable for both single-cell and isolate bacterial datasets. Benchmarks across eight popular assemblers demonstrate that exSPAnder produces high-quality assemblies for datasets of different types. Additionally, as illustrated by recent integration of Illumina and PacBio reads in SPAdes 3.0, exSPAnder is a flexible approach that can be easily modified to work with diverse types of sequencing data.
Authors: Sante Gnerre; Iain Maccallum; Dariusz Przybylski; Filipe J Ribeiro; Joshua N Burton; Bruce J Walker; Ted Sharpe; Giles Hall; Terrance P Shea; Sean Sykes; Aaron M Berlin; Daniel Aird; Maura Costello; Riza Daza; Louise Williams; Robert Nicol; Andreas Gnirke; Chad Nusbaum; Eric S Lander; David B Jaffe Journal: Proc Natl Acad Sci U S A Date: 2010-12-27 Impact factor: 11.205
Authors: Binh An Diep; Steven R Gill; Richard F Chang; Tiffany HaiVan Phan; Jason H Chen; Matthew G Davidson; Felice Lin; Jessica Lin; Heather A Carleton; Emmanuel F Mongodin; George F Sensabaugh; Françoise Perdreau-Remington Journal: Lancet Date: 2006-03-04 Impact factor: 79.321
Authors: Sergey Nurk; Anton Bankevich; Dmitry Antipov; Alexey A Gurevich; Anton Korobeynikov; Alla Lapidus; Andrey D Prjibelski; Alexey Pyshkin; Alexander Sirotkin; Yakov Sirotkin; Ramunas Stepanauskas; Scott R Clingenpeel; Tanja Woyke; Jeffrey S McLean; Roger Lasken; Glenn Tesler; Max A Alekseyev; Pavel A Pevzner Journal: J Comput Biol Date: 2013-10 Impact factor: 1.479
Authors: Jonathan Butler; Iain MacCallum; Michael Kleber; Ilya A Shlyakhter; Matthew K Belmonte; Eric S Lander; Chad Nusbaum; David B Jaffe Journal: Genome Res Date: 2008-03-13 Impact factor: 9.043
Authors: Hamidreza Chitsaz; Joyclyn L Yee-Greenbaum; Glenn Tesler; Mary-Jane Lombardo; Christopher L Dupont; Jonathan H Badger; Mark Novotny; Douglas B Rusch; Louise J Fraser; Niall A Gormley; Ole Schulz-Trieglaff; Geoffrey P Smith; Dirk J Evers; Pavel A Pevzner; Roger S Lasken Journal: Nat Biotechnol Date: 2011-09-18 Impact factor: 54.908
Authors: Alla Lapidus; Rüdiger Pukall; Kurt Labuttii; Alex Copeland; Tijana Glavina Del Rio; Matt Nolan; Feng Chen; Susan Lucas; Hope Tice; Jan-Fang Cheng; David Bruce; Lynne Goodwin; Sam Pitluck; Manfred Rohde; Markus Göker; Amrita Pati; Natalia Ivanova; Konstantinos Mavrommatis; Amy Chen; Krishna Palaniappan; Patrik D'haeseleer; Patrick Chain; Jim Bristow; Jonathan A Eisen; Victor Markowitz; Philip Hugenholtz; Nikos C Kyrpides; Hans-Peter Klenk Journal: Stand Genomic Sci Date: 2009-07-20
Authors: Yu Lin; Jeffrey Yuan; Mikhail Kolmogorov; Max W Shen; Mark Chaisson; Pavel A Pevzner Journal: Proc Natl Acad Sci U S A Date: 2016-12-12 Impact factor: 11.205
Authors: Nathan D Olson; Todd J Treangen; Christopher M Hill; Victoria Cepeda-Espinoza; Jay Ghurye; Sergey Koren; Mihai Pop Journal: Brief Bioinform Date: 2019-07-19 Impact factor: 11.622
Authors: Scott Quainoo; Jordy P M Coolen; Sacha A F T van Hijum; Martijn A Huynen; Willem J G Melchers; Willem van Schaik; Heiman F L Wertheim Journal: Clin Microbiol Rev Date: 2017-10 Impact factor: 26.132