| Literature DB >> 20529929 |
Jared T Simpson1, Richard Durbin.
Abstract
MOTIVATION: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms.Entities:
Mesh:
Year: 2010 PMID: 20529929 PMCID: PMC2881401 DOI: 10.1093/bioinformatics/btq217
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Diagram of a simple string graph. Three overlapping reads (R1, R2, R3) are shown in (A). (B) shows the string graph constructed from the overlaps between the reads. The arrowheads pointing into the nodes depict an edge of type B and arrowheads pointing away from the nodes depict edges of type E. The edge R1 ↔ R3 is transitive.
Fig. 2.The running time of the direct and exhaustive overlap algorithms for simulated E. coli data with sequence depth from 5× to 100×. The direct overlap algorithm scales linearly with sequence depth. As the number of overlaps grows quadratically with sequence depth, the exhaustive overlap algorithm exhibits above-linear scaling.
Simulation results for human chromosomes 22, 15, 7 and 2
| chr 22 | chr 15 | chr 7 | chr 2 | ratio | |
|---|---|---|---|---|---|
| Chr. size (Mb) | 34.9 | 81.7 | 155.4 | 238.2 | 6.8 |
| Number of reads (M) | 7.0 | 16.3 | 31.1 | 47.6 | 6.8 |
| Contained reads (k) | 684 | 1668 | 3103 | 4709 | 6.9 |
| Contained (%) | 9.8 | 10.2 | 10.0 | 9.9 | – |
| Transitive edges (M) | 38.0 | 93.0 | 177.7 | 274.6 | 7.2 |
| Irreducible edges (M) | 6.3 | 14.9 | 28.7 | 44.4 | 7.0 |
| Assembly N50 (kbp) | 4.0 | 4.6 | 4.2 | 4.7 | – |
| Longest contig (kbp) | 31.9 | 47.7 | 53.1 | 48.6 | – |
| Index time (s) | 2606 | 9743 | 19 779 | 30 866 | 11.8 |
| Overlap -e time (s) | 2657 | 6572 | 12 970 | 18 060 | 6.8 |
| Overlap -d time (s) | 2885 | 6750 | 13 271 | 19 437 | 6.7 |
| Assemble -e time (s) | 1836 | 4043 | 8112 | 13 095 | 7.1 |
| Assemble -d time (s) | 423 | 1161 | 2044 | 3226 | 7.6 |
| Index memory (GB) | 8.0 | 18.6 | 35.4 | 54.5 | 6.8 |
| Overlap -e mem. (GB) | 2.4 | 5.5 | 10.5 | 16.1 | 6.7 |
| Overlap -d mem. (GB) | 2.4 | 5.5 | 10.4 | 16.1 | 6.7 |
| Assemble -e mem. (GB) | 5.9 | 14.2 | 27.2 | 41.9 | 7.1 |
| Assemble -d mem. (GB) | 2.7 | 6.3 | 12.1 | 18.6 | 6.9 |
For the overlap and assemble rows, -e and -d indicate the exhaustive and direct algorithms, respectively. The last column is the ratio between chromosome 2 and 22.