| Literature DB >> 27330851 |
Chengxi Ye1, Zhanshan Sam Ma2.
Abstract
Motivation. The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15-40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such as de novo genome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences. Results. We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitate de novo genome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, and uses approximately 80% less memory and time. Availability. The source code is available for download at https://github.com/yechengxi/Sparc.Entities:
Keywords: Consensus algorithm; Genome assembly; Single molecular sequencing; Third generation sequencing technology; Variant discovery
Year: 2016 PMID: 27330851 PMCID: PMC4906657 DOI: 10.7717/peerj.2016
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1A standard Overlap-Layout-Consensus genome assembly pipeline.
Figure 2The Pseudo-code of Sparc.
Figure 3A toy example of constructing the position specific sparse k-mer graph.
(A) The initial k-mer graph of the backbone. (B) Adding two sequences to the graph. (C) The heaviest path representing the consensus is found by graph traversal (original weights are used in this example).
Results on an E. coli dataset using PacBio sequencing.
| Program | Coverage | N50 | # | Time | Memory | Err1 | Err2 | Err4 |
|---|---|---|---|---|---|---|---|---|
| Sparc | 10× PB | 1.06 MB | 11 | 0.5 m | 308 MB | 1.95% | 1.51% | 1.50% |
| PBdagcon | 10× PB | 1.06 MB | 11 | 3.0 m | 1.10 GB | 1.95% | 1.52% | 1.51% |
| Sparc | 10× Hybrid | 1.06 MB | 11 | 0.5 m | 237 MB | 0.19% | 0.09% | 0.06% |
| PBdagcon | 10× Hybrid | 1.06 MB | 11 | 3.0 m | 1.23 GB | 1.02% | 0.64% | 0.58% |
| Sparc | 30× PB | 4.74 MB | 2 | 1.3 m | 2.30 GB | 0.41% | 0.16% | 0.11% |
| PBdagcon | 30× PB | 4.74 MB | 2 | 9.3 m | 7.70 GB | 0.49% | 0.23% | 0.18% |
| Sparc | 30× Hybrid | 4.74 MB | 2 | 1.3 m | 2.14 GB | 0.17% | 0.02% | 0.02% |
| PBdagcon | 30× Hybrid | 4.74 MB | 2 | 9.7 m | 9.58 GB | 0.49% | 0.18% | 0.13% |
Results on an A. thaliana dataset using PacBio sequencing.
| Program | Coverage | N50 | # | Time | Memory | Err1 | Err2 | Err4 |
|---|---|---|---|---|---|---|---|---|
| Sparc | 20× Hybrid | 2.02 MB | 469 | 21 m | 1.7 GB | 0.36% | 0.19% | 0.17% |
| PBdagcon | 20× Hybrid | 2.02 MB | 469 | 123 m | 8.9 GB | 0.81% | 0.53% | 0.47% |
Results on an E. coli dataset using Oxford Nanopore sequencing.
| Program | Coverage | N50 | # | Time | Memory | Err1 | Err2 | Err4 |
|---|---|---|---|---|---|---|---|---|
| Sparc | 30× ON | 4.61 MB | 1 | 2.3 m | 1.89 GB | 11.96% | 9.22% | 7.47% |
| PBdagcon | 30× ON | 4.61 MB | 1 | 10.0 m | 8.38 GB | 13.70% | 12.96% | 12.86% |
| Sparc | 30× Hybrid | 4.61 MB | 1 | 3.3 m | 1.86 GB | 0.72% | 0.59% | 0.46% |
| PBdagcon | 30× Hybrid | 4.61 MB | 1 | 13.2 m | 9.56 GB | 11.20% | 10.01% | 9.96% |
Memory and quality comparisons using different k, g values.
| Time | Memory | Error rate | ||
|---|---|---|---|---|
| 1 | 1 | 43 s | 2.3 GB | 0.16% |
| 2 | 1 | 55 s | 3.5 GB | 0.14% |
| 1 | 2 | 59 s | 1.6 GB | 0.18% |
| 2 | 2 | 68 s | 2.3 GB | 0.13% |
Performance of using different b values.
| Dataset | Err1 | Err4 | |
|---|---|---|---|
| 30× PB Hybrid | 0 | 0.34% | 0.05% |
| 30× PB Hybrid | 5 | 0.17% | 0.02% |
| 30× PB Hybrid | 10 | 0.11% | 0.02% |
| 30× PB Hybrid | 15 | 0.08% | 0.02% |
| 30× ON Hybrid | 0 | 6.87% | 6.69% |
| 30× ON Hybrid | 5 | 0.88% | 0.70% |
| 30× ON Hybrid | 10 | 0.72% | 0.46% |
| 30× ON Hybrid | 15 | 0.69% | 0.48% |