| Literature DB >> 26044652 |
Chunyu Wang, Maozu Guo, Xiaoyan Liu, Yang Liu, Quan Zou.
Abstract
DNA sequencing technology has been rapidly evolving, and produces a large number of short reads with a fast rising tendency. This has led to a resurgence of research in whole genome shotgun assembly algorithms. We start the assembly algorithm by clustering the short reads in a cloud computing framework, and the clustering process groups fragments according to their original consensus long-sequence similarity. We condense each group of reads to a chain of seeds, which is a kind of substring with reads aligned, and then build a graph accordingly. Finally, we analyze the graph to find Euler paths, and assemble the reads related in the paths into contigs, and then lay out contigs with mate-pair information for scaffolds. The result shows that our algorithm is efficient and feasible for a large set of reads such as in next-generation sequencing technology.Entities:
Mesh:
Year: 2015 PMID: 26044652 PMCID: PMC4460749 DOI: 10.1186/1755-8794-8-S2-S13
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Figure 1MapReduce framework.
Figure 2Generating the .
Figure 3Extension from .
Figure 4Methods to resolve graph complexity. (a) Split sharing in-path nodes by a mate-pair (b) Split sharing starting and ending nodes by a mate-pair
Details of next-generation sequencing datasets used for experiments
| Species |
|
|
|---|---|---|
| Size (Mb) | 2.9 | 4.6 |
| Read length | 101 | 101 |
| Insert size (base pairs) | 180 | 180 |
| Number of reads | 1,294,101 | 2,050,868 |
Assembly result of next-generation sequencing data for Staphylococcus aureus and Rhodobacter sphaeroides
| Dataset |
|
| ||
|---|---|---|---|---|
| SeedsGraph | SeedsGraph | |||
| Number of contigs | 274 | 754 | 3,067 | 7,033 |
| N50 of contigs (kb) | 24 | 43 | 27 | 42 |
| Number of scaffolds | 122 | 323 | 2096 | 4291 |
| N50 of scaffolds (kb) | 205 | 174 | 95 | 46 |
| Assembly size (%) | 94.2 | 87.4 | 97.2 | 86.3 |