| Literature DB >> 23281792 |
Xiao-Long Wu1, Yun Heo, Izzat El Hajj, Wen-Mei Hwu, Deming Chen, Jian Ma.
Abstract
BACKGROUND: With the cost reduction of the next-generation sequencing (NGS) technologies, genomics has provided us with an unprecedented opportunity to understand fundamental questions in biology and elucidate human diseases. De novo genome assembly is one of the most important steps to reconstruct the sequenced genome. However, most de novo assemblers require enormous amount of computational resource, which is not accessible for most research groups and medical personnel.Entities:
Mesh:
Year: 2012 PMID: 23281792 PMCID: PMC3526431 DOI: 10.1186/1471-2105-13-S19-S18
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Schematic view of our iterative framework for genome assembly.
Figure 2Contig clustering algorithm. Words are extracted from contigs. The number of common words between two contigs is used as the edge weight in the graph. Contig lengths are modeled as vertex weights. The contig connectivity graph is thus built, followed by the METIS partitioning process. The partitioned sub-graphs are clustered contig sets.
Details of the human chromosome 14 read libraries.
| Genome size (bp) | Read library 1 | Read library 2 | Read library 3 | |||
|---|---|---|---|---|---|---|
| # of reads | Insert length | # of reads | Insert length | # of reads | Insert length | |
| 88,289,540 | 32,621,862 | 155 | 14,054,994 | 2,280-2,800 | 2,009,674 | 35,000 |
The human chromosome 14 assembly results in terms of continuity, accuracy, and statistics.
| Evaluations | Continuity | Accuracy | Statistics | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Contig # | NG50 (kbp) | NG50 corr. (kbp) | SNP | Indels | Misjoins | Asm. (%) | Unaligned ref. (%) | Duplicated ref. (%) | |
| Velvet 61k | 28,974 | 5.2 | 4.7 | 82,235 | 17,755 | 601 | 96.69 | 2.09 | 0.43 |
| Tiger-Velvet-R 125i | 20,189 | 11.6 | 9.3 | 84,577 | 21,847 | 533 | 97.90 | 1.98 | 1.50 |
| Tiger-Velvet-I 7i | 21,623 | 10.9 | 8.9 | 84,811 | 21,470 | 654 | 98.43 | 1.53 | 1.48 |
| SOAPdenovo 55k | 50,094 | 3.0 | 3.0 | 67,956 | 11,866 | 36 | 95.91 | 3.13 | 0.28 |
| Tiger-Soap-R 120i | 60,134 | 3.6 | 3.4 | 68,881 | 12,839 | 185 | 99.40 | 3.01 | 2.79 |
| Tiger-Soap-I 7i | 55,173 | 3.8 | 3.6 | 69,215 | 13,390 | 205 | 98.68 | 2.43 | 1.46 |
The columns include the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), the number of indels and misjoins in contigs, total assembly length, genome coverage (100 - Unaligned ref.), and duplications. K-mer 61 and 55 are the best k-mer sizes for Velvet and SOAPdenovo, respectively. "#k" stands for the applied k-mer size. "#i" stands for the iteration number.
E. coli (SRR001665) 24-tile assembly results in terms of contiguity, accuracy, and statistics.
| Evaluations | Continuity | Accuracy | Statistics | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Contig # | NG50 (kbp) | NG50 corr. (kbp) | SNP | Indels | Misjoins | Asm. (%) | Unaligned ref. (%) | Duplicated ref. (%) | |
| Velvet 25k | 147 | 87.0 | 67.3 | 238 | 37 | 2 | 97.89 | 0.56 | 0.01 |
| Tiger-Velvet-R 51i | 281 | 95.6 | 87.2 | 190 | 35 | 3 | 100.92 | 0.14 | 2.30 |
| Tiger-Velvet-I 7i | 276 | 95.4 | 87.2 | 211 | 33 | 12 | 100.40 | 0.12 | 1.68 |
| SOAPdenovo 27k | 450 | 17.9 | 17.9 | 12 | 4 | 1 | 97.56 | 1.31 | 0.00 |
| Tiger-Soap-R 80i | 524 | 25.6 | 25.6 | 31 | 11 | 3 | 98.67 | 1.20 | 0.78 |
| Tiger-Soap-I 7i | 509 | 25.8 | 25.8 | 23 | 6 | 2 | 98.78 | 0.80 | 0.64 |
The columns include the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), the number of indels and misjoins in contigs, total assembly length, genome coverage (100 - Unaligned ref.), and duplications. K-mer 25 and 27 are the best k-mer sizes for Velvet and SOAPdenovo, respectively. "#k" stands for the applied k-mer size. "#i" stands for the iteration number. Both Tiger-Velvet-I and Tiger-Soap-I evaluations use the best results from Velvet and SOAPdenovo as input, respectively.
The runtime and memory usage of the assemblies on the human chromosome 14 genome.
| Evaluations | Wall-clock Time (Hr.) | Peak memory usage (GB) | Thread # in total | K-mer size # | Tile # |
|---|---|---|---|---|---|
| Velvet 61k | 0.95 | 8.26 | 1 | 1 | 1 |
| Tiger-Velvet-R 1i | 1.49 | 0.16 | 1 | 8 | 150 |
| Tiger-Velvet-I 1i | 1.96 | 0.29 | 1 | 3 | 150 |
| SOAPdenovo 55k | 0.43 | 8.31 | 1 | 1 | 1 |
| Tiger-Soap-R 1i | 1.35 | 1.8 | 1 | 8 | 150 |
| Tiger-Soap-I 1i | 1.67 | 1.9 | 1 | 3 | 150 |
All evaluations are done using 1 thread. "#k" stands for the applied k-mer size. "#i" stands for the iteration number. Note: The runtime and memory usage for Tiger is on the read assembly (Step 5) only.
Comparison of the runtime and memory usage on the human chromosome 14 assembly.
| Evaluations | Wall-clock time (Hr.) | Speedup against 1 thread | Peak memory usage (GB) | Thread # in total | Machine # | K-mer size # | Tile # |
|---|---|---|---|---|---|---|---|
| Velvet 61k | 0.95 | 1x | 8.26 | 1 | 1 | 1 | 1 |
| Velvet 61k | 0.47 | 2.02x | 8.40 | 4 | 1 | 1 | 1 |
| SOAPdenovo 55k | 0.43 | 1x | 8.31 | 1 | 1 | 1 | 1 |
| SOAPdenovo 55k | 0.25 | 1.72x | 8.50 | 4 | 1 | 1 | 1 |
| Tiger-Velvet-I 1i | 4.69 | 1x | 1.87 | 1 | 1 | 1 | 150 |
| Tiger-Velvet-I 1i | 1.58 | 2.98x | 2.44 | 4 | 1 | 1 | 150 |
| Tiger-Velvet-I 1i | 0.83 | 5.69x | N/A+ | 12 | 3 | 1 | 150 |
| Tiger-Velvet-I 1i | 0.66 | 7.16x | N/A+ | 20 | 5 | 1 | 150 |
"#k" stands for the applied k-mer size. "#i" stands for the iteration number.
+ The memory usage across machines can not be measured in our environment.
Figure 3Tiger-Velvet-I 1i runtime comparison using the human chromosome 14 data. Different numbers of threads across machines are used. The speedup base line is labeled as 1x for other corresponding columns. The k-mer size 61 is used in all tests to avoid varying runtime caused by different k-mer sizes. Step 5* does not include SSPACE result since it does not execute across computers.