| Literature DB >> 23593174 |
Aarti Desai1, Veer Singh Marwah, Akshay Yadav, Vineet Jha, Kishor Dhaygude, Ujwala Bangar, Vivek Kulkarni, Abhay Jere.
Abstract
Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6-40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.Entities:
Mesh:
Year: 2013 PMID: 23593174 PMCID: PMC3625192 DOI: 10.1371/journal.pone.0060204
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Assembly metrics for E.coli genome assembled from Illumina paired end data with Velvet, SOAPdenovo, ABySS, Meraculous and IDBA-UD.
| Velvet | SOAPdenovo | ABySS | Meraculous | IDBA-UD | |||
| Number of Contigs | |||||||
| 20X | 284 | 280 | 448 | 1510 | 215 | ||
| 35X | 251 | 198 | 278 | 362 | 202 | ||
| 50X | 209 | 253 | 185 | 292 | 233 | ||
| 100X | 157 | 279 | 153 | 148 | 258 | ||
| 150X | 120 | 267 | 146 | 124 | 305 | ||
| 200X | 158 | 311 | 146 | 115 | 289 | ||
| Average Contig Length | |||||||
| 20X | 16156.8 | 16231.5 | 10202.9 | 1419.23 | 21189.7 | ||
| 35X | 18221.6 | 23034.6 | 16627.7 | 11939.2 | 22628.4 | ||
| 50X | 21895 | 18078.4 | 25156.3 | 15031.7 | 19647.6 | ||
| 100X | 29133.9 | 16432.1 | 30395.1 | 30373.8 | 17780.1 | ||
| 150X | 38100.8 | 17176.9 | 31889.4 | 36368.4 | 15067.2 | ||
| 200X | 28964.4 | 14778.4 | 31921.8 | 39274.2 | 15895.1 | ||
| Maximum Contig Length | |||||||
| 20X | 178348 | 221554 | 100991 | 13023 | 173604 | ||
| 35X | 221572 | 314900 | 222650 | 74711 | 203084 | ||
| 50X | 284242 | 268507 | 222853 | 125799 | 221677 | ||
| 100X | 293696 | 312097 | 222915 | 178290 | 224008 | ||
| 150X | 221612 | 221639 | 222915 | 178333 | 268172 | ||
| 200X | 242677 | 315391 | 222915 | 221617 | 224008 | ||
| Genome coverage (bases) | |||||||
| 20X | 4505595 (97.11%) | 4521190 (97.44%) | 4556108 (98.19%) | 2008137 (43.28%) | 4554438 (98.16%) | ||
| 35X | 4600283 (99.15%) | 4549085 (98.04%) | 4613734 (99.44%) | 4294176 (92.55%) | 4570413 (98.5%) | ||
| 50X | 4561865 (98.332%) | 4561639 (98.31%) | 4649731 (100.21%) | 4368129 (94.14%) | 4577197 (98.65%) | ||
| 100X | 4567690 (98.44%) | 4576959 (98.64%) | 4649792 (100.21%) | 4482341 (96.6%) | 4586801 (98.86%) | ||
| 150X | 4567362 (98.44%) | 4582652 (98.77%) | 4655357 (100.33%) | 4500522 (97%) | 4594933 (99.03%) | ||
| 200X | 4570779 (98.51%) | 4592154 (98.97%) | 4660225 (100.44%) | 4509122 (97.18%) | 4593102 (98.99%) | ||
The expected genome size for E.coli MG1655 is 4639675 bases.
Assembly metrics for S.kudriavzevii genome assembled from Illumina paired end data with Velvet, SOAPdenovo, ABySS, Meraculous and IDBA-UD.
| Velvet | SOAPdenovo | ABySS | Meraculous | IDBA-UD | |
| Number of Contigs | |||||
| 20X | 3647 | 3421 | 5524 | 8323 | 1203 |
| 35X | 1424 | 1167 | 3080 | 2414 | 1773 |
| 50X | 661 | 2145 | 2243 | 2097 | 1879 |
| 100X | 524 | 3102 | 1916 | 744 | 1991 |
| 150X | 573 | 3763 | 1850 | 594 | 1960 |
| 200X | 711 | 4689 | 2200 | 573 | 2012 |
| Average Contig Length (bases) | |||||
| 20X | 3221.18 | 3377.79 | 2103.14 | 328.911 | 9674.6 |
| 35X | 8177.24 | 10267.7 | 3828.97 | 4717.08 | 6624.67 |
| 50X | 17617.6 | 5472.39 | 5269.26 | 5386.86 | 6262.4 |
| 100X | 22199.2 | 3838.7 | 6180.79 | 15417.2 | 5929.76 |
| 150X | 20328.5 | 3198.61 | 6415.42 | 19336.2 | 6024.7 |
| 200X | 16432.7 | 2599.94 | 5407.71 | 20063.2 | 5875.02 |
| Maximum Contig Length (bases) | |||||
| 20X | 32734 | 47695 | 27667 | 5483 | 215790 |
| 35X | 320896 | 220365 | 175589 | 48857 | 215367 |
| 50X | 207865 | 478251 | 129807 | 60475 | 188478 |
| 100X | 396838 | 377320 | 391375 | 175552 | 329736 |
| 150X | 404654 | 458019 | 391364 | 360534 | 348439 |
| 200X | 381309 | 348264 | 350371 | 293610 | 348443 |
| Genome Coverage (bases) | |||||
| 20X | 11137714 (99.42%) | 11099527 (99.08%) | 11149205 (99.53%0 | 2613522 (23.33%) | 11148577 (99.52%) |
| 35X | 11221514 (100.17%) | 11146583 (99.50%) | 11227483 (100.23%) | 11123809 (99.30%) | 11268339 (100.59%) |
| 50X | 11176252 (99.77%) | 11219577 (100.15%) | 11245724 (100.39%) | 11002676 (98.22%) | 11271642 (100.62%) |
| 100X | 11200115 (99.98%) | 11260111 (100.52%) | 11236212 (100.30%) | 11167029 (99.69%) | 11273408 (100.64%) |
| 150X | 11207895 (100.05%) | 11292045 (100.80%) | 11237111 (100.31%) | 11180363 (99.80%) | 11274406 (100.64%) |
| 200X | 11222467 (110.18%) | 11339559 (101.23%) | 11240746 (100.34%) | 11182471 (99.82%) | 11274323 (100.64%) |
The size of the assembled genome we used ZP_591 as reference is 11201698 bases.
Assembly metrics for C.elegans genome assembled from Illumina paired end data with Velvet, SOAPdenovo, ABySS, Meraculous and IDBA-UD.
| Velvet | SOAPdenovo | ABySS | Meraculous | IDBA-UD | |
| Number of Contigs | |||||
| 20X | 64450 | 124333 | 158419 | 127344 | 45380 |
| 35X | 50223 | 54175 | 130592 | 26174 | |
| 50X | 20369 | 58430 | 113521 | 5130 | 26049 |
| 100X | 12852 | 54936 | 83051 | 76890 | 31682 |
| 150X | 10362 | 55090 | 73487 | 40161 | 41191 |
| 200X | 9547 | 56810 | 70275 | 25002 | 49908 |
| Average Contig size | |||||
| 20X | 1581.05 | 805.639 | 656.37 | 201.977 | 2090.02 |
| 35X | 1988.3 | 1842.2 | 803.407 | 3837.33 | |
| 50X | 4983.68 | 1756.63 | 940.162 | 335.264 | 3855.73 |
| 100X | 7928.07 | 1897.26 | 1298.32 | 644.795 | 3240.16 |
| 150X | 9863.32 | 1907.95 | 1476.72 | 2445.47 | 2529.5 |
| 200X | 10715 | 1860.29 | 1549.11 | 3692.64 | 2112 |
| Maximum Contig length | |||||
| 20X | 39794 | 27376 | 35861 | 11562 | 58626 |
| 35X | 90829 | 135554 | 102365 | 154577 | |
| 50X | 395204 | 115246 | 125515 | 13606 | 232754 |
| 100X | 406962 | 116690 | 192510 | 15948 | 270543 |
| 150X | 403179 | 142926 | 203282 | 125692 | 200042 |
| 200X | 442196 | 120281 | 384612 | 86439 | 270543 |
| Genome Coverage (bases) | |||||
| 20X | 95125613 (94.85%) | 95426999 (95.15%) | 96843084 (96.57%) | 24649177 (24.58%) | 90021226 (89.76%) |
| 35X | 94709449 (94.44%) | 94868880 (94.60%) | 97717867 (97.44%) | Data Not Generated | 95559209 (95.29%) |
| 50X | 95784402 (95.51%) | 97746194 (97.47%) | 99358917 (99.08%) | 1620321 (1.61%) | 95548487 (95.28%) |
| 100X | 96294494 (96.02%) | 99077027 (98.79%) | 10133828 (101.05%)3 | 48129645 (47.99%) | 97789517 (97.51%) |
| 150X | 96576131 (96.35%) | 99904107 (99.62%) | 103018960 (102.02%) | 93574066 (93.31%) | 99334562 (99.05%) |
| 200X | 96671928 (96.40%) | 100428367 (100.14%) | 103378677 (103.08%) | 88472503 (88.22%) | 100530260 (100.2%) |
The expected genome size of the C.elegans genome is 100281427.
Figure 1N50 value for the genomes assembled by Velvet, SOAPdenovo, ABySS, Meraculous and IDBA-UD.
A) N50 for assembled E.coli genome: N50 is the length of the smallest contig which when added to a set of larger contigs yields at least 50% of the genome. The N50 values for IDBA-UD, Velvet and SOAPdenovo seemed to reach a plateau at 35X, ABySS at 50X depth of coverage. On the other hand, the N50 value of Meraculous generated assembly increased till 150X depth of coverage. B) N50 for assembled S.kudriavzevii genome: IDBA-UD and SOAPdenovo attained peak N50 value at 35X and 100X depth of coverage respectively, whereas the N50 value of Velvet, ABySS and Meraculous generated assembly increased till 150X depth of coverage. C) N50 for assembled C.elegans genome: SOAPdenovo, ABySS and IDBA-UD reached peak N50 value at 100X depth of coverage, whereas the N50 value of Velvet generated assembly increased approximately 1.5 fold until 150X with no change thereafter. Velvet generated assembly had the best N50 values of all the 4 assemblers.
Figure 2Memory requirement for genome assembly.
Memory required to assemble E.coli (A), S.kudriavzevii (B) and C.elegans (C) genomes increased, although not proportionately, with increasing depth of sequencing.