| Literature DB >> 27556636 |
Yesesri Cherukuri1, Sarath Chandra Janga2,3,4.
Abstract
BACKGROUND: Improved DNA sequencing methods have transformed the field of genomics over the last decade. This has become possible due to the development of inexpensive short read sequencing technologies which have now resulted in three generations of sequencing platforms. More recently, a new fourth generation of Nanopore based single molecule sequencing technology, was developed based on MinION(®) sequencer which is portable, inexpensive and fast. It is capable of generating reads of length greater than 100 kb. Though it has many specific advantages, the two major limitations of the MinION reads are high error rates and the need for the development of downstream pipelines. The algorithms for error correction have already emerged, while development of pipelines is still at nascent stage.Entities:
Keywords: Contigs; De Bruijn; De novo assembly; Greedy Extension graph; MinION®; N50; Nanopore; Oxford Nanopore
Mesh:
Year: 2016 PMID: 27556636 PMCID: PMC5001211 DOI: 10.1186/s12864-016-2895-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Illustrates the pipeline implemented in this study for benchmarking various assembler algorithms on Nanopore sequenced datasets
Fig. 2Each pair of plots give an overview of the comparisons of the quality of the assemblies across assemblers for E. coli and yeast datasets. a&b: Histograms with error bars plotted between % of 2D reads and N50_value of an assembly show the variation in N50 value of an assembly among different assembler algorithms and how it varies with respect to the data size. c&d: Histograms with error bars plotted between % of 2D reads and number of contigs generated from an assembly, shows how the number of contigs generated vary with respect to the mean contig length for each respective assembler algorithm across various bins of respective datasets. e&f: Histograms showing the percentage of 2D reads employed on X-axis versus the average length of the contigs obtained using each algorithm. g&h: Histograms showing the sum of the lengths of all the contigs generated by an assembler as a function of the percentage of the total reads employed in the assembly. In each set of plots, left panel corresponds to E. coli dataset while the plots in the right panel correspond to the Yeast dataset. In all the plots labeled numeric values on histograms indicate corresponding values of the metric in respective color representing each tool
Fig. 3Each pair of plots give an overview of the computational requirements of each assembler for assembling E. coli and Yeast datasets. a&b: Histogram with error bars plotted between % of 2D reads and log values of wall time which represents the actual time consumed by each assembler to execute the task with respect to gradual increase in data size. c&d: Histograms with error bars plotted between % of 2D reads and log values of CPU time which represents amount of time the CPU is actually executing instructions for each assembler with variation in data size. e&f: Histograms with error bars plotted between varying amount of allotted memory on X-axis and log values of the wall time, showing the influence of memory allocation on wall time consumption by various assembler algorithms. g&h: Histograms with error bars plotted between varying amount of memory and log values of the CPU time, illustrating the influence of memory allocation on the CPU time consumed by various assembler algorithms. In each set of these plots, left panel corresponds to E. coli dataset while the plots in the right panel correspond to the Yeast dataset
Fig. 4Each pair of plots show the accuracy of the assembly generated by various assembler algorithms for E.coli (Panels A and C) and Yeast (Panels B and D) datasets. a&b: Line graphs plotted between % of 2D reads and the % of genome covered, showing the extent of genome assembled by each assembler algorithm. c&d: Line graphs between the % of 2D reads and % of alignment showing the confidence level of the contigs being assembled by various assembler algorithms