Literature DB >> 26601933

LASER: Large genome ASsembly EvaluatoR.

Nilesh Khiste1, Lucian Ilie2.   

Abstract

BACKGROUND: Genome assembly is a fundamental problem with multiple applications. Current technological limitations do not allow assembling of entire genomes and many programs have been designed to produce longer and more reliable contigs. Assessing the quality of these assemblies and comparing those produced by different tools is essential in choosing the best ones. The QUAST program has become the current state-of-the-art in quality assessment of genome assemblies. The only drawback of QUAST is high time and memory usage for large genomes, e.g., over 4 days and 120 GB of RAM for a single human genome assembly.
RESULTS: We introduce LASER, a new tool for assembly evaluation that improves greatly the speed and memory requirements of QUAST. For a human genome assembly, LASER is 5.6 times faster than QUAST while using only half the memory; one human genome assembly is evaluated in 17 hours instead of 4 days. The code of LASER is based on that of QUAST and therefore inherits all its features.
CONCLUSIONS: Genome assembly evaluation is an essential step in assessing the quality of an assembly that is too often done improperly, in part due to significant resource consumption. With the introduction of LASER, proper evaluation can be performed efficiently.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 26601933      PMCID: PMC4657217          DOI: 10.1186/s13104-015-1682-y

Source DB:  PubMed          Journal:  BMC Res Notes        ISSN: 1756-0500


Background

The current sequencing technologies produce short pieces of DNA, called reads, that need to be assembled together to reconstruct the original genome. Usually, whole genomes cannot be produced and instead the assembling programs produce longer DNA pieces, called contigs. High quality assemblies require longer and more accurate contigs. Genome assembly is a difficult problem that is far from being solved. A multitude of assemblers have been designed, see, e.g., [1-11]. Comparing the quality of two assemblies is already nontrivial; one may have longer contigs while the other may have fewer misassembles. Given the large number of tools available, choosing the best one for, say, building a new pipeline, becomes a difficult problem. Evaluating the assembly quality for an assembler during the designing stage is essential as well. Therefore, fast and effective evaluation of genome assembly quality is of crucial importance and a number of solutions have been proposed [12-17]. The most comprehensive evaluation is currently provided by the QUAST program [17]. QUAST quickly became the current state-of-the-art in assembly evaluation. Its thorough evaluation, new metrics, and useful visualizations made it achieve widespread use. Its only drawback is the high time and memory usage for large genome assemblies. In most cases, it requires over 4 days and 120 GB of RAM to assess the quality of a single human genome assembly. To remedy this problem we have designed LASER: Large genome ASsembly EvaluatoR. LASER’s code is based on that of QUAST, inheriting all its features and advantages. We describe below the essential improvements implemented in LASER and compare its performance with that of QUAST on several human datasets.

Methods

The most time consuming stage of QUAST is, by far, the maximal exact match (MEM) computation step of the alignment process, performed using the NUCmer aligner from MUMmer v3.23 [18]. Our recent E-MEM tool [19] clearly outperforms not only MUMmer but also the currently best tools for MEM computation in large genomes: [20-24]. It was therefore a natural choice for replacing MUMmer. Besides using E-MEM, we performed a number of other improvements as well. A large number of redundant string copy operations on large strings in the ‘show-snp’ utility program of the MUMmer toolkit have been avoided. The memory and performance of Python code was improved by replacing class objects with tuples. The rest of QUAST code has been reused in LASER. MUMmer and GlimmerHMM [25] are open source and the authors of GeneMarkS [26] have kindly allowed us to use their code in LASER.

Results

As mentioned before, all features of QUAST have been preserved and LASER has been designed to be used exactly the same way as QUAST. That is, LASER produces exactly the same output. The advantage of LASER consists of greatly increased speed and reduced memory usage. To prove these claims, we have compared LASER and QUAST on several datasets, presented in Table 1. As we are interested in improvement when it really matters, that is, for large genomes, all datasets are human. They were all produced by Illumina HiSeq2000 machines. All datasets were assembled using SOAPdenovo2 [6]. We used SOAPdenovo2 because of its good speed. The k-mer size producing the best assembly (as indicated by the aligned N50 size) was used. This was for and for the other datasets. The assemblies are available for download from the website of LASER.
Table 1

The datasets used for comparison; accession numbers are included for the datasets and for the corresponding reference genomes

DatasetOrganismAccession numberRead lengthNumber of readsTotal bpDepth of coverageReference genomeGenome length
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm H_1$$\end{document}H1 Homo sapiens SRR13022801011,287,175,558130,004,731,35841Build 383,209,286,105
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm H_2$$\end{document}H2 Homo sapiens ERR1941461011,626,361,156164,262,476,75651Build 383,209,286,105
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm H_3$$\end{document}H3 Homo sapiens ERR1941471011,574,530,218159,027,552,01850Build 383,209,286,105
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm H_4$$\end{document}H4 Homo sapiens ERR3244331011,614,713,636163,086,077,23651Build 383,209,286,105
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm H_5$$\end{document}H5 Homo sapiens ERX0695051011,708,169,546172,525,124,14654Build 383,209,286,105
All tests were performed on a DELL PowerEdge R620 computer with 12 cores Intel Xeon at 2.0'GHz and 256 GB of RAM, running Linux Red Hat, CentOS 6.3. The datasets used for comparison; accession numbers are included for the datasets and for the corresponding reference genomes Comparison. Visual comparison of the time (left plot) and memory (right plot) between QUAST and LASER Figure 1 gives the time and memory comparison between QUAST and LASER on the SOAPdenovo2 assemblies produced from the datasets in Table 1. LASER is 5.6 times faster than QUAST while using half the memory.
Fig. 1

Comparison. Visual comparison of the time (left plot) and memory (right plot) between QUAST and LASER

Conclusions

We hope that the improvement in genome assembly evaluation provided by LASER will further boost the use of thorough quality evaluation. N50 is still used as the most important parameter. (N50 is the length l such that the sum of the lengths of all contigs of length l or more is at least half of the total length of all contigs.) An aggressive assembler will produce a high N50 but at the cost of many misassemblies, thus lowering the overall quality. Therefore, a combination of parameters, as provided by QUAST or LASER, gives a much better evaluation of the actual assembly quality.

Availability and requirements

Project name: LASER Project home page: http://www.csd.uwo.ca/~ilie/LASER/ Operating system(s): UNIX, Linux, Mac OS X Programming language: C++, OpenMP License: see web page Any restrictions to use by non-academics: licence needed.
  24 in total

1.  GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Authors:  Steven L Salzberg; Adam M Phillippy; Aleksey Zimin; Daniela Puiu; Tanja Magoc; Sergey Koren; Todd J Treangen; Michael C Schatz; Arthur L Delcher; Michael Roberts; Guillaume Marçais; Mihai Pop; James A Yorke
Journal:  Genome Res       Date:  2012-01-06       Impact factor: 9.043

2.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors:  Daniel R Zerbino; Ewan Birney
Journal:  Genome Res       Date:  2008-03-18       Impact factor: 9.043

3.  A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

Authors:  Zia Khan; Joshua S Bloom; Leonid Kruglyak; Mona Singh
Journal:  Bioinformatics       Date:  2009-04-23       Impact factor: 6.937

4.  slaMEM: efficient retrieval of maximal exact matches using a sampled LCP array.

Authors:  Francisco Fernandes; Ana T Freitas
Journal:  Bioinformatics       Date:  2013-12-11       Impact factor: 6.937

5.  The MaSuRCA genome assembler.

Authors:  Aleksey V Zimin; Guillaume Marçais; Daniela Puiu; Michael Roberts; Steven L Salzberg; James A Yorke
Journal:  Bioinformatics       Date:  2013-08-29       Impact factor: 6.937

6.  Assemblathon 1: a competitive assessment of de novo short read assembly methods.

Authors:  Dent Earl; Keith Bradnam; John St John; Aaron Darling; Dawei Lin; Joseph Fass; Hung On Ken Yu; Vince Buffalo; Daniel R Zerbino; Mark Diekhans; Ngan Nguyen; Pramila Nuwantha Ariyaratne; Wing-Kin Sung; Zemin Ning; Matthias Haimel; Jared T Simpson; Nuno A Fonseca; İnanç Birol; T Roderick Docking; Isaac Y Ho; Daniel S Rokhsar; Rayan Chikhi; Dominique Lavenier; Guillaume Chapuis; Delphine Naquin; Nicolas Maillet; Michael C Schatz; David R Kelley; Adam M Phillippy; Sergey Koren; Shiaw-Pyng Yang; Wei Wu; Wen-Chi Chou; Anuj Srivastava; Timothy I Shaw; J Graham Ruby; Peter Skewes-Cox; Miguel Betegon; Michelle T Dimon; Victor Solovyev; Igor Seledtsov; Petr Kosarev; Denis Vorobyev; Ricardo Ramirez-Gonzalez; Richard Leggett; Dan MacLean; Fangfang Xia; Ruibang Luo; Zhenyu Li; Yinlong Xie; Binghang Liu; Sante Gnerre; Iain MacCallum; Dariusz Przybylski; Filipe J Ribeiro; Shuangye Yin; Ted Sharpe; Giles Hall; Paul J Kersey; Richard Durbin; Shaun D Jackman; Jarrod A Chapman; Xiaoqiu Huang; Joseph L DeRisi; Mario Caccamo; Yingrui Li; David B Jaffe; Richard E Green; David Haussler; Ian Korf; Benedict Paten
Journal:  Genome Res       Date:  2011-09-16       Impact factor: 9.043

7.  E-MEM: efficient computation of maximal exact matches for very large genomes.

Authors:  Nilesh Khiste; Lucian Ilie
Journal:  Bioinformatics       Date:  2014-10-17       Impact factor: 6.937

8.  ALLPATHS: de novo assembly of whole-genome shotgun microreads.

Authors:  Jonathan Butler; Iain MacCallum; Michael Kleber; Ilya A Shlyakhter; Matthew K Belmonte; Eric S Lander; Chad Nusbaum; David B Jaffe
Journal:  Genome Res       Date:  2008-03-13       Impact factor: 9.043

9.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors:  Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal:  Gigascience       Date:  2012-12-27       Impact factor: 6.524

10.  SAGE: String-overlap Assembly of GEnomes.

Authors:  Lucian Ilie; Bahlul Haider; Michael Molnar; Roberto Solis-Oba
Journal:  BMC Bioinformatics       Date:  2014-09-15       Impact factor: 3.169

View more
  3 in total

Review 1.  But where did the centromeres go in the chicken genome models?

Authors:  Benoît Piégu; Peter Arensburger; Florian Guillou; Yves Bigot
Journal:  Chromosome Res       Date:  2018-09-17       Impact factor: 5.239

2.  HISEA: HIerarchical SEed Aligner for PacBio data.

Authors:  Nilesh Khiste; Lucian Ilie
Journal:  BMC Bioinformatics       Date:  2017-12-19       Impact factor: 3.169

3.  Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains.

Authors:  Luis Acuña-Amador; Aline Primot; Edouard Cadieu; Alain Roulet; Frédérique Barloy-Hubler
Journal:  BMC Genomics       Date:  2018-01-16       Impact factor: 3.969

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.