Literature DB >> 35896425

Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs.

Amatur Rahman1, Paul Medvedev1,2,3.   

Abstract

Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources. More typical assembly efforts, on the other hand, still suffer both from misassemblies (joining sequences that should not be adjacent) and from underassemblies (not joining sequences that should be adjacent). To better understand the common algorithm-driven causes of these limitations, we investigated the unitig algorithm, which is a core algorithm at the heart of most assemblers. We prove that, contrary to popular belief, even when there are no sequencing errors, unitigs are not always safe (i.e., they are not guaranteed to be substrings of the sequenced genome). We also prove that the unitigs of a bidirected de Bruijn graph are different from those of a doubled de Bruijn graph and, contrary to our expectations, result in underassembly. Using experimental simulations, we then confirm that these two artifacts exist not only in theory but also in the output of widely used assemblers. In particular, when coverage is low, then even error-free data result in unsafe unitigs; also, unitigs may unnecessarily split palindromes in half if special care is not taken. To the best of our knowledge, this paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice.
© 2022 Rahman and Medvedev; Published by Cold Spring Harbor Laboratory Press.

Entities:  

Year:  2022        PMID: 35896425      PMCID: PMC9528984          DOI: 10.1101/gr.276601.122

Source DB:  PubMed          Journal:  Genome Res        ISSN: 1088-9051            Impact factor:   9.438


  26 in total

1.  ABySS: a parallel assembler for short read sequence data.

Authors:  Jared T Simpson; Kim Wong; Shaun D Jackman; Jacqueline E Schein; Steven J M Jones; Inanç Birol
Journal:  Genome Res       Date:  2009-02-27       Impact factor: 9.043

Review 2.  The Theory and Practice of Genome Sequence Assembly.

Authors:  Jared T Simpson; Mihai Pop
Journal:  Annu Rev Genomics Hum Genet       Date:  2015-04-22       Impact factor: 8.929

3.  Information-optimal genome assembly via sparse read-overlap graphs.

Authors:  Ilan Shomorony; Samuel H Kim; Thomas A Courtade; David N C Tse
Journal:  Bioinformatics       Date:  2016-09-01       Impact factor: 6.937

4.  Detecting copy number variation with mated short reads.

Authors:  Paul Medvedev; Marc Fiume; Misko Dzamba; Tim Smith; Michael Brudno
Journal:  Genome Res       Date:  2010-08-30       Impact factor: 9.043

5.  The complete sequence of a human genome.

Authors:  Sergey Nurk; Sergey Koren; Arang Rhie; Mikko Rautiainen; Andrey V Bzikadze; Alla Mikheenko; Mitchell R Vollger; Nicolas Altemose; Lev Uralsky; Ariel Gershman; Sergey Aganezov; Savannah J Hoyt; Mark Diekhans; Glennis A Logsdon; Michael Alonge; Stylianos E Antonarakis; Matthew Borchers; Gerard G Bouffard; Shelise Y Brooks; Gina V Caldas; Nae-Chyun Chen; Haoyu Cheng; Chen-Shan Chin; William Chow; Leonardo G de Lima; Philip C Dishuck; Richard Durbin; Tatiana Dvorkina; Ian T Fiddes; Giulio Formenti; Robert S Fulton; Arkarachai Fungtammasan; Erik Garrison; Patrick G S Grady; Tina A Graves-Lindsay; Ira M Hall; Nancy F Hansen; Gabrielle A Hartley; Marina Haukness; Kerstin Howe; Michael W Hunkapiller; Chirag Jain; Miten Jain; Erich D Jarvis; Peter Kerpedjiev; Melanie Kirsche; Mikhail Kolmogorov; Jonas Korlach; Milinn Kremitzki; Heng Li; Valerie V Maduro; Tobias Marschall; Ann M McCartney; Jennifer McDaniel; Danny E Miller; James C Mullikin; Eugene W Myers; Nathan D Olson; Benedict Paten; Paul Peluso; Pavel A Pevzner; David Porubsky; Tamara Potapova; Evgeny I Rogaev; Jeffrey A Rosenfeld; Steven L Salzberg; Valerie A Schneider; Fritz J Sedlazeck; Kishwar Shafin; Colin J Shew; Alaina Shumate; Ying Sims; Arian F A Smit; Daniela C Soto; Ivan Sović; Jessica M Storer; Aaron Streets; Beth A Sullivan; Françoise Thibaud-Nissen; James Torrance; Justin Wagner; Brian P Walenz; Aaron Wenger; Jonathan M D Wood; Chunlin Xiao; Stephanie M Yan; Alice C Young; Samantha Zarate; Urvashi Surti; Rajiv C McCoy; Megan Y Dennis; Ivan A Alexandrov; Jennifer L Gerton; Rachel J O'Neill; Winston Timp; Justin M Zook; Michael C Schatz; Evan E Eichler; Karen H Miga; Adam M Phillippy
Journal:  Science       Date:  2022-03-31       Impact factor: 63.714

6.  Compacting de Bruijn graphs from sequencing data quickly and in low memory.

Authors:  Rayan Chikhi; Antoine Limasset; Paul Medvedev
Journal:  Bioinformatics       Date:  2016-06-15       Impact factor: 6.937

Review 7.  Genome graphs and the evolution of genome inference.

Authors:  Benedict Paten; Adam M Novak; Jordan M Eizenga; Erik Garrison
Journal:  Genome Res       Date:  2017-03-30       Impact factor: 9.043

8.  metaSPAdes: a new versatile metagenomic assembler.

Authors:  Sergey Nurk; Dmitry Meleshko; Anton Korobeynikov; Pavel A Pevzner
Journal:  Genome Res       Date:  2017-03-15       Impact factor: 9.043

9.  Recombination Marks the Evolutionary Dynamics of a Recently Endogenized Retrovirus.

Authors:  Lei Yang; Raunaq Malhotra; Rayan Chikhi; Daniel Elleder; Theodora Kaiser; Jesse Rong; Paul Medvedev; Mary Poss
Journal:  Mol Biol Evol       Date:  2021-12-09       Impact factor: 16.240

10.  Disk compression of k-mer sets.

Authors:  Amatur Rahman; Rayan Chikhi; Paul Medvedev
Journal:  Algorithms Mol Biol       Date:  2021-06-21       Impact factor: 1.405

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.