Maryam Ghareghani1,2,3, David Porubskỳ1,2, Ashley D Sanders4, Sascha Meiers4, Evan E Eichler5,6, Jan O Korbel4, Tobias Marschall1,2. 1. Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany. 2. Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany. 3. Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, Saarbrücken, Germany. 4. European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany. 5. Department of Genome Sciences, University of Washington, Seattle, WA, USA. 6. Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.
Abstract
Motivation: Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately. Results: To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly. Availability and implementation: https://github.com/daewoooo/SaaRclust.
Motivation: Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately. Results: To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly. Availability and implementation: https://github.com/daewoooo/SaaRclust.
Authors: Yu Lin; Jeffrey Yuan; Mikhail Kolmogorov; Max W Shen; Mark Chaisson; Pavel A Pevzner Journal: Proc Natl Acad Sci U S A Date: 2016-12-12 Impact factor: 11.205
Authors: David Porubský; Ashley D Sanders; Niek van Wietmarschen; Ester Falconer; Mark Hills; Diana C J Spierings; Marianna R Bevova; Victor Guryev; Peter M Lansdorp Journal: Genome Res Date: 2016-09-19 Impact factor: 9.043
Authors: Clémence Claussin; David Porubský; Diana Cj Spierings; Nancy Halsema; Stefan Rentas; Victor Guryev; Peter M Lansdorp; Michael Chang Journal: Elife Date: 2017-12-12 Impact factor: 8.140
Authors: Wen-Biao Jiao; Gonzalo Garcia Accinelli; Benjamin Hartwig; Christiane Kiefer; David Baker; Edouard Severing; Eva-Maria Willing; Mathieu Piednoel; Stefan Woetzel; Eva Madrid-Herrero; Bruno Huettel; Ulrike Hümann; Richard Reinhard; Marcus A Koch; Daniel Swan; Bernardo Clavijo; George Coupland; Korbinian Schneeberger Journal: Genome Res Date: 2017-02-03 Impact factor: 9.043
Authors: David Porubsky; Shilpa Garg; Ashley D Sanders; Jan O Korbel; Victor Guryev; Peter M Lansdorp; Tobias Marschall Journal: Nat Commun Date: 2017-11-03 Impact factor: 14.919
Authors: Niek van Wietmarschen; Sarra Merzouk; Nancy Halsema; Diana C J Spierings; Victor Guryev; Peter M Lansdorp Journal: Nat Commun Date: 2018-01-18 Impact factor: 14.919
Authors: Jana Ebler; Peter Ebert; Wayne E Clarke; Tobias Rausch; Peter A Audano; Torsten Houwaart; Yafei Mao; Jan O Korbel; Evan E Eichler; Michael C Zody; Alexander T Dilthey; Tobias Marschall Journal: Nat Genet Date: 2022-04-11 Impact factor: 38.330
Authors: Erich D Jarvis; Giulio Formenti; Arang Rhie; Andrea Guarracino; Chentao Yang; Jonathan Wood; Alan Tracey; Francoise Thibaud-Nissen; Mitchell R Vollger; David Porubsky; Haoyu Cheng; Mobin Asri; Glennis A Logsdon; Paolo Carnevali; Mark J P Chaisson; Chen-Shan Chin; Sarah Cody; Joanna Collins; Peter Ebert; Merly Escalona; Olivier Fedrigo; Robert S Fulton; Lucinda L Fulton; Shilpa Garg; Jennifer L Gerton; Jay Ghurye; Anastasiya Granat; Richard E Green; William Harvey; Patrick Hasenfeld; Alex Hastie; Marina Haukness; Erich B Jaeger; Miten Jain; Melanie Kirsche; Mikhail Kolmogorov; Jan O Korbel; Sergey Koren; Jonas Korlach; Joyce Lee; Daofeng Li; Tina Lindsay; Julian Lucas; Feng Luo; Tobias Marschall; Matthew W Mitchell; Jennifer McDaniel; Fan Nie; Hugh E Olsen; Nathan D Olson; Trevor Pesout; Tamara Potapova; Daniela Puiu; Allison Regier; Jue Ruan; Steven L Salzberg; Ashley D Sanders; Michael C Schatz; Anthony Schmitt; Valerie A Schneider; Siddarth Selvaraj; Kishwar Shafin; Alaina Shumate; Nathan O Stitziel; Catherine Stober; James Torrance; Justin Wagner; Jianxin Wang; Aaron Wenger; Chuanle Xiao; Aleksey V Zimin; Guojie Zhang; Ting Wang; Heng Li; Erik Garrison; David Haussler; Ira Hall; Justin M Zook; Evan E Eichler; Adam M Phillippy; Benedict Paten; Kerstin Howe; Karen H Miga Journal: Nature Date: 2022-10-19 Impact factor: 69.504
Authors: Peter Ebert; Peter A Audano; Qihui Zhu; Bernardo Rodriguez-Martin; Charles Lee; Jan O Korbel; Tobias Marschall; Evan E Eichler; David Porubsky; Marc Jan Bonder; Arvis Sulovari; Jana Ebler; Weichen Zhou; Rebecca Serra Mari; Feyza Yilmaz; Xuefang Zhao; PingHsun Hsieh; Joyce Lee; Sushant Kumar; Jiadong Lin; Tobias Rausch; Yu Chen; Jingwen Ren; Martin Santamarina; Wolfram Höps; Hufsah Ashraf; Nelson T Chuang; Xiaofei Yang; Katherine M Munson; Alexandra P Lewis; Susan Fairley; Luke J Tallon; Wayne E Clarke; Anna O Basile; Marta Byrska-Bishop; André Corvelo; Uday S Evani; Tsung-Yu Lu; Mark J P Chaisson; Junjie Chen; Chong Li; Harrison Brand; Aaron M Wenger; Maryam Ghareghani; William T Harvey; Benjamin Raeder; Patrick Hasenfeld; Allison A Regier; Haley J Abel; Ira M Hall; Paul Flicek; Oliver Stegle; Mark B Gerstein; Jose M C Tubio; Zepeng Mu; Yang I Li; Xinghua Shi; Alex R Hastie; Kai Ye; Zechen Chong; Ashley D Sanders; Michael C Zody; Michael E Talkowski; Ryan E Mills; Scott E Devine Journal: Science Date: 2021-02-25 Impact factor: 47.728
Authors: Sergey Nurk; Sergey Koren; Arang Rhie; Mikko Rautiainen; Andrey V Bzikadze; Alla Mikheenko; Mitchell R Vollger; Nicolas Altemose; Lev Uralsky; Ariel Gershman; Sergey Aganezov; Savannah J Hoyt; Mark Diekhans; Glennis A Logsdon; Michael Alonge; Stylianos E Antonarakis; Matthew Borchers; Gerard G Bouffard; Shelise Y Brooks; Gina V Caldas; Nae-Chyun Chen; Haoyu Cheng; Chen-Shan Chin; William Chow; Leonardo G de Lima; Philip C Dishuck; Richard Durbin; Tatiana Dvorkina; Ian T Fiddes; Giulio Formenti; Robert S Fulton; Arkarachai Fungtammasan; Erik Garrison; Patrick G S Grady; Tina A Graves-Lindsay; Ira M Hall; Nancy F Hansen; Gabrielle A Hartley; Marina Haukness; Kerstin Howe; Michael W Hunkapiller; Chirag Jain; Miten Jain; Erich D Jarvis; Peter Kerpedjiev; Melanie Kirsche; Mikhail Kolmogorov; Jonas Korlach; Milinn Kremitzki; Heng Li; Valerie V Maduro; Tobias Marschall; Ann M McCartney; Jennifer McDaniel; Danny E Miller; James C Mullikin; Eugene W Myers; Nathan D Olson; Benedict Paten; Paul Peluso; Pavel A Pevzner; David Porubsky; Tamara Potapova; Evgeny I Rogaev; Jeffrey A Rosenfeld; Steven L Salzberg; Valerie A Schneider; Fritz J Sedlazeck; Kishwar Shafin; Colin J Shew; Alaina Shumate; Ying Sims; Arian F A Smit; Daniela C Soto; Ivan Sović; Jessica M Storer; Aaron Streets; Beth A Sullivan; Françoise Thibaud-Nissen; James Torrance; Justin Wagner; Brian P Walenz; Aaron Wenger; Jonathan M D Wood; Chunlin Xiao; Stephanie M Yan; Alice C Young; Samantha Zarate; Urvashi Surti; Rajiv C McCoy; Megan Y Dennis; Ivan A Alexandrov; Jennifer L Gerton; Rachel J O'Neill; Winston Timp; Justin M Zook; Michael C Schatz; Evan E Eichler; Karen H Miga; Adam M Phillippy Journal: Science Date: 2022-03-31 Impact factor: 63.714
Authors: Glennis A Logsdon; Mitchell R Vollger; PingHsun Hsieh; Yafei Mao; Mikhail A Liskovykh; Sergey Koren; Sergey Nurk; Ludovica Mercuri; Philip C Dishuck; Arang Rhie; Leonardo G de Lima; Tatiana Dvorkina; David Porubsky; William T Harvey; Alla Mikheenko; Andrey V Bzikadze; Milinn Kremitzki; Tina A Graves-Lindsay; Chirag Jain; Kendra Hoekzema; Shwetha C Murali; Katherine M Munson; Carl Baker; Melanie Sorensen; Alexandra M Lewis; Urvashi Surti; Jennifer L Gerton; Vladimir Larionov; Mario Ventura; Karen H Miga; Adam M Phillippy; Evan E Eichler Journal: Nature Date: 2021-04-07 Impact factor: 69.504