Can Firtina1, Jeremie S Kim1,2, Mohammed Alser1, Damla Senol Cali2, A Ercument Cicek3, Can Alkan3, Onur Mutlu1,2,3. 1. Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland. 2. Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 3. Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey.
Abstract
MOTIVATION: Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. RESULTS: We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. AVAILABILITY AND IMPLEMENTATION: Source code is available at https://github.com/CMU-SAFARI/Apollo. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. RESULTS: We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. AVAILABILITY AND IMPLEMENTATION: Source code is available at https://github.com/CMU-SAFARI/Apollo. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Rafael Mamede; Pedro Vila-Cerqueira; Mickael Silva; João A Carriço; Mário Ramirez Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971
Authors: José A Campoy; Hequan Sun; Manish Goel; Wen-Biao Jiao; Kat Folz-Donahue; Nan Wang; Manuel Rubio; Chang Liu; Christian Kukat; David Ruiz; Bruno Huettel; Korbinian Schneeberger Journal: Genome Biol Date: 2020-12-29 Impact factor: 13.583
Authors: Mohammed Alser; Jeremy Rotman; Onur Mutlu; Serghei Mangul; Dhrithi Deshpande; Kodi Taraszka; Huwenbo Shi; Pelin Icer Baykal; Harry Taegyun Yang; Victor Xue; Sergey Knyazev; Benjamin D Singer; Brunilda Balliu; David Koslicki; Pavel Skums; Alex Zelikovsky; Can Alkan Journal: Genome Biol Date: 2021-08-26 Impact factor: 13.583
Authors: Fengyuan Huang; Li Xiao; Min Gao; Ethan J Vallely; Kevin Dybvig; T Prescott Atkinson; Ken B Waites; Zechen Chong Journal: BMC Genomics Date: 2022-05-11 Impact factor: 4.547
Authors: Fabian van Beveren; Luis Rodriguez-Moreno; H Martin Kramer; Edgar A Chavarro Carrero; Thomas A Wood; Bart P H J Thomma; Michael F Seidl; Jasper R L Depotter; Gabriel L Fiorin; Grardy C M van den Berg Journal: mBio Date: 2021-07-20 Impact factor: 7.867