Fabian Gärtner1,2, Lydia Müller1,3,4, Peter F Stadler1,2,3,5,6,7,8,9. 1. 1Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany. 2. 2Bioinformatics Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany. 3. 3Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany. 4. 4Natural Language Processing Group, Department of Computer Science, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany. 5. 5Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany. 6. 6Fraunhofer Institute for Cell Therapy and Immunology, Perlickstraße 1, 04103 Leipzig, Germany. 7. 7Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, 1090 Vienna, Austria. 8. Center for Non-coding RNA in Technology and Health, Grønegårdsvej 3, 1870 Frederiksberg C, Denmark. 9. 9Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501 USA.
Abstract
BACKGROUND: Superbubbles are distinctive subgraphs in direct graphs that play an important role in assembly algorithms for high-throughput sequencing (HTS) data. Their practical importance derives from the fact they are connected to their host graph by a single entrance and a single exit vertex, thus allowing them to be handled independently. Efficient algorithms for the enumeration of superbubbles are therefore of important for the processing of HTS data. Superbubbles can be identified within the strongly connected components of the input digraph after transforming them into directed acyclic graphs. The algorithm by Sung et al. (IEEE ACM Trans Comput Biol Bioinform 12:770-777, 2015) achieves this task in O ( m l o g ( m ) ) -time. The extraction of superbubbles from the transformed components was later improved to by Brankovic et al. (Theor Comput Sci 609:374-383, 2016) resulting in an overall O ( m + n ) -time algorithm. RESULTS: A re-analysis of the mathematical structure of superbubbles showed that the construction of auxiliary DAGs from the strongly connected components in the work of Sung et al. missed some details that can lead to the reporting of false positive superbubbles. We propose an alternative, even simpler auxiliary graph that solved the problem and retains the linear running time for general digraph. Furthermore, we describe a simpler, space-efficient O ( m + n ) -time algorithm for detecting superbubbles in DAGs that uses only simple data structures. IMPLEMENTATION: We present a reference implementation of the algorithm that accepts many commonly used formats for the input graph and provides convenient access to the improved algorithm. https://github.com/Fabianexe/Superbubble.
BACKGROUND: Superbubbles are distinctive subgraphs in direct graphs that play an important role in assembly algorithms for high-throughput sequencing (HTS) data. Their practical importance derives from the fact they are connected to their host graph by a single entrance and a single exit vertex, thus allowing them to be handled independently. Efficient algorithms for the enumeration of superbubbles are therefore of important for the processing of HTS data. Superbubbles can be identified within the strongly connected components of the input digraph after transforming them into directed acyclic graphs. The algorithm by Sung et al. (IEEE ACM Trans Comput Biol Bioinform 12:770-777, 2015) achieves this task in O ( m l o g ( m ) ) -time. The extraction of superbubbles from the transformed components was later improved to by Brankovic et al. (Theor Comput Sci 609:374-383, 2016) resulting in an overall O ( m + n ) -time algorithm. RESULTS: A re-analysis of the mathematical structure of superbubbles showed that the construction of auxiliary DAGs from the strongly connected components in the work of Sung et al. missed some details that can lead to the reporting of false positive superbubbles. We propose an alternative, even simpler auxiliary graph that solved the problem and retains the linear running time for general digraph. Furthermore, we describe a simpler, space-efficient O ( m + n ) -time algorithm for detecting superbubbles in DAGs that uses only simple data structures. IMPLEMENTATION: We present a reference implementation of the algorithm that accepts many commonly used formats for the input graph and provides convenient access to the improved algorithm. https://github.com/Fabianexe/Superbubble.
Entities:
Keywords:
Genome assembly; Linear time algorithm; Superbubble; de Bruijn graph
Authors: Benedict Paten; Jordan M Eizenga; Yohei M Rosen; Adam M Novak; Erik Garrison; Glenn Hickey Journal: J Comput Biol Date: 2018-02-20 Impact factor: 1.479