Wei Zheng1, Le Yang1, Robert J Genco2,3, Jean Wactawski-Wende4, Michael Buck5, Yijun Sun1,3. 1. Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, NY, USA. 2. Department of Oral Biology, University at Buffalo, The State University of New York, Buffalo, NY, USA. 3. Department of Microbiology and Immunology, University at Buffalo, The State University of New York, Buffalo, NY, USA. 4. Department of Epidemiology and Environmental Health, University at Buffalo, The State University of New York, Buffalo, NY, USA. 5. Department of Biochemistry, University at Buffalo, The State University of New York, Buffalo, NY, USA.
Abstract
MOTIVATION: Sequence analysis is arguably a foundation of modern biology. Classic approaches to sequence analysis are based on sequence alignment, which is limited when dealing with large-scale sequence data. A dozen of alignment-free approaches have been developed to provide computationally efficient alternatives to alignment-based approaches. However, existing methods define sequence similarity based on various heuristics and can only provide rough approximations to alignment distances. RESULTS: In this article, we developed a new approach, referred to as SENSE (SiamEse Neural network for Sequence Embedding), for efficient and accurate alignment-free sequence comparison. The basic idea is to use a deep neural network to learn an explicit embedding function based on a small training dataset to project sequences into an embedding space so that the mean square error between alignment distances and pairwise distances defined in the embedding space is minimized. To the best of our knowledge, this is the first attempt to use deep learning for alignment-free sequence analysis. A large-scale experiment was performed that demonstrated that our method significantly outperformed the state-of-the-art alignment-free methods in terms of both efficiency and accuracy. AVAILABILITY AND IMPLEMENTATION: Open-source software for the proposed method is developed and freely available at https://www.acsu.buffalo.edu/∼yijunsun/lab/SENSE.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Sequence analysis is arguably a foundation of modern biology. Classic approaches to sequence analysis are based on sequence alignment, which is limited when dealing with large-scale sequence data. A dozen of alignment-free approaches have been developed to provide computationally efficient alternatives to alignment-based approaches. However, existing methods define sequence similarity based on various heuristics and can only provide rough approximations to alignment distances. RESULTS: In this article, we developed a new approach, referred to as SENSE (SiamEse Neural network for Sequence Embedding), for efficient and accurate alignment-free sequence comparison. The basic idea is to use a deep neural network to learn an explicit embedding function based on a small training dataset to project sequences into an embedding space so that the mean square error between alignment distances and pairwise distances defined in the embedding space is minimized. To the best of our knowledge, this is the first attempt to use deep learning for alignment-free sequence analysis. A large-scale experiment was performed that demonstrated that our method significantly outperformed the state-of-the-art alignment-free methods in terms of both efficiency and accuracy. AVAILABILITY AND IMPLEMENTATION: Open-source software for the proposed method is developed and freely available at https://www.acsu.buffalo.edu/∼yijunsun/lab/SENSE.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Roman Feldbauer; Lukas Gosch; Lukas Lüftinger; Patrick Hyden; Arthur Flexer; Thomas Rattei Journal: Bioinformatics Date: 2020-12-26 Impact factor: 6.937