Chia-Hua Chang1,2, Min-Te Chou1, Yi-Chung Wu3, Ting-Wei Hong1, Yun-Lung Li1, Chia-Hsiang Yang3, Jui-Hung Hung1,2,4. 1. Institute of Bioinformatics and Systems Biology. 2. Institute of Biomedical Engineering. 3. Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan. 4. Department of Biological Science and Technology, National Chiao Tung University, Hsin-Chu, Taiwan.
Abstract
MOTIVATION: The Full-text index in Minute space (FM-index) derived from the Burrows-Wheeler transform (BWT) is broadly used for fast string matching in large genomes or a huge set of sequencing reads. Several graphic processing unit (GPU) accelerated aligners based on the FM-index have been proposed recently; however, the construction of the index is still handled by central processing unit (CPU), only parallelized in data level (e.g. by performing blockwise suffix sorting in GPU), or not scalable for large genomes. RESULTS: To fulfill the need for a more practical, hardware-parallelizable indexing and matching approach, we herein propose sBWT based on a BWT variant (i.e. Schindler transform) that can be built with highly simplified hardware-acceleration-friendly algorithms and still suffices accurate and fast string matching in repetitive references. In our tests, the implementation achieves significant speedups in indexing and searching compared with other BWT-based tools and can be applied to a variety of domains. AVAILABILITY AND IMPLEMENTATION: sBWT is implemented in C ++ with CPU-only and GPU-accelerated versions. sBWT is open-source software and is available at http://jhhung.github.io/sBWT/Supplementary information: Supplementary data are available at Bioinformatics online. CONTACT: chyee@ntu.edu.tw or jhhung@nctu.edu.tw (also juihunghung@gmail.com).
MOTIVATION: The Full-text index in Minute space (FM-index) derived from the Burrows-Wheeler transform (BWT) is broadly used for fast string matching in large genomes or a huge set of sequencing reads. Several graphic processing unit (GPU) accelerated aligners based on the FM-index have been proposed recently; however, the construction of the index is still handled by central processing unit (CPU), only parallelized in data level (e.g. by performing blockwise suffix sorting in GPU), or not scalable for large genomes. RESULTS: To fulfill the need for a more practical, hardware-parallelizable indexing and matching approach, we herein propose sBWT based on a BWT variant (i.e. Schindler transform) that can be built with highly simplified hardware-acceleration-friendly algorithms and still suffices accurate and fast string matching in repetitive references. In our tests, the implementation achieves significant speedups in indexing and searching compared with other BWT-based tools and can be applied to a variety of domains. AVAILABILITY AND IMPLEMENTATION: sBWT is implemented in C ++ with CPU-only and GPU-accelerated versions. sBWT is open-source software and is available at http://jhhung.github.io/sBWT/Supplementary information: Supplementary data are available at Bioinformatics online. CONTACT: chyee@ntu.edu.tw or jhhung@nctu.edu.tw (also juihunghung@gmail.com).