Sijie Chen1, Yixin Chen1, Fengzhu Sun2,3, Michael S Waterman1,2,3, Xuegong Zhang1,4. 1. Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China. 2. Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA. 3. Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China. 4. School of Life Sciences, Tsinghua University, Beijing 100084, China.
Abstract
MOTIVATION: Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. RESULTS: Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. AVAILABILITY AND IMPLEMENTATION: The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. RESULTS: Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. AVAILABILITY AND IMPLEMENTATION: The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Fredrik H Karlsson; Valentina Tremaroli; Intawat Nookaew; Göran Bergström; Carl Johan Behre; Björn Fagerberg; Jens Nielsen; Fredrik Bäckhed Journal: Nature Date: 2013-05-29 Impact factor: 49.962
Authors: Le Cong; F Ann Ran; David Cox; Shuailiang Lin; Robert Barretto; Naomi Habib; Patrick D Hsu; Xuebing Wu; Wenyan Jiang; Luciano A Marraffini; Feng Zhang Journal: Science Date: 2013-01-03 Impact factor: 47.728
Authors: Shany Doron; Sarah Melamed; Gal Ofir; Azita Leavitt; Anna Lopatina; Mai Keren; Gil Amitai; Rotem Sorek Journal: Science Date: 2018-01-25 Impact factor: 47.728
Authors: Jinxiang Chen; Fuyi Li; Miao Wang; Junlong Li; Tatiana T Marquez-Lago; André Leier; Jerico Revote; Shuqin Li; Quanzhong Liu; Jiangning Song Journal: Front Big Data Date: 2022-01-18