Literature DB >> 28174677

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance.

Jingbo Shang1, Jian Peng1, Jiawei Han1.   

Abstract

Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.

Entities:  

Year:  2016        PMID: 28174677      PMCID: PMC5292242          DOI: 10.1137/1.9781611974348.63

Source DB:  PubMed          Journal:  Proc SIAM Int Conf Data Min


  9 in total

1.  REPuter: the manifold applications of repeat analysis on a genomic scale.

Authors:  S Kurtz; J V Choudhuri; E Ohlebusch; C Schleiermacher; J Stoye; R Giegerich
Journal:  Nucleic Acids Res       Date:  2001-11-15       Impact factor: 16.971

2.  mreps: Efficient and flexible detection of tandem repeats in DNA.

Authors:  Roman Kolpakov; Ghizlane Bana; Gregory Kucherov
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

3.  Tandem repeats over the edit distance.

Authors:  Dina Sokol; Gary Benson; Justin Tojeira
Journal:  Bioinformatics       Date:  2007-01-15       Impact factor: 6.937

4.  Tandem repeats finder: a program to analyze DNA sequences.

Authors:  G Benson
Journal:  Nucleic Acids Res       Date:  1999-01-15       Impact factor: 16.971

5.  mrsFAST: a cache-oblivious algorithm for short-read mapping.

Authors:  Faraz Hach; Fereydoun Hormozdiari; Can Alkan; Farhad Hormozdiari; Inanc Birol; Evan E Eichler; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2010-08       Impact factor: 28.547

6.  CONTRA: copy number analysis for targeted resequencing.

Authors:  Jason Li; Richard Lupat; Kaushalya C Amarasinghe; Ella R Thompson; Maria A Doyle; Georgina L Ryland; Richard W Tothill; Saman K Halgamuge; Ian G Campbell; Kylie L Gorringe
Journal:  Bioinformatics       Date:  2012-04-02       Impact factor: 6.937

7.  Efficient sequential and parallel algorithms for finding edit distance based motifs.

Authors:  Soumitra Pal; Peng Xiao; Sanguthevar Rajasekaran
Journal:  BMC Genomics       Date:  2016-08-18       Impact factor: 3.969

8.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

9.  mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications.

Authors:  Faraz Hach; Iman Sarrafi; Farhad Hormozdiari; Can Alkan; Evan E Eichler; S Cenk Sahinalp
Journal:  Nucleic Acids Res       Date:  2014-05-08       Impact factor: 16.971

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.