Literature DB >> 19095701

All hits all the time: parameter-free calculation of spaced seed sensitivity.

Denise Y F Mak1, Gary Benson.   

Abstract

MOTIVATION: Standard search techniques for DNA repeats start by identifying small matching words, or seeds, that may inhabit larger repeats. Recent innovations in seed structure include spaced seeds and indel seeds which are more sensitive than contiguous seeds. Evaluating seed sensitivity requires (i) specifying a homology model for alignments and (ii) assigning probabilities to those alignments. Optimal seed selection is resource intensive because all alternative seeds must be tested. Current methods require that the model and its probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun.
RESULTS: We show how to eliminate the need for prior parameter specification by exploiting a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe an efficient preprocessing step, which is computed once per seed. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to identify optimal seeds and the boundaries within probability space where they are optimal.

Mesh:

Substances:

Year:  2008        PMID: 19095701     DOI: 10.1093/bioinformatics/btn643

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  6 in total

1.  Global, highly specific and fast filtering of alignment seeds.

Authors:  Matthis Ebel; Giovanna Migliorelli; Mario Stanke
Journal:  BMC Bioinformatics       Date:  2022-06-10       Impact factor: 3.307

2.  Cgaln: fast and space-efficient whole-genome alignment.

Authors:  Ryuichiro Nakato; Osamu Gotoh
Journal:  BMC Bioinformatics       Date:  2010-04-30       Impact factor: 3.169

3.  VNTRseek-a computational tool to detect tandem repeat variants in high-throughput sequencing data.

Authors:  Yevgeniy Gelfand; Yozen Hernandez; Joshua Loving; Gary Benson
Journal:  Nucleic Acids Res       Date:  2014-07-23       Impact factor: 16.971

4.  Hit integration for identifying optimal spaced seeds.

Authors:  Won-Hyoung Chung; Seong-Bae Park
Journal:  BMC Bioinformatics       Date:  2010-01-18       Impact factor: 3.169

5.  Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds.

Authors:  Laurent Noé
Journal:  Algorithms Mol Biol       Date:  2017-02-14       Impact factor: 1.405

6.  SANS: high-throughput retrieval of protein sequences allowing 50% mismatches.

Authors:  J Patrik Koskinen; Liisa Holm
Journal:  Bioinformatics       Date:  2012-09-15       Impact factor: 6.937

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.