Literature DB >> 17456017

Generalized correlation functions and their applications in selection of optimal multiple spaced seeds for homology search.

Yong Kong1.   

Abstract

The Goulden-Jackson cluster method is a powerful method to calculate the probability of occurrences of a pattern or set of patterns in a sequence. If the patterns contain wildcard characters, however, the size of the connector matrix grows exponentially with the number of wildcards. Here we show that average correlation c(z) is a good predicator of hitting probability q (n), and the generalized correlation function ĉ(z) can be used to approximate c(z) efficiently. We apply the method to the problem of optimal multiple spaced seed selection for homology search. We reexamine the concept of optimal sensitivity of spaced seeds and show that it is better to select optimal seeds based on some average properties, such as c(1), which is the expectation of the first hitting length. Higher order approximations can also be constructed easily. Tests on arbitrary large genomic data with multiple seeds show that the optimal multiple seeds selected by the methods are indeed more sensitive. The methods provide a theoretical background on which various empirical observations can be unified and further heuristic search methods can be developed.

Mesh:

Year:  2007        PMID: 17456017     DOI: 10.1089/cmb.2006.0008

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  1 in total

1.  Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds.

Authors:  Laurent Noé
Journal:  Algorithms Mol Biol       Date:  2017-02-14       Impact factor: 1.405

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.