Literature DB >> 23861010

Robust k-mer frequency estimation using gapped k-mers.

Mahmoud Ghandi1, Morteza Mohammad-Noori, Michael A Beer.   

Abstract

Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 23861010      PMCID: PMC3895138          DOI: 10.1007/s00285-013-0705-3

Source DB:  PubMed          Journal:  J Math Biol        ISSN: 0303-6812            Impact factor:   2.259


  14 in total

Review 1.  DNA binding sites: representation and discovery.

Authors:  G D Stormo
Journal:  Bioinformatics       Date:  2000-01       Impact factor: 6.937

2.  Metrics for comparing regulatory sequences on the basis of pattern counts.

Authors:  Jacques van Helden
Journal:  Bioinformatics       Date:  2004-02-05       Impact factor: 6.937

3.  Predicting gene expression from sequence.

Authors:  Michael A Beer; Saeed Tavazoie
Journal:  Cell       Date:  2004-04-16       Impact factor: 41.582

4.  Mismatch string kernels for discriminative protein classification.

Authors:  Christina S Leslie; Eleazar Eskin; Adiel Cohen; Jason Weston; William Stafford Noble
Journal:  Bioinformatics       Date:  2004-01-22       Impact factor: 6.937

5.  ARTS: accurate recognition of transcription starts in human.

Authors:  Sören Sonnenburg; Alexander Zien; Gunnar Rätsch
Journal:  Bioinformatics       Date:  2006-07-15       Impact factor: 6.937

6.  Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts.

Authors:  Jonathan Göke; Marcel H Schulz; Julia Lasserre; Martin Vingron
Journal:  Bioinformatics       Date:  2012-01-12       Impact factor: 6.937

7.  Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach.

Authors:  Olivier Elemento; Saeed Tavazoie
Journal:  Genome Biol       Date:  2005-01-26       Impact factor: 13.583

8.  Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites.

Authors:  Peter Meinicke; Maike Tech; Burkhard Morgenstern; Rainer Merkl
Journal:  BMC Bioinformatics       Date:  2004-10-28       Impact factor: 3.169

9.  Accurate splice site prediction using support vector machines.

Authors:  Sören Sonnenburg; Gabriele Schweikert; Petra Philips; Jonas Behr; Gunnar Rätsch
Journal:  BMC Bioinformatics       Date:  2007       Impact factor: 3.169

Review 10.  Support vector machines and kernels for computational biology.

Authors:  Asa Ben-Hur; Cheng Soon Ong; Sören Sonnenburg; Bernhard Schölkopf; Gunnar Rätsch
Journal:  PLoS Comput Biol       Date:  2008-10-31       Impact factor: 4.475

View more
  12 in total

1.  gkmSVM: an R package for gapped-kmer SVM.

Authors:  Mahmoud Ghandi; Morteza Mohammad-Noori; Narges Ghareghani; Dongwon Lee; Levi Garraway; Michael A Beer
Journal:  Bioinformatics       Date:  2016-04-19       Impact factor: 6.937

2.  Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay.

Authors:  Dustin Shigaki; Orit Adato; Aashish N Adhikari; Shengcheng Dong; Alex Hawkins-Hooker; Fumitaka Inoue; Tamar Juven-Gershon; Henry Kenlay; Beth Martin; Ayoti Patra; Dmitry D Penzar; Max Schubach; Chenling Xiong; Zhongxia Yan; Alan P Boyle; Anat Kreimer; Ivan V Kulakovskiy; John Reid; Ron Unger; Nir Yosef; Jay Shendure; Nadav Ahituv; Martin Kircher; Michael A Beer
Journal:  Hum Mutat       Date:  2019-06-23       Impact factor: 4.878

Review 3.  Enhancer Predictions and Genome-Wide Regulatory Circuits.

Authors:  Michael A Beer; Dustin Shigaki; Danwei Huangfu
Journal:  Annu Rev Genomics Hum Genet       Date:  2020-05-22       Impact factor: 8.929

4.  Estimating evolutionary distances between genomic sequences from spaced-word matches.

Authors:  Burkhard Morgenstern; Bingyao Zhu; Sebastian Horwege; Chris André Leimeister
Journal:  Algorithms Mol Biol       Date:  2015-02-11       Impact factor: 1.405

5.  A method to predict the impact of regulatory variants from DNA sequence.

Authors:  Dongwon Lee; David U Gorkin; Maggie Baker; Benjamin J Strober; Alessandro L Asoni; Andrew S McCallion; Michael A Beer
Journal:  Nat Genet       Date:  2015-06-15       Impact factor: 38.330

6.  Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds.

Authors:  Laurent Noé
Journal:  Algorithms Mol Biol       Date:  2017-02-14       Impact factor: 1.405

7.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC.

Authors:  Hui Yang; Wang-Ren Qiu; Guoqing Liu; Feng-Biao Guo; Wei Chen; Kuo-Chen Chou; Hao Lin
Journal:  Int J Biol Sci       Date:  2018-05-22       Impact factor: 6.580

8.  Enhanced regulatory sequence prediction using gapped k-mer features.

Authors:  Mahmoud Ghandi; Dongwon Lee; Morteza Mohammad-Noori; Michael A Beer
Journal:  PLoS Comput Biol       Date:  2014-07-17       Impact factor: 4.475

9.  Recombination spot identification Based on gapped k-mers.

Authors:  Rong Wang; Yong Xu; Bin Liu
Journal:  Sci Rep       Date:  2016-03-31       Impact factor: 4.379

10.  iterb-PPse: Identification of transcriptional terminators in bacterial by incorporating nucleotide properties into PseKNC.

Authors:  Yongxian Fan; Wanru Wang; Qingqi Zhu
Journal:  PLoS One       Date:  2020-05-15       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.