Literature DB >> 10890389

Efficient detection of unusual words.

A Apostolico1, M E Bock, S Lonardi, X Xu.   

Abstract

Words that are, by some measure, over- or underrepresented in the context of larger sequences have been variously implicated in biological functions and mechanisms. In most approaches to such anomaly detections, the words (up to a certain length) are enumerated more or less exhaustively and are individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. Here we take the global approach of annotating the suffix tree of a sequence with some such values and scores, having in mind to use it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary filter for words suspicious enough to undergo a more accurate scrutiny. We consider in depth the simple probabilistic model in which sequences are produced by a random source emitting symbols from a known alphabet independently and according to a given distribution. Our main result consists of showing that, within this model, full tree annotations can be carried out in a time-and-space optimal fashion for the mean, variance and some of the adopted measures of significance. This result is achieved by an ad hoc embedding in statistical expressions of the combinatorial structure of the periods of a string. Specifically, we show that the expected value and variance of all substrings in a given sequence of n symbols can be computed and stored in (optimal) O(n2) overall worst-case, O (n log n) expected time and space. The O (n2) time bound constitutes an improvement by a linear factor over direct methods. Moreover, we show that under several accepted measures of deviation from expected frequency, the candidates over- or underrepresented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the theta(n2) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, overrepresented, then its extension to the nearest node of the tree is even more so. Based on this, we design global detectors of favored and unfavored words for our probabilistic framework in overall linear time and space, discuss related software implementations and display the results of preliminary experiments.

Entities:  

Mesh:

Substances:

Year:  2000        PMID: 10890389     DOI: 10.1089/10665270050081397

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  8 in total

1.  On avoided words, absent words, and their application to biological sequence analysis.

Authors:  Yannis Almirantis; Panagiotis Charalampopoulos; Jia Gao; Costas S Iliopoulos; Manal Mohamed; Solon P Pissis; Dimitris Polychronopoulos
Journal:  Algorithms Mol Biol       Date:  2017-03-14       Impact factor: 1.405

2.  Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers.

Authors:  Dmitri A Papatsenko; Vsevolod J Makeev; Alex P Lifanov; Mireille Régnier; Anna G Nazina; Claude Desplan
Journal:  Genome Res       Date:  2002-03       Impact factor: 9.043

3.  WordSeeker: concurrent bioinformatics software for discovering genome-wide patterns and word-based genomic signatures.

Authors:  Jens Lichtenberg; Kyle Kurz; Xiaoyu Liang; Rami Al-ouran; Lev Neiman; Lee J Nau; Joshua D Welch; Edwin Jacox; Thomas Bitterman; Klaus Ecker; Laura Elnitski; Frank Drews; Stephen Sauchi Lee; Lonnie R Welch
Journal:  BMC Bioinformatics       Date:  2010-12-21       Impact factor: 3.169

4.  Detecting seeded motifs in DNA sequences.

Authors:  Cinzia Pizzi; Stefania Bortoluzzi; Andrea Bisognin; Alessandro Coppe; Gian Antonio Danieli
Journal:  Nucleic Acids Res       Date:  2005-09-01       Impact factor: 16.971

Review 5.  Computational identification of transcriptional regulatory elements in DNA sequence.

Authors:  Debraj GuhaThakurta
Journal:  Nucleic Acids Res       Date:  2006-07-19       Impact factor: 16.971

6.  An algorithm for identifying novel targets of transcription factor families: application to hypoxia-inducible factor 1 targets.

Authors:  Yue Jiang; Bojan Cukic; Donald A Adjeroh; Heath D Skinner; Jie Lin; Qingxi J Shen; Bing-Hua Jiang
Journal:  Cancer Inform       Date:  2009-03-04

7.  The word landscape of the non-coding segments of the Arabidopsis thaliana genome.

Authors:  Jens Lichtenberg; Alper Yilmaz; Joshua D Welch; Kyle Kurz; Xiaoyu Liang; Frank Drews; Klaus Ecker; Stephen S Lee; Matt Geisler; Erich Grotewold; Lonnie R Welch
Journal:  BMC Genomics       Date:  2009-10-08       Impact factor: 3.969

Review 8.  A survey of DNA motif finding algorithms.

Authors:  Modan K Das; Ho-Kwok Dai
Journal:  BMC Bioinformatics       Date:  2007-11-01       Impact factor: 3.169

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.