Literature DB >> 12804090

Discovering simple regions in biological sequences associated with scoring schemes.

Honghui Wan1, Lugang Li, Scott Federhen, John C Wootton.   

Abstract

Let A denote an alphabet consisting of n types of letters. Given a sequence S of length L with v(i) letters of type i on A, to describe the compositional properties and combinatorial structure of S, we propose a new complexity function of S, called the reciprocal complexity of S, as C(S) = (i=1) product operator (n) (L/nv(i))(vi) Based on this complexity measure, an efficient algorithm is developed for classifying and analyzing simple segments of protein and nucleotide sequence databases associated with scoring schemes. The running time of the algorithm is nearly proportional to the sequence length. The program DSR corresponding to the algorithm was written in C++, associated with two parameters (window length and cutoff value) and a scoring matrix. Some examples regarding protein sequences illustrate how the method can be used to find regions. The first application of DSR is the masking of simple sequences for searching databases. Queries masked by DSR returned a manageable set of hits below the E-value cutoff score, which contained all true positive homologues. The second application is to study simple regions detected by the DSR program corresponding to known structural features of proteins. An extensive computational analysis has been made of protein sequences with known, physicochemically defined nonglobular segments. For the SWISS-PROT amino acid sequence database (Release 40.2 of 02-Nov-2001), we determine that the best parameters and the best BLOSUM matrix are, respectively, for automatic segmentation of amino acid sequences into nonglobular and globular regions by the DSR program: Window length k = 35, cutoff value b = 0.46, and the BLOSUM 62.5 matrix. The average "agreement accuracy (sensitivity)" of DSR segmentation for the SWISS-PROT database is 97.3%.

Mesh:

Substances:

Year:  2003        PMID: 12804090     DOI: 10.1089/106652703321825955

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  4 in total

1.  Complexity: an internet resource for analysis of DNA sequence complexity.

Authors:  Y L Orlov; V N Potapov
Journal:  Nucleic Acids Res       Date:  2004-07-01       Impact factor: 16.971

2.  ProtRepeatsDB: a database of amino acid repeats in genomes.

Authors:  Mridul K Kalita; Gowthaman Ramasamy; Sekhar Duraisamy; Virander S Chauhan; Dinesh Gupta
Journal:  BMC Bioinformatics       Date:  2006-07-07       Impact factor: 3.169

3.  Understanding and identifying amino acid repeats.

Authors:  Hong Luo; Harm Nijveen
Journal:  Brief Bioinform       Date:  2014-07       Impact factor: 11.622

4.  Novel transglutaminase-like peptidase and C2 domains elucidate the structure, biogenesis and evolution of the ciliary compartment.

Authors:  Dapeng Zhang; L Aravind
Journal:  Cell Cycle       Date:  2012-09-14       Impact factor: 4.534

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.