Literature DB >> 17523186

The relationship between n-gram patterns and protein secondary structure.

John K Vries1, Xiong Liu, Ivet Bahar.   

Abstract

An n-gram pattern (NP{n,m}) in a protein sequence is a set of n residues and m wildcards in a window of size n+m. Each window of n+m amino acids is associated with a collection of NP{n,m} patterns based on the combinatorics of n+m objects taken m at a time. NP{n,m} patterns that are shared between sequences reflect evolutionary relationships. Recently the authors developed an alignment-independent protein classification algorithm based on shared NP{4,2} patterns that compared favorably to PSI-BLAST. Theoretically, NP{4,2} patterns should also reflect secondary structure propensity since they contain all possible n-grams for 1 < or = n < or = 4 and a window of 6 residues is wide enough to capture periodicities in the 2 < or = n < or = 5 range. This sparked interest in differentiating the information content in NP{4,2} patterns related to evolution from the content related to local propensity. The probability of alpha-, beta-, and coil components was determined for every NP{4,2} pattern over all the chains in the Protein Data Bank (PDB). An algorithm exclusively based on the Z-values of these distributions was developed, which accurately predicted 71-76% of alpha-helical segments and 62-67% of beta-sheets in rigorous jackknife tests. This provided evidence for the strong correlation between NP{4,2} patterns and secondary structure. By grouping PDB chains into subsets with increasing levels of sequence identity, it was also possible to separate the evolutionary and local propensity contributions to the classification process. The results showed that information derived from evolutionary relationships was more important for beta-sheet prediction than alpha-helix prediction. 2007 Wiley-Liss, Inc.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17523186     DOI: 10.1002/prot.21480

Source DB:  PubMed          Journal:  Proteins        ISSN: 0887-3585


  7 in total

Review 1.  A frequency-based linguistic approach to protein decoding and design: Simple concepts, diverse applications, and the SCS Package.

Authors:  Kenta Motomura; Morikazu Nakamura; Joji M Otaki
Journal:  Comput Struct Biotechnol J       Date:  2013-03-29       Impact factor: 7.271

2.  Quantiprot - a Python package for quantitative analysis of protein sequences.

Authors:  Bogumił M Konopka; Marta Marciniak; Witold Dyrka
Journal:  BMC Bioinformatics       Date:  2017-07-17       Impact factor: 3.169

3.  Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank.

Authors:  Ryohei Kondo; Kota Kasahara; Takuya Takahashi
Journal:  Biophys Physicobiol       Date:  2022-02-08

4.  GAIA: a gram-based interaction analysis tool--an approach for identifying interacting domains in yeast.

Authors:  Kelvin X Zhang; B F Francis Ouellette
Journal:  BMC Bioinformatics       Date:  2009-01-30       Impact factor: 3.169

5.  Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.

Authors:  Kenta Motomura; Tomohiro Fujita; Motosuke Tsutsumi; Satsuki Kikuzato; Morikazu Nakamura; Joji M Otaki
Journal:  PLoS One       Date:  2012-11-21       Impact factor: 3.240

6.  Subfamily specific conservation profiles for proteins based on n-gram patterns.

Authors:  John K Vries; Xiong Liu
Journal:  BMC Bioinformatics       Date:  2008-01-30       Impact factor: 3.169

7.  Sequence and structure based models of HIV-1 protease and reverse transcriptase drug resistance.

Authors:  Majid Masso; Iosif I Vaisman
Journal:  BMC Genomics       Date:  2013-10-01       Impact factor: 3.969

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.