Literature DB >> 11391012

Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics.

A C May1.   

Abstract

Hierarchical classification is probably the most popular approach to group related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g., 'significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster related proteins without the need for an a priori threshold: one, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a 'core-periphery' structure within families. The 'core' of a protein family comprises the most typical sequences while the 'periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.

Mesh:

Substances:

Year:  2001        PMID: 11391012     DOI: 10.1093/protein/14.4.209

Source DB:  PubMed          Journal:  Protein Eng        ISSN: 0269-2139


  6 in total

1.  Definition of the tempo of sequence diversity across an alignment and automatic identification of sequence motifs: Application to protein homologous families and superfamilies.

Authors:  Alex C W May
Journal:  Protein Sci       Date:  2002-12       Impact factor: 6.725

2.  The relative inefficiency of sequence weights approaches in determining a nucleotide position weight matrix.

Authors:  Lee A Newberg; Lee Ann McCue; Charles E Lawrence
Journal:  Stat Appl Genet Mol Biol       Date:  2005-06-01

3.  Simplifying complex sequence information: a PCP-consensus protein binds antibodies against all four Dengue serotypes.

Authors:  David M Bowen; Jessica A Lewis; Wenzhe Lu; Catherine H Schein
Journal:  Vaccine       Date:  2012-07-31       Impact factor: 3.641

4.  Prediction of beta-barrel membrane proteins by searching for restricted domains.

Authors:  Oliver Mirus; Enrico Schleiff
Journal:  BMC Bioinformatics       Date:  2005-10-14       Impact factor: 3.169

5.  A functional hierarchical organization of the protein sequence space.

Authors:  Noam Kaplan; Moriah Friedlich; Menachem Fromer; Michal Linial
Journal:  BMC Bioinformatics       Date:  2004-12-14       Impact factor: 3.169

6.  Physicochemical property consensus sequences for functional analysis, design of multivalent antigens and targeted antivirals.

Authors:  Catherine H Schein; David M Bowen; Jessica A Lewis; Kyung Choi; Aniko Paul; Gerbrand J van der Heden van Noort; Wenzhe Lu; Dmitri V Filippov
Journal:  BMC Bioinformatics       Date:  2012-08-24       Impact factor: 3.169

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.