Literature DB >> 9159489

Global self-organization of all known protein sequences reveals inherent biological signatures.

M Linial1, N Linial, N Tishby, G Yona.   

Abstract

A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acid residues and a dynamic programming distance is calculated between each pair of segments. This space of segments is initially embedded into Euclidean space. The algorithm that we apply embeds every finite metric space into Euclidean space so that (1) the dimension of the host space is small, (2) the metric distortion is small. A novel self-organized, cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. We monitor the validity of our clustering by randomly splitting the data into two parts and performing an hierarchical clustering algorithm independently on each part. At every level of the hierarchy we cross-validate the clusters in one part with the clusters in the other. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural data about proteins. Some of the known families clustered into well distinct clusters. Motifs and domains such as the zinc finger, EF hand, homeobox, EGF-like and others are automatically correctly identified, and relations between protein families are revealed by examining the splits along the tree. This clustering leads to a novel representation of protein families, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporter family. Finally, we introduce a new concise representation for complete proteins that is very useful in presenting multiple alignments, and in searching for close relatives in the database. The self-organization method presented is very general and applies to any data with a consistent and computable measure of similarity between data items.

Mesh:

Substances:

Year:  1997        PMID: 9159489     DOI: 10.1006/jmbi.1997.0948

Source DB:  PubMed          Journal:  J Mol Biol        ISSN: 0022-2836            Impact factor:   5.469


  6 in total

1.  Similar cases retrieval from the database of laboratory test results.

Authors:  Zhenjun Yang; Yasushi Matsumura; Shigeki Kuwata; Hideo Kusuoka; Hiroshi Takeda
Journal:  J Med Syst       Date:  2003-06       Impact factor: 4.460

Review 2.  Data clustering in life sciences.

Authors:  Ying Zhao; George Karypis
Journal:  Mol Biotechnol       Date:  2005-09       Impact factor: 2.695

3.  A limited universe of membrane protein families and folds.

Authors:  Amit Oberai; Yungok Ihm; Sanguk Kim; James U Bowie
Journal:  Protein Sci       Date:  2006-07       Impact factor: 6.725

4.  Entropy-scaling search of massive biological data.

Authors:  Y William Yu; Noah M Daniels; David Christian Danko; Bonnie Berger
Journal:  Cell Syst       Date:  2015-08-26       Impact factor: 10.304

5.  Sequence embedding for fast construction of guide trees for multiple sequence alignment.

Authors:  Gordon Blackshields; Fabian Sievers; Weifeng Shi; Andreas Wilm; Desmond G Higgins
Journal:  Algorithms Mol Biol       Date:  2010-05-14       Impact factor: 1.405

6.  Geometric aspects of biological sequence comparison.

Authors:  Aleksandar Stojmirović; Yi-Kuo Yu
Journal:  J Comput Biol       Date:  2009-04       Impact factor: 1.479

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.