Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Global self-organization of all known protein sequences reveals inherent biological signatures.

Literature DB >> 9159489

Global self-organization of all known protein sequences reveals inherent biological signatures.

Abstract

A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acid residues and a dynamic programming distance is calculated between each pair of segments. This space of segments is initially embedded into Euclidean space. The algorithm that we apply embeds every finite metric space into Euclidean space so that (1) the dimension of the host space is small, (2) the metric distortion is small. A novel self-organized, cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. We monitor the validity of our clustering by randomly splitting the data into two parts and performing an hierarchical clustering algorithm independently on each part. At every level of the hierarchy we cross-validate the clusters in one part with the clusters in the other. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural data about proteins. Some of the known families clustered into well distinct clusters. Motifs and domains such as the zinc finger, EF hand, homeobox, EGF-like and others are automatically correctly identified, and relations between protein families are revealed by examining the splits along the tree. This clustering leads to a novel representation of protein families, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporter family. Finally, we introduce a new concise representation for complete proteins that is very useful in presenting multiple alignments, and in searching for close relatives in the database. The self-organization method presented is very general and applies to any data with a consistent and computable measure of similarity between data items.

Mesh：

Substances：

Year: 1997 PMID： 9159489 DOI： 10.1006/jmbi.1997.0948

Source DB: PubMed Journal: J Mol Biol ISSN： 0022-2836 Impact factor: 5.469

Keyword Cloud
Cited

6 in total

1. Similar cases retrieval from the database of laboratory test results.

Authors: Zhenjun Yang; Yasushi Matsumura; Shigeki Kuwata; Hideo Kusuoka; Hiroshi Takeda
Journal: J Med Syst Date: 2003-06 Impact factor: 4.460

Review 2. Data clustering in life sciences.

Authors: Ying Zhao; George Karypis
Journal: Mol Biotechnol Date: 2005-09 Impact factor: 2.695

Global self-organization of all known protein sequences reveals inherent biological signatures.

1. Similar cases retrieval from the database of laboratory test results.

Review 2. Data clustering in life sciences.

3. A limited universe of membrane protein families and folds.

4. Entropy-scaling search of massive biological data.

5. Sequence embedding for fast construction of guide trees for multiple sequence alignment.

6. Geometric aspects of biological sequence comparison.