| Literature DB >> 19341454 |
Sondes Fayech1, Nadia Essoussi, Mohamed Limam.
Abstract
BACKGROUND: Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods.Entities:
Year: 2009 PMID: 19341454 PMCID: PMC2678123 DOI: 10.1186/1756-0381-2-3
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Summary of symbols and definitions
| D | Data set of protein sequences to be clustered |
| K | Number of clusters |
| n | Number of proteins in |
| Oi | a protein sequence |
| q | Number of iterations |
Figure 1Pseudo code for Pro-Kmeans algorithm.
Figure 2Pseudo code for Pro-LEADER algorithm.
Figure 3Pseudo code for Pro-PAM algorithm.
Figure 4Pseudo code for Pro-CLARA algorithm.
Figure 5Pseudo code for Pro-CLARANS algorithm.
Performance of the three other tools (ProClust, TribeMCL and JACOP) and our four proposed methods on DS1, DS2, DS3 and DS4 data sets with respect to two clustering quality measurements: Sensitivity (Sens.) and Specificity (Spec.)
| Algorithms | DS1 | DS2 | DS3 | DS4 | ||||
| Sens. | Spec. | Sens. | Spec. | Sens. | Spec. | Sens. | Spec. | |
| ProClust | 50.64 | 56.77 | 48.71 | 61.86 | 46.09 | 55.14 | 46.39 | 51.07 |
| TribeMCL | 46.09 | 52.89 | 41.42 | 52.14 | 41.04 | 47.48 | 51.22 | 56.46 |
| JACOP | 99.92 | 66.27 | 99.96 | 70.06 | 99.96 | 73.96 | 99.92 | 94.42 |
| Pro-Kmeans | 92.38 | 99.90 | 55.32 | 98.01 | 63.30 | 96.92 | 56.06 | 99.56 |
| Pro-LEADER | 90.21 | 91.40 | 53.15 | 91.24 | 52.96 | 74.06 | 23.34 | 95.70 |
| Pro-CLARA | 93.60 | 99.92 | 73.28 | 99.26 | 81.53 | 98.60 | 77.84 | 99.66 |
| Pro-CLARANS | 93.10 | 99.90 | 78. 62 | 98.70 | 76.24 | 97.34 | 62.06 | 99.09 |
DS4 is a very large data set which contains all sequences of DS1 (HLA protein family), DS2 (Hydrolases protein family) and DS3 (Globins protein family).