| Literature DB >> 24532722 |
Ling-Hong Hung1, Ram Samudrala1.
Abstract
MOTIVATION: fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project.Entities:
Mesh:
Year: 2014 PMID: 24532722 PMCID: PMC4058946 DOI: 10.1093/bioinformatics/btu098
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Mean TM-score of centroids relative to native structure
| Clustering method | Centroid of largest cluster | Best centroid of five largest clusters |
|---|---|---|
| Spicker | 0.584 | 0.607 |
| Clusco/k-means/RMSD | 0.585 | 0.612 |
| Multi-k-means/RMSD | 0.590 | |
| Multi-k-means/TM-score | 0.592 | |
| Hierarchical/RMSD | 0.588 | |
| Hierarchical/TM-score |
fast_protein_cluster k-means values are the average of five separate runs to control for different starting seeds. Distance matrices were calculated using CA-atom coordinates. TM-score means that are significantly better (paired z-test with P < 0.05) than Spicker are in bold, and those significantly better than Clusco are underlined. The quality of the best model among the centroids of the five largest clusters is significantly improved when fast_protein_cluster is used as the clustering method.
Fig. 1.Performance of fast_protein_cluster. The speeds of all-atom RMSD and TM-score matrix calculations over the entire Spicker test set are shown relative to qcprot and the original TM-score for the different methodologies on a 4-core I7 CPU and two different GPUs. The times for k-means and hierarchical partitioning are shown as a function of the number of models. For RMSD calculations, the parallel SSE2 and AVX SIMD code on the laptop CPU outperform the Clusco GPU code. For partitioning, fast_protein_cluster is up to 250× and 2000× faster for k-means and hierarchical clustering, respectively