| Literature DB >> 28031034 |
Rafal Adamczak1, Jarek Meller2,3,4.
Abstract
BACKGROUND: Advances in computing have enabled current protein and RNA structure prediction and molecular simulation methods to dramatically increase their sampling of conformational spaces. The quickly growing number of experimentally resolved structures, and databases such as the Protein Data Bank, also implies large scale structural similarity analyses to retrieve and classify macromolecular data. Consequently, the computational cost of structure comparison and clustering for large sets of macromolecular structures has become a bottleneck that necessitates further algorithmic improvements and development of efficient software solutions.Entities:
Keywords: Hierarchical clustering; Macromolecular structure analysis; Model quality assessment; Profile hashing; Protein structure; RNA structure
Mesh:
Substances:
Year: 2016 PMID: 28031034 PMCID: PMC5198500 DOI: 10.1186/s12859-016-1381-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schematic representation of approximate hierarchical clustering with profile hashing to generate ‘micro-clusters’ (lower level in the figure, with hashing keys in terms of consensus structure) that are subsequently hierarchically clustered, starting from the representative structures (1D-jury centroids shown above) in each micro-cluster, using an applicable distance measure, such as Hamming, cosine or RMSD (if 3D structures are available)
Evaluation of protein model quality assessment approaches
| Method | CASP10 | TASSER |
|---|---|---|
| PconsD | 0.68 / 0.43 | 4.3 / 0.46 |
|
|
|
|
|
|
|
|
| ClusCo (10) | 0.68 / 0.37 | 3.2 / 0.49 |
| Pleiades (10) | 0.67 / 0.38 | 3.1 / 0.45 |
|
|
|
|
|
|
|
|
|
|
|
|
Average MaxSub similarity score between top ranking and best models (left), and fraction of good models (right) are reported for both CASP and TASSER targets. The fraction of good models is defined as the fraction of targets with the top ranking model less than 0.2 MaxSub score from the best model for CASP, and less than 2 Ang RMSD for TASSER. Centroids of the 5 largest (out of K = 10) clusters are considered for clustering methods, and F = 60% of data is used for uQlust
Running times for model ranking on TASSER target 256b_A
| N_struct | 2000 | 4000 | 8000 | 16,000 |
|---|---|---|---|---|
|
| 13.8 | 51.6 | 132.0 | 231.0 |
|
| 0.6 | 1.2 | 3.0 | 6.6 |
| PconsD | 23.6 | 64.8 | 260.7 | 901.7 |
Time in CPU sec on a server with 8 Intel (R) Core (TM)2 Q6600@2.0GHz CPUs, 4 GB, and Linux version Ubuntu 12.04. PconsD was allowed to use all 8 CPUs and the TESLA C2075 graphical card with 448 GPUs, while times for uQlust are for 1 CPU only to demonstrate its linear scaling
Time and memory usage for hierarchical clustering methods
| N_struct | 9000 | 18000 | 36,000 | 72,000 | 144,000 |
|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Time (ClusCo) | 360 | 3080 | 24818 | 209072 | --- |
| Time (MaxClust) | 7140 | 50540 | --- | --- | --- |
|
|
|
|
|
|
|
| Memory (ClusCo) | 0.4 | 1.6 | 6.5 | 25 | --- |
| Memory (MaxClust) | 1.9 | 5.7 | 19.0 | --- | --- |
CPU times (sec) and memory usage (GB) for approximate uQlust:Tree vs. full hierarchical clustering, obtained by using ClusCo [6] or MaxClust [22]. All calculations were performed on a server with 16 Intel (R) Xeon (R) E5-2680-0@2.70GHz CPUs, 132GB, and Linux version 2.6.32-504.1.3.el6.centos.plus.x86_64
Fig. 2Hierarchical clustering of 98,000 protein chains from the Protein Data Bank, using the fragment-based FragBag profile and the uQlust:Tree algorithm. The initial micro-clusters of structures deemed as closely related (i.e. those with identical hash keys, including large “micro-clusters” of nearly identical structures such as those of globins or lysozymes) constitute the leaves in the tree. CATH assignment at the class level for majority alpha, alpha/beta (or alpha + beta) and beta clusters are shown as red, blue and yellow bars, respectively. It should be noted that the uQlust graphical user interface enables interactive exploration of such generated dendograms and other representations of large data sets
Clustering-based RNA model quality assessment for FARNA
| Target | Best RMSD | 10-means (3D) | Rpart (10,60) |
|---|---|---|---|
| 2a43 | 4.5 | 5.3 | 4.6 |
| 1a4d | 3.8 | 11.9 | 6.0 |
| 1esy | 2.9 | 3.4 | 3.3 |
| 1kka | 3.6 | 4.5 | 4.5 |
| 1l2x | 3.9 | 4.8 | 4.0 |
| 1q9a | 4.1 | 4.4 | 4.7 |
RMSD (Ang) between the native structure and the closest of top 5 centroids, obtained using uQlust:K-means with RMSD distance (third column) or uQlust:Rpart with Hamming distance and RNA-SS-LW profile (last column), are compared with the best possible prediction, i.e., RMSD for the best model in a subset of 500 decoys for each target from [19]