| Literature DB >> 23320864 |
Bong-Hyun Kim1, Bhadrachalam Chitturi, Nick V Grishin.
Abstract
BACKGROUND: Numerous types of clustering like single linkage and K-means have been widely studied and applied to a variety of scientific problems. However, the existing methods are not readily applicable for the problems that demand high stringency.Entities:
Mesh:
Year: 2012 PMID: 23320864 PMCID: PMC3426801 DOI: 10.1186/1471-2105-13-S13-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Execution traces for SCG algorithms. The two matrices rmatrix (upper matrix) and smatrix (lower matrix) are shown for a dataset of seven objects. We trace A1, A2 and A3 on this input. Note that the ranks are asymmetric.
Execution time measurements of SCG algorithms
| A1 | A2 | A3 | |
|---|---|---|---|
| 8 | 0.0011 | 0.00055 | 0.0009 |
| 16 | 0.0061 | 0.0025 | 0.0042 |
| 32 | 0.025 | 0.0039 | 0.01 |
| 64 | 0.131 | 0.009 | 0.051 |
| 128 | 0.594 | 0.016 | 0.276 |
| 256 | 2.584 | 0.049 | 1.295 |
| 512 | 12.54 | 0.118 | 4.34 |
| 1024 | 58.429 | 0.27 | 25.47 |
Figure 2Comparisons of execution times of SCG algorithms. Comparison of execution times of SCG algorithms. A1 is the slowest followed by A3 and A2. Note that A2 is guaranteed to identify the independent clusters and might not identify the complete structure.
Figure 3Effect of random errors introduced in distance measurements. (a) Dataset: We randomly generated a dataset of 80 points around four centers (0,8), (0, -8), (8,0) and (-8,0), 20 points for each center. Each point was offset from the center in both X and Y directions by a random amount following normal distribution (µ=0 and SD= 1). (b) Effect of random error on average cluster sizes: For the given dataset of 80 points, the Euclidean distances were calculated. Then we perturbed each pairwise distances with random value following a Gaussian distribution with µ=0 and SD shown on X axis. Note that SD = 0 implies that there are no perturbations. These distances are used to build clusters using SCG (cyan line), complete linkage (CL, blue line), average linkage (AL, green line) and single linkage (SL, red line). Since CL, AL and SL requires score cut-offs, we measured the clustering with distance cut-off values of 2 (conservative), and 4 (less conservative) denoted by the numbers following “/” (solid lines and dotted lines respectively). Finally, the number of clusters was measured per method per cut-off (this includes singletons). Thus, the maximum value can be 80 (all singletons) and the minimum possible value is 1. The ideal number is 4 by design (Fig 3. (a) ). The error bars shown at different points of the curves (each representing a method) are derived from 100 perturbations for a given SD. Note that the SCG shows steepest rise in the number of clusters. (c) Effect of random error on cluster qualities: Legends and the unit of X-axis are same as in (b). After each method identifies clusters, we enumerate all pairs within a cluster, e.g. the cluster {1,2,3} is decomposed into three pairs 1-2, 1-3, and 2-3. If both objects of any pair do not belong to the same reference cluster then we increment the number of incorrect pairs by one. Note that the number of incorrect pairs is intrinsically related to the number of clusters. If the number of clusters is 1, meaning all objects are grouped into one cluster, the number of clusters is minimum and that is reflected in the big numbers of incorrect pairs (See SL/4). If the number of clusters is 80, (similar to SCG at high errors) then no incorrect pairs exist.
Comparisons of clusters built by different methods to the reference SCOP fold classification (Total # of domains clustered:9528)
| SCG | CL | AL | SL | |
|---|---|---|---|---|
| Total number of clusters* | 4965 | 4965 | 4965 | 4965 |
| Number of non-singleton clusters | 1926 | 1561 | 1263 | 975 |
| Number of incorrect pairs | 102 | 214 | 2952 | 6440 |
| Percentage of incorrect pairs | (0.2) | (0.4) | (3.5) | (3.7) |
| Number of correct pairs | 46938 | 50948 | 81280 | 166386 |
| Percentage of correct pairs | (99.8) | (99.6) | (96.5) | (96.3) |
*Total number of clusters was fixed at the number of clusters determined by SCG for a fair comparison of different methods.
Figure 4SCG, CL, AL, and SL clustering results of SCOP domains based on structural similarity score. The same color scheme was used as in Fig 3. The graph represents the cluster sizes (X axis) and the corresponding number of clusters (Y axis). This figure shows that all clustering methods yielded many small clusters or singletons and few big clusters. Note that the clusters with sizes greater than 5 are all lumped together.
Similarities of clusters built by different methods
| SCG | CL | AL | SL | |
|---|---|---|---|---|
| SCG | 1.0 | 0.78 | 0.59 | 0.36 |
| CL | 1.0 | 0.74 | 0.44 | |
| AL | 1.0 | 0.65 | ||
| SL | 1.0 |
F-measures are used for similarities between cluster similarities. F-measure is formally defined as a harmonic mean of precision and recall [13].