William Wong1,2,3, Naotsugu Tsuchiya1,2,3. 1. School of Psychological Sciences and Turner Institute for Brain and Mental Health, Monash University. 2. Center for Information and Neural Networks (CiNet), National Institute of Information and Communications Technology (NICT), Suita, Osaka, Japan. 3. Advanced Telecommunications Research Computational Neuroscience Laboratories, Seika-cho, Soraku-gun, Kyoto, Japan.
Abstract
Evidence accumulation clustering (EAC) is an ensemble clustering algorithm that can cluster data for arbitrary shapes and numbers of clusters. Here, we present a variant of EAC in which we aimed to better cluster data with a large number of features, many of which may be uninformative. Our new method builds on the existing EAC algorithm by populating the clustering ensemble with clusterings based on combinations of fewer features than the original dataset at a time. Our method also calls for prewhitening the recombined data and weighting the influence of each individual clustering by an estimate of its informativeness. We provide code of an example implementation of the algorithm in Matlab and demonstrate its effectiveness compared to ordinary evidence accumulation clustering with synthetic data.•The clustering ensemble is made by clustering on subset combinations of features from the data•The recombined data may be prewhitened•Evidence accumulation can be improved by weighting the evidence with a goodness-of-clustering measure.
Evidence accumulation clustering (EAC) is an ensemble clustering algorithm that can cluster data for arbitrary shapes and numbers of clusters. Here, we present a variant of EAC in which we aimed to better cluster data with a large number of features, many of which may be uninformative. Our new method builds on the existing EAC algorithm by populating the clustering ensemble with clusterings based on combinations of fewer features than the original dataset at a time. Our method also calls for prewhitening the recombined data and weighting the influence of each individual clustering by an estimate of its informativeness. We provide code of an example implementation of the algorithm in Matlab and demonstrate its effectiveness compared to ordinary evidence accumulation clustering with synthetic data.•The clustering ensemble is made by clustering on subset combinations of features from the data•The recombined data may be prewhitened•Evidence accumulation can be improved by weighting the evidence with a goodness-of-clustering measure.
We developed a novel clustering method for a study [15] where we were faced with the problem of clustering a set of data, composed of 54 observations of over 50,000 raw variables, in an unsupervised manner. A smaller number of attributes (approx. 2,500) or “features” were extracted for each observation, but a majority of the features were still unlikely to be informative for clustering—we will call these “dubious feature sets”. The lack of ground truth combined with the stipulation to not disregard any of the features a priori, for reasons specific to our study, meant that there was a high risk of producing clusters conditioned on the noise rather than on any underlying structure. Our proposed clustering method, which we call combinatorial evidence accumulation clustering (or combination clustering for short in this article), is a variant of evidence accumulation clustering (EAC) that attempts to mitigate the influence of dubious features in order to obtain better results in unsupervised clustering.EAC is an algorithm for finding data clusters of arbitrary shapes and numbers [3]. It belongs to the class of consensus, or ensemble, clustering algorithms [4,13] in that it combines the results of multiple individual clusterings of the same data to produce a partitioning solution that can be more accurate than those of the individuals. In EAC, an individual clustering is the result of applying some clustering algorithm to the data with variation imparted on either the algorithm itself or the data representation. Each of these individual clusterings are therefore considered pieces of evidence of how data are organised, which the EAC algorithm uses to determine the optimal clustering solution. Here, we shall call the clusterings making up the evidence “sub-clusterings” so as not to confuse them for the ultimate clustering step of EAC, which we will call “evidence accumulation”. In Fred & Jain's implementation, henceforth referred to as ordinary EAC, sub-clustering consisted of multiple, randomly-seeded runs of k-means clustering [7]. The evidence is combined and expressed in a similarity matrix called the co-association matrix (), which counts the frequency of data points pairwise occurring in the same sub-cluster in each piece of evidence. The co-association matrix is then hierarchically clustered using the single- or average-linkage criterion, giving the final EAC result: this is the evidence accumulation step.To address the specific problem of clustering a data set with many dubious features, our combination clustering method incorporated the following improvements to Fred & Jain's [3] EAC method:Sub-clusterings are performed on subset combinations of features of data.Evidence is weighted by a goodness-of-clustering measure of the originating sub-clustering and by the inverse number of total sub-clusterings within its dimensionality.Data in each sub-clustering are prewhitened.In the next section, we explain the motivation behind these improvements using simple examples.
Motivation
The 1-dimensional case
Given a data set with only one feature (or dimension, as we use here interchangeably), improvement b) is the only applicable factor to consider. Both ordinary EAC and combination clustering produce a clustering ensemble composed of the results of other clusterings (i.e., sub-clusterings). Let us take an ensemble to be composed of number of k-means clusterings, with variation between them given by different parameters and/or initialisations of the clustering algorithm as in Fred & Jain's [3] original description.When sub-clustering n number of objects, we will represent the result of the lth sub-clustering as an n × n
similarity matrix,where S signifies the entire matrix, and s(i, j) (for and ) is the entry for object pair i, j in the ith row of the jth column. When there is a co-association between the ith and jth objects due to being sub-clustered together, the corresponding entries of the similarity matrix are given a value of 1 (i.e., , and otherwise 0 when the objects do not co-associate. The result S from sub-clustering l constitutes one piece of evidence. The total number of sub-clusterings is given by (thus, ). We obtain the clustering ensemble E as the set of all similarity matrices resulting from sub-clustering:In ordinary EAC, the co-association matrix is produced by taking the average of the evidence in E. With improvement b), we propose to weight each piece of evidence by a measure of its sub-clustering's goodness-of-clustering. The goodness-of-clustering for sub-clustering l shall be given by g. Thus, to implement improvement b), the co-association matrix entry at the ith row and jth column will be given byWe would expect that, compared to ordinary EAC, our weighted EAC should better cluster the data as long as g reflects the true goodness-of-clustering. To understand how this might be the case, consider that some parameters or initialisations of sub-clustering may produce evidence from local minima that are non-conducive to the optimal solution. For such cases, we would want to excise or penalise the evidence depending on how poor it were to be. In other words, if we can estimate the quality of the evidence, conceivably we can downweight poor-quality evidence to boost the quality of the clustering ensemble. We illustrated this concept in Fig. 1.
Fig. 1
Application of improvement b) in the 1-dimensional case. A set of data is illustrated as circles, whose coordinates along the real number line represent its value magnitudes for a single feature. In this example, two different sub-clusterings are produced due to using different initial parameters for the sub-clustering subroutine. The clusters are represented by groups of empty or filled circles. The sub-clustering on the left is an example of a relatively poor sub-clustering result, which is estimated by a goodness-of-clustering measure. The clustering ensemble, which is the collection of all sub-clusterings, may therefore be improved by weighting the contribution of individual sub-clusterings by their goodness-of-clustering.
Application of improvement b) in the 1-dimensional case. A set of data is illustrated as circles, whose coordinates along the real number line represent its value magnitudes for a single feature. In this example, two different sub-clusterings are produced due to using different initial parameters for the sub-clustering subroutine. The clusters are represented by groups of empty or filled circles. The sub-clustering on the left is an example of a relatively poor sub-clustering result, which is estimated by a goodness-of-clustering measure. The clustering ensemble, which is the collection of all sub-clusterings, may therefore be improved by weighting the contribution of individual sub-clusterings by their goodness-of-clustering.The choice of method to measure goodness-of-clustering depends on the nature of the clustering problem, and is a discussion outside the scope of this paper. Here, we provisionally present a simple method to measure goodness-of-clustering based on the averaging of silhouette values. The silhouette value for object is defined by Rousseeuw [11] aswhere is the average dissimilarity (e.g., the Euclidean distance in feature space) of object to all other members of its cluster, and is the average dissimilarity of object to all members of the next nearest cluster. takes a value on the interval , with higher values corresponding to closer association with its own cluster, and negative values corresponding to closer association with the other cluster. Thus, the value of averaged over all objects will be high when well-separated clusters exist and cluster memberships are correctly identified. It will be low when clusters are poorly-separated, do not exist, or when cluster memberships are incorrectly assigned.We measure goodness-of-clustering by taking the mean of these silhouette values over all objects of the data, indexed by , and then applying a ramp function to the result (to ensure that sub-clusterings with negative means do not contribute to evidence accumulation), as follows:With g and s, we can now find co-association matrix via Eq. 3. Any clustering method may then be applied to for the evidence accumulation step: this will produce the algorithm's final clustering result.
The 2-dimensional case
With data represented by two features, we can explain improvement a) of combination clustering.Let us first represent the set of all features in the data as , where f is one of those features for Improvement a) permits sub-clusterings to be performed on subset combinations of features: that is, on f1 or f2 alone, or both f1 and f2. All possible subsets are given by the power set of , ignoring the null set; in this case, they are .To understand the motivation behind improvement a), consider a data set with two clusters that are well separated in the first dimension (f1), but randomly mixed in the second (f2). The second dimension is essentially noise, which adds variance to the data. Since the expected distance separating the clusters in f1 is not changed from the addition of f2, the signal-to-noise ratio decreases overall in (f1, f2). For ordinary EAC, whose sub-clusterings are all performed within the whole feature space, clustering effectiveness is thereby reduced. In contrast, combination clustering may collect evidence based on subset combinations of the features, which includes the more informative (f1) combination. This information boost can then be leveraged by the weighting procedure proposed earlier, which by a goodness-of-clustering metric, effectively selects for sub-clusterings with combinations of informative features and against those with uninformative features. In our hypothetical 2-D example, we expect the weighting procedure to automatically put more weight on sub-clusterings using only (f1) than those using (f1, f2), which may only be partly informative. We furthermore expect it to put much less weight on sub-clusterings using only (f2), as they would only produce spurious, uninformative sub-clusters associated with an ideally low goodness-of-clustering measure. We illustrate this concept in Fig. 2.
Fig. 2
Application of improvement a) in the 2-dimensional case. A set of data is illustrated as circles in 2-D space. In the application of improvement a), Feature 1, Feature 2, and Feature (1, 2) are used separately by different sub-clusterings. In this example, Feature 1 is informative of the underlying groups in the data while Feature 2 is uninformative. Concordantly, the sub-clustering results from only Feature 1 tend to reflect the underlying groups, while results from only Feature 2 tend to be uninformative. Results from both features tend to be intermediately informative. This improvement contrasts with ordinary EAC in which Feature (1, 2) would be used by all sub-clusterings. As per improvement b) (see Fig. 1), the poor sub-clustering result based on Feature 2 (middle sub-clustering) produces a relatively low goodness-of-clustering measure, which correspondingly downweights the sub-clustering's contribution to the clustering ensemble. Similarly, the sub-clustering based on Feature 1 (left sub-clustering) has a better goodness-of-clustering measure than the other two's, hence it has more influence in the clustering ensemble.
Application of improvement a) in the 2-dimensional case. A set of data is illustrated as circles in 2-D space. In the application of improvement a), Feature 1, Feature 2, and Feature (1, 2) are used separately by different sub-clusterings. In this example, Feature 1 is informative of the underlying groups in the data while Feature 2 is uninformative. Concordantly, the sub-clustering results from only Feature 1 tend to reflect the underlying groups, while results from only Feature 2 tend to be uninformative. Results from both features tend to be intermediately informative. This improvement contrasts with ordinary EAC in which Feature (1, 2) would be used by all sub-clusterings. As per improvement b) (see Fig. 1), the poor sub-clustering result based on Feature 2 (middle sub-clustering) produces a relatively low goodness-of-clustering measure, which correspondingly downweights the sub-clustering's contribution to the clustering ensemble. Similarly, the sub-clustering based on Feature 1 (left sub-clustering) has a better goodness-of-clustering measure than the other two's, hence it has more influence in the clustering ensemble.Recall that the possible combinations of are {(f1), (f2), (f1, f2)}. You will notice that there are twice as many combinations of one feature than there are of two. Looking beyond the 2-dimensional case, as the number of features in increases, the number of possible combinations of features grows factorially. Among candidate features, the number of ways to choose features without repetition is given by the binomial coefficient , which is maximal when approaches half of . Thus, there can be very large imbalances in the number of sub-clusters for each dimensionality of features if they are exhaustively sampled. We can mitigate this problem by proportionately decreasing the contribution of sub-clusterings by the total number of sub-clusterings that has its order of dimensionality—this becomes part of the weighting procedure.To be methodical, we shall sort the sub-clusterings into sets according to the number of features they use: that is, their order of dimensionality. Say that our clustering ensemble is composed of all possible combinations of two features: {(f1), (f2), (f1, f2)}. We should group together (f1) and (f2) due to their being combinations of one feature: making the set . We call these “first-order combinations” of features. Combination (f1, f2) would be called a “second-order combination”, and should be put in a separate set as its only member: . More generally, we would say that all k-order combinations used for sub-clustering belong to set F, whereThus, when we denote the total number of combinations , we may balance the number of sub-clusterings in proportion to the total number of sub-clusterings that has its order of dimensionality by weighting the contribution of all k-order sub-clusterings by .
Sub-clustering using the k-means algorithm with prewhitening
Fred & Jain [3] explored the use of k-means clustering for sub-clustering in EAC, and provided recommendations on the number of sub-clusterings and choice of k to achieve better convergence. We make an additional recommendation to prewhiten the data for each combination of features before sub-clustering.k-means clustering achieves better results when clusters are described by multiple features that are uncorrelated with each other. However, the empirical data that we aimed to cluster were usually correlated among its features. Prewhitening data is known to result in better clustering with the k-means algorithm [5,8], as it effectively reveals any data relationships arising from the combination of features that is otherwise obscured by their strong correlation. We applied it in our method for each sub-clustering using zero-phase component analysis [1]. This method is also equivalent to performing k-means clustering using Mahalanobis distances [6]. We provide experimental results demonstrating the effectiveness of prewhitening later in this paper.
Description
Now, we describe our method for the general case of data with any number of features. Given a data set of n objects with N number of features, the set of all available features isWe construct a clustering ensemble (E) by sub-clustering on the data in many k-order combinations of (for various chosen k). Among all possible k-order combinations of features, we may use a subset of them—numbering —which forms a set F (Eq. 6). For each combination, we prewhiten the data before performing sub-clustering on them. Out of practicality, we may consider only feature combinations of chosen order N′ and lower (i.e., ), where N′ < N if N is infeasibly large.Each sub-clustering result, or evidence, is encoded as an n × n distance matrix,where (i.e., the lth sub-clustering of the kth order of feature combinations) for and (i.e., the co-association between objects i and j of the data: which is 1 if they belong to the same sub-cluster, and 0 if not). The set of all evidence is the clustering ensembleNote that E is not a matrix, since are not necessarily the same.Next, we measure the goodness-of-clustering of each sub-clustering in E by its ramped average silhouette value,at corresponding indices k, l.Finally, we find the co-association matrix as the average of all evidence in E, weighted by their goodness-of-clustering and the inverse of the number of sub-clusterings of the same order.
An implementation
We provide code of an example implementation of combination clustering as a function for the Matlab software platform [14] in Table 1. A summary overview of the algorithm is shown in Fig. 3 with some omissions. Notably, our implementation declares variables for all selected combinations prior to sub-clustering, preallocates memory for ensemble-related variables (e.g., E and g), and parallelises the sub-clustering loop for efficiency.
Table 1
Example Matlab implementation of combination clustering. The subfunction “Whiten” by original author Colorado Reed is included.
Image, table 1
Image, table 1
Image, table 1
Image, table 1
Image, table 1
Image, table 1
Fig. 3
Conceptual flowchart of a combination clustering algorithm. Each pass through the primary loop adds another sub-cluster to the ensemble (E) with a corresponding goodness-of-clustering measure (g). This loop may be paralellised.
Example Matlab implementation of combination clustering. The subfunction “Whiten” by original author Colorado Reed is included.Conceptual flowchart of a combination clustering algorithm. Each pass through the primary loop adds another sub-cluster to the ensemble (E) with a corresponding goodness-of-clustering measure (g). This loop may be paralellised.The function, CombClust, takes as arguments X, k, and combinations. Argument X should be a matrix describing the data, consisting of n objects with N features, with dimensions n × N. Argument k should be the number of clusters for the desired final clustering result. Argument combinations is a logical matrix, with N columns and O rows, specifying the combinations of features used by each sub-clustering. O is the total number of sub-clusterings over the whole ensemble. The N columns correspond to the N features extracted from data X. Each row of this matrix is a logical vector that specifies which features of X constitute a combination to be used for one sub-clustering. For example, a first-order combination of features would be specified by a row containing one entry that is True, and all other entries that are False.CombClust takes three further arguments: Boolean flags doGood, doWhiten, and doWeightOrder. They respectively control whether to perform weighting by goodness-of-clustering, prewhitening, and weighting by the number of sub-clusterings within each order of dimensionality of their combination of features.CombClust returns variables clusters, C, E, g, and Ns. Variable clusters is a vector of size n that assigns integer labels to each object representing the final clustering result of the algorithm. C is the final co-association matrix (sized n × n) resulting from evidence accumulation and weighting. Ns is a vector listing all orders of dimensionalities used in the clustering ensemble; let us use N’ to denote its length. E contains the evidence resulting from the ensemble of all sub-clusterings. It is an N’ × 1 cell array where the lth cell corresponds to an order of dimensionality given by Ns(l) and contains the evidence from all sub-clusterings of that order. This evidence is conditioned as an n × n × logical array, formed as the concatenation of all applicable n × n distance matrices of logical type in the third dimension. Variable g is also an N’ × 1 cell array, containing the goodness-of-clustering values from the ensemble for each order of dimensionality, conditioned as vectors.The sub-clustering routine uses Matlab's kmeans function for the k-means clustering with initial seeds randomly chosen from the sample. The number of sub-clusters k is also chosen at random between the user-given k and user-given k + 1; according to Fred & Jain [3], such variation increased the robustness of the ensemble solution.We also set a default behaviour for combination selection based on exhaustive/random selection. This particular strategy was used because sub-clustering of all possible combinations of features is computationally infeasible for large numbers of features. Thus, the strategy selected all possible k-order combinations for tractable values of k, and a limited number of combinations for intractable values of k. To prevent a systematic bias in the latter case, those combinations were selected randomly. By default, all possible combinations of features would be selected up to and including the 9th order, with a maximum of 1,000 combinations per order and a minimum of 50. For orders whereof the number of possible combinations exceeds this limit, 1,000 would be selected randomly. For those whereof the number subceeds the limit, additional combinations would be selected at random to make up 50. For comparison, Fred & Jain [3] reported being able to achieve convergent results for simple data sets using 50 sub-clusterings, and for complex data sets using 200 sub-clusterings.In the evidence accumulation step, the co-association matrix is converted to a distance matrix and clustered using hierarchical clustering with the average linkage criterion [12] until k clusters are produced.
Experimental results
We tested the effectiveness of our proposed method in four experiments. In the first experiment, we compared the clustering performance of combination clustering to ordinary EAC; in the second experiment, we made a broader comparison between the two algorithms for a range of signal-to-noise (SNR) distributions among features of the data; in the third experiment, we contrasted the performance of combination clustering with and without the prewhitening procedure; and in the fourth experiment, we contrasted the performance of combination clustering with and without the evidence weighting factors.Our experiments were performed with synthetic data, generated to represent data sets of varying numbers of features and SNR distributions. We used the Matlab implementation of combination clustering, given in Table 1, and its default parameters throughout the experiments unless stated otherwise. Clustering performance was measured as normalised mutual information—also used in Fred & Jain [3]—but between the clustered groups and the true groups. Its formula is given in Eq. 4 of their paper.For the first experiment, we compared the performance of combination clustering to ordinary EAC with respect to the number of features of the data set being clustered. We wrote a Matlab implemention for ordinary EAC based on Fred & Jain's [3] description, whose evidence also derived from k-means clustering. Here, the data were sub-clustered verbatim over 1,000 runs with randomised seeds; the values of k were chosen between the user-given k and user-given k + 1, the same as in our combination clustering implementation. The evidence accumulation step was also identical. The total number of sub-clusterings performed is about an order of magnitude larger for the combination clustering than ordinary EAC. This difference was also reflected in their computational times (see supplementary material Table S2). In our environment,1 the typical computational times for data of 50 objects with 300 features was 0.5 seconds for ordinary EAC and 6.7 seconds for combination clustering.Data were generated as consisting of one informative feature that separated two underlying groups of objects (each ) with a 10:1 SNR, and the remaining features having zero SNR. All the features were normalised to have zero mean and variance of one; noise components were generated as independent Gaussian processes with zero covariance between the components. We tested the performance of both clustering algorithms over 11 feature set sizes ranging from 1 to 2,000—each randomly generated 10 times. We plot the average of their performance in Fig. 4. The curves here and throughout the experiments were generalised logistic functions, fitted using the trust-region method [9], with a fixed upper asymptote of 1, and where the independent variable was the number of features (log-transformed).
Fig. 4
Comparison of the performance between the original EAC and combination clustering. As the number of noisy features increased, both the original method (red squares, dotted line) and our method (blue circles, solid line) degraded in performance, measured by normalised mutual information. Our proposed method performed significantly better than the old method for numbers of features in the range 3–50 (one-tailed sign tests, W(10) ≥ 9, padj < .024, Benjamini–Hochberg correction). Error bars are bootstrapped 95% confidence intervals; curves are fitted generalised logistic functions [10].
Comparison of the performance between the original EAC and combination clustering. As the number of noisy features increased, both the original method (red squares, dotted line) and our method (blue circles, solid line) degraded in performance, measured by normalised mutual information. Our proposed method performed significantly better than the old method for numbers of features in the range 3–50 (one-tailed sign tests, W(10) ≥ 9, padj < .024, Benjamini–Hochberg correction). Error bars are bootstrapped 95% confidence intervals; curves are fitted generalised logistic functions [10].As may be expected, clustering performance generally decreased with increasing number of features carrying no clusterable information. The performance of our proposed method outperformed ordinary EAC over a larger range of set sizes: statistically significantly over the range 3–50 features (one-tailed, paired sample sign tests, W(10) ≥ 9, padj < .024, Benjamini–Hochberg correction).In the next experiment, we repeated our initial experiment with differently generated data sets. Here, we generated data sets with a distribution of SNRs among the features following Zipf's law. Zipf's law originally related the frequency of words in a corpus of natural language as inversely proportional to its rank, but has since been found to describe many other real-world observations [2,16]. Its discrete probability distribution may be given bywhere, in this context, k is the SNR rank of the features, N is the total number of features, and the term of the denominator is a normalisation factor. Parameter s controls the steepness of the SNR distribution as a function of k. When , all features are equally informative. When , only 1 feature is informative, equivalent to the situation in the first experiment. Using this probability mass function, we fixed the feature-wise SNR distribution of our generated data sets to Zipfian distributions with varied s and N.Like in the first exerpiment, we explored 11 values of N ranging from 1 to 2,000. In addition, we also systematically varied . Thus, 66 parameter combinations were used to generate the data sets.The results of the second experiment are summarised in Fig. 5. Note that the first experiment's results are replotted in the curves for where . In general, we observed that combination clustering performance was superior to ordinary EAC's for larger numbers of features (statistics given in Table 2), and this advantage diminishes for smaller s. These results are consistent with our intention of making EAC more robust to data that have many uninformative features.
Fig. 5
The performance of combination clustering and ordinary EAC for various SNR distributions via the Zipf exponent (s). The results are identical to Fig. 4 where . Error bars are bootstrapped 95% confidence intervals; curves are fitted generalised logistic functions [10]. The bottom panel plots the number of features where the fitted curves reached 0.5 normalised mutual information, for each clustering algorithm and Zipf exponent.
Table 2
Combination clustering advantage over ordinary EAC for varying Zipf exponent (s). For each tested s, the proportion of data sets that resulted in higher normalised mutual information by the combination clustering algorithm is given, after collapsing over all feature set sizes and discarding tied performances.
s
Proportion better (%)
Sign test statistic (paired sample, two-tailed)
p
0
52
W(50)=26
.888
1
49
W(49)=24
1
2
60
W(60)=36
.155
3
66
W(73)=48
.010*
4
65
W(81)=53
.007*
∞
64
W(87)=56
.010*
The performance of combination clustering and ordinary EAC for various SNR distributions via the Zipf exponent (s). The results are identical to Fig. 4 where . Error bars are bootstrapped 95% confidence intervals; curves are fitted generalised logistic functions [10]. The bottom panel plots the number of features where the fitted curves reached 0.5 normalised mutual information, for each clustering algorithm and Zipf exponent.Combination clustering advantage over ordinary EAC for varying Zipf exponent (s). For each tested s, the proportion of data sets that resulted in higher normalised mutual information by the combination clustering algorithm is given, after collapsing over all feature set sizes and discarding tied performances.We additionally observed that the rate of clustering failure with respect to the number of features was slower in combination clustering compared to ordinary EAC, as signified by the shallower slope of the fitted curves belonging to the combination clustering condition. We numerically computed and compared the maximum slopes of each curve. In this way, we found that all six slopes of the tested s parameters, except for , were shallower for combination clustering than ordinary EAC, and the mean slope difference was statistically significant (one-tailed, paired sample t-test, , ). This suggested that some residual clustering of the true underlying groups was exhibited by combination clustering at numbers of features where ordinary EAC had practically reached asymptotic minimum (e.g., at 500 features for , Fig. 5).For the third experiment, within the combination clustering method, we looked at the effect of prewhitening the recombined data of each sub-clustering. In the previous experiments, because there was no noise covariance between the feautres, we would not expect prewhitening have much effect on clustering performance (see supplementary material Figure S1). Therefore, for this experiment, we generated data sets that do have correlated noise between the features.We controlled the covariance in our data sets with a five-step procedure. First, we generated N independent noise components like in previous experiments. Second, we scaled the relative variances of the noise components to follow a Zipfian distribution (). At this step, the resulting noise covariance matrix would have diagonal values given by Z(k; 1, N), and zeroes in all other entries. Third, we applied a random rotation to the matrix to introduce non-zero covariance between the features. Fourth, we normalised their means to zero and variances to one. Fifth, we added the signal component as in the first experiment, and then normalized this final set of features.We compared the clustering performance of combination clustering over the same range of feature set sizes as previous experiments, with and without prewhitening, by setting the doWhiten flag of our method implementation accordingly. We randomly generated 30 sets of each number of features instead of 10 (like in the previous experiments). We plot the average of their clustering performance in Fig. 6.
Fig. 6
Effect of prewhitening within combination clustering for dubious data sets with correlated noise. Prewhitening generally improved clustering performance; this was statistically significant for feature sizes 2–10 (one-tailed, paired sample sign tests, padj ≤ .014, Benjamini–Hochberg correction).
Effect of prewhitening within combination clustering for dubious data sets with correlated noise. Prewhitening generally improved clustering performance; this was statistically significant for feature sizes 2–10 (one-tailed, paired sample sign tests, padj ≤ .014, Benjamini–Hochberg correction).For dubious feature sets with correlated noise, we found that prewhitening of the sub-clustering data did produce generally improved clustering performance, most notably for feature set sizes of 2–20. The difference was statistically significant for sizes of 2–10 features (one-tailed, paired sample sign tests, padj ≤ .014, Benjamini–Hochberg correction).For our fourth and final experiment, we examined the effect of evidence weighting on clustering performance. We generated data sets in the way that we did in the first experiment, but randomly generated 20 sets of each number of features rather than 10. Through the setting of the doGood and doWeightOrder flags of our combination clustering implementation, we clustered over four conditions: with no weighting (“none”), with weighting by “order of dimensionality”, with weighting by “goodness-of-clustering”, and with weighting by both order of dimensionality and goodness-of-clustering (“both”). We plot the average of their resulting performance in Fig. 7.
Fig. 7
Effect of evidence weighting on clustering performance. Weighting by “goodness-of-clustering” and “both” resulted in better clustering performance most notably for data sets consisting of 10 features (Friedman test, , p < .001). The right panel plots the number of features where the fitted curves reached 0.5 normalised mutual information, for each weighting condition.
Effect of evidence weighting on clustering performance. Weighting by “goodness-of-clustering” and “both” resulted in better clustering performance most notably for data sets consisting of 10 features (Friedman test, , p < .001). The right panel plots the number of features where the fitted curves reached 0.5 normalised mutual information, for each weighting condition.We observed a small difference between the weighting conditions after factoring out feature set size (Friedman test, , ). The largest apparent difference was seen for data sets of 10 features (Friedman test, , p < .001). In both views, the “none” condition had the lowest median performance; and two conditions, “goodness-of-clustering” and “both”, had higher median performances than the remaining conditions, “none” and “order of dimensionality”. These results suggest that weighting by goodness-of-clustering improves clustering performance, and weighting by order of dimensionality neither improves nor worsens the performance—at least for these generated data sets. Given our original reasoning for weighting by order of dimensionality, we suspect it may still be useful for other types of data sets; however, this matter shall have to be left for a future investigation.
Final remarks
We have proposed, and given an implementation for, a variant of EAC that was designed to handle dubious data consisting of many features that are uninformative to clustering. Our experiments have shown that our proposed method is superior to ordinary EAC for a range of data set characteristics—particularly when informative components are concentrated in only a few of the features, and for larger feature set sizes.