Literature DB >> 28405271

Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure.

Joshua P Kilborn1, David L Jones1, Ernst B Peebles1, David F Naar1.   

Abstract

Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing-based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance-based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

Entities:  

Keywords:  Monte Carlo; PRIMER‐E; SIMPROF; constrained clustering; data simulation; permutation testing

Year:  2017        PMID: 28405271      PMCID: PMC5383504          DOI: 10.1002/ece3.2760

Source DB:  PubMed          Journal:  Ecol Evol        ISSN: 2045-7758            Impact factor:   2.912


Introduction

In data‐rich scientific studies, it is often necessary to apply a clustering algorithm to detect groups of homogenous objects with respect to a set of descriptors (i.e., measured variables). Detection of groups is useful in ecology, economics, genetics, and other disciplines that analyze large, multidimensional datasets. Clustering techniques for multivariate datasets are diverse and can be drawn from methods derived from one or more of the following approaches: sequential versus simultaneous, agglomerative versus divisive, monothetic versus polythetic, hierarchical versus nonhierarchical, probabilistic versus nonprobabilistic, and constrained versus unconstrained (Legendre & Legendre, 2012). In many cases, these methods are sensitive to the sequence of the steps within the algorithm, to random decisions enforced by the algorithm, or to arbitrary assignment of stopping rules, numbers of clusters, or levels of resemblance that define homogeneity.

Resemblance profiles and clustering criterion

Multivariate studies of complex datasets are often analyzed statistically using distance‐based (db) methods. These db‐methods begin with a series of pairwise comparisons between all objects to determine their relative resemblances with respect to a set of descriptors, and these resemblance values can be interpreted as either similarity or dissimilarity. The selection of a resemblance measure is discretionary and varies with the type of data being analyzed as well as the method of analysis (Batagelj & Bren, 1995; Clarke, Somerfield, & Chapman, 2006; Faith, Minchin, & Belbin, 1987). Clarke, Somerfield, and Gorley (2008) developed the SIMPROF routine based on the concept of a “similarity profile,” which represents the matrix of pairwise similarity values between any set of objects. SIMPROF was implemented as a clustering solution in v‐6 of the PRIMER software package and was first used to describe community structure in marine nematodes (Liu, Zhang, & Huang, 2007) and larval marine fishes (Muhling, Beckley, Koslow, & Pearce, 2008). Over the last decade, the number of peer‐reviewed publications that incorporate SIMPROF in some portion of their methodologies has grown. A search of Web of Science© for the term “SIMPROF” (searched 20 November 2016) returned 32 publications since 2007 and indicated the original Clarke et al. (2008) paper had 279 citations. Publications utilizing SIMPROF tend to come from marine ecology, with studies focusing on beta‐diversity in reef corals (Huang et al., 2015), diatoms (Hernandez Almeida & Siqueiros Beltrones, 2012), fishes (Macedo‐Soares, Freire, & Muelbert, 2012; Selleslagh et al., 2009), fish gut contents (French, Clarke, Platell, & Potter, 2013), macrofauna (Rehm, Hooke, & Thatje, 2011), and sediment microbes (Gilbert et al., 2009). SIMPROF‐based studies have also been conducted on dinoflagellates and ciguatera poisoning (Parsons, Settlemier, & Ballauer, 2011), food webs (Kelly & Scheibling, 2012), habitat classifications (Gonzalez‐Mirelis & Buhl‐Mortensen, 2015; Valesini, Hourston, Wildsmith, Coen, & Potter, 2010), species/environment relationships (Travers, Potter, Clarke, & Newman, 2012), metagenomics (Khodakova, Smith, Burgoyne, Abarno, & Linacre, 2014), and otolith elemental microchemistry (Moore & Simpfendorfer, 2014). While the preceding literature review reflects the recent use of the algorithm in ecological applications, it is likely that the method has uses in other disciplines as well. Clarke et al. (2008) demonstrated the use of SIMPROF in conjunction with agglomerative hierarchical clustering via the unweighted pair group method with arithmetic mean (UPGMA; Figure 1), and they also described two theoretical corollaries to the functional dynamics of their algorithm. They proposed that (1) the test for multivariate structure would become more powerful as the number of descriptors increased and (2) that the resolution of any structure identified (i.e., number of groups, G) might be far finer (greater) than is meaningfully interpreted (Clarke et al., 2008). It is our understanding that these corollaries have yet to be tested empirically with numerical simulations, and given recent inconsistencies in the performance of other permutation‐ and distance‐based hypothesis tests (e.g., ANOSIM and MANTEL tests; Anderson & Walsh, 2013; Legendre & Fortin, 2010), we felt this action was warranted.
Figure 1

Theoretical diagram of the process flow for DISPROF clustering with UPGMA: (1) Data are pretreated and configured. (2) An appropriate resemblance metric is applied to the pretreated dataset. (3) The UPGMA site‐connection linkage is assembled. (4) DISPROF is employed in an iterative process to identify the grouping structure in the data and create breaks in the associated linkage tree. (5) DISPROF settles on a final solution, and a two‐dimensional dendrogram visualization is created

Theoretical diagram of the process flow for DISPROF clustering with UPGMA: (1) Data are pretreated and configured. (2) An appropriate resemblance metric is applied to the pretreated dataset. (3) The UPGMA site‐connection linkage is assembled. (4) DISPROF is employed in an iterative process to identify the grouping structure in the data and create breaks in the associated linkage tree. (5) DISPROF settles on a final solution, and a two‐dimensional dendrogram visualization is created The present paper intends to improve our understanding of the proposed corollaries to the Clarke et al. (2008) approach, to help users of SIMPROF avoid potential pitfalls during analysis and interpretation, and to encourage use of the method outside of the ecological focus. We tested the SIMPROF method by estimating and describing the type I and type II error rates for the hypothesis test for multivariate structure while varying the datasets’ distribution type, dimensionality, data‐cloud overlap between adjacent clusters, and data‐cloud shape or overdispersion. We also elucidated the effects of dataset configuration variability on the quality of the solution achieved by examining the level of correspondence between the algorithm's clustering solutions and the known grouping partitions for datasets with structure.

Review of the SIMPROF approach

For a set of objects, a similarity profile is created by plotting the rank‐ordered similarity values versus each value's rank (Figure 2a). This profile is ultimately checked against the mean rank‐ordered similarity values for many randomized profiles (i.e., ≥1,000) created via permuting the original descriptor measurements across objects. The π statistic is created by summing the absolute deviations of the observed profile from the mean of the set of permuted profiles. Intuitively, one can see that if an observed profile has many more high and/or low similarity values than would be expected under the null conditions, then multivariate structure would be deemed present (Figure 2b). The null hypothesis (H o) of “no multivariate structure among objects, with respect to the descriptors” in the original dataset, is formally tested by examining the placement of the observed π statistic relative to the null distribution of all permuted π statistics. To model the null distribution of the π statistic, an additional set of permuted similarity profiles (i.e., ≥1,000 iterations) is created, and their associated π statistics are calculated with respect to the same mean profile used to calculate the original observed π statistic. The p‐value for the observed π statistic is calculated as the proportion of π statistics that are at least as large as the observed statistic versus the total number of π statistics calculated via permutation (Clarke et al., 2008).
Figure 2

Two examples of Euclidean‐dissimilarity profiles: Resemblance value sort order is increasing along the x‐axis, and the sorted pairwise dissimilarity values are increasing along the y‐axis. (a) A dissimilarity profile for a simulated unstructured dataset drawn from the exponential probability distribution with [N × P] = [50 × 50]. The observed profile is within the 99% confidence envelope based on 999 permutations of the observed data. (b) A dissimilarity profile for a simulated structured dataset drawn from the normal distribution with two groups having equal variance, [N × P] = [50 × 50], and Ov = 0.01. The observed profile has many dissimilarity values that are above and below the expected mean permuted profile, and its associated 99% confidence envelope, thereby signifying the presence of structure in the dataset

Two examples of Euclidean‐dissimilarity profiles: Resemblance value sort order is increasing along the x‐axis, and the sorted pairwise dissimilarity values are increasing along the y‐axis. (a) A dissimilarity profile for a simulated unstructured dataset drawn from the exponential probability distribution with [N × P] = [50 × 50]. The observed profile is within the 99% confidence envelope based on 999 permutations of the observed data. (b) A dissimilarity profile for a simulated structured dataset drawn from the normal distribution with two groups having equal variance, [N × P] = [50 × 50], and Ov = 0.01. The observed profile has many dissimilarity values that are above and below the expected mean permuted profile, and its associated 99% confidence envelope, thereby signifying the presence of structure in the dataset Resemblance profile consideration is inserted into UPGMA clustering as a clustering decision criterion in an iterative process (Figure 1). The data are required to be in [N × P] matrix format, where the N rows represent individual objects (sampling units) and the P columns of the matrix represent the descriptors (measured variables). In many real‐world, large datasets, there are often some objects where certain descriptor measurements are missing due to either technical failure or human error. When compiling these data, we must remove objects that do not contain an accurate measurement for all descriptors of interest (zero‐value measurements may be appropriate, but missing measurements are not). Once the data are assembled and checked for quality, user‐defined pretreatments are applied (e.g., standardization and/or normalization) and an appropriate resemblance measure is employed. One advantage to the approach considered here is the use of distribution‐free statistics, which releases the analyst from the often‐unrealistic assumption of Gaussian data distributions, and decreases the need for data transformations to satisfy those assumptions. Another advantage to using distribution‐free significance tests is that they are often generalized to accept any of the potential pool of resemblance measures available to researchers (Legendre & Legendre, 2012). After a square, symmetric distance‐matrix is produced, an UPGMA clustering solution is constructed to reflect the magnitude of apparent resemblance between the objects with respect to the descriptors. SIMPROF can be used as an iterative decision criterion to assess each node of the UPGMA dendrogram to determine whether the objects connected by any node are clusters of relative homogeneity, or whether there is additional multivariate structure present in those remaining objects (Clarke et al., 2008). Recall that the H o tested by SIMPROF is of “no multivariate structure among objects with respect to the descriptors.” When assessing an UPGMA dendrogram, SIMPROF begins hypothesis testing at the node that has the smallest similarity value and that contains all objects. If H o is rejected and structure is deemed present in the objects connected by the top‐level node, the SIMPROF routine repeats independently on the two sets of objects joined at that node. SIMPROF iteratively assesses the presence of structure for all newly identified subsets within the original top‐level subsets until a stopping point is reached and all possible subsets have been identified. The stopping point for the algorithm is when either a nonsignificant p‐value (i.e., p‐value ≥ α) for all remaining subsets is obtained (failure to reject H o), or when the number of objects that remain connected within untested subsets is no greater than two (Clarke et al., 2008). Due to the multiple‐testing aspect of the algorithm, a p‐value correction method can be employed when determining significance for tests between sets of objects (Clarke et al., 2008). The primary output of UPGMA clustering with SIMPROF is a grouping partition containing a cluster assignment for each object. Using this decision framework creates immediate advantages when interpreting the clustering dendrogram in that (1) the researcher is no longer required to arbitrarily assign a single level of similarity that defines all clusters and (2) the clusters can be defined by varying levels of similarity. To obtain a two‐dimensional ordination of the identified groups in hyperdimensional space, a Euclidean embedding can be produced via principle coordinates analysis (PCoA; Gower, 1966). This ordination is based on the same symmetric resemblance matrix used in the clustering process, and the group assignments can be overlain in place of the object labels to present a final clustering diagram.

Methods

Rationale

The only modification we made to the original Clarke et al. (2008) algorithm was to use dissimilarities (or distance) for the computation of the resemblance profile; this convention is consistent with the Fathom Toolbox for MATLAB (Jones, 2015), which was used for our testing and evaluations, and is advantageous because dissimilarity measures span a broad range of types (i.e., metric, nonmetric, or semi‐metric) that can be applied to a diversity of potential research disciplines. These types of resemblance measures also allow ordination of the objects via multidimensional methods, which require db‐resemblance measures, and are intuitively interpreted with two objects’ spatial “closeness” in ordination space as being more similar (i.e., less dissimilar). Because similarity profiles and dissimilarity profiles are analogous, we refer to “DISPROF” hereafter. Detail of the simulation scenarios used for the study listed as Sim 1–Sim 4 For each scenario, S = 1,000 datasets were simulated, and mean dissimilarity profiles (DISPROF) were obtained with 1,000 permutations and the p‐values for the test were calculated with 999 permutations (α = .05). Variables are as follows: G, total number of groups; N , total number of objects; P, total number of descriptors; T, number of successful trials; df, degrees of freedom; μ , mean for all descriptors in group i; λ, Poisson rate parameter; , variance for all descriptors in group i; q, probability of success for a trial; θ , overdispersion parameter for all descriptors in group i; Σ , correlation among descriptors in group i; Ov, average overlap per axis between data clouds for G 1 and G 2. Where θ = 0, then μ = σ , and the negative binomial distribution reduces to the Poisson. Probability distributions used in Sim 1–Sim 4: The representative data type and the resemblance measure used to determine the pairwise distance between objects No data were transformed prior to subjection to the resemblance measure. To test the effectiveness of DISPROF at detecting the presence of multivariate structure among objects, we used simulated datasets with both unstructured and structured sets of descriptors, under four different simulation scenarios (Table 1). We attempted to simulate data that would be applicable to a range of numerical studies including, but not limited to, the ecological type of data that SIMPROF was initially developed for (Table 2). The unstructured data were simulated with a single grouping structure present and were used for estimating type I error rates for DISPROF; the structured data were simulated with known groups among objects and were used to estimate type II error rates and the power of the hypothesis test. Structured data were also used to examine the effects of descriptor overdispersion in ecological count data, as well as the effects of increasing numbers of descriptors and the type of correlation structure among them. We retained the grouping partitions from the structured data simulations, and doing so allowed us to test the correspondence between the clustering solutions achieved by the UPGMA with DISPROF algorithm and these baseline partitions. The criterion for rejecting H o in this simulation study was set at α = .05, and we opted to use a progressive Bonferroni p‐value correction (Legendre & Legendre, 2012) for instances where repeated hypothesis testing was conducted (i.e., simulated structured data testing).
Table 1

Detail of the simulation scenarios used for the study listed as Sim 1–Sim 4

Probability distribution G Parameter 1Parameter 2 N P
Sim 1. Unstructured data
a.Binomial1 = 10 ≤  1{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}
b.Chi‐square11 ≤ df ≤ N − 1{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}
c.Exponential10 ≤ μ ≤ 5{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}
d.Log‐normal10 ≤ μ ≤ 500 ≤ σ 2 ≤ 5{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}
e.Negative binomial10 ≤  100 ≤  1{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}
f.Negative binomial/Poissona 11 ≤ μ ≤ 1000 ≤ θ ≤ 1{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}
g.Normal1−100 ≤ μ ≤ 1000 ≤ σ ≤ 5{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}
h.Poisson10 ≤ λ ≤ 1,000{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}
Sim 2. Structured data—overlapping groups
a.Normal (OCLUS)2 σ12 = σ22 = 1 Ov = {0.01, 0.02, … 0.49, 0.5} n 1 = n 2 = 25, = 50{2, 3, 5, 10, 25, 50, 150, 225, 300}
Sim 3. Structured data—Overdispersed descriptors
a.Negative binomial/Poissona 2 μ 1 = μ 2 = 10 θ 1 = 0, θ 2 = {0, 0.1, 0.4, 0.9} n 1 = n 2 = 25, = 50{2, 3, 5, 10, 25, 50, 150, 225, 300}
b.Negative binomial/Poissona 2 μ 1 = 10, μ 2 = 30 θ 1 = 0, θ 2 = {0, 0.1, 0.4, 0.9} n 1 = n 2 = 25, = 50{2, 3, 5, 10, 25, 50, 150, 225, 300}
Sim 4. Structured data—correlated descriptors
a.Normal2 μ 1 = 10, μ 2 = 30 Σ 1 = 0, Σ 2 = {0, 0.6, 0.9} n 1 = n = 25, = 50{2, 3, 5, 10, 25, 50, 150, 225, 300}
b.Normal2 μ 1 = 10, μ = 30 Σ 1 Σ 2 = {0.6, 0.9} n n = 25, = 50{2, 3, 5, 10, 25, 50, 150, 225, 300}

For each scenario, S = 1,000 datasets were simulated, and mean dissimilarity profiles (DISPROF) were obtained with 1,000 permutations and the p‐values for the test were calculated with 999 permutations (α = .05). Variables are as follows: G, total number of groups; N , total number of objects; P, total number of descriptors; T, number of successful trials; df, degrees of freedom; μ , mean for all descriptors in group i; λ, Poisson rate parameter; , variance for all descriptors in group i; q, probability of success for a trial; θ , overdispersion parameter for all descriptors in group i; Σ , correlation among descriptors in group i; Ov, average overlap per axis between data clouds for G 1 and G 2.

Where θ = 0, then μ = σ , and the negative binomial distribution reduces to the Poisson.

Table 2

Probability distributions used in Sim 1–Sim 4: The representative data type and the resemblance measure used to determine the pairwise distance between objects

Probability distributionData typeResemblance
BinomialBinary, presence/absenceJaccard
Chi‐squareRational, continuousEuclidean
ExponentialRational, continuousEuclidean
Log‐normalRational, continuousEuclidean
Negative binomialInteger, frequency with many 0'sBray–Curtis
Negative binomial/PoissonOverdispersed ecological count dataBray–Curtis
NormalRational, continuousEuclidean
PoissonInteger, frequency with many 0'sBray–Curtis

No data were transformed prior to subjection to the resemblance measure.

All data simulations were coded in MATLAB using the Fathom Toolbox (Jones, 2015), the OCLUS routine (Steinley & Henson, 2005), and the Darkside Toolbox (Kilborn, 2015). To complete the algorithm testing described below, we used the University of South Florida Research Computing high‐performance computing hardware running MATLAB v. 2016 and used an experimental MATLAB module from the Fathom Toolbox called “ClustX.”

Data simulation methods

In all simulations, varying size conditions for the resultant data matrices were used, and this allowed us to investigate the effects of changing the numbers of objects (N) and dataset dimensionalities (P, number of descriptors) on DISPROF's performance, and also the quality of the clustering solutions achieved by the algorithm. S = 1,000 datasets were simulated for each combination of [N × P] under additional simulation scenarios described in Table 1. The simulation scenarios allowed further investigation of DISPROF's performance regarding variation in (1) the underlying probability distribution of the data; (2) the amount of overlap between groups’ data clouds; (3) the location and dispersion among groups of objects representing ecological abundance data; and (4) correlation structures among descriptors within groups of objects.

Unstructured data (Sim 1)

The first set of simulations were used to estimate type I error rates for the DISPROF routine for data drawn from eight different probability distributions (Table 1). Each probability distribution was used to simulate a specific data type, and the properties of the simulated data informed the choice of resemblance measure (Table 2). Each statistical distribution had S = 40,000 unstructured datasets across all combinations of [N × P]. A total of 320,000 independently generated unstructured datasets were used to complete the type I error rate estimations. Within each of the S = 1,000 equally sized datasets, the columns were individually parameterized at random from a set range of values specific to the underlying probability distribution (Table 1). The instances where random processes produced objects with all zero‐value entries were allowed to persist in the data, and they were treated as a special case during the calculation of Bray–Curtis and Jaccard dissimilarity matrices. In this special case, any comparison of two objects with all zero‐value entries would be assigned a dissimilarity value of one (i.e., perfectly dissimilar), as they share no common variability (Anderson & Walsh, 2013; Warton & Hudson, 2004). This convention was upheld for all simulation scenarios where it was appropriate to do so (Sim 1e, 1f, 1h; Sim 3). Each probability distribution was tested in batches of S = 1,000 according to their [N × P] configurations. The S independent datasets were each tested with the DISPROF routine one time to determine whether the null was rejected at α = .05. The resultant p‐value for each DISPROF hypothesis test was collected, and the proportion of all S datasets where the associated p‐value was significant was calculated for each [N × P] configuration.

Structured data—overlapping groups (Sim 2)

The second set of simulations were designed to examine the effects of dataset configuration, as well as the average amount of overlap per dimension between the data clouds that represent grouped objects, on the DISPROF routine and its grouping solutions. We used an established data simulation routine described by Steinley and Henson (2005), called OCLUS, to produce a total of 450,000 datasets with overlapping grouping structures. The OCLUS routine implementation in MATLAB allowed the configuration of the probability distribution type, the number of groups (G) and whether or not they overlap, the number of objects per group (n ), and the average amount of group overlap across all dimensions (Ov) between groups of objects in hyperdimensional space. Note that Ov for the entire dataset is evenly distributed across all dimensions, and two major assumptions of the OCLUS routine are (1) that all dimensions are independent; and (2) that all groups are independent (Steinley & Henson, 2005). For our purposes, when simulating all structured data with multiple groups (Sim 2–Sim 4), a simple simulation design was employed where two groups (G = 2) with n 1 = n 2 = 25 (N = 50) objects were simulated. In Sim 2, for each [N × P] configuration the average overlap between the two groups was increased progressively from Ov = 0.01 to 0.50, in 0.01 increments. S = 1,000 datasets were simulated for each [N × P × Ov] configuration. Descriptor data were drawn from the multivariate normal distribution with equal variances ( =  = 1) for both groups (Anderson & Walsh, 2013; Steinley & Henson, 2005). Normally distributed data were used to examine the type II error because the concern that the underlying probability distribution of the data would impart some sort of unknown structure was negligible as the data were simulated in a known grouping configuration. As cluster analysis falls into the category of “exploratory” data analysis, it should be obvious that the amount of overlap between objects in a sampling data set, or any inherent grouping structure, is unknown at the time of testing. Therefore, it is important to understand the empirical effects group location and overlap on clustering solutions if we are to put any faith in the solutions provided by the algorithm.

Structured data—overdispersed descriptors (Sim 3)

The third simulation scenario also indirectly dealt with group location, but the main focus of these simulations was on determining the effect on DISPROF from increasing the overdispersion of one group while holding the other group constant, and to do so for ecological frequency data (i.e., abundances or counts). We used the Fathom Toolbox for MATLAB to implement ecological‐data simulation scenarios similar to those used by Anderson and Walsh (2013), and in Sim 3, we simulated ecological abundance data drawn from the overdispersed negative binomial and/or Poisson distribution (Tables 1 and 2). These data were simulated where the σ 2 >> mean (μ), and the σ 2 parameter is related to μ such that σ 2 = μ+ θμ 2, where θ is the overdispersion parameter. In cases where σ 2 = μ, the data were drawn from the Poisson distribution, and the data were drawn from the negative binomial distribution otherwise. In Sim 3a, we simulated a total of 36,000 datasets with G = 2, μ 1 = μ 2 = 10 (collocated groups), and we induced heterogeneity between the groups by increasing the overdispersion for the descriptors in G 2. In Sim 3b, we maintained the group heterogeneity from increasing θ 2 when we simulated an additional 36,000 datasets with G = 2, but in this scenario, we set μ 1 = 10 and μ2 = 30 (separated groups). For all [N × P] configurations, four different combinations of θ 1 and θ 2 were used to simulate S = 1,000 datasets for all [N × P × (θ 1 and θ 2)] configurations (Table 1). In Sim 3, we simulated ecological count datasets with no overdispersion in G 1 and increasing θ in G 2, and where the groups were collocated in hyperdimensional space (Sim 3a) or where they existed in separate locations (Sim 3b). It should be noted, however, that this method does not account for data‐cloud overlap, and is possible that two simulated groups that do not share a mean value could still overlap if the θ parameter were extremely high. We tested values ranging from zero overdispersion, to low (θ = 0.1), to medium (θ = 0.4), to high (θ = 0.9).

Structured data—increasing correlation (Sim 4)

The fourth set of simulations was used to examine the effects of correlated descriptors within a group of objects on DISPROF and its clustering outputs. We simulated data with different correlation structures (Σ) between descriptors in G 1 and G 2, and where Σ 2 increased in G 2 (Sim 4a), and also with Σ 1 = Σ 2, but still increasing Σ (Sim 4b, Table 1). In both cases, we simulated data drawn from the multivariate normal distribution with μ 1 = 10, μ 2 = 30 and  =  = 1. The square, symmetric correlation‐matrices Σ were built such that each descriptor would be correlated with all other descriptors in the dataset by the proportion listed in Σ. Sim 4 examines data with correlated descriptors whose level of correlation varies from no correlation (Σ = 0), to medium (Σ = 0.6), to high correlation (Σ = 0.9).

Power, resolution, and correspondence estimation

As all datasets in Sim 2–Sim 4 had G = 2, we estimated the proportion of type II errors for each [N × P × Ov], [N × P × (θ 1 and θ 2)], and [N × P × (Σ 1 and Σ 2)] configuration by finding the number of instances, per S = 1,000, where the H o was retained at α = .05 (i.e., no multivariate structure deemed present). Type II error estimates were converted to power, and values ≥0.80 were considered acceptable at our selected confidence level (Cohen, 2013). As our primary interest was in exploring the efficacy of using DISPROF as a clustering criterion, we examined the first iteration of sequential testing of H o (to record type II error rates), but we also allowed for all subsequent DISPROF iterations to run until the clustering implementation was completed. This unconstrained approach allowed the UPGMA clustering with DISPROF algorithm to settle on complete clustering solutions with the maximum number of groups that could be discovered of G  = N − 2. The final result of each DISPROF clustering attempt was a partition for the simulated objects that identified each object's group membership. In all cases, G and the generated grouping partition were retained for further analysis. The number of groups identified was used to examine the effective resolution of the clustering solution, with larger values of G being indicative of fine resolution and smaller G values being coarse. The grouping partitions were used to compare the computed results against the known reference partition for each structured dataset simulated. The measure of correspondence between the clustering solutions’ partitions and their reference partitions was calculated using the Hubert–Arabie adjusted Rand index (ARI ). This effort was undertaken due to the importance of a clustering algorithm being able to find “correct” structure in the data. The absolute value of ARI ranges from 0 to 1, requires a probabilistic interpretation, and measures the likelihood of agreement between one randomly chosen pair of objects represented in both partitions, corrected for chance (Hubert & Arabie, 1985). Negative ARI HA values can be interpreted as a probability of agreement that is less than what would be expected by chance alone. We interpreted ARI HA values ≥0.80 as “good” correspondence with anything above 0.90 being “excellent.” Likewise, ARI HA values <0.80 were interpreted as “moderate” correspondence, and values below 0.65 were interpreted as “poor” correspondence (Steinley, 2004).

Results

Data simulation scenarios

The mean estimated type I error rates for DISPROF were within the confidence interval that would be expected for the chosen level of α = .05 for all simulated unstructured data, regardless of the base probability distribution that the data were drawn from (Table 3). There was also no apparent effect of the number of objects or descriptors on the type I error rates for DISPROF (Figure 3).
Table 3

Descriptive statistics for DISPROF type I error based on Sim 1

Probability distribution N P MinimumMeanModeMaximumσ SE
Sim 1. Type I error – S = 40,000
a.Binomial{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}0.0080.0460.0550.0680.013.002
b.Chi‐square{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}0.0320.0500.0500.0670.007.001
c.Exponential{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}0.0370.0490.0490.0670.006.001
d.Log‐normal{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}0.0330.0500.0470.0700.008.001
e.Negative binomial{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}0.0340.0490.0500.0640.006.001
f.Negative binomial/Poisson{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}0.0280.0480.0450.0630.008.001
g.Normal{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}0.0350.0510.0500.0660.008.001
h.Poisson{10, 25, 50, 150, 300}{2, 3, 10, 25, 50, 150, 225, 300}0.0360.0490.0430.0620.007.001

Unstructured data: Type I error rate estimates and statistics were obtained from S = 40,000 datasets across all configurations of [N × P] for each probability distribution simulated. Error rate estimates for each configuration were based on S = 1,000 datasets, and all p‐values were obtained via 999 permutations with significance assessed at α = .05. N, total number of objects; P, total number of descriptors; σ, standard deviation of the mean; SE, standard error of the mean.

Figure 3

Ratio of P:N versus the proportion of type I error: The type I error rates (α = .05) for the DISPROF hypothesis test for multivariate structure of S = 1,000 simulated unstructured datasets from eight different probability distributions simulated in scenario Sim 1. Data points represent each of the 40 different [N × P] configurations; the dotted vertical line indicates the mean type I error rate for all 40 configurations. All data were randomly parameterized and drawn from the (a) binomial, (b) chi‐square, (c) exponential, (d) log‐normal, (e) negative binomial, (f) negative binomial/Poisson, (g) normal, and (h) Poisson probability distributions. The σ and standard error for all probability distributions tested were ≤0.01 and .002, respectively

Descriptive statistics for DISPROF type I error based on Sim 1 Unstructured data: Type I error rate estimates and statistics were obtained from S = 40,000 datasets across all configurations of [N × P] for each probability distribution simulated. Error rate estimates for each configuration were based on S = 1,000 datasets, and all p‐values were obtained via 999 permutations with significance assessed at α = .05. N, total number of objects; P, total number of descriptors; σ, standard deviation of the mean; SE, standard error of the mean. Ratio of P:N versus the proportion of type I error: The type I error rates (α = .05) for the DISPROF hypothesis test for multivariate structure of S = 1,000 simulated unstructured datasets from eight different probability distributions simulated in scenario Sim 1. Data points represent each of the 40 different [N × P] configurations; the dotted vertical line indicates the mean type I error rate for all 40 configurations. All data were randomly parameterized and drawn from the (a) binomial, (b) chi‐square, (c) exponential, (d) log‐normal, (e) negative binomial, (f) negative binomial/Poisson, (g) normal, and (h) Poisson probability distributions. The σ and standard error for all probability distributions tested were ≤0.01 and .002, respectively The mean power values for each P‐dimension, calculated from the 50 proportions of type II errors, estimated for each [N × P × Ov] configuration (S = 1,000), showed an increase in the power of DISPROF to detect the presence of multivariate structure as the overall dimensionality of the dataset increased (Table 4). A closer look at each P‐dimension's power values (Figure 4) showed that, for P ≤ 10, as Ov decreased, the statistical power of DISPROF increased asymptotically from unacceptable levels toward 1. For all values of P ≥ 25, the power was estimated to equal 1 for all Ov. Furthermore, for any given Ov the power increased as P increased. The average number of groups () per S = 50,000 datasets from all [N × P] configurations across all 50 Ov levels was similar across all P, ranging from a minimum  = 1.81 (P = 2) to a maximum  = 2.16 (P = 5; Table 4). Closer inspection of each [P × Ov] combination (S = 1,000) revealed that DISPROF clustering solutions where P ≤ 3 displayed an increase in as Ov decreased. increased from a value of  < 2 and asymptotically approached the mean of for all clustering solutions within a given [P × Ov] combination. For all P ≥ 5, values remained above 2 for all Ov and were much more tightly bound around their respective means (Figure 5a, Table 4). The mean correspondence values () for each S = 50,000 datasets from all [N × P] configurations across all Ov increased as P increased (Table 4), and for any single Ov level, the also increased with P (Figure 5b). A more detailed view of within each P‐dimension (Figure 5b) indicated for P ≤ 5 the mean ARI HA values persisted below 0.8 for the majority of Ov scenarios, but had a generally increasing trend. Eventually, the had high correspondence values at low levels of Ov. All P ≥ 10 clustering solutions had values that were considerably less variable across all levels of Ov than those for P ≤ 5. These solutions’ correspondence values were tightly bound around their respective mean values (Table 4) and displayed good or excellent correspondence (Figure 5b).
Table 4

Descriptive statistics for power, , and for DISPROF based on Sim 2

P OvMinimumMeanModeMaximumσ SE
Sim 2. Power − σ12 = σ22 = 1, n 1 = n 2 = 25, S = 50,000
= 2 Ov = {0.01, 0.02, … 0.49, 0.5}0.3420.6260.4761.0000.221.004
= 3 Ov = {0.01, 0.02, … 0.49, 0.5}0.4910.7130.6291.0000.164.003
= 5 Ov = {0.01, 0.02, … 0.49, 0.5}0.7700.8770.7601.0000.068.001
= 10 Ov = {0.01, 0.02, … 0.49, 0.5}0.9900.9970.9991.0000.002<.001
 25 Ov = {0.01, 0.02, … 0.49, 0.5}1.0001.0001.0001.0000.000.000
Sim 2. G¯ − σ12 = σ22 = 1, n 1 = n 2 = 25, S = 50,000
= 2 Ov = {0.01, 0.02, … 0.49, 0.5}1.461.811.662.140.23<.01
= 3 Ov = {0.01, 0.02, … 0.49, 0.5}1.701.952.162.190.16<.01
= 5 Ov = {0.01, 0.02, … 0.49, 0.5}2.072.162.132.220.03<.01
= 10 Ov = {0.01, 0.02, … 0.49, 0.5}2.082.152.152.210.02<.01
= 25 Ov = {0.01, 0.02, … 0.49, 0.5}2.052.062.062.090.01<.01
= 50 Ov = {0.01, 0.02, … 0.49, 0.5}2.032.062.062.090.01<.01
= 150 Ov = {0.01, 0.02, … 0.49, 0.5}2.032.062.062.090.01<.01
= 225 Ov = {0.01, 0.02, … 0.49, 0.5}2.042.072.062.090.01<.01
= 300 Ov = {0.01, 0.02, … 0.49, 0.5}2.042.062.072.090.01<.01
Sim 2. ARI¯HA − σ12 = σ22 = 1, n 1 = n 2 = 25, S = 50,000
= 2 Ov = {0.01, 0.02, … 0.49, 0.5}0.1160.3470.1160.9270.232.005
= 3 Ov = {0.01, 0.02, … 0.49, 0.5}0.1980.4070.1980.8970.190.004
= 5 Ov = {0.01, 0.02, … 0.49, 0.5}0.4470.5910.4470.8830.111.002
= 10 Ov = {0.01, 0.02, … 0.49, 0.5}0.8460.8750.8460.9340.019<.001
= 25 Ov = {0.01, 0.02, … 0.49, 0.5}0.9840.9880.9840.9910.001<.001
= 50 Ov = {0.01, 0.02, … 0.49, 0.5}0.9950.9970.9950.9980.001<.001
= 150 Ov = {0.01, 0.02, … 0.49, 0.5}0.9950.9970.9950.9980.001<.001
= 225 Ov = {0.01, 0.02, … 0.49, 0.5}0.9960.9970.9960.9980.001<.001
= 300 Ov = {0.01, 0.02, … 0.49, 0.5}0.9950.9970.9950.9980.001<.001

Structured data—overlapping groups: Power estimates for each [N × P × Ov] configuration were based on S = 1,000 datasets with mean values based on 50 [P × Ov] configurations at each P; all p‐values were obtained via 999 permutations with significance assessed at α = .05. Mean number of groups () and average clustering solution correspondence () estimations and statistics were obtained from S = 50,000 datasets across all Ov for each configuration of [N × P]. N, total number of objects (n  = number of objects in group i); P, total number of descriptors; Ov, average overlap per axis between data clouds for G 1 and G 2 ; , variance of group i; σ, standard deviation of the mean; SE, standard error of the mean.

Figure 4

Power of the DISPROF test versus the proportion of group overlap: Statistical power of DISPROF versus Ov for all P tested under Sim 2. Each line plot represents the 50 power values for S = 1,000 datasets at each Ov level for a given P. The horizontal dashed line at power = 0.8 is the lower limit of acceptable power values

Figure 5

The relationship for and with Ov for DISPROF clustering: (a) The mean number of groups identified () versus the average data cloud overlap (Ov) for all P tested under Sim 2. Each line plot represents the 50 values for S = 1,000 datasets at each Ov level for a given P. The optimal grouping solution (G = 2) is represented by the horizontal dashed line. (b) The mean correspondence of the grouping solution () versus the average data cloud overlap (Ov) for all P tested under Sim 2. Each line plot is configured as in panel (a), the horizontal black dashed line represents lower bound for excellent correspondence ( = 0.9), and the red dashed line represents lower bound for good correspondence ( = 0.8). Boxplots to the right represent the distribution of standard errors for each estimate of the and for all Ov within a noted dimensionality for P. The horizontal red line in each boxplot represents the median standard error value in the distribution, with the upper and lower edges of the box being the 25th and 75th percentiles. Whiskers extend to encompass the most extreme data points, and outliers are plotted individually as crosses

Descriptive statistics for power, , and for DISPROF based on Sim 2 Structured data—overlapping groups: Power estimates for each [N × P × Ov] configuration were based on S = 1,000 datasets with mean values based on 50 [P × Ov] configurations at each P; all p‐values were obtained via 999 permutations with significance assessed at α = .05. Mean number of groups () and average clustering solution correspondence () estimations and statistics were obtained from S = 50,000 datasets across all Ov for each configuration of [N × P]. N, total number of objects (n  = number of objects in group i); P, total number of descriptors; Ov, average overlap per axis between data clouds for G 1 and G 2 ; , variance of group i; σ, standard deviation of the mean; SE, standard error of the mean. Power of the DISPROF test versus the proportion of group overlap: Statistical power of DISPROF versus Ov for all P tested under Sim 2. Each line plot represents the 50 power values for S = 1,000 datasets at each Ov level for a given P. The horizontal dashed line at power = 0.8 is the lower limit of acceptable power values The relationship for and with Ov for DISPROF clustering: (a) The mean number of groups identified () versus the average data cloud overlap (Ov) for all P tested under Sim 2. Each line plot represents the 50 values for S = 1,000 datasets at each Ov level for a given P. The optimal grouping solution (G = 2) is represented by the horizontal dashed line. (b) The mean correspondence of the grouping solution () versus the average data cloud overlap (Ov) for all P tested under Sim 2. Each line plot is configured as in panel (a), the horizontal black dashed line represents lower bound for excellent correspondence ( = 0.9), and the red dashed line represents lower bound for good correspondence ( = 0.8). Boxplots to the right represent the distribution of standard errors for each estimate of the and for all Ov within a noted dimensionality for P. The horizontal red line in each boxplot represents the median standard error value in the distribution, with the upper and lower edges of the box being the 25th and 75th percentiles. Whiskers extend to encompass the most extreme data points, and outliers are plotted individually as crosses The performance of DISPROF across all 36 combinations of [N × P × (θ 1 and θ 2)] (S = 1,000) was more consistent when μ 1 = 10, μ 2 = 30 (Sim 3b) than when μ 1 = μ 2 = 10 (Sim 3a) (Table S1). Sim 3a displayed increasing power to detect groups as the amount of overdispersion in G 2 increased, even when the groups’ centroids overlapped and the only distinction between the groups was their respective θ structures. Sim 3b maintained power values of 1 for all configurations except three (P = {2, 3}, θ 1 = 0, θ 2 = 0.4; P = 3, θ 1 = 0, θ 2 = 0.9), whose power values were all above 0.85. The power of DISPROF within all [P × (θ 1 and θ 2)] configurations where θ 2 > 0 increased with P until a threshold value of P was met, and for the remaining dimensions where P ≥ P threshold, the power was 1. The value of P threshold decreased as θ 2 increased and the difference in spread of the two groups became more pronounced (Table S1). The mean number of groups identified in Sim 3b across all [P × (θ 1 and θ 2)] configurations where θ 2 < 0.9 was approximately 2 (the correct number), and there was no apparent effect of increasing P or θ 2 when the two groups were sufficiently separated in hyperdimensional space (Table 5). For simulations where θ 2 = 0.9, increased from ~2.5 groups identified per 1,000 datasets at P = 2, to ~4 groups at P = {5, 10}, after which the value of tapered off to around 2 starting at P = 150 (Table 5). The mean correspondence values for scenarios where θ 2 = {0, 0.1} remained excellent for all P; where θ 2 ≥ 0.4, the increased with P (Table 6). In Sim 3a, where μ 1 = μ 2, DISPROF clustering, on average, never settled on the solution of G = 2. When θ 1 = θ 2= 0, all P returned  = 1 (as the two groups were effectively identical), but for all other [P × (θ 1 and θ 2)] configurations where θ 2 > 0, as P increased so did the value of (max  = 28 groups, Table 5). The same pattern was observed in the values for Sim 3a as was seen for ; for all θ 1 = θ 2 = 0 scenarios, the  = 0, and for all other levels of θ 2 the values increased along with P (Table 6), reaching their maximum values around 1 when P ≥ 25.
Table 5

Descriptive statistics for for DISPROF based on Sim 3

P θ 1 and θ 2 MinimumMeanModeMaximum σ SE P θ 1 and θ 2 MinimumMeanModeMaximum σ SE
Sim 3a. G¯ − μ 1 = μ 2 = 10, n 1 = n 2 = 25, S = 1,000Sim 3b. G¯ − μ 1 = 10, μ 2 = 30, n 1 = n 2 = 25, S = 1,000
= 2 θ 1 = θ 2 = 01.001.061.004.000.28.01 = 2 θ 1 =  θ 2 = 02.002.072.005.000.32.01
θ 1 = 0, θ 2 = 0.11.001.101.005.000.35.01 θ 1 = 0, θ 2 = 0.12.002.072.005.000.30.01
θ 1 = 0, θ 2 = 0.41.001.321.005.000.62.02 θ 1 = 0, θ 2 = 0.41.002.162.005.000.62.02
θ 1 = 0, θ 2 = 0.91.001.751.006.000.86.03 θ 1 = 0, θ 2 = 0.91.002.512.006.000.98.03
= 3 θ 1 = θ 2 = 01.001.071.004.000.30.01 P = 3 θ 1 = θ 2 = 02.002.062.005.000.29.01
θ 1 = 0, θ 2 = 0.11.001.131.005.000.42.01 θ 1 = 0, θ 2 = 0.12.002.052.005.000.27.01
θ 1 = 0, θ 2 = 0.41.001.841.006.000.99.03 θ 1 = 0, θ 2 = 0.42.002.362.006.000.62.02
θ 1 = 0, θ 2 = 0.91.003.183.008.001.44.05 θ 1 = 0, θ 2 = 0.91.003.453.007.001.03.03
P = 5 θ 1 = θ 2 = 01.001.071.006.000.34.01 P = 5 θ 1 = θ 2 = 02.002.052.004.000.24.01
θ 1 = 0, θ 2 = 0.11.001.251.006.000.58.02 θ 1 = 0, θ 2 = 0.12.002.062.005.000.27.01
θ 1 = 0, θ 2 = 0.41.003.933.0010.001.73.05 θ 1 = 0, θ 2 = 0.42.002.342.005.000.55.02
θ 1 = 0, θ 2 = 0.93.007.277.0013.001.71.05 θ 1 = 0, θ 2 = 0.92.004.234.008.001.23.04
P = 10 θ 1 = θ 2 = 01.001.061.004.000.31.01 P = 10 θ 1 = θ 2 = 02.002.072.006.000.35.01
θ 1 = 0, θ 2 = 0.11.001.941.008.001.14.04 θ 1 = 0, θ 2 = 0.12.002.052.004.000.24.01
θ 1 = 0, θ 2 = 0.44.009.7110.0016.001.96.06 θ 1 = 0, θ 2 = 0.42.002.242.006.000.50.02
θ 1 = 0, θ 2 = 0.98.0012.9112.0018.001.65.05 θ 1 = 0, θ 2 = 0.92.003.944.0010.001.22.04
P = 25 θ 1 = θ 2 = 01.001.111.007.000.57.02 P = 25 θ 1 = θ 2 = 02.002.062.005.000.28.01
θ 1 = 0, θ 2 = 0.11.006.016.0014.002.30.07 θ 1 = 0, θ 2 = 0.12.002.062.006.000.32.01
θ 1 = 0, θ 2 = 0.412.0017.9318.0023.001.66.05 θ 1 = 0, θ 2 = 0.42.002.052.006.000.28.01
θ 1 = 0, θ 2 = 0.914.0019.7020.0024.001.58.05 θ 1 = 0, θ 2 = 0.92.002.642.007.000.81.03
P = 50 θ 1 = θ 2 = 01.001.101.008.000.51.02 P = 50 θ 1 = θ 2 = 02.002.092.006.000.41.01
θ 1 = 0, θ 2 = 0.15.0012.7313.0020.002.37.08 θ 1 = 0, θ 2 = 0.12.002.062.007.000.35.01
θ 1 = 0, θ 2 = 0.418.0023.1223.0026.001.40.04 θ 1 = 0, θ 2 = 0.42.002.072.006.000.32.01
θ 1 = 0, θ 2 = 0.919.0023.5524.0026.001.30.04 θ 1 = 0, θ 2 = 0.92.002.172.006.000.45.01
P = 150 θ 1 = θ 2 = 01.001.101.0010.000.61.02 P = 150 θ 1 = θ 2 = 02.002.072.009.000.41.01
θ 1 = 0, θ 2 = 0.118.0022.7523.0027.001.41.04 θ 1 = 0, θ 2 = 0.12.002.052.006.000.27.01
θ 1 = 0, θ 2 = 0.424.0025.9126.0027.000.31.01 θ 1 = 0, θ 2 = 0.42.002.052.007.000.28.01
θ 1 = 0, θ 2 = 0.924.0025.9226.0027.000.29.01 θ 1 = 0, θ 2 = 0.92.002.052.007.000.31.01
P = 225 θ 1 = θ 2 = 01.001.111.009.000.67.02 P = 225 θ 1 = θ 2 = 02.002.072.006.000.36.01
θ 1 = 0, θ 2 = 0.121.0024.8325.0027.000.95.03 θ 1 = 0, θ 2 = 0.12.002.072.005.000.32.01
θ 1 = 0, θ 2 = 0.425.0025.9926.0027.000.12<.01 θ 1 = 0, θ 2 = 0.42.002.092.007.000.40.01
θ 1 = 0, θ 2 = 0.925.0025.9926.0028.000.12.00 θ 1 = 0, θ 2 = 0.92.002.072.006.000.35.01
= 300 θ 1 = θ 2 = 01.001.101.0010.000.60.02 P = 300 θ 1 = θ 2 = 02.002.072.006.000.34.01
θ 1 = 0, θ 2 = 0.123.0025.6526.0027.000.58.02 θ 1 = 0, θ 2 = 0.12.002.062.006.000.32.01
θ 1 = 0, θ 2 = 0.425.0026.0026.0027.000.05<.01 θ 1 = 0, θ 2 = 0.42.002.082.006.000.37.01
θ 1 = 0, θ 2 = 0.925.0026.0026.0027.000.07<.01 θ 1 = 0, θ 2 = 0.92.002.082.008.000.41.01

Structured data—overdispersed descriptors: Estimates of the mean number of groups identified () for each [N × P × (θ 1, θ 2)] configuration were based on S = 1,000 datasets. N, total number of objects (n  = number of objects in group i); P, total number of descriptors; θ , overdispersion for descriptors in group i; μ , mean value of descriptors in group i; σ, standard deviation of the mean; SE, standard error of the mean.

Table 6

Descriptive statistics for for DISPROF based on Sim 3

P θ 1 and θ 2 MinimumMeanModeMaximum σ SE P θ 1 and θ 2 MinimumMeanModeMaximum σ SE
Sim 3a. ARI¯HA   − μ 1 = μ 2 = 10, n 1 = n 2 = 25, S = 1,000Sim 3b. ARI¯HA  − μ1 = 10, μ 2 = 30, n 1 = n 2 = 25, S = 1,000
= 2 θ 1 = θ 2 = 0−0.0130.0000.0000.0600.003<.001 P = 2 θ 1 = θ 2 = 00.6760.9881.0001.0000.042.001
θ 1 = 0, θ 2 = 0.1−0.0130.0000.0000.0770.004<.001 θ 1 = 0, θ 2 = 0.10.3990.9091.0001.0000.100.003
θ 1 = 0, θ 2 = 0.4−0.0040.0040.0000.3100.019.001 θ 1 = 0, θ 2 = 0.40.0000.5630.0001.0000.269.009
θ 1 = 0, θ 2 = 0.9−0.0060.0140.0000.3260.035.001 θ 1 = 0, θ 2 = 0.90.0000.3160.0001.0000.232.007
= 3 θ 1 = θ 2 = 0−0.0210.0000.0000.0210.002<.001 P = 3 θ 1 = θ 2 = 00.7210.9941.0001.0000.030.001
θ 1 = 0, θ 2 = 0.1−0.0070.0000.0000.0380.002<.001 θ 1 = 0, θ 2 = 0.10.6150.9701.0001.0000.059.002
θ 1 = 0, θ 2 = 0.4−0.0070.0120.0000.3120.032.001 θ 1 = 0, θ 2 = 0.40.0000.7800.9201.0000.161.005
θ 1 = 0, θ 2 = 0.9−0.0030.0630.0000.5550.086.003 θ 1 = 0, θ 2 = 0.90.0000.5390.7701.0000.170.005
= 5 θ 1 = θ 2 = 0−0.0190.0000.0000.0280.002<.001 P = 5 θ 1 = θ 2 = 00.7010.9961.0001.0000.025.001
θ 1 = 0, θ 2 = 0.1−0.0110.0010.0000.1090.006<.001 θ 1 = 0, θ 2 = 0.10.7270.9921.0001.0000.029.001
θ 1 = 0, θ 2 = 0.40.0000.0650.0000.4220.075.002 θ 1 = 0, θ 2 = 0.40.5270.9151.0001.0000.088.003
θ 1 = 0, θ 2 = 0.90.0020.2640.1510.5730.112.004 θ 1 = 0, θ 2 = 0.90.2560.7050.8821.0000.121.004
= 10 θ 1 = θ 2 = 0−0.0170.0000.0000.0170.001<.001 P = 10 θ 1 = θ 2 = 00.7010.9951.0001.0000.030.001
θ 1 = 0, θ 2 = 0.1−0.0030.0050.0000.1250.014<.001 θ 1 = 0, θ 2 = 0.10.7470.9971.0001.0000.018.001
θ 1 = 0, θ 2 = 0.40.0260.2600.2190.5330.097.003 θ 1 = 0, θ 2 = 0.40.7080.9841.0001.0000.035.001
θ 1 = 0, θ 2 = 0.90.2470.4510.4520.5580.054.002 θ 1 = 0, θ 2 = 0.90.5890.8600.9611.0000.097.003
= 25 θ 1 = θ 2 = 0−0.0190.0000.0000.1060.004<.001 P = 25 θ 1 = θ 2 = 00.6760.9971.0001.0000.020.001
θ 1 = 0, θ 2 = 0.1−0.0030.0590.0120.3100.056.002 θ 1 = 0, θ 2 = 0.10.6560.9961.0001.0000.022.001
θ 1 = 0, θ 2 = 0.40.3280.4600.4760.5350.034.001 θ 1 = 0, θ 2 = 0.40.6260.9971.0001.0000.021.001
θ 1 = 0, θ 2 = 0.90.4670.5150.5150.5330.011<.001 θ 1 = 0, θ 2 = 0.90.6730.9661.0001.0000.049.002
= 50 θ 1 = θ 2 = 0−0.0170.0000.0000.0290.002<.001 P = 50 θ 1 = θ 2 = 00.6760.9951.0001.0000.027.001
θ 1 = 0, θ 2 = 0.10.0280.2360.2660.4810.080.003 θ 1 = 0, θ 2 = 0.10.6260.9961.0001.0000.025.001
θ 1 = 0, θ 2 = 0.40.4300.5060.5100.5230.012<.001 θ 1 = 0, θ 2 = 0.40.6650.9951.0001.0000.028.001
θ 1 = 0, θ 2 = 0.90.4300.5090.5080.5200.004<.001 θ 1 = 0, θ 2 = 0.90.7270.9921.0001.0000.024.001
= 150 θ 1 = θ 2 = 0−0.0180.0000.0000.0350.002<.001 P = 150 θ 1 = θ 2 = 00.6310.9951.0001.0000.027.001
θ 1 = 0, θ 2 = 0.10.3520.4540.4680.5170.028.001 θ 1 = 0, θ 2 = 0.10.7920.9971.0001.0000.015<.001
θ 1 = 0, θ 2 = 0.40.3950.5050.5050.5080.004<.001 θ 1 = 0, θ 2 = 0.40.6330.9971.0001.0000.019.001
θ 1 = 0, θ 2 = 0.90.4280.5050.5050.5080.003<.001 θ 1 = 0, θ 2 = 0.90.6890.9971.0001.0000.021.001
= 225 θ 1 = θ 2 = 0−0.0100.0000.0000.0130.001<.001 P = 225 θ 1 = θ 2 = 00.6990.9961.0001.0000.026.001
θ 1 = 0, θ 2 = 0.10.4240.4840.4640.5130.021.001 θ 1 = 0, θ 2 = 0.10.7140.9961.0001.0000.022.001
θ 1 = 0, θ 2 = 0.40.4650.5050.5050.5070.002<.001 θ 1 = 0, θ 2 = 0.40.6460.9951.0001.0000.029.001
θ 1 = 0, θ 2 = 0.90.3910.5050.5050.5070.004<.001 θ 1 = 0, θ 2 = 0.90.6630.9961.0001.0000.024.001
= 300 θ 1 = θ 2 = 0−0.0100.0000.0000.0190.001<.001 P = 300 θ 1 = θ 2 = 00.6070.9961.0001.0000.023.001
θ 1 = 0, θ 2 = 0.10.4640.5020.5050.5100.011<.001 θ 1 = 0, θ 2 = 0.10.7390.9971.0001.0000.020.001
θ 1 = 0, θ 2 = 0.40.4650.5050.5050.5070.001<.001 θ 1 = 0, θ 2 = 0.40.6110.9961.0001.0000.026.001
θ 1 = 0, θ 2 = 0.90.4650.5050.5050.5070.002<.001 θ 1 = 0, θ 2 = 0.90.6100.9951.0001.0000.027.001

Structured data—overdispersed descriptors: Estimates of mean correspondence ( ) for each [N × P × (θ 1, θ 2)] configuration were based on S = 1,000 datasets, where correspondence is measured between the clustering solution achieved via DISPROF w/UPGMA and the simulated grouping partition. N, total number of objects (n  = number of objects in group i); P, total number of descriptors; θ , overdispersion for descriptors in group i; μ , mean value of descriptors in group i; σ, standard deviation of the mean; SE, standard error of the mean. ARI values estimate the likelihood of agreement between one randomly selected pair of objects represented in both partitions, corrected for change, and negative values represent probabilities that are less than would be expected by random chance alone.

Descriptive statistics for for DISPROF based on Sim 3 Structured data—overdispersed descriptors: Estimates of the mean number of groups identified () for each [N × P × (θ 1, θ 2)] configuration were based on S = 1,000 datasets. N, total number of objects (n  = number of objects in group i); P, total number of descriptors; θ , overdispersion for descriptors in group i; μ , mean value of descriptors in group i; σ, standard deviation of the mean; SE, standard error of the mean. Descriptive statistics for for DISPROF based on Sim 3 Structured data—overdispersed descriptors: Estimates of mean correspondence ( ) for each [N × P × (θ 1, θ 2)] configuration were based on S = 1,000 datasets, where correspondence is measured between the clustering solution achieved via DISPROF w/UPGMA and the simulated grouping partition. N, total number of objects (n  = number of objects in group i); P, total number of descriptors; θ , overdispersion for descriptors in group i; μ , mean value of descriptors in group i; σ, standard deviation of the mean; SE, standard error of the mean. ARI values estimate the likelihood of agreement between one randomly selected pair of objects represented in both partitions, corrected for change, and negative values represent probabilities that are less than would be expected by random chance alone.

Structured data—correlated descriptors (Sim 4)

For all P, when both groups had no correlation structure, was consistently ~2, and values were excellent; where at least one group had no correlation structure, increased and the decreased as P increased (Table 7). For all P where the correlation structure for either group was Σ ≥ 0.6 (medium to high), DISPROF produced clustering solutions where increased with P (Table 7). However, in those same scenarios, the decreased as P increased, and it should be noted that none of the simulation scenarios in Sim 4a or 4b that included any amount of within‐group descriptor correlation returned clustering solutions with an  ≥ 0.8 for any P ≥ 5.
Table 7

Descriptive statistics for and for DISPROF based on Sim 4

P Σ 1 and Σ 2 MinimumMeanModeMaximum σ SE P Σ 1 and Σ 2 MinimumMeanModeMaximum σ SE
Sim 4. G¯ − μ 1 = 10, μ 2 = 30, n 1 = n 2 = 25, S = 1,000Sim 4. ARI¯HA  − μ1 = 10, μ 2 = 30, n 1 = n 2 = 25, S = 1,000
= 2 Σ 1 = Σ 2 = 02.0002.0582.0005.0000.294.009 = 2 Σ 1 = Σ 2 = 00.6910.9941.0001.0000.033.001
Σ 1 = Σ 2 = 0.62.0003.6203.0007.0000.974.031 Σ 1 = Σ 2 = 0.60.3450.7691.0001.0000.153.005
Σ 1 = Σ 2 = 0.94.0006.5156.00010.0000.986.031 Σ 1 = Σ 2 = 0.90.2540.4110.3530.7520.071.002
Σ 1 = 0, Σ 2 = 0.62.0002.8443.0006.0000.740.023 Σ 1 = 0, Σ 2 = 0.60.4130.8811.0001.0000.111.004
Σ 1 = 0, Σ 2 = 0.93.0004.3434.0007.0000.743.023 Σ 1 = 0, Σ 2 = 0.90.3980.6990.6840.8920.053.002
= 3 Σ 1 = Σ 2 = 02.0002.0562.0004.0000.247.008 P = 3 Σ 1 = Σ 2 = 00.7310.9951.0001.0000.027.001
Σ 1 = Σ 2 = 0.62.0004.5534.00010.0001.017.032 Σ 1 = Σ 2 = 0.60.3260.6370.5051.0000.136.004
Σ 1 = Σ 2 = 0.95.0007.6018.00011.0001.074.034 Σ 1 = Σ 2 = 0.90.1930.3410.3060.5620.058.002
Σ 1 = 0, Σ 2 = 0.62.0003.3493.0006.0000.744.024 Σ 1 = 0, Σ 2 = 0.60.5050.8121.0001.0000.096.003
Σ 1 = 0, Σ 2 = 0.93.0004.8995.0008.0000.798.025 Σ 1 = 0, Σ 2 = 0.90.3810.6680.6500.8300.044.001
= 5 Σ 1 = Σ 2 = 02.0002.0642.0005.0000.307.010 P = 5 Σ 1 = Σ 2 = 00.6910.9961.0001.0000.025.001
Σ 1 = Σ 2 = 0.63.0005.3355.00010.0000.988.031 Σ 1 = Σ 2 = 0.60.3110.5370.5880.9230.101.003
Σ 1 = Σ 2 = 0.96.0008.9439.00013.0001.077.034 Σ 1 = Σ 2 = 0.90.1680.2840.2570.4730.047.001
Σ 1 = 0, Σ 2 = 0.62.0003.7314.0007.0000.746.024 Σ 1 = 0, Σ 2 = 0.60.4920.7660.7771.0000.074.002
Σ 1 = 0, Σ 2 = 0.94.0005.5355.0009.0000.823.026 Σ 1 = 0, Σ 2 = 0.90.3650.6400.6300.7830.036.001
= 10 Σ 1 = Σ 2 = 02.0002.0662.0005.0000.316.010 P = 10 Σ 1 = Σ 2 = 00.7090.9961.0001.0000.024.001
Σ 1 = Σ 2 = 0.64.0006.2486.00010.0001.034.033 Σ 1 = Σ 2 = 0.60.2590.4460.4820.8230.076.002
Σ 1 = Σ 2 = 0.98.00010.54010.00015.0001.221.039 Σ 1 = Σ 2 = 0.90.1360.2340.2220.3880.036.001
Σ 1 = 0, Σ 2 = 0.63.0004.1964.0008.0000.795.025 Σ 1 = 0, Σ 2 = 0.60.4620.7190.7310.9250.053.002
Σ 1 = 0, Σ 2 = 0.94.0006.4076.00011.0000.908.029 Σ 1 = 0, Σ 2 = 0.90.3090.6150.6160.7270.030.001
= 25 Σ 1 = Σ 2 = 02.0002.0562.0006.0000.266.008 P = 25 Σ 1 = Σ 2 = 00.7290.9971.0001.0000.014.000
Σ 1 = Σ 2 = 0.65.0007.6407.00012.0001.133.036 Σ 1 = Σ 2 = 0.60.2050.3550.3260.5880.059.002
Σ 1 = Σ 2 = 0.98.00012.72313.00017.0001.282.041 Σ 1 = Σ 2 = 0.90.1200.1850.1610.3090.029.001
Σ 1 = 0, Σ 2 = 0.63.0004.9115.0009.0000.788.025 Σ 1 = 0, Σ 2 = 0.60.4020.6760.6660.9250.042.001
Σ 1 = 0, Σ 2 = 0.95.0007.5057.00010.0000.905.029 Σ 1 = 0, Σ 2 = 0.90.4550.5930.5830.6790.021.001
= 50 Σ 1 = Σ 2 = 02.0002.0682.0005.0000.302.010 P = 50 Σ 1 = Σ 2 = 00.7750.9961.0001.0000.018.001
Σ 1 = Σ 2 = 0.66.0008.7929.00012.0001.197.038 Σ 1 = Σ 2 = 0.60.1850.3030.2870.5180.052.002
Σ 1 = Σ 2 = 0.910.00014.36814.00021.0001.468.046 Σ 1 = Σ 2 = 0.90.0980.1560.1460.2640.024.001
Σ 1 = 0, Σ 2 = 0.63.0005.4995.0009.0000.878.028 Σ 1 = 0, Σ 2 = 0.60.5170.6500.6260.8230.036.001
Σ 1 = 0, Σ 2 = 0.95.0008.4058.00014.0001.078.034 Σ 1 = 0, Σ 2 = 0.90.3930.5780.5730.6460.021.001
= 150 Σ 1 = Σ 2 = 02.0002.0542.0004.0000.247.008 P = 150 Σ 1 = Σ 2 = 00.8890.9981.0001.0000.011.000
Σ 1 = Σ 2 = 0.67.00010.65210.00016.0001.316.042 Σ 1 = Σ 2 = 0.60.1370.2350.2180.3710.038.001
Σ 1 = Σ 2 = 0.912.00017.06717.00024.0001.578.050 Σ 1 = Σ 2 = 0.90.0730.1220.1190.2370.019.001
Σ 1 = 0, Σ 2 = 0.64.0006.4766.00010.0000.973.031 Σ 1 = 0, Σ 2 = 0.60.4920.6160.6160.7310.027.001
Σ 1 = 0, Σ 2 = 0.96.0009.76610.00014.0001.166.037 Σ 1 = 0, Σ 2 = 0.90.4530.5620.5550.6260.015.000
= 225 Σ 1 = Σ 2 = 02.0002.0522.0006.0000.282.009 P = 225 Σ 1 = Σ 2 = 00.7160.9971.0001.0000.016.000
Σ 1 = Σ 2 = 0.68.00011.34811.00016.0001.357.043 Σ 1 = Σ 2 = 0.60.1310.2170.2080.3280.035.001
Σ 1 = Σ 2 = 0.914.00018.05218.00023.0001.550.049 Σ 1 = Σ 2 = 0.90.0760.1120.1100.1860.017.001
Σ 1 = 0, Σ 2 = 0.64.0006.7697.00010.0000.963.030 Σ 1 = 0, Σ 2 = 0.60.4430.6090.6030.7120.027.001
Σ 1 = 0, Σ 2 = 0.97.00010.16910.00014.0001.139.036 Σ 1 = 0, Σ 2 = 0.90.4050.5580.5520.6080.014.000
= 300 Σ 1 = Σ 2 = 02.0002.0532.0006.0000.317.010 P = 300 Σ 1 = Σ 2 = 00.6460.9971.0001.0000.018.001
Σ 1 = Σ 2 = 0.68.00011.97312.00017.0001.342.042 Σ 1 = Σ 2 = 0.60.1240.2030.1880.3210.031.001
Σ 1 = Σ 2 = 0.914.00018.72619.00024.0001.659.052 Σ 1 = Σ 2 = 0.90.0700.1070.1040.2180.017.001
Σ 1 = 0, Σ 2 = 0.64.0007.1077.00010.0001.001.032 Σ 1 = 0, Σ 2 = 0.60.4120.6020.5970.7170.026.001
Σ 1 = 0, Σ 2 = 0.97.00010.58811.00014.0001.175.037 Σ 1 = 0, Σ 2 = 0.90.3780.5550.5520.6160.015.000

Structured data—correlated descriptors: Estimates of the mean number of groups identified () and mean correspondence () for each [N × P × (Σ 1, Σ 2)] configuration were based on S = 1,000 datasets, where correspondence is measured between the clustering solution achieved via DISPROF with UPGMA and the simulated partition. N, total number of objects (n  = number of objects in group i); P, total number of descriptors; Σ , correlation among descriptors in group i; μ , mean value of descriptors in group i; σ, standard deviation of the mean; SE, standard error of the mean. ARI values estimate the likelihood of agreement between one randomly selected pair of objects represented in both partitions, corrected for chance.

Discussion

The DISPROF algorithm is designed to test the H o that there is “no multivariate structure among objects, with respect to a set of descriptors” in a dataset. The utility of deploying the algorithm with a clustering technique such as UPGMA is in (1) the reduction of arbitrary decision criteria (i.e., dissimilarity thresholds for group identification); (2) the ability to assess multivariate structure at multiple levels of resemblance; (3) the inclusion of the frequentist approach to hypothesis testing; and (4) the application of db multivariate statistical techniques. As such, it is important to determine where UPGMA clustering, with DISPROF implemented as a decision criterion, is affected by changes in data configuration, distribution, dispersion, and correlation. We were particularly interested in statistical error rates associated with DISPROF and the resolution and correspondence of the grouping solutions provided by DISPROF with UPGMA under a variety of potential data scenarios.

Type I error and power of DISPROF

Type I error

When assessing the DISPROF algorithm's H o, there appears to be no effect of distribution type or [N × P] configuration on type I error rates. The mean type I error rates for all [N × P] within each probability distribution type fell within acceptable ranges for the expected number of rejections (α = .05). As DISPROF correctly failed to reject H o with acceptable levels of type I error, it is, therefore, reasonable to assume that there is a low likelihood that the underlying probability distribution will impart some sort of unknown grouping structure to the dataset (e.g., where some unwanted noise structure might elevate false positives). This is notable given that these techniques were developed for ecological datasets such as those tested in Sim 1f, but they appear to be applicable to many common data types collected by different lines of scientific inquiry (Tables 1 and 2). However, the activity displayed by DISPROF in Sim 3a and Sim 4 leads us to believe that further investigation may be required for datasets with high levels of overdispersion or correlation among descriptors. In these cases, misclassification appears to increase along with both θ and Σ, and is exacerbated by increases in P (Tables 6 and 7). These findings are also notable as overdispersion and correlation are two common qualities of ecological datasets.

Power

The power of DISPROF to detect structure in data is generally poor with low‐dimensional (P ≤ 5) multivariate normal data, and with low‐dimensional (P ≤ 10) ecological count data where μ 1 = μ 2, the latter being expected as this configuration can be interpreted as G = 1. As DISPROF performed decidedly better when μ 1 = 10 and μ 2 = 30, it follows that the hypothesis test relies heavily on the location parameter when assigning group membership, and when heterogeneity of groups is only defined by overdispersion the two are confounded by the algorithm. A similar response to collocated sets of heterogeneous objects was observed during empirical investigation of ANOSIM and the MANTEL test (Anderson & Walsh, 2013). The power of DISPROF improves dramatically once P ≥ 25, and increases with greater separation between groups in hyperdimensional space. With group separation in hyperspace, the power of DISPROF to evaluate H o is unaffected by increasing the overdispersion in ecological data, and the test for structure is able to correctly identify the presence of groups in virtually all simulated datasets where μ 1 = 10 and μ 2 = 30. The presence of correlation structure among the descriptors within any group also has no noticeable effect on the power of DISPROF to detect structure. The power of DISPROF is excellent in most cases and, as Clarke et al. (2008) predicted, its ability to detect structure becomes more powerful as the dimensionality of the predictors increases, and so we have found their corollary (1) to be supported. A potential explanation for the increase in power observed along with the increases in P may be related to the idea of a group's identity, or the unique combination of numerical values that quantitatively represent a set of objects (i.e., their “fingerprint”). The more descriptors used to quantify an object, the less likely the unique fingerprint that describes that group of similar objects could be re‐created by chance. Therefore, during the randomization process of the DISPROF test, and with a large enough P, breaking the structure in the original data is relatively easy to do in order to create the null distribution for the test statistic. This is essentially the overfitting problem in reverse (Babyak, 2004; Hawkins, 2004). This overfitting is appropriate because it essentially creates highly unique observed resemblance profiles to test against for structure, and because no extrapolation or interpolation is based on the overfitted identity. Any unique group identity exposed in the dataset will be similarly overfitted because all objects are represented in the same space of descriptors. Descriptive statistics for and for DISPROF based on Sim 4 Structured data—correlated descriptors: Estimates of the mean number of groups identified () and mean correspondence () for each [N × P × (Σ 1, Σ 2)] configuration were based on S = 1,000 datasets, where correspondence is measured between the clustering solution achieved via DISPROF with UPGMA and the simulated partition. N, total number of objects (n  = number of objects in group i); P, total number of descriptors; Σ , correlation among descriptors in group i; μ , mean value of descriptors in group i; σ, standard deviation of the mean; SE, standard error of the mean. ARI values estimate the likelihood of agreement between one randomly selected pair of objects represented in both partitions, corrected for chance.

Resolution and correspondence of DISPROF

If either of the theoretical corollaries presented by Clarke et al. (2008) were to be considered cautionary, it would be corollary (2), which regards the resolution of DISPROF solutions being finer than ecologists (or any professional) utilizing the method could interpret meaningfully. We further contend that the correspondence between these grouping partitions and any known grouping structure in the simulated datasets is informative and is indicative of the DISPROF clustering method's ability to settle on “meaningful” solutions. Therefore, any discussion of the issues surrounding the resolution of the grouping solutions is incomplete without also discussing their correspondence with reality (i.e., “correctness”).

Effect of group locations

The structured data were simulated as either two groups whose location in hyperspace was defined by the progressively decreasing amount of average overlap between the groups’ data clouds (Sim 2), or as two stationary groups whose location was predefined to be the same (Sim 3a) or different (Sim 3b, Sim 4). In all cases, we have demonstrated that when the two groups have higher overlap in hyperspace, the DISPROF algorithm has a tendency to underestimate the number of groups, and often settles on solutions where only a single large group exists. When clustering multivariate normal data, as in Sim 2, the effects of the amount of overlap are overridden by increases in the dimensionality of the dataset (Figure 5a) and potentially are due to the increase in complexity of the fingerprint for the groups that coincides with the extra dimensions. The result of this override is that even at levels of data overlap that reach as much as 50%, DISPROF clustering is able to detect the correct number of groups in data that have P ≥ 5. However, the correspondence values for those correct numbers of groups do not reach acceptable levels ( ≥ 0.80) until P ≥ 10 (Figure 5b). Therefore, when clustering multivariate normal data with equal variances, the most reliable resolution and correspondence levels will be achieved with P ≥ 10. The simulated ecological count data showed a profound effect of group location on the resolution and correspondence of the clustering solutions provided by DISPROF. Particularly in cases where the two sets of objects had the same central tendency but different overdispersion structures, and regardless of the number of descriptors in the dataset, DISPROF either underestimated the number of groups (e.g., G mode = 1), or very greatly overestimated it (e.g., G mode = 26). This directly contrasts with the performance of DISPROF with ecological count data whose groups are separated in hyperspace. In these cases, once again regardless of the number of descriptors, DISPROF performed optimally and identified the correct number of groups, on average, in ecological data, even with high levels of overdispersion. This finding is consistent with those for the multivariate normal data, in that low Ov improved DISPROF's performance as a clustering criterion. High group overlap may negatively affect DISPROF in the same manner as having low numbers of descriptors (P), where the high‐overlap situation allows for group fingerprints that are not unique enough when compared to one another. In this case, the randomization process is unable to break the structure in the datasets and the differences between the mean resemblance profile (representing H o) and the observed profile are negligible (i.e., no structure present); thus, the routine returns a solution that identifies the entire data cloud as one group.

Effects of overdispersion among descriptors within groups

The ecological count data used here were simulated so that we could examine the effects of increasing the overdispersion (θ) of G 2 while holding θ 1 = 0. The purpose of this exercise was to increase the relatability of the results to ecological data, as many species composition and abundance datasets are highly overdispersed. Our results indicate that when the groups do not overlap in hyperspace, the effects of the overdispersion of the second group are negligible when considering the resolution of the clustering solutions, but the correspondence of those solutions with reality is unacceptable when P ≤ 10 for data with high overdispersion (θ 2 = 0.9). When the groups are defined by different levels of overdispersion and share a location, the effects of increasing overdispersion become more pronounced and are seemingly amplified by increasing the dimensionality of the dataset being tested. In these cases, the resolution of the solutions is as described previously, but the correspondence levels for the resultant partitions are all inadequate. The point of interest, however, is that the values tended to be around 0.5 for clustering scenarios where the overdispersion among descriptors is medium or high (i.e., θ 2 = {0.4, 0.9}) and P ≥ 25 (and for θ 2 = 0.1, the P threshold = 150). This indicates that one group is being identified fairly well and the other is being completely misrepresented by the grouping algorithm. We suspect that the increase in θ 2 causes the numerical fingerprint of the objects within the group to be too dissimilar when only compared to one another, and the result is a series of singleton groups, as the clustering algorithm iteratively works through the UPGMA connection of the overdispersed nodes. It seems as though the effects of overdispersion among ecological count data are secondary to the effects of group location in hyperspace, but supersede those of dataset dimensionality (dimension < overdispersion < location).

Effects of correlation structure among descriptors within groups

Our simulation studies that incorporated different correlation structures among descriptors within groups were also undertaken in an effort to relate our investigations to studies incorporating ecological datasets, which often contain descriptors that are correlated with one another to some degree. We used multivariate normal data in our simulations to ensure that the observed effects of different correlation scenarios were not confounded by some other distributional assumptions. It appears as though medium to high levels of correlation (Σ = {0.6, 0.9}) among descriptors within a group will strongly impact the number of groups identified, and it tends to increase as Σ increases. Drawing inferences from these clustering results may be dubious, however, because for virtually all clustering solutions that had medium or high correlation among descriptors, regardless of dimension, the mean correspondence was well below acceptable limits. Correlation structure among groups affects the shape of the data cloud in hyperspace. It is interesting to note that DISPROF seems to have an improved ability to detect more “correct” structure in data where the shapes (i.e., correlation structures) of the groups are the same (Σ 1 = Σ 2), as opposed to one group having no correlation structure (i.e., spherical data cloud) and the second group having medium‐to‐large correlations among descriptors (i.e., data cloud distortion). As our simulations only explore medium‐to‐high correlation among all descriptors, it would be of interest to examine low, negative, and mixed correlation structures to describe DISPROF's performance variability under a full range of correlation conditions. The control scenarios, where Σ 1 = Σ 2 = 0, were among the only scenarios that returned reasonable or results; however, these scenarios effectively recreate a simplified version of those data simulated under Sim 2. The overall results suggest that increasing the correlation between descriptors in one group and not the other tends to produce increasingly unreliable grouping partitions, and these results are in line with those from Sim 2, where low P results in low . One explanation for this might be that as the level of correlation between descriptors increases the effective size of P decreases, and when considering the pairwise dissimilarity between objects, because the variability across all correlated descriptors in a group is essentially the same, datasets with high P and Σ tend to have similar DISPROF clustering dynamics as datasets with low P and no correlation structure.

Conclusions

DISPROF as a clustering decision criterion

Strengths of using resemblance profiles as a hypothesis test for multivariate structure are that the type I error rates (1) are within the range of acceptability for α = .05, (2) tend to be binomially distributed around 5%, and (3) are resistant to the effects of both the underlying probability density function and (4) the [N × P] configuration of the data. Additional strengths include the facts that, when μ 1 ≠ μ 2, the power of DISPROF (5) is within the acceptable range for P ≥ 10 and is unaffected (6) by up to 50% average group overlap, (7) by increasing overdispersion among ecological count data, and (8) by increasing correlation structures among descriptors. Finally, (9) the first theoretical corollary proposed by Clarke et al. (2008), that the power of the test for multivariate structure increases as P increases, was confirmed. From a traditional statistical error perspective, it appears that using resemblance profiles is a very effective method for identifying multivariate structure; it rarely identifies structure that is not present and it almost always identifies structure that is present. The weaknesses of using this hypothesis test are mostly related to the second Clarke et al. (2008) corollary, where the resolution of any grouping structure identified may be too fine to interpret meaningfully. The realized power of the resemblance profile hypothesis test comes when it is implemented as a clustering criterion, and success is based upon the partition returned by the algorithm. The resolution of the partition and the solution's correspondence with interpretable multivariate structure in the dataset are ultimately what the researchers will use to explain their theories. The second Clarke et al. (2008) corollary appears to be valid, but it manifests differently depending on the type, configuration, and hyperdimensional structure of the dataset being considered. However, if we constrain our analysis to relatively high‐dimensional, low‐correlation datasets where the group locations are separated, then the resolution‐versus‐interpretability concern wanes greatly. The power to detect structure is very high, even with P as low as 10 descriptors, and so it follows that any additional resolution imparted on the solution (which may account for any reduction in correspondence) is likely the result of an actual numerical signal in the dataset, and can be manifest from random (or unmeasured) processes, or error. An alternative explanation may be related to the construction of the null distribution for the test statistic π, where group properties such as location and hyperdimensional shape may preclude the permutation procedure from accurately depicting the null scenario.

Recommendations for using DISPROF (SIMPROF)

The results presented for type I error, power, resolution, and correspondence suggest that using resemblance profiles as a test for multivariate structure, and as a clustering decision criterion, has strengths and weaknesses. The results also highlight pitfalls that can be avoided if particular care is taken prior to implementation of these clustering techniques. The complex interactions between the data type/configuration and the hyperdimensional structure and overlap between groups strongly affect the results achieved when clustering with DISPROF. The method is nonetheless an improvement over traditional UPGMA clustering, most notably due to the removal of the arbitrary and static assignments of resemblance thresholds that define groups of objects. Because the realized power of using resemblance profiles as clustering decision criteria cannot be maximized without making tradeoffs between resolution and correspondence with interpretable structure, we make the following recommendations. Exploratory analysis, such as principle coordinates analysis (PCoA), should be performed to determine, at a minimum, if any hypothesized grouping structures might have high amounts of overlap (i.e., Ov > 50%) in hyperdimensional space, and DISPROF should be avoided in high‐overlap situations. Data clouds that appear to overlap greatly could produce unreliable results and should not be clustered using these methods. Medium‐to‐high correlation (i.e., ≥0.6) among all descriptors should be avoided, and efforts should be made to either reduce or remove the correlated descriptors in a dataset. In an effort to create more parsimonious models, priority should be given to descriptors that are indicative of independent processes, whenever possible. In the case of ecological abundance data, where many species are often both of interest and are highly correlated, it may be of benefit to use a dimension reduction technique (e.g., PCoA) that produces new orthogonal descriptors, with no correlation structures, prior to clustering with DISPROF. The data dimensionality should be restricted to P ≥ 25 descriptors in order to achieve solutions with ideal resolution and “excellent” correspondence ( ≥ 0.90) to meaningfully interpretable structure. A less conservative guideline would be to restrict the number of descriptors to P ≥ 10. This new limit retains power, increases the potential for higher resolution solutions, and reduces correspondence from “excellent” to “good” (0.80 ≤  < 0.90). Since its initial development and addition to PRIMER‐E (Clarke & Gorley, 2015), the use of resemblance profiles has been gaining traction as a clustering criterion, mostly in the ecological literature. Our results provide recommendations for ecologists to use when applying these methods, and demonstrate the methods’ transferability to other numerical analyses, data types, and disciplines. With a better understanding of the dynamic performance of resemblance profiles as clustering criteria and the potential variability in the results they produce, researchers can more confidently deploy SIMPROF and interpret the results with respect to beta‐diversity, species/environment relationships, or any other complex multivariate model and/or associated hypotheses. While there appear to be clear advantages imparted by the use of resemblance profiles as clustering criteria, there are still many questions that deserve additional attention that were beyond the scope of this evaluation.

Conflict of Interest

None declared.

Data Availability

All simulated datasets and analyses performed in MATLAB are publicly available upon request. Click here for additional data file.
  8 in total

1.  The problem of overfitting.

Authors:  Douglas M Hawkins
Journal:  J Chem Inf Comput Sci       Date:  2004 Jan-Feb

2.  What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models.

Authors:  Michael A Babyak
Journal:  Psychosom Med       Date:  2004 May-Jun       Impact factor: 4.312

3.  Properties of the Hubert-Arabie adjusted Rand index.

Authors:  Douglas Steinley
Journal:  Psychol Methods       Date:  2004-09

4.  An examination of the epiphytic nature of Gambierdiscus toxicus, a dinoflagellate involved in ciguatera fish poisoning.

Authors:  Michael L Parsons; Chelsie J Settlemier; Josh M Ballauer
Journal:  Harmful Algae       Date:  2011-09-01       Impact factor: 4.273

5.  The seasonal structure of microbial communities in the Western English Channel.

Authors:  Jack A Gilbert; Dawn Field; Paul Swift; Lindsay Newbold; Anna Oliver; Tim Smyth; Paul J Somerfield; Sue Huse; Ian Joint
Journal:  Environ Microbiol       Date:  2009-07-31       Impact factor: 5.491

6.  Comparison of the Mantel test and alternative approaches for detecting complex multivariate relationships in the spatial analysis of genetic data.

Authors:  Pierre Legendre; Marie-Josée Fortin
Journal:  Mol Ecol Resour       Date:  2010-05-17       Impact factor: 7.090

7.  Random whole metagenomic sequencing for forensic discrimination of soils.

Authors:  Anastasia S Khodakova; Renee J Smith; Leigh Burgoyne; Damien Abarno; Adrian Linacre
Journal:  PLoS One       Date:  2014-08-11       Impact factor: 3.240

8.  Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure.

Authors:  Joshua P Kilborn; David L Jones; Ernst B Peebles; David F Naar
Journal:  Ecol Evol       Date:  2017-02-26       Impact factor: 2.912

  8 in total
  1 in total

1.  Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure.

Authors:  Joshua P Kilborn; David L Jones; Ernst B Peebles; David F Naar
Journal:  Ecol Evol       Date:  2017-02-26       Impact factor: 2.912

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.