| Literature DB >> 32373681 |
Michael C Thrun1,2, Alfred Ultsch1.
Abstract
The Fundamental Clustering Problems Suite (FCPS) offers a variety of clustering challenges that any algorithm should be able to handle given real-world data. The FCPS consists of datasets with known a priori classifications that are to be reproduced by the algorithm. The datasets are intentionally created to be visualized in two or three dimensions under the hypothesis that objects can be grouped unambiguously by the human eye. Each dataset represents a certain problem that can be solved by known clustering algorithms with varying success. In the R package "Fundamental Clustering Problems Suite" on CRAN, user-defined sample sizes can be drawn for the FCPS. Additionally, the distances of two high-dimensional datasets called Leukemia and Tetragonula are provided here. This collection is useful for investigating the shortcomings of clustering algorithms and the limitations of dimensionality reduction methods in the case of three-dimensional or higher datasets. This article is a simultaneous co-submission with Swarm Intelligence for Self-Organized Clustering [1].Entities:
Keywords: Cluster analysis; Dimensionality reduction; Pattern recognition; Projection methods
Year: 2020 PMID: 32373681 PMCID: PMC7195520 DOI: 10.1016/j.dib.2020.105501
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Summary of the description and challenges of the 12 datasets for cluster analysis, and in case of not 2D datasets for projection methods.
| Name of Dataset | Short Description of Shape or Content | Challenge |
|---|---|---|
| Atom | Core enclosed by hull | Completely overlapping convex hull |
| Chainlink | Two intertwined chains | Linear nonseparable entanglements |
| EngyTime | Two Gaussian mixtures with different variance | Overlapping clusters separable only by density |
| GolfBall | Empty sphere | No distance-based cluster structures |
| Hepta | Six balls, each centered at each one of the six corners of a large octahedron with the 7th ball having a higher density at its center | Nonoverlapping convex hulls with varying intracluster distances |
| Lsun3D | One full sphere, two bricks at a perpendicular angle to each other, and outliers | Varying geometric shapes with noise defined by one group of outliers |
| Target | Circular disk enclosed by a circle with outliers in four corners | Overlapping convex hulls combined with noise defined by four groups of outliers |
| Tetra | Four close full spheres at the four corners of a tetrahedron | Narrow distances between the clusters |
| TwoDiamonds | Two rhombs with one touching corner | Identification of the weak link in chain-like connected clusters |
| WingNut | Two rectangles, each having a density that increases towards one corner in the direction of the other rectangle | Short intercluster distances combined with vast intracluster distances |
| Tetragonula | Distance matrix easy associable with geographic origins of cases | Smooth transition between clusters and outliers, clusters have to be coherent with geographic origins |
| Leukemia | Distance matrix easy associable with patient diagnosis of cases | Reproducing highly unbalanced class sizes |
Fig. 1Visualization of the Atom dataset of a core enclosed by a hull. The predefined classification is indicated by color.
Fig. 2Visualization of the Chainlink dataset of two intertwined chains. The predefined classification is indicated by color.
Fig. 3Visualization of the EngyTime dataset of two Gaussian mixtures with different variance. The predefined classification is indicated by color.
Fig. 4Visualization of the GolfBall dataset of an empty sphere. The predefined classification is indicated by color.
Fig. 5Visualization of the Hepta dataset of six balls with their centers at the six corners of a large octahedron and a 7th ball with a higher density at the center in magenta. The predefined classification is indicate by color.
Fig. 6Visualization of the Lsun3D dataset of one full sphere, two bricks at perpendicular angle to each other, and outliers in red. The predefined classification is indicated by color.
Fig. 7Visualization of the Target dataset of a circular disk enclosed by a circle with outliers in four corners. The predefined classification is indicated by color.
Fig. 8Visualization of the Tetra dataset of four large full spheres close to each other centering at the four corners of a tetrahedron. The predefined classification is indicated by color.
Fig. 9Visualization of the TwoDiamonds dataset of two rhombs with one touching corner. The predefined classification is indicated by color.
Fig. 10Visualization of the WingNut dataset of two rectangles, each having a density that increases in direction of the other rectangle towards one corner. The predefined classification is indicated by color.
Fig. 11Heatmap of the distances in the Tetragonula dataset. The distances are not sorted. A high-dimensional distance structure is visible. Any clustering should have smaller intracluster than intercluster distances while remaining coherent with the geographic origins.
Fig. 12Heatmap of the distances in the Leukemia dataset with four highly unbalanced class sizes. The prior classification defines the order of the distances. The high-dimensional distance structure is defined by the classification and two outliers are visible.
| Subject | Computer Science |
| Specific subject area | Unsupervised Machine Learning |
| Type of data | All files are ASCII text files. TAB separates columns. Headers are included. *.lrn files contain the data, including a unique key for each case; *.cls contain keys and class labels. A positive number indicates each class. For Tetragonula, the geographic coordinates are included as a separate *.lrn. |
| How data were acquired | Artificially, except for the two high-dimensional datasets Leukemia and Tetragonula. In this case, the distance matrices and Databionic swarm clusterings are included. |
| Data format | FCPS: Raw; High-dimensional datasets: Preprocessed. |
| Parameters for data collection | For artificial datasets, none; for High-Dimensional datasets, please see below. |
| Description of data collection | For artificial datasets none; for Leukemia and Tetragonula, please see below. |
| Data source location | For artificial datasets none; for Leukemia and Tetragonula, please see below. |
| Data accessibility | FCPS In R: |
| Related research article | Co-submission of the revision of M. C. Thrun, and A. Ultsch, “Swarm Intelligence for Self-Organized Clustering,” Journal of Artificial Intelligence, in press, DOI: |