| Literature DB >> 29576689 |
Aliyu Usman Ahmad1, Andrew Starkey1.
Abstract
The effective modelling of high-dimensional data with hundreds to thousands of features remains a challenging task in the field of machine learning. This process is a manually intensive task and requires skilled data scientists to apply exploratory data analysis techniques and statistical methods in pre-processing datasets for meaningful analysis with machine learning methods. However, the massive growth of data has brought about the need for fully automated data analysis methods. One of the key challenges is the accurate selection of a set of relevant features, which can be buried in high-dimensional data along with irrelevant noisy features, by choosing a subset of the complete set of input features that predicts the output with higher accuracy comparable to the performance of the complete input set. Kohonen's self-organising neural network map has been utilised in various ways for this task, such as with the weighted self-organising map (WSOM) approach and this method is reviewed for its efficacy. The study demonstrates that the WSOM approach can result in different results on different runs on a given dataset due to the inappropriate use of the steepest descent optimisation method to minimise the weighted SOM's cost function. An alternative feature weighting approach based on analysis of the SOM after training is presented; the proposed approach allows the SOM to converge before analysing the input relevance, unlike the WSOM that aims to apply weighting to the inputs during the training which distorts the SOM's cost function, resulting in multiple local minimums meaning the SOM does not consistently converge to the same state. We demonstrate the superiority of the proposed method over the WSOM and a standard SOM in feature selection with improved clustering analysis.Entities:
Keywords: Automation; Clustering; Feature selection; Self-organising neural network map
Year: 2017 PMID: 29576689 PMCID: PMC5857284 DOI: 10.1007/s00521-017-3005-9
Source DB: PubMed Journal: Neural Comput Appl ISSN: 0941-0643 Impact factor: 5.606
Fig. 1A 2-dimensional self-organising map architecture
Fig. 2Steepest decent method for a quadratic functions
Fig. 3Problems beyond quadratic functions, Schwefel function [35]
Synthetic and real datasets definition
| Dataset name | Samples | Input features | Classes |
|---|---|---|---|
| Synthetic_Data01 | 100 | 4 | 5 |
| All classes defined by first 4 related features. | |||
| This is a simple dataset with no irrelevant inputs and outliers, created mainly for exploring the cost functions of the two self-organising algorithms. | |||
| Synthetic_Data02 | 1220 | 7 | 5 |
| All classes defined by first 4 related features. | |||
| Irrelevant inputs are clearly separated from the relevant inputs for easy identification by the algorithms. | |||
| Synthetic_Data03 | 1220 | 10 | 5 |
| Classes defined by features independently with equal distribution. | |||
| In addition to Synthetic_Data02, the definition of classes was distributed among variables, to identify the self-organising method’s ability to identify the degree of relevance of the input features for classification. | |||
| Synthetic_Data04 | 1220 | 9 | 5 |
| Classes defined by features independently with unequal distribution. | |||
| Synthetic_Data05 | 1220 | 7 | 5 |
| All classes defined by first 4 related features. | |||
| This dataset was created to evaluate the self-organising system’s performance in identifying irrelevant inputs from | |||
| WaveForm dataset | 5000 | 40 | 3 |
| As described by [ | |||
Performance of clustering methods on Synthetic_Data01
| Clustering Synthetic_Data01 | ||||||
|---|---|---|---|---|---|---|
| Training parameters | Map dimension, 3 × 3 rectangular grid topology | |||||
| Weighted SOM | Standard SOM | |||||
| RUNS | Identified important inputs | Correct classes | Correct classes | Identified important attributes | Correct classes | Correct classes |
| Run 1 | 1/4 | 1/5 | 0/5 | 4/4 | 5/5 | 5/5 |
| Run 2 | 1/4 | 1/5 | 0/5 | 4/4 | 5/5 | 4/5 |
| Run 3 | 1/4 | 1/5 | 1/5 | 4/4 | 5/5 | 5/5 |
| Run 4 | 1/4 | 2/5 | 1/5 | 4/4 | 5/5 | 5/5 |
| Run 5 | 2/4 | 1/5 | 2/5 | 4/4 | 4/5 | 5/5 |
| Run 6 | 1/4 | 0/5 | 0/5 | 4/4 | 5/5 | 5/5 |
| Run 7 | 1/4 | 1/5 | 0/5 | 4/4 | 5/5 | 5/5 |
| Run 8 | 1/4 | 0/5 | 1/5 | 4/4 | 5/5 | 5/5 |
| Run 9 | 1/4 | 2/5 | 1/5 | 4/4 | 5/5 | 5/5 |
| Run 10 | 1/4 | 0/5 | 0/5 | 4/4 | 4/5 | 5/5 |
Performance of clustering methods on Synthetic_Data02
| Clustering Synthetic_Data02 | ||||||
|---|---|---|---|---|---|---|
| Training parameters | Map dimension, 3 × 3 rectangular grid topology | |||||
| Weighted SOM | Standard SOM | |||||
| RUNS | Identified important inputs | Correct classes | Correct classes | Identified important attributes | Correct classes | Correct classes |
| Run 1 | 1/4 | 0/5 | 0/5 | 4/4 | 2/5 | 5/5 |
| Run 2 | 0/4 | 0/5 | – | 4/4 | 1/5 | 5/5 |
| Run 3 | 2/4 | 1/5 | 2/5 | 4/4 | 2/5 | 5/5 |
| Run 4 | 1/4 | 0/5 | 1/5 | 3/4 | 1/5 | 4/5 |
| Run 5 | 2/4 | 0/5 | 1/5 | 4/4 | 1/5 | 5/5 |
| Run 6 | 3/4 | 1/5 | 1/5 | 4/4 | 2/5 | 5/5 |
| Run 7 | 1/4 | 1/5 | 1/5 | 4/4 | 3/5 | 5/5 |
| Run 8 | 2/4 | 0/5 | 2/5 | 4/4 | 2/5 | 5/5 |
| Run 9 | 0/4 | 0/5 | – | 4/4 | 3/5 | 5/5 |
| Run 10 | 1/4 | 0/5 | 1/5 | 4/4 | 2/5 | 5/5 |
Performance of clustering methods on Synthetic_Data03
| Clustering Synthetic_Data03 | ||||||
|---|---|---|---|---|---|---|
| Training parameters | Map dimension, 3 × 3 rectangular grid topology | |||||
| Weighted SOM | Standard SOM | |||||
| RUNS | Identified important inputs | Correct classes | Correct classes | Identified important attributes | Correct classes | Correct classes |
| Run 1 | 0/8 | 0/5 | – | 2/8 | 3/5 | 2/5 |
| Run 2 | 0/8 | 0/5 | – | 4/8 | 1/5 | 3/5 |
| Run 3 | 1/8 | 1/5 | 0/5 | 2/8 | 1/5 | 2/5 |
| Run 4 | 0/8 | 0/5 | – | 3/8 | 0/5 | 2/5 |
| Run 5 | 1/8 | 0/5 | 0/5 | 1/8 | 1/5 | 1/5 |
| Run 6 | 2/8 | 0/5 | 1/5 | 1/8 | 0/5 | 1/5 |
| Run 7 | 1/8 | 1/5 | 1/5 | 5/8 | 1/5 | 4/5 |
| Run 8 | 1/8 | 0/5 | 0/5 | 1/8 | 2/5 | 2/5 |
| Run 9 | 0/8 | 0/5 | – | 2/8 | 1/5 | 2/5 |
| Run 10 | 1/8 | 0/5 | 0/5 | 2/8 | 1/5 | 2/5 |
Performance of clustering methods on Synthetic_Data04
| Clustering Synthetic_Data04 | ||||||
|---|---|---|---|---|---|---|
| Training parameters | Map dimension, 3 × 3 rectangular grid topology | |||||
| Weighted SOM | Standard SOM | |||||
| RUNS | Identified important inputs | Correct classes | Correct classes | Identified important attributes | Correct classes | Correct classes |
| Run 1 | 2/7 | 1/5 | 1/5 | 4/7 | 3/5 | 2/5 |
| Run 2 | 1/7 | 1/5 | 0/5 | 4/7 | 2/5 | 3/5 |
| Run 3 | 1/7 | 1/5 | 0/5 | 2/7 | 1/5 | 1/5 |
| Run 4 | 1/7 | 1/5 | 0/5 | 4/7 | 4/5 | 2/5 |
| Run 5 | 1/7 | 1/5 | 0/5 | 3/7 | 2/5 | 2/5 |
| Run 6 | 0/7 | 0/5 | – | 4/7 | 2/5 | 3/5 |
| Run 7 | 2/7 | 1/5 | 1/5 | 4/7 | 2/5 | 3/5 |
| Run 8 | 1/7 | 1/5 | 0/5 | 2/7 | 1/5 | 1/5 |
| Run 9 | 1/7 | 1/5 | 0/5 | 2/7 | 1/5 | 1/5 |
| Run 10 | 1/7 | 1/5 | 0/5 | 4/7 | 2/5 | 3/5 |
Performance of clustering methods on Synthetic_Data05
| Clustering Synthetic_Data05 | ||||||
|---|---|---|---|---|---|---|
| Training parameters | Map dimension, 3 × 3 rectangular grid topology | |||||
| Weighted SOM | Standard SOM | |||||
| RUNS | Identified important inputs | Correct classes | Correct classes | Identified important attributes | Correct classes | Correct classes |
| Run 1 | 2/4 | 0/5 | 1/5 | 3/4 | 3/5 | 5/5 |
| Run 2 | 0/4 | 0/5 | – | 4/4 | 2/5 | 5/5 |
| Run 3 | 1/4 | 1/5 | 2/5 | 4/4 | 2/5 | 5/5 |
| Run 4 | 1/4 | 0/5 | 1/5 | 4/4 | 2/5 | 5/5 |
| Run 5 | 1/4 | 0/5 | 2/5 | 4/4 | 2/5 | 5/5 |
| Run 6 | 0/4 | 1/5 | – | 4/4 | 2/5 | 5/5 |
| Run 7 | 2/4 | 1/5 | 0/5 | 4/4 | 2/5 | 5/5 |
| Run 8 | 1/4 | 1/5 | 1/5 | 4/4 | 2/5 | 5/5 |
| Run 9 | 0/4 | 1/5 | – | 4/4 | 3/5 | 5/5 |
| Run 10 | 1/4 | 1/5 | 0/5 | 3/4 | 2/5 | 5/5 |
Performance of clustering methods for waveform data
| Clustering waveform data | |||||
|---|---|---|---|---|---|
| Training parameters | Map dimension, 26 × 14 rectangular grid topology | ||||
| Weighted SOM | Standard SOM | ||||
| RUNS | Identified important inputs | Correct classes | Identified important attributes | Correct classes | Correct classes |
| Run 1 | 2/20 | 1/3 | 18/20 | 1/3 | 2/3 |
| Run 2 | 9/20 | 1/3 | 18/20 | 1/3 | 3/3 |
| Run 3 | 19/20 | 2/3 | 15/20 | 2/3 | 2/3 |
| Run 4 | 11/20 | 1/3 | 18/20 | 1/3 | 3/3 |
| Run 5 | 5/20 | 1/3 | 18/20 | 1/3 | 3/3 |
| Run 6 | 19/20 | 2/3 | 18/20 | 2/3 | 2/3 |
| Run 7 | 15/20 | 2/3 | 19/20 | 1/3 | 3/3 |
| Run 8 | 2/20 | 1/3 | 18/20 | 1/3 | 2/3 |
| Run 9 | 16/20 | 2/3 | 17/20 | 1/3 | 2/3 |
| Run 10 | 11/20 | 1/3 | 18/20 | 1/3 | 3/3 |
Fig. 4Correlation matrix for multiple runs of standard SOM on Synthetic_Data01
Fig. 5Correlation matrix for multiple runs of weighted SOM on Synthetic_Data01