| Literature DB >> 33288773 |
Abstract
Understanding the inner behaviour of multilayer perceptrons during and after training is a goal of paramount importance for many researchers worldwide. This article experimentally shows that relevant patterns emerge upon training, which are typically related to the underlying problem difficulty. The occurrence of these patterns is highlighted by means of [Formula: see text] diagrams, a 2D graphical tool originally devised to support the work of researchers on classifier performance evaluation and on feature assessment. The underlying assumption being that multilayer perceptrons are powerful engines for feature encoding, hidden layers have been inspected as they were in fact hosting new input features. Interestingly, there are problems that appear difficult if dealt with using a single hidden layer, whereas they turn out to be easier upon the addition of further layers. The experimental findings reported in this article give further support to the standpoint according to which implementing neural architectures with multiple layers may help to boost their generalisation ability. A generic training strategy inspired by some relevant recommendations of deep learning has also been devised. A basic implementation of this strategy has been thoroughly used during the experiments aimed at identifying relevant patterns inside multilayer perceptrons. Further experiments performed in a comparative setting have shown that it could be adopted as viable alternative to the classical backpropagation algorithm.Entities:
Year: 2020 PMID: 33288773 PMCID: PMC7721750 DOI: 10.1038/s41598-020-76517-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Class signature of the dataset optdigits, downloaded from the UC Irvine machine learning repository (UCI, hereinafter). Each sample is encoded with an image of B/W pixels, for a total of 1024 binary features. The multiclass problem has been binarized considering the digit 0 as positive category and as negative category. Each point in the diagram represents the “performance” of a feature, considered as an elementary classifier. Feature importance is highlighted by a scale of colours: from red (not relevant) to blue (highly relevant). Intermediate values are represented with yellow, green and light blue, depending on the corresponding feature importance (from lower to higher). Due to the presence of several points with high value of |δ|, the problem is expected to be easy.
Figure 2Two typical class signatures, for easy and difficult classification problems. In particular, the toy problem kidney-disease is reported at the left-hand side, whereas the (expected to be) difficult problem dota2 is reported at the right-hand side. Both datasets have been downloaded from UCI.
Figure 3Hidden layer class signatures of MLPs trained on kidney-disease and on dota2 (left- and right-hand side). The diagrams highlight that the neuron outputs obey to very different patterns (i.e., success and failure, respectively).
Figure 4Depicting the hidden layer of an MLP trained with different values of momentum on the CNAE-9 dataset (from UCI). In particular, the hidden layer at the left-hand side shows that the MLP has been able to generalise, despite the fact that some neurons (i.e., those located at the left- and right-hand corners) operate in saturation, whereas the other highlights a clear pattern of failure, as all neurons operate in saturation.
Figure 5Class signature of the arrhythmia dataset (from UCI). The problem is expected to be difficult, as almost all features lay in proximity of the axis; the only exception being the blue point at the upper-right part of the diagram, which shows a small, but not negligible, correlation with the occurrence of arrhythmia. Besides, that point corresponds to the binary feature sex={M,F}, which is in accordance with the existing statistics about this disease.
Figure 6Inner behaviour of an MLP, trained with PT, on the arrhythmia dataset (from UCI). The two hidden layers (with 50 and 20 hidden units, respectively) are entrusted with very different tasks: the one whose class signature is shown at the left-hand side basically performs feature extraction, whereas the other (right-hand side) is responsible for generalisation.
Figure 7Success patterns of an MLP equipped with three hidden layers (embedding 10, 4 and 2 neurons) trained with BP and PT (left- and right-hand side), on the synthetic xor dataset. Note that the rise of a success pattern is clearer for PT.
Figure 8Feature extraction performed by an MLP equipped with three hidden layers (with 10, 4, and 2 neurons) and trained with PT on the synthetic xor dataset. The sequence highlights the role of feature extractor played by the MLP. Note that, by construction, each feature taken in isolation—including those used to generate the labelling—is almost completely independent of the class label (upper left-hand diagram).
Comparison, with focus on accuracy, between BP and PT applied to non trivial datasets (all from UCI). For each dataset, 100 training and test sets have been generated by random splitting. On the backpropagation side, experiments have been performed on two kinds of MLPs: one equipped with one hidden layer and the other with a shape identical to the one selected for PT. The best results obtained on the backpropagation side has been retained for each dataset. Results in favour of/against PT are highlighted with black/white circles, whereas results with no significant difference are highlighted with an equal sign. Two-sample Welch’s t-test has been used to check the similarity between the outcomes of different kinds of classifiers. The significance level for p-values has been set to 0.05. Also standard deviation is reported for accuracy. Legenda: 1HL/nHL = MLP with one/more than one hidden layer, = specificity, = sensitivity, and = accuracy.
| Dataset | BP (best 1HL/nHL) | PT (nHL) | ||||||
|---|---|---|---|---|---|---|---|---|
| Autos | 0.77 | 0.73 | 0.76 ± 0.06 | 0.81 | 0.68 | 0.78 ± 0.04 | 0.0030 | • |
| Bank | 0.78 | 0.74 | 0.77 ± 0.10 | 0.76 | 0.68 | 0.75 ± 0.07 | 0.0515 | = |
| Breast-cancer | 0.76 | 0.45 | 0.67 ± 0.05 | 0.74 | 0.51 | 0.67 ± 0.04 | 0.2024 | = |
| Census | 0.81 | 0.73 | 0.75 ± 0.05 | 0.82 | 0.75 | 0.77 ± 0.04 | 0.0010 | • |
| Connect-4 | 0.71 | 0.69 | 0.70 ± 0.07 | 0.73 | 0.71 | 0.72 ± 0.04 | 0.0049 | • |
| Credit-approval | 0.84 | 0.85 | 0.84 ± 0.02 | 0.85 | 0.85 | 0.85 ± 0.02 | 0.0042 | • |
| Credit-cards | 0.73 | 0.59 | 0.70 ± 0.18 | 0.72 | 0.62 | 0.70 ± 0.16 | 0.4837 | = |
| Heart-disease | 0.78 | 0.82 | 0.80 ± 0.04 | 0.79 | 0.82 | 0.80 ± 0.04 | 0.5000 | = |
| Sonar | 0.83 | 0.81 | 0.82 ± 0.05 | 0.83 | 0.82 | 0.83 ± 0.04 | 0.1426 | = |
| SPECT | 0.77 | 0.71 | 0.72 ± 0.07 | 0.75 | 0.73 | 0.74 ± 0.04 | 0.0272 | • |
Comparison, with focus on (and on ), between BP and PT applied to non trivial datasets (all from UCI).
| Dataset | BP (best 1HL/nHL) | PT (nHL) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Autos | 0.77 | 0.73 | − 0.04 | 0.50 | 0.81 | 0.68 | − 0.13 | 0.49 | 0.0582 | = |
| Bank | 0.78 | 0.74 | − 0.04 | 0.52 | 0.76 | 0.68 | − 0.08 | 0.44 | 0.0129 | |
| Breast-cancer | 0.76 | 0.45 | 0.31 | 0.21 | 0.74 | 0.51 | 0.24 | 0.25 | 0.0224 | • |
| Census | 0.81 | 0.73 | − 0.08 | 0.54 | 0.82 | 0.75 | − 0.06 | 0.57 | 0.0000 | • |
| Connect-4 | 0.71 | 0.69 | − 0.02 | 0.40 | 0.73 | 0.71 | − 0.02 | 0.44 | 0.0001 | • |
| Credit-approval | 0.84 | 0.85 | 0.01 | 0.69 | 0.85 | 0.85 | − 0.00 | 0.70 | 0.0794 | = |
| Credit-cards | 0.73 | 0.59 | − 0.14 | 0.32 | 0.72 | 0.62 | − 0.10 | 0.34 | 0.1032 | = |
| Heart-disease | 0.78 | 0.82 | 0.04 | 0.60 | 0.79 | 0.82 | 0.02 | 0.61 | 0.4008 | = |
| Sonar | 0.83 | 0.81 | − 0.02 | 0.64 | 0.83 | 0.82 | − 0.01 | 0.65 | 0.1229 | = |
| SPECT | 0.77 | 0.71 | 0.06 | 0.48 | 0.75 | 0.73 | − 0.01 | 0.48 | 0.4231 | = |
The comparison is in fact focused on unbiased accuracy—i.e., on the accuracy measured as datasets were in fact balanced. Results in favour of/against PT are highlighted with black/white circles, whereas results with no significant difference are highlighted with an equal sign. Standard deviation is reported for both and .
Comparison, with focus on accuracy, between BP and PT—applied to medium-size datasets (all from the Kaggle ML repository). The comparison has been performed using the train-and-test strategy, as often done with datasets of significant size. Non binary datasets have been preventively binarized (for the sake of brevity, only the best and the worst cases have been reported for the dataset MNIST). Experimental runs have been repeated 4 times. Also standard deviation is reported for accuracy.
| Dataset | #Samples | #Features | BP (nHL) | PT (nHL) | ||||
|---|---|---|---|---|---|---|---|---|
| MNIST[1] (worst | 70.000 | 784 | 0.96 | 0.96 | 0.98 | 0.95 | ||
| MNIST[2] (best | 0.98 | 0.96 | 0.98 | 0.96 | ||||
| MNIST[8] (worst | 0.98 | 0.96 | 0.95 | 0.97 | ||||
| Credit card fraud detection | 284.000 | 30 | 0.99 | 0.88 | 0.99 | 0.88 | ||
| Heart beat categorization | 109.446 | 187 | 0.87 | 0.95 | 0.83 | 0.96 | ||
| Diabetic retinopathy[No_DR] | 3.662 | 150.528 | 0.87 | 0.88 | 0.91 | 0.88 | ||
Comparison, with focus on (and on ), between BP and PT applied to medium-size datasets (all from the Kaggle ML repository). The comparison has been performed using the train-and-test strategy. Non binary datasets have been preventively binarized (for the sake of brevity, only the best and the worst cases have been reported for the dataset MNIST). Experimental runs have been repeated 4 times. Also standard deviation is reported for both and .
| Dataset | BP (nHL) | PT (nHL) | ||||||
|---|---|---|---|---|---|---|---|---|
| MNIST[1] (worst | 0.96 | 0.96 | 0.92 | 0.98 | 0.96 | 0.94 | ||
| MNIST[2] (best | 0.98 | 0.96 | 0.94 | 0.98 | 0.96 | 0.94 | ||
| MNIST[8] (worst | 0.98 | 0.96 | 0.94 | 0.95 | 0.97 | 0.92 | ||
| Credit card fraud detection | 0.99 | 0.88 | 0.87 | 0.99 | 0.88 | 0.87 | ||
| Heart beat categorization | 0.87 | 0.95 | 0.82 | 0.83 | 0.96 | 0.79 | ||
| Diabetic retinopathy[No_DR] | 0.87 | 0.88 | 0.75 | 0.91 | 0.88 | 0.79 | ||
Figure 9Feature extraction performed by an MLP equipped with four hidden layers (with 40, 30, 20, and 10 neurons) and trained with PT on the Credit Cards Fraud Detection dataset (from the Kaggle ML repository). The sequence highlights that a pattern of success occurs at the first hidden layer and that it is slightly improved at the subsequent layers. The last layer is not reported for the sake of brevity.
Figure 10Hidden layers of an MLP trained on the dataset WBC (Wisconsin Breast Cancer) from UCI. To give a flavour of the underlying process, PT has been performed on an MLP architecture equipped with four hidden layers, all with the same number of neurons (i.e., 10). The corresponding signatures highlight that the same pattern of generalisation success is duplicated along the hidden layers.
Figure 11Snapshot of PT applied to an MLP equipped with three hidden layers (240 inputs, one output, and three hidden layers with 80, 40, and 20 neurons).