| Literature DB >> 35789599 |
Mohamed Fakhfakh1,2, Lotfi Chaari2, Bassem Bouaziz1, Faiez Gargouri1.
Abstract
Artificial neural networks (ANNs) are being widely used in supervised machine learning to analyze signals or images for many applications. Using an annotated learning database, one of the main challenges is to optimize the network weights. A lot of work on solving optimization problems or improving optimization methods in machine learning has been proposed successively such as gradient-based method, Newton-type method, meta-heuristic method. For the sake of efficiency, regularization is generally used. When non-smooth regularizers are used especially to promote sparse networks, such as the ℓ 1 norm, this optimization becomes challenging due to non-differentiability issues of the target criterion. In this paper, we propose an MCMC-based optimization scheme formulated in a Bayesian framework. The proposed scheme solves the above-mentioned sparse optimization problem using an efficient sampling scheme and Hamiltonian dynamics. The designed optimizer is conducted on four (4) datasets, and the results are verified by a comparative study with two CNNs. Promising results show the usefulness of the proposed method to allow ANNs, even with low complexity levels, reaching high accuracy rates of up to 94 % . The proposed method is also faster and more robust concerning overfitting issues. More importantly, the training step of the proposed method is much faster than all competing algorithms.Entities:
Keywords: Artificial neural networks; Hamiltonian dynamics; Machine learning; Optimization
Year: 2022 PMID: 35789599 PMCID: PMC9244188 DOI: 10.1007/s12652-022-04073-8
Source DB: PubMed Journal: J Ambient Intell Humaniz Comput
Summary of optimization methods
| Categories | Method | Years | Purpose | Advantages | Disadvantages |
|---|---|---|---|---|---|
| SGD Robbins and Monro ( | 2013 | The update parameters are calculated using a randomly sampled mini-batch. The method converges at a sublinear rate | The computational time for each update does not depend on the total number of training samples | Difficult setting of an appropriate learning rate. The solution may be trapped at the saddle point in some cases | |
| First-Order | Adam Kingma and Ba ( | 2014 | To dynamically adjust the learning rate of each parameter, use gradient first and second order moment estimations | Stable gradient descent process. Suitable for most non-convex optimization problems with large data sets and high dimensional space | The method may not converge in some cases |
| Adadelta Zeiler ( | 2012 | Change the way of total gradient accumulation to exponential moving average | Improves the ineffective learning problem in the late stage of AdaGrad. Suitable for optimizing non-stationary and non-convex problems | In the late training stage, the update process may be repeated around the local minimum | |
| Adamax Kingma and Ba ( | 2017 | Generalization of Adam. It is based on adaptive lower-order moment estimation | Infinite-order norm makes the algorithm stable | The penalty parameter is related to both the original and dual residuals whose value is difficult to determine | |
| Newton’s method Avriel ( | 2003 | Calculates the inverse of the Hessian matrix to obtain faster convergence than with first-order approaches | Faster convergence than the first-order gradient method. Quadratic convergence under certain conditions | Long computing time and large storage space at each iteration | |
| High-order | Quasi-Newton method Nocedal and Wright ( | 2006 | Uses a Hessian matrix approximation or its inverse | There is no need to calculate the Hessian matrix’s inverse matrix, which reduces the computing time. It is possible to obtain superlinear convergence in most cases | Large storage space: not suitable for large-scale problems |
| Hessian free (HF) method Martens ( | 2010 | Sub-optimization with the conjugate gradient: avoids the computation of inverse Hessian matrix | The second-order gradient information can be used. There’s no need to calculate Hessian matrices directly. Suitable for high dimensional optimization | Computation cost to calculate the matrix–vector product increases linearly with the training data. Not appropriate for large-scale issues | |
| Sochastic Quasi-Newton method Bottou et al. ( | 2018 | Employs techniques of stochastic optimization: e.g. online-LBFGS Schraudolph et al. ( | Can handle large-scale issues | More complex that the stochastic gradient method | |
| IWT Khishe and Mosavi ( | 2019 | Using a suitable spiral shape inspired by a humpback whale to improve the exploitation phase of the standard whale optimization algorithm | Stronger global search ability. It can be used to effectively solve complex constrained optimization problems | Slow convergence and easy to fall into local optimum | |
| Derivative-free“Meta-heuristic” | SSA Khishe and Mohammadi ( | 2019 | Is a bio-inspired optimization algorithm based on swarming mechanism of salps to enhance accuracy and reliability of the solution | Faster to execute because of its lower complexity. Improved capability in avoiding local minima | It may get stuck in the local area, which results in the failure to obtain the global optimal solution |
| DA Khishe and Safari ( | 2019 | Inspired by the dynamic and static swarming behaviors of dragonflies to resolves local optima stagnation when solving challenging problems | Simple and easy to implement. Having few parameters for tuning | It does not have an internal memory that can lead to premature convergence to the local optimum | |
| ABGSA Mosavi et al. ( | 2019 | Used to solve the problem of impertinent classification accuracy, and to block local minima as well as low convergence speed for Multi-Layer Perceptron Neural Network | Reduced complexity and processing time | Unaffordable sampling rate. Difficult because of randomness. Greatly influenced by initial solution | |
| Federated optimization | Fuzzy consensus Połap ( | 2021 | The goal of FL tasks is to learn a single global model that minimizes the empirical risk function over the entire training dataset. The authors extend FL using a fuzzy consensus method to improve large-scale group decision-making (LSGDM). | In practical use, it has a great advantage due to the possibility of quick implementation and classification of the sample even during the training process. Capable of providing the most effective solution to complex issues | Adapting centralized training workflows such as hyperparameter tuning, and interpretability tasks to the federated learning setting present roadblocks to the widespread adoption of FL in practical settings |
Setting details of the used datasets.
| Dataset | Training set | Test set | # Classes |
|---|---|---|---|
| CT images for simple classification | 1210 | 430 | 2 |
| CT images for challenging classification | 566 | 180 | 2 |
| Fashion-MNIST | 48,000 | 12,000 | 10 |
| CIFAR-10 | 50,000 | 10,000 | 10 |
Parameters setting for benchmark algorithms
| Algorithm | Parameters | Description | Values |
|---|---|---|---|
| MH | Stand. dev. of the proposal normal distribution | 3 | |
| rw-MH | Stand. dev. of the proposal normal distribution | 5 | |
| lr | Learning rate | ||
| Adam | 1st moment estimates exponential decay rate | 0.9 | |
| 2nd moment estimates exponential decay rate | 0.999 | ||
| Numerical stability constant | 1e | ||
| lr | Learning rate | ||
| SGD | momentum | Acceleration rate | 0.8 |
| decay | Learning rate decay over each update | 1e | |
| lr | Learning rate | ||
| Adadelta | rho | Decay rate | 0.95 |
| Numerical stability constant | 1e | ||
| lr | Learning rate | ||
| Adamax | 1st moment estimates exponential decay rate | 0.9 | |
| 2nd moment estimates exponential decay rate | 0.999 | ||
| Numerical stability constant | 1e | ||
| minv | Lower bound | ||
| IWT | maxv | Upper bound | 2 |
| size | Number of particles | 30 | |
| p_s | Spiral parameter | 3 | |
| minv | Lower bound | ||
| DA | maxv | Upper bound | 2 |
| size | Number of particles | 15 | |
| minv | Lower bound | ||
| SSA | maxv | Upper bound | 2 |
| size | Number of particles | 30 |
Convnet with regularization techniques
| CNN_1 | CNN_2 |
|---|---|
| Conv3 | Conv3 |
| BatchNormalization | BatchNormalization |
| MaxPool 2 | Conv3 |
| Dropout(0.2) | BatchNormalization |
| MaxPool 2 | |
| Dropout(0.2) | |
| Conv3 | Conv3 |
| BatchNormalization | BatchNormalization |
| MaxPool 2 | Conv3 |
| Dropout(0.3) | BatchNormalization |
| MaxPool 2 | |
| Dropout(0.3) | |
| Conv3 | Conv3 |
| BatchNormalization | BatchNormalization |
| MaxPool 2 | Conv3 |
| Dropout(0.4) | BatchNormalization |
| MaxPool 2 | |
| Dropout(0.4) | |
| Flattening | Flattening |
| FC-64 | FC-128 |
| Dropout(0.3) | Dropout(0.3) |
| FC-64 | |
| Dropout(0.2) | |
| FC-softmax | FC-softmax |
Accuracy, sensitivity, specificity, computational time (in min), and norms of the estimated weights for CNN_1 using Adam and the proposed method with different values of
| Optimizer | Acc. | Time | Sens. | Spec. | |||
|---|---|---|---|---|---|---|---|
| ns-HMC | 149,113 | 48,988 | 89.68 | 37.28 | 88.21 | 86.95 | |
| 152,904 | 51,232 | 89.42 | 38.47 | 87.81 | 86.11 | ||
| Adam | 180,513 | 51,727 | 86.91 | 52.12 | 84.36 | 81.25 | |
| 191,229 | 67,732 | 85.34 | 54.91 | 83.14 | 80.66 | ||
| 189,075 | 58,823 | 85.49 | 53.85 | 84.09 | 82.49 | ||
Bold values indicate the importance of the obtained results of our approach compared to competing algorithms
Experiment 1: Results for CT image classification using CNN_1 and CNN_2 (Computational time in min, accuracy, loss, sensitivity and specificity)
| Optimizers | CNN_1 | CNN_2 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Time (min) | Acc. | Loss | Sens. | Spec. | Time (min) | Acc. | Loss | Sens. | Spec. | |
| MH | 79.2 | 0.84 | 0.18 | 0.81 | 0.77 | 133.8 | 0.86 | 0.16 | 0.85 | 0.80 |
| rw-MH | 64.8 | 0.85 | 0.17 | 0.84 | 0.79 | 95.4 | 0.87 | 0.14 | 0.86 | 0.82 |
| Adam | 52 | 0.87 | 0.12 | 0.86 | 0.83 | 85.2 | 0.88 | 0.11 | 0.87 | 0.85 |
| SGD | 53 | 0.88 | 0.13 | 0.85 | 0.80 | 87.1 | 0.86 | 0.15 | 0.84 | 0.81 |
| Adadelta | 56 | 0.86 | 0.12 | 0.84 | 0.81 | 90.6 | 0.87 | 0.11 | 0.85 | 0.83 |
| Adamax | 53 | 0.87 | 0.13 | 0.86 | 0.84 | 87,6 | 0.87 | 0.12 | 0.86 | 0.85 |
| IWT | 59 | 0.84 | 0.21 | 0.81 | 0.78 | 70,2 | 0.86 | 0.19 | 0.84 | 0.83 |
| DA | 61 | 0.83 | 0.25 | 0.82 | 0.79 | 73.2 | 0.85 | 0.22 | 0.83 | 0.81 |
| SSA | 57 | 0.86 | 0.18 | 0.85 | 0.83 | 61.2 | 0.88 | 0.17 | 0.87 | 0.86 |
Bold values indicate the importance of the obtained results of our approach compared to competing algorithms
Fig. 1Experiment 1: Train and test curves using CNN_1
Fig. 2Experiment 1: Train and test curves using CNN_2
Experiment 2: Results for CT image classification - challenging case - using CNN_1 and CNN_2 (computational time in min, accuracy, loss, sensitivity and specificity)
| Optimizers | CNN_1 | CNN_2 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Time (min) | Acc. | Loss | Sens. | Spec. | Time (min) | Acc. | Loss | Sens. | Spec. | |
| MH | 71,4 | 0.73 | 0.38 | 0.71 | 0.69 | 92.4 | 0.76 | 0.34 | 0.74 | 0.72 |
| rw-MH | 59 | 0.76 | 0.36 | 0.75 | 0.72 | 94.8 | 0.77 | 0.32 | 0.75 | 0.74 |
| Adam | 58 | 0.71 | 0.43 | 0.69 | 0.68 | 81 | 0.73 | 0.36 | 0.72 | 0.71 |
| SGD | 59 | 0.65 | 0.45 | 0.64 | 0.62 | 82.2 | 0.68 | 0.42 | 0.67 | 0.65 |
| Adadelta | 61.8 | 0.67 | 0.42 | 0.65 | 0.63 | 87.6 | 0.70 | 0.38 | 0.69 | 0.67 |
| Adamax | 60.6 | 0.69 | 0.41 | 0.67 | 0.66 | 90 | 0.74 | 0.36 | 0.72 | 0.71 |
| IWT | 54 | 0.75 | 0.38 | 0.74 | 0.72 | 90 | 0.78 | 0.35 | 0.77 | 0.75 |
| DA | 57 | 0.78 | 0.36 | 0.77 | 0.76 | 87 | 0.81 | 0.33 | 0.80 | 0.76 |
| SSA | 51 | 0.76 | 0.37 | 0.76 | 0.75 | 83 | 0.79 | 0.36 | 0.78 | 0.77 |
Bold values indicate the importance of the obtained results of our approach compared to competing algorithms
Experiment 3: Results for Fashion-MNIST image classification—challenging case—using CNN_1 and CNN_2 (computational time in min, accuracy, loss, sensitivity and specificity)
| Optimizers | CNN_1 | CNN_2 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Time (min) | Acc. | Loss | Sens. | Spec. | Time (min) | Acc. | Loss | Sens. | Spec. | |
| MH | 166.2 | 0.86 | 0.35 | 0.85 | 0.81 | 745.8 | 0.87 | 0.33 | 0.84 | 0.82 |
| rw-MH | 183.6 | 0.88 | 0.33 | 0.86 | 0.83 | 797.4 | 0.88 | 0.31 | 0.85 | 0.84 |
| Adam | 156.6 | 0.90 | 0.46 | 0.85 | 0.82 | 444 | 0.92 | 0.32 | 0.88 | 0.87 |
| SGD | 164.4 | 0.88 | 0.71 | 0.71 | 0.67 | 452.4 | 0.89 | 0.56 | 0.84 | 0.83 |
| Adadelta | 169.8 | 0.70 | 1.20 | 0.66 | 0.64 | 439.8 | 0.78 | 0.96 | 0.71 | 0.70 |
| Adamax | 149 | 0.91 | 0.49 | 0.88 | 0.82 | 448.2 | 0.91 | 0.26 | 0.88 | 0.87 |
| IWT | 180.2 | 0.82 | 0.37 | 0.79 | 0.73 | 486 | 0.83 | 0.36 | 0.80 | 0.79 |
| DA | 174 | 0.79 | 0.40 | 0.76 | 0.74 | 469 | 0.82 | 0.35 | 0.78 | 0.78 |
| SSA | 165.7 | 0.84 | 0.33 | 0.81 | 0.75 | 453 | 0.86 | 0.24 | 0.86 | 0.85 |
Bold values indicate the importance of the obtained results of our approach compared to competing algorithms
Experiment 4: Results for CIFAR-10 image classification—challenging case—using CNN_1 and CNN_2 (computational time in min, accuracy, loss, sensitivity and specificity)
| Optimizers | CNN_1 | CNN_2 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Time (min) | Acc. | Loss | Sens. | Spec. | Time (min) | Acc. | Loss | Sens. | Spec. | |
| MH | 172 | 0.83 | 0.41 | 0.80 | 0.78 | 763.2 | 0.84 | 0.36 | 0.83 | 0.81 |
| rw-MH | 192.2 | 0.85 | 0.36 | 0.84 | 0.81 | 814.1 | 0.86 | 0.35 | 0.84 | 0.83 |
| Adam | 161 | 0.89 | 0.42 | 0.83 | 0.81 | 429 | 0.90 | 0.36 | 0.87 | 0.86 |
| SGD | 169 | 0.86 | 0.75 | 0.69 | 0.65 | 459.7 | 0.86 | 0.60 | 0.83 | 0.80 |
| Adadelta | 174.7 | 0.75 | 0.92 | 0.68 | 0.65 | 453.3 | 0.79 | 0.81 | 0.72 | 0.70 |
| Adamax | 155 | 0.90 | 0.33 | 0.88 | 0.85 | 507.8 | 0.91 | 0.24 | 0.87 | 0.85 |
| IWT | 186 | 0.80 | 0.35 | 0.78 | 0.74 | 531 | 0.81 | 0.34 | 0.79 | 0.78 |
| DA | 179 | 0.82 | 0.35 | 0.77 | 0.75 | 519 | 0.84 | 0.30 | 0.78 | 0.76 |
| SSA | 172.7 | 0.83 | 0.31 | 0.81 | 0.77 | 503 | 0.85 | 0.27 | 0.83 | 0.80 |
Bold values indicate the importance of the obtained results of our approach compared to competing algorithms
Results for fashion-MNIST image classification using deep CNN (computational time in min, accuracy, loss, sensitivity and specificity)
| Optimizers | Time (min) | Acc. | Loss | Sens. | Spec. |
|---|---|---|---|---|---|
| MH | 977 | 0.84 | 0.41 | 0.80 | 0.76 |
| rw-MH | 986 | 0.85 | 0.37 | 0.83 | 0.78 |
| Adam | 701 | 0.91 | 0.55 | 0.88 | 0.85 |
| SGD | 705 | 0.88 | 0.46 | 0.82 | 0.79 |
| Adadelta | 707 | 0.80 | 0.63 | 0.70 | 0.72 |
| Adamax | 706 | 0.92 | 0.44 | 0.90 | 0.88 |
| IWT | 681 | 0.88 | 0.33 | 0.82 | 0.79 |
| DA | 677 | 0.85 | 0.37 | 0.81 | 0.77 |
| SSA | 694 | 0.90 | 0.31 | 0.86 | 0.84 |
Bold values indicate the importance of the obtained results of our approach compared to competing algorithms