| Literature DB >> 35888292 |
Nuria Nievas1,2, Adela Pagès-Bernaus2, Francesc Bonada1, Lluís Echeverria1, Albert Abio1, Danillo Lange1, Jaume Pujante1.
Abstract
Hot stamping is a hot metal forming technology increasingly in demand that produces ultra-high strength parts with complex shapes. A major concern in these systems is how to shorten production times to improve production Key Performance Indicators. In this work, we present a Reinforcement Learning approach that can obtain an optimal behavior strategy for dynamically managing the cycle time in hot stamping to optimize manufacturing production while maintaining the quality of the final product. Results are compared with the business-as-usual cycle time control approach and the optimal solution obtained by the execution of a dynamic programming algorithm. Reinforcement Learning control outperforms the business-as-usual behavior by reducing the cycle time and the total batch time in non-stable temperature phases.Entities:
Keywords: autonomous control; hot stamping; reinforcement learning
Year: 2022 PMID: 35888292 PMCID: PMC9322736 DOI: 10.3390/ma15144825
Source DB: PubMed Journal: Materials (Basel) ISSN: 1996-1944 Impact factor: 3.748
Figure 1Hot stamping direct method process phases.
Figure 2The pilot plant at Eurecat Manresa from which is defined the experiment considered in simulation and control.
Parameter range setting in the hot stamping process case study.
| Parameters | Use Case Values Definition |
|---|---|
| Batch size | 50 parts |
| Die initial temperature at the beginning of the batch | 25 °C |
| Die temperature allowed | |
| Blank initial constant temperature | 800 °C |
| Formed part closing time | 20–30 s |
| Constant recovery time | 10 s |
| Formed part cooling temperature | |
| Cycle time | 30–40 s |
Figure 3Reinforcement Learning interaction process scheme.
Description of the MDP state space.
| State Variables | Range of Acceptable Values |
|---|---|
| Die temperature | 25–550 °C |
| Current forming time setting | 20–30 s |
| Remaining parts from batch | 0–50 parts |
Description of the MDP action space.
| Discrete Actions | Forming Time Variation at Each Step |
|---|---|
| 0 | 0 s |
| 1 | +0.5 s |
| 2 | −0.5 s |
Description of the MDP reward function.
| Cost Function Components | Range of Penalties Values | Condition |
|---|---|---|
| Cycle time | 30–40 | Always |
| Part cooling temperature not achieved | 100 | |
| Maximum die temperature exceeded |
|
Figure 4Workflow of the control model.
Figure 5Data-driven prediction models for the continuous environment simulation.
Description of the experiments presented.
| Experiment | Environment | Algorithm | Iterations |
|---|---|---|---|
| Experiment 1 | Data-driven models-based | DP | 50 |
| Experiment 2 | Data-driven models-based | Q-Learning |
|
Dynamic Programming parameters setup values.
| DP Parameters | Setup Values |
|---|---|
| Gamma | 1 |
| Epsilon |
|
Q-Learning parameters setup values.
| Q-Learning Parameters | Setup Values |
| Episodes |
|
| Gamma | 1 |
| Alpha | 0.1 |
| Initial epsilon value | 1 |
| Epsilon decay | Linear |
| Minimum epsilon value | 0.01 |
Figure 6Cost function definition to evaluate the hot stamping process.
Figure 7Q-Learning agent’s policy improvement during training.
Batch penalty production for all the experiments with the BAU, DP, and Q-Learning policies. A non-feasible solution is marked in red, and the best solutions within each policy are marked in green.
| Initial Forming Time | BAU Penalty | DP Penalty | Q−Learning Penalty | Savings |
|---|---|---|---|---|
| 20 |
|
|
| 2760 |
| 20.5 |
|
|
| 3084 |
| 21 |
|
|
| 4023 |
| 21.5 |
| −1832.5 | −1832.5 | 4142.5 |
| 22 |
|
|
| 4074.5 |
| 22.5 |
|
|
| 4099.5 |
| 23 |
|
|
| 4024.5 |
| 23.5 |
| −1826 | −1826 | 3949 |
| 24 |
| −1826.5 | −1826.5 | 3973.5 |
| 24.5 |
| −1827.5 | −1827.5 | 3997.5 |
| 25 |
| −1828.5 | −1828.5 | 3921.5 |
| 25.5 |
| −1830 | −1830 | 3745 |
| 26 |
| −1831.5 | −1831.5 | 3768.5 |
| 26.5 |
| −1833 | −1833 | 3692 |
| 27 |
| −1835 | −1835 | 15 |
| 27.5 | −1875 | −1837 | −1837 | 38 |
| 28 | −1900 | −1839.5 | −1839.5 | 60.5 |
| 28.5 | −1925 | −1842 | −1842 | 83 |
| 29 | −1950 | −1843.5 | −1843.5 | 106.5 |
| 29.5 | −1975 | −1847 | −1847 | 128 |
| 30 | −2000 | −1851 | −1851 | 149 |
Figure 8Batch execution with RL (Q-Learning) best solution.
Figure 9Best BAU, DP, and Q-Learning solutions benchmark for a batch execution; (a) Forming time setting benchmark, (b) Final parts temperature benchmark.
Figure 10Models of cycle time adjustment and its connections considered in this work.
Challenges in DP and Q-Learning implementations.
| Challenge | Solution |
|---|---|
| Curse of dimensionality | Discrete action and state spaces definition |
| Reward function definition | Evaluation of different reward functions |
| Surrogate models error | Evaluation of different algorithms |