| Literature DB >> 35161975 |
Muhammad Junaid1, Saad Arslan2, TaeGeon Lee1, HyungWon Kim1.
Abstract
The convergence of artificial intelligence (AI) is one of the critical technologies in the recent fourth industrial revolution. The AIoT (Artificial Intelligence Internet of Things) is expected to be a solution that aids rapid and secure data processing. While the success of AIoT demanded low-power neural network processors, most of the recent research has been focused on accelerator designs only for inference. The growing interest in self-supervised and semi-supervised learning now calls for processors offloading the training process in addition to the inference process. Incorporating training with high accuracy goals requires the use of floating-point operators. The higher precision floating-point arithmetic architectures in neural networks tend to consume a large area and energy. Consequently, an energy-efficient/compact accelerator is required. The proposed architecture incorporates training in 32 bits, 24 bits, 16 bits, and mixed precisions to find the optimal floating-point format for low power and smaller-sized edge device. The proposed accelerator engines have been verified on FPGA for both inference and training of the MNIST image dataset. The combination of 24-bit custom FP format with 16-bit Brain FP has achieved an accuracy of more than 93%. ASIC implementation of this optimized mixed-precision accelerator using TSMC 65nm reveals an active area of 1.036 × 1.036 mm2 and energy consumption of 4.445 µJ per training of one image. Compared with 32-bit architecture, the size and the energy are reduced by 4.7 and 3.91 times, respectively. Therefore, the CNN structure using floating-point numbers with an optimized data path will significantly contribute to developing the AIoT field that requires a small area, low energy, and high accuracy.Entities:
Keywords: IEEE 754; MNIST dataset; convolutional neural network (CNN); floating-points
Mesh:
Year: 2022 PMID: 35161975 PMCID: PMC8840430 DOI: 10.3390/s22031230
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1General architecture of CNN.
Figure 2SoftMax Function.
Figure 3(a) General operation process. (b) Detailed operation process of floating point.
Formats evaluated in CNN.
| Total Bits | Common Name | Significand Bits | Exponent Bits | Exponent Bias |
|---|---|---|---|---|
| 16 | Custom | 10 | 6 |
|
| 16 | Brain Floating | 8 | 8 |
|
| 24 | Custom | 16 | 8 |
|
| 32 | Single-Precision | 24 | 8 |
|
Figure 4The representation of floating-points (a) 16-bit custom floating points; (b) 16-bit brain floating point; (c) 24-bit custom floating point; (d) 32-bit single precision.
Figure 5The architecture of floating-point divider using reciprocal.
Figure 6The Structure of processing unit for Signed Array division.
Figure 7Structure of floating-point division operator using Signed Array.
Comparison of division calculation using reciprocal and Signed Array.
| Clock Frequency | 50 Mhz | 100 Mhz | ||
|---|---|---|---|---|
| Reciprocal | Signed Array | Reciprocal | Signed Array | |
| Area (µm2) | 38,018.24 | 6253.19 | 38,039.84 | 8254.21 |
| Processing delay (ns) | 70.38 | 21.36 | 64.23 | 10.79 |
| Total Energy (pJ) a | 78.505 | 4.486 | 112.927 | 5.019 |
a Total energy is an energy per division operation.
Figure 8The architecture of proposed floating-point multiplier.
Figure 9Overall architecture of the proposed CNN accelerator.
Comparison of accuracy and dynamic power for different precisions.
| S. No | Precision Format | Formats for Individual Layers | Mantissa | Exponents | Training Accuracy | Test Accuracy | Dynamic Power |
|---|---|---|---|---|---|---|---|
| 1 | IEEE-32 | All 32-bits | 24 | 8 | 96.42% | 96.18% | 36 mW |
| 2 | Custom-24 | All 24-bits | 16 | 8 | 94.26% | 93.15% | 30 mW |
| 3 | IEEE-16 | All 16-bits | 11 | 5 | 12.78% | 11.30% | 19 mW |
Figure 10Arithmetic optimization algorithm.
Comparison of N-bit floating-point adder/subtracter.
| N-Bits | Common Name | Area (µm2) | Processing Delay (ns) | Total Energy (pJ) |
|---|---|---|---|---|
| 16 (1,8,7) | Brain Floating | 1749.96 | 10.79 |
|
| 24 (1,8,15) | Custom | 2610.44 | 10.80 |
|
| 32 (1,8,23) | Single-Precision | 3895.16 | 10.75 |
|
Comparison of N-bit floating point multiplier.
| N-Bits | Common Name | Area (µm2) | Processing Delay (ns) | Total Energy (pJ) |
|---|---|---|---|---|
| 16 (1,8,7) | Brain Floating | 1989.32 | 10.80 | 0.8751 |
| 24 (1,8,15) | Custom | 2963.16 | 10.74 | 1.5766 |
| 32 (1,8,23) | Single-Precision | 5958.07 | 10.76 | 3.3998 |
Comparison of N-bit floating-point divider.
| N-Bits | Common Name | Area (µm2) | Processing Delay (ns) | Total Energy (pJ) |
|---|---|---|---|---|
| 16 (1,8,7) | Brain Floating | 1442.16 | 10.80 | 0.6236 |
| 24 (1,8,15) | Custom | 3624.12 | 10.79 | 1.9125 |
| 32 (1,8,23) | Single-Precision | 8254.21 | 10.85 | 5.019 |
Figure 11Hardware validation platform (FPGA ZCU102 and Host CPU Board).
Comparison of accuracy and dynamic power using the algorithm.
| S. No | Precision Format | Formats for Individual Layers | Mantissa Bits | Exponent Bits | Training Accuracy | Test Accuracy | Dynamic Power |
|---|---|---|---|---|---|---|---|
| 1 | IEEE-16 | All 16-bits | 11 | 5 | 11.52% | 10.24% | 19 mW |
| 2 | Custom-16 | All 16-bits | 10 | 6 | 15.78% | 13.40% | 19 mW |
| 3 | Custom-16 | All 16-bits | 9 | 7 | 45.72% | 32.54% | 19 mW |
| 4 | Brain-16 | All 16-bits | 8 | 8 | 91.85% | 90.73% | 20 mW |
| 5 | CONV Mixed-18 | Conv/BackConv-18 Rest 16-bits a | 10/8 | 8 | 92.16% | 91.29% | 21 mW |
| 6 | CONV Mixed-20 | Conv/BackConv-20 Rest 16-bits a | 12/8 | 8 | 92.48% | 91.86% | 22 mW |
| 7 | CONV Mixed-23 | Conv/BackConv-23 Rest 16-bits a | 15/8 | 8 | 92.91% | 92.75% | 22 mW |
| 8 | CONV Mixed-24 | Conv/BackConv-24 Rest 16-bits a | 16/8 | 8 | 93.32% | 93.12% | 23 mW |
| 9 | FC1 Mixed-32 | FC1/BackFC1-32 Rest 20-bits b | 24/12 | 8 | 93.01% | 92.53% | 26 mW |
| 10 | FC2 Mixed-32 | FC1/BackFC1-32 Rest 22-bits c | 24/14 | 8 | 93.14% | 92.71% | 27 mW |
a Rest 16-bit modules are Pooling, FC1, FC2, Softmax, Back FC1, Back FC2 and Back Pooling. b Rest 20-bit modules are Convolution, Pooling, FC2, Softmax, Back FC2, Back Pooling and Back Conv. c Rest 16-bit modules are Convolution, Pooling, FC1, Softmax, Back FC1, Back Pooling and Back Conv.
Figure 1216-bit Adder with output in 32 bits.
Comparison with other related work.
| Criteria | [ | [ | [ | [ | Proposed |
|---|---|---|---|---|---|
| Precision | FP 32 | FP 32 | Fixed Point 16 | FP 32 | Mixed |
| Training dataset | MNIST | MNIST | MNIST | MNIST | MNIST |
| Device | Maxeler MPC-X | Artix 7 | Spartan-6 LX150 | Xilinx XCZU7EV | XILINX XCZU9EG |
| Accuracy | - | 90% | 92% | 96% | 93.32% |
| LUT | 69,510 | 7986 | - | 169,143 | 33,404 |
| FF | 87,580 | 3297 | - | 219,372 | 61,532 |
| DSP | 23 | 199 | - | 12 | 0 |
| BRAM | 510 | 8 | 200 | 304 | 7.5 |
| Operations (OPs) | 14,149,798 | - | 16,780,000 | 114,824 | 114,824 |
| Time Per Image (µs) | 355 | 58 | 236 | 26.17 | 13.398 |
| Power (W) | 27.3 | 12 | 20 | 0.67 | 0.635 a |
| Energy Per Image (µJ) | 9691.5 | 696 | 4720 | 17.4 | 8.5077 |
a Calculated by Xilinx Vivado (Power = Static power + Dynamic power).