| Literature DB >> 36081072 |
Xuefu Sui1,2,3, Qunbo Lv1,2,3, Yang Bai1,2,3, Baoyu Zhu1,2,3, Liangjie Zhi1,2,3, Yuanbo Yang1,2,3, Zheng Tan1,3.
Abstract
To address the problems of convolutional neural networks (CNNs) consuming more hardware resources (such as DSPs and RAMs on FPGAs) and their accuracy, efficiency, and resources being difficult to balance, meaning they cannot meet the requirements of industrial applications, we proposed an innovative low-bit power-of-two quantization method: the global sign-based network quantization (GSNQ). This method involves designing different quantization ranges according to the sign of the weights, which can provide a larger quantization-value range. Combined with the fine-grained and multi-scale global retraining method proposed in this paper, the accuracy loss of low-bit quantization can be effectively reduced. We also proposed a novel convolutional algorithm using shift operations to replace multiplication to help to deploy the GSNQ quantized models on FPGAs. Quantization comparison experiments performed on LeNet-5, AlexNet, VGG-Net, ResNet, and GoogLeNet showed that GSNQ has higher accuracy than most existing methods and achieves "lossless" quantization (i.e., the accuracy of the quantized CNN model is higher than the baseline) at low-bit quantization in most cases. FPGA comparison experiments showed that our convolutional algorithm does not occupy on-chip DSPs, and it also has a low comprehensive occupancy in terms of on-chip LUTs and FFs, which can effectively improve the computational parallelism, and this proves that GSNQ has good hardware-adaptation capability. This study provides theoretical and experimental support for the industrial application of CNNs.Entities:
Keywords: convolutional neural networks; hardware resource occupancy; hardware-friendly; low-bit quantization; shift operation
Mesh:
Year: 2022 PMID: 36081072 PMCID: PMC9460272 DOI: 10.3390/s22176618
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Weight distributions of CNNs. (a) Weight distribution of the first convolutional layer in GoogLeNet on Mini-ImageNet. (b) Weight distributions of the first and second convolutional layers in LeNet-5 on MNIST.
Figure 2(a) The quantization strategies of INQ with different quantization bit-widths. (b) The quantization strategies of GSNQ with different quantization bit-widths.
Figure 3Comparison of two retraining approaches. (a) Horizontal retraining in INQ. (b) Global retraining in GSNQ. denotes the weight set of layer l, the green parts denote the quantized low-precision weight sets, the blue parts denote the full-precision weight sets used for retraining, and the arrows denote the loss-compensation process.
Figure 4Global-based retraining process. denotes the weight set of layer l, the green areas denote the quantized low-precision weight sets, and the blue areas denote the full-precision weight sets used for retraining.
Recoding weights (4-bit).
| Recoded Weight | Quantized Weight | Recoded Weight | Quantized Weight |
|---|---|---|---|
| 0001 | 2−1 | 1001 | −2−1 |
| 0010 | 2−2 | 1010 | −2−2 |
| 0011 | 2−3 | 1011 | −2−3 |
| 0100 | 2−4 | 1100 | −2−4 |
| 0101 | 2−5 | 1101 | −2−5 |
| 0110 | 2−6 | 1110 | −2−6 |
| 0111 | 20 | 1111 | −2−7 |
| 0000 | 0 | Null | Null |
Recoding weights (3-bit).
| Recoded Weight | Quantized Weight | Recoded Weight | Quantized Weight |
|---|---|---|---|
| 001 | 2−1 | 101 | −2−1 |
| 010 | 2−2 | 110 | −2−2 |
| 011 | 2−3 | 111 | −2−3 |
| 000 | 0 | Null | Null |
Figure 5The multiplicative-based computation processing element.
Figure 6The shift-based multiplication processing element.
Figure 7The shift-based convolutional computation module.
Figure 8The designed accumulator.
Figure 9An example: (a) output data before truncation; (b) output data after truncation.
Training parameters for different CNNs.
| Network | Weight Partition | Weight Decay | Momentum | Learning Rate | Batch Size |
|---|---|---|---|---|---|
| LeNet-5 | (0.3, 0.6, 0.8, 1) | 0.0001 | 0.9 | 0.1 | 256 |
| AlexNet | (0.3, 0.6, 0.8, 1) | 0.0005 | 0.9 | 0.01 | 256 |
| VGG-16 | (0.5, 0.75, 0.875, 1) | 0.0005 | 0.9 | 0.01 | 128 |
| ResNet-18 | (0.5, 0.75, 0.875, 1) | 0.0005 | 0.9 | 0.01 | 128 |
| ResNet-20 | (0.2, 0.4, 0.6, 0.8, 1) | 0.0001 | 0.9 | 0.1 | 256 |
| ResNet-56 | (0.2, 0.4, 0.6, 0.8, 1) | 0.0001 | 0.9 | 0.1 | 128 |
| GoogLeNet | (0.2, 0.4, 0.6, 0.8, 1) | 0.0002 | 0.9 | 0.01 | 128 |
Ablation experiments on LeNet-5 on the MNIST dataset.
| Network | Method | Bit Width | Top-1 Accuracy | Top-5 Accuracy | Increase in Top-1/Top-5 Error |
|---|---|---|---|---|---|
| LeNet-5 | Baseline | 32-bit | 98.84% | 100.00% | |
| INQ | 4-bit | 97.86% | 99.97% | −0.98%/−0.03% | |
| GSNQ (only S) | 4-bit | 98.01% | 99.97% | −0.83%/−0.03% | |
| GSNQ (only G) | 4-bit | 98.48% | 99.99% | −0.36%/−0.01% | |
| GSNQ | 4-bit | 98.86% | 100.00% | 0.02%/0.00% | |
| INQ | 3-bit | 96.11% | 99.84% | −2.73%/−0.16% | |
| GSNQ (only S) | 3-bit | 97.67% | 99.97% | −1.17%/−0.03% | |
| GSNQ (only G) | 3-bit | 97.01% | 99.88% | −1.83%/−0.12% | |
| GSNQ | 3-bit | 98.33% | 99.99% | −0.51%/−0.01% |
Ablation experiments on 4-bit quantization on the CIFAR10 dataset.
| Network | Method | Bit Width | Top-1 Accuracy | Top-5 Accuracy | Increase in Top-1/Top-5 Error |
|---|---|---|---|---|---|
| AlexNet | Baseline | 32-bit | 81.26% | 98.90% | |
| INQ | 4-bit | 79.92% | 98.64% | −1.34%/−0.26% | |
| GSNQ (only S) | 4-bit | 81.61% | 99.07% | 0.35%/0.17% | |
| GSNQ (only G) | 4-bit | 82.83% | 99.11% | 1.57%/0.21% | |
| GSNQ | 4-bit | 84.62% | 99.26% | 3.36%/0.36% | |
| VGG-16 | Baseline | 32-bit | 85.33% | 99.39% | |
| INQ | 4-bit | 85.94% | 99.49% | 0.61%/0.10% | |
| GSNQ (only S) | 4-bit | 86.17% | 99.51% | 0.84%/0.12% | |
| GSNQ (only G) | 4-bit | 87.92% | 99.55% | 2.59%/0.16% | |
| GSNQ | 4-bit | 88.12% | 99.59% | 2.79%/0.20% | |
| ResNet-18 | Baseline | 32-bit | 88.88% | 99.65% | |
| INQ | 4-bit | 89.18% | 99.66% | 0.30%/0.01% | |
| GSNQ (only S) | 4-bit | 89.20% | 99.68% | 0.32%/0.03% | |
| GSNQ (only G) | 4-bit | 90.09% | 99.65% | 1.21%/0.00% | |
| GSNQ | 4-bit | 90.46% | 99.75% | 1.58%/0.10% | |
| GoogLeNet | Baseline | 32-bit | 89.00% | 99.60% | |
| INQ | 4-bit | 89.37% | 99.62% | 0.37%/0.02% | |
| GSNQ (only S) | 4-bit | 89.49% | 99.61% | 0.49%/0.01% | |
| GSNQ (only G) | 4-bit | 90.46% | 99.66% | 1.46%/0.06% | |
| GSNQ | 4-bit | 90.72% | 99.66% | 1.72%/0.06% |
Ablation experiments on 3-bit quantization on the CIFAR10 dataset.
| Network | Method | Bit Width | Top-1 Accuracy | Top-5 Accuracy | Increase in Top-1/Top-5 Error |
|---|---|---|---|---|---|
| AlexNet | Baseline | 32-bit | 81.26% | 98.90% | |
| INQ | 3-bit | 65.85% | 95.95% | −15.41%/−2.95% | |
| GSNQ (only S) | 3-bit | 78.89% | 98.73% | −2.37%/−0.17% | |
| GSNQ (only G) | 3-bit | 74.15% | 98.18% | −7.11%/−0.72% | |
| GSNQ | 3-bit | 82.14% | 99.09% | 0.88%/0.19% | |
| VGG-16 | Baseline | 32-bit | 85.33% | 99.39% | |
| INQ | 3-bit | 76.33% | 98.50% | −9.00%/−0.89% | |
| GSNQ (only S) | 3-bit | 85.18% | 99.44% | −0.15%/0.05% | |
| GSNQ (only G) | 3-bit | 77.47% | 98.53% | −7.86%/−0.86% | |
| GSNQ | 3-bit | 87.59% | 99.58% | 2.26%/0.19% | |
| ResNet-18 | Baseline | 32-bit | 88.88% | 99.65% | |
| INQ | 3-bit | 85.56% | 99.58% | −3.32%/−0.07% | |
| GSNQ (only S) | 3-bit | 88.84% | 99.69% | −0.04%/0.04% | |
| GSNQ (only G) | 3-bit | 87.64% | 99.61% | −1.24%/−0.04% | |
| GSNQ | 3-bit | 90.01% | 99.58% | 1.13%/−0.07% | |
| GoogLeNet | Baseline | 32-bit | 89.00% | 99.60% | |
| INQ | 3-bit | 86.08% | 99.38% | −2.92%/−0.22% | |
| GSNQ (only S) | 3-bit | 89.07% | 99.63% | −0.07%/0.03% | |
| GSNQ (only G) | 3-bit | 89.85% | 99.53% | 0.85%/−0.07% | |
| GSNQ | 3-bit | 90.70% | 99.60% | 1.70%/0.00% |
Ablation experiments on 4-bit quantization on the Mini-ImageNet dataset.
| Network | Method | Bit Width | Top-1 Accuracy | Top-5 Accuracy | Increase in Top-1/Top-5 Error |
|---|---|---|---|---|---|
| AlexNet | Baseline | 32-bit | 58.05% | 83.50% | |
| INQ | 4-bit | 58.33% | 83.43% | 0.28%/−0.07% | |
| GSNQ (only S) | 4-bit | 59.31% | 83.98% | 1.26%/0.48% | |
| GSNQ (only G) | 4-bit | 64.27% | 86.82% | 6.22%/3.32% | |
| GSNQ | 4-bit | 65.19% | 87.43% | 7.14%/3.84% | |
| VGG-16 | Baseline | 32-bit | 73.68% | 91.03% | |
| INQ | 4-bit | 73.48% | 90.94% | −0.20%/−0.09% | |
| GSNQ (only S) | 4-bit | 73.91% | 91.06% | 0.23%/0.03% | |
| GSNQ (only G) | 4-bit | 78.33% | 92.70% | 4.65%/1.67% | |
| GSNQ | 4-bit | 78.96% | 92.89% | 5.28%/1.86% | |
| ResNet-18 | Baseline | 32-bit | 74.43% | 92.11% | |
| INQ | 4-bit | 74.37% | 91.96% | −0.06%/−0.15% | |
| GSNQ (only S) | 4-bit | 74.83% | 91.96% | 0.40%/−0.15% | |
| GSNQ (only G) | 4-bit | 78.30% | 92.92% | 3.87%/0.81% | |
| GSNQ | 4-bit | 78.49% | 92.92% | 4.06%/0.81% | |
| GoogLeNet | Baseline | 32-bit | 77.76% | 93.40% | |
| INQ | 4-bit | 77.83% | 93.42% | 0.07%/0.02% | |
| GSNQ (only S) | 4-bit | 78.37% | 93.56% | 0.61%/0.16% | |
| GSNQ (only G) | 4-bit | 79.49% | 93.77% | 1.73%/0.37% | |
| GSNQ | 4-bit | 79.96% | 93.52% | 2.20%/0.12% |
Ablation experiments on 3-bit quantization on the Mini-ImageNet dataset.
| Network | Method | Bit Width | Top-1 Accuracy | Top-5 Accuracy | Increase in Top-1/Top-5 Error |
|---|---|---|---|---|---|
| AlexNet | Baseline | 32-bit | 58.05% | 83.50% | |
| INQ | 3-bit | 49.48% | 77.26% | −8.57%/−6.24% | |
| GSNQ (only S) | 3-bit | 57.29% | 83.29% | −0.76%/−0.21% | |
| GSNQ (only G) | 3-bit | 52.87% | 81.26% | −5.18%/−2.24% | |
| GSNQ | 3-bit | 64.11% | 86.70% | 6.06%/3.20% | |
| VGG-16 | Baseline | 32-bit | 73.68% | 91.03% | |
| INQ | 3-bit | 71.27% | 90.17% | −2.14%/−0.86% | |
| GSNQ (only S) | 3-bit | 73.60% | 91.04% | −0.08%/0.01% | |
| GSNQ (only G) | 3-bit | 74.44% | 91.45% | 0.76%/0.42% | |
| GSNQ | 3-bit | 75.69% | 92.10% | 2.01%/1.07% | |
| ResNet-18 | Baseline | 32-bit | 74.43% | 92.11% | |
| INQ | 3-bit | 69.81% | 89.62% | −4.62%/−2.49% | |
| GSNQ (only S) | 3-bit | 73.82% | 91.38% | −0.61%/−0.73% | |
| GSNQ (only G) | 3-bit | 76.51% | 92.26% | 2.08%/0.15% | |
| GSNQ | 3-bit | 77.69% | 92.86% | 3.26%/0.75% | |
| GoogLeNet | Baseline | 32-bit | 77.29% | 93.14% | |
| INQ | 3-bit | 67.70% | 88.95% | −9.59%/−4.19% | |
| GSNQ (only S) | 3-bit | 76.98% | 93.12% | −0.31%/−0.02% | |
| GSNQ (only G) | 3-bit | 74.76% | 91.57% | −2.53%/−1.57% | |
| GSNQ | 3-bit | 79.24% | 93.49% | 1.95%/0.35% |
Comparison with the state-of-the-art quantization methods on CIFAR10.
| Network | Method | Bit Width | Top-1 Accuracy | Top-5 Accuracy | Increase in Top-1/Top-5 Error |
|---|---|---|---|---|---|
| ResNet-18 | Baseline | 32-bit | 88.88% | 99.65% | |
| DSQ | 4-bit | 90.38% | 99.66% | 1.50%/0.01% | |
| APoT | 4-bit | 90.20% | 99.69% | 1.32%/0.04% | |
| GSNQ | 4-bit | 90.46% | 99.75% | 1.58%/0.10% | |
| DSQ | 3-bit | 90.01% | 99.68% | 1.13%/0.03% | |
| APoT | 3-bit | 90.18% | 99.58% | 1.30%/−0.07% | |
| GSNQ | 3-bit | 90.01% | 99.58% | 1.13%/−0.07% | |
| GoogLeNet | Baseline | 32-bit | 89.00% | 99.60% | |
| DSQ | 4-bit | 90.47% | 99.63% | 1.47%/0.03% | |
| APoT | 4-bit | 89.91% | 99.49% | 0.91%/−0.11% | |
| GSNQ | 4-bit | 90.72% | 99.66% | 1.72%/0.06% | |
| DSQ | 3-bit | 89.99% | 99.57% | 0.99%/−0.03% | |
| APoT | 3-bit | 90.78% | 99.61% | 1.78%/0.01% | |
| GSNQ | 3-bit | 90.70% | 99.60% | 1.70%/0.00% |
Comparison between ResNet-20 and ResNet-56 on CIFAR10.
| Network | Method | Top-1 Accuracy | ||
|---|---|---|---|---|
| 4-bit | 3-bit | 2-bit | ||
| ResNet-20 | DoReFa-Net | 90.5% | 89.9% | 88.2% |
| PACT | 91.7% | 91.1% | 89.7% | |
| LQ-Net | — | 91.6% | 90.2% | |
| ProxQuant | — | — | 90.6% | |
| APoT | 92.3% | 92.2% | 91.0% | |
| GSNQ | 92.42% | 91.96% | 90.91% | |
| ResNet-56 | APoT | 94.0% | 93.9% | 92.9% |
| GSNQ | 94.00% | 93.62% | 92.92% | |
On-chip resource occupation of a single convolution computation module based on different methods.
| Method | Bit Width | LUTs | FFs | DSPs |
|---|---|---|---|---|
| Method 1: Using on-chip DSPs | 4-bit | 1036 | 790 | 25 |
| Method 2: Using on-chip LUTs | 4-bit | 1373 | 690 | 0 |
| Our method | 4-bit | 1094 | 734 | 0 |
| Method 1: Using on-chip DSPs | 3-bit | 680 | 691 | 25 |
| Method 2: Using on-chip LUTs | 3-bit | 1103 | 585 | 0 |
| Our method | 3-bit | 722 | 612 | 0 |
On-chip resource occupation for convolutional computation at 32-channel parallelism.
| Method | Bit Width | LUTs | FFs | DSPs |
|---|---|---|---|---|
| Method 1: Using on-chip DSPs | 4-bit | 38,356 (22.31%) | 28,171 (8.19%) | 800 (88.89%) |
| Method 2: Using on-chip LUTs | 4-bit | 54,383 (31.64%) | 24,971 (7.26%) | 0 (0.00%) |
| Our method | 4-bit | 41,923 (24.39%) | 28,151 (8.19%) | 0 (0.00%) |
| Method 1: Using on-chip DSPs | 3-bit | 23,303 (13.56%) | 24,968 (7.26%) | 800 (88.89%) |
| Method 2: Using on-chip LUTs | 3-bit | 37,719 (21.94%) | 22,568 (6.56%) | 0 (0.00%) |
| Our method | 3-bit | 30,163 (17.55%) | 23,330 (6.79%) | 0 (0.00%) |