| Literature DB >> 30813259 |
Xin Wei1, Wenchao Liu2, Lei Chen3, Long Ma4, He Chen5, Yin Zhuang6.
Abstract
Recently, extensive convolutional neural network (CNN)-based methods have been used in remote sensing applications, such as object detection and classification, and have achieved significant improvements in performance. Furthermore, there are a lot of hardware implementation demands for remote sensing real-time processing applications. However, the operation and storage processes in floating-point models hinder the deployment of networks in hardware implements with limited resource and power budgets, such as field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). To solve this problem, this paper focuses on optimizing the hardware design of CNN with low bit-width integers by quantization. First, a symmetric quantization scheme-based hybrid-type inference method was proposed, which uses the low bit-width integer to replace floating-point precision. Then, a training approach for the quantized network is introduced to reduce accuracy degradation. Finally, a processing engine (PE) with a low bit-width is proposed to optimize the hardware design of FPGA for remote sensing image classification. Besides, a fused-layer PE is also presented for state-of-the-art CNNs equipped with Batch-Normalization and LeakyRelu. The experiments performed on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset using a graphics processing unit (GPU) demonstrate that the accuracy of 8-bit quantized model drops by about 1%, which is an acceptable accuracy loss. The accuracy result tested on FPGA is consistent with that of GPU. As for the resource consumptions of FPGA, the Look Up Table (LUT), Flip-flop (FF), Digital Signal Processor (DSP), and Block Random Access Memory (BRAM) are reduced by 46.21%, 43.84%, 45%, and 51%, respectively, compared with that of floating-point implementation.Entities:
Keywords: FPGA; convolutional neural network; hybrid-type inference; remote sensing; symmetric quantization
Year: 2019 PMID: 30813259 PMCID: PMC6412419 DOI: 10.3390/s19040924
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1(a) Integer/floating-point hybrid-type inference in FPGA. (b) The architecture of the quantized convolutional layer during training.
Figure 2Frameworks of the fundamental network (a) modified network (b) and quantized network (c).
Figure 3Hardware implementation architecture on the FPGA platform.
Figure 4Efficient calculation method for the convolutional and fully connected layers.
Figure 5Fusing floating-point multiplication for the optimization to hardware design.
Figure 6(a) Structure of the floating-point-type PE. (b) Structure of the hybrid-type PE.
Logical resource consumptions of PEs in six formats.
| Logical Resource | Float | 4-bit | 6-bit | 8-bit | 10-bit | 12-bit |
|---|---|---|---|---|---|---|
| LUT | 2936 | 396 | 404 | 410 | 417 | 424 |
| FF | 2267 | 767 | 824 | 867 | 916 | 961 |
| DSP | 38 | 20 | 20 | 20 | 20 | 20 |
Figure 7(a) Structure of the fused-layer floating-point-type PE. (b) Structure of fused-layer hybrid-type PE.
Logical resource consumptions of fused-layer PEs in six formats.
| Logical Resource | Float | 4-bit | 6-bit | 8-bit | 10-bit | 12-bit |
|---|---|---|---|---|---|---|
| LUT | 3319 | 828 | 836 | 842 | 849 | 856 |
| FF | 5517 | 1512 | 1569 | 1612 | 1661 | 1706 |
| DSP | 42 | 24 | 24 | 24 | 24 | 24 |
Figure 8Samples of each type of vehicle in MSTAR with corresponding optical images.
The quantity of training and testing images.
| Class | BMP-2 | BRDM-2 | BTR-60 | BTR-70 | D7 | T-62 | T-72 | ZIL-131 | ZSU-234 | 2S1 |
|---|---|---|---|---|---|---|---|---|---|---|
| Training | 233 | 298 | 256 | 233 | 299 | 299 | 232 | 299 | 299 | 299 |
| Testing | 195 | 274 | 195 | 196 | 274 | 273 | 196 | 274 | 274 | 274 |
Classification accuracies of floating-point model and quantized models.
| No. | Float [ | 4-bit | 6-bit | 8-bit | 10-bit | 12-bit |
|---|---|---|---|---|---|---|
| 1 | 98.14% | 94.51% | 97.07% | 97.64% | 97.31% | 97.15% |
| 2 | 98.22% | 93.97% | 96.08% | 97.77% | 97.15% | 97.23% |
| 3 | 98.55% | 95.09% | 97.56% | 97.19% | 97.23% | 97.11% |
| 4 | 98.47% | 93.89% | 96.82% | 97.31% | 97.48% | 97.03% |
| 5 | 98.76% | 94.68% | 97.64% | 97.19% | 97.36% | 97.11% |
| Mean | 98.43% | 94.43% | 97.03% | 97.42% | 97.31% | 97.13% |
| Std. dev. | 0.22% | 0.49% | 0.57% | 0.24% | 0.11% | 0.06% |
Weights storage and compression statistics in experiments of quantized models.
| Float | 4-bit | 6-bit | 8-bit | 10-bit | 12-bit | |
|---|---|---|---|---|---|---|
| Weights Storage (MB) | 6.638 | 0.830 | 1.245 | 1.660 | 2.074 | 2.489 |
| Compression Rate | - | 8.0× | 5.3× | 4.0× | 3.2× | 2.7× |
Figure 9Training processes of the fundamental floating-point-type and quantized-type networks.
Accuracies of directly quantized models and quantized models with training.
| 4-bit | 6-bit | 8-bit | 10-bit | 12-bit | |
|---|---|---|---|---|---|
| Quantized with training | 94.43% | 97.03% | 97.42% | 97.31% | 97.13% |
| Quantized directly | 29.27% | 20.70% | 42.22% | 43.34% | 41.17% |
| Difference | 65.16% | 76.33% | 55.20% | 53.97% | 55.96% |
Classification accuracies under the proposed scheme and the scheme in [32].
| 4-bit | 6-bit | 8-bit | 10-bit | 12-bit | |
|---|---|---|---|---|---|
| [ | 95.79% | 98.11% | 97.98% | 97.75% | 98.18% |
| Ours | 94.43% | 97.03% | 97.42% | 97.31% | 97.13% |
Figure 10(a) Real dynamic range and quantized range in [32].(b) Real dynamic range and quantized range in ours.
Number of multiplications and additions in the proposed scheme and the scheme in [32].
| Layer | [ | Ours | ||
|---|---|---|---|---|
| Multiplication | Addition | Multiplication | Addition | |
| Conv1 | 922560 | 2583168 | 922560 | 830304 |
| Conv2 | 921600 | 1900800 | 921600 | 806400 |
| FC1 | 1728120 | 5184120 | 1728120 | 1728000 |
| FC2 | 10164 | 30324 | 10164 | 10080 |
| FC3 | 850 | 2530 | 850 | 840 |
| Total | 3583294 | 9700942 | 3583294 | 3375624 |
Experimental results of the floating-point model and the 8-bit quantized model in FPGA.
| Available | [ | Ours | Ours | |
|---|---|---|---|---|
| Format | - | 32-bit float | 32-bit float | 8-bit fixed |
| Frequency | - | 100 MHz | 100 MHz | 100 MHz |
| LUT | 203800 | 55745(27.35%) | 36725(18.02%) | 19753 (9.69%) |
| FF | 407600 | 45561(11.18%) | 37283(9.19%) | 20938 (5.14%) |
| DSP | 840 | - | 220(26.19%) | 121 (14.40%) |
| BRAM (36 Kb) | 445 | 150.5(33.82%) | 150(33.71%) | 73.5 (16.52%) |
| DDR | 100 Gbps | - | 47.62 Gbps | 11.91 Gbps |
| Processing | - | 2.29 ms | 2.29 ms | 2.29 ms |
| Classification | - | 98.18% | 98.76% | 97.77% |