| Literature DB >> 35200746 |
Attila Fejér1,2, Zoltán Nagy2, Jenny Benois-Pineau1, Péter Szolgay2, Aymar de Rugy3, Jean-Philippe Domenger1.
Abstract
The present paper proposes an implementation of a hybrid hardware-software system for the visual servoing of prosthetic arms. We focus on the most critical vision analysis part of the system. The prosthetic system comprises a glass-worn eye tracker and a video camera, and the task is to recognize the object to grasp. The lightweight architecture for gaze-driven object recognition has to be implemented as a wearable device with low power consumption (less than 5.6 W). The algorithmic chain comprises gaze fixations estimation and filtering, generation of candidates, and recognition, with two backbone convolutional neural networks (CNN). The time-consuming parts of the system, such as SIFT (Scale Invariant Feature Transform) detector and the backbone CNN feature extractor, are implemented in FPGA, and a new reduction layer is introduced in the object-recognition CNN to reduce the computational burden. The proposed implementation is compatible with the real-time control of the prosthetic arm.Entities:
Keywords: FPGA; computer vision; convolutional neural network; image processing
Year: 2022 PMID: 35200746 PMCID: PMC8878618 DOI: 10.3390/jimaging8020044
Source DB: PubMed Journal: J Imaging ISSN: 2313-433X
Figure 1Example of the residual block in the Resnet.
Figure 2The prosthetic arm visually guided system.
Figure 3Example of bowl place 4 subject 2 gaze point alignment. The points are the gaze points.
Figure 4Example of bowl place 4 subject 2 KDE gaze point estimation. The points are the gaze points and the white point is the estimated gaze point.
Figure 5Gaze-driven, object-recognition CNN, where is the number of output channel of the reduction layer.
Figure 6Example of bowl place 1 subject 1 generated bounding box. The bounding boxes generated around the red bowl.
Hybridization of preliminary steps in the pipeline, which contains two main blocks: Gaze-Point Alignment Block and Gaze-Point Noise Reduction Block and its submodules.
| Module | CPU | FPGA |
|---|---|---|
|
| ||
| SIFT Detection [ | - | X |
| SIFT Matching | X | - |
| Homography estimation | X | - |
| Gaze-point projection | X | - |
|
| ||
| KDE estimation | X | - |
Hybridization of the gaze-driven CNN.
| Module | CPU | FPGA |
|---|---|---|
| Resnet50 | - | X |
| Reduction layer | X | - |
| Faster R-CNN | X | - |
| MIL aggregation | X | - |
Comparison between the Intel i5 7300HQ and the Xilinx ZCU102 ARM CORTEX A53.
| Xilinx ZCU102 ARM CORTEX A53 | |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| BowlPlace1Subject1 | 119 ± 25 | 875.504 ± 12.123 | 23.471 ± 5.203 | 2.200 ± 0.540 | 0.089 ± 0.004 |
| BowlPlace1Subject2 | 106 ± 16 | 875.282 ± 9.504 | 20.036 ± 3.704 | 1.900 ± 0.398 | 0.088 ± 0.001 |
| BowlPlace1Subject3 | 153 ± 50 | 873.072 ± 7.283 | 17.626 ± 3.276 | 2.539 ± 0.621 | 0.089 ± 0.001 |
| BowlPlace1Subject4 | 120 ± 25 | 873.545 ± 9.062 | 22.244 ± 5.938 | 2.160 ± 0.464 | 0.092 ± 0.009 |
| BowlPlace4Subject1 | 158 ± 55 | 855.947 ± 6.583 | 16.011 ± 3.053 | 2.883 ± 1.188 | 0.088 ± 0.001 |
| BowlPlace4Subject2 | 117 ± 24 | 861.933 ± 5.821 | 16.276 ± 2.623 | 1.997 ± 0.449 | 0.089 ± 0.004 |
| BowlPlace4Subject3 | 108 ± 19 | 867.649 ± 8.894 | 15.679 ± 4.620 | 2.136 ± 0.350 | 0.089 ± 0.005 |
| BowlPlace4Subject4 | 147 ± 49 | 857.271 ± 9.468 | 16.762 ± 4.186 | 2.240 ± 0.516 | 0.088 ± 0.001 |
| BowlPlace5Subject1 | 120 ± 33 | 861.481 ± 8.012 | 17.875 ± 2.176 | 2.018 ± 0.505 | 0.088 ± 0.001 |
| BowlPlace5Subject2 | 133 ± 42 | 858.547 ± 6.232 | 17.944 ± 3.024 | 2.354 ± 0.880 | 0.088 ± 0.001 |
| BowlPlace5Subject3 | 126 ± 33 | 859.774 ± 6.384 | 15.742 ± 2.836 | 2.007 ± 0.524 | 0.087 ± 0.001 |
| BowlPlace6Subject1 | 120 ± 25 | 867.344 ± 10.950 | 19.026 ± 3.862 | 1.965 ± 0.306 | 0.088 ± 0.001 |
| BowlPlace6Subject2 | 129 ± 35 | 862.750 ± 9.731 | 19.737 ± 4.973 | 3.681 ± 3.456 | 0.090 ± 0.008 |
| BowlPlace6Subject3 | 127 ± 31 | 864.429 ± 6.931 | 17.555 ± 3.806 | 2.588 ± 0.823 | 0.087 ± 0.001 |
| BowlPlace6Subject4 | 112 ± 22 | 867.962 ± 9.579 | 17.368 ± 4.725 | 2.710 ± 0.649 | 0.089 ± 0.004 |
|
| |||||
|
|
|
|
|
|
|
| BowlPlace1Subject1 | 151 ± 67 | 74.205 ± 5.611 | 3.891 ± 0.853 | 0.259 ± 0.051 | 0.015 ± |
| BowlPlace1Subject2 | 156 ± 37 | 75.062 ± 5.640 | 3.304 ± 0.579 | 0.228 ± 0.040 | 0.014 ± |
| BowlPlace1Subject3 | 86 ± 50 | 72.217 ± 2.572 | 3.011 ± 0.476 | 0.282 ± 0.055 | 0.014 ± |
| BowlPlace1Subject4 | 138 ± 69 | 72.979 ± 2.853 | 3.717 ± 0.940 | 0.252 ± 0.044 | 0.015 ± 0.002 |
| BowlPlace4Subject1 | 94 ± 50 | 70.068 ± 2.405 | 2.747 ± 0.565 | 0.313 ± 0.113 | 0.014 ± |
| BowlPlace4Subject2 | 121 ± 28 | 72.280 ± 3.538 | 2.778 ± 0.407 | 0.233 ± 0.040 | 0.015 ± |
| BowlPlace4Subject3 | 126 ± 39 | 73.402 ± 3.406 | 2.678 ± 0.728 | 0.256 ± 0.047 | 0.014 ± |
| BowlPlace4Subject4 | 95 ± 50 | 70.394 ± 2.349 | 2.872 ± 0.695 | 0.259 ± 0.051 | 0.014 ± |
| BowlPlace5Subject1 | 129 ± 39 | 71.990 ± 2.691 | 3.027 ± 0.369 | 0.244 ± 0.050 | 0.015 ± |
| BowlPlace5Subject2 | 120 ± 56 | 71.587 ± 2.526 | 3.077 ± 0.573 | 0.272 ± 0.087 | 0.014 ± |
| BowlPlace5Subject3 | 108 ± 36 | 71.359 ± 2.500 | 2.684 ± 0.448 | 0.234 ± 0.049 | 0.015 ± 0.001 |
| BowlPlace6Subject1 | 132 ± 48 | 72.150 ± 2.891 | 3.213 ± 0.645 | 0.237 ± 0.031 | 0.015 ± |
| BowlPlace6Subject2 | 129 ± 59 | 71.790 ± 3.934 | 3.348 ± 0.823 | 0.390 ± 0.316 | 0.015 ± |
| BowlPlace6Subject3 | 114 ± 47 | 72.042 ± 2.883 | 2.976 ± 0.617 | 0.287 ± 0.076 | 0.015 ± 0.001 |
| BowlPlace6Subject4 | 138 ± 44 | 74.585 ± 4.431 | 3.089 ± 0.849 | 0.303 ± 0.075 | 0.015 ± |
Comparison in processing time of kernel density estimation module between the Intel i5 7300HQ and the Xilinx ZCU102 ARM CORTEX A53.
| Xilinx ZCU102 ARM CORTEX A53 | Intel i5 7300HQ | ||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| Bowl | 22 ± 8 | 49.27 ± 82.83 | 307.34 | 4.94 ± 7.68 | 27.90 |
| CanOfCocaCola | 26 ± 11 | 75.54 ± 95.89 | 395.08 | 7.46 ± 8.80 | 36.70 |
| FryingPan | 24 ± 9 | 59.09 ± 50.06 | 206.76 | 5.86 ± 4.51 | 18.98 |
| Glass | 29 ± 10 | 148.22 ± 265.60 | 943.19 | 14.89 ± 26.23 | 92.21 |
| Jam | 27 ± 12 | 132.75 ± 319.01 | 1365.65 | 13.34 ± 31.39 | 134.68 |
| Lid | 29 ± 16 | 247.21 ± 718.32 | 3835.30 | 23.92 ± 70.97 | 379.64 |
| MilkBottle | 28 ± 10 | 114.95 ± 148.60 | 647.86 | 11.20 ± 13.99 | 61.92 |
| Mug | 28 ± 11 | 109.88 ± 218.40 | 1087.39 | 11.03 ± 21.26 | 106.63 |
| OilBottle | 30 ± 12 | 235.15 ± 477.79 | 2117.26 | 22.86 ± 46.23 | 205.83 |
| Plate | 32 ± 14 | 203.39 ± 406.91 | 1837.70 | 19.59 ± 39.46 | 178.97 |
| Rice | 29 ± 13 | 90.34 ± 95.16 | 372.93 | 8.64 ± 8.92 | 35.80 |
| SaucePan | 25 ± 12 | 139.07 ± 261.08 | 1286.11 | 13.68 ± 25.82 | 126.92 |
| Sponge | 24 ± 10 | 50.05 ± 49.79 | 207.89 | 5.10 ± 4.76 | 20.46 |
| Sugar | 27 ± 14 | 146.60 ± 271.58 | 1165.44 | 14.46 ± 26.70 | 117.57 |
| VinegarBottle | 28 ± 13 | 122.32 ± 178.37 | 683.56 | 12.23 ± 17.71 | 70.01 |
| WashLiquid | 28 ± 12 | 102.93 ± 183.02 | 880.47 | 10.42 ± 18.45 | 89.25 |
Measurements of the gaze-driven, object-recognition CNN in the Intel i5 7300 CPU. The first column contains the reaming number of channels after the reduction layer. Each column shows the elapsed time during the computation in milliseconds.
| Number of Channel | Backbone (ms) | Reduction Layer (ms) | ROI Heads (ms) | Aggregation (ms) |
|---|---|---|---|---|
| 32 | 90.000 ± 0.250 | 0.336 ± | 1.107 ± | 0.137 ± |
| 64 | 97.307 ± 1.613 | 0.531 ± 0.002 | 2.262 ± 0.004 | 0.138 ± |
| 96 | 87.441 ± 0.508 | 0.557 ± 0.003 | 2.956 ± 0.003 | 0.241 ± |
| 128 | 89.952 ± 2.568 | 0.646 ± 0.001 | 3.356 ± 0.001 | 0.142 ± |
| 256 | 85.287 ± 0.375 | 0.908 ± | 6.592 ± 0.002 | 0.150 ± |
| 512 | 94.505 ± 2.100 | 2.485 ± 0.002 | 12.276 ± 0.002 | 0.159 ± |
| 1024 | 95.515 ± 7.285 | 3.204 ± 0.007 | 23.718 ± 0.010 | 0.164 ± |
Measurements of the gaze-driven, object-recognition CNN in the ARM A53 CPU. The first column contains the reaming number of the channels after the reduction layer. Each column shows the elapsed time during the computation in milliseconds.
| Number of Channel | Backbone (ms) | Reduction Layer (ms) | ROI Heads (ms) | Aggregation (ms) |
|---|---|---|---|---|
| 32 | 1863.300 ± 11.433 | 6.949 ± 0.001 | 13.843 ± 0.002 | 0.643 ± 0.001 |
| 64 | 1768.616 ± 15.615 | 8.156 ± 0.001 | 21.859 ± 0.006 | 0.708 ± |
| 96 | 1787.737 ± 15.903 | 10.178 ± 0.001 | 30.705 ± 0.001 | 0.758 ± |
| 128 | 1800.327 ± 17.915 | 12.140 ± 0.001 | 39.371 ± 0.002 | 0.727 ± |
| 256 | 1797.798 ± 16.372 | 22.061 ± 0.011 | 73.750 ± 0.002 | 0.714 ± |
| 512 | 1733.458 ± 14.429 | 33.723 ± 0.001 | 142.231 ± 0.001 | 0.752 ± |
| 1024 | 1761.748 ± 16.305 | 63.319 ± 0.001 | 285.121 ± 0.002 | 0.714 ± |
The results of the training and testing after 30 epochs.
| Number of Channel | 32 | 64 | 96 | 128 | 256 | 512 | 1024 |
|---|---|---|---|---|---|---|---|
| avg loss on training set | 7.235 | 6.318 | 6.642 | 4.778 | 3.920 | 3.115 | 2.623 |
| avg acc on trainig set | 0.815 | 0.877 | 0.827 | 0.963 | 0.988 | 1.000 | 1.000 |
| avg acc on test set | 0.793 ± 0.261 | 0.926 ± 0.120 | 0.853 ± 0.161 | 0.952 ± 0.083 | 1.000 | 1.000 | 1.000 |
| avg ap on test set | 0.978 ± 0.043 | 0.985 ± 0.030 | 0.964 ± 0.041 | 0.995 ± 0.012 | 1.000 | 1.000 | 1.000 |
Figure 7Training accuracy (in blue) and loss (in red) during 30 epochs. The top left is when the reduction layer number of channels is 32, and, next to it, 64. The vertical axis is the training loss or the accuracy, depending on the line. The horizontal axis is the epochs number.
Comparison of different object recognition CNNs. All the measurements were taken by Vitis AI 1.4. The gaze-driven, object-recognition CNN used 128 channels in the reduction layer.
| Name | Gaze-Driven, Object-Recognition CNN | SSD Mobilnet V2 | YOLO V3 |
|---|---|---|---|
| Dataset | GITW | COCO | VOC |
| Framework | Pytorch | Tensorflow | Tensorflow |
| Input size | 300 × 300 | 300 × 300 | 416 × 416 |
| Running device | ZCU 102 + ARM A53 | ZCU 102 | ZCU 102 |
| FPS | 12.64 | 78.8 | 13.2 |
The average computational time measurement of the whole system on different hardware. The Resnet50 number of channels is 128.
| Computational Time (ms) | |||
|---|---|---|---|
|
|
|
|
|
| SIFT [ | 72.407 ± 3.349 | 865.499 ± 8.437 | 7.407 [ |
| FLANN matcher | 3.094 ± 0.638 | 18.223 ± 3.867 | 18.223 ± 3.867 |
| Homography estimation | 0.270 ± 0.075 | 2.359 ± 0.778 | 2.359 ± 0.778 |
| Gaze point projection | 0.015 ± | 0.089 ± 0.003 | 0.089 ± 0.003 |
| KDE estimation | 12.477 ± 23.306 | 126.672 ± 238.900 | 126.672 ± 238.900 |
| Bounding Box generation | 0.424 ± 0.020 | 2.659 ± 0.027 | 2.659 ± 0.027 |
| Resnet50 [ | 89.952 ± 2.568 | 1800.327 ± 17.915 | 26.860 |
| Reduction Layer | 0.645 ± 0.001 | 12.140 ± 0.001 | 12.140 ± 0.001 |
| Faster R-CNN [ | 3.356 ± 0.001 | 39.371 ± 0.002 | 39.371 ± 0.002 |
| MIL Aggregation | 0.142 ± | 0.727 ± | 0.727 ± |
| Total time (ms) | 182.782 ± 29.957 | 2868.066 ± 269.930 | 236.507 ± 243.578 |