| Literature DB >> 31001104 |
Guang Chen1,2, Hu Cao3, Canbo Ye1, Zhenyan Zhang1, Xingbo Liu1, Xuhui Mo3, Zhongnan Qu4, Jörg Conradt5, Florian Röhrbein2, Alois Knoll2.
Abstract
Neuromorphic vision sensors are bio-inspired cameras that naturally capture the dynamics of a scene with ultra-low latency, filtering out redundant information with low power consumption. Few works are addressing the object detection with this sensor. In this work, we propose to develop pedestrian detectors that unlock the potential of the event data by leveraging multi-cue information and different fusion strategies. To make the best out of the event data, we introduce three different event-stream encoding methods based on Frequency, Surface of Active Event (SAE) and Leaky Integrate-and-Fire (LIF). We further integrate them into the state-of-the-art neural network architectures with two fusion approaches: the channel-level fusion of the raw feature space and decision-level fusion with the probability assignments. We present a qualitative and quantitative explanation why different encoding methods are chosen to evaluate the pedestrian detection and which method performs the best. We demonstrate the advantages of the decision-level fusion via leveraging multi-cue event information and show that our approach performs well on a self-annotated event-based pedestrian dataset with 8,736 event frames. This work paves the way of more fascinating perception applications with neuromorphic vision sensors.Entities:
Keywords: convolutional neural network; event-stream encoding; multi-Cue event information fusion; neuromorphic vision sensor; object detection
Year: 2019 PMID: 31001104 PMCID: PMC6454154 DOI: 10.3389/fnbot.2019.00010
Source DB: PubMed Journal: Front Neurorobot ISSN: 1662-5218 Impact factor: 2.650
Figure 1Overview of the pedestrian detection system with multi-cue event information fusion. Frequency, SAE (Surface of Active Events) and LIF (leaky integrate-and-fire) are the three methods we use to encode event stream. And we encode Positive, Negative, and Positive + Negative events respectively. We do channel-level fusion by corresponding these three encoding methods to R, G, B channels respectively, then by processing the merged data we obtain a detector. Moreover, we process different encoded data separately to acquire three detectors, then we use DBF to do decision-level fusion and obtain one fused detector.
Figure 2(A) The representation of the spatiotemporal data from neuromorphic Vision Sensor in 3D(x, y, t), (B) event-frame based on Frequency (C) event-frame based on SAE, (D) event-frame based on LIF.
Figure 3The encoding procedure of the LIF neuron model. Top shows an asynchronous event stream. At time t, there is a spike of the LIF neuron.
Figure 4The model of the Merged-Three-Channel method: we colorized three before-merging event frames for better effect of visualization, the actual frames used in this work are grayscale.
Nomenclature of the groups.
| P | All events contained in the frame is positive. |
| N | All events contained in the frame is negative. |
| PN | All events contained in the frame is positive or negative. |
| Frequency | Encoded by the method of Frequency. |
| SAE | Encoded by the method of Surface of Active Events. |
| LIF | Encoded by the method of Leaky Integrate-and-Fire. |
| MTC | Encoded by the method of Merged-Three-Channels. |
The performance for polarity datatset using different event-stream encoding methods (Frequency, SAE, and LIF) based on YOLO (YOLOv3 IOU@0.5 AP).
| YOLO-F | Positive | |
| YOLO-SAE | Positive | 71.76% |
| YOLO-LIF | Positive | 63.05% |
| YOLO-F | Negative | |
| YOLO-SAE | Negative | 74.25% |
| YOLO-LIF | Negative | 71.05% |
| YOLO-F | Positive + Negative | |
| YOLO-SAE | Positive + Negative | 76.47% |
| YOLO-LIF | Positive + Negative | 72.72% |
The bold values significant the best value of encoding methods in same polarity group.
The performance for Positive-Negative combination dataset using different event-stream encoding methods (Frequency, SAE and LIF) based on YOLO and two fusion strategies.
| YOLO-F | 74.48% | 78.62% | 81.04% |
| YOLO-SAE | 71.76% | 74.25% | 76.47% |
| YOLO-LIF | 63.05% | 71.05% | 72.72% |
| YOLO-MTC | 76.06% | 77.26% | 78.98% |
| DBF | 78.53% | 80.86% | 82.28% |
Figure 5Precision-Recall curves of different event-stream encoding methods (Frequency, SAE and LIF) based on YOLO with different Positive-Negative combination dataset: (A)YOLO-F; (B) YOLO-SAE; (C) YOLO-LIF.
Figure 6Predicted results: The outputs from three individual detectors (YOLO-F, YOLO-SAE and YOLO-LIF) are shown in the upper three rows while outputs from YOLO-MTC are shown in the lower row. Meanwhile, the results of the same Positive-Negative combination dataset are shown in the same column. (A) YOLO-F_P; (B) YOLO-F_N; (C) YOLO-F_PN; (D) YOLO-SAE_P; (E) YOLO-SAE_N; (F) YOLO-SAE_PN; (G) YOLO-LIF_P; (H) YOLO-LIF_N; (I) YOLO-LIF_PN; (J) YOLO-MTC_P; (K) YOLO-MTC_N; (L) YOLO-MTC-PN.