| Literature DB >> 22518097 |
Clément Farabet1, Rafael Paz, Jose Pérez-Carrasco, Carlos Zamarreño-Ramos, Alejandro Linares-Barranco, Yann Lecun, Eugenio Culurciello, Teresa Serrano-Gotarredona, Bernabe Linares-Barranco.
Abstract
Most scene segmentation and categorization architectures for the extraction of features in images and patches make exhaustive use of 2D convolution operations for template matching, template search, and denoising. Convolutional Neural Networks (ConvNets) are one example of such architectures that can implement general-purpose bio-inspired vision systems. In standard digital computers 2D convolutions are usually expensive in terms of resource consumption and impose severe limitations for efficient real-time applications. Nevertheless, neuro-cortex inspired solutions, like dedicated Frame-Based or Frame-Free Spiking ConvNet Convolution Processors, are advancing real-time visual processing. These two approaches share the neural inspiration, but each of them solves the problem in different ways. Frame-Based ConvNets process frame by frame video information in a very robust and fast way that requires to use and share the available hardware resources (such as: multipliers, adders). Hardware resources are fixed- and time-multiplexed by fetching data in and out. Thus memory bandwidth and size is important for good performance. On the other hand, spike-based convolution processors are a frame-free alternative that is able to perform convolution of a spike-based source of visual information with very low latency, which makes ideal for very high-speed applications. However, hardware resources need to be available all the time and cannot be time-multiplexed. Thus, hardware should be modular, reconfigurable, and expansible. Hardware implementations in both VLSI custom integrated circuits (digital and analog) and FPGA have been already used to demonstrate the performance of these systems. In this paper we present a comparison study of these two neuro-inspired solutions. A brief description of both systems is presented and also discussions about their differences, pros and cons.Entities:
Keywords: FPGA; VHDL; address-event-representation; convolutional neural network; frame-free vision; image convolutions; spike-based convolutions
Year: 2012 PMID: 22518097 PMCID: PMC3324817 DOI: 10.3389/fnins.2012.00032
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 4.677
Figure 1Architecture of a typical convolutional network for object recognition. This implements a convolutional feature extractor and a linear classifier for generic N-class object recognition. Once trained, the network can be computed on arbitrary large input images, producing a classification map as output.
Figure 2Example netlist and its ASCII file netlist description.
Figure 3Architecture of the convolution chip.
Figure 4Block diagram of the FPGA AER-based convolution processor (left) and its State Machine (right).
Frame-free FPGA resource consumption.
| Resources of a Virtex6 LX240T | #Used |
|---|---|
| 128 × 8-bit single-port block RAM | 64 |
| 16 × 1-bit single-port read-only distributed RAM | 64 |
| 16 × 16-bit dual-port distributed RAM | 64 |
| 4096 × 8-bit single-port block RAM | 64 |
| 4 × 4-bit single-port read-only distributed RAM | 1 |
| 64 × 64-bit single-port read-only distributed RAM | 1 |
| 2-33-bit adders/subtractors | 2752 |
| 2-14-bit counters | 1487 |
| Flip-flops | 91397 |
| Finite-state-machines | 1557 |
| 2-33-bit comparators | 3274 |
| 1-32-bit multiplexors | 25801 |
| Slices registers | 74987 out of 301440 (24%) |
| Slices LUTs | 83521 out of 150720 (55%) |
| Occupied Slices | 32720 out of 37680 (86%) |
| Block RAM36E1/FIFO | 64 out of 416 (15%) |
| Block RAM18E1/FIFO | 68 out of 832 (8%) |
Figure 5Illustration of pseudo-simultaneity in fast event-driven recognition. (A) Feed-forward Two-Convolution system. (B) Photograph with commercial camera at 1/60 s. (C) Five milliseconds event capture from AER motion retina. (D) Event rate computed using 10 μs bins. (E) First pre-filtering Kernel. (F) Second template-matching kernel. (G) Events from real retina (red dots), simulated output of first filter (green circles), and simulated output of second filter (blue stars). (H) y/time zoom out. (I) x/y zoom out.
Figure 6Illustration of pseudo-simultaneity concept extrapolated to multiple layers. (A) Vision system composed of Vision Sensor and five sequential processing stages, like in a ConvNet. (B) Timing in a Frame-constraint system with 1 ms frame time for sensing and per stage processing. (C) Timing in an Event-driven system with micro-second delays for sensor and processor events.
Figure 7(A) A data-flow computer. A set of run-time configurable processing tiles are connected on a 2D grid. They can exchange data with their 4 neighbors and with an off-chip memory via global lines. (B) The grid is configured for a more complex computation that involves several tiles: the 3 top tiles perform a 3 × 3 convolution, the 3 intermediate tiles another 3 × 3 convolution, the bottom left tile sums these two convolutions, and the bottom center tile applies a function to the result.
Figure 8Computing time for a typical ConvNet, versus the number of connections used for training the network.
Frame-free vs. frame-constrained.
| Frame-free | Frame-constrained | |
|---|---|---|
| Data processing | Per-event, resulting in pseudo-simultaneity | Per frame/patch |
| Hardware multiplexing | Not possible | Possible |
| Hardware up-scaling | By adding modules | |
| Speed | Determined by statistics of input stimuli | Determined by number and type of operations, available hardware resources and their speed |
| Power consumption | Determined by module power per-event, and inter-module communication power per-event | Determined by power of processor(s) and memory fetching requirements |
| Feedback | Instantaneous. No need to iterate | Need to iterate until convergence for each frame |
Performance comparison.
| Purdue/NYU | IMSE/US | 3D ASIC | Grid 40 nm | |
|---|---|---|---|---|
| Input scene size | 521 × 512 | 128 × 128 | 512 × 512 | 512 × 512 |
| Delay | 5.2 ms/frame | 3 μs/event | 1.3 ms/frame | 10 ns/events |
| Gabor array | 16 convs 10 × 10 kernels | 64 convs 11 × 11 kernels | 16 convs 10 × 10 kernels | 100 convs 32 × 32 kernels |
| Neurons | 4.05 × 106 | 2.62 × 105 | 4.05 × 106 | 108 |
| Synapses | 4.05 × 108 | 3.20 × 107 | 4.05 × 108 | 1011 |
| Conn/s | 7.8 × 1010 | 2.6 × 109 | 3 × 1011 | 4 × 1013 |