| Literature DB >> 28350358 |
Manuel Rodríguez1, Eduardo Magdaleno2, Fernando Pérez3, Cristhian García4.
Abstract
Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation.Entities:
Keywords: NFFT; SDSoC; Zynq; parallelism techniques; software acceleration
Year: 2017 PMID: 28350358 PMCID: PMC5421654 DOI: 10.3390/s17040694
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Simplified architecture model of a 7020 Zynq device.
Figure 2The programmable logic of the Zynq-7020 device and its constituent elements.
Figure 3The traditional hardware/software co-design flow.
Figure 4Flow design using SDSoC.
Figure 5Example of two function selections to implement the same application.
Figure 6Flow design for NFFT algorithm implementation in Zynq.
Figure 7NFFT HW/SW C -code acceleration using Zynq. SDSoC is a flexible tool that permits us to select independently one, two or three functions of our algorithm to be implemented in HW.
Figure 8Loop pipelining scheme. Loop pragma reduce in two cycles the loop latency.
Acceleration of the deconvolution function.
| Size of Samples | Software Solution | Hardware Solution | Speed-Up Improvement | ||||
|---|---|---|---|---|---|---|---|
| CPU Cycles | % PS 1 | % PL 2 | CPU Cycles | % PS | % PL | ||
| 32 (25) | 265,328 | 24.6% | 0.0% | 14,652 | 4.1% | 19.7% | 18.11 |
| 64 (26) | 278,355 | 24.9% | 0.0% | 14,763 | 4.3% | 20.0% | 18.85 |
| 128 (27) | 304,407 | 25.3% | 0.0% | 14,983 | 4.5% | 20.6% | 20.32 |
| 256 (28) | 356,512 | 26.2% | 0.0% | 15,423 | 4.9% | 21.7% | 23.12 |
| 512 (29) | 460,722 | 27.8% | 0.0% | 16,303 | 5.7% | 23.9% | 28.26 |
| 1024 (210) | 669,142 | 31.1% | 0.0% | 18,064 | 7.4% | 28.3% | 37.04 |
1 Percentage of the Processing System (ARM) used. 2 Percentage of the Programmable Logic (FPGA) used.
Acceleration of the FFT function.
| Size of Samples | Software Solution | Hardware Solution | Speed-Up Improvement | ||||
|---|---|---|---|---|---|---|---|
| CPU Cycles | % ARM (PS) | % PL | CPU Cycles | % ARM (PS) | % PL | ||
| 32 (25) | 252,298 | 25.2% | 0.0% | 7844 | 4.1% | 23.4% | 32.16 |
| 64 (26) | 342,447 | 25.5% | 0.0% | 7849 | 4.3% | 23.8% | 43.63 |
| 128 (27) | 522,744 | 25.9% | 0.0% | 7859 | 4.5% | 24.6% | 66.52 |
| 256 (28) | 883,339 | 26.7% | 0.0% | 7880 | 4.9% | 26.2% | 112.10 |
| 512 (29) | 1,604,528 | 28.4% | 0.0% | 7919 | 5.7% | 29.4% | 202.62 |
| 1024 (210) | 3,046,906 | 31.9% | 0.0% | 7998 | 7.4% | 35.7% | 380.96 |
Acceleration of the convolution function.
| Size of Samples | Software Solution | Hardware Solution | Speed-Up Improvement | ||||
|---|---|---|---|---|---|---|---|
| CPU Cycles | % ARM (PS) | % PL | CPU Cycles | % ARM (PS) | % PL | ||
| 32 (25) | 159,932 | 61.3% | 0.0% | 143,336 | 4.1% | 54.5% | 1.12 |
| 64 (26) | 348,085 | 62.3% | 0.0% | 311,964 | 4.3% | 55.1% | 1.12 |
| 128 (27) | 724,390 | 64.1% | 0.0% | 649,220 | 4.5% | 55.9% | 1.12 |
| 256 (28) | 1,476,999 | 67.8% | 0.0% | 1,323,732 | 4.9% | 57.7% | 1.12 |
| 512 (29) | 2,982,218 | 75.2% | 0.0% | 2,682,756 | 5.7% | 61.2% | 1.11 |
| 1024 (210) | 5,992,656 | 89.9% | 0.0% | 5,370,804 | 7.4% | 68.3% | 1.12 |
Acceleration of the whole system.
| Size of Samples | Software Solution | Hardware Solution | Speed-Up Improvement | ||||
|---|---|---|---|---|---|---|---|
| CPU Cycles | % ARM (PS) | % PL | CPU Cycles | % ARM (PS) | % PL | ||
| 32 (25) | 678,094 | 69.9% | 0.0% | 166,582 | 4.1% | 61.5% | 4.07 |
| 64 (26) | 969,428 | 70.7% | 0.0% | 335,397 | 4.3% | 62.5% | 2.90 |
| 128 (27) | 1,552,094 | 72.2% | 0.0% | 673,027 | 4.5% | 64.4% | 2.31 |
| 256 (28) | 2,717,426 | 75.3% | 0.0% | 1,348,287 | 4.9% | 68.6% | 2.02 |
| 512 (29) | 5,048,090 | 81.4% | 0.0% | 2,698,807 | 5.7% | 76.9% | 1.87 |
| 1024 (210) | 9,709,418 | 93.6% | 0.0% | 5,399,846 | 7.4% | 91.3% | 1.80 |