| Literature DB >> 29250109 |
Qiang Lan1,2, Zelong Wang1,2, Mei Wen1,2, Chunyuan Zhang1,2, Yijie Wang1,2.
Abstract
Convolutional neural networks have proven to be highly successful in applications such as image classification, object tracking, and many other tasks based on 2D inputs. Recently, researchers have started to apply convolutional neural networks to video classification, which constitutes a 3D input and requires far larger amounts of memory and much more computation. FFT based methods can reduce the amount of computation, but this generally comes at the cost of an increased memory requirement. On the other hand, the Winograd Minimal Filtering Algorithm (WMFA) can reduce the number of operations required and thus can speed up the computation, without increasing the required memory. This strategy was shown to be successful for 2D neural networks. We implement the algorithm for 3D convolutional neural networks and apply it to a popular 3D convolutional neural network which is used to classify videos and compare it to cuDNN. For our highly optimized implementation of the algorithm, we observe a twofold speedup for most of the 3D convolution layers of our test network compared to the cuDNN version.Entities:
Mesh:
Year: 2017 PMID: 29250109 PMCID: PMC5698830 DOI: 10.1155/2017/8348671
Source DB: PubMed Journal: Comput Intell Neurosci
Algorithm 13D winograd transformation.
Algorithm 23D Convolutional layer implemented with WMFA F(m × m × m, r × r × r).
Figure 1The computing flow of 3D WMFA.
Figure 2For an input matrix, its size is (M, N∗α∗α∗α); α is the tile size, here equal to 4. After the reshape kernel is applied, lots of small submatrices with new layouts are generated.
Properties of the GeForce GTX 1080.
| Parameters | Values |
|---|---|
| CUDA capability major/minor version number | 6.1 |
| Total amount of global memory | 8 GB |
| CUDA cores | 2560 |
| L2 cache Size | 2 MB |
| Warp size | 32 |
| Total number of registers available per block | 64 KB |
Convolution layers of a 3D network; the filter size in all layers is 3 × 3 × 3, and the GFLOPS columns calculate the number of flops operations in each convolutional layer. Assume the batch size is 32.
| Layer |
|
| GFLOPS |
|---|---|---|---|
| conv1 | 3 × 16 × 112 × 112 × 32 | 32 | 16.65 |
| conv2 | 32 × 16 × 56 × 56 × 32 | 64 | 88.8 |
| conv3 | 64 × 8 × 28 × 28 × 32 | 256 | 88.8 |
| conv4 | 256 × 4 × 14 × 14 × 32 | 256 | 44.4 |
| conv5 | 256 × 2 × 7 × 7 × 32 | 256 | 5.55 |
Figure 3Speedup with different optimizations on 3D convolution layers.
Figure 4Time percentage distribution of each kernel in each implementation version for a specific convolution layer.
Figure 5Execution time of different methods on 3D convolution layers.
Performance of cuDNN SGEMM versus that of the 3D WMFA on 3D convolution layers. Performance is measured in effective TFLOPS.
| Layer |
|
| TFLOPS | Speedup | |
|---|---|---|---|---|---|
| cuDNN SGEMM | 3D WMFA | ||||
| conv2 | 32 × 16 × 56 × 56 × 32 | 64 | 1.21 | 1.28 | 1.05 |
| conv3 | 64 × 8 × 28 × 28 × 32 | 256 | 2.38 | 3.31 | 1.39 |
| conv4 | 256 × 4 × 14 × 14 × 32 | 256 | 2.4 | 4.72 | 1.96 |
| conv5 | 256 × 2 × 7 × 7 × 32 | 256 | 1.46 | 2.1 | 1.44 |