| Literature DB >> 31135363 |
Baohan Xu, Hao Ye, Yingbin Zheng, Heng Wang, Tianyu Luwang, Yu-Gang Jiang.
Abstract
The ability to recognize actions throughout a video is essential for surveillance, self-driving, and many other applications. Although many researchers have investigated deep neural networks to get a better result in video action recognition, these networks usually require a large number of well-labeled data to train. In this paper, we introduce a dense dilated network to collect action information from snippet-level to global-level. The dilated dense network is composed of the blocks with densely connected dilated convolutions layers. Our proposed framework is capable of fusing outputs from each layer to learn high-level representations, and these representations are robust even with only a few training snippets. We study different spatial and temporal modality fusing configurations and introduce a novel temporal guided fusion upon the dense dilated network which can further boost the performance. We conduct extensive experiments on two popular video action datasets: UCF101 and HMDB51. The experiments demonstrate the effectiveness of our proposed framework.Year: 2019 PMID: 31135363 DOI: 10.1109/TIP.2019.2917283
Source DB: PubMed Journal: IEEE Trans Image Process ISSN: 1057-7149 Impact factor: 10.856