| Literature DB >> 34152986 |
Wenfei Yang, Tianzhu Zhang, Zhendong Mao, Yongdong Zhang, Qi Tian, Feng Wu.
Abstract
Weakly supervised temporal action detection has better scalability and practicability than fully supervised action detection in reality deployment. However, it is difficult to learn a robust model without temporal action boundary annotations. In this paper, we propose an en-to-end Multi-Scale Structure-Aware Network (MSA-Net) for weakly supervised temporal action detection by exploring both the global structure information of a video and the local structure information of actions. The proposed SA-Net enjoys several merits. First, to localize actions with different durations, each video is encoded into feature representations with different temporal scales. Second, based on the multi-scale feature representation, the proposed model has designed two effective structure modeling mechanisms including global structure modeling and local structure modeling, which can effectively learn discriminative structure aware representations for robust and complete action detection. To the best of our knowledge, this is the first work to fully explore the global and local structure information in a unified deep model for weakly supervised action detection. And extensive experimental results on two benchmark datasets demonstrate that the proposed MSA-Net performs favorably against state-of-the-art methods.Year: 2021 PMID: 34152986 DOI: 10.1109/TIP.2021.3089361
Source DB: PubMed Journal: IEEE Trans Image Process ISSN: 1057-7149 Impact factor: 10.856