| Literature DB >> 35615551 |
Fengda Zhao1,2,3, Jiuhan Zhao1,3, Xianshan Li1,3, Yinghui Zhang4, Dingding Guo1,3, Wenbai Chen5.
Abstract
Analyzing and understanding human actions in long-range videos has promising applications, such as video surveillance, automatic driving, and efficient human-computer interaction. Most researches focus on short-range videos that predict a single action in an ongoing video or forecast an action several seconds earlier before it occurs. In this work, a novel method is proposed to forecast a series of actions and their durations after observing a partial video. This method extracts features from both frame sequences and label sequences. A retentive memory module is introduced to richly extract features at salient time steps and pivotal channels. Extensive experiments are conducted on the Breakfast data set and 50 Salads data set. Compared to the state-of-the-art methods, the method achieves comparable performance in most cases.Entities:
Mesh:
Year: 2022 PMID: 35615551 PMCID: PMC9126708 DOI: 10.1155/2022/4260247
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1The architecture of 2S-RLSTM. Given a frame sequence and a label sequence, 2S-RLSTM can predict a series of actions and their durations in an iterative way.
Figure 2Training samples are generated by cutting each action segmentation at a random split point. Each input sequence is a compound matrix of which each row consists of a double set, including the label and length of the observed action. Each target vector consists of a triple set, including the label of the next action, the remaining length of the current action, and the length of the next action before the cut line.
Figure 3The architecture of a memory neural network. We extract features c in compose operation after the analysis of all time steps.
Dense action anticipation performance comparison on the Breakfast data set.
| Observation% | Prediction% | 2S-RLSTM | CNN [ | RNN [ | Grammar [ |
|---|---|---|---|---|---|
| 20 | 10 | 65.04 | 57.59 | 60.35 | 48.92 |
| 20 | 52.67 | 49.12 | 50.44 | 40.33 | |
| 30 | 50.42 | 44.03 | 45.28 | 36.24 | |
| 50 | 45.42 | 39.26 | 40.02 | 31.46 | |
|
| |||||
| 30 | 10 | 65.44 | 60.23 | 61.45 | 52.66 |
| 20 | 54.59 | 50.14 | 50.25 | 42.15 | |
| 30 | 49.24 | 45.18 | 44.90 | 38.44 | |
| 50 | 46.03 | 40.51 | 41.75 | 33.09 | |
Dense action anticipation performance comparison on 50 Salads data set.
| Observation% | Prediction% | 2S-RLSTM | CNN [ | RNN [ | Grammar [ |
|---|---|---|---|---|---|
| 20 | 10 | 46.67 | 36.08 | 42.30 | 28.69 |
| 20 | 33.32 | 27.62 | 31.19 | 21.65 | |
| 30 | 31.14 | 21.43 | 25.22 | 18.32 | |
| 50 | 19.76 | 15.48 | 16.82 | 10.37 | |
|
| |||||
| 30 | 10 | 39.96 | 37.36 | 44.19 | 26.71 |
| 20 | 27.40 | 24.78 | 29.51 | 14.59 | |
| 30 | 21.23 | 20.78 | 19.96 | 11.69 | |
| 50 | 10.03 | 14.05 | 10.38 | 9.25 | |
Comparison of different architectures that are composed of different components.
| Observation% | Prediction% | Baseline | 2S-LSTM | L-CLSTM | L-RLSTM |
|---|---|---|---|---|---|
| 20 | 10 | 54.26 | 59.45 | 57,24 | 58.35 |
| 20 | 43.96 | 48.98 | 45.24 | 47.38 | |
| 30 | 41.86 | 46.19 | 42.81 | 46.01 | |
| 50 | 40.96 | 44.75 | 41.56 | 43.46 | |
|
| |||||
| 30 | 10 | 56.79 | 63.24 | 57.26 | 61.25 |
| 20 | 51.87 | 52.69 | 52.67 | 53.61 | |
| 30 | 46.14 | 47.26 | 46.64 | 47.63 | |
| 50 | 42.69 | 44.85 | 43.79 | 44.51 | |