| Literature DB >> 30897792 |
Jongkwang Hong1, Bora Cho2, Yong Won Hong3, Hyeran Byun4.
Abstract
In action recognition research, two primary types of information are appearance and motion information that is learned from RGB images through visual sensors. However, depending on the action characteristics, contextual information, such as the existence of specific objects or globally-shared information in the image, becomes vital information to define the action. For example, the existence of the ball is vital information distinguishing "kicking" from "running". Furthermore, some actions share typical global abstract poses, which can be used as a key to classify actions. Based on these observations, we propose the multi-stream network model, which incorporates spatial, temporal, and contextual cues in the image for action recognition. We experimented on the proposed method using C3D or inflated 3D ConvNet (I3D) as a backbone network, regarding two different action recognition datasets. As a result, we observed overall improvement in accuracy, demonstrating the effectiveness of our proposed method.Entities:
Keywords: action recognition; contextual information; multi-stream fusion
Year: 2019 PMID: 30897792 PMCID: PMC6471330 DOI: 10.3390/s19061382
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Examples from UCF sports [11] dataset, each row representing the image sequence of the video. (a) RGB images from the “Kicking” class. (b) Corresponding optical flow images to (a). (c) RGB images from the “Run” class. (d) Corresponding optical flow images to (c).
Figure 2Overall architecture of the proposed method.
Figure 3Example of an RGB image with corresponding pairwise inputs. (a) “HorseRiding”. (b) “SoccerJugglings”. (c) “HorseRiding”. (d) “SoccerJugglings”.
Figure 4Details about pairwise stream inputs; images of each row are from the same video in different timelines. Solid red line boxes represent the “actors” in the action. Dotted red line boxes represent the “person” who is regarded as an “object”. Solid blue line boxes represent the “object” of “actors”. (a) “Biking”. (b) “SoccerJugglings”. (c) “Basketball”. (d) “IceDancing”.
Figure 5Example of an RGB image with corresponding keypoints inputs. (a) “Basketball”. (b) “Biking”. (c) “IceDancing”. (d) “HandbandPushups”.
Pairwise stream performance comparison.
| Accuracy (%) | Bounding Box | Mask |
|---|---|---|
| Pairwise stream | 49.80 |
|
| Fusion | 97.83 |
|
The result of using C3D [12] and inflated 3D ConvNet (I3D) [13] as backbone networks for the UCF-101 and HMDB-51 datasets.
| Method | UCF-101 | HMDB-51 |
|---|---|---|
| C3D-RGB—(our implementation) | 84.17 | - |
| C3D-Pose | 80.35 | - |
| C3D-Pairwise | 79.87 | - |
| C3D-(RGB and Pairwise and Pose) |
| - |
| I3D-RGB | 94.69 | 74.84 |
| I3D-Flow | 94.14 | 77.52 |
| I3D-Pose | 69.15 | 51.57 |
| I3D-Pairwise | 76.02 | 51.83 |
| I3D-(RGB and Flow)—(our implementation) | 97.33 | 80.07 |
| I3D-(RGB and Flow and Pairwise) | 97.46 | 80.33 |
| I3D-(RGB and Flow and Pose) | 97.89 | 80.85 |
| I3D-(RGB and Flow and Pairwise and Pose) |
|
|
The result of the UCF-101 dataset class accuracy of the baseline (I3D using RGB and optical flow) and proposed method.
| Class | Baseline | Proposed (Improved) |
|---|---|---|
| HandstandPushups | 82.14 | 98.43 (+14.29) |
| HandstandWalking | 82.35 | 91.18 (+8.82) |
| CricketShot | 89.80 | 95.92 (+6.12) |
| FrontCrawl | 91.89 | 97.30 (+5.41) |
| Punch | 89.74 | 94.87 (+5.13) |
| Shotput | 93.48 | 97.83 (+4.35) |
| BoxingPunchingBag | 73.47 | 77.55 (+4.08) |
| PullUps | 96.43 | 100.00 (+3.57) |
| BodyWeightSquats | 96.67 | 100.00 (+3.33) |
| HammerThrow | 82.83 | 85.86 (+3.03) |
| FloorGymnastics | 91.67 | 94.44 (+2.78) |
| WalkingWithDog | 94.44 | 97.22 (+2.78) |
| Archery | 95.12 | 97.56 (+2.44) |
| SoccerPenalty | 97.56 | 97.22 (+2.78) |
| BaseballPitch | 90.70 | 93.02 (+2.33) |
| PlayingCello | 97.73 | 100.00 (+2.27) |
Comparison with other models.
| Model | UCF-101 | HMDB-51 |
|---|---|---|
| LSTM (as reported in [ | 86.8 | 49.7 |
| 3D-ConvNet (as reported in [ | 79.9 | 49.4 |
| Convolutional Two-Stream Network [ | 90.4 | 58.63 |
| 3D-Fused (as reported in [ | 91.5 | 66.5 |
| Temporal Segment Networks [ | 93.5 | - |
| Spatiotemporal Multiplier Networks [ | 94.0 | 69.02 |
| Two-Stream I3D [ | 97.6 |
|
| Multi-stream I3D (Proposed) |
| 80.92 |
| LSTM | 91.0 | 53.4 |
| Two-Stream | 94.2 | 66.6 |
| 3D-Fused | 94.2 | 71.0 |
| Two-Stream I3D | 98.0 | 81.2 |