| Literature DB >> 33291759 |
Xiang Yan1, Syed Zulqarnain Gilani2, Mingtao Feng3, Liang Zhang3, Hanlin Qin1, Ajmal Mian4.
Abstract
Detecting key frames in videos is a common problem in many applications such as video classification, action recognition and video summarization. These tasks can be performed more efficiently using only a handful of key frames rather than the full video. Existing key frame detection approaches are mostly designed for supervised learning and require manual labelling of key frames in a large corpus of training data to train the models. Labelling requires human annotators from different backgrounds to annotate key frames in videos which is not only expensive and time consuming but also prone to subjective errors and inconsistencies between the labelers. To overcome these problems, we propose an automatic self-supervised method for detecting key frames in a video. Our method comprises a two-stream ConvNet and a novel automatic annotation architecture able to reliably annotate key frames in a video for self-supervised learning of the ConvNet. The proposed ConvNet learns deep appearance and motion features to detect frames that are unique. The trained network is then able to detect key frames in test videos. Extensive experiments on UCF101 human action and video summarization VSUMM datasets demonstrates the effectiveness of our proposed method.Entities:
Keywords: convolutional networks; key frames; self-supervised learning; two-stream ConvNets
Year: 2020 PMID: 33291759 PMCID: PMC7731244 DOI: 10.3390/s20236941
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Conceptual overview of our approach. The proposed method automatically annotates key frames as the discriminant frames in a video to avoid the time-consuming and subjective manual labelling. Discrimnant analysis is performed on the framewise features extracted using a pretrained CNN. This example is of “Cricket Shot” with 92 frames. Our method marks four frames as key frames. Note that these four key frames can still describe the class of the human action.
Figure 2An overview of the deep key frame detection framework. The appearance network operates on RGB frames, while the motion network operates on the optical flow represented as images. The feature maps from the appearance and motion ConvNets are aggregated to form a spatio-temporal feature representation.
Figure 3Block Diagram. Automatically generating labels to train the deep key frame automatic annotation framework. The appearance and motion information (the output of fc7 layer of VGG-16 [58]) are concatenated as a spatio-temporal feature vector. Thereon, LDA is applied to all the feature vectors of all the classes of training videos to project the feature vectors to a low dimensional feature space (LDA space) and obtain the projection vectors of each class videos. Finally, the projection vectors are used to calculate the frame-level video labels.
Figure 4Key frame labelling score results and their respective automatically annotated ground truth. The blue curve represents the ground truth annotation scores while the red curve represents the labelling results from our two-stream network. The temporal location of ground truth and detected key frames are shown in blue and purple solid circles respectively.
Our technique can detect more than one key frame in a video. The table shows the difference in the number of key frames detected versus the automatically annotated ground truth as well as the position of these key frames relative to the ground truth in 20 classes of the UCF101 dataset.
| Average Error in Key Frame Detection | ||
|---|---|---|
| Class Name | Number Detected | Key Frame Position |
| Baseball Pitch | ±1.73 | ±0.455 |
| Basket. Dunk | ±1.64 | ±0.640 |
| Billiards | ±2.18 | ±1.300 |
| Clean and Jerk | ±2.27 | ±1.202 |
| Cliff Diving | ±2.45 | ±0.627 |
| Cricket Bowl. | ±2.45 | ±1.14 |
| Cricket Shot | ±1.27 | ±0.828 |
| Diving | ±2.00 | ±0.907 |
| Frisbee Catch | ±1.73 | ±0.546 |
| Golf Swing | ±2.45 | ±0.752 |
| Hamm. Throw | ±1.73 | ±1.223 |
| High Jump | ±1.73 | ±0.434 |
| Javelin Throw | ±2.45 | ±0.555 |
| Long Jump | ±2.27 | ±0.611 |
| Pole Vault | ±2.64 | ±1.139 |
| Shotput | ±2.00 | ±0.564 |
| Soccer Penalty | ±2.09 | ±0.712 |
| Tennis Swing | ±1.64 | ±0.554 |
| Throw Discus | ±2.09 | ±0.642 |
| Volley. Spike | ±1.54 | ±0.633 |
| Average accuracy | ±2.02 | ±0.773 |
Figure 5Examples of key frames detected along with the automatically annotated ground truth. Note that the empty red box denotes a missing key frame in the detected frames which implies that our method has detected more key frames than the automatically annotated ground truth. It can be observed that the key frame detected by our method are very similar to the ground truth. (best seen in colour).
Comparison of our proposed key frame detection approach compared to several state of the art (F-score%).
| Method | F-Score% |
|---|---|
| VSUMM. [ | 67 |
| DDC [ | 71 |
| Gong et al. [ | 60.3 |
| Zhang et al. [ | 61.0 |
| SUM-GAN [ | 62.5 |
| Fu et al. [ | 69.7 |
| AVS [ | 66.2 |
| Ours | 72.1 |
Figure 6Examples of key frames detection of VSUMM dataset, along with the ground-truth key frames detection.