| Literature DB >> 29510529 |
Baojun Zhao1,2, Boya Zhao3,4, Linbo Tang5,6, Yuqi Han7,8, Wenzheng Wang9,10.
Abstract
With the development of deep neural networks, many object detection frameworks have shown great success in the fields of smart surveillance, self-driving cars, and facial recognition. However, the data sources are usually videos, and the object detection frameworks are mostly established on still images and only use the spatial information, which means that the feature consistency cannot be ensured because the training procedure loses temporal information. To address these problems, we propose a single, fully-convolutional neural network-based object detection framework that involves temporal information by using Siamese networks. In the training procedure, first, the prediction network combines the multiscale feature map to handle objects of various sizes. Second, we introduce a correlation loss by using the Siamese network, which provides neighboring frame features. This correlation loss represents object co-occurrences across time to aid the consistent feature generation. Since the correlation loss should use the information of the track ID and detection label, our video object detection network has been evaluated on the large-scale ImageNet VID dataset where it achieves a 69.5% mean average precision (mAP).Entities:
Keywords: Siamese network; deep neural network; multiscale feature representation; temporal information; video object detection
Mesh:
Year: 2018 PMID: 29510529 PMCID: PMC5876594 DOI: 10.3390/s18030774
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The architecture of the proposed method. The orange part is the training procedure by neighboring frames. The green part is the testing procedure of frame n.
Figure 2The architecture of VGG16 that showing the convolutional and pooling layers. Along with the feed-forward procedure, the size of the feature map is decreased. s and p refer to stride and padding size, respectively.
Details of the multiscale feature representation.
| Stage | Conv Kernel Size | Feature Map Size | Usage | Options |
|---|---|---|---|---|
| Conv4_3 | Detection for scale 1 | |||
| Conv6_1 | Enlarge receptive field | |||
| Conv6_2 | Detection for scale 2 | |||
| Conv7_1 | Reduce channels | |||
| Conv7_2 | Detection for scale 3 | |||
| Conv8_1 | Reduce channels | |||
| Conv8_2 | Detection for scale 4 | |||
| Conv9_1 | Reduce channels | |||
| Conv9_2 | Detection for scale 5 | |||
| Conv10_1 | Reduce channels | |||
| Conv10_2 | Detection for scale 6 |
Figure 3Anchor generation flow. For each pixel on the feature map, six anchor shapes are generated that share the same center with the scales of 0.2 and 0.27. In addition, the center of each anchor is the pixel center. In the figure, the center is (0.5, 0.5).
Anchor details in multiscale feature representation.
| Feature | Feature Map Size | Anchor Height | Anchor Width | Number |
|---|---|---|---|---|
| Conv4_3 | 0.0707, 0.0577, 0.1414 | 0.1414, 0.1732, 0.0707 | 8864 | |
| Conv6_2 | 0.1414, 0.1155, 0.2828 | 0.2828, 0.3464, 0.1414 | 2166 | |
| Conv7_2 | 0.2740, 0.2237, 0.5480 | 0.5480, 0.6712, 0.2740 | 600 | |
| Conv8_2 | 0.4066, 0.3320, 0.8132 | 0.8132, 0.9959, 0.4066 | 150 | |
| Conv9_2 | 0.5392, 0.4402, 1.0783 | 1.0783, 1.3207, 0.5392 | 54 | |
| Conv10_2 | 0.6718, 0.5485, 1.3435 | 1.3435, 1.6454, 0.6718 | 6 |
Figure 4Confidence and bounding box prediction from an anchor. The red point is the center of the example anchor. There are two kernels of for the confidence prediction and the bounding box prediction for this anchor. The red kernel is the confidence kernel and the blue kernel is the bounding box prediction kernel.
Details of the prediction kernel.
| Feature | Feature Map Size | Confidence Kernel | Location Kernel |
|---|---|---|---|
| Conv4_3 | |||
| Conv6_2 | |||
| Conv7_2 | |||
| Conv8_2 | |||
| Conv9_2 | |||
| Conv10_2 |
Figure 5Coordinate loss computation flow. The red and green points are the positive anchors of frame t and frame t + 1, respectively. A score heat map is computed with the correlation operation by using a kernel in frame t with the red point center and the feature map of frame t + 1, similar to tracking on frame t + 1. After obtaining the max point coordinate, can be computed by the max point coordinate and the green point coordinate.
Figure 6The number of ground truths in each class.
Figure 7The object area characteristics of the Imagenet VID dataset.
Figure 8The small object area characteristics of the Imagenet VID dataset.
Average precision of each class and mAP. The bold values in the table are the best results among single models for a certain class.
| Class | [ | [ | [ | [ | Baseline | Our | [ |
|---|---|---|---|---|---|---|---|
| Airplane | 64.5 | 82.1 | 72.7 | 79.3 | 81.2 | 83.7 | |
| Antelope | 71.4 | 78.4 | 75.5 | 73.2 | 73.5 | 85.7 | |
| Bear | 42.6 | 66.5 | 42.2 | 65.0 | 70.2 | 84.4 | |
| Bicycle | 36.4 | 65.6 | 39.5 | 67.2 | 72.3 | 74.5 | |
| Bird | 18.8 | 66.1 | 25 | 68 | 70.9 | 73.8 | |
| Bus | 62.4 | 77.2 | 64.1 | 76.8 | 78.6 | 75.7 | |
| Car | 37.3 | 52.3 | 36.3 | 49.2 | 50.1 | 57.1 | |
| Cattle | 47.6 | 49.1 | 51.1 | 61.2 | 63.8 | 58.7 | |
| Dog | 15.6 | 57.1 | 24.4 | 56.6 | 60.4 | 72.3 | |
| Dc_cat | 49.5 | 72.0 | 48.6 | 72.6 | 70.1 | 69.2 | |
| Elephant | 66.9 | 68.1 | 65.6 | 71.6 | 78.9 | 80.2 | |
| Fox | 66.3 | 76.8 | 73.9 | 83.2 | 85.6 | 83.4 | |
| Giant_panda | 58.2 | 71.8 | 61.7 | 78.1 | 79.8 | 80.5 | |
| Hamster | 74.1 | 89.7 | 82.4 | 86.5 | 87.5 | 93.1 | |
| Horse | 25.5 | 65.1 | 30.8 | 66.8 | 73.5 | 84.2 | |
| Lion | 29 | 20.1 | 34.4 | 21.6 | 46.5 | 67.8 | |
| Lizard | 68.7 | 63.8 | 54.2 | 69.4 | 71.5 | 80.3 | |
| Monkey | 1.9 | 34.7 | 1.6 | 36.6 | 50.3 | 54.8 | |
| Motorcycle | 50.8 | 74.1 | 61.0 | 70.8 | 72.5 | 80.6 | |
| Rabbit | 34.2 | 45.7 | 36.6 | 51.4 | 59.1 | 63.7 | |
| Red_panda | 29.4 | 55.8 | 19.7 | 70.6 | 67.8 | 85.7 | |
| Sheep | 59.0 | 54.1 | 55.0 | 38.7 | 40.0 | 60.5 | |
| Snake | 43.7 | 57.2 | 38.9 | 61.2 | 59.2 | 72.9 | |
| Squirrel | 1.8 | 29.8 | 2.6 | 42.3 | 83.4 | 52.7 | |
| Tiger | 33.0 | 81.5 | 42.8 | 76.8 | 78.1 | 89.7 | |
| Train | 56.6 | 72.0 | 54.6 | 69.3 | 71.2 | 81.3 | |
| Turtle | 66.1 | 74.4 | 66.1 | 72.9 | 74.5 | 73.7 | |
| Watercraft | 61.1 | 55.7 | 61.5 | 63.4 | 65.6 | 69.5 | |
| Whale | 24.1 | 43.2 | 26.5 | 46.8 | 51.8 | 33.5 | |
| Zebra | 64.2 | 89.4 | 68.6 | 74.9 | 75.2 | 90.2 | |
| mAP | 45.3 | 63.0 | 47.5 | 68.4 | 67.9 | 73.8 |
Figure 9Multiscale feature representation of the network. The first line is the multiscale feature representation of the proposed method and the second line is the feature of the baseline.
Figure 10The feature similarity of the proposed method. The horizontal axis is the number of the multiscale feature map and the vertical axis is the similarity index.
Figure 11Results of the validation dataset. (a) Blue bounding boxes are the ground truths. (b) Yellow bounding boxes are the baseline results. (c) Red bounding boxes are our results.
Detection performance on different number of anchor shapes. The bold values in the table are the fastest detection speed and best performance.
| Settings | Anchor-6 | Anchor-4 | Anchor-4 and 6 |
|---|---|---|---|
| Anchor number | 11,640 | 7760 | 8732 |
| Detection speed | 32 fps | 46 fps | |
| Mean AP | 67.9 | 68.3 |
Detection performance on different number of feature maps. The bold values in the table are the fastest detection speed and the best performance.
| Settings | Feature-6 | Feature-5 | Feature-4 | Feature-3 |
|---|---|---|---|---|
| Anchor number | 11,640 | 11,634 | 11,580 | 11,430 |
| Detection speed | 32 fps | 32 fps | 32 fps | |
| Mean AP | 69.3 | 69.1 | 66.8 |
Detection performance on the YTO dataset. The bold values in the table are the best results among single models for a certain class.
| Class | [ | [ | [ | [ | [ | Base | Our |
|---|---|---|---|---|---|---|---|
| Airplane | 56.5 | 76.6 | 76.1 | 78.9 | 80.2 | 85.2 | |
| Bird | 66.4 | 89.5 | 87.6 | 69.7 | 79.5 | 83.6 | |
| Boat | 58.0 | 57.6 | 62.1 | 65.9 | 75.8 | 79.5 | |
| Car | 76.8 | 65.5 | 80.7 | 84.8 | 79.3 | 86.9 | |
| Cat | 39.9 | 43.0 | 62.4 | 65.2 | 76.6 | 76.5 | |
| Cow | 69.3 | 53.4 | 78.0 | 81.4 | 18.6 | 82.3 | |
| Dog | 50.4 | 55.8 | 58.7 | 61.9 | 67.3 | 71.7 | |
| Horse | 56.3 | 37.0 | 81.8 | 83.2 | 85.2 | 88.1 | |
| Moterbike | 53.0 | 24.6 | 41.5 | 43.9 | 58.6 | 65.8 | |
| Train | 31.0 | 62.0 | 58.2 | 61.3 | 75.3 | 71.7 | |
| Mean AP | 55.7 | 56.5 | 68.7 | 72.1 | 76.8 | 76.4 |