Literature DB >> 35584103

Improving object detection quality with structural constraints.

Zihao Rong1, Shaofan Wang1, Dehui Kong1, Baocai Yin1.   

Abstract

Recent researches revealed object detection networks using the simple "classification loss + localization loss" training objective are not effectively optimized in many cases, while providing additional constraints on network features could effectively improve object detection quality. Specifically, some works used constraints on training sample relations to successfully learn discriminative network features. Based on these observations, we propose Structural Constraint for improving object detection quality. Structural constraint supervises feature learning in classification and localization network branches with Fisher Loss and Equi-proportion Loss respectively, by requiring feature similarities of training sample pairs to be consistent with corresponding ground truth label similarities. Structural constraint could be applied to all object detection network architectures with the assist of our Proxy Feature design. Our experiment results showed that structural constraint mechanism is able to optimize object class instances' distribution in network feature space, and consequently detection results. Evaluations on MSCOCO2017 and KITTI datasets showed that our structural constraint mechanism is able to assist baseline networks to outperform modern counterpart detectors in terms of object detection quality.

Entities:  

Mesh:

Year:  2022        PMID: 35584103      PMCID: PMC9116628          DOI: 10.1371/journal.pone.0267863

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


1 Introduction

Object detection is a fundamental computer vision technology with a broad range of application scenarios, such as autonomous driving. It’s a compound task of object classification and localization. Modern object detectors are trained by matching their detection results with ground truth labels, and then minimizing the loss measuring the differences of these label-prediction matches. Each match’s loss is constituted with two terms, measuring classification error and localization error respectively. The complete loss is the sum of the two terms of all matches. In such a loss, each detection result is evaluated independently and only required to fit to the matched ground truth label. Though this loss form is simple, recent researches revealed that object detection networks could not be effectively trained by directly minimizing such a loss in many cases [1], while some researches showed that object detection quality could be effectively improved with additional constraints on intermediate network features [2]. Specifically, recent researches on network-based clustering [3] showed that feature learning could be effectively guided for the benefit of the main task, under constraints on mutual relations between training samples. This indicates it’s possible to optimize object class distributions in network feature space for the benefit of object recognition. Thus, it’s reasonable to expect object detection quality improvement by complementing the basic loss form of modern detectors with additional constraints on training sample relations in intermediate feature space. This work presents a training-sample-relation-based constraint on object detection network training for improving detection quality. We name these Structural Constraints, because these constraints exert influence on the structure of training sample distribution in object detection network feature space (as is shown later in Fig 3). Structural constraints append two terms to the basic loss, Fisher Loss and Equi-proportion Loss, for constraining the relations of training samples in classification branch space and localization branch space respectively. For an arbitrary pair of training samples, Fisher loss measures the difference between pairwise sample feature similarity and pairwise classification target similarity, while equi-proportion loss measures the difference between pairwise sample feature similarity and pairwise localization target similarity. Thus, under the constraint of these two terms, training sample feature distributions in classification branch and localization branch more resemble ground truth label distributions. As a result, features of these network branches could be more easily transformed to accurate detection results. Structural constraints could be applied to object detection networks of various architectures, like single-stage, two-stage and multi-stage networks, without changing the original network structures or influencing detection rates. In our experiments, we evaluated structural constraints’ effect on representative object detection networks of various architectures on different image datasets. These experiment results demonstrated that structural constraints could improve object detection quality noticeably on a broad range of detectors. To summarize, novel contributions of this work are: proposing Fisher loss function as part of structural constraints to constrain training sample feature relations for improving classification performance of object detection networks; proposing equi-proportion loss function as part of structural constraints to constrain training sample feature relations for improving localization performance; a mechanism for applying structural constraints to various object detection network architectures. The rest of this paper is organized as follows: Section 2 reviews related works, Section 3 describes in detail structural constraint and the mechanism of applying it to networks, Section 4 presents our experiment results and analysis, and finally Section 5 concludes this work.

2 Related works

In this section, we review some previous works closely related to structural constraints proposed in this work, and we confine the scope to works based on neural networks. At first, we review some deep learning models for image recognition with feature learning constraints; then, we review representative object detection networks of various architectures.

2.1 Feature learning constraints

Feature learning constraints are widely adopted in deep-learning-based image recognition domain. Some works on object detection use feature learning constraints to improve object detection quality. RIFD-CNN [2] used two types of constraints on its network’s intermediate layer features, one for rotation invariance and one for Fisher discrimination. Its rotation invariance constraint requires the intermediate feature representation of each training sample image to be similar to the average intermediate feature representation of the rotated versions of the image, so the subsequent classification based on this type of features will be robust against influence of object rotation. Its Fisher discrimination constraint requires each class’s training sample intermediate features to lie close to the mean of the class, and each class’s mean feature to lie distant from the global mean of all classes, so the subsequent classification layer could easily and accurately separate the classes from each other. Using these two constraints, RIFD-CNN achieved significant object detection accuracy improvement. DETR [4] is another object detection network using feature learning constraints. DETR uses transformer to process feature maps from its backbone into detection results. Its transformer’s encoder consists of multiple layers of attention mechanism, and the detection results are produced by the last attention mechanism layer. However, other attention mechanism layers’ intermediate features are also required to be transformed into accurate detection results through the same detection head shared with the last layer. This deep supervision is in essence a type of feature learning constraint: the supervision on the intermediate attention mechanism layers constrains their output features to facilitate the subsequent inference for better detection accuracy. Feature learning constraints have also been used to solve image clustering problems. Deep Self-evolution Clustering (DSEC) [3] network and Deep Adaptive Clustering (DAC) [5] network constrain their output features’ pairwise relationships to make these features directly express cluster identities. These clustering networks’ constraints require the dot products of aribitrary pairs of output features to be close to corresponding pseudo labels. These pseudo labels reflect cosine similarities of the feature pairs. As a result, the training of these networks under this type of constraints gradually makes the output features to be one-hot vectors which express cluster identities directly. Factually, this type of constraints on pairwise feature relationships are the only content in these two clustering networks’ training objective functions. Compared with the feature learning constraints in the works described above, structural constraints in this work exhibit both similarity and difference. Like RIFD-CNN, structural constraints are applied over intermediate layer features of object detection networks; like DSEC and DAC, structural constraints are based on pairwise training sample feature relations. However, the combination of these two characteristics is absent in all these works. Besides, as constraints for object detection networks, RIFD-CNN’s constraints are applied for classification merely, while structural constraints are applied for both classification and localization. Furthermore, RIFD-CNN did not constrain inter-class training sample relations, while our structural constraints’ Fisher loss constrains both inter- and intra-class relations over all training sample pairs.

2.2 Object detection network architectures

Until now, object detection networks exhibited two types of architectures: networks generating detections in a single stage, and networks generating detections through several stages of refinements. We review these architectures below.

2.2.1 Single-stage object detection networks

Single-stage object detection networks transform input images’ backbone feature maps into detection results directly, through a single detection head. SSD [6] is the forerunner of this architecture. It scatters boxes of various sizes and aspect ratios over input images’ feature maps and infers classes and adjustments for these boxes to form detection results. Detection results across feature map pyramid levels are synthesized to infer final detection results. The boxes initially scattered are then known as anchors. YOLO [7] is another single-stage detection network that is fast at inference. It additionally infers a confidence value for each bounding box, which represents probability of existence of objects within the bounding box, and these confidence values participate in the decision of final detection results. However, YOLO’s detection quality is not satisfying. RetinaNet [1] is a high-detection-quality single-stage object detection network. It focuses on dealing with imbalance between foreground and background training samples, which is a crucial cause of poor detection quality of many other single-stage networks. It proposes Focal Loss to replace the widely adopted cross entropy loss for classification task. By using focal loss, RetinaNet is able to allocate more weights on poorly classified hard samples during training, and make the trained network better generalize to test data.

2.2.2 Object detection networks with several stages

Another kind of object detection networks are constituted with more than one stage. These networks could be further divided into two groups according to number of network stages: two-stage networks and multi-stage (more than two) networks. The first network stage of all these object detection networks are responsible for generating region proposals, also known as RoIs (regions of interest). Two-stage networks then refine the region proposals with a detection head to produce final detection results, while multi-stage networks refine the region proposals with several detection heads in sequence. We review representatives of these architectures below. Two-stage object detection networks. Two-stage object detection networks appeared early among all architectures, and usually produce better detection quality than single-stage networks. Faster RCNN [8] is the forerunner of this architecture. Faster RCNN introduced RPN (Region Proposal Network) upon the basis of Fast RCNN [9]. RPN takes backbone feature maps as input and inferences RoIs and corresponding confidence values. These RoIs are then used to extract features from backbone feature maps through RoI pooling operation, and these features are passed into a fully connected detection head to inference detection results. R-FCN [10] focuses on accelerating inference rate of Faster RCNN by reducing redundant computation of detection head. R-FCN’s detection head is constituted with convolutional layers, and is able to generate a special feature map of which different channels are sensitive to different parts of target objects. Then, RoI pooling over this feature map could easily decide whether an RoI accurately localizes an object and corresponding class, by filling RoI parts with features from corresponding channels. Since most necessary computation is done by the convolutional detection head and the remaining RoI pooling operations cost only subtle computation, R-FCN’s inference is time-efficient. Double-head RCNN [11] is another two-stage network whose second stage is composed of two detection heads in parallel, one fully connected head and one convolutional head. This design is based on the observation that fully connected layers are sensitive to spatial completeness of objects, while convolutional layers are robust against occlusion and deformation. Thus Double-head RCNN uses its fully connected head to infer classification scores which should reflect localization quality, and uses its convolutional head to infer bounding boxes to better generalize to various object appearances and influencing contents. Multi-stage object detection networks. Multi-stage object detection networks extend two-stage architecture by appending additional detection heads, refining RoIs with more stages of inferences. Cascade RCNN [12] is a typical multi-stage object detection network. During Cascade RCNN training, each stage’s detection head is trained from detection results of its previous stage. At inference, each stage’s detection head takes features from RoI pooling based on its previous stage’s detection boxes, and generates new detection results. The final detection results take the last stage’s detection head’s output boxes as localization results, and take the averages of all detection heads’ class scores as classification results. The increased network stages improved detection quality noticeably, making Cascade RCNN one of the most accurate object detectors by then. Hybrid Task Cascade [13] is a multi-stage network capable of both object detection and instance segmentation. Hybrid Task Cascade inherited the network structure of Cascade RCNN, and introduced additional components and links. It introduced a semantic segmentation convolutional branch to provide helpful inputs to its detection heads and mask heads. The detection quality of Hybrid Task Cascade is outstanding in multi-stage group, but the whole of its network is cumbersome. All representative object detection networks mentioned above and many others lack constraints on relationships between training samples in feature spaces, so structural constraints proposed in this work are able to complement them in this respect. We will show that structural constraints are applicable to all these architectures through a unified mechanism in next section.

3 Structural constraint mechanism

In this section, we describe sturctural constraint mechanism for object detection in detail. Firstly, we explain the motivation of structural constraints. Then, we present the definition of structural constraints. After these, we describe the mechanism of combining structural constraints with object detection networks.

3.1 Motivation

The reason of we proposing structural constraints is based on two observations: first, the lack of constraints on training sample relationships in modern object detection networks; second, the importance of feature learning exhibited in many other image recognition tasks. As described in Section 1, it could be observed that most modern object detection networks’ loss functions usually have a form like this: where Lcls and Lloc are two loss terms for measuring classification error and localization error respectively. For each match, the difference between the estimated class probability vector p and the corresponding ground truth vector is calculated by Lcls, and the difference between the estimated bounding box b and the corresponding ground truth box is measured by Lloc. Loss functions like this only force each detection result to fit to its matched ground truth. They are simple in form, but could not be effectively minimized in many cases, since the supervision on object classification could be severely influenced by large amount of background training samples [1]. We observed that additional supervision on one training sample could come from the other training samples, since one training sample could be represented by its relative differences from the others. This could be understood by looking at some works on image clustering, such as DSEC [3], where the clustering network was effectively trained under the supervision on similarity of each pair of training samples. Thus, structural constraints are designed to supervise the differences between each pair of sample detections. Because of that object detection consists of classification and localization, structural constraints use two types of loss functions to measure sample pairs’ classification differences and localization differences, namely Fisher loss and equi-proportion loss. We also observed that proper supervision on object detection networks’ intermediate features could effectively improve detection quality. Examples are RIFD-CNN [2]’s rotation invariance constraint and Fisher discrimination constraint on its backbone’s intermediate layers, and DETR [4]’s auxiliary supervisions on multiple levels of transformer decoders. Apart from this, we try to avoid disrupting optimization of the main objective in Eq (1). Thus, instead of being applied over object detection networks’ final outputs, structural constraints are applied over the networks’ intermediate features to guide the feature learning.

3.2 Definition

Structural constraints take training samples’ intermediate features as input. To supervise training samples’ relations during classification and localization, structural constraints use Fisher loss and equi-proportion loss to constrain pairwise feature differences respectively. These losses in structural constraints and the basic object detection objective in Eq (1) altogether form the new training objective. Fisher loss in structural constraints calculates the similarity between an arbitrary pair of intermediate features of training samples, and supervises this with the corresponding pair of class labels’ similarity. It’s expressed as: where σ(⋅) is sigmoid function, f is a transformed intermediate feature vector of training sample i, and is the corresponding one-hot class label, with C being the number of object classes. Fisher loss LFisher calculates the similarity between f and f, and the similarity between and , both in terms of dot production. The squared difference between these two similarities is used as the loss value. To make the comparison between these similarities fair, f is obtained by linearly transforming the intermediate feature into the same dimensionality as . Since f acts as a proxy of the intermediate feature, we name it Proxy Feature. Before calculating the similarity, the proxy feature vectors’ elements are transformed by σ(⋅) into the same range [0, 1] as . By supervising the similarity between proxy feature vectors, Fisher loss drives the similarity between the underlying intermediate features to be consistent with the similarity of the corresponding class labels. As a result, Fisher loss produces the effect of reducing intra-class variance and increasing inter-class separation of training sample distribution, which benefits object classification. Equi-proportion loss is another loss term in structural constraints. It also measures the similarity between an arbitrary pair of intermediate features, but supervises this with the corresponding pair of localization labels. It’s expressed as: where is proxy feature of training sample i, and is the corresponding localization label. is linearly transformed from the intermediate feature into same dimensionality as , to facilitate the comparison between training sample difference and localization label difference. Since are not bounded, we measure their relative difference in terms of element-wise ratios, and so is the difference between and measured. The squared magnitude of the difference between these two sets of ratios is used as the value of Lequip. Under the guidance of equi-proportion loss, the intermediate features of training samples tend to be sensitive enough to reflect the differences between their localization labels, and benefit bounding box regression. After applying structural constraint constituted with Fisher loss and equi-proportion loss, the object detection network training objective is rewritten as: where Fisher loss and equi-proportion loss are evaluated for all pairs of training samples. This sum of original loss and structural constraint terms is used to calculate back-propagations during end-to-end object detection network training processes. Thus, training with this new objective not only optimizes the main objective of object detection, but also optimizes the structure of training sample distribution in intermediate feature space which benefits the main objective in return.

3.3 Combination with various object detection architectures

Structural constraints supervise intermediate features of object detection networks, that is, applied over intermediate network layers, so how they are combined with networks depends on the forms of these layers, which differ among object detection architectures. We describe how structural constraints are combined with single-stage, two-stage and multi-stage object detection networks respectively below.

Single-stage case

Single-stage object detection networks’ detection heads transform backbone feature maps with two-dimensional convolution (Conv2D) to generate classification outputs and localization outputs. Because that the dimensionality of proxy features used in Fisher loss and equi-proportion loss calculation must be unified with the dimensionality of classification outputs and localization outputs respectively, structural constraints in single-stage networks use Conv2D layers to transform intermediate features of training samples into the needed proxy features. This could be expressed as: where Conv2DFisher and Conv2Dequip are convolution layers generating proxy features for Fisher loss and equi-proportion loss respectively, and F is intermediate feature collection. Conv2DFisher and Conv2Dequip take F as input and generate proxy feature collections . It should be noticed that F, {f} and take the form of feature tensors in this case. With proxy features obtained, the rest of structural constraint evaluation is exactly same as the description in Section 3.2. The complete mechanism in single-stage case is illustrated in Fig 1a.
Fig 1

Illustration of structural constraint mechanisms in object detection networks of various architectures.

(a) single-stage, (b) two-stage, and (c) multi-stage.

Illustration of structural constraint mechanisms in object detection networks of various architectures.

(a) single-stage, (b) two-stage, and (c) multi-stage.

Two-stage case

Two-stage object detection networks firstly generate RoIs with their RPNs, and then their detection heads infer detection results from these RoIs. Their detection heads usually consist of fully-connected (FC) layers. Thus, for the same reason as in single-stage case, we set up special FC layers for transforming intermediate features into proxy features whose dimensionality is unified with detection head outputs. This could be expressed as: where FCFisher and FCequip are the FC layers that generate proxy features for Fisher loss and equi-proportion loss respectively. In this case, the intermediate feature collection F comes from RoI pooling. The rest of structural constraint evaluation is still same as the description in Section 3.2. Apart from the detection heads, structural constraints could also be applied to RPNs of two-stage networks, because these RPNs are identical to single-stage networks’ detection heads. This means the aforementioned mechanism for single-stage case could be directly applied to these RPNs. The complete structural constraint mechanism for two-stage case is illustrated in Fig 1b.

Multi-stage case

Multi-stage object detection networks extend two-stage architecture by using multiple detection heads to refine detection results sequentially. Thus, compared with two-stage networks, the constituting modules of multi-stage networks remain unchanged. This means how structural constraints are applied to detection heads and RPNs in multi-stage networks is exactly same as the two-stage case. For structural constraints on detection heads, the proxy features are generated in the same manner as Eq (6); on RPNs, they are generated in the same manner as Eq (5). All the rest of structural constraint evaluation still obey Section 3.2. The complete mechanism for multi-stage case is illustrated in Fig 1c. In all cases above, structural constraint mechanisms exist during the training period of these object detection networks, and guide the intermediate feature learning by handling proxy features. At inference time, all calculations related to structural constraints are absent, as well as all exclusive network layers (Conv2DFisher/equip, FCFisher/equip), so detection rates and deployment sizes of these networks are not influenced.

4 Experiments

To verify the effectiveness of structural constraints, we experimented with multiple object detection networks over several image datasets, and examined the training processes and network behaviors. In this section, we present these experiment results.

4.1 Experiment settings

We describe settings of the experiments firstly. These include settings of networks, training and testing. All hyper-parameters listed below are set to default values of MMDetection [14] configuration files.

Networks

The default settings of object detection networks used in the experiments are: ResNet-101 [15] as backbone, FPN [16] as neck, and Greedy NMS [17] for post-processing. All multi-stage networks use 3 stages of detection heads. All object detection networks are implemented with MMDetection toolbox [14] and PyTorch deep learning library [18].

Training and testing

All networks’ optimizers are SGD (Stochastic Gradient Descent). The default length of training is 12 epochs. For single-stage networks, their detection head training samples’ positive and negative IoU thresholds are 0.5 and 0.4 respectively. For two-stage networks, their detection head training samples’ positive and negative IoU thresholds are both 0.5, and positive training samples cover 25%. For multi-stage networks, 3 stages of detection heads’ positive and negative IoU thresholds are 0.5, 0.6 and 0.7 respectively. Besides, for two- and multi-stage networks, their RPN training samples’ positive and negative IoU thresholds are 0.7 and 0.3 respectively, and positive samples cover 50%. All training samples are randomly obtained. At test time, the default NMS IoU threshold is 0.5 for detection heads, and 0.7 for RPNs. All networks are trained and tested on GPU servers.

4.2 Experiment results

We present experiment results on structural constraint mechanism in this subsection. Firstly, we present ablation evaluation results to show influences of different factors in the mechanism. Then, we compare object detection quality of our structural-constraint-applied networks with other modern detectors. Finally, we analyze behaviors of structural constraint mechanism through visualization.

4.2.1 Ablation evaluation

We performed ablation evaluations on structural constraint mechanism to investigate different factors’ influences on object detection quality, including the constituting loss terms LFisher and Lequip as well as different combination manners. We report our evaluation results on two widely used image datasets, MSCOCO2017 [19] and KITTI [20], respectively. Evaluations on . For ablation on MSCOCO2017, all object detection networks are trained over the train2017 subset, and tested over the val2017 subset. We choose RetinaNet as the evaluation subject for single-stage architecture, Faster RCNN for two-stage, and Cascade RCNN for multi-stage. The ablation evaluation results are shown in Table 1. The network names containing “+LFisher/equip” indicate that Fisher loss or equi-proportion loss is applied to the detection heads of those networks, and names with “” indicate Fisher loss or equi-proportion loss is applied to both the detection heads and RPNs of those networks (in two- or multi-stage case). It could be observed that structural constraint mechanism is able to improve object detection qualities of all network subjects on this general object detection task. Specifically, the complete structural constraint mechanism that includes both Fisher loss and equi-proportion loss produced the most obvious improvement in some cases, like Faster . We also evaluated the influence of batch size, and the networks marked with “*” are trained with smaller batch sizes (half). It could be observed that structural constraint mechanism is robust against batch size changes.
Table 1

Ablation evaluations of structural constraint mechanism on MSCOCO2017.

detectorAPAP0.5AP0.75APsmallAPmedAPlargeARMD=1ARMD=10ARARsmallARmedARlarge
RetinaNet0.3790.5720.4090.2030.4200.4980.3210.5130.5440.3350.5910.699
RetinaNet+LFisher0.3700.5590.3980.1940.410 0.502 0.3170.5040.5350.3210.581 0.701
Faster RCNN0.3760.5890.4120.2130.4190.4810.3150.5020.5260.3370.5690.669
Faster RCNN+LFisher 0.377 0.589 0.413 0.2130.414 0.485 0.3150.5000.5240.329 0.571 0.657
Faster RCNN+LFisher2 0.378 0.5880.410 0.217 0.419 0.488 0.317 0.505 0.530 0.335 0.575 0.673
Faster RCNN*0.3830.6060.4140.2230.4270.4960.3140.4960.5210.3320.5640.662
Faster RCNN+Lequip * 0.384 0.607 0.417 0.224 0.427 0.501 0.317 0.501 0.524 0.336 0.568 0.664
Faster RCNN+Lequip2 *0.383 0.607 0.412 0.224 0.429 0.4960.3140.4950.5190.332 0.569 0.656
Faster RCNN+LFisher+Lequip * 0.384 0.607 0.415 0.227 0.426 0.500 0.314 0.498 0.522 0.335 0.567 0.661
Faster RCNN+LFisher2+Lequip2 * 0.385 0.607 0.418 0.225 0.427 0.505 0.318 0.501 0.525 0.330 0.567 0.671
Cascade RCNN0.4120.5900.4470.2270.4470.5500.3370.5300.5540.3370.5950.718
Cascade RCNN+LFisher0.412 0.592 0.451 0.231 0.448 0.552 0.3370.5290.554 0.346 0.596 0.719
Cascade RCNN+LFisher20.412 0.592 0.449 0.233 0.449 0.5450.337 0.531 0.554 0.341 0.599 0.706
Cascade RCNN*0.3960.5700.4330.2180.4280.5240.3320.5230.5460.3340.5850.700
Cascade RCNN+Lequip *0.3950.5670.4320.2140.4260.5220.3320.5210.5440.321 0.586 0.702
Cascade RCNN+Lequip2 *0.3960.5680.433 0.224 0.4270.520 0.334 0.525 0.548 0.336 0.586 0.711

Note: the values where improvements happen are in bold face.

“*” indicates that the network is trained using smaller batch size.

Note: the values where improvements happen are in bold face. “*” indicates that the network is trained using smaller batch size. Evaluations on . We use the 2D object detection subset in KITTI to perform ablation evaluations, which contains 7481 labeled driver-view images. For all evaluated network subjects, the first 6000 images are used for training and the rest 1481 images for testing. We adopted Pascal-VOC-styled metrics which evaluate class-wise average precisions and the global mean average precision (MAP). We choose RetinaNet and SSD as evaluation subjects for single-stage architecture, Faster RCNN for two-stage, and Cascade RCNN for multi-stage. The evaluation results are shown in Table 2. It could be observed that structural constraint mechanism is able to produce object detection quality improvement for all these network architectures. It’s also observable that the improvement happened on multiple classes simultaneously, such as the case of Faster . Besides, structural constraint mechanism still exhibits robustness against batch size settings, which could be observed from the evaluations on Cascade RCNN.
Table 2

Ablation evaluations of structural constraint mechanism on KITTI.

detectorcarpedestrianvantruckperson sittingcyclisttrammiscdon’t careMAP
RetinaNet0.9770.9250.9891.000.9270.9850.9970.9660.8280.955
RetinaNet+LFisher0.9770.9020.9871.00 0.942 0.9760.997 0.969 0.8050.950
SSD0.8560.3950.6850.8260.2310.4080.8060.5090.1230.538
SSD+Lequip0.853 0.409 0.704 0.827 0.219 0.417 0.835 0.4990.118 0.542
SSD+LFisher+Lequip0.8560.388 0.715 0.8050.2070.396 0.820 0.506 0.129 0.536
Faster RCNN0.9780.9320.9941.000.8160.9791.000.9900.8450.948
Faster RCNN+LFisher 0.979 0.928 0.996 1.00 0.874 0.995 1.000.9900.844 0.956
Faster RCNN+LFisher2 0.979 0.932 0.996 1.00 0.884 0.986 1.000.985 0.849 0.957
Cascade RCNN0.9760.9280.9931.000.8530.9831.000.9900.8710.955
Cascade RCNN+LFisher0.9760.922 0.994 1.000.836 0.986 1.00 0.995 0.878 0.954
Cascade RCNN+LFisher20.9760.917 0.994 1.00 0.873 0.9820.991 0.995 0.882 0.957
Cascade RCNN*0.9430.7920.9430.9710.5990.9040.9770.9200.3540.822
Cascade RCNN+LFisher+Lequip *0.9390.7890.936 0.979 0.646 0.8980.946 0.925 0.3430.822
Cascade RCNN+LFisher2+Lequip2[*]0.939 0.804 0.934 0.987 0.608 0.9000.9180.8850.3540.814

Note: the values where improvements happen are in bold face;

“*” indicates that the network is trained using smaller batch size.

Note: the values where improvements happen are in bold face; “*” indicates that the network is trained using smaller batch size.

4.2.2 Comparison with other object detectors

We present object detection quality comparisons between modern object detectors and our networks with structural constraints in this subsection. These comparisons were carried out over MSCOCO2017 and KITTI. We give descriptions respectively in the following. Comparison on . The training set and testing set for this comparison are same as the settings in last subsection. The evaluation results are presented in Table 3. SCM-Two and SCM-Multi are our two-stage and multi-stage object detection networks with structural constraint mechanisms. SCM-Two is configured as Faster , and SCM-Multi as Cascade . SSD300 and SSD512 are SSD networks with input image sizes as 300 × 300 and 512 × 512 respectively. It could be observed that our SCM-Two network produced identical object detection quality with many other detectors, and our SCM-Multi network achieved top values under most metrics.
Table 3

Object detection quality comparison between structural-constraint-applied networks and other detectors on MSCOCO2017.

detectorAPAP0.5AP0.75APsmallAPmedAPlargeARMD=1ARMD=10ARARsmallARmedARlarge
FCOS [21]0.3910.5850.4180.2200.4350.511------
Mask Scoring RCNN [22]0.400 0.614 0.4370.2320.4420.523------
GA-RetinaNet [23]0.3890.5910.4180.2200.4260.519------
RetinaNet-GHM [24]0.3900.5770.4130.2180.4320.518------
Libra Faster RCNN [25]0.4030.6120.4390.2330.4430.522------
SSD300 [6]0.2540.4280.2640.0590.2790.4280.2380.3480.3680.0940.4130.588
SSD512 [6]0.2920.4810.3070.1050.3470.4560.2620.3920.4150.1380.4920.614
Mask RCNN [26]0.3870.5970.4240.2260.4270.5010.3220.5120.5370.3490.5820.674
Double-head RCNN [11]0.3860.5830.4200.2250.4220.4960.3260.5220.549 0.350 0.5900.700
DETR [4]0.4010.6060.4200.1830.433 0.595 ------
YOLOX [27]0.4030.5910.434 0.235 0.4450.531------
Dynamic R-CNN [28]0.3890.5760.4270.2210.4190.517------
SCM-Two (ours)0.3850.6070.4180.2250.4270.5050.3180.5010.5250.3300.5670.671
SCM-Multi (ours) 0.412 0.592 0.449 0.233 0.449 0.545 0.337 0.531 0.554 0.341 0.599 0.706

Note: the top value under each metric is in bold face.

Note: the top value under each metric is in bold face. Comparison on . In this comparison, the training setting of our network SCM-Multi is same as the last subsection, and it’s configured as Cascade . Other detectors’ evaluation results are obtained from KITTI’s official website. The comparison is shown in Table 4. Since KITTI’s leaderboard publishes detection precisions on car, pedestrian and cyclist, we compare performances on these three classes and the global mean average precisions (MAP). It could be observed that our SCM-Multi network achieved top values on all these metrics.
Table 4

Object detection quality comparison of our structural-constraint-applied networks and other detectors on KITTI.

detectorcarpedestriancyclistMAP
TuSimple [29]0.9080.7700.8140.831
RRC [30]0.9060.7530.8500.836
UberATG-MMF [31]0.918---
PC-CNN-V2 [32]0.908---
SJTU-HW [33]0.9080.742--
SenseKITTI [34]0.9080.6730.8180.800
F-PointNet [35]0.9080.7730.8490.843
HRI-VoxelFPN [36]0.907---
F-ConvNet [37]0.9040.7240.8480.825
Regionlet [38]0.8480.6120.7040.721
DPM-VOC+VP [39]0.7500.4490.4240.541
3DVP [40]0.875---
SubCat [41]0.841---
CompACT-Deep [42]-0.587--
DeepParts [43]-0.587--
Fast RCNN+VGG16 [9]0.8600.6250.6880.724
SCM-Multi (ours) 0.939 0.804 0.900 0.881

Note: the top value under each metric is in bold face.

Note: the top value under each metric is in bold face. According to these ablation evaluations and comparisons with other modern detectors on different datasets, it’s shown that structural constraint mechanism is able to improve object detection quality on various network architectures, and is able to assist some prototype networks to achieve advanced performances.

4.3 Visualization analysis

We analyze behaviors of structural constraint mechanism during training and testing in this subsection. For this purpose, we visualized changing of the loss terms in structural constraint, their influences on feature space and some final detection results.

Changing of loss values

We plotted curves of Fisher loss and equi-proportion loss during training of object detection networks of different architectures. The observation subjects include RetinaNet, SSD, Faster RCNN and Cascade RCNN, all with structural constraints applied. These loss curves are shown in Fig 2. Both losses were obviously dropping during all these training processes. This observation indicates that the loss terms in structural constraints are effectively minimized, so they are indeed guiding networks’ training.
Fig 2

The curves of Fisher and equi-proportion losses (LFisher and Lequip) during the training of object detection networks of different architectures.

Upper row: Fisher losses; lower row: equi-proportion losses. “s#” in legends indicates the loss corresponds to stage # in the case of multi-stage networks.

The curves of Fisher and equi-proportion losses (LFisher and Lequip) during the training of object detection networks of different architectures.

Upper row: Fisher losses; lower row: equi-proportion losses. “s#” in legends indicates the loss corresponds to stage # in the case of multi-stage networks.

Influence on network feature space

To observe the influences of structural constraint mechanism on object detection networks’ feature spaces, we adopted t-SNE [44] to project high-dimensional backbone features to 2D space for visualization. These backbone features were obtained by feeding the networks with images of object classes. These images are sampled from KITTI according to its bounding box labels and are of class Car or Pedestrian (Ped). The extracted backbone features are then resized to a uniform size for the convenience of t-SNE transform. The visualization results are shown in Fig 3. The network subjects are Faster RCNN and Cascade RCNN. It could be observed that with greater extent of structural constraint application, the distributions of Car and Ped are less mixed and easier to separate. This is a beneficial behavior to object classification, and is consistent with the intention of structural constraints.
Fig 3

t-SNE visualization of Car and Pedestrian (Ped) instance distributions in feature spaces of object detection networks with and without structural constraints applied.

Detection result visualization

In Fig 4, we visualized some detection results on MSCOCO2017 images (val2017). We compared detection results of Faster RCNNs with and without structural constraints applied. It could be observed that the application of structural constraints made the detector more accurate at localization and give less false positives.
Fig 4

Detection results of Faster RCNN (left in each couple) and Faster RCNN (right in each couple).

Green boxes: correct detection results; red boxes: incorrect detection results. Each box is marked with its estimated class name and confidence score.

Detection results of Faster RCNN (left in each couple) and Faster RCNN (right in each couple).

Green boxes: correct detection results; red boxes: incorrect detection results. Each box is marked with its estimated class name and confidence score.

5 Conclusion

In this work, we introduced our structural constraint mechanism for improving object detection quality. Structural constraint mechanism supervises object detection networks’ intermediate feature spaces, and guides the training processes to optimize object class instances’ distributions within the spaces. It constrains feature similarities of training sample pairs to be consistent with corresponding ground truth label similarities. With the aid of proxy feature design, structural constraint could be applied to all types of object detection network architectures. Experiment results indicate our structural constraint mechanism is able to optimize networks’ intermediate features and consequently final detection results. It should be pointed out that calculation of structural constraint is done for all possible pairs of training samples, which has high GPU memory demand. We will address this issue in our future work. 11 Feb 2022
PONE-D-21-39942
Improving object detection quality with structural constraints
PLOS ONE Dear Dr. Rong, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.
 
Please Revise the paper by considering the reviewers' comments.
 
Please submit your revised manuscript by Mar 28 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Jie Zhang Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse. 3. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. 4. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 1, 2, 3 and 4 in your text; if accepted, production will need this reference to link the reader to the Table. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This paper proposes an approach that utilizes Structural Constraints to improve performance of object detection tasks. The proposed Structural Constraints measurement consists of two parts: Fisher loss and Equi-proportion loss. The fisher loss is helpful in improving classification performance, and the equip loss function is designed for improving localization performance. Finally, a series of strategies are proposed to apply the Structural Constraints into different frameworks of object detection. Extensive experiments are conducted to demonstrate the effectiveness of proposed approach. 1) Positive Points. Most of the modern object detection networks constrain only classification loss and localization loss, and this paper states that these losses are not enough and then further utilizes the Fisher loss and the Equip loss. Such a solution is interesting and could improve performance of object detection. In addition, the presentation is clear and the paper is easy to read. 2) Negative Points. The main contribution of this paper is the Structural Constraints that introduce the Fisher loss and the Equip loss. However, the experiments are insufficient to verify the positive effects of the two losses. The ablation experiments on RetinaNet some scores become even worse after using the Fisher loss (Table 1 and Table 2), making it somehow useless for the one-stage object detection framework. In addition, the ablation experiments on Cascade RCNN* show that the Equip loss only improves the AR indexes but becomes harmful for the AP indexes in most cases, implying that the Equip loss is not useful for the multi-stage object detection network. The authors should explain and further justify the positive effects of these two losses. Another concern is that the overall performance is not very satisfactory (i.e., not state-of-the-art). Many superior object detection networks have been proposed in recent years (e.g., DETR). Although the proposed approach claims that their solution can be applied to all object detection network architectures, more comparative experiments on recent object detection works should be done to demonstrate the effect of proposed approach. Reviewer #2: Object detection is an important research direction in computer vision and the basis of many downstream work. In this paper, the author investigated a large number of literatures and found that in object detection, in addition to the commonly used classification loss and regression loss, the addition of other loss functions can improve the detection effect, such as focal loss. At the same time, from the analysis of clustering and other tasks, it is found that the constraint based on the mutual relations between training samples can effectively improve feature learning. Therefore, the authors propose to use structural constraint mechanism in object detection, and improve the effect of object detection by adding structural constraint loss to object detection algorithm. The authors add Fisher loss to the classification branch of object detection and Equi-proportion loss to the location branch, and completes the corresponding training with the help of an proxy feature. Experiments on Object detection public data sets mscoco2017 and Kitti show the effectiveness of this method. The main problems are as follows: 1、 In this paper, the two additional loss functions are directly added to the original loss function, such as formula 4, but in the following 3.3, it is explained that they are used in different branches, and they are inconsistent. 2、 After adding the loss function, how to train the network? Is it to train with the classification loss first, and then use the Fisher loss for fine-tuning? It needs to be explained in detail here. 3、 The super parameters in the experimental settings, such as the setting of IOU threshold, directly give specific values, but do not explain how to select this parameter value. Further clarification is required. 4、 What is the difference between the Fisher loss added in this paper and the Fisher discriminative layer in reference paper 2. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jia Li Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 23 Mar 2022 The responses to reviewers are uploaded as PDF files. Submitted filename: r2_1-reply.pdf Click here for additional data file. 18 Apr 2022 Improving object detection quality with structural constraints PONE-D-21-39942R1 Dear Dr. Rong, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Jie Zhang Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Most of my concerns have been well addressed. I think the paper can now be published in its current form. Reviewer #2: This manuscript has been well improved, and I think all of my comments have been addressed with new analysis. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jia Li Reviewer #2: No 9 May 2022 PONE-D-21-39942R1 Improving object detection quality with structural constraints Dear Dr. Rong: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Jie Zhang Academic Editor PLOS ONE
  4 in total

1.  Object detection with discriminatively trained part-based models.

Authors:  Pedro F Felzenszwalb; Ross B Girshick; David McAllester; Deva Ramanan
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2010-09       Impact factor: 6.226

2.  Multi-view and 3D deformable part models.

Authors:  Bojan Pepik; Michael Stark; Peter Gehler; Bernt Schiele
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2015-11       Impact factor: 6.226

3.  Deep Self-Evolution Clustering.

Authors:  Jianlong Chang; Gaofeng Meng; Lingfeng Wang; Shiming Xiang; Chunhong Pan
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2018-12-27       Impact factor: 6.226

4.  Voxel-FPN: Multi-Scale Voxel Feature Aggregation for 3D Object Detection from LIDAR Point Clouds.

Authors:  Hongwu Kuang; Bei Wang; Jianping An; Ming Zhang; Zehan Zhang
Journal:  Sensors (Basel)       Date:  2020-01-28       Impact factor: 3.576

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.