| Literature DB >> 35793275 |
Jia Zhao1, Bingfei Mao1, Hengran Meng1, Liping Wu2, Jingpeng Li3.
Abstract
Because thermal infrared sport targets have rich and complex semantic information, there is a high coupling between different types of features. In view of these limitations, we propose a Non-Glodal decoupled Attention, namely,local U-shaped attention decoupling network (LUANets), which aims to decompose the coupling relationship of different sport target features in thermal infrared images and establish effective spatial dependence between them. This method takes the captured multi-scale initial features according to different levels and inputs them into the local decoupling module with U-shaped attention structure to realize the decomposition of semantic details. At the same time, considering the correlation between different targets, in the process of feature decomposition, using prior knowledge as guiding information many times to establish effective spatial dependence. Secondly, we design a two-way cross-aggregation FPN module to cross-aggregate information flows in the front and back directions to achieve feature interaction while further reducing the coupling between different types of features. The evaluation results on data such as TIIs,SportFCs and FLIR show that the LUANets method we proposed has achieved the best detection performance, with mAP of 68.72%,59.51% and 65.29%, respectively.Entities:
Mesh:
Year: 2022 PMID: 35793275 PMCID: PMC9258899 DOI: 10.1371/journal.pone.0270376
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1(a) The overall structure of the LUANets detection framework. (b) Represents the LUADM of C5. (c) The bidirectional cross-aggregation FPN (BCFPN) module. f represents the features extracted by Res2Net-101 [35]. C3, C4, C5 respectively represent the convolution operation of different hole coefficients, and the output feature can be defined as f, f, f. EN, De represents the encoder and decoder, respectively. GA(⋯) represents the global attention operation. f, f, f represent different levels of feature information, such as low, middle, and high. BCFPN represents the bidirectional cross-aggregation FPN module. Conv1 × 1 means that the convolution kernel is 1 × 1, RELU + BN and other operations. represents the feature map connection. MP2,3,5(⋅) represents the maximum pooling operation of separate scales. ⊕ means skip connection. {P5, ⋯, P7} represents a pyramid structure of different levels.
Experimental results of different detection methods.
| models | Backbone | TIIs | SportFCs | FLIR | |||
|---|---|---|---|---|---|---|---|
| R | mAP | R | mAP | R | mAP | ||
| FasterR-CNN | X-101-64-4d-FPN | 66.85 | 65.08 | 58.90 | 55.85 | 62.33 | 60.15 |
| CascadeR-CNN | x101-64-4d-FPN | 67.86 | 66.91 | 59.26 | 58.19 | 64.67 | 62.09 |
| DETR | R50-8-2 | 64.02 | 62.84 | 55.96 | 53.07 | 59.99 | 56.84 |
| Yolof | R50-c5-8-8 | 66.75 | 64.56 | 57.87 | 55.08 | 61.79 | 58.15 |
| AutoAssign | R50-FPN-8-2 | 65.31 | 63.96 | 56.72 | 54.61 | 60.32 | 57.62 |
| PVTv2 | pvtv2-b2-FPN | 68.56 | 67.93 | 59.45 | 58.63 | 65.55 | 62.97 |
| VFNet | x101-64-4d-FPN-C3-C5 | 68.47 | 67.84 | 59.39 | 58.23 | 65.18 | 62.36 |
|
|
|
|
|
|
|
|
|
LUANets represents our proposed detection framework. MGUN represents our proposed local embedded U-shaped module. BCFPN represents our proposed bidirectional cross-aggregation FPN.
Detection results of the different components.
| Model | Backbone | R (%) | mAP (%) |
|---|---|---|---|
|
| Resnet-50 | 64.40 | 62.97 |
| ResNet-101 | 65.59 | 63.48 | |
| ResNet-50-DCN | 65.99 | 63.76 | |
| ResNet-101-DCN | 66.52 | 65.17 | |
|
| MGUN | 66.79 | 65.44 |
| BCFPN | 67.15 | 65.74 | |
|
| 67.58 | 66.27 | |
| 68.07 | 66.55 | ||
| 66.98 | 65.82 | ||
| MGUN+FPN | 68.44 | 66.96 | |
| MGUN+PCN | 68.87 | 67.24 | |
|
| NoMGUN | 65.12 | 63.88 |
| NoBCFPN | 62.99 | 60.01 | |
| NoDCN | 68.33 | 66.95 | |
| MGUN+BCFPN | 69.91 | 68.72 |
MGUN represents our multilevel local embedding U-shaped model. BCFPN represents our proposed bidirectional cross-aggregation FPN module. FPN stands for Feature Pyramid Network. MLF stands for a multilevel feature extractor, which mainly includes low-level, middle-level and high-level semantic information. IS represents the internal structure of the MGUN and BCFPN modules. In IFE, MGUN and BCFPN remain unchanged, and only the initial feature extractor is changed. The initial feature extractor Res2Net-101+DCN in MLF remains unchanged, only changing the representation process of multiple hierarchical information. The IS keeps the original feature extractor unchanged while changing the internal structure of the multilevel information extraction component. For example, MFCN1,2 means that only low-level and middle-level multiscale information is used. PCN stands for pyramid convolution structure. DCN indicates C3 C5.
Fig 2Visualization of different components.
(a) represents the original image; (b) represents MFCN2,3+ BCFPN. (c) and (d) represent the visual demonstrations of MFCN1,2+ BCFPN and MFCN1,3+ BCFPN, respectively. (e) represents the BDFPN component. (f) represents the visual presentation of LUANets we mentioned, which includes three levels of detailed information MGUN+BCFPN: low-level, medium-level, and high-level.
Detection results of the different components.
| Model | 10% | 20% | 30% | 40% |
|---|---|---|---|---|
| FasterR-CNN | 17.43 | 20.14 | 25.94 | 38.78 |
| CascadeR-CNN | 18.07 | 20.51 | 27.34 | 40.15 |
| DETR | 15.65 | 17.98 | 22.56 | 34.99 |
| Yolof | 16.96 | 20.01 | 25.38 | 37.95 |
| AutoAssign | 16.31 | 19.46 | 24.77 | 36.88 |
| PVTv2 | 18.86 | 20.78 | 27.42 | 40.53 |
| VFNet | 18.32 | 20.41 | 26.99 | 40.02 |
| LUANets | 18.09 | 21.56 | 30.19 | 42.77 |
MGUN represents our multilevel local embedding U-shaped model. BCFPN represents our proposed bidirectional cross-aggregation FPN module. FPN stands for Feature Pyramid Network. MLF stands for a multilevel feature extractor, which mainly includes low-level, middle-level and high-level semantic information. IS represents the internal structure of the MGUN and BCFPN modules. In IFE, MGUN and BCFPN remain unchanged, and only the initial feature extractor is changed. The initial feature extractor Res2Net-101+DCN in MLF remains unchanged, only changing the representation process of multiple hierarchical information. The IS keeps the original feature extractor unchanged while changing the internal structure of the multilevel information extraction component. For example, MFCN1,2 means that only low-level and middle-level multiscale information is used. PCN stands for pyramid convolution structure. DCN indicates C3 C5.
Fig 3Detection efficiency of the different models.
M indicates that the model parameter is megabytes, and s indicates the time it takes for every 100 images to be detected.