| Literature DB >> 32479515 |
Jianfang Cao1,2, Lichao Chen2, Chenyan Wu2, Zibang Zhang2.
Abstract
With the rapid development of the Internet and the increasing popularity of mobile devices, the availability of digital image resources is increasing exponentially. How to rapidly and effectively retrieve and organize image information has been a hot issue that urgently must be solved. In the field of image retrieval, image auto-annotation remains a basic and challenging task. Targeting the drawbacks of the low accuracy rate and high memory resource consumption of current multilabel annotation methods, this study proposed a CM-supplement network model. This model combines the merits of cavity convolutions, Inception modules and a supplement network. The replacement of common convolutions with cavity convolutions enlarged the receptive field without increasing the number of parameters. The incorporation of Inception modules enables the model to extract image features at different scales with less memory consumption than before. The adoption of the supplement network enables the model to obtain the negative features of images. After 100 training iterations on the PASCAL VOC 2012 dataset, the proposed model achieved an overall annotation accuracy rate of 94.5%, which increased by 10.0 and 1.1 percentage points compared with the traditional convolution neural network (CNN) and double-channel CNN (DCCNN). After stabilization, this model achieved an accuracy of up to 96.4%. Moreover, the number of parameters in the DCCNN was more than 1.5 times that of the CM-supplement network. Without increasing the amount of memory resources consumed, the proposed CM-supplement network can achieve comparable or even better annotation effects than a DCCNN.Entities:
Year: 2020 PMID: 32479515 PMCID: PMC7263637 DOI: 10.1371/journal.pone.0234014
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Structure of the supplement CNN.
Fig 2The CM-supplement model.
Fig 3Schematic diagram of the image annotation part of the CM-supplement network.
Comparisons of the annotation accuracy rates for each category of the PASCAL VOC 2012 dataset based on different algorithms after 100 training iterations.
| Image category | Annotation accuracy rate | ||||||
|---|---|---|---|---|---|---|---|
| CNN | DCCNN [ | Method from reference [ | Method from reference [ | Method from referenc [ | Method from reference [ | Method proposed in this study | |
| plane | 0.983 | 0.999 | 0.919 | 0.985 | 0.993 | 0.981 | 1.0 |
| bike | 0.877 | 0.973 | 0.448 | 0.902 | 0.973 | 0.452 | 0.978 |
| bird | 0.918 | 0.984 | 0.912 | 0.925 | 0.981 | 0.911 | 0.991 |
| boat | 0.920 | 0.972 | 0.639 | 0.919 | 0.971 | 0.647 | 0.976 |
| bottle | 0.722 | 0.892 | 0.750 | 0.822 | 0.889 | 0.720 | 0.892 |
| bus | 0.920 | 0.980 | 0.928 | 0.925 | 0.978 | 0.919 | 0.985 |
| car | 0.819 | 0.939 | 0.874 | 0.832 | 0.939 | 0.873 | 0.943 |
| cat | 0.916 | 0.970 | 0.921 | 0.934 | 0.965 | 0.922 | 0.978 |
| chair | 0.668 | 0.804 | 0.345 | 0.708 | 0.803 | 0.408 | 0.823 |
| cow | 0.999 | 1.0 | 0.851 | 0.999 | 0.999 | 0.862 | 1.0 |
| dining table | 0.570 | 0.757 | 0.693 | 0.695 | 0.748 | 0.627 | 0.803 |
| dog | 0.894 | 0.971 | 0.871 | 0.898 | 0.976 | 0.883 | 0.992 |
| horse | 0.927 | 0.978 | 0.886 | 0.938 | 0.969 | 0.876 | 0.990 |
| motorbike | 0.849 | 0.931 | 0.743 | 0.866 | 0.942 | 0.746 | 0.977 |
| person | 0.871 | 0.957 | 0.772 | 0.894 | 0.956 | 0.791 | 0.969 |
| potted plant | 0.729 | 0.881 | 0.654 | 0.805 | 0.878 | 0.639 | 0.880 |
| sheep | 0.960 | 0.993 | 0.851 | 0.963 | 0.991 | 0.849 | 0.991 |
| sofa | 0.618 | 0.827 | 0.332 | 0.666 | 0.824 | 0.307 | 0.828 |
| train | 1.0 | 0.999 | 0.813 | 1.0 | 1.0 | 0.825 | 1.0 |
| tv monitor | 0.746 | 0.866 | 0.721 | 0.765 | 0.867 | 0.738 | 0.896 |
| MAP value | 0.845 | 0.934 | 0.746 | 0.872 | 0.932 | 0.749 | 0.945 |
Fig 4Comparisons of the MAPs of the CM-supplement model and other methods based on convolution neural networks (CNN, DCCNN and references [29] and [32]).
The horizontal axis represents the number of iterations implemented during network training, and the vertical axis represents the increase in the number of iterations. MAP, mean average precision.
Comparisons of the annotation accuracy rates for each category in the PASCAL VOC 2012 dataset after stabilization training.
| Image category | Annotation accuracy rate | ||||||
|---|---|---|---|---|---|---|---|
| CNN | DCCNN [ | Method from reference [ | Method from reference [ | Method from reference [ | Method from reference [ | Method in this study | |
| plane | 0.983 | 0.999 | 0.924 | 0.989 | 0.992 | 0.981 | 1.0 |
| bike | 0.877 | 0.973 | 0.451 | 0.902 | 0.969 | 0.455 | 0.979 |
| bird | 0.977 | 0.984 | 0.946 | 0.978 | 0.980 | 0.916 | 0.996 |
| boat | 0.920 | 0.972 | 0.652 | 0.935 | 0.968 | 0.653 | 0.977 |
| bottle | 0.879 | 0.919 | 0.758 | 0.886 | 0.903 | 0.727 | 0.920 |
| bus | 0.971 | 0.980 | 0.951 | 0.925 | 0.978 | 0.928 | 0.986 |
| car | 0.949 | 0.987 | 0.891 | 0.949 | 0.985 | 0.890 | 0.987 |
| cat | 0.955 | 0.970 | 0.923 | 0.967 | 0.964 | 0.922 | 0.979 |
| chair | 0.794 | 0.893 | 0.39 | 0.825 | 0.883 | 0.415 | 0.917 |
| cow | 0.999 | 1.0 | 0.857 | 0.999 | 0.999 | 0.864 | 1.0 |
| dining table | 0.825 | 0.885 | 0.704 | 0.827 | 0.889 | 0.699 | 0.919 |
| dog | 0.905 | 0.971 | 0.886 | 0.909 | 0.970 | 0.895 | 0.996 |
| horse | 0.927 | 0.978 | 0.894 | 0.940 | 0.968 | 0.893 | 0.995 |
| motorbike | 0.849 | 0.931 | 0.761 | 0.871 | 0.941 | 0.757 | 0.979 |
| person | 0.897 | 0.957 | 0.794 | 0.899 | 0.945 | 0.799 | 0.973 |
| potted plant | 0.829 | 0.881 | 0.658 | 0.838 | 0.873 | 0.668 | 0.910 |
| sheep | 0.960 | 0.993 | 0.862 | 0.963 | 0.991 | 0.864 | 0.996 |
| sofa | 0.719 | 0.816 | 0.339 | 0.782 | 0.810 | 0.379 | 0.864 |
| train | 0.989 | 1.0 | 0.818 | 1.0 | 0.993 | 0.825 | 1.0 |
| tv monitor | 0.816 | 0.856 | 0.729 | 0.835 | 0.852 | 0.741 | 0.903 |
| MAP value | 0.901 | 0.947 | 0.759 | 0.911 | 0.943 | 0.764 | 0.964 |
Fig 5Comparisons of the annotation accuracy rates of the three networks for PASCAL VOC 2012 dataset after stabilization.
The number of parameters in each layer of the DCCNN.
| DCCNN | Parameters corresponding to each layer of the channel | Number of parameters | |
|---|---|---|---|
| 256*256 input image | Channel 1 | Channel 2 | |
| Convolution layer 1 | Convolution kernel [10, 10, 3, 20] | Convolution kernel [12, 12, 3, 20] | ((10*10*3+1)+(12*12*3+1))*20 = 14680 |
| Convolution layer 2 | Convolution kernel [5, 5, 20, 40] | Convolution kernel [5, 5, 20, 40] | ((5*5+1)+(5*5+1))*40 = 2080 |
| Convolution layer 3 | Convolution kernel [6, 6, 40, 60] | Convolution kernel [5, 5, 40, 60] | ((6*6+1)+(5*5+1))*60 = 3780 |
| Fully connected layer 1 | [6*6*60, 1000] | [5*5*60, 1000] | (6*6*60+5*5*60)*1000+2000 = 3662000 |
| Fully connected layer 2 | [1000, 1000] | [1000, 1000] | 1000*1000*2+2000 = 2002000 |
| Output | [1000, 20] | [1000, 20] | 1000*20*2 = 4000 |
| Total number of parameters | 14680+2080+3780+3662000+2002000+4000 = 5688540 | ||
The number of parameters of each layer of the CM-supplement network.
| CM-supplement network | Parameters in each layer | Number of parameters |
|---|---|---|
| 256*256 input image | ||
| Cavity convolution layer 1 | Convolution kernel [10, 10, 3, 20] | (10*10*3+1)*20 = 6020 |
| Cavity convolution layer 2 | Convolution kernel [5, 5, 20, 40] | (5*5+1)*40 = 1040 |
| Cavity convolution layer 3 | Convolution kernel [5, 5, 40, 60] | (5*5+1)*60 = 1560 |
| Inception v1 | (1*1+1)*4+3*3+5*5 | (1*1+1)*4+3*3+5*5+2 = 44 |
| Fully connected layer 1 | [10*1*256, 1000] | 10*1*256*1000+1000 = 2561000 |
| Fully connected layer 2 | [1000, 1000] | 1000*1000+1000 = 1001000 |
| Output | [1000, 20] | 1000*20 = 20000 |
| Total parameter number | 6020+1040+1560+44+2561000+1001000+2000 = 3572664 | |
For the parameters and the number of parameters:
(1) Convolutional layer: Convolution kernel [height, c_height; width, c_width; number of input channels, channel_input; number of output channels, channel_output]
Number of parameters in the convolutional layer = (c_height*c_width*channel_input+1)* channel_output
(2) Number of parameters in the Inception v1 layer = number of feature maps *the size of the convolution kernel
(3) Fully connected layer: [number of input feature maps, fc_inputfeature; number of output feature maps, fc_outputfeature]
Number of parameters in the fully connected layer = fc_inputfeature*fc_outputfeature+fc_outputfeature
(4) Output layer: [number of the input feature maps, out_ inputfeature; number of image categories, cata_num]
Number of parameters in the output layer = out_ inputfeature*cata_num
Comparisons of the annotation effects from different algorithms for high-frequency words.
| Image example | Auto-annotation outcomes | ||
|---|---|---|---|
| DCCNN [ | Reference [ | Method proposed in this study | |
| people | people | people, horse | |
| dog, people, cat | dog, people | dog, people, cat | |
| boat | boat | boat, people | |
| train | train, people | train, people | |
| bird | bird, people | bird, people | |
| people, train | people | people, train | |
| dog, potted plant | dog, potted plant | dog, potted plant | |
| boat, people | boat, people | boat, cow, people | |
| plane, people | plane, people | plane, people | |
| train | train | train, bus | |
All images in this table are sourced from the photos taken by the current team. For copyright consideration, they are similar but not identical to the original images sourced from PASCAL VOC 2012 and are therefore for illustrative purposes only.
Comparisons of the annotation effects from different algorithms for low-frequency words.
| Image example | Auto-annotation outcomes | ||
|---|---|---|---|
| DCCNN[ | Reference [ | Method proposed in this study | |
| people | people | people, sofa, chair | |
| sofa | sofa | sofa, potted plant, chair, tv monitor | |
| chair, sofa | sofa | chair, sofa, dining table | |
| people, car | people, car | people, car, tv monitor | |
| chair, dining table | chair, dining table | potted plant, chair, dining table, sofa | |
| dining table, chair | dining table | dining table, chair, potted plant, sofa | |
| motorbike | motorbike | people, motorbike, chair | |
| people, bus | people, bus | people, bus, car, bike | |
| chair, tv monitor | chair, tv monitor | sofa, chair, tv monitor, dining table | |
| chair, people | chair, people, | dog, chair, people, sofa, potted plant | |
All images in this table are sourced from the photos taken by the current team. For copyright consideration, they are similar but not identical to the original images sourced from PASCAL VOC 2012 and are therefore for illustrative purposes only.