| Literature DB >> 34066794 |
Shuyu Wang1,2, Mingxin Zhao1,2, Runjiang Dou1,2, Shuangming Yu1,2, Liyuan Liu1,2, Nanjian Wu1,2,3.
Abstract
Image demosaicking has been an essential and challenging problem among the most crucial steps of image processing behind image sensors. Due to the rapid development of intelligent processors based on deep learning, several demosaicking methods based on a convolutional neural network (CNN) have been proposed. However, it is difficult for their networks to run in real-time on edge computing devices with a large number of model parameters. This paper presents a compact demosaicking neural network based on the UNet++ structure. The network inserts densely connected layer blocks and adopts Gaussian smoothing layers instead of down-sampling operations before the backbone network. The densely connected blocks can extract mosaic image features efficiently by utilizing the correlation between feature maps. Furthermore, the block adopts depthwise separable convolutions to reduce the model parameters; the Gaussian smoothing layer can expand the receptive fields without down-sampling image size and discarding image information. The size constraints on the input and output images can also be relaxed, and the quality of demosaicked images is improved. Experiment results show that the proposed network can improve the running speed by 42% compared with the fastest CNN-based method and achieve comparable reconstruction quality as it on four mainstream datasets. Besides, when we carry out the inference processing on the demosaicked images on typical deep CNN networks, Mobilenet v1 and SSD, the accuracy can also achieve 85.83% (top 5) and 75.44% (mAP), which performs comparably to the existing methods. The proposed network has the highest computing efficiency and lowest parameter number through all methods, demonstrating that it is well suitable for applications on modern edge computing devices.Entities:
Keywords: U-Net; bayer color filter array; convolutional neural network; edge computing; image demosaicking; image sensor
Year: 2021 PMID: 34066794 PMCID: PMC8125912 DOI: 10.3390/s21093265
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The UNet++ structure. It contains the basic U-Net backbone which is shown as gray circles in the figure. Firstly, it down-samples the input images through pooling layers at each level, then feature maps at each node get up-sampled and cross-connected with later nodes. Besides, UNet++ can be pruned during inference if trained with deep supervision. The sub-network at each level is shown on the left.
Figure 2Network structure: yellow box indicates the image feature extraction part, and the gray box indicates the image reconstruction part.
Figure 3The first two cross-layer connections in each Dense Unit are connected between channels, while the cross-layer connection before the final output is summed by the feature map values. Since a normal dense block is composed of a large number of ordinary convolutions, the number of parameters could be large. Therefore, all the convolutions in our densely connected layers use depthwise separable convolutions [40], which further reduces the number of parameters of the network.
Details of the parameters nf and gc in each Dense Unit used in our network.
| Node | nf | gc |
|---|---|---|
| 0–0 | 8 | 4 |
| 1–0 | 16 | 8 |
| 2–0 | 32 | 16 |
| 3–0 | 64 | 32 |
Network details of each node in Figure 2, including the convolutional kernel size of the reconstruction nodes, the input size, and the output size of feature maps.
| Node | Kernel Size | Input Size | Output Size |
|---|---|---|---|
| 0–0 | - | H × W × 4 | H × W × 8 |
| 0–1 | 3 × 3 × 16 × 16; 1 × 1 × 16 × 8 | H × W × 16 | H × W × 8 |
| 0–2 | 3 × 3 × 16 × 16; 1 × 1 × 16 × 8 | H × W × 16 | H × W × 8 |
| 0–3 | 3 × 3 × 16 × 16; 1 × 1 × 16 × 8 | H × W × 16 | H × W × 8 |
| 1–0 | - | H × W × 4 | H × W × 16 |
| 1–1 | 3 × 3 × 32 × 32; 1 × 1 × 32 × 16 | H × W × 32 | H × W × 16 |
| 1–2 | 3 × 3 × 32 × 32; 1 × 1 × 32 × 16 | H × W × 32 | H × W × 16 |
| 2–0 | - | H × W × 4 | H × W × 32 |
| 2–1 | 3 × 3 × 64 × 64; 1 × 1 × 64 × 32 | H × W × 64 | H × W × 32 |
| 3–0 | - | H × W × 4 | H × W × 64 |
Figure 4(a) The channel (or color) position to be chosen in a 2 × 2 window. (b) The process of generating network input from an original RGB image. Note that the two green color of different intensities in the middle of (b) both represents the green channel. Just to make the image-generating process more clear and understandable, we painted it to the different green intensities.
The PSNR values (dB) of each channel (R, G, B) and CPSNRs of the whole image (RGB) of 9 demosaicking algorithms on Kodak, McMaster, Urban100, Manga109 datasets and the final average values of them. The bolded two methods in each section are those that achieve better performance among all algorithms.
| Algorithm | AHD [ | DLMMSE [ | RI [ | MLRI [ | ARI [ | Tan [ | Kok 1 [ | Cui [ | Ours(L1) | Ours(L2) | Ours(L3) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Kod | R | 36.88 | 38.47 | 37.83 | 38.87 | 39.11 | 41.11 | 41.30 | 41.98 | 40.29 | 40.88 | 41.22 |
| G | 39.59 | 42.65 | 41.03 | 41.86 | 42.33 | 44.86 | 45.96 | 45.10 | 43.44 | 44.00 | 44.34 | |
| B | 37.37 | 38.53 | 37.80 | 38.86 | 38.77 | 40.80 | 41.29 | 41.04 | 39.68 | 40.24 | 40.59 | |
| RGB | 37.74 | 39.36 | 38.57 | 39.58 | 39.75 | 41.82 |
|
| 40.82 | 41.38 | 41.72 | |
| McM | R | 33.01 | 33.13 | 36.12 | 36.38 | 37.44 | 38.54 | 39.93 | 39.70 | 36.64 | 37.56 | 38.01 |
| G | 36.99 | 38.00 | 40.00 | 39.91 | 40.73 | 41.95 | 42.65 | 42.63 | 39.61 | 40.31 | 40.74 | |
| B | 32.16 | 31.84 | 35.37 | 35.38 | 36.07 | 37.14 | 38.00 | 37.72 | 35.25 | 35.90 | 36.33 | |
| RGB | 33.50 | 33.55 | 36.50 | 36.65 | 37.54 | 38.67 |
|
| 36.75 | 37.48 | 37.91 | |
| Urb | R | 32.63 | 33.91 | 33.72 | 36.38 | 34.63 | 37.14 | 38.68 | 37.71 | 35.54 | 36.30 | 36.84 |
| G | 35.62 | 37.65 | 36.67 | 39.91 | 38.03 | 40.94 | 42.33 | 41.42 | 39.31 | 39.99 | 40.46 | |
| B | 32.87 | 33.92 | 33.90 | 35.38 | 34.79 | 37.13 | 38.54 | 37.70 | 35.59 | 36.30 | 36.90 | |
| RGB | 33.42 | 34.73 | 34.48 | 36.65 | 35.49 | 38.00 |
|
| 36.44 | 37.16 | 37.70 | |
| Man | R | 32.01 | 32.71 | 34.68 | 36.38 | 35.58 | 37.31 | 38.00 | 38.16 | 36.11 | 36.90 | 37.27 |
| G | 38.14 | 39.45 | 40.31 | 39.91 | 40.30 | 43.23 | 43.35 | 43.60 | 40.55 | 41.36 | 41.93 | |
| B | 33.10 | 33.23 | 35.10 | 35.38 | 35.34 | 37.37 | 36.16 | 37.68 | 35.97 | 36.62 | 37.01 | |
| RGB | 33.55 | 34.11 | 35.88 | 36.65 | 36.43 |
| 38.17 |
| 37.04 | 37.77 | 38.17 | |
| Ave. | R | 33.63 | 34.55 | 35.59 | 37.00 | 36.69 | 38.53 | 39.48 | 39.39 | 37.14 | 37.91 | 38.33 |
| G | 37.59 | 39.44 | 39.50 | 40.40 | 40.35 | 42.75 | 43.57 | 43.19 | 40.73 | 41.41 | 41.87 | |
| B | 33.88 | 34.38 | 35.54 | 36.25 | 36.24 | 38.11 | 38.50 | 38.54 | 36.62 | 37.27 | 37.71 | |
| RGB | 34.55 | 35.44 | 36.36 | 37.38 | 37.30 | 39.24 |
|
| 37.76 | 38.45 | 38.88 | |
1 Due to the limitations of the table format, we abbreviate ‘Kokkinos’ to ‘Kok’.
Figure 5Visual comparison of the demosaicking results on 5 representative images in Kodak, McMaster, Urban100, and Manga109. (a) ‘kodim19’ from Kodak [44]. (b) ‘kodim08‘ from Kodak [44]. (c) ‘1’ from McMaster [45]. (d) ‘img_008’ from Urban100 [46]. (e) ‘BEMADER_P’ from Manga109 [47].
The average running time of a 500 × 500 image in McMaster and the number and size (for double-precision floating-point format) of parameters for CNN-based models. The bolded two methods in each section are those that achieve better performance among all algorithms.
| Algorithm | Running Time (s) | Parameters | |
|---|---|---|---|
| Number | Size (MB) | ||
| AHD [ | 0.48 | - | - |
| DLMMSE [ | 234.78 | - | - |
| RI [ |
| - | - |
| MLRI [ | 0.20 | - | - |
| ARI [ | 3.66 | - | - |
| Tan [ | 0.42 | 528,518 | 2.02 |
| Kokkinos [ | 0.87 | 380,356 | 1.45 |
| Cui [ | 1.19 | 1,793,032 | 6.84 |
| Ours (L1) |
|
| 0.04 |
| Ours (L2) | 0.17 |
| 0.18 |
| Ours (L3) | 0.24 | 183,628 | 0.70 |
Figure 6The scatter plot of PSNRs and the running time of log scale on the 9 demosaicking methods.
The accuracy of classification and detection tasks connected with demosaicking methods. The ‘Origin’ item indicates the original accuracy of the pre-trained model MobileNet v1 and SSD300. The bolded two methods in each section are those that achieve better performance among all algorithms.
| Algorithm | MobileNet v1 | SSD300 | |
|---|---|---|---|
| Top1 (%) | Top5 (%) | mAP (%) | |
| Origin | 71.11 | 89.84 | 75.77 |
| AHD [ |
| 85.67 | 75.41 |
| DLMMSE [ | 64.06 | 85.44 | 75.14 |
| RI [ | 64.25 | 85.65 | 75.16 |
| MLRI [ | 64.36 | 85.70 | 75.21 |
| ARI [ | 64.40 | 85.74 | 75.06 |
| Tan [ |
|
|
|
| Kokkinos [ | 64.43 | 85.76 |
|
| Cui [ | 64.50 | 85.80 | 75.49 |
| Ours (L1) | 64.11 | 85.49 | 75.16 |
| Ours (L2) | 64.43 | 85.78 | 75.22 |
| Ours (L3) | 64.56 |
| 75.44 |
Testing results on four datasets for network structures with different pooling layers. Only the results for the L3 network are presented here for clearer comparisons of different network structures.
| Algorithms | Avg Pooling | Max Pooling | Gaussian Pooling | Gaussian Smoothing | |
|---|---|---|---|---|---|
| L = 3 | L = 3 | L = 3 | L = 3 | ||
| Kodak24 | R | 40.95 | 40.74 | 40.86 | 41.22 |
| G | 44.09 | 43.86 | 44.01 | 44.34 | |
| B | 40.35 | 40.11 | 40.28 | 40.59 | |
| RGB | 41.47 | 41.25 | 41.39 |
| |
| McMaster | R | 37.83 | 37.79 | 37.71 | 38.01 |
| G | 40.58 | 40.53 | 40.56 | 40.74 | |
| B | 36.07 | 36.03 | 35.97 | 36.33 | |
| RGB | 37.70 | 37.66 | 37.61 |
| |
| Urban100 | R | 36.67 | 36.34 | 36.58 | 36.84 |
| G | 40.29 | 39.99 | 40.22 | 40.46 | |
| B | 36.68 | 36.32 | 36.56 | 36.90 | |
| RGB | 37.51 | 37.17 | 37.41 |
| |
| Manga109 | R | 36.93 | 36.72 | 36.90 | 37.27 |
| G | 41.53 | 41.30 | 41.54 | 41.93 | |
| B | 36.75 | 36.54 | 36.74 | 37.01 | |
| RGB | 37.86 | 37.65 | 37.84 |
| |
| Ave. | R | 38.10 | 37.90 | 38.01 | 38.33 |
| G | 41.62 | 41.42 | 41.58 | 41.87 | |
| B | 37.46 | 37.25 | 37.39 | 37.71 | |
| RGB | 38.64 | 38.43 | 38.56 |
| |
Where Avg pooling, Max pooling, Gaussian pooling, and Gaussian smoothing denote 2 × 2 average pooling, 2 × 2 max pooling, Gaussian smoothing followed by 2 × 2 down-sampling, and Gaussian smoothing layer (used in this paper), respectively. Since pooling changes the image size, in each image reconstruction node in the original network, we replace the 1 × 1 convolution with up-sampling implemented by a transposed convolution. It can be seen that the network adopting Gaussian smoothing layers can extract image features more efficiently than other pooling approaches, thus obtaining a little better accuracy due to its capability to extract image features through multi-scale receptive fields and its preservation of entire image information.