| Literature DB >> 31857821 |
Haipeng Xiong1, Zhiguo Cao1, Hao Lu1, Simon Madec2, Liang Liu1, Chunhua Shen3.
Abstract
BACKGROUND: Grain yield of wheat is greatly associated with the population of wheat spikes, i.e., s p i k e n u m b e r m - 2 . To obtain this index in a reliable and efficient way, it is necessary to count wheat spikes accurately and automatically. Currently computer vision technologies have shown great potential to automate this task effectively in a low-end manner. In particular, counting wheat spikes is a typical visual counting problem, which is substantially studied under the name of object counting in Computer Vision. TasselNet, which represents one of the state-of-the-art counting approaches, is a convolutional neural network-based local regression model, and currently benchmarks the best record on counting maize tassels. However, when applying TasselNet to wheat spikes, it cannot predict accurate counts when spikes partially present.Entities:
Keywords: Context fusion; Convolutional models; Local regression networks; Object counting; Wheat spikes
Year: 2019 PMID: 31857821 PMCID: PMC6905110 DOI: 10.1186/s13007-019-0537-2
Source DB: PubMed Journal: Plant Methods ISSN: 1746-4811 Impact factor: 4.993
Fig. 1Challenges of counting wheat spikes in the wild. a different planting regions, b various growth stages, c degraded image quality due to blurring, d visual differences caused by changing illumination, e extremely dense spatial distributions and severe occlusions, f size and pose variations
Fig. 2Three examples of incomplete objects when only looking at the local patches. White parts are invisible contextual regions for the current visible patches. Wheat spikes annotated with black dots indicate the spike is partly within the visible area, and red dots represent spikes with severe occlusions. In both cases, accurate wheat numbers are just hard to obtain without the help of local visual context
Fig. 3A high-level overview of the approach utilizing local visual context information. The red dashed box indicates a local patch ready for counting, and the part outside the box refers to the context
Fig. 4Imaging device in the Zhengzhou, Henan Province. The main components include a high resolution CCD digital camera (E450 Olympus) and low-resolution monitoring equipment. The camera is set 5 m high above the ground
Constitution of the WSC dataset
| Sequence | Images | Spikes | Min | Max |
|---|---|---|---|---|
| Hebei Gucheng (2011–2012) | 324 | 82,578 | 0 | 661 |
| Henan Zhengzhou (2011–2012) | 234 | 118,022 | 0 | 1462 |
| Henan Zhengzhou (2012–2013) | 171 | 104,847 | 0 | 1331 |
| Shandong Taian (2011–2012 Camera 1) | 279 | 97,695 | 0 | 1010 |
| Shandong Taian (2011–2012 Camera 2) | 261 | 78,887 | 0 | 908 |
| Shandong Taian (2012–2013 Camera 1) | 234 | 94,454 | 0 | 1090 |
| Shandong Taian (2012–2013 Camera 2) | 261 | 98,839 | 0 | 971 |
| Total | 1764 | 675,322 | 0 | 1462 |
Images denote the number of images in each sequence. Spikes refer to the number of wheat spikes in each sequence. Min and Max indicate the minimum and maximum number of wheat spikes per image
Training set (train), validation set (val) and test set (test) settings of the WSC dataset
| Sequence | Train | Val | Test |
|---|---|---|---|
| Hebei Gucheng (2011–2012) | |||
| Henan Zhengzhou (2011–2012) | |||
| Henan Zhengzhou (2012–2013) | |||
| Shandong Taian (2011–2012 Camera 1) | |||
| Shandong Taian (2011–2012 Camera 2) | |||
| Shandong Taian (2012–2013 Camera 1) | |||
| Shandong Taian (2012–2013 Camera 2) |
Fig. 5An example of dotted annotation. A red dot is marked at each location of the wheat spike
Fig. 7Feature maps and the corresponding receptive field of TasselNet and TasselNetv2. a For TasselNet, b for adding context to TasselNet via canceling zero-paddings and c for TasselNetv2. The above line are feature maps of each layer in the network, numbers below feature maps are in the format: . The following line is the corresponding receptive fields, where black dotted boxes represents the target local area to be counted, the blue rectangular areas represents the input area, and the pink area represents the receptive field of the bottom left element in the feature map (the part of the receptive field beyond the input area denotes zero area). Since the last few layers have receptive fields of the same size, we use orange lines to point to the corresponding receptive fields
Fig. 6The structure of TasselNet, TasselNet added context and TasselNetv2. All of the networks adopt AlexNet-like architectures. The definition of the convolutional and pooling layers is in the format: fliter size + layer name, number of channels, padding, /stride. Fully connected layers are defined in the format: layer name, number of nodes. The different settings are highlighted in red
Comparison towards the floating point computations (FLOPs) when processing images with the resolution of . Only the single-precision floating point multiplication are taken into account
| TasselNet | TasselNetv2 | ||
|---|---|---|---|
| Non-overlap | Dense sample | ||
| conv1 | |||
| conv2 | |||
| conv3 | |||
| conv4 | |||
| conv5 | |||
| conv6(fc1) | |||
| conv7(fc2) | |||
| conv8(fc3) | |||
| Total | |||
Fig. 8The processing pipeline of TasselNetv2 at the test stage. Unlike TasselNet, TasselNetv2 directly processes the whole input image and outputs all local counts. And the final density map can be acquired by merging and normalizing all local counts
TasselNet configurations on the WSC dataset
| Patch size | Gaussian size | 4 | |
|---|---|---|---|
| Backbone of TasselNet | AlexNet-like in Fig. | ||
The effect of context on the test set of the WSC dataset. “train” denotes adding context into TasselNet since training phase as Fig. 7b, while “test” denotes only adding context into TasselNet in the testing phase
| Method | Context | MAE | RMSE | Train (s) |
|---|---|---|---|---|
| TasselNet | 61.35 | 99.27 | 3495.29 | |
| TasselNet | Test | 79.42 | 126.18 | 3495.29 |
| TasselNet | Train | 82.16 | 4026.68 | |
| TasselNetv2 | 50.79 | 333.27 |
All networks are trained from scratch. Training time for one epoch is reported. The best performance is in italics
Fig. 9The distribution of absolute errors for local patches and test images. The left is the histogram of absolute error for local patches, and the right is the histogram of absolute error for test images. All networks are trained from scratch. “TasselNet (add-c)” denotes adding the context in TasselNet as per Fig. 6 since the training phase
The necessity of adding context on the test set of the WSC dataset
| Method | MAE | RMSE |
|---|---|---|
| TasselNet | 61.35 | 99.27 |
| TasselNetv2 | ||
| TasselNetv2(del-c) | 66.96 | 113.20 |
All networks are trained from scratch and with the same hyper parameters. The best performance is in italics
Comparison with state-of-the-art counting approaches on the test set of WSC dataset. TasselNetv2 adopts an AlexNet-like architecture in Fig. 6 and is trained from scratch
| Method | Henan Zhengzhou (2012–2013) | Shandong Taian (2012–2013 Camera1) | Overall | #Parameters | |||
|---|---|---|---|---|---|---|---|
| MAE | RMSE | MAE | RMSE | MAE | RMSE | ||
| Segmentation method in [ | 387.09 | 436.84 | 268.03 | 345.78 | 317.19 | 386.22 | |
| CCNN [ | 168.41 | 214.41 | 52.40 | 72.78 | 101.39 | 149.91 | |
| MCNN [ | 149.44 | 188.34 | 58.83 | 75.50 | 97.08 | 135.17 | |
| CSRNet | 64.19 | 88.96 | 33.26 | 46.32 | 67.63 | ||
| TasselNet [ | 94.97 | 137.24 | 36.79 | 57.37 | 61.35 | 99.27 | |
| TasselNetv2 | 74.97 | 113.21 | 33.12 | 49.26 | 50.79 | 80.66 | |
| TasselNetv2 | 47.55 | ||||||
means the model is finetuned from the pretrained VGG16, and layer-by-layer settings can be found in Additional file. The best performance is italics
Evaluations of different methods on the MTC [9] dataset
| Method | MAE | RMSE |
|---|---|---|
| JointSeg [ | 24.2 | 31.6 |
| mTASSEL [ | 19.6 | 26.1 |
| GlobalReg [ | 19.7 | 23.3 |
| DensityReg [ | 11.9 | 14.8 |
| CCNN [ | 21.0 | 25.5 |
| TasselNet [ | 6.6 | 9.6 |
| TasselNetv2 | 5.4 | |
| TasselNetv2 | 9.4 |
means the model is finetuned from the pretrained VGG16. The best performance is in italics
Evaluations on the ShanghaiTech [5] dataset
| Method | Part A | Part B | ||
|---|---|---|---|---|
| MAE | RMSE | MAE | RMSE | |
| MCNN [ | 110.2 | 173.2 | 26.4 | 41.3 |
| CP-CNN [ | 73.6 | 106.4 | 20.1 | 30.1 |
| ACSCP [ | 75.7 | 17.2 | 27.4 | |
| CSRNet | 68.2 | 115.0 | 10.6 | |
| TasselNet [ | 87.0 | 138.9 | 16.7 | 28.1 |
| TasselNetv2 | 84.1 | 140.1 | 15.3 | 27.8 |
| TasselNetv2 | 112.1 | 17.5 | ||
means the model is fine-tuned from the pretrained VGG16. The best performance is in italics
Fig. 10Some ground truth density maps overlaid on original images on the test set of the WSC dataset and count maps generated by TasselNetv2 (finetuned with pre-trained VGG16). The number above each original image denotes the ground truth count number of wheat spikes, while that above each density map denotes prediction count number. The last line shows some unsuccessful predictions, and error maps of these images are also presented. An error map denotes the difference of the ground truth and predicted density map. Over-estimate is denoted by red, under-estimate by blue, and minor difference by gray. The darker the color is, the greater the errors are. We also zoom in some local areas with high counting errors. ’GT’ denotes ground-truth counts and ’Error’ denotes the difference compared to the ground truth. Further visualizations can be found in Additional file 1.