Liang Jia1,2, Ye Tian1, Junguo Zhang1. 1. School of Technology, Beijing Forestry University, Beijing 100083, China. 2. School of Microelectronics and Control Engineering, Changzhou University, Changzhou 213164, China.
Abstract
Camera traps provide a feasible way for ecological researchers to observe wildlife, and they often produce millions of images of diverse species requiring classification. This classification can be automated via edge devices installed with convolutional neural networks, but networks may need to be customized per device because edge devices are highly heterogeneous and resource-limited. This can be addressed by a neural architecture search capable of automatically designing networks. However, search methods are usually developed based on benchmark datasets differing widely from camera trap images in many aspects including data distributions and aspect ratios. Therefore, we designed a novel search method conducted directly on camera trap images with lowered resolutions and maintained aspect ratios; the search is guided by a loss function whose hyper parameter is theoretically derived for finding lightweight networks. The search was applied to two datasets and led to lightweight networks tested on an edge device named NVIDIA Jetson X2. The resulting accuracies were competitive in comparison. Conclusively, researchers without knowledge of designing networks can obtain networks optimized for edge devices and thus establish or expand surveillance areas in a cost-effective way.
Camera traps provide a feasible way for ecological researchers to observe wildlife, and they often produce millions of images of diverse species requiring classification. This classification can be automated via edge devices installed with convolutional neural networks, but networks may need to be customized per device because edge devices are highly heterogeneous and resource-limited. This can be addressed by a neural architecture search capable of automatically designing networks. However, search methods are usually developed based on benchmark datasets differing widely from camera trap images in many aspects including data distributions and aspect ratios. Therefore, we designed a novel search method conducted directly on camera trap images with lowered resolutions and maintained aspect ratios; the search is guided by a loss function whose hyper parameter is theoretically derived for finding lightweight networks. The search was applied to two datasets and led to lightweight networks tested on an edge device named NVIDIA Jetson X2. The resulting accuracies were competitive in comparison. Conclusively, researchers without knowledge of designing networks can obtain networks optimized for edge devices and thus establish or expand surveillance areas in a cost-effective way.
Entities:
Keywords:
camera trap images; convolutional neural network; neural architecture search
Visual data are a rich source of information about wildlife and can provide strong support for wildlife conservation and ecological research. One cost-effective way to obtain visual data of wildlife is via camera traps that work in a non-disturbing [1] and round-the-clock manner [2], thus making them ideal for observing wild animals otherwise difficult to monitor [3], e.g., nocturnal mammals [4] and large animals [5]. Because camera traps are noninvasive [6], a single deployment may record a diverse range of species [7]. Consequently, the recorded images have to be processed before being adopted in ecological research [8]. The images may go through several processing stages determined by the research process, a fundamental stage of which is species identification that is usually implemented as automatically and centrally classifying camera trap images at a data center installed with very deep convolutional neural networks (CNNs) [9,10,11,12,13]. In practice, there may be millions of images produced by camera traps [7,12,14,15], so image transfer and processing at a data center is often computationally intensive and costly. Furthermore, the scale of the surveillance area may also be restricted by the processing capability of the data center.Edge computing [16] was ideally developed for such cases, i.e., intensive computation centralized at a data center can be split and localized by edge devices near camera traps [17,18]. Thus, fundamental processing steps such as removing images without animals [6,7,12,19,20,21] and classifying images with animals [9,10,11,12,13] can be automatically conducted on edge devices. However, edge devices are not only heterogeneous [22] but also resource constrained [23]. These limitations of edge devices narrow down the range of available neural networks [23,24]. Hence, lightweight networks [25] designed for edge devices are critical in edge computing for camera trap images. Even so, “deep neural network design is very difficult, and it requires the experience and knowledge of experts, a lot of trial and error, and even inspiration” [26]. Luckily, network design can be automated through neural architecture search (NAS) [27]. However, NAS is often developed regardless of domain knowledge [28] regarding camera trap images of wildlife [29]. Specifically, NAS is often designed based on benchmark datasets such as CIFAR-10 [30] and ImageNet [31], which differ from camera trap images in many aspects, especially data distribution and aspect ratios as described below.The data distribution of benchmark datasets may differ from camera trap images, e.g., in classes, image foregrounds and backgrounds. Camera trap images purely contain animals, but only partial classes in benchmark datasets are relevant to animals. For instance, six out of ten classes in CIFAR-10 are related to animals and 233 out of 1000 classes in ImageNet are relevant to vertebrate [32]. Consequently, NAS based on benchmark datasets may waste resources on designing networks optimized for data irrelevant to animals. In addition to classes, animal images in benchmark datasets also differ from camera trap images in foregrounds and backgrounds. For benchmark datasets [30,31], images are usually artificially preprocessed to guarantee that foreground animals are large and centered and their backgrounds are relatively small and may differ from animal habitats in the wild. In contrast, animals in habitats are photographed by camera traps under various conditions, so the animals may appear at random locations in images and are often closely related with image backgrounds.The image aspect ratios of benchmark datasets differ from camera trap images, e.g., the aspect ratios of CIFAR-10 and ImageNet are both 1:1 (the image width and height are the same), though this ratio may not hold for camera trap images. For instance, the resolutions of camera trap images range from 2048 × 1536 (aspect ratio: 4:3) to 2616 × 1472 (16:9) in North American Camera Trap Images, i.e., NACTI [13], and the resolutions range from 1920 × 1080 (16:9) to 2048 × 1536 (4:3) in Missouri Camera Trap Images, i.e., MCTI [33]. Therefore, networks found by NAS based on benchmark datasets may require that camera trap images be resized to satisfy the aspect ratio 1:1. However, resizing images may alter their aspect ratios and introduce interpolated pixels, often resulting in either misshaped animals or memory waste.In short, images from benchmark datasets adopted by NAS often differ from camera trap images, and this difference potentially implies domain shift [34]. Additionally, it may be hard to modify existing networks in line with the applications [29]. These issues inspired us to develop NAS based on the domain knowledge of camera trap images for edge devices. We used the proposed method to conduct searches directly on camera trap images rather than images of benchmark datasets. The aspect ratios of camera trap images are maintained during the search, which is guided by a loss function particularly derived for finding the lightweight networks. The hyper parameter of loss function was theoretically analyzed and carefully chosen, and lightweight networks found by the search were tested on the NVIDIA Jetson X2 edge device. The experimental results confirmed the validity of the proposed method. The main contributions of this paper are as follows.A method named Domain-Aware Neural Architecture Search (DANAS) was developed regarding the domain knowledge of camera trap images. The proposed method directly searches networks on camera trap images, thus avoiding negative effects such as the domain shift incurred by benchmark datasets in conventional search methods.Aspect ratios of camera trap images are maintained during the search. As part of domain knowledge, the changes of aspect ratio may not be automatically tackled by neural networks. Therefore, the changes are manually eliminated by first finding the most frequent aspect ratio and then padding images whose aspect ratios differ from the most frequent one.A loss function was derived to guide DANAS to find lightweight networks applicable for edge devices. A theoretical analysis of the proposed loss function was conducted, and the analysis revealed the value of hyper parameter in the loss function to boost its guiding effect on the search.
2. Materials and Methods
2.1. Datasets
Two datasets were employed in this study: MCTI and NACTI, containing 24 thousand and 3.7 million camera trap images, respectively, with varying resolutions. Since label errors are found in NACTI and its millions of images require too much computational resources, NACTI was selectively adopted in this study in the form of a subset named NACTI-a containing 29 thousand images with varying resolutions. The species data in NACTI-a and MCTI are illustrated in Table 1.
Table 1
Dataset details.
Species in NACTI-a 1
2048 × 1536(4:3)
1920 × 1080(16:9)
2616 × 1472(16:9)
Species in MCTI
2048 × 1536(4:3)
1920 × 1080(16:9)
Black bear 2
2420/534
10/1
Agouti
499/107
279/65
Marten 2
72/16
Bird
584/120
70/20
Red squirrel 2
313/75
Coiban Agouti
1135/245
18/2
Jackrabbit 3
594/135
55/8
Collared Peccary
372/83
398/85
Bobcat
2040/453
15/2
Opossum
454/94
295/73
California quail
277/60
European Hare
578/122
Cougar
2380/527
Great Tinamou
681/148
380/66
Coyote
1416/322
55/14
6/1
Mouflon
1940/425
Gray squirrel
811/186
1/0
Ocelot
256/64
184/35
Elk
1754/393
8/1
Paca
772/162
200/62
Gray fox
1253/279
5/2
Red Brocket Deer
425/94
384/78
Moose
978/216
Red Deer
2321/509
Mule deer
1761/397
Red Fox
410/91
Armadillo 4
521/113
Red Squirrel
343/78
182/36
Raccoon
1126/250
Roe Deer
1038/233
Red deer
1754/374
Spiny Rat
383/91
201/37
Red fox
266/59
White-nosed Coati
883/192
179/41
Snowshoe hare
1183/263
White-tailed Deer
1363/287
452/106
Striped skunk
1080/243
Wild Boar
1538/345
Virginia opossum
91/19
Wood Mouse
1105/245
Wild boar
1548/340
4/2
Wild turkey
643/155
17/0
1 Numbers before and after slashes, respectively, refer to the training and testing image numbers; 2 American animals; 3 Black-tailed jackrabbit 4 Nine-banded armadillo.
2.2. Method
DANAS was developed within the framework of reinforcement learning [35,36], i.e., the search is implemented on sampling candidate networks from a search space through a sampler [29], as shown in Figure 1. In DANAS, the sampler is long short-term memory (LSTM) [37]. The reason to use LSTM as the sampler is that this sampler does not rely on parameter sharing [38], which may not be helpful for finding high-performance networks (as reported by [39]). Around the sampler, there are five conceptual search steps (from ① to ⑤ in Figure 1). By repeating these steps, the quality of the sampled network is gradually improved via updates of the learnable parameters of the sampler. Starting from the first step, all five steps are introduced sequentially next.
Figure 1
Flowchart of DANAS.
In Step ① shown in Figure 1, LSTM samples candidate networks from the search space defined by a meta architecture, i.e., a prototype from which all candidate networks are derived. The meta architecture is similar to the ones defined in [35,36,37,38], i.e., a pipeline segmented to groups of layers called cells. There are two types of cells, normal and reduction cells, and the cells of the same type share the same inner structure. Besides the inner structures, the normal and reduction cells differ in the way they process data, i.e., the width and height dimensions of data remain the same before and after normal cells while the width and height of the input are halved through reduction cells. There are N normal cells in the pipeline, and each normal cell is adjacent to two reduction cells. At the end of the pipeline, a global average [40] is appended. In this study, the reduction cell was simplified to a single pooling layer, i.e., an average pooling or a max pooling with a kernel size of 5 × 5 or 3 × 3, and the normal cells were sampled based on the meta cell shown in Figure 2.
Figure 2
Meta normal cell.
As shown in Figure 2, a normal cell is a group of blocks whose inputs come from blocks in the same cell or previous B cells. For blocks not serving as inputs of any other blocks, their outputs are concatenated to produce the cell output. Each normal cell has B (a constant) blocks, and each block has M (determined by the sampler) operations. The operation is sampled from the same set of operations as that in [38], e.g., a stack of 3 × 3 depth-wise-separable convolution [41], batch normalization [42], and ReLU [43]. Accordingly, the sampler first determines the operation number M of a block by sampling an integer from some predefined integers, and then it samples inputs and operations for the block, and the sampling repeats for B blocks to form a normal cell. Once the normal cell has been sampled, the sampler samples a pooling layer to form a reduction cell. Both the sampled normal cell and the reduction cell are employed to build the candidate network.In Step ② shown in Figure 1, the candidate network is built based on the sampled cells and the meta architecture, i.e., assembling the cells according to the cell pipeline. The building process is identical with the one introduced in [44], i.e., we applied the adaptive meta-architecture [44] to build candidate networks. Once the candidate network is properly built, its performance is evaluated based on the camera trap images with maintained aspect ratios.In Step ③ shown in Figure 1, the candidate network is trained and validated on camera trap images with the most frequent aspect ratio, i.e., the occurrences of unique aspect ratios of camera trap images are counted and the aspect ratio with the maximal count is chosen as the most frequent one. Images with aspect ratios different the most frequent one are padded by zero pixels. In practice, camera trap images are processed to have the same aspect ratio before the search starts, and the processed images are employed to train the candidate network. The trained network is then validated to yield validation accuracy to compute the loss.In Step ④ shown in Figure 1, an accuracy reward [44] is generated based on accuracies obtained by training and validating a candidate network, and both the produced accuracy and the network parameter number [25] are employed to generate the loss . The purpose of this step is to train LSTM to sample “good” networks via gradient-based optimization algorithms such as stochastic gradient descent (SGD). The meaning of “good” is twofold, i.e., the parameter number of the network should be close to the desired parameter number ( million in our case) and the accuracy reward of the network should be close to the ideal accuracy (, i.e., 100% accuracy). Since the reward is twofold, we need a bivariate reward function so that the gradient of the total loss synchronizes with the reward. According to the case of the unary loss function in studies of reinforcement learning [45], we defined as
where represents learnable parameters associated with the sampler, is the accuracy reward involving the training and the validation accuracies of the candidate network, and is the parameter number of the network in millions. The bivariate function provides the reward based on , and summarizes probabilities of sampling the candidate network through the sampler, i.e.,
where the notations are similar to [44], i.e., is the number of the cells in the candidate network, denotes the block number of the th cell, is the operation number of the th block, and is the probability of sampling under condition . The details of can be found in Appendix A. Since
and yields a scalar, the direction of is solely determined by . However, remains unknown due to the unknown probability distributions of and , which means the direction of is out of our control, i.e., we cannot change the direction of to point to promising positions of high rewards. However, we can change its magnitude via so that synchronizes with the reward. For example, suppose the sampler sampled a network of close to ; we expect the sampler to sample networks alike, which requires that should not be largely updated by SGD involving . However, is partially determined by , so may not remain small when is close to . In this case, should scale to ensure that the resulting is relatively small. This requires the reward surface defined by to be similar to a whirlpool with vortex . We chose Witch of Agnesi [46] to build on account of its bell-like curve and the simple mathematical form that only introduces one hyper parameter. Therefore, is defined as
where is the hyper parameter introduced by Witch of Agnesi. In practice, usually equals 100 (100% accuracy) [44], is determined by the application, and only remains unknown. The value of may be discovered by assuming both and are restrained within some range, and this assumption may be reasonable under certain search conditions. Specifically, let and ; then can be written asAssuming and , the volume of within the assumed ranges is given bywhere and . Suppose and , then, the formula above can be simplified by substituting and by their Taylor series of order three, i.e.,
which is equivalent to
which is a special case of monic cubic polynomials, i.e., the depressed cubic: . According to Cardano’s formula, the solution of the depressed cubic is
where and areThe solution of the depressed cubic requires
which holds if . The numerator of is and its determinant is , so holds regardless of . The denominator of is , so is equivalent to , which leads to .In practice, , where may take a small value such as . Figure 3 illustrates the surface of parameterized by , and within the ranges and . As expected, does have a whirlpool-like surface with the vortex , and the sampler may be guided by involving to find lightweight networks.
Figure 3
Surface of .
In step ⑤ shown in Figure 1, a selecting and training strategy is employed to find the optimal network. The idea behind this strategy is concentrating computational resources on promising networks found during the search, as the method first samples a relatively large number of candidate networks with small training epochs, e.g., 2 epochs in our case, and then finds the promising ones based on the sampled networks with large training epochs. In practice, we ran a single search to sample 1500 networks, and then networks with parameter numbers ranging from 1 to 1.5 million (the ideal parameter number in our case) were sorted decreasingly by their validation accuracies. If there are more than 150 networks, then only top 150 networks are retained for retraining through 5 epochs, and then the trained networks are sorted based on accuracies. If there are networks with accuracies >90%, then 15 networks are retained and retrained through 10 epochs; otherwise, half of the networks are retained and retrained. We stopped this procedure at 15 epochs and selected the top-1 network. If the difference between the accuracies between the top-1 and the top-2 networks was not large, e.g., less than 1%, then we would increase the epoch number and continue the training.
3. Results
The performance evaluation of DANAS was individually conducted on the NACTI-a and MCTI datasets. As shown in Table 1, the most frequent resolution of both NACTI-a and MCTI is 2048 × 1536 (aspect ratio 4:3). Accordingly, the images of the two datasets were resized to have the resolutions 85 × 64 (4:3) [44] for the search and 224 × 168 (4:3) for the test; for each dataset, the search was conducted on 85 × 64 images, and then the optimal network discovered by the search was trained and tested on 224 × 168 images. Each dataset was split to three subsets, i.e., the training set, the validation set and the test set, and the search was conducted on the first two subsets. The split was implemented by randomly sampling images from the dataset at a ratio of 0.64:0.16:0.2 of the sample numbers of three subsets, namely, 20% images were randomly sampled from the dataset to build the test set, then 20% images were randomly sampled from the remaining images to build the validation set, and the rest of the images served as the training set. The candidate networks found by the search were trained on the training set and then tested on the validation set, so the test set remained unknown to the search.In searches on NACTI-a and MCTI, the pipeline shown in Figure 2 had three pairs of one reduction cell and five normal cells () at most. The normal cell had five blocks (), and each block may have had five operations () at most. The input channels of the normal cell and reduction cell were, respectively, fixed to 20 and 40. The output channel of the reduction cell was fixed to 40, while the output channel of the normal cell was automatically determined by its operations. The candidate network was trained by using AMSGrad [47] with a batch size of 32, two epochs, and a learning rate of 0.005. The training was conducted on 85 × 64 training images via a PyTorch module named Distributed Data Parallel (DDP) that loaded the network and the batches to available GPUs, individually trained networks on GPUs, collected the resulting gradients from all GPUs and synchronized networks based on the collected gradients. The trained network was then tested on 85 × 64 images of the validation set on each GPU, and the resulting accuracies were retrieved via PyTorch module named Manager. The retrieved accuracies were then averaged to yield the training and the validation accuracies that were used to generate the accuracy reward. Finally, the loss was computed based on the accuracy reward through the loss function whose hyper parameters were set as , and . All searches were done on a workstation installed with 4 GPUs of NVIDIA TITAN Xp, Ubuntu 20.04, PyTorch 1.7.0 and MySQL 8.0.13.In tests, several networks famous for their lightweight designs or performance were chosen for comparison with DANAS, i.e., MobileNet-v2 [48], EfficientNet [49], DenseNet [50], Resnet-18 [51], ResNext [52] and Wide ResNet [53]. Each network was trained by using SGD [54] of Nesterov momentum [55] with a batch size of 10, 20 epochs, and a learning rate ranging from 0.005 to 0.0001. The learning rate was changed by cosine schedule [54]. The training was conducted on 224 × 168 images from both the training and the validation sets via DDP, and the weights of the network at the last epoch were saved on the hard disk. During tests, the weights were read from the disk and employed to populate the network, and the network was tested on 224 × 168 images of the test set. All networks in comparison were trained and tested on the workstation, and the optimal networks found by DANAS were additionally tested on an NVIDIA Jetson X2 edge device installed with Ubuntu 18.06 and PyTorch 1.1.0.Since the camera trap images differ widely between MCTI and NACTI-a, DANAS found different networks, which led to distinct accuracies and misclassifications for two datasets. The detailed results are discussed in the following sections.
3.1. Search and Test on NACTI-a
The search on NACTI-a consumed roughly 74 hours and found a network with 1.36 million parameters. The search performance was compared with a random search via steps like those shown in Figure 1. Specifically, the sampler in step 1 of Figure 1 was replaced by random sampling, and both memory constriction [44] in step 2 and sampler updating in step 4 of Figure 1 were removed. However, the memory constriction could not truly be removed due to the limited physical GPU memory, and the constriction was thus alleviated by resampling the networks until the pipeline shrinkage [44] did not happen. The training and test configurations of networks explored by the random search were the same as those in DANAS. The search procedures of the random search and DANAS are visualized in Figure 4 and Figure 5, respectively.
Figure 4
Scatter plot of parameter numbers and accuracies associated with the networks explored by a random search on NACTI-a.
Figure 5
Scatter plot of parameter numbers and accuracies associated with the networks discovered by DANAS on NACTI-a.
As shown in Figure 4 regarding the random search, there were 57 networks with parameter numbers exceeding 2.5 million and 79 networks with validation accuracies exceeding 60%.As shown in Figure 5 regarding DANAS, there were 32 networks with parameter numbers exceeding 2.5 million and 140 networks with validation accuracies exceeding 60%, and one of them was chosen as the optimal network according to step 5 in Figure 2. The optimal network is highlighted by a yellow star in Figure 5 and its normal cell is depicted in Figure 6; its reduction cell was simply a max pooling with a 3-by-3 kernel.
Figure 6
Normal cell found by DANAS on NACTI-a.
The detailed network structure based on the normal cell shown in Figure 6 is illustrated in Figure 7, which shows how the data flowed through the normal and the reduction cells. The connections between cells are denoted by arrows. In Figure 7, cells labeled “normal cell ”, “normal cell ” …“normal cell ” correspond to cells labeled “Cell ”, “Cell ” …“Cell ” in Figure 6.
Figure 7
Network built based on the cell found by DANAS on NACTI-a.
If we rotate Figure 6 clockwise by 90°, then cell labels and arrow colors in Figure 6 will match labels and arrow colors in Figure 7. For instance, yellow arrows between “cell ” and “normal cell” in Figure 6 correspond to the yellow arrow between “ max pool” and “normal cell ” in Figure 7, purple arrows between “cell ” and “normal cell” in Figure 6 correspond to the purple arrow between “normal cell ” and “normal cell ” in Figure 7, and so forth. For each normal cell shown in Figure 7, its inputs are signified by “a direct arrow running from the previous cell” and “three curved arrows running from another three previous cells”, and each arrow in Figure 7 corresponds to a group of arrows with the same color in Figure 6.The input channels of the normal and reduction cells were fixed to 20 and 40, respectively. The output channel of the reduction cell was fixed to 40, and the output channel of the normal cell was automatically determined by its operations. The fixed channel numbers served as element-wise additions within blocks, i.e., only tensors of the same dimensions could be added element-wise. Therefore, channels of any block input were assumed to be 20. If the input channel differed from this constant, then the input was fed to an additional stack of 1-by-1 convolution, batch normalization and ReLU for changing the channel number to 20. Accordingly, the channel numbers of all block outputs were the same, i.e., 20, due to the fact that no operation within a block affected input data dimensions. Besides the channel numbers of inputs, if an input to a block differed in widths or heights, then all inputs were resized to have the minimal width and height found among the block inputs. Therefore, all block inputs shared the same dimensions, and element-wise additions worked in any block. As shown in Figure 6, the cell output was obtained by concatenating block outputs, which required that outputs to concatenate had the same width and height. If the outputs differed in width or height, then they were resized to the maximal width and height found among outputs to concatenate. The number of cell output channels could thus be easily derived by counting the number of outputs to concatenate, e.g., for the normal cell in Figure 6, its output channel number was , i.e., three block outputs were concatenated to yield the cell output. For reduction cell, since the pooling layer only halved the width and height dimensions of inputs, the output channel was the same as the input channel, i.e., 40. If inputs of a reduction cell had different channel numbers other than 40, then the inputs were fed to the channel-changing stack the same as the normal cell. All convolutions in normal cells had strides set to 1 and paddings set to 1 or 2, respectively, for or convolutions. All poolings had strides set to 2 and paddings set to 1 or 2, respectively, for or poolings.As shown in Figure 7, the output of the last normal cell was fed to a global average, i.e., a average pooling where and refer to the input width and height, respectively. Here, a tensor was pooled to scalars via the global average where denotes the class number. If the input to the global average had channels other than channels, then the input was fed to an additional 1-by-1 convolution of stride set to 1 and padding set to 0 before the input was fed to the global average. The results of the network shown in Figure 7 are illustrated in Table 2, and the best accuracy within each row is highlighted by bold texts.
Table 2
NACTI-a accuracy comparison.
Species or Parameter Number
DANAS(Ours)
MobileNet-v2 [48]
EfficientNet [49]
DenseNet [50]
Resnet-18 [51]
ResNext [52]
Wide_ResNet [53]
RandomSearch
Para. num.
1.36
2.25
4.04
6.98
11.19
23.02
66.88
0.52
Black bear 1
98.32
97.57
96.07
99.44
98.13
98.88
98.88
98.69
Marten 1
25.00
6.25
25.00
62.50
37.50
31.25
37.50
0.00
Red squirrel1
98.67
97.33
100
96.00
100
33.33
20.00
100
Jackrabbit 2
99.30
99.30
98.60
99.30
100
99.30
98.60
98.60
Bobcat
97.58
96.92
96.26
97.36
96.26
96.48
95.60
97.14
Quail 3
96.67
95.00
98.33
96.67
100
90.00
83.33
96.67
Cougar
98.10
96.20
96.20
99.05
98.48
95.26
95.83
97.53
Coyote
95.55
94.07
95.85
92.88
93.77
81.90
78.34
93.18
Gray squirrel 4
100
97.31
97.85
96.24
98.92
93.01
97.85
96.77
Elk
99.24
99.24
97.97
99.75
99.49
98.98
98.98
99.49
Gray fox
99.64
97.15
96.09
98.22
97.86
97.51
97.15
98.58
Moose
96.76
96.76
95.83
93.52
95.83
58.80
62.96
95.37
Mule deer
98.49
97.48
96.98
98.49
98.24
94.71
94.96
98.49
Armadillo 5
100
98.23
100
100
97.35
97.35
100
100
Raccoon
99.20
96.80
97.20
98.00
98.00
96.00
93.20
97.20
Red deer
92.25
91.98
91.44
95.19
95.72
87.97
86.36
92.78
Red fox
62.30
62.30
60.66
75.41
62.30
40.98
36.07
47.54
Hare 6
99.62
98.86
97.34
96.96
98.48
98.48
97.34
98.10
Skunk 7
99.18
99.18
98.77
99.59
100
98.77
98.35
99.59
Opossum 8
94.74
89.47
89.47
94.74
100
94.74
94.74
94.74
Wild boar
95.32
96.78
95.61
96.49
97.08
86.84
87.13
94.44
Wild turkey
98.06
96.13
98.71
99.35
99.35
87.74
82.58
96.77
Average
92.91
90.92
91.83
94.78
93.76
84.47
83.44
90.53
1 American animals; 2 Black-tailed jackrabbit; 3 California quail; 4 Eastern gray squirrel; 5 Nine-banded armadillo; 6 Snowshoe hare; 7 Striped skunk; 8 Virginia opossum.
As shown in Table 2, although the parameter number of the optimal network discovered by DANAS was small, the average test accuracy associated with DANAS was the third best of the compared networks. However, there were eight species accuracies in DANAS that were the best (bold digits in Table 2) compared to other networks, and there were eight best species accuracies in Resnet-18, which demonstrated the best average test accuracy.There were 155 images misclassified by DANAS. Among all misclassifications, 78 were color images and the rest were night-vision images, i.e., about half misclassified images were night-vision images. The image samples of typical misclassifications are illustrated in Figure 8, i.e., the partial animal body in the left sample, the small region occupied by the animal in the middle left sample, and visually similar animals in the right and the middle right samples.
Figure 8
Examples of misclassified images from NACTI-a.
Among all misclassifications, about 64% (99 samples) were misclassified due to the visual similarity of animals, and these misclassifications mainly originated from the deer and canine species. Samples of deer and canine misclassifications are shown in Figure 9. The misclassifications were mainly made among red deer (29 samples) and red fox (23 samples). For red deer samples, 14 samples were grayscale images without colors (the left sample in Figure 9), and the remaining color samples always contained red deer whose heads were obscured due to camera view limitations, body orientations (the middle-left sample in Figure 9), etc. For red fox samples, 11 samples were grayscale images (the middle-right sample in Figure 9), and the remaining color samples always contained foxes occupying small image regions (the right sample in Figure 9).
Figure 9
Samples of misclassified deer images in NACTI-a.
Samples of misclassifications other than deer are shown in Figure 10. The misclassifications were made among bobcat, cougar, coyote, moose, etc., due to reasons similar to those of the deer and red fox misclassifications. Additionally, misclassification samples only containing animal heads are shown in Figure 10.
Figure 10
Samples of misclassified images of similar animals in NACTI-a.
3.2. Search and Test on MCTI
The search on MCTI consumed roughly 62 hours and found a network with 1.43 million parameters. The search performance was compared with a random search whose configuration was the same as the one introduced in the previous section. The search procedures of DANAS and the random search are visualized in Figure 11 and Figure 12, respectively.
Figure 11
Scatter plot of parameter numbers and accuracies associated with the networks explored by a random search on MCTI.
Figure 12
Scatter plot of parameter numbers and accuracies associated with the networks discovered by DANAS on MCTI.
As shown in Figure 11 regarding the random search, there were 66 networks with parameter numbers exceeding 2.5 million (62 points on the right of vertical line at 2.5 in the figure; four points are not shown due to limited space) and 16 networks with validation accuracies exceeding 50%.As shown in Figure 12 regarding DANAS, there were 15 networks with parameter numbers exceeding 2.5 million (13 points on the right of vertical line at 2.5 in the figure; two are not shown due to limited space) and 93 networks with validation accuracies exceeding 50%; one of them was chosen as the optimal network according to step 5 in Figure 2. The optimal network is highlighted by a yellow star in Figure 12, its normal cell is depicted in Figure 13; its reduction cell was simply a max pooling with a 3-by-3 kernel.
Figure 13
Normal cell found by DANAS on MCTI.
The network structure based on the cell in Figure 13 was the same as the one shown in Figure 7 because normal cells found on both NACTI-a and MCTI involved all previous cells, and the pipeline in Figure 7 illustrates how data flowed at the cell level (in contrast to the data flow at the block level that is shown in Figure 6 and Figure 13). The test results are shown in Table 3, and the best accuracy within each row is highlighted by bold texts.
Table 3
MCTI accuracy comparison.
Species or Parameter Number
DANAS(Ours)
MobileNet-v2 [48]
EfficientNet [49]
DenseNet [50]
Resnet-18 [51]
ResNext [52]
Wide_ResNet [53]
RandomSearch
Para. num.
1.43
2.25
4.04
6.98
11.19
23.02
66.88
0.70
Agouti
91.86
91.86
94.19
91.86
93.60
93.02
86.05
83.72
Bird
97.02
88.10
87.50
91.07
92.26
89.29
92.26
88.69
Agouti 1
97.77
86.16
86.61
92.41
87.05
90.18
91.96
92.86
Peccary 2
90.12
86.63
81.98
88.95
77.91
83.72
86.63
90.12
Opossum
94.42
96.14
93.56
97.00
97.00
94.42
96.14
93.13
Hare 3
95.31
66.41
75.78
88.28
92.19
82.03
86.72
70.31
Tinamou 4
73.74
65.66
74.75
75.76
75.76
68.69
81.82
41.41
Mouflon
93.86
88.60
81.58
94.74
89.47
89.47
88.60
76.32
Ocelot
96.41
88.02
89.82
89.22
92.22
91.62
90.42
92.81
Paca
90.71
89.29
90.71
92.14
90.00
91.43
92.14
78.57
Deer 5
99.07
96.26
96.26
98.60
98.60
97.20
96.73
94.39
Red Deer
97.46
91.86
95.93
96.69
97.46
96.95
96.69
94.40
Red Fox
99.76
99.29
99.06
99.76
100
62.59
61.18
97.88
Red Squirrel
99.80
98.04
98.82
100
99.80
97.45
99.61
99.21
Roe Deer
94.85
97.00
95.71
97.85
97.00
97.00
98.71
94.42
Spiny Rat
98.26
96.81
97.39
98.84
98.55
95.36
96.23
95.07
Coati 6
79.12
71.43
81.32
72.53
75.82
80.22
79.12
68.13
Deer 7
96.72
89.34
88.52
91.80
90.16
92.62
87.70
92.62
Wild Boar
100
100
100
100
100
100
100
93.47
Mouse 8
100
100
99.19
97.57
100
97.57
98.38
99.19
Average
94.31
89.34
90.43
92.75
92.24
89.54
90.35
86.84
1 Coiban agouti; 2 Collared peccary; 3 European hare; 4 Great Tinamou; 5 Red brocket deer; 6 White-nosed coati; 7 White-tailed deer; 8 Wood mouse.
As shown in Table 3, the parameter number of the optimal network discovered by DANAS was small, and the average test accuracy associated with DANAS was the best throughout the networks in comparison. There were 167 images misclassified by DANAS. Among all misclassifications, 45 were color images, and the rest were grayscale images. Samples of typical misclassifications are shown in Figure 14, i.e., vagueness due to dirty camera lens in the left sample, similar backgrounds and species in the middle samples, and partial animal body in the right sample.
Figure 14
Examples of misclassified images from MCTI.
3.3. Tests on Jetson X2
The optimal networks discovered by DANAS with the MCTI and NACTI-a datasets were tested on the NVIDIA Jetson X2 edge device shown in Figure 15. Because the versions of PyTorch installed in the workstation and the Jetson X2 are different, the format of network weights saved in the workstation was incompatible with Jetson X2. This issue was tackled by loading and resaving weights in Pickle-based format through PyTorch’s built-in function torch.save() with the parameter “_use_new_zipfile_serialization” set to False. The resaved network weights and 224 × 168 test images were copied to Jetson X2 through secure copy protocol as in [44]. The test results are shown in Table 4.
Figure 15
NVIDIA Jetson X2.
Table 4
Test results on Jetson X2.
Species in NACTI-a
DANAS
Species in MCTI
DANAS
American black bear
97.94
Agouti
92.44
American marten
25.00
Bird
97.62
American red squirrel
98.67
Coiban Agouti
97.77
Black-tailed jackrabbit
98.60
Collared Peccary
90.12
Bobcat
97.36
Opossum
93.56
California quail
96.67
European Hare
95.31
Cougar
98.29
Great Tinamou
69.70
Coyote
95.55
Mouflon
93.86
Eastern Gray squirrel
100
Ocelot
94.01
Elk
99.24
Paca
91.43
Gray fox
99.29
Red Brocket Deer
99.07
Moose
97.22
Red Deer
97.96
Mule deer
97.98
Red Fox
99.76
Nine-banded armadillo
100
Red Squirrel
100
Raccoon
98.80
Roe Deer
94.85
Red deer
91.44
Spiny Rat
98.26
Red fox
63.93
White-nosed Coati
78.02
Snowshoe hare
99.24
White-tailed Deer
96.72
Striped skunk
99.59
Wild Boar
100
Virginia opossum
94.74
Wood Mouse
100
Wild boar
95.32
Average
94.02
Wild turkey
98.06
Average
92.86
As shown in Table 4, the average accuracies on Jetson X2 were 92.91% and 94.31% for NACTI-a and MCTI, respectively. The average accuracies on Jetson X2 were slightly lower than the corresponding accuracies obtained on the workstation, i.e., 92.86% and 94.02% for NACTI-a and MCTI, respectively.
3.4. Comparisons between DANAS and other Search Methods
Since comparisons of search methods based on custom-defined search space and various hardware may introduce bias, we compared our method with other methods via Nasbench-201 [39]. Nasbench-201 provides a database and application programming interfaces (APIs) for comparing search methods with the same search space and hardware. In Nasbench-201, all candidate networks in a specific search space were trained, validated and tested on the CIFAR-10, CIFAR-100 and ImageNet-16-120 datasets [39]. The training, validating and testing accuracies were saved in databases and could be programmatically retrieved via an API, a network code encoding the network architecture. There were five operations scattering in three cells, i.e., the first cell containing one operation, the second containing two, and the last containing three. For each operation, there were five options available for sampling, i.e., “nor_conv_3 × 3”, “none”, “nor_conv_1 × 1”, “avg_pool_3 × 3” and “skip_connect” [39]. Nasbench-201 does not distinguish networks of different operation inputs (i.e., for all networks of operations arranged in the same encoding order, only one network is trained, validated and tested on the aforementioned datasets.), so there are [39] networks in Nasbench-201, and a sampler tested on Nasbench-201 is restricted to sample operations only. Accordingly, we simplified our sampler and applied Bayesian optimization [21] to automatically find values of the sampler hyper parameters, i.e., the embedding dimension, the hidden unit number, the layer number, and the learning rate were set to 19, 33, 1, and 0.005, respectively. The rest of the configuration was the same as that of DANAS.According to [39], there are two types of search methods tested on Nasbench-201, i.e., methods dependent on or independent of parameter sharing. Parameter sharing often means weights of a newly-sampled network are initialized by using weights from the previously-sampled networks trained on the dataset, so the weights of previously trained networks are not abandoned during the search. In [39], parameter-sharing-dependent methods were repeated three times and other methods were repeated 500 times. For each run of the method independent of parameter sharing, the method continued to run until the simulated training time [39] of its sampled networks reached a predefined limit called time budget, i.e., 12,000 s. [39]. The simulated training time of the sampled network was obtained by adding its training and validation time saved in Nasbench-201.Since our method (DANAS) is independent of parameter sharing, DANAS was tested according to the configuration of search methods independent of parameter sharing, i.e., the search based on DANAS was repeated 500 times and each search automatically stopped once the time budget was reached. Different from methods in [39], our method requires an additional hyper parameter, i.e., the ideal parameter number . This parameter is set to the parameter number of the candidate network with the optimal validation accuracy. Accordingly, DANAS was tested against three datasets available in Nasbench-201, i.e., CIFAR-10 ( in millions), CIFAR-100 ( in millions), and ImageNet-16-120 ( in millions). Because network weights required by parameter-sharing-dependent methods were not available at the time of paper submission, we only tested parameter-sharing-independent methods with Nasbench-201 on our own hardware. Specifically, all search steps except for training, validating and testing sampled networks were conducted on our hardware, and network accuracies and parameter numbers were directly retrieved from Nasbench-201. The configurations of all methods except DANAS were the same as [39]. The results are illustrated in Table 5.
Table 5
Comparisons with other search methods.
Method
Search(Seconds)
CIFAR-10
CIFAR-100
ImageNet-16-120
Validation
Test
Validation
Test
Validation
Test
REA [56]
0.03
91.56 ± 0.13
94.35 ± 0.18
73.15 ± 0.49
73.05 ± 0.56
46.08 ± 0.77
46.08 ± 0.78
RS [57]
1.00
91.48 ± 0.12
94.08 ± 0.26
72.63 ± 1.09
72.44 ± 0.70
45.90 ± 0.58
45.64 ± 0.85
REINFORCE [45]
1.00
91.70 ± 0.06
94.35 ± 0.19
73.52 ± 0.30
73.43 ± 0.52
46.49 ± 0.41
45.98 ± 0.72
BOHB [58]
6.12
88.52 ± 1.39
91.77 ± 1.30
62.62 ± 9.73
62.74 ± 9.79
33.43 ± 9.18
33.22 ± 9.51
DANAS (ours)
4.24
91.58 ± 0.17
94.28 ± 0.21
72.85 ± 0.64
72.71 ± 0.87
45.99 ± 0.56
45.75 ± 0.83
Bold text: optimal mean accuracies; underlined text: suboptimal mean accuracies.
As shown in Table 5, five search methods were compared on three benchmark datasets, i.e., CIFAR-10, CIFAR-100 and ImageNet-16-120 [39]. Among methods in comparison, i.e., REA [56], RS [57], REINFORCE [45] and BOHB [58], our method (DANAS) achieved the second best test accuracy on CIFAR-10 and the third best test accuracy on both CIFAR-100 and ImageNet-16-120.
4. Discussion
DANAS was evaluated on two datasets, NACTI-a and MCTI. For both datasets, the random searches significantly differed from DANAS in changes of validation accuracy and parameter number over time.In the case of NACTI-a, the number of networks with parameter numbers exceeding 2.5 million in the random search was almost twice that of DANAS, and the number of networks with validation accuracies exceeding 50% in the random search was roughly half that of DANAS. More importantly, the distribution of points from DANAS in Figure 5 illustrates a growing trend towards networks with few parameter numbers and high validation accuracies, i.e., the search tended to find Pareto solutions good for both accuracy and parameter number, while no such trend can be seen in Figure 4 regarding the random search.In the case of MCTI, the ratio between the numbers of networks with 2.5 million parameter numbers or above for random search and DANAS was higher than the case of NACTI-a, i.e., about 4:1, and the ratio between the numbers of networks with validation accuracies exceeding 50% for the random search and DANAS was lower than the case of NACTI-a, i.e., about 1:8. The distribution of points from DANAS in Figure 12 illustrates the same trend as the case of NACTI-a, and the random search showed no such trend, as depicted in Figure 11.The performance of the networks found by DANAS was evaluated by comparing the test accuracies with seven CNNs with parameter numbers ranging from 0.7 to 66.8 million on two datasets, NACTI-a and MCTI. Although the parameter numbers of networks found on both datasets were lower than 1.5 M, the test accuracy was the third best for NACTI-a and the best for MCTI. These results reveal the benefit of designing CNNs with structures highly customized for studied data and used device. Generally, the experimental results confirmed the validity of DANAS.The search efficiency of DANAS was compared with search methods reported in [39] based on Nasbench-201, and the search methods with parameter sharing were retested on our hardware. For all benchmark datasets of Nasbench-201, our method outperformed all parameter-sharing-dependent methods reported in [39] and most of parameter-sharing-independent methods including the random search. Generally, DANAS outperformed NAS methods with parameter sharing and was competitive compared with NAS methods without parameter sharing.
5. Conclusions
In this study, DANAS is proposed to automatically design lightweight CNNs for ecological research powered by camera traps and edge computing. DANAS was developed based on domain knowledge of camera trap images, i.e., the search is conducted on camera trap images whose resolutions are lowered while the original aspect ratios are maintained. Therefore, the data distribution of the original dataset is preserved during the search, so the data distribution difference incurred by using benchmark datasets in traditional NAS is reduced in DANAS. Furthermore, the search in DANAS is guided by a loss function designed based on Witch of Agnesi whose hyper parameter was theoretically derived. In experiments, DANAS was shown to successfully find lightweight networks for two datasets of wildlife camera trap images. The found networks were then trained on a workstation and tested on both the workstation and an edge device. In comparison with CNNs of classical lightweight designs and good performance, the networks found by DANAS had low parameter numbers and competitive test accuracies. Generally, researchers without knowledge of designing CNNs can obtain lightweight CNNs optimized for edge devices through DANAS and thus expand surveillance areas in a cost-effective way.
Authors: Mohammad Sadegh Norouzzadeh; Anh Nguyen; Margaret Kosmala; Alexandra Swanson; Meredith S Palmer; Craig Packer; Jeff Clune Journal: Proc Natl Acad Sci U S A Date: 2018-06-05 Impact factor: 11.205
Authors: Zhongqi Miao; Kaitlyn M Gaynor; Jiayun Wang; Ziwei Liu; Oliver Muellerklein; Mohammad Sadegh Norouzzadeh; Alex McInturff; Rauri C K Bowie; Ran Nathan; Stella X Yu; Wayne M Getz Journal: Sci Rep Date: 2019-05-31 Impact factor: 4.379