| Literature DB >> 36081177 |
Seungwoo Kum1, Seungtaek Oh1, Jeongcheol Yeom1, Jaewon Moon1.
Abstract
As deep learning technology paves its way, real-world applications that make use of it become popular these days. Edge computing architecture is one of the service architectures to realize the deep learning based service, which makes use of the resources near the data source or client. In Edge computing architecture it becomes important to manage resource usage, and there is research on optimization of deep learning, such as pruning or binarization, which makes deep learning models more lightweight, along with the research for the efficient distribution of workloads on cloud or edge resources. Those are to reduce the workload on edge resources. In this paper, a usage optimization method with batch and model management is proposed. The proposed method is to increase the utilization of GPU resource by modifying the batch size of the input of an inference application. To this end, the inference pipelines are identified to see how the different kinds of resources are used, and then the effect of batch inference on GPU is measured. The proposed method consists of a few modules, including a tool for batch size management which is able to change a batch size with respect to the available resources, and another one for model management which supports on-the-fly update of a model. The proposed methods are implemented on a real-time video analysis application and deployed in the Kubernetes cluster as a Docker container. The result shows that the proposed method can optimize the usage of edge resources for real-time video analysis deep learning applications.Entities:
Keywords: batch inference; deep learning application framework; edge computing; edge optimization
Mesh:
Year: 2022 PMID: 36081177 PMCID: PMC9460489 DOI: 10.3390/s22176717
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Cloud and Edge resources for a deep learning service.
Figure 2Real-time video analysis application. If inference processing delay is bigger than the time between two consecutive frames, the output will be lagged.
Figure 3Inference pipelines. (a) Sequential processing of a video input. (b) Pipeline of sequential processing. Each process is separated with a queue. (c) Batch inference pipeline. Input data are stacked on a queue, and inference fetches a batch of images. Lined boxes refer processes that make use of GPU while dotted boxes refer those of CPU.
Figure 4Structure of the proposed architecture.
Figure 5Testbed Configuration.
Figure 6GUI of CDS.
Delays and real-time constraints on a Xavier.
| Batch Size | 1 | 2 | 4 | 8 | 16 |
|---|---|---|---|---|---|
| Pre-processing | 0.004 | 0.007 | 0.013 | 0.026 | 0.051 |
| Inference | 0.021 | 0.031 | 0.058 | 0.107 | 0.211 |
| Post-processing | 0.003 | 0.004 | 0.008 | 0.014 | 0.027 |
|
|
|
|
|
|
|
| Real-time condition | 0.033 | 0.067 | 0.133 | 0.267 | 0.533 |
|
|
|
|
|
|
|
Delays and real-time constraints on a 2080 ti.
| Batch Size | 1 | 2 | 4 | 8 | 16 |
|---|---|---|---|---|---|
| Pre-processing | 0.003 | 0.006 | 0.013 | 0.027 | 0.053 |
| Inference | 0.003 | 0.005 | 0.01 | 0.017 | 0.035 |
| Post-processing | 0.001 | 0.002 | 0.003 | 0.006 | 0.009 |
|
|
|
|
|
|
|
| Real-time condition | 0.033 | 0.067 | 0.133 | 0.267 | 0.533 |
|
|
|
|
|
|
|
Figure 7GPU resource usage with different batch size.
Figure 8Memory consumption of GPU with different batch size.