| Literature DB >> 28934163 |
Jilin Zhang1,2,3,4,5, Hangdi Tu6,7, Yongjian Ren8,9, Jian Wan10,11,12,13, Li Zhou14,15, Mingwei Li16,17, Jue Wang18, Lifeng Yu19,20, Chang Zhao21,22, Lei Zhang23.
Abstract
In order to utilize the distributed characteristic of sensors, distributed machine learning has become the mainstream approach, but the different computing capability of sensors and network delays greatly influence the accuracy and the convergence rate of the machine learning model. Our paper describes a reasonable parameter communication optimization strategy to balance the training overhead and the communication overhead. We extend the fault tolerance of iterative-convergent machine learning algorithms and propose the Dynamic Finite Fault Tolerance (DFFT). Based on the DFFT, we implement a parameter communication optimization strategy for distributed machine learning, named Dynamic Synchronous Parallel Strategy (DSP), which uses the performance monitoring model to dynamically adjust the parameter synchronization strategy between worker nodes and the Parameter Server (PS). This strategy makes full use of the computing power of each sensor, ensures the accuracy of the machine learning model, and avoids the situation that the model training is disturbed by any tasks unrelated to the sensors.Entities:
Keywords: disturbed machine learning; dynamic synchronous parallel strategy (DSP); parameter server (PS); sensors
Year: 2017 PMID: 28934163 PMCID: PMC5677357 DOI: 10.3390/s17102172
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The first-generation parameter server system.
Figure 2The second-generation parameter server system.
Figure 3The third-generation parameter server system.
Figure 4Bulk Synchronous Parallel Strategy (BSP).
Figure 5Stale Synchronous Parallel (SSP) is not optimized for the cluster with similar performance worker nodes.
Figure 6SSP cannot cope with external interference in the process of training model.
Figure 7We select a worker node and the PS to show the flow diagram of our optimization strategy procedure.
Figure 8We solve the problem of that using SSP to train distributed machine learning model in the cluster composed with the similar performance worker nodes has the low efficiency.
Figure 9We solve the problem that the stale threshold s will be disabled when computing performance of the worker node has changed in the distributed machine learning training.
Figure 10A distributed machine learning system based on Caffe and supports Dynamic Synchronous Parallel Strategy (DSP).
Figure 11This figure compares the accuracy of the model for machine learning with each parameter communication optimization strategy.
Figure 12This figure compares the training time of machine learning model training with each parameter communication optimization strategy.
Figure 13This figure compares the effects of SSP with different stale thresholds and DSP on the accuracy of the distributed machine learning model.
Figure 14This figure compares the difference of training time between the distributed machine learning model using SSP with different stale thresholds and DSP.