Literature DB >> 25400488

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server.

Qirong Ho1, James Cipar1, Henggang Cui2, Jin Kyu Kim1, Seunghak Lee1, Phillip B Gibbons3, Garth A Gibson1, Gregory R Ganger2, Eric P Xing1.   

Abstract

We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model's values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes.

Entities:  

Year:  2013        PMID: 25400488      PMCID: PMC4230489     

Source DB:  PubMed          Journal:  Adv Neural Inf Process Syst        ISSN: 1049-5258


  4 in total

1.  Efficient Privacy-preserving Machine Learning in Hierarchical Distributed System.

Authors:  Qi Jia; Linke Guo; Yuguang Fang; Guirong Wang
Journal:  IEEE Trans Netw Sci Eng       Date:  2018-07-24

2.  Dynamic Allocation Method of Economic Information Integrated Data Based on Deep Learning Algorithm.

Authors:  Zhitao Cao
Journal:  Comput Intell Neurosci       Date:  2022-05-16

3.  A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors.

Authors:  Jilin Zhang; Hangdi Tu; Yongjian Ren; Jian Wan; Li Zhou; Mingwei Li; Jue Wang; Lifeng Yu; Chang Zhao; Lei Zhang
Journal:  Sensors (Basel)       Date:  2017-09-21       Impact factor: 3.576

4.  Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system.

Authors:  Jia Wei; Xingjun Zhang; Zeyu Ji; Jingbo Li; Zheng Wei
Journal:  Sci Rep       Date:  2021-10-12       Impact factor: 4.379

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.