Literature DB >> 27390389

BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark.

Muhammad Ali Gulzar1, Matteo Interlandi1, Seunghyun Yoo1, Sai Deep Tetali1, Tyson Condie1, Todd Millstein1, Miryung Kim1.   

Abstract

Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today's data-centers is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires re-thinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user. First, BIGDEBUG's simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BIGDEBUG scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BIGDEBUG supports debugging at interactive speeds with minimal performance impact.

Entities:  

Keywords:  Debugging; big data analytics; data-intensive scalable computing (DISC); fault localization and recovery; interactive tools

Year:  2016        PMID: 27390389      PMCID: PMC4933307          DOI: 10.1145/2884781.2884813

Source DB:  PubMed          Journal:  Proc Int Conf Softw Eng        ISSN: 0270-5257


  1 in total

1.  Titian: Data Provenance Support in Spark.

Authors:  Matteo Interlandi; Kshitij Shah; Sai Deep Tetali; Muhammad Ali Gulzar; Seunghyun Yoo; Miryung Kim; Todd Millstein; Tyson Condie
Journal:  Proceedings VLDB Endowment       Date:  2015-11
  1 in total
  2 in total

1.  Optimizing Interactive Development of Data-Intensive Applications.

Authors:  Matteo Interlandi; Sai Deep Tetali; Muhammad Ali Gulzar; Joseph Noor; Tyson Condie; Miryung Kim; Todd Millstein
Journal:  Proc ACM Symp Cloud Comput       Date:  2016-10

2.  Titian: Data Provenance Support in Spark.

Authors:  Matteo Interlandi; Kshitij Shah; Sai Deep Tetali; Muhammad Ali Gulzar; Seunghyun Yoo; Miryung Kim; Todd Millstein; Tyson Condie
Journal:  Proceedings VLDB Endowment       Date:  2015-11
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.