Literature DB >> 31595143

Fine-Grained Provenance for Matching & ETL.

Nan Zheng1, Abdussalam Alawini2, Zachary G Ives1.   

Abstract

Data provenance tools capture the steps used to produce analyses. However, scientists must choose among work-flow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks.

Entities:  

Year:  2019        PMID: 31595143      PMCID: PMC6783128          DOI: 10.1109/ICDE.2019.00025

Source DB:  PubMed          Journal:  Proc Int Conf Data Eng        ISSN: 1084-4627


  2 in total

1.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors:  Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal:  Genome Biol       Date:  2010-08-25       Impact factor: 13.583

2.  Titian: Data Provenance Support in Spark.

Authors:  Matteo Interlandi; Kshitij Shah; Sai Deep Tetali; Muhammad Ali Gulzar; Seunghyun Yoo; Miryung Kim; Todd Millstein; Tyson Condie
Journal:  Proceedings VLDB Endowment       Date:  2015-11
  2 in total
  1 in total

1.  Approaches and Criteria for Provenance in Biomedical Data Sets and Workflows: Protocol for a Scoping Review.

Authors:  Kerstin Gierend; Frank Krüger; Dagmar Waltemath; Maximilian Fünfgeld; Thomas Ganslandt; Atinkut Alamirrew Zeleke
Journal:  JMIR Res Protoc       Date:  2021-11-22
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.