| Literature DB >> 31777414 |
Stephen H Bach1, Daniel Rodriguez2, Yintao Liu2, Chong Luo2, Haidong Shao2, Cassandra Xia2, Souvik Sen2, Alex Ratner3, Braden Hancock3, Houman Alborzi2, Rahul Kuchhal2, Chris Ré3, Rob Malkin2.
Abstract
Labeling training data is one of the most costly bottlenecks in developing machine learning-based applications. We present a first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introduce Snorkel DryBell, a new weak supervision management system for this setting. Snorkel DryBell builds on the Snorkel framework, extending it in three critical aspects: flexible, template-based ingestion of diverse organizational knowledge, cross-feature production serving, and scalable, sampling-free execution. On three classification tasks at Google, we find that Snorkel DryBell creates classifiers of comparable quality to ones trained with tens of thousands of hand-labeled examples, converts non-servable organizational resources to servable models for an average 52% performance improvement, and executes over millions of data points in tens of minutes.Entities:
Keywords: Systems for machine learning; weak supervision
Year: 2019 PMID: 31777414 PMCID: PMC6879379 DOI: 10.1145/3299869.3314036
Source DB: PubMed Journal: Proc ACM SIGMOD Int Conf Manag Data ISSN: 0730-8078