Literature DB >> 29770249

Snorkel: Rapid Training Data Creation with Weak Supervision.

Alexander Ratner1, Stephen H Bach1, Henry Ehrenberg1, Jason Fries1, Sen Wu1, Christopher Ré1.   

Abstract

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of- the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.

Entities:  

Year:  2017        PMID: 29770249      PMCID: PMC5951191          DOI: 10.14778/3157794.3157797

Source DB:  PubMed          Journal:  Proceedings VLDB Endowment        ISSN: 2150-8097


  7 in total

1.  Training products of experts by minimizing contrastive divergence.

Authors:  Geoffrey E Hinton
Journal:  Neural Comput       Date:  2002-08       Impact factor: 2.026

2.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures.

Authors:  Alex Graves; Jürgen Schmidhuber
Journal:  Neural Netw       Date:  2005 Jun-Jul

3.  Ranking and combining multiple predictors without labeled data.

Authors:  Fabio Parisi; Francesco Strino; Boaz Nadler; Yuval Kluger
Journal:  Proc Natl Acad Sci U S A       Date:  2014-01-13       Impact factor: 11.205

4.  The mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility.

Authors:  Joy P Ku; Jennifer L Hicks; Trevor Hastie; Jure Leskovec; Christopher Ré; Scott L Delp
Journal:  J Am Med Inform Assoc       Date:  2015-08-13       Impact factor: 4.497

5.  A CTD-Pfizer collaboration: manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions.

Authors:  Allan Peter Davis; Thomas C Wiegers; Phoebe M Roberts; Benjamin L King; Jean M Lay; Kelley Lennon-Hopkins; Daniela Sciaky; Robin Johnson; Heather Keating; Nigel Greene; Robert Hernandez; Kevin J McConnell; Ahmed E Enayetallah; Carolyn J Mattingly
Journal:  Database (Oxford)       Date:  2013-11-28       Impact factor: 3.451

6.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases.

Authors:  Ron Caspi; Richard Billington; Luciana Ferrer; Hartmut Foerster; Carol A Fulcher; Ingrid M Keseler; Anamika Kothari; Markus Krummenacker; Mario Latendresse; Lukas A Mueller; Quang Ong; Suzanne Paley; Pallavi Subhraveti; Daniel S Weaver; Peter D Karp
Journal:  Nucleic Acids Res       Date:  2015-11-02       Impact factor: 16.971

7.  Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task.

Authors:  Chih-Hsuan Wei; Yifan Peng; Robert Leaman; Allan Peter Davis; Carolyn J Mattingly; Jiao Li; Thomas C Wiegers; Zhiyong Lu
Journal:  Database (Oxford)       Date:  2016-03-19       Impact factor: 3.451

  7 in total
  30 in total

1.  Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices.

Authors:  Vincent S Chen; Sen Wu; Zhenzhen Weng; Alexander Ratner; Christopher Ré
Journal:  Adv Neural Inf Process Syst       Date:  2019-12

2.  Training Complex Models with Multi-Task Weak Supervision.

Authors:  Alexander Ratner; Braden Hancock; Jared Dunnmon; Frederic Sala; Shreyash Pandey; Christopher Ré
Journal:  Proc Conf AAAI Artif Intell       Date:  2019 Jan-Feb

3.  Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale.

Authors:  Stephen H Bach; Daniel Rodriguez; Yintao Liu; Chong Luo; Haidong Shao; Cassandra Xia; Souvik Sen; Alex Ratner; Braden Hancock; Houman Alborzi; Rahul Kuchhal; Chris Ré; Rob Malkin
Journal:  Proc ACM SIGMOD Int Conf Manag Data       Date:  2019 Jun-Jul

4.  Snuba: Automating Weak Supervision to Label Training Data.

Authors:  Paroma Varma; Christopher Ré
Journal:  Proceedings VLDB Endowment       Date:  2018-11

5.  TAX-Corpus: Taxonomy based Annotations for Colonoscopy Evaluation.

Authors:  Shorabuddin Syed; Adam Jackson Angel; Hafsa Bareen Syeda; Carole Franc Jennings; Joseph VanScoy; Mahanazuddin Syed; Melody Greer; Sudeepa Bhattacharyya; Shaymaa Al-Shukri; Meredith Zozus; Fred Prior; Benjamin Tharian
Journal:  Biomed Eng Syst Technol Int Jt Conf BIOSTEC Revis Sel Pap       Date:  2022-02

Review 6.  Machine learning in human movement biomechanics: Best practices, common pitfalls, and new opportunities.

Authors:  Eni Halilaj; Apoorva Rajagopal; Madalina Fiterau; Jennifer L Hicks; Trevor J Hastie; Scott L Delp
Journal:  J Biomech       Date:  2018-09-13       Impact factor: 2.712

7.  A corpus-driven standardization framework for encoding clinical problems with HL7 FHIR.

Authors:  Kevin J Peterson; Guoqian Jiang; Hongfang Liu
Journal:  J Biomed Inform       Date:  2020-08-16       Impact factor: 6.317

8.  Fonduer: Knowledge Base Construction from Richly Formatted Data.

Authors:  Sen Wu; Luke Hsiao; Xiao Cheng; Braden Hancock; Theodoros Rekatsinas; Philip Levis; Christopher Ré
Journal:  Proc ACM SIGMOD Int Conf Manag Data       Date:  2018-06

9.  A deep database of medical abbreviations and acronyms for natural language processing.

Authors:  Lisa Grossman Liu; Raymond H Grossman; Elliot G Mitchell; Chunhua Weng; Karthik Natarajan; George Hripcsak; David K Vawdrey
Journal:  Sci Data       Date:  2021-06-02       Impact factor: 6.444

10.  Pain Recognition With Electrocardiographic Features in Postoperative Patients: Method Validation Study.

Authors:  Emad Kasaeyan Naeini; Ajan Subramanian; Michael-David Calderon; Kai Zheng; Nikil Dutt; Pasi Liljeberg; Sanna Salantera; Ariana M Nelson; Amir M Rahmani
Journal:  J Med Internet Res       Date:  2021-05-28       Impact factor: 5.428

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.