Literature DB >> 28004039

Building an Evaluation Scale using Item Response Theory.

John P Lalor1, Hao Wu2, Hong Yu3.   

Abstract

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.

Entities:  

Year:  2016        PMID: 28004039      PMCID: PMC5167538          DOI: 10.18653/v1/d16-1062

Source DB:  PubMed          Journal:  Proc Conf Empir Methods Nat Lang Process


  1 in total

1.  Mastering the game of Go with deep neural networks and tree search.

Authors:  David Silver; Aja Huang; Chris J Maddison; Arthur Guez; Laurent Sifre; George van den Driessche; Julian Schrittwieser; Ioannis Antonoglou; Veda Panneershelvam; Marc Lanctot; Sander Dieleman; Dominik Grewe; John Nham; Nal Kalchbrenner; Ilya Sutskever; Timothy Lillicrap; Madeleine Leach; Koray Kavukcuoglu; Thore Graepel; Demis Hassabis
Journal:  Nature       Date:  2016-01-28       Impact factor: 49.962

  1 in total
  3 in total

1.  Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds.

Authors:  John P Lalor; Hao Wu; Hong Yu
Journal:  Proc Conf Empir Methods Nat Lang Process       Date:  2019-11

2.  Improving Electronic Health Record Note Comprehension With NoteAid: Randomized Trial of Electronic Health Record Note Comprehension Interventions With Crowdsourced Workers.

Authors:  John P Lalor; Beverly Woolf; Hong Yu
Journal:  J Med Internet Res       Date:  2019-01-16       Impact factor: 5.428

3.  Using Item Response Theory for Explainable Machine Learning in Predicting Mortality in the Intensive Care Unit: Case-Based Approach.

Authors:  Adrienne Kline; Theresa Kline; Zahra Shakeri Hossein Abad; Joon Lee
Journal:  J Med Internet Res       Date:  2020-09-25       Impact factor: 5.428

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.