| Literature DB >> 24977172 |
Parnia Samimi1, Sri Devi Ravana1.
Abstract
Test collection is used to evaluate the information retrieval systems in laboratory-based evaluation experimentation. In a classic setting, generating relevance judgments involves human assessors and is a costly and time consuming task. Researchers and practitioners are still being challenged in performing reliable and low-cost evaluation of retrieval systems. Crowdsourcing as a novel method of data acquisition is broadly used in many research fields. It has been proven that crowdsourcing is an inexpensive and quick solution as well as a reliable alternative for creating relevance judgments. One of the crowdsourcing applications in IR is to judge relevancy of query document pair. In order to have a successful crowdsourcing experiment, the relevance judgment tasks should be designed precisely to emphasize quality control. This paper is intended to explore different factors that have an influence on the accuracy of relevance judgments accomplished by workers and how to intensify the reliability of judgments in crowdsourcing experiment.Entities:
Mesh:
Year: 2014 PMID: 24977172 PMCID: PMC4055211 DOI: 10.1155/2014/135641
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Classification of IR evaluation methods.
Figure 2Typical IR evaluation process.
User-based evaluation methods.
| User-based methods | Description |
|---|---|
| Human in the lab | This method involves human experimentation in the lab to evaluate the user-system interaction |
|
| |
| Side-by-side panels | This method is defined as collecting the top ranked answers generated by two IR systems for the same search query and representing them side by side to the users. To evaluate this method, in the eyes of human assessor, a simple judgment is needed to see which side retrieves better results |
|
| |
| A/B testing | A/B testing involves numbers of preselected users of a website to analyse their reactions to the specific modification to see whether the change is positive or negative |
|
| |
| Using clickthrough data | Clickthrough data is used to observe how frequently users click on retrieved documents for a given query |
|
| |
| Crowdsourcing | Crowdsourcing is defined as outsourcing tasks, which was formerly accomplished inside a company or institution by employees assigned externally to huge, heterogeneous mass of potential workers in the form of an open call through Internet |
Different applications of crowdsourcing.
| Application | Description |
|---|---|
| Natural language processing | Crowdsourcing technology was used to investigate linguistic theory and language processing [ |
|
| |
| Machine learning | Automatic translation by using active learning and crowdsourcing was suggested to reduce the cost of language experts [ |
|
| |
| Software engineering | The use of crowdsourcing was investigated to solve the problem of recruiting the right type and number of subjects to evaluate a software engineering technique [ |
|
| |
| Network event monitoring | Using crowdsourcing to detect, isolate, and report service-level network events was explored which was called CEM (crowdsourcing event monitoring) [ |
|
| |
| Sentiment classification | The issues in training a sentiment analysis system using data collected through crowdsourcing were analysed [ |
|
| |
| Cataloguing | The application of crowdsourcing for libraries and archives was assessed [ |
|
| |
| Transportation plan | Use of crowdsourcing was argued to enable the citizen participation process in public planning projects [ |
|
| |
| Information retrieval | To create relevance judgments, crowdsourcing was suggested as a feasible alternative [ |
Figure 3Crowdsourcing scheme.
Figure 4Flow of submitting and completing tasks via crowdsourcing.
Statistics for calculating the interrater agreement.
| Methods | Description |
|---|---|
| Joint-probability of agreement (percentage agreement) [ | The simplest and easiest measure based on dividing number of times for each rating (e.g.,1,2,…, 5), assigned by each assessor, by the total number of the ratings |
|
| |
| Cohen's kappa [ | A statistical measure to calculate interrater agreement among raters. This measurement is more robust than percentage agreement since this method considers the effects of random agreement between two assessors |
|
| |
| Fleiss' kappa [ | An extended version of Cohen's kappa. This measurement considers the agreement among any number of raters (not only two) |
|
| |
| Krippendorff's alpha [ | The measurement is based on the overall distribution of assessors regardless of which assessors produced the judgments |
Worker types based on their behavioral observation.
| Workers | Description |
|---|---|
| Diligent | Completed tasks precisely with a high accuracy and longer time spent on tasks |
| Competent | Skilled workers with high accuracy and fast work |
| Sloppy | Completed tasks quickly without considering quality |
| Incompetent | Completed tasks in a time with low accuracy |
| Spammer | Did not deliver useful works |
Worker types based on their average precision.
| Workers | Description |
|---|---|
| Proper | Completed tasks precisely |
| Random spammer | Gave a worthless answer |
| Semirandom spammer | Answered incorrectly on most questions while answering correctly on few questions, hoping to avoid detection as a spammer |
| Uniform spammer | Repeated answers |
| Sloppy | Not precise enough in their judgments |
Design-time methods.
| Method | Description | Platform | Example |
|---|---|---|---|
| Qualification test | Qualification test is a set of questions which the workers must answer to qualify for doing the tasks | AMT | IQ test [ |
|
| |||
| Honey pots or gold standard data | Creating predefined questions with known answers is honey pots [ | Crowdflower [ | — |
|
| |||
| Qualification settings | Some qualification settings are set when creating HITs | AMT, Crowdflower | Using approval rate |
|
| |||
| Trap questions | This method is about designing HITs along with a set of questions with known answers to find unreliable workers [ | — | — |
|
| |||
| CAPTCHAs and reCAPTCHA | CAPTCHAs are an antispamming technique to separate computers and humans to filter automatic answers. A text of the scanned page is used to identify whether human inputs data or spam is trying to trick the web application as only a human can go through the test [ | — | — |
Run-time methods.
| Method | Description |
|---|---|
| Majority voting (MV) | MV is a straightforward and common method which eliminates the wrong results by using the majority decision [ |
|
| |
| Expectation maximization (EM) algorithm | EM algorithm measures the worker quality by estimating the accurate answer for each task through labels completed by different workers using maximum likelihood. The algorithm has two phases: (i) the correct answer is estimated for each task through multiple labels submitted by different workers, accounting for the quality of each worker and (ii) comparing the assigned responses to the concluded accurate answer in order to estimate quality of each worker [ |
|
| |
| Naive Bayes (NB) | Following EM, NB is a method to model the biases and reliability of single workers and correct them in order to intensify the quality of the workers' results. According to gold standard data, a small amount of training data labeled by expert was used to correct the individual biases of workers. The idea is to recalibrate answers of workers to be more matched with experts. An average of four inexpert labels for each example is needed to emulate expert level label quality. This idea helps to improve annotation quality [ |
|
| |
| Observation of the pattern of responses | Looking at the pattern of answers is another effective way of filtering unreliable responses as some untrustworthy workers have a regular pattern, for example, selecting the first choice of every question |
|
| |
| Probabilistic matrix factorization (PMF) |
Using probabilistic matrix factorization (PMF) that induces a latent feature vector for each worker and example to infer unobserved worker assessments for all examples [ |
|
| |
| Expert review | Expert review uses experts to evaluate workers [ |
|
| |
| Contributor evaluation | The workers are evaluated according to quality factors such as their reputation, experience, or credentials. If the workers have enough quality factors, the requester would accept their tasks. For instance, Wikipedia would accept the article written by administrators without evaluation [ |
|
| |
| Real-time support | The requesters give workers feedback about the quality of their work in real time while workers are accomplishing the task. This helps workers to amend their works and the results showed that self-assessment and external feedback improve the quality of the task [ |