| Literature DB >> 29118893 |
Fabienne Lind1, Maria Gruber1, Hajo G Boomgaarden1.
Abstract
Crowdsourcing platforms are commonly used for research in the humanities, social sciences and informatics, including the use of crowdworkers to annotate textual material or visuals. Utilizing two empirical studies, this article systematically assesses the potential of crowdcoding for less manifest contents of news texts, here focusing on political actor evaluations. Specifically, Study 1 compares the reliability and validity of crowdcoded data to that of manual content analyses; Study 2 proceeds to investigate the effects of material presentation, different types of coding instructions and answer option formats on data quality. We find that the performance of the crowd recommends crowdcoded data as a reliable and valid alternative to manually coded data, also for less manifest contents. While scale manipulations affected the results, minor modifications of the coding instructions or material presentation did not significantly influence data quality. In sum, crowdcoding appears a robust instrument to collect quantitative content data.Entities:
Year: 2017 PMID: 29118893 PMCID: PMC5652897 DOI: 10.1080/19312458.2017.1317338
Source DB: PubMed Journal: Commun Methods Meas ISSN: 1931-2458
Overview of the data sources crowdworkers, offline coders and AUTNES coders [study 1].
| Data source | Number of coders | Number of sentences | Judgments per sentence | Setting | Training |
|---|---|---|---|---|---|
| Crowdworker | 158 | 500 | 10 | Online, no control a | Indirect via test questions, short introduction b |
| Offline coders | 5 | 500 | 5 | Offline, control | No training, short introduction b |
| AUTNES coders | 7 | 250 | 1 | Offline, control | Task specific training, comprehensive instruction |
Note. For more details on AUTNES please refer to Kleinen-von Königslöw et al. (2016).
“No control” means that as the coding was done online, we have no information about the diligence or surrounding of the crowdworkers while they were completing the task.
The short introduction included explanations on the two-step evaluation task as well as on the fact that coders should focus on the evaluation of the actor marked with X. It was identical for the crowdworkers and the offline coders and is available on request.
Intercoder reliability of crowdworkers, offline coders and AUTNES coders (Krippendorff’s alpha) [study 1].
| Data source | Evaluation dichotomous | Evaluation tendency 5-point scale | Evaluation tendency 3-point scale |
|---|---|---|---|
| Crowdworkers a | 0.27 | 0.60 | |
| Crowdworkers Top 5 b | 0.31 | 0.66 |
Coders per sentence = 10; in total 158 different coders contributed; each of them coded up to 40 different sentences; hence, the coders were not identical for each sentence, which depicts an uncommon setup for ICR measures.
Coders per sentence = 5; in total 116 different coders contributed; the Crowdworkers Top 5 is a subsample of the full crowdcoded sample, here we selected the five ‘best’ answers (those coded by the workers with the highest trust scores) per sentence.
Coders per sentence = 3; in total 97 different coders contributed; the Crowdworkers Top 3 is a subsample of the full crowdcoded sample, here we selected the three ‘best’ answers (those coded by the workers with the highest trust scores) per sentence.
Coders per sentence = 5; five different coders contributed, each coder coded all 500 sentences.
Coders per sentence = 7; reliability test procedure and results for variable V31 “object evaluation” reported in the AUTNES documentation (Kleinen-von Königslöw et al., 2016). For AUTNES overall more than 50000 sentences were coded by seven coders, the ICR measures were calculated based on their coding of 790 sentences.
Ratings of crowdworkers, offline coders, and AUTNES coders [study 1].
| Evaluation dichotomous a | Evaluation tendency 5-point scale b | Evaluation tendency 3-point scale c | ||||
| Date source | Mean | Mean ( | Mean ( | |||
| Crowdworkers | 0.74 (0.44) | 5059 | −0.30 (1.21) | 3744 | −0.20 (0.77) | 1981 |
| Crowdworkers weighted according to their trust score | 0.74 (0.44) | 4538 | −0.30 (1.21) | 3361 | −0.20 (0.77) | 1768 |
| Offline coders | 0.62 (0.49) | 2489 | −0.60 (1.28) | 1533 | −0.31 (0.74) | 979 |
| AUTNES coders | −0.14 (0.69) | 196 | ||||
Coded as: 0 = neutral-no evaluation, 1 = evaluation.
Coded as: -2 = explicitly negative, -1 = rather negative, 0 = mixed (both positive and negative), 1 = rather positive, 2 = explicitly positive.
Coded as: -1 = negative (refers to “criticism” for AUTNES), 0 = neutral or mixed (both, positive and negative) (refers to “neutral” for AUTNES), 1 = positive (refers to “approval” in AUTNES).
We ordered 10 judgments per sentence from CrowdFlower. We received 10 to 13 assessments per sentence, in sum 5,059 instead of 5,000. We included all to exhaust the crowd’s potential.
We asked the 5 offline coders to evaluate 500 sentences. Due to missing data (11 annotations) we received 2,489 evaluations instead of 2,500.
The second item (evaluation tendency 5-point scale) was only displayed to those that selected ‘evaluation’ with regard to the first item (evaluation dichotomous).
To compare crowdworkers, offline coders and AUTNES coders we selected only the ratings for the 196 sentences with concurring targets of opinion for this analysis.
Experimental manipulations of material presentation, coding instructions, and answer option formats [study 2].
| Material presentation | Coding instructions | Answer option formats | ||
|---|---|---|---|---|
| Question formats | Guiding comment | |||
| Baseline condition: | ||||
| Actor_X/Party_X | Look for an explicit evaluation; take the perspective of the evaluated subject; | Please do not overuse neutral-no evaluation | ||
| Other conditions: introduced in the column of the factor to be contrasted with the baseline, while the respective two other factors are held constant with the baseline | ||||
| M1: | I1: | I4: | A1: | |
| M2: | I2: | A2: | ||
| I3: | A3: | |||
| A4: | ||||
| A5: | ||||
Ratings of crowdworkers (N = 510) for 12 conditions that vary with regard to material presentation, coding instructions and answer option formats [study 2].
| Evaluation dichotomousa | Evaluation tendency 5-point scaleb | Evaluation tendency 3-point scalec | ||||
|---|---|---|---|---|---|---|
| Condition | Mean | Mean ( | Mean ( | |||
| Baseline | .60 (0.49) | 1380 | −0.37 (1.20) | 821 | −0.16 (0.68) | 1380 |
| M1 | .67 (0.47) | 1440 | −0.38 (1.25) | 970 | −0.20 (0.72) | 1440 |
| M2 | .65 (0.48) | 1350 | −0.40 (1.21) | 873 | −0.19 (0.71) | 1350 |
| I1 | .62 (0.49) | 1296 | −0.30 (1.28) | 797 | −0.14 (0.71) | 1296 |
| I2 | .63 (0.48) | 1320 | −0.39 (1.22) | 834 | −0.19 (0.69) | 1320 |
| I3 | .66 (0.48) | 1306 | −0.29 (1.23) | 857 | −0.15 (0.71) | 1306 |
| I4 | .58 (0.49) | 1303 | −0.36 (1.24) | 755 | −0.17 (0.69) | 1303 |
| A1d | .59 (0.49) | 1230 | −0.51 (1.25) | 729 | −0.22 (0.69) | 1230 |
| A2 | .64 (0.48) | 1049 | −0.36 (1.21) | 672 | −0.17 (0.69) | 1049 |
| A3 | .67 (0.47) | 1020 | −0.49 (1.26) | 688 | −0.22 (0.73) | 1020 |
| A4d,e | .76 (0.42) | 1200 | −0.32 (1.26) | 917 | −0.19 (0.77) | 1200 |
| A5 | −0.11 (0.82) | 1330 | ||||
Coded as: 1 = evaluation, 0 = neutral-no evaluation.
Coded as: -2 = explicitly negative, -1 = rather negative, 0 = mixed (both positive and negative), 1 = rather positive, 2 = explicitly positive.
Coded as: -1 = negative, 0 = neutral or mixed (both, positive and negative), 1 = positive.
Recoded to a 3-point scale for a comparison with A5.
Recoded to two scales for a comparison with A1.
Initial best practice recommendations for the use of crowdsourcing platforms for quantitative content analysis [study 1 and 2].
| Validity and Reliability |
|---|
| Assess data reliability for all cases. If reliability scores are low, assess empirical validity by comparing aggregated data to subsample, manual coded gold standard. |
| Data aggregation weighted by workers’ trust scores should be preferred above non-weighted aggregation. |
| Crowdcoded content data may produce more reliable and valid results for scale ratings than for nominal answer options. |
| Do not anonymize targets of opinion, variation in your crowd will prevent non-valid responses. |
| Keep tasks separate; try not to save money by demanding different judgments in one step. |
| Quality control |
| Apply test questions to monitor crowdworkers` performance on your job and to be able to sort out those who work poorly. |
| Apply test questions that are representative for your task to train your crowdworkers while they are working on the job. |
| Check if there are workers who constantly select the same combination of answers (straightliners). |
Note. For many more questions that might come up when working with CrowdFlower we can recommend the CrowdFlower guides, documentations, and customer service.