| Literature DB >> 35845435 |
Romena Yasmin1, Md Mahmudulla Hassan2, Joshua T Grassel1, Harika Bhogaraju1, Adolfo R Escobedo1, Olac Fuentes2.
Abstract
This work investigates how different forms of input elicitation obtained from crowdsourcing can be utilized to improve the quality of inferred labels for image classification tasks, where an image must be labeled as either positive or negative depending on the presence/absence of a specified object. Five types of input elicitation methods are tested: binary classification (positive or negative); the (x, y)-coordinate of the position participants believe a target object is located; level of confidence in binary response (on a scale from 0 to 100%); what participants believe the majority of the other participants' binary classification is; and participant's perceived difficulty level of the task (on a discrete scale). We design two crowdsourcing studies to test the performance of a variety of input elicitation methods and utilize data from over 300 participants. Various existing voting and machine learning (ML) methods are applied to make the best use of these inputs. In an effort to assess their performance on classification tasks of varying difficulty, a systematic synthetic image generation process is developed. Each generated image combines items from the MPEG-7 Core Experiment CE-Shape-1 Test Set into a single image using multiple parameters (e.g., density, transparency, etc.) and may or may not contain a target object. The difficulty of these images is validated by the performance of an automated image classification method. Experiment results suggest that more accurate results can be achieved with smaller training datasets when both the crowdsourced binary classification labels and the average of the self-reported confidence values in these labels are used as features for the ML classifiers. Moreover, when a relatively larger properly annotated dataset is available, in some cases augmenting these ML algorithms with the results (i.e., probability of outcome) from an automated classifier can achieve even higher performance than what can be obtained by using any one of the individual classifiers. Lastly, supplementary analysis of the collected data demonstrates that other performance metrics of interest, namely reduced false-negative rates, can be prioritized through special modifications of the proposed aggregation methods.Entities:
Keywords: crowdsourcing; human computation; image classification; input elicitations; machine learning
Year: 2022 PMID: 35845435 PMCID: PMC9276979 DOI: 10.3389/frai.2022.848056
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Figure 1Object/shape templates from the MPEG-7 core experiment CE-shape-1 test set.
Figure 2Image classification task UI for balanced dataset—image contains bat (lower right).
Figure 3Image classification task UI for imbalanced dataset—image contains bat (center left).
Summary of experiment image parameters.
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|
| SetA | #1 | 16 | {100, 120, | { | Discrete: {4} | Bat | |
| #2 | Butterfly | ||||||
| #3 | Apple | ||||||
| #4 | Stingray | ||||||
| SetB | #5 | 24 | {80} | { | Discrete: {1,…,6} | Bat | |
| #6 | {80,100,120} | { | Turtle | ||||
| #7 | {100, 150} | { | Various-7 | ||||
| SetC | #8 | 40 | {90, 100, 115, 150} | { | Discrete: {4} | Bat | |
| #9 | |||||||
| #10 | |||||||
| SetD | #11 | ||||||
| #12 | |||||||
| #13 | |||||||
Figure 4Distribution of Binary Classification results from crowdsoured data. (A) Balanced dataset. (B) Imbalanced dataset.
Performance analysis of voting methods for balanced dataset.
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
| ||||
| Experiment Set A | 0.73 | 0.53 | 0.81 | 0.34 | 0.45 | 0.94 |
| Experiment Set B | 0.71 | 0.53 | 0.74 | 0.47 | 0.53 | 0.92 |
Performance analysis of crowdsourcing based ML methods for balanced dataset.
|
|
|
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| ||||
|
| ||||||||||||
| BCE | 0.83 |
| 0.87 |
|
|
| 0.83 |
| 0.86 |
|
| 0.90 |
| BCE-CE |
| 0.22 |
| 0.86 | 0.19 | 0.89 |
| 0.22 |
| 0.86 | 0.19 | 0.91 |
| BCE-SE | 0.84 | 0.22 | 0.85 | 0.88 |
| 0.91 | 0.83 | 0.22 | 0.87 | 0.86 | 0.19 | 0.92 |
| BCE-GME |
| 0.19 | 0.87 | 0.86 |
| 0.91 | 0.83 |
| 0.83 | 0.88 |
| 0.91 |
| BCE-CE-SE | 0.81 | 0.31 | 0.86 | 0.88 | 0.19 | 0.88 |
| 0.22 | 0.91 |
|
| 0.91 |
| BCE-CE-GME | 0.8 | 0.25 | 0.82 | 0.84 | 0.19 | 0.90 | 0.83 | 0.22 | 0.89 | 0.84 | 0.19 | 0.90 |
| BCE-CE-SE-GME |
| 0.25 | 0.82 | 0.83 | 0.19 | 0.90 | 0.83 | 0.22 | 0.89 | 0.86 | 0.19 | 0.89 |
|
| ||||||||||||
| BCE | 0.75 | 0.28 | 0.79 | 0.81 | 0.28 | 0.74 | 0.75 | 0.31 | 0.76 | 0.74 | 0.42 | 0.85 |
| BCE-CE | 0.78 | 0.28 |
| 0.81 | 0.25 | 0.88 | 0.75 |
|
|
|
| 0.85 |
| BCE-SE |
|
| 0.81 | 0.68 | 0.42 | 0.55 | 0.74 | 0.31 | 0.78 | 0.74 | 0.44 | 0.80 |
| BCE-GME | 0.75 | 0.31 | 0.78 | 0.76 | 0.31 |
| 0.68 | 0.33 | 0.74 | 0.72 | 0.42 |
|
| BCE-CE-SE | 0.76 | 0.22 | 0.79 |
|
|
| 0.74 | 0.25 | 0.80 | 0.72 | 0.47 | 0.85 |
| BCE-CE-GME | 0.76 | 0.31 | 0.81 | 0.78 | 0.28 | 0.80 |
|
| 0.79 | 0.78 | 0.31 | 0.86 |
| BCE-CE-SE-GME | 0.72 | 0.36 | 0.82 | 0.78 | 0.31 | 0.87 | 0.72 | 0.31 | 0.79 | 0.74 | 0.47 | 0.83 |
Bold values denote best performance among the different input elicitation combinations for each Crowdsourcing-based ML method.
Performance analysis of voting methods for imbalanced dataset.
|
|
| |||
|---|---|---|---|---|
|
|
|
|
| |
| Experiment Set C | 0.77 | 0.38 | 0.77 | 0.25 |
| Experiment Set D | 0.53 | 0.58 | 0.52 | 0.50 |
Performance analysis of crowdsourcing based ML methods for imbalanced datasets.
|
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
| ||||||||||||
| BCE | 0.73 | 0.38 | 0.82 | 0.78 | 0.33 | 0.79 | 0.73 | 0.33 | 0.81 | 0.73 | 0.38 | 0.92 |
| BCE-CE | 0.75 | 0.38 | 0.89 | 0.81 | 0.29 | 0.90 | 0.78 | 0.33 | 0.86 | 0.80 | 0.33 | 0.90 |
| BCE-SE |
|
| 0.83 |
|
|
| 0.76 |
| 0.87 |
|
| 0.86 |
| BCE-PDE |
|
| 0.83 | 0.76 | 0.33 | 0.94 | 0.68 | 0.38 | 0.81 | 0.77 | 0.38 | 0.9 |
| BCE-CE-SE | 0.76 | 0.33 |
| 0.81 | 0.29 | 0.92 | 0.77 |
|
| 0.81 | 0.29 | 0.88 |
| BCE-CE-PDE |
|
| 0.90 | 0.81 | 0.29 | 0.86 | 0.79 |
| 0.86 | 0.80 | 0.33 | 0.90 |
| BCE-CE-SE-PDE |
|
|
| 0.81 | 0.29 | 0.86 |
|
|
| 0.81 | 0.29 | 0.86 |
|
| ||||||||||||
| BCE | 0.53 |
| 0.59 | 0.55 |
|
| 0.36 | 0.58 | 0.64 | 0.61 |
| 0.85 |
| BCE-CE |
|
|
| 0.54 | 0.42 | 0.83 |
|
|
|
| 0.50 |
|
| BCE-SE |
|
| 0.62 | 0.46 | 0.50 | 0.84 | 0.36 | 0.58 | 0.65 | 0.63 | 0.50 | 0.8 |
| BCE-PDE | 0.50 | 0.67 | 0.67 |
|
| 0.85 | 0.47 | 0.67 | 0.78 | 0.56 |
| 0.86 |
| BCE-CE-SE |
|
| 0.72 | 0.52 | 0.42 |
| 0.53 | 0.58 | 0.77 |
| 0.50 | 0.84 |
| BCE-CE-PDE | 0.44 | 0.67 | 0.68 | 0.56 | 0.42 | 0.73 | 0.53 | 0.58 |
| 0.63 | 0.50 |
|
| BCE-CE-SE-PDE | 0.56 |
| 0.74 | 0.52 | 0.42 | 0.84 | 0.44 | 0.67 | 0.78 |
| 0.50 | 0.85 |
Bold values denote best performance among the different input elicitation combinations for each Crowdsourcing-based ML method.
Figure 5Change in FNR/FPR of different aggregation methods under varying thresholds. (A) Experiment Set A, (B) Experiment Set B, (C) Experiment Set C, (D) Experiment Set D.
Performance analysis of Crowdsourcing-based ML methods with expanded inputs from ResNet-50.
|
|
|
|
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
|
| ||||||||||||||||
| BCE-CE-SE-PDE* | – | – | – | – | 0.81 | 0.29 | 0.92 | 0.81 | 0.29 | 0.86 | 0.81 | 0.29 | 0.90 | 0.81 | 0.29 | 0.86 |
| RC | 10k | 0.36 | 0.21 | 0.67 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – | 0.73 | 0.38 | 0.85 |
| 0.25 |
| 0.70 | 0.38 | 0.89 | 0.75 | 0.38 | 0.92 | |
| BCE-CE-RC | – | – | – | 0.70 | 0.42 | 0.89 | 0.77 | 0.25 | 0.88 | 0.74 | 0.33 | 0.86 | 0.80 | 0.33 | 0.91 | |
| RC | 30k | 0.71 | 0.29 | 0.92 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – | 0.77 | 0.38 | 0.83 | 0.75 |
| 0.92 | 0.78 | 0.33 | 0.89 | 0.76 | 0.33 | 0.92 | |
| BCE-CE-RC | – | – | – | 0.75 | 0.38 | 0.81 | 0.73 |
| 0.91 | 0.78 | 0.33 | 0.88 | 0.80 | 0.33 |
| |
| RC | 50k | 0.87 | 0.04 | 0.99 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – | 0.80 | 0.25 | 0.95 |
| 0.04 | 0.97 | 0.84 | 0.21 | 0.97 |
| 0.13 | 0.98 | |
| BCE-CE-RC | – | – | – | 0.82 | 0.25 | 0.92 |
| 0.04 | 0.98 | 0.84 | 0.21 | 0.97 |
| 0.13 | 0.97 | |
| RC | 70k | 0.90 | 0.08 | 0.99 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – |
| 0.13 | 0.98 |
|
| 0.99 |
| 0.13 | 0.98 |
|
| 0.99 | |
| BCE-CE-RC | – | – | – |
| 0.13 | 0.98 | 0.88 |
|
|
| 0.13 | 0.98 |
|
| 0.99 | |
| RC | 90k | 0.96 | 0.00 | 1.00 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – |
| 0.04 | 0.98 | 0.94 | 0.04 | 0.96 |
| 0.04 | 0.97 |
| 0.04 | 0.99 | |
| BCE-CE-RC | – | – | – |
| 0.04 | 0.98 | 0.9 | 0.04 | 0.97 |
| 0.04 | 0.97 |
| 0.04 | 0.99 | |
|
| ||||||||||||||||
| BCE-CE* | – | – | – | – | 0.59 | 0.58 | 0.76 | 0.54 | 0.42 | 0.83 | 0.63 | 0.50 | 0.79 | 0.67 | 0.50 | 0.87 |
| RC | 10k | 0.17 | 0.33 | 0.62 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – | 0.59 | 0.58 | 0.67 | 0.44 |
| 0.87 | 0.44 | 0.67 | 0.73 | 0.11 | 0.42 | 0.78 | |
| BCE-CE-RC | – | – | – | 0.56 | 0.58 | 0.69 | 0.43 | 0.33 | 0.84 | 0.63 | 0.50 | 0.78 | 0.63 | 0.50 | 0.86 | |
| RC | 30k | 0.50 | 0.42 | 0.87 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – | 0.50 | 0.67 | 0.67 | 0.43 |
| 0.87 | 0.59 | 0.58 | 0.74 | 0.63 | 0.50 | 0.85 | |
| BCE-CE-RC | – | – | – | 0.50 | 0.67 | 0.64 | 0.47 | 0.42 |
| 0.56 | 0.58 | 0.8 | 0.67 | 0.50 | 0.87 | |
| RC | 50k | 0.79 | 0.08 | 0.96 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – | 0.74 | 0.42 | 0.90 | 0.71 | 0.17 | 0.91 | 0.70 | 0.42 | 0.88 |
| 0.17 | 0.96 | |
| BCE-CE-RC | – | – | – | 0.70 | 0.42 | 0.91 | 0.69 | 0.17 | 0.90 | 0.74 | 0.42 | 0.86 |
| 0.17 | 0.91 | |
| RC | 70k | 0.83 | 0.17 | 0.98 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – |
| 0.17 | 0.96 |
|
| 0.92 |
| 0.17 | 0.94 |
|
| 0.92 | |
| BCE-CE-RC | – | – | – |
| 0.17 | 0.96 |
|
| 0.92 |
| 0.17 | 0.93 |
|
| 0.92 | |
| RC | 90k | 0.96 | 0.08 | 0.98 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – | 0.96 | 0.08 | 0.96 | 0.92 | 0.08 | 0.94 | 0.91 | 0.17 | 0.94 | 0.96 | 0.08 | 0.95 | |
| BCE-CE-RC | – | – | – | 0.96 | 0.08 | 0.96 | 0.92 | 0.08 | 0.95 | 0.91 | 0.17 | 0.94 | 0.96 | 0.08 | 0.92 | |
|
| ||||||||||||||||
| BCE-CE* | – | – | – | – | 0.68 | 0.47 | 0.83 | 0.73 | 0.33 | 0.9 | 0.72 | 0.42 | 0.85 | 0.76 | 0.39 | 0.9 |
| RC | 10k | 0.27 | 0.25 | 0.65 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – | 0.67 | 0.5 | 0.83 | 0.64 | 0.25 | 0.88 | 0.67 | 0.39 | 0.86 | 0.71 | 0.39 |
| |
| BCE-CE-RC | – | – | – | 0.69 | 0.44 | 0.86 | 0.68 | 0.25 | 0.9 | 0.69 | 0.44 | 0.84 | 0.76 | 0.39 |
| |
| RC | 30k | 0.63 | 0.33 | 0.90 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – | 0.71 | 0.42 | 0.87 | 0.66 |
|
|
|
|
| 0.72 | 0.42 |
| |
| BCE-CE-RC | – | – | – | 0.72 | 0.42 | 0.84 | 0.64 |
|
| 0.75 | 0.39 |
| 0.72 | 0.42 |
| |
| RC | 50k | 0.84 | 0.06 | 0.98 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – |
| 0.19 | 0.96 |
| 0.08 | 0.97 |
| 0.11 | 0.96 |
| 0.14 | 0.97 | |
| BCE-CE-RC | – | – | – |
| 0.22 | 0.94 |
| 0.08 | 0.96 |
| 0.22 | 0.96 |
| 0.14 | 0.98 | |
| RC | 70k | 0.88 | 0.11 | 0.99 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – |
|
| 0.97 |
|
| 0.94 |
|
| 0.96 |
|
| 0.97 | |
| BCE-CE-RC | – | – | – |
|
| 0.97 |
|
| 0.97 |
|
| 0.96 |
|
| 0.97 | |
| RC | 90k | 0.96 | 0.03 | 0.99 | – | – | – | – | – | – | – | – | – | – | – | – |
| BCE-RC | – | – | – |
| 0.06 | 0.97 | 0.93 | 0.06 | 0.98 |
| 0.06 | 0.96 |
| 0.06 | 0.98 | |
| BCE-CE-RC | – | – | – |
| 0.06 | 0.97 | 0.92 | 0.06 | 0.94 |
| 0.06 | 0.97 |
| 0.06 | 0.98 | |
*Denotes the input combinations that achieved the best performance among the Crowdsourcing-based ML methods. Bold values denote cases where hybrid method outperforms both the Resnet-50 classifier and the Crowdsourcing-based ML methods.