Peter Washington1, Haik Kalantarian2, Jack Kent2, Arman Husic2, Aaron Kline2, Emilie Leblanc2, Cathy Hou3, Cezmi Mutlu4, Kaitlyn Dunlap5, Yordan Penev2, Nate Stockham6, Brianna Chrisman1, Kelley Paskov7, Jae-Yoon Jung2, Catalin Voss3, Nick Haber8, Dennis P Wall9. 1. Department of Bioengineering, Stanford University. 2. Department of Pediatrics (Systems Medicine), Stanford University. 3. Department of Computer Science, Stanford University. 4. Department of Electrical Engineering, Stanford University. 5. Department of Pediatrics (Systems Medicine). 6. Department of Neuroscience, Stanford University. 7. Department of Biomedical Data Science, Stanford University. 8. Graduate School of Education, Stanford University. 9. Departments of Pediatrics (Systems Medicine), Biomedical Data Science, and Psychiatry and Behavioral Sciences, Stanford University.
Abstract
Background/Introduction: Emotion detection classifiers traditionally predict discrete emotions. However, emotion expressions are often subjective, thus requiring a method to handle compound and ambiguous labels. We explore the feasibility of using crowdsourcing to acquire reliable soft-target labels and evaluate an emotion detection classifier trained with these labels. We hypothesize that training with labels that are representative of the diversity of human interpretation of an image will result in predictions that are similarly representative on a disjoint test set. We also hypothesize that crowdsourcing can generate distributions which mirror those generated in a lab setting. Methods: We center our study on the Child Affective Facial Expression (CAFE) dataset, a gold standard collection of images depicting pediatric facial expressions along with 100 human labels per image. To test the feasibility of crowdsourcing to generate these labels, we used Microworkers to acquire labels for 207 CAFE images. We evaluate both unfiltered workers as well as workers selected through a short crowd filtration process. We then train two versions of a ResNet-152 neural network on soft-target CAFE labels using the original 100 annotations provided with the dataset: (1) a classifier trained with traditional one-hot encoded labels, and (2) a classifier trained with vector labels representing the distribution of CAFE annotator responses. We compare the resulting softmax output distributions of the two classifiers with a 2-sample independent t-test of L1 distances between the classifier's output probability distribution and the distribution of human labels. Results: While agreement with CAFE is weak for unfiltered crowd workers, the filtered crowd agree with the CAFE labels 100% of the time for happy, neutral, sad and "fear + surprise", and 88.8% for "anger + disgust". While the F1-score for a one-hot encoded classifier is much higher (94.33% vs. 78.68%) with respect to the ground truth CAFE labels, the output probability vector of the crowd-trained classifier more closely resembles the distribution of human labels (t=3.2827, p=0.0014). Conclusions: For many applications of affective computing, reporting an emotion probability distribution that accounts for the subjectivity of human interpretation can be more useful than an absolute label. Crowdsourcing, including a sufficient filtering mechanism for selecting reliable crowd workers, is a feasible solution for acquiring soft-target labels.
Background/Introduction: Emotion detection classifiers traditionally predict discrete emotions. However, emotion expressions are often subjective, thus requiring a method to handle compound and ambiguous labels. We explore the feasibility of using crowdsourcing to acquire reliable soft-target labels and evaluate an emotion detection classifier trained with these labels. We hypothesize that training with labels that are representative of the diversity of human interpretation of an image will result in predictions that are similarly representative on a disjoint test set. We also hypothesize that crowdsourcing can generate distributions which mirror those generated in a lab setting. Methods: We center our study on the Child Affective Facial Expression (CAFE) dataset, a gold standard collection of images depicting pediatric facial expressions along with 100 human labels per image. To test the feasibility of crowdsourcing to generate these labels, we used Microworkers to acquire labels for 207 CAFE images. We evaluate both unfiltered workers as well as workers selected through a short crowd filtration process. We then train two versions of a ResNet-152 neural network on soft-target CAFE labels using the original 100 annotations provided with the dataset: (1) a classifier trained with traditional one-hot encoded labels, and (2) a classifier trained with vector labels representing the distribution of CAFE annotator responses. We compare the resulting softmax output distributions of the two classifiers with a 2-sample independent t-test of L1 distances between the classifier's output probability distribution and the distribution of human labels. Results: While agreement with CAFE is weak for unfiltered crowd workers, the filtered crowd agree with the CAFE labels 100% of the time for happy, neutral, sad and "fear + surprise", and 88.8% for "anger + disgust". While the F1-score for a one-hot encoded classifier is much higher (94.33% vs. 78.68%) with respect to the ground truth CAFE labels, the output probability vector of the crowd-trained classifier more closely resembles the distribution of human labels (t=3.2827, p=0.0014). Conclusions: For many applications of affective computing, reporting an emotion probability distribution that accounts for the subjectivity of human interpretation can be more useful than an absolute label. Crowdsourcing, including a sufficient filtering mechanism for selecting reliable crowd workers, is a feasible solution for acquiring soft-target labels.
Authors: Jena Daniels; Nick Haber; Catalin Voss; Jessey Schwartz; Serena Tamura; Azar Fazel; Aaron Kline; Peter Washington; Jennifer Phillips; Terry Winograd; Carl Feinstein; Dennis P Wall Journal: Appl Clin Inform Date: 2018-02-21 Impact factor: 2.342
Authors: Giovanni Pioggia; Roberta Igliozzi; Marcello Ferro; Arti Ahluwalia; Filippo Muratori; Danilo De Rossi Journal: IEEE Trans Neural Syst Rehabil Eng Date: 2005-12 Impact factor: 3.802
Authors: Haik Kalantarian; Khaled Jedoui; Kaitlyn Dunlap; Jessey Schwartz; Peter Washington; Arman Husic; Qandeel Tariq; Michael Ning; Aaron Kline; Dennis Paul Wall Journal: JMIR Ment Health Date: 2020-04-01