| Literature DB >> 35214430 |
Hajer Guerdelli1,2, Claudio Ferrari3, Walid Barhoumi1,4, Haythem Ghazouani1,4, Stefano Berretti2.
Abstract
Automatic facial expression recognition is essential for many potential applications. Thus, having a clear overview on existing datasets that have been investigated within the framework of face expression recognition is of paramount importance in designing and evaluating effective solutions, notably for neural networks-based training. In this survey, we provide a review of more than eighty facial expression datasets, while taking into account both macro- and micro-expressions. The proposed study is mostly focused on spontaneous and in-the-wild datasets, given the common trend in the research is that of considering contexts where expressions are shown in a spontaneous way and in a real context. We have also provided instances of potential applications of the investigated datasets, while putting into evidence their pros and cons. The proposed survey can help researchers to have a better understanding of the characteristics of the existing datasets, thus facilitating the choice of the data that best suits the particular context of their application.Entities:
Keywords: applications of facial expression datasets; facial expression recognition; macro-expressions datasets; micro-expressions datasets
Mesh:
Year: 2022 PMID: 35214430 PMCID: PMC8879817 DOI: 10.3390/s22041524
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The valence–arousal continuous emotional space.
Proposed categorization of macro- and micro-expression datasets.
| Macro- and Micro-Expressions Facial Datasets | |||
|---|---|---|---|
| Macro-Expression Datasets | Micro-Expression Datasets | ||
| Spontaneous | In-the-wild | Spontaneous | In-the-wild |
| EB+, TAVER, RAVDESS, GFT, SEWA, BP4D+ (MMSE), BioVid Emo, 4D CCDb, MAHNOB Mimicry, OPEN-EmoRec-II, AVEC’14, BP4D-Spontaneous, DISFA, RECOLA, AVEC’13, CCDb, DynEmo, DEAP, SEMAINE, MAHNOB-HCI, UNBC-McMaster, CAM3D, B3D(AC), CK+, AvID, AVIC, DD, SAL, HUMAINE, EmoTABOO, ENTERFACE, UT-Dallas, RU-FACS, MIT, UA-UIUC, AAI, Smile dataset, iSAFE, ISED | RAF-DB, Aff-Wild2, AM-FED+, AffectNet, AFEW-VA, Aff-Wild, EmotioNet, FER-Wild, Vinereactor, CHEAVD, HAPPEI, AM-FED, FER-2013, AFEW, Belfast induced, SFEW, VAM-faces, FreeTalk, EmoTV, BAUM-2 | SAMM, CAS(ME)2, Silesian deception, CASME II, CASME, SMIC-E, SMIC, Canal9, YorkDDT | MEVIEW |
Figure 2Sample frames from the CAM3D spontaneous dataset.
Figure 3Structure of the BP4D+ dataset.
Figure 4Sample frames from the AFEW-VA dataset in-the-wild.
Classification of macro-expression datasets according to their content.
| Expression Representation | Macro-Expression Datasets |
|---|---|
| Six basic expressions | MMI, USTC-NVIE, MMI-V, SFEW |
| Six basic expressions + neutral | iSAFE, AFEW, FER-2013 |
| Six basic expressions + neutral, pain | Hi4D-ADSIP |
| Six basic expressions + neutral, contempt | BAUM-2 |
| Six basic expressions (happiness or amusement, sadness, surprise or startle, fear or nervous, anger or upset, disgust) + embarrassment, pain | BP4D-Spontaneous |
| 23 categories of emotion | EmotioNet |
| Nine categories of emotions (no-face, six basic expressions, neutral, none, and uncertain) | FER-Wild |
| 13 emotional and mental states (six basic emotions plus boredom and contempt plus mental states, confusion, neutral, thinking, concentrating, and bothered) | BAUM-1 |
| Four emotions (sadness, surprise, happiness, and disgust) | ISED |
| One emotion (smile) | AM-FED, Smile dataset |
| Valence–arousal | AffectNet, DEAP, Aff-Wild, AVEC’14 |
Classification of macro-expression datasets according to their number of subjects.
| Number of Subjects | Macro-Expression Datasets |
|---|---|
| ≤50 | TAVER, RAVDESS, BAUM-1, OPEN-EmoRec-II, BP4D-Spontaneous, DISFA, RECOLA, CCDb, MAHNOB Laughter, DEAP, SEMAINE, MAHNOB-HCI, UNBC-McMaster, CAM3D, B3D(AC), MMI-V, AVLC, AvID, AVIC, VAM-faces, ENTERFACE, MMI, MIT, EmoTV, UA-UIUC, 4D CCDb, FreeTalk, IEMOCAP, SAL, iSAFE, ISED |
| ∈[50, 100] | GFT, SEWA, BioVid Emo, MAHNOB Mimicry, AVEC’14, PICS-Stirling ESRC 3D Face Database, Belfast induced (Set2 and Set3), Hi4D-ADSIP, DD, RU-FACS, AAI, Smile dataset |
| ∈[100, 250] | EB+, 4DFAB, AFEW-VA, BP4D+ (MMSE), Vinereactor, CHEAVD, AM-FED, Belfast induced (Set1), USTC-NVIE, CK+ |
| ∈[250, 500] | SFEW, Aff-Wild2, AM-FED+, BAUM-2, AVEC’13 AViD-Corpus, DynEmo, AFEW, UT-Dallas |
| ≥500 | RAF-DB, AffectNet, Aff-Wild, EmotioNet, FER-Wild, FER-2013, HAPPEI, HUMAINE |
Classification of micro-expression datasets according to their number of subjects.
| Number of Subjects | Micro-Expression Datasets |
|---|---|
| ≤50 | SAMM, CAS(ME)2, MEVIEW, CASME II, CASME, SMIC-E, SMIC, YorkDDT |
| ≥100 | RAF-DB, AffectNet, Aff-Wild, EmotioNet, FER-Wild, FER-2013, HAPPEI, HUMAINE |
Macro-expressions datasets. The columns report: the dataset name (Dataset); the number of subjects; the range of subjects’ age (Age); the number of frames captured per second (FPS); ethnicity; and the amount of data/frames. In the table cells, a ‘-’ indicates that no information is available, while a ‘*’ following the dataset name indicates that the data is publicly available.
| Dataset | Year | Number of Subjects | Age | FPS | Ethnicity | Amount of Data/Frames |
|---|---|---|---|---|---|---|
| 2020 | 200 | 18–66 | 25 | Five ethnicities (Latino/Hispanic, White, African American, Asian, and Others) | 1216 videos, with 395 K frames in total | |
| 2020 | 44 | 17–22 | 60 | Two ethnicities (Indo-Aryan and Dravidian (Asian)) | 395 clips | |
| 2019 | thousands | - | - | The images URLs were collected from Flickr | 30,000 facial images | |
| 2019 | 17 | 21–38 | 10 | One ethnicity (Korean) | 17 videos of 1–4 mn | |
| 2018 | 180 | 5–75 | 60 | Three ethnicities (Caucasian (Europeans and Arabs), Asian (East-Asian and South-Asian) and Hispanic/Latino) | Two million frames. The vertex number of reconstructed 3D meshes ranges from 60 k to 75 k | |
| 2018 | 258 | infants, young and elderly | 30 | Five ethnicities (Caucasian, Hispanic or Latino, Asian, black, or African American) | Extending it with 260 more subjects and 1,413,000 new video frames | |
| 2018 | 24 | 21–33 | 30 | (Caucasian, East-Asian, and Mixed (East-Asian Caucasian, and Black-Canadian First nations Caucasian)) | 7356 recordings composed of 4320 speech recordings and 3036 song recordings | |
| 2018 | 416 | - | 14 | Participants from around the world | 1044 videos of naturalistic facial responses to online media content recorded over the Internet | |
| 2017 | 96 | 21–28 | - | Participants were randomly selected | 172,800 frames | |
| 2017 | 450,000 | average age 33.01 years | - | More than 1,000,000 facial images from the Internet | 1,000,000 images with facialandmarks. 450,000 images annotated manually | |
| 2017 | 240 | 8–76 | - | Movie actors | 600 video clips | |
| 2017 | 398 | 18–65 | 20–30 | Six ethnicities (British, German, Hungarian, Greek, Serbian, and Chinese) | 1990 audio-visual recording clips | |
| 2016 | 140 | 18–66 | 25 | Five ethnicities (Latino/Hispanic, White, African American, Asian, and Others) | 1.4 million frames. Over 10TB high quality data generated for the research community | |
| 2016 | 500 | - | - | - | 500 videos from YouTube | |
| 2016 | 1,000,000 | - | - | One million images of facial expressions downloaded from the Internet | Images queried from web: 100,000 images annotated manually, 900,000 images annotated automatically | |
| 2016 | 24,000 | - | - | - | 24,000 images from web | |
| 2016 | 31 | 19-65 | 30 | One ethnicity (Turkish) | 1184 multimodal facial video clips contain spontaneous facial expressions and speech of 13 emotional and mental states | |
| 2016 | 86 | 18–65 | - | - | 15 standardized film clips | |
| 2016 | 222 | - | web-cam | Mechanical tuckers | 6029 video responses from 343 unique mechanical truck workers in response to 200 video stimulus. Total number of 1,380,343 video frames | |
| CHEAVD * [ | 2016 | 238 | 11–62 | 25 | - | Extracted from 34 films, two TV series and four other television shows. In the wild |
| 2016 | 50 | 18–22 | 50 | One ethnicity (India) | 428 videos | |
| 2015 | 4 | 20–50 | 60 | - | 34 audio-visuals | |
| 2015 | 60 | 18–34 | 25 | Staff and students at Imperial College London | Over 54 sessions of dyadic interactions between 12 confederates and their 48 counterparts | |
| 2015 | 30 | Mean age: women 37.5 years; men 51.1 years | - | - | Video, audio, physiology (SCL, respiration, BVP, EMG Corrugator supercilii, EMG Zygomaticus Major) and facial reactions annotations | |
| 2015 | 8500 faces | - | - | - | 4886 images. | |
| 2014 | 84 | 18–63 | - | German | 300 audio-visuals | |
| 2014 | 286 | 5–73 | - | two ethnicities (Turkish, English) | 1047 video clips | |
| 2013 | 41 | 18–29 | 25 | four ethnicities (Asian, African-American, Hispanic, and Euro-American) | 368,036 frames | |
| 2013 | 27 | 18–50 | 20 | four ethnicities (Asian, Euro American, Hispanic, and African-American) | 130,000 frames | |
| 2013 | 46 | Mean age: 22 years, standard deviation: three years | - | four ethnicities (French, Italian, German and Portuguese) | 27 videos | |
| 2013 | 242 | Range of ages and ethnicities | 14 | Viewers from a range of ages and ethnicities | 168,359 frames/242 facial videos | |
| 2013 | 35,685 | - | - | - | Images queried from web | |
| 2013 | 292 | 18–63 | 30 one ethnicity (German) | 340 audio-visuals | ||
| 2013 | 16 | 25–56 | - | All participants were fully fluent in the Englishanguage | 30 audio-visuals | |
| 2013 | 22 | Average age: 27 and 28 years | 25 | 12 different countries and of different origins. | 180 sessions 563aughter episodes, 849 speech utterances, 51 posedaughs, 67 speech–laughs episodes and 167 other vocalizations annotated in the dataset | |
| 2013 | 358 | 25–65 | 25 | One ethnicity (Caucasian) | Two sets of 233 and 125 recordings of EFE of ordinary people | |
| 2013 | 99 | - | - | - | 2D images, video sequences and 3D face scans | |
| 2012 | 32 | 19–37 | - | Mostly European students | 40 one-minuteong videos shown to subjects | |
| 2012 | 330 | 1–70 | - | Extracted from movies | 1426 sequences withength from 300 to 5400 ms. 1747 expressions | |
| 2012 | 24 | 22–60 | - | Undergraduate and postgraduate students | 130,695 frames | |
| 2012 | Set1: 114 | Undergraduate students | - | undergraduate students | 570 audio-visuals | |
| Set2: 82 | Mean age of participants 23.78 | - | Undergraduate students, postgraduate students or employed professionals | 650 audio-visuals | ||
| Set3: 60 | age of participants 32.54 | - | (Peru, Northern Ireland) | 180 audio-visuals | ||
| 2012 | 27 | 19–40 | 60 | Different educational background, from undergraduate students to postdoctoral fellows, with different English proficiency from intermediate to native speakers | 756 data sequences | |
| 2011 | 80 | 18–60 | 60 | Undergraduate students from the Performing Arts Department at the University. Undergraduate students, postgraduate students and members of staff from other departments | 3360 images/sequences | |
| 2011 | 25 | - | - | Participants were self-identified while having a problem with shoulder pain | 48,398 frames/200 video sequences | |
| 2011 | 16 | 24–50 | 25 | Three ethnicities (Caucasian, Asian and Middle Eastern) | 108 videos of 12 mental states | |
| 2011 | 95 | - | - | - | 700 images: 346 images in Set 1 and 354 images in Set 2 | |
| 2010 | 14 | 21–53 | 25 | Native English speakers | 1109 sequences, 4.67 song | |
| 2010 | 215 | 17–31 | 30 | Students | 236 apex images | |
| 2010 | 123 | 18–50 | - | Three ethnicities (Euro-American, Afro-American and other) | 593 sequences | |
| 2010 | 25 | 20–32 | 25 | Three ethnicities (European, South American, Asian) | 1 h and 32 min of data. 392 segments | |
| 2010 | 24 | Average ages were respectively 30, 28 and 29 years | 25 | eleven ethnicities (Belgium, France, Italy, UK, Greece, Turkey, Kazakhstan, India, Canada, USA and South Korea) | 1000 spontaneousaughs and 27 actedaughs | |
| 2009 | 15 | 19–37 | - | Native Slovenian speakers | Approximately one-hour video for each subject | |
| 2009 | 21 | ≤30 and ≥40 | 25 | Two ethnicities (Asian and European) | No. episodes 324 | |
| 2009 | 57 | - | 30 | 19% non-Caucasian | No. episodes 238 | |
| 2008 | 20 | 16–69 (70% ≤ 35) | 25 | One ethnicity (German) | 1867 images (93.6 images per speaker on average) | |
| 2008 | 4 | - | 60 | Originating from different countries and each of them speaking a different nativeanguage (Finnish, French, Japanese, and English) | No. episodes 300 | |
| 2008 | 10 | - | 120 | Actors (fluent English speakers) | Two hours of audiovisual data, including video, speech, motion capture of face, and text transcriptions | |
| 2008 | 4 | - | - | - | 30 min sessions for each user | |
| 2007 | Multiple | - | - | - | 50 ‘clips’ from naturalistic and induced data | |
| 2007 | - | - | - | French dataset | 10 clips | |
| 2006 | - | - | 25 | - | A multi-modal data set consisting of 100 h of meeting recordings | |
| 2006 | 16 | average age 25 | - | - | - | |
| 5 | 22–38 | |||||
| 16 | average age 25 | |||||
| 2005 | 100 | 18–30 | 24 | Two ethnicities (African-American and Asian or Latino) | 400–800 min dataset | |
| 2005 | 19 | 19–62 | 24 | Three ethnicities (European, Asian, or South American) | Subjects portrayed 79 series of facial expressions. Image sequence of frontal and side view are captured. 740 static images/848 videos | |
| 2005 | 284 | 18–25 | 29.97 | One ethnicity (Caucasians) | 1540 standardized clips | |
| 2005 | 17 | - | - | - | Over 25,000 frames were scored | |
| 2005 | 48 | - | - | French | 51 video clips | |
| 2004 | 28 | Students | - | Students | One video clip for each subject | |
| 2004 | 60 | 18–30 | - | Two ethnicities (European American and Chinese American) | One audiovisual for each subject | |
| 2001 | 95 | - | 30 | - | 195 spontaneous smiles |
Micro-expressions datasets. Number subjects: Number of subjects. Ages: age range of the subjects. FPS: frames captured per second. -: No Information. Samples: micro-expressions. Content: Spontaneous or in-the-wild.
| Dataset | Year | Number Subjects | Age | FPS | Ethnicity | # of Data/Frames | FACs Coded | Samples | Lights | Resolution | Emotions |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2018 | 32 | average 33.24 | 200 | Thirteen ethnicities (white British and other) | 338 micro movements | Yes | 159 | Twoights as array of LEDs | 2040 | Seven emotions. Macro/Micro | |
| 2018 | 22 | Average 22.59 | 30 | One ethnicity | 250 macro, 53 micro | No | 53 | Twoight-emitting diose (LDE)ights |
| Four emotions. Macro/Micro | |
| 2017 | 16 | - | 25 | - | 31 videos | Yes | 31 | - | - | Five emotions. Macro/Micro | |
| 2015 | 101 | Students | 100 | Third and fourth year students | 101 videos 1.1 M frames. | Yes | 183 micro-tensions | Proper illumination |
| Macro/Micro | |
| 2014 | 26 | Average 22.03 | 200 | One ethnicity | Among 3000 facial movements | Yes | 247 | Four selected LEDamps under umbrella reflectors |
| Five emotions | |
| 2013 | 35 (19 valid) | Average 22.03 | 60 | One ethnicity | More than 1500 elicited facial movements | Yes | 195 in Class A, 100 in Class B, 95 | Class A: naturalight, Class B: room with two LEDights | Class A: | Seven emotions | |
| 2013 | HS: 16 | (22–34) | 100 | Three ethnicities (Asians, Caucasians and African) | Longest micro-expression clips: 50 frames | No | 164 | 4ights at the four upper corners of the room |
| 3 emotions (positive, negative and surprise) | |
| VIS: 8 | 25 | Theongest micro-expression clips: 13 frames | 71 | ||||||||
| NIR: 8 | 25 | Same as VIS | 71 | ||||||||
| 2011 | 6 | - | 100 | - | 1,260,000 frames | No | 77 | Indoor bunker environment resembling an interrogation room |
| Five emotions: Micro | |
| 2009 | 190 | - | - | - | 70 debates for a total of 43 h and 10 min of material | - | 24 | - |
| Political debates recorded by the Canal 9ocal Switzerland TV station | |
| 2009 | 9 | - | 25 | - | 20 videos for a deception detection test (DDT). seven frames | No | 18 | - | 320 × 240 | Two emotion classes |