| Literature DB >> 35645923 |
Xiaoming Zhao1, Zhiwei Tang1,2, Shiqing Zhang1.
Abstract
Automatic personality trait recognition has attracted increasing interest in psychology, neuropsychology, and computer science, etc. Motivated by the great success of deep learning methods in various tasks, a variety of deep neural networks have increasingly been employed to learn high-level feature representations for automatic personality trait recognition. This paper systematically presents a comprehensive survey on existing personality trait recognition methods from a computational perspective. Initially, we provide available personality trait data sets in the literature. Then, we review the principles and recent advances of typical deep learning techniques, including deep belief networks (DBNs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs). Next, we describe the details of state-of-the-art personality trait recognition methods with specific focus on hand-crafted and deep learning-based feature extraction. These methods are analyzed and summarized in both single modality and multiple modalities, such as audio, visual, text, and physiological signals. Finally, we analyze the challenges and opportunities in this field and point out its future directions.Entities:
Keywords: deep learning; multimodal; personality computing; personality trait recognition; survey
Year: 2022 PMID: 35645923 PMCID: PMC9136483 DOI: 10.3389/fpsyg.2022.839619
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Figure 1The evolution of personality trait recognition with feature extraction algorithms and databases. From 2012 to 2022, feature extraction algorithms have changed from hand-crafted to deep learning. Meanwhile, the developed databases have evolved from single modality (audio or visual) to multiple modalities (audio, visual, text, etc.).
Comparisons of representative personality trait recognition databases.
| Data set | Year | Brief description | Central issues | Labels | Modality | Environment |
|---|---|---|---|---|---|---|
| SSPNet ( | 2012 | 640 audio clips from 322 speakers | Personality trait assessment from speech | BFI-10 personality assessment questionnaire, Big-Five impressions | Audio | Uncontrolled |
| SEMAINE ( | 2012 | 959 conversations from 150 participants | Face-to-face conversations with sensitive artificial listener agents | Five affective dimensions and 27 associated categories | Audio-visual | Controlled |
| YouTube Vlogs ( | 2012 | 2,269 videos from 469 different vloggers | Conversational vlogs and apparent personality trait analysis | Big-Five impressions | Audio-visual | Uncontrolled |
| ELEA ( | 2013 | 40 meeting sessions with about 10 h of recordings (148 participants) | Small group interactions and emergent leadership | Big-Five impressions | Audio-visual | Controlled |
| ChaLearn First Impression V1 ( | 2016 | 10,000 videos from 2,762 YouTube users | Apparent personality trait analysis | Big-Five impressions | Audio-visual | Uncontrolled |
| ChaLearn First Impression V2 ( | 2017 | An extended version of [5], including the newly added hirability impressions and audio transcripts | Apparent personality trait and hirability impressions | Big-Five impressions, job interview variable, and transcripts | Multimodal | Uncontrolled |
| MHHRI ( | 2017 | 12 interaction sessions (about 4 h) from 18 participants | Personality and engagement during HHI and HCI | Self/acquaintance assessed Big-Five, and engagement | Multimodal | Controlled |
| UDIVA ( | 2021 | 188 dyadic sessions (90.5 h) from 147 participants | Context-aware personality inference in dyadic scenarios | Big-Five scores, sociodemographics, mood, fatigue, relationship type | Multimodal | Controlled |
Comparisons of deep CNN models and its configurations.
| AlexNet | VGGNet | GoolgeNet | ResNet | DenseNet | |
|---|---|---|---|---|---|
| Year | 2012 | 2015 | 2015 | 2016 | 2017 |
| layers (Conv. + FC) | 5 + 3 | 19 + 3 | 21 + 1 | 151 + 1 | 264 + 1 |
| Conv. kernel | 11,5,3 | 3 | 7,1,3,5 | 7,1,3,5 | 7,1,3 |
| Dropout | √ | √ | √ | √ | √ |
| Inception | × | × | √ | × | × |
| DA | √ | √ | √ | √ | √ |
| BN | × | × | × | √ | √ |
Conv., convolution; DA, data augmentation; BN, batch normalization. The number of layers is the used maximum in deep models.
A brief summary of audio-based on personality trait recognition.
| Year | Authors | Feature descriptions |
|---|---|---|
| 2012 | Mohammadi et al. | Pitch, formants, energy, and speaking rate |
| 2016 | An et al. | Interspeech-2013 ComParE feature set |
| 2017 | Su et al. | Wavelet-based multiresolution analysis and CNNs for feature extraction |
| 2019 | Hayat et al. | Fine-tuning the pretrained AudioSet for feature extraction |
| 2020 | Carbonneau et al. | Learning feature dictionary from the extracted patches in speech spectrograms |
Figure 2The used CNN scheme for personality trait perception from speech signals (Su et al., 2017).
A brief summary of visual-based on personality trait recognition.
| Visual type | Year | Authors | Feature descriptions |
|---|---|---|---|
| Static images | 2015 | Guntuku et al. | LBP, GIST, aesthetic features |
| 2016 | Yan et al. | HOG, SIFT, LBP | |
| 2017 | Zhang et al. | Fine-tuning the pretrained VGG-face model for facial feature extraction | |
| 2017 | Segalin et al. | Fine-tuning the pretrained AlexNet and VGG-16 for aesthetic attributes | |
| 2020 | Rodríguez et al. | Trained a ResNet-50 to derive personality representations from the posted images | |
| 2021 | Fu et al. | An improved ASM model for facial feature extraction, followed by a DBN | |
| Dynamic video sequences | 2012 | Biel et al. | Facial activity statistics based on frame-by-frame estimation |
| 2013 | Aran et al. | Statistical information derived from the weighted motion energy images | |
| 2014 | Teijeiro-Mosquera, et al. | Four sets of behavioral cues, such as statistic, THR, HMM, and WTA cues | |
| 2016 | Gürpinar et al. | Fine-tuning the pretrained VGG-19 to extract deep facial and scene features | |
| 2017 | Ventura et al. | An extension of DAN for facial feature extraction in videos | |
| 2019 | Beyan et al. | Deep visual activity-based features derived from key-dynamic images in videos |
Figure 3The flowchart of personality trait prediction by using deep facial and scene feature representations (Gürpınar et al., 2016).
Figure 4The used class activation maps (CAM) to interpret the CNN models in learning facial features (Ventura et al., 2017).
A brief summary of text and physiological-based personality trait recognition.
| Input type | Year | Authors | Feature descriptions |
|---|---|---|---|
| 2013 | Bazelli et al. | Predicting the personality traits of Stack Overflow authors with LIWC | |
| 2016 | Golbeck et al. | The Receptiviti API providing personality score predictions with LIWC | |
| 2017 | Majumder et al. | A CNN with injection of the document-level Mairesse features | |
| 2017 | Hernandez et al. | RNNs and its variants, such as GRU, LSTM, and Bi-LSTM for text features | |
| 2018 | Xue et al. | A hierarchical deep neural network for learning deep semantic features | |
| 2018 | Sun et al. | A 2CLSTM integrating a Bi-LSTM with a CNN for feature extraction | |
| 2020 | Mehta et al. | Psycholinguistic features were combined with BERT embeddings | |
| 2021 | Ren et al. | A BERT for text feature extraction, followed by GRU, LSTM, and CNN | |
| Physiological signals | 2014 | Wache et al. | The measurements of ECG, EEG, GSR |
| 2018 | Subramanian et al. | The measurements of ECG, EEG, GSR and facial activity data | |
| 2020 | Taib et al. | Adopting eye-tracking and skin conductivity sensors for capturing their physiological responses |
Figure 5The flowchart of CNN-based document-level personality prediction from text (Majumder et al., 2017).
A brief summary of multimodal fusion for personality trait recognition.
| Year | Authors | Modalities | Fusion methods | Feature descriptions |
|---|---|---|---|---|
| 2016 | Güçlütürk et al. | Audio, visual | Feature-level | An deep residual network for audio and visual feature extraction |
| 2016, 2017 | Zhang et al. | Audio, visual | Decision-level | A DBR method integrating audio and visual (scene and face) modality |
| 2016 | Gürpinar et al. | Audio, visual | Score-level | Fine-tuning a pretrained VGG model to derive facial emotion and ambient features. The INTERSPEECH-2009 for audio feature set |
| 2016 | Subramaniam et al. | Audio, visual | Feature-level | A volumetric (3D) convolution network for visual feature extraction. The statistics of zero-crossing rate, energy, MFCCs for audio features |
| 2021 | Curto et al. | Audio, visual | Model-level | The pretrained VGGish for audio feature extraction, and the pretrained R(2 + 1)D for video feature extraction |
| 2016 | Xianyu et al. | Text, visual | Model-level | A heterogeneity entropy (HE) neural network (HENN) consisting of HE-DBN, HE-AE and common DBN for common feature representations among text, image and behavior statistical modalities |
| 2019 | Principi et al. | Audio, visual | Model-level/Feature-level | A multimodal deep learning model (ResNet-50 for visual modality and 14-layer 1D CNN for audio modality) for feature extraction |
| 2020 | Li et al. | Audio, visual, text | Feature-level | A deep CR-Net to predict the multimodal Big-Five personality traits based on video, audio, and text cues |
| 2017 | Güçlütürk et al. | Audio, visual, text | Feature-level | A deep residual networks for audio-visual feature extraction. A bag-of-words and a skip-thought vector model for text feature extraction |
| 2017, 2018 | Gorbova et al. | Audio, visual, text | Decision-level | Acoustic LLD features (MFCCs, ZCR, speaking rate), facial action unit features, as well as negative and positive word scores |
| 2018 | Kampman et al. | Audio, visual, text | Decision-level/Model-level | An trimodal deep CNN method for audio, visual, text feature extraction |
| 2020 | Escalante et al. | Audio, visual, text | Feature-level | A bag-of-words model and a skip-thought vector model for text feature extraction, and the ResNet18 model for audio-visual feature extraction |
| 2022 | Suman et al. | Audio, visual, text | Feature-level/Decision-level | A MTCNN and ResNet for facial and ambient feature extraction, respectively. A VGGish model for audio feature extraction and an |
Figure 6The flowchart of the proposed DBR method for audio-visual personality trait prediction (Wei et al., 2017).
Figure 7The flowchart of Integrating audio, vision and language for first-Impression personality analysis (Gorbova et al., 2018).