| Literature DB >> 33983978 |
Nikodem Rybak1, Daniel J Angus2.
Abstract
Accurate inferences of the emotional state of conversation participants can be critical in shaping analysis and interpretation of conversational exchanges. In qualitative analyses of discourse, most labelling of the perceived emotional state of conversation participants is performed by hand, and is limited to selected moments where an analyst may believe that emotional information is valuable for interpretation. This reliance on manual labelling processes can have implications for repeatability and objectivity, both in terms of accuracy, but also in terms of changes in emotional state that might go unnoticed. In this paper we introduce a qualitative discourse analytic support method intended to support the labelling of emotional state of conversational participants over time. We demonstrate the utility of the technique using a suite of well-studied broadcast interviews, taking a particular focus on identifying instances of inter-speaker conflict. Our findings indicate that this two-step machine learning approach can help decode how moments of conflict arise, sustain, and are resolved through the mapping of emotion over time. We show how such a method can provide useful evidence of the change in emotional state by interlocutors which could be useful to prompt and support further in-depth study.Entities:
Year: 2021 PMID: 33983978 PMCID: PMC8118557 DOI: 10.1371/journal.pone.0251186
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1System overview.
A summary of our hierarchical approach for automatic conflict measurement. Training of this conflict prediction model was completed using the SSPNet Conflict Corpus. Each input video frame is classified across the 7 dimensions of emotion. The vectors of emotions are then used as a basis for prediction of the level of conflict.
Modified VGG network used for emotion recognition.
| Layer/Block | Type of Layer | Shape | Filter |
|---|---|---|---|
| input_1 | Input Layer | 224 x 224 | None |
| block1_conv1 | Conv2D | 224 x 224 | 3 x 3 |
| block1_conv2 | Conv2D | 224 x 224 | 3 x 3 |
| block1_pool | MaxPooling2D | 112 x 112 | 2 x 2 |
| block2_conv1 | Conv2D | 112 x 112 | 3 x 3 |
| block2_conv2 | Conv2D | 112 x 112 | 3 x 3 |
| block2_pool | MaxPooling2D | 56 x 56 | 2 x 2 |
| block3_conv1 | Conv2D | 56 x 56 | 3 x 3 |
| block3_conv2 | Conv2D | 56 x 56 | 3 x 3 |
| block3_pool | MaxPooling2D | 28 x 28 | 2 x 2 |
| block4_conv1 | Conv2D | 28 x 28 | 3 x 3 |
| block4_conv2 | Conv2D | 28 x 28 | 3 x 3 |
| block4_poll | MaxPooling2D | 14 x 14 | 2 x 2 |
| Block5_FC | Fully Connected | 56 | None |
| Block5_FC | Fully Connected | 7 | None |
| block5_softmax | Softmax | 7 | None |
Summary of previous research on video-based emotion recognition using IEMOCAP database.
| Ref | Method | Accuracy |
|---|---|---|
| [ | Deep Belief Networks (DBN) | 73.78% |
| [ | ARTMAP Classifier | 72.20% |
| [ | Stacked CNN for Mocap Model combined with Bidirectional Long short-term memory (LSTM) with attention | 71.04% |
| [ | Emotion profiled Support Vector Machine (SVM) | 71.00% |
| [ | Informed Segmentation and Labelling Approach (ISLA) with addition of audio information | 67.22% |
| [ | SVM with Reynolds Boltzman Machine | 60.71% |
| [ | SVM speaker-inclusive (Sp-In) video (V) data | 60.30% |
Fig 2The accuracy of polynomial model without implementation of CPLE nor dimensionality reduction.
Fig 3The accuracy of polynomial model with implementation of CPLE but without dimensionality reduction.
Fig 4The accuracy of polynomial model with both CPLE algorithm and dimensionality reduction method implemented.
Fig 5Results for Interview 1, Jeremy Paxman interviews Michael Howard on BBC Newsnight.
This figure shows results from the automated system classification of emotion and conflict (axes Y). Axes X represent time points within the video.
Fig 6Results for Interview 2, Krishnan Guru-Murthy interviews film director Quentin Tarantino on Channel 4 News.
The figure shows results from the automated system classification of emotion and conflict (axes Y). Axes X represent time points within the video.
Fig 7Results for Interview 3, Cathy Newman interviews controversial alt-right figurehead Milo Yiannopoulos on Channel 4 News.
The figure shows results from the automated system classification of emotion and conflict (axes Y). Axes X represent time points within the video.
| start model | |
| 1: enter input | |
| 2: kernel_size ←3 | //3x3 kernel |
| 3: stride ←2 | //2x2 stride |
| 4: activation ←’PRELU’ | //Parametric Exponential Linear Unit (PRELU) |
| 5: padding ← ’same’ | //input has same size as output |
| 6: pool_size ←2 | //2x2 pooling kernel |
| 7: dropout_rate ←0.3 | //drop 30% of nodes |
| //Block 1: first CONV = > PRELU = > CONV = > PRELU = > POOL | |
| filters_0 ←matrix size 32 | |
| input ←padding(input) | //pad |
| conv_0 ←input * filters_0 | //do convolution |
| conv_0←activation(conv_0) | //apply activation function |
| conv_0←normalization(conv_0) | //apply batch normalization |
| filters_1 ←matrix size 32 | |
| conv_1 ←padding(conv_0) | //pad |
| conv_1 ←conv_1 * filters_1 | //do convolution |
| conv_1←activation(conv_1) | //apply activation function |
| conv_1←normalization(conv_1) | //apply batch normalization |
| conv_1 ←Max_pooling(conv_1) | //apply max pooling |
| conv_1 ←Dropout(conv_1) | //apply dropout |
| //Block 2: second CONV = > PRELU = > CONV = > PRELU = > POOL | |
| filters_2 ←matrix size 64 | |
| conv_2 ←padding(conv_1) | //pad |
| conv_2 ←conv_2 * filters_2 | //do convolution |
| conv_2←activation(conv_2) | //apply activation function |
| conv_2←normalization(conv_2) | //apply batch normalization |
| filters_3 ←matrix size 64 | |
| conv_3←padding(conv_2) | //pad |
| conv_3 ←conv_3 * filters_3 | //do convolution |
| conv_3←activation(conv_3) | //apply activation function |
| conv_3←normalization(conv_3) | //apply batch normalization |
| conv_3 ←Max_pooling(conv_3) | //apply max pooling |
| conv_3 ←Dropout(conv_3) | //apply dropout |
| //Block 3: third CONV = > PRELU = > CONV = > PRELU = > POOL | |
| filters_4 ←matrix size 128 | |
| conv_4 ←padding(conv_3) | //pad |
| conv_4 ←conv_4 * filters_4 | //do convolution |
| conv_4←activation(conv_4) | //apply activation function |
| conv_4←normalization(conv_4) | //apply batch normalization |
| filters_5 ←matrix size 128 | |
| conv_5←padding(conv_4) | //pad |
| conv_5 ←conv_5 * filters_5 | //do convolution |
| conv_5←activation(conv_5) | //apply activation function |
| conv_5←normalization(conv_5) | //apply batch normalization |
| conv_5 ←Max_pooling(conv_5) | //apply max pooling |
| conv_5 ←Dropout(conv_5) | //apply dropout |
| //Block 4: first set of FC | |
| filters_6 ←matrix size 64 | |
| activation ←None | |
| bias ←0 | |
| conv_6 ← Flatten (conv_5) | //Flatten |
| conv_6 = activation (conv_6 · filters_6 + bias) | //Convert to dense |
| activation ←’PRELU’ | |
| dropout_rate ←0.5 | //drop 50% of nodes |
| conv_6←activation(conv_6) | //apply activation function |
| conv_6←normalization(conv_6) | //apply batch normalization |
| conv_6 ←Dropout(conv_6) | //apply dropout |
| //Block 5: second set of FC | |
| filters_7 ←matrix size 56 | |
| activation ←None | |
| bias ←0 | |
| conv_7 = activation (conv_6 · filters_7 + bias) | //Convert to dense |
| activation ←’PRELU’ | |
| dropout_rate ←0.3 | //drop 30% of nodes |
| conv_7←activation(conv_7) | //apply activation function |
| conv_7←normalization(conv_7) | //apply batch normalization |
| conv_7 ←Dropout(conv_7) | //apply dropout |
| //Block 6: softmax classifier | |
| input classes | //number of output classes matrix |
| activation ←None | |
| bias ←0 | |
| output ← activation (conv_7 · classes + bias) | //Convert to dense |
| activation ←softmax | //Softmax activation function for output |
| output ← activation(output) | |
| return model | |