| Literature DB >> 36090291 |
Abstract
Traditional video recommendation provides the viewers with customized media content according to their historical records (e.g., ratings, reviews). However, such systems tend to generate terrible results if the data is insufficient, which leads to a cold-start problem. An affective video recommender system (AVRS) is a multidiscipline and multimodal human-robot interaction (HRI) system, and it incorporates physical, physiological, neuroscience, and computer science subjects and multimedia resources, including text, audio, and video. As a promising research domain, AVRS employs advanced affective analysis technologies in video resources; therefore, it can solve the cold-start problem. In AVRS, the viewers' emotional responses can be obtained from various techniques, including physical signals (e.g., facial expression, gestures, and speech) and internal signals (e.g., physiological signals). The changes in these signals can be detected when the viewers face specific situations. The physiological signals are a response to central and autonomic nervous systems and are mostly involuntarily activated, which cannot be easily controlled. Therefore, it is suitable for reliable emotion analysis. The physical signals can be recorded by a webcam or recorder. In contrast, the physiological signals can be collected by various equipment, e.g., psychophysiological heart rate (HR) signals calculated by echocardiogram (ECG), electro-dermal activity (EDA), and brain activity (GA) from electroencephalography (EEG) signals, skin conductance response (SCR) by a galvanic skin response (GSR), and photoplethysmography (PPG) estimating users' pulse. This survey aims to provide a comprehensive overview of the AVRS domain. To analyze the recent efforts in the field of affective video recommendation, we collected 92 relevant published articles from Google Scholar and summarized the articles and their key findings. In this survey, we feature these articles concerning AVRS from different perspectives, including various traditional recommendation algorithms and advanced deep learning-based algorithms, the commonly used affective video recommendation databases, audience response categories, and evaluation methods. Finally, we conclude the challenge of AVRS and provide the potential future research directions.Entities:
Keywords: affective analysis; affective video recommender system; deep learning; multidiscipline; multimodal; neuroscience; physiological signals; video recommendation
Year: 2022 PMID: 36090291 PMCID: PMC9459336 DOI: 10.3389/fnins.2022.984404
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 5.152
Comparisons between this survey and existing reviews.
| Main concerns |
|
|
|
| Our survey |
| Multimodal feature | × | ✓ | ✓ | ✓ | ✓ |
| Multimodal data sources | × | × | Few | × | ✓ |
| Deep learning methods | ✓ | ✓ | ✓ | ✓ | ✓ |
| Affective computing | ✓ | × | ✓ | ✓ | ✓ |
| Multidiscipline knowledge | × | × | ✓ | × | ✓ |
| Video content | ✓ | ✓ | Few | ✓ | ✓ |
Publications from different sources.
| Databases | Number of publications | Percentage |
| ACM | 11 | 11.96% |
| IEEE | 35 | 38.04% |
| Elsevier | 11 | 11.96% |
| Springer | 15 | 16.30% |
| Others | 20 | 21.74% |
| Total | 92 | 100% |
FIGURE 1Distribution of publications in AVRS.
FIGURE 2The evolution of AVRS with different algorithms and databases.
Publications based on different techniques.
| Categories | Algorithm/Model | Publications |
| Traditional methods | SVM/SVR | |
| Clustering | ||
| AdaBoost | ||
| MA | ||
| CF | ||
| CBF | ||
| KG | ||
| GA |
| |
| HRS | ||
| Deep learning-based methods | RL | |
| CNN | ||
| LSTM | ||
| MLP | ||
| DHM |
FIGURE 3The framework of the improved AdaBoost learning algorithm (Zhao et al., 2013).
FIGURE 4The recommendation process based on KG (Breitfuss et al., 2021).
FIGURE 5The RL sequence of states and actions (Tripathi et al., 2018).
FIGURE 6The framework of the proposed method (Zhu et al., 2019).
FIGURE 7The architecture of VECERM (Cao et al., 2022).
FIGURE 8The MLP model (Ðorđević Čegar et al., 2020).
FIGURE 9The framework of EmoWare (Tripathi et al., 2019).
The databases for affective computing.
| Name | Details | Publication |
| The affective feedback database | Questionnaires of 24 participants on tasks, search process, and emotional experience of the information-seeking process |
|
| Cohn–Kanada expression database | The database has 2105 digitized image sequences of 182 adult subjects, suitable for comparative studies by multiple tokens of most primary FACS action units. |
|
| Moviepilot mood track | It consists of 4.5M ratings assigned by 105K users on 25K movies. Various contextual information is provided, i.e., gender, age, production year, the audience of each movie, movie-mood tag, etc. |
|
| The Hollywood movie video clips database | Contains 155 video clips from Hollywood movies, annotated by 40 participants with more than 1,300 annotations. |
|
| The Tellyads and YouTube video clips database | Contains 15 videos of 165 min duration from various genres, e.g., TV shows, movie clips, and news broadcasts. |
|
| The affective property movie database | The database contains more than 2,000 videos; movie affective properties are measured by arousal and valence. |
|
| Nvidia 3D Vision database | The database contains nine stereoscopic sequences of nearly 2 min duration. |
|
| The movie profile database | It contains an item profile of various attributes describing the movie content. |
|
| The five emotional reactions database | Two standard webcams are operating in real-time used to capture the users’ facial expressions and estimate the pulse. The users’ reactions can be classified into five categories: happiness, sadness, anger, fear, and surprise. | |
| Cohn–Kanada database | Consists of 100 students of different races, i.e., African–American, Asian, and Latino. Each subject performs a series of 23 facial displays. The selected sequences are labeled with six emotions: anger, disgust, fear, happiness, sadness, and surprise. |
|
| The clicker and emotional reaction database | It consists of 30 subjects from the age of 18–35. Each subject watches five videos, and two webcams monitor the behavior. The issues must also be surveyed according to their watching and rating. |
|
| DEAP | The database is a multimodal database using EEG and physiological signals for emotion analysis. The database obtains 32 subjects’ 1-min musical physiological video signals. | |
| Algebra video field test database | The data are collected by a field experiment of 18,925 school students and 152 teachers in 149 schools. |
|
| Cohn Kanade database | It contains photos of different emotions, from a neutral state to an explicit one. |
|
| The 0-MOOD, 7-MOOD, 16-MOOD | It contains 0, 7, and 16 mood states, respectively. |
|
| The user action session database | Affivir constantly crawls video data from the Internet, and user preference features are extracted. | |
| The format video database | It contains 1,000 format mp4 videos ranging from 30 s to 10 min. The videos are from various websites, i.e., Youku.com, YouTube.com, etc. |
|
| The footwear advertising videos database | The user facial features and ratings of 52 subjects record the movement of vital facial points continuously. |
|
| The NEAR database | The NEAR database consists of a wide range of databases, i.e., the Property Video Clip Ads Database, a text database of video clips. |
|
| LIRIS-ACCEDE | It contains 160 feature films and short films from 9,800 video clips. It is the largest video database with emotional labels and can be used for video indexing, summarization, and browsing. | |
| PM-SZU | It is a new database for affective video analysis. It consists of 386 video clips extracted from 8 films. |
|
| The metractitic.com and imdb.com database | It consists of 2,627,476 movie reviews. |
|
| Danmu database | It contains a large amount of user-generated comments from Bilibili. |
|
| LDOS-PerAff-1 Corpus | It consists of subjects’ affective responses to video clips, answers are annotated in the continuous valence-arousal-dominance space, and topics are annotated with personality information. | |
| Mechanical Turk setup | It contains affective annotations for the corpus to evaluate viewers’ reported boredom. | |
| Multidimensional sentiment dictionary from Ren CE | It includes 1,487 blogs and many emotional words and is labeled as a vector of 8 dimensions. |
|
| YouTube video clips | Containing f 600 videos, 480 had transcripts. |
|
| LDOS-CoMoDa | It consists of contextual information and ratings on the users’ consumed movies and personality profiles. |
|
| The IMDB movie scenes | Some 240 users are viewing videos on 25 movie scenes on IMDB. The duration is recorded. |
|
| The AFEW database | A dynamic, temporal facial-expression data corpus contains short video clips of facial expressions close to the real world. |
|
| The SFEW database | It is a static, harsh conditions database consisting of seven facial expression classes. |
|
The audience responses in different publications.
| Audience responses | Publications |
| Facial expressions/features | |
| Skin-estimated pulse/heart rate | |
| Mood |
|
| EDA |
|
| BA | |
| User interactions | |
| GSR |
|
| Body gestures |
|
| Perceived connotative properties | |
| Movie reviews/comments/web recordings | |
| Questionnaire/survey/quiz |
The evaluation metrics of different publications.
| Metrics | Related research papers |
| Pearson’s chi-square test and the dependent |
|
| Mean accuracy | |
| Precision/recall/F1 | |
| MAE | |
| MSE/RMSE | |
| ROC |
|
| CTR | |
| Session length | |
| Confusion matrix | |
| CACE |
|
| Sparsity impact, the granularity of emotions, extensibility, recommendation quality, additional characteristics |
|
| Valence, arousal |