| Literature DB >> 35408079 |
Kaito Hirasawa1, Keisuke Maeda2, Takahiro Ogawa3, Miki Haseyama3.
Abstract
In this study, a novel prediction method for predicting important scenes in baseball videos using a time-lag aware latent variable model (Tl-LVM) is proposed. Tl-LVM adopts a multimodal variational autoencoder using tweets and videos as the latent variable model. It calculates the latent features from these tweets and videos and predicts important scenes using these latent features. Since time lags exist between posted tweets and events, Tl-LVM introduces the loss considering time lags by correlating the feature into the loss function of the multimodal variational autoencoder. Furthermore, Tl-LVM can train the encoder, decoder, and important scene predictor, simultaneously, using this loss function. This is the novelty of Tl-LVM, and this work is the first end-to-end prediction model of important scenes that considers time lags to the best of our knowledge. It is the contribution of Tl-LVM to realize high-quality prediction using latent features that consider time lags between tweets and multiple corresponding previous events. Experimental results using actual tweets and baseball videos show the effectiveness of Tl-LVM.Entities:
Keywords: Twitter; latent variable model; prediction of important scenes; sports video; time lags
Mesh:
Year: 2022 PMID: 35408079 PMCID: PMC9002476 DOI: 10.3390/s22072465
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Overview of Tl-LVM. The encoder extracted latent features from features of tweets and baseball videos. By the decoder, original features were reconstructed from latent features. Additionally, the important scene predictor output the probability of the target scene being important and determined whether it was important or normal using the output probability. The encoder, decoder, and important scene predictor consisted of multiple fully connected layers, respectively. Unlike the conventional method [14] which trains a part of the encoder considering time lags, Tl-LVM can train the entire network considering time lags based on the new loss function. Therefore, this figure does not represent a time-lag aware transformation in the encoder as in the figure of the conventional method. While Tl-LVM is a model for prediction, the conventional method is a model for detection. Thus, Tl-LVM has the predictor network, but the conventional method has the detector network.
Figure 2Relationship between multiple events and a tweet affected by them. Additionally, these events were weighted by the degree of influence defined using the Poisson distribution. This figure is an adaptation from Reference [14].
Details of the baseball videos used in the experiment.
| Games | Game Time | Number of Important Scenes | Average Time Length of Important Scenes |
|---|---|---|---|
| 1 | 3 h 23 min | 12 | 2 min 28 s |
| 2 | 2 h 59 min | 14 | 2 min 31 s |
| 3 | 3 h 32 min | 22 | 2 min 03 s |
| 4 | 3 h 08 min | 18 | 1 min 58 s |
| 5 | 2 h 44 min | 11 | 1 min 40 s |
Features used in Comps. 1–6.
| Features | Comp. 1 | Comp. 2 | Comp. 3 | Comp. 4 | Comp. 5 | Comp. 6 |
|---|---|---|---|---|---|---|
| Tweet | ✓ | ✓ | ✓ | |||
| Video | ✓ | ✓ | ✓ | |||
| Audio | ✓ | ✓ | ✓ |
F-measure of important scene prediction in Tl-LVM and Comps. 1–11. Bold values represent the highest F-measure.
| Games | Tl-LVM | Comp. 1 | Comp. 2 | Comp. 3 | Comp. 4 | Comp. 5 |
|---|---|---|---|---|---|---|
| 1 |
| 0.469 | 0.457 | 0.456 | 0.502 | 0.481 |
| 2 |
| 0.355 | 0.366 | 0.334 | 0.385 | 0.378 |
| 3 |
| 0.402 | 0.412 | 0.404 | 0.408 | 0.412 |
| 4 |
| 0.330 | 0.325 | 0.323 | 0.345 | 0.345 |
| 5 |
| 0.334 | 0.314 | 0.298 | 0.334 | 0.314 |
| Average |
| 0.378 | 0.375 | 0.363 | 0.395 | 0.386 |
|
|
|
|
|
|
|
|
| 1 | 0.527 | 0.440 | 0.502 | 0.517 | 0.525 | 0.552 |
| 2 | 0.378 | 0.385 | 0.393 | 0.385 | 0.387 | 0.400 |
| 3 | 0.408 | 0.394 | 0.451 | 0.456 | 0.465 |
|
| 4 | 0.344 | 0.345 | 0.357 | 0.352 | 0.353 | 0.360 |
| 5 | 0.306 | 0.358 | 0.364 | 0.364 | 0.357 | 0.381 |
| Average | 0.393 | 0.384 | 0.413 | 0.415 | 0.417 | 0.436 |
Figure 3Example of an important scene correctly predicted by Tl-LVM and tweets containing content from this scene. Furthermore, the horizontal axis represents time. Ground truth corresponding to this scene and labels (predicted by Tl-LVM and Comp. 11) are shown with respect to time. There is a time lag between when this important scene occurs and when these tweets are posted.
Average F-measure of Tl-LVM for all games when parameters and L of the Poisson distribution are changed. Bold values represent the highest F-measure.
|
|
|
|
|
| Average | |
|---|---|---|---|---|---|---|
|
| 0.417 | 0.436 | 0.415 | 0.403 | 0.392 | 0.413 |
|
| 0.436 |
| 0.431 | 0.417 | 0.403 | 0.426 |
|
| 0.415 | 0.427 | 0.403 | 0.407 | 0.386 | 0.408 |
|
| 0.403 | 0.409 | 0.403 | 0.392 | 0.378 | 0.397 |
|
| 0.395 | 0.413 | 0.386 | 0.378 | 0.367 | 0.388 |
| Average | 0.413 | 0.425 | 0.408 | 0.399 | 0.385 | 0.406 |