| Literature DB >> 31191211 |
Hanne Stenzel1, Jon Francombe2, Philip J B Jackson1.
Abstract
The ventriloquism effect describes the phenomenon of audio and visual signals with common features, such as a voice and a talking face merging perceptually into one percept even if they are spatially misaligned. The boundaries of the fusion of spatially misaligned stimuli are of interest for the design of multimedia products to ensure a perceptually satisfactory product. They have mainly been studied using continuous judgment scales and forced-choice measurement methods. These results vary greatly between different studies. The current experiment aims to evaluate audio-visual fusion using reaction time (RT) measurements as an indirect method of measurement to overcome these great variances. A two-alternative forced-choice (2AFC) word recognition test was designed and tested with noise and multi-talker speech background distractors. Visual signals were presented centrally and audio signals were presented between 0° and 31° audio-visual offset in azimuth. RT data were analyzed separately for the underlying Simon effect and attentional effects. In the case of the attentional effects, three models were identified but no single model could explain the observed RTs for all participants so data were grouped and analyzed accordingly. The results show that significant differences in RTs are measured from 5° to 10° onwards for the Simon effect. The attentional effect varied at the same audio-visual offset for two out of the three defined participant groups. In contrast with the prior research, these results suggest that, even for speech signals, small audio-visual offsets influence spatial integration subconsciously.Entities:
Keywords: Simon effect; audio-visual; reaction times; spatial correspondence; ventriloquism
Year: 2019 PMID: 31191211 PMCID: PMC6538976 DOI: 10.3389/fnins.2019.00451
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 4.677
Summary of papers on the limit of ventriloquism in audio-visual application settings.
| de Bruijn and Boone, | X | Synchr. speech (AV) | 3D video projection, WFS, loudspeakers | Absolute 5-point impairment scale | No values given |
| Melchior et al., | X | Pink noise (A) with 3D object (V) | WFS, VR device | 5-point impairment scale with hidden anchor | 4°–8° |
| Bertelson and Aschersleben, | X | 2 kHz pulses (A) with LED light flashes (V) | Phase panning between two loudspeakers, central LED | Staircase paradigm, JND | ~5° |
| Sporer et al., | T, U | “Meaningless speech” (A), pink noise (A), 10 cm white dot (V) | Wall of loudspeakers, interpolated panning, video projection | Staircase paradigm, JND | 4°–7° |
| Melchior et al., | T | Synchr. speech (AV) | WFS, 2D projection | 5-point impairment scale with hidden anchor | 5°–7° |
| Komiyama, | T, U | Synchr. speech (AV), Synchr. singing voice (AV) | Loudspeakers at every 5°, HDTV | Absolute 5-point impairment scale | 11° (T) 20° (U) |
| Stenzel et al., | T,U | Synchr. speech (AV) | Loudspeakers at every 5°, video projection | PF on coherent location, PSE | 10° (T) 19° (U) |
| André et al., | U | Synchr. Speech (AV) | WFS, 3D projection | PF on coherent location, PSE | 18° |
| Bishop and Miller, | U | Synchr. Speech (AV); McGurk signals (AV); Speech with still face (AV) | Individualized HRTFs for loudspeakers at every 6°, TV | PF on coherent location, PSE | ~19°~16°~10° |
| Lewald and Guski, | U | 1 kHz pure tones (A), white diode (V) | Loudspeakers, diodes | 9-point scale on common cause 9-point scale on spatial coincidence | ~15°~10° |
| Godfroy et al., | U | Burst of pink noise (A), white flashing circle (V) | Loudspeakers, 2D projection | PF on fusion of sound and vision | ~6° |
The “Tr” column details listener training (T, trained; U, untrained; X, unknown). The column “Type of test” lists the applied methods (PF, psychometric function). The “Results” column shows the maximum angle of accepted audio-visual offset.
Figure 1The figure shows the schematic description of the close link between areas in the midbrain processing spatial information, directing movement and the two processing streams. The first stage of combined spatial processing is found in the tectum with the visual spatial information in the superior colliculus (SC); the auditory spatial information in the inferior colliculus (IC); and the direction of head and eye movement in the tectospinal tract (TT). In direct neighborhood the cerebral aqueduct (CA) controls the eye, the eye focus and eyelid movements. The tegmentum (T) on the other side is responsible for reflexive movement, alertness, and muscle tone in the limbs (Waldman, 2009). Following the processing in the midbrain, visual and auditory spatial information is forwarded to the lateral geniculate nucleus (LGN), the pulvinar, and the medial geniculate nucleus (MGN) respectively, followed by the according visual and auditory cortii (VC, AC). Within the two cortii spatial and feature information is separated into the ventral stream (green) across the temporal cortex (TC) and dorsal stream (red) across the parietal cortex (PC). Decisions on motor reaction are then executed by the motor cortex (MC), the premotor cortex (PM), and the basis pendunculi (BP) in the midbrain (Stein et al., 2004; Malmierca and Hackett, 2010).
Figure 2The time course of each of the discussed RT models is depicted showing the relation between unimodal and bimodal RTs, where typically in speech recognition the faster RT is measured in the A condition and the slower RT in the V condition.
Figure 3Hypothesis assuming the existence of two separate effects influencing RTs. (A) The first graph shows the assumed even, or axis symmetric effect of spatial attention on RTs, differentiating between the co-activation, race, and Colavita models. (B) The second graph depicts the odd, or point-symmetric course of RTs due to the Simon effect (SE) for spatially congruent and incongruent responses. The size of the Simon effect is given as the difference in response time between congruent and incongruent responses. (C) The last graph summarizes both effects and the expected RT over offset when congruent are summarized under negative offset values and incongruent responses under positive offset values.
Word pairs used in the perceptual test.
| Pong | Song | /p/ | /s/ | Plosive (U) | Fricative (U) | Bilabial | Palatal |
| Pen | Den | /p/ | /d/ | Plosive (U) | Plosive (V) | Bilabial | Palatal |
| Sin | Fin | /s/ | /f/ | Fricative (U) | Fricative (U) | Palatal | Labiodental |
| Can | Fan | /k/ | /f/ | Plosive (U) | Fricative (U) | Palatal | Labiodental |
| Cog | Log | /k/ | /l/ | Plosive (U) | Liquid (V) | Palatal | Approximant |
| Food | Rude | /f/ | / r / | Fricative (U) | Liquid (V) | Labiodental | Approximant |
| Beef | Reef | /b/ | / r / | Plosive (V) | Liquid (V) | Bilabial | Approximant |
| Bus | Fuss | /b/ | /f/ | Plosive (V) | Fricative (U) | Bilabial | Labiodental |
| Gong | Wrong | /ɡ/ | / r / | Plosive (V) | Liquid (V) | Palatal | Approximant |
| Man | Than | /m/ | /ð/ | Nasal (V) | Fricative (V) | Bilabial | Interdental |
Each word pair consists of two monosyllabic words differing in the first consonant only. The words are grouped according to the viseme and phonetic group of the first consonant (U, unvoiced; V, voiced).
Figure 4Test setup showing the screen (I), the area covered by the loudspeakers and the video projection (II), and the area covered by the face (III). Written informed consent for publication of the personal image was obtained (CC BY-NC 4.0).
Figure 5Representation of the trial sequence. Illustrated is a standard trial with audio presented at 10° left. The keywords are presented for 0.5 s, followed by a video of the sentence. The keyword is spoken at 1.0 s within the video. The feedback is presented once a response is given. After an interval of 1.0 s the next trial starts. The gray circles indicate the position of the loudspeakers and the symbol denotes the currently active one. Written informed consent for publication of all personal images was obtained (CC BY-NC 4.0).
Figure 6RT distribution of two participants from the first experiment when pooled across left and right offsets. The graphs show the estimated difference in RT from the mean RT at 0°. The bars indicate the standard error. These two datasets exemplify the huge differences between participants and also show both typical co-activation model (A) and Colavita effect (B) at offsets. The y-axis indicates the difference in RT between responses given with no offset, 0°, and those with offset.
Figure 7Experiment one—pink noise. This figure shows the decomposition of the original RT distribution into the underlying even and odd components per participant group that were possibly caused by changes in the spatial attention and the Simon effect. The third graph depicts the original distribution of the data sorted into congruent and incongruent responses. The graphs show the changes in mean RTs between data from 0° and data at offsets in (A, C), and as the difference in RTs between congruent and incongruent responses (B). The bars indicate the standard error.
Figure 8Experiment two—multi-talker speech interference. This figure shows the decomposition of the original RT distribution into the underlying even and odd components per participant group that were possibly caused by changes in the spatial attention and the Simon effect. The third graph depicts the original distribution of the data sorted into congruent and incongruent responses. The graphs show the changes in mean RTs between data from 0° and data at offset in (A,C), and as the difference in RTs between congruent and incongruent responses (B). The bars indicate the standard error.