| Literature DB >> 23734220 |
Matthieu Dubois1, David Poeppel, Denis G Pelli.
Abstract
To understand why human sensitivity for complex objects is so low, we study how word identification combines eye and ear or parts of a word (features, letters, syllables). Our observers identify printed and spoken words presented concurrently or separately. When researchers measure threshold (energy of the faintest visible or audible signal) they may report either sensitivity (one over the human threshold) or efficiency (ratio of the best possible threshold to the human threshold). When the best possible algorithm identifies an object (like a word) in noise, its threshold is independent of how many parts the object has. But, with human observers, efficiency depends on the task. In some tasks, human observers combine parts efficiently, needing hardly more energy to identify an object with more parts. In other tasks, they combine inefficiently, needing energy nearly proportional to the number of parts, over a 60∶1 range. Whether presented to eye or ear, efficiency for detecting a short sinusoid (tone or grating) with few features is a substantial 20%, while efficiency for identifying a word with many features is merely 1%. Why? We show that the low human sensitivity for words is a cost of combining their many parts. We report a dichotomy between inefficient combining of adjacent features and efficient combining across senses. Joining our results with a survey of the cue-combination literature reveals that cues combine efficiently only if they are perceived as aspects of the same object. Observers give different names to adjacent letters in a word, and combine them inefficiently. Observers give the same name to a word's image and sound, and combine them efficiently. The brain's machinery optimally combines only cues that are perceived as originating from the same object. Presumably such cues each find their own way through the brain to arrive at the same object representation.Entities:
Mesh:
Year: 2013 PMID: 23734220 PMCID: PMC3667182 DOI: 10.1371/journal.pone.0064803
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
A survey of summation.
| Publication | Senses |
|
| ‘Same’ object | ‘Different’ objects | Report | Cues | |
|
| 2 | Vision & hearing | 0.76 | A word, seen and heard | Word | Spoken and printed word | ||
|
| 2 | Vision & hearing | 0.68 | 0.48 | Point-light tap dancer, seen and heard | Two different point-light tap-dance movies, one seen and one heard | Dancing presence | Dot display and sound of tap dancing plus random dots and taps |
|
| 2 | Vision & hearing | 0.7 | 0.25 | A moving flash/click, seen and heard | A moving flash and a moving click, displaced | Motion | 31 lamps and loudspeakers along the horizontal meridian presenting flashes and clicks that move |
|
| 2 | Vision & hearing | 1 | A blob-click, seen and heard: “a ball thudding onto the screen” | Location | A brief bright visual blob and a sound click | ||
|
| 2 | Vision & tactile | 1 | Moving grating, seen and felt | Velocity | Visual and tactile motion of metal gratings | ||
|
| 2 | Vision & haptic | 0.74 | 0.17 | Like a thick dusty sheet of glass, seen and felt | Two glass sheets, displaced, one seen and one touched | Thickness (distance between the two parallel surfaces) | Seeing random dot visual stereogram of front and back surfaces and squeezing the two surfaces between thumb and fingertip |
|
| 2 | Vision & haptic | 1 | A bar, seen and felt | Thickness | A raised bar in a visual stereogram and a bar squeezed by finger and thumb | ||
|
| 1 | Hearing | 0.46 | The syllables of a word | Word | Spoken syllables, successive | ||
|
| 1 | Hearing | 0.54 | The syllables of a word | Word | Spoken syllables, successive | ||
|
| 1 | Hearing | 0.53 | 16 different simultaneous tones | Tone presence | Brief tones of 16 different frequencies | ||
|
| 1 | Vision | 0.1 | The letters of a word | Word | Printed letters, adjacent | ||
|
| 1 | Vision | 0.1 | The “features” of a letter | Letter | One of 26 letters of a given font or alphabet. The number of “features” is assumed to be proportional to letter complexity. | ||
|
| 1 | Vision | 0.48 | The “features” of a letter | Letter presence | A band-pass filtered letter. The number of “features” is assumed to be proportional to letter complexity. | ||
|
| 1 | Vision | 1 | A letter, filtered to remove all but a band of high or low spatial frequencies | Letter | High- and low-frequency bands of a letter | ||
|
| 1 | Vision | 0 | High- and low-frequency gratings | Grating presence | High- and low-frequency gratings | ||
|
| 1 | Vision | 0.57 | High- and low- frequency gratings | Grating presence | High- and low-frequency gratings | ||
|
| 1 | Vision | 0.57 | Grating patches, adjacent | Grating presence | Grating patches, adjacent | ||
|
| 1 | Vision | 0.43 | The many brief gratings that make up a prolonged grating | Grating presence | A grating of various durations | ||
|
| 1 | Vision | 0.30 | Successive cycles over time | Flicker presence | A circular uniform field flickering at various temporal frequencies | ||
|
| 1 | Vision | 0.95 | 0.58 | A “stationary” grating, actually two gratings moving very slowly in opposite directions | Two gratings moving quickly in opposite directions | Grating presence | Two gratings moving in opposite directions |
|
| 1 | Vision | 1 | A slanted plane, with appropriate gradients in texture and binocular disparity | Surface slant discrimination | A plane whose slant is cued by the perspective gradient in texture, or by the gradient in binocular disparity, or both | ||
|
| 1 | Vision | 1 | A slanted plane, with appropriate linear perspective, or perspective-caused gradient in texture, or both | Surface slant | Linear perspective and texture gradient | ||
|
| 1 | Vision | 0.89 | The border between two areas | Location | Boundary contour between two areas differing in luminance, color, and/or texture. | ||
Taken from 23 papers, these are summation experiments: the observer’s thresholds for a compound stimulus are compared with the thresholds for the components of the compound. [31, Sec. 1.11.2]. The index of summation, k, ranges from none (k = 0) to optimal (k = 1). This table shows that, whether using one or several senses, whether the task is high- or low-level, the summation is strong (k near 1) only if the two components are both perceived as aspects of the “same” thing. When cues are “different” things, summation is weak 0≤ k ≤0.58 with a mean 0.37; when cues are “same”, summation is strong 0.68≤ k ≤1 with a mean 0.89. This table is based on 23 papers, including all the perceptual summation efficiencies for adults that we found (or could calculate from published results) in a quick survey of the literature. It includes only summation of cues that are consistent and informative. It omits cue conflict studies, in which the cues provide conflicting information about the quality to be reported. It also omits the facilitation paradigm, in which one of the cues provides no information. At an n-component threshold that equates the energy of the components, if the summation index is k then the summation efficiency is251658240. Conservation of energy, i.e. the optimal algorithm, has k = 1. Independence of successes, i.e. probability summation, has k ≈ 0.57. In most cases the papers do not report whether their observers perceived the two components as the “same” object, so we have guessed, based on our experience and a close reading of what the papers do say. Another two papers turned up by our search could not be included in the Table because we were unable to classify their stimuli, with any confidence, as “same” or “different” [45], [65]. Obviously our guesses must yield to better assessments. Table S1 explains how k was estimated for each paper. Note that the predicted value for the summation index of roughly 0.6 for probability summation is for detection, whereas most of the experiments in Table 1 are identification.
Figure 1Materials and Methods.
(A) A sentence (or a word) is presented as two concurrent streams: text and speech in visual and audio white noise. The observer identifies the words. In Experiments 1, 3, and 4, the visual stream includes only one word. In Experiment 2, the visual stream is a rapid serial visual presentation [35] of a sentence, presented one word at a time. The audio stream presents the same words as the visual stream. (B & C) The critical difference between models B and C is whether the two streams converge before or after detecting the signal. This dichotomy has been called “pre- and post-labelling” in speech recognition [36]. A neural receptive field computes a weighted average of the stimulus, i.e. the cross correlation of the stimulus and the receptive field’s weighting function [37]. In fact, if the noise is white, taking the weighting function to be a known signal, the receptive field is computing the log likelihood ratio of the presence of that signal in the stimulus, relative to zero signal. When the possible signals are equally probable, the best performance is attained by the maximum likelihood choice. (B) In probability summation, there is a receptive field for each possible signal. Detection occurs independently in each stream and the detections are combined logically to yield the overt decision. This is practically optimal when there is uncertainty among the known signals [38]. (C) In linear summation there is just one receptive field. The signals are linearly combined by a single audio-visual receptive field, followed by a single detector, which emits the final decision [22]. This is optimal for a known audiovisual signal.
Figure 2Predictions and Results.
(A) Audio-visual summation is summarized by the summation index k of a smooth curve (Eq. 8) fitted to the threshold energies. The horizontal and vertical scales represent the normalized visual and audio energy components v = V/V uni and a = A/A uni of the bimodal signal at threshold. Each audio:visual ratio – including the two unimodal conditions (V uni, 0) and (0, A uni) – is a condition. All conditions are randomly interleaved, trial by trial (with one exception, described at the end of this caption). The noise is always present in both streams. For a given audio:visual ratio A/V, we measure the threshold (V, A) radially, along a line from the origin (0, 0). The curves represent degrees of summation ranging from none (k = 0) to complete (k = 1). The special case of k = 0 is to be understood as the limit as k approaches 0, which is max(v,a) = 1. (B) Averaging k over our ten observers, we find the same summation for reporting either a single word (k = 0.75, red, Experiment 1) or a sentence (k = 0.76, blue, Experiment 2). The error bars indicate mean ± standard error. The curves obtained for each individual observer are shown in Figure S2. The virtue of randomly interleaving conditions (a:v ratios) is that the observer begins every trial in the same state, which enhances the comparability of the conditions plotted above. However, one might wonder how much better the observer would perform when the whole block is devoted to one condition. Random interleaving produces uncertainty; blocking each condition does not. Testing one observer (MD) on three conditions (audio, visual, and audiovisual signal; noise always present in both streams) we find insignificant difference in thresholds measured with and without uncertainty (i.e. interleaved vs. blocked conditions). Furthermore, ideal observer thresholds for the same conditions are negligibly different with and without uncertainty. This indicates that the results presented in this figure, found with uncertainty, also apply to performance without uncertainty.
Figure 3Assessing efficiency for combining the parts of a word: energy threshold as a function of word length.
The summation index k is 1 minus the slope. Ideal thresholds, not shown, are independent of word length, with slope zero. (A) For a written word [24], the summation index is k = 0.1. (B) For a spoken word [48], [49], the summation index is k = 0.5 or 0.7. See Methods for details.
Figure 4Efficiency for identifying letters and words as a function of their complexity.
Efficiency is nearly inversely proportional to complexity over a nearly hundred-fold range. The horizontal scale is the perimetric complexity (perimeter squared over ink area) of the letter or word. Each+is efficiency for identifying one of 26 words of a given length (1 to 17) in Courier [24]. Each Courier letter has a complexity of 100 (averaging a-z), and the complexity of a word is proportional to its length. Each △ is efficiency for identifying one letter of one of 14 traditional fonts and alphabets by native or highly trained readers, in order of increasing complexity [25]: Braille, bold Helvetica, bold Bookman, Sloan, Helvetica, Hebrew, Devanagari, Courier, Armenian, Bookman, Arabic, uppercase Bookman, Chinese, Künstler. The outlying □ is efficiency for a letter in an untraditional alphabet: 4×4 random checkerboards, after extended training [25]. The outlying ○ is efficiency for identifying the location of a disk. (See ‘Experiment 5. Identifying disks’ at the end of Materials and Methods.) A disk has the lowest possible perimetric complexity K = 4π = 12.6. A linear regression of log efficiency vs. log complexity for the traditional letters (13 fonts and alphabets) and words (13 lengths), excluding the untraditional alphabet and disk, has a slope of −0.92 and R 2 = 0.99. The regression line and its equation are shown.
Figure 5Histogram of the values of the summation index k reported in Table 1.
Efficiency for identifying a disk “letter”.
| Observer | Resolution (pix/deg) | Check size (pix) | Check size (deg) | Threshold (log contrast) | Threshold (log | Log efficiency |
| DGP | 42 | 6×6 | 0.14×0.14 | −1.11±0.03 | 0.99±0.05 | −0.38±0.05 |
| MD | 35.5 | 4×4 | 0.11×0.11 | −1.18±0.02 | 1.05±0.06 | −0.45±0.06 |
Thresholds and efficiencies are reported as mean ± se.