| Literature DB >> 24058349 |
Thomas Hannagan1, James S Magnuson, Jonathan Grainger.
Abstract
How do we map the rapid input of spoken language onto phonological and lexical representations over time? Attempts at psychologically-tractable computational models of spoken word recognition tend either to ignore time or to transform the temporal input into a spatial representation. TRACE, a connectionist model with broad and deep coverage of speech perception and spoken word recognition phenomena, takes the latter approach, using exclusively time-specific units at every level of representation. TRACE reduplicates featural, phonemic, and lexical inputs at every time step in a large memory trace, with rich interconnections (excitatory forward and backward connections between levels and inhibitory links within levels). As the length of the memory trace is increased, or as the phoneme and lexical inventory of the model is increased to a realistic size, this reduplication of time- (temporal position) specific units leads to a dramatic proliferation of units and connections, begging the question of whether a more efficient approach is possible. Our starting point is the observation that models of visual object recognition-including visual word recognition-have grappled with the problem of spatial invariance, and arrived at solutions other than a fully-reduplicative strategy like that of TRACE. This inspires a new model of spoken word recognition that combines time-specific phoneme representations similar to those in TRACE with higher-level representations based on string kernels: temporally independent (time invariant) diphone and lexical units. This reduces the number of necessary units and connections by several orders of magnitude relative to TRACE. Critically, we compare the new model to TRACE on a set of key phenomena, demonstrating that the new model inherits much of the behavior of TRACE and that the drastic computational savings do not come at the cost of explanatory power.Entities:
Keywords: TRACE model; spoken word recognition; string kernels; symmetry networks; time-invariance
Year: 2013 PMID: 24058349 PMCID: PMC3759031 DOI: 10.3389/fpsyg.2013.00563
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Figure 1One time-slice of the TRACE model of spoken word recognition.
Figure 2The detailed structure of the TRACE model of spoken word recognition (adapted from McClelland and Elman, .
Figure 3The TISK model—a time-invariant architecture for spoken word recognition.
Figure 4A symmetry network for time-invariant nphone recognition that can distinguish anadromes. The units in the center of the diagram (e.g., /a/1) represent time-specific input nodes for phonemes /a/ and /b/ at time steps 1–4. The /ba/ and /ab/ nodes represent time-invariant diphone units.
Figure 5Response times in TISK (.
Figure 6Comparison between TISK (left panel) and TRACE (right panel) on the average time-course of activation for different competitors of a target word. Cohort: initial phonemes shared with the target. Rhymes (1 mismatch): all phonemes except the first shared with the target. Embeddings: words that embed in the target. The average time course for all words (Mean of all words) is presented as a baseline.
Figure 7An overview of how recognition cycles correlate with other lexical variables in TRACE (left column) and in TISK (right column). Length: target length. Embedded words: number of words that embed in the target. Onset competitors (Cohorts): number of words that share two initial phonemes with the target. Neighbors (DAS): count of deletion/addition/subsitution neighbors of the target. Embeddings: logarithm of the number of words the target embeds in. Rhymes: logarithm of the number of words that overlap with the target with first phoneme removed.
Estimates of the number of units and connections required in TRACE and TISK for 212 or 20,000 words, 14 or 40 phonemes, an average of four phonemes per word, and assuming 2 s of input stream.
| Units | 15, 067 | 3222 | 16, 800 | 9852 | 1, 336, 000 | 29, 640 |
| Connections | 45, 049, 733 | 3, 737, 313 | 45, 401, 600 | 31,718,357 | >4E + 11 | 348, 783, 175 |
Figure 8Number of connections (.
| Times | 10 | Number of time-specific slots (for input and time specific phonemes) |
| Istep | 10 | Pace of input stream (a new input is introduced every “istep” cycles) |
| Deadline | 100 | Deadline |
| DecayP | 0.01 | Decay rate for time-specific phonemes |
| DecayNP | 0.01 | Decay rate for time-invariant nphones |
| DecayW | 0.05 | Decay rate for time-invariant words |
| Gap | max | Authorized gap between phonemes in time-invariant nphones |
| (e.g., if gap = 1, “/bark/” = “/ba/,” “/ar/,” “/rk/”; | ||
| if gap = 2, “/bark/”= ‘/ba/,” “/br/,” “/ar/,” “/ar/,” “/ak/,” “/rk/”). | ||
| PtoNPexc | 0.1 | Time-specific phoneme to time-invariant nphone excitation |
| PtoNPthr | 6 | Time-invariant nphone activation threshold |
| NPtoNPinh | 0 | Lateral inhibition between nphones |
| NPtoWexc | 0.05 | Excitation from time-invariant nphone (“/ba/”) to words (“/bark/”) |
| NPtoWscale | Wordlength | Scaling factor for NPtoW connections (here, set to word length) |
| WtoNPexc | 0 | Excitation from words (“/bark/”) to time-invariant nphone (“/ba/”) |
| 1PtoWexc | 0.01 | Excitation from 1-phone (“/a/”) to words (“/bark/”) |
| Wto1PExc | 0 | Excitation from words (“/bark/”) to 1-phone (“/a/”) |
| WtoWinh | −0.005 | Lateral inhibition between words |