Literature DB >> 33636119

Audiovisual integration in macaque face patch neurons.

Amit P Khandhadia¹, Aidan P Murphy², Lizabeth M Romanski³, Jennifer K Bizley⁴, David A Leopold⁵.

Abstract

Primate social communication depends on the perceptual integration of visual and auditory cues, reflected in the multimodal mixing of sensory signals in certain cortical areas. The macaque cortical face patch network, identified through visual, face-selective responses measured with fMRI, is assumed to contribute to visual social interactions. However, whether face patch neurons are also influenced by acoustic information, such as the auditory component of a natural vocalization, remains unknown. Here, we recorded single-unit activity in the anterior fundus (AF) face patch, in the superior temporal sulcus, and anterior medial (AM) face patch, on the undersurface of the temporal lobe, in macaques presented with audiovisual, visual-only, and auditory-only renditions of natural movies of macaques vocalizing. The results revealed that 76% of neurons in face patch AF were significantly influenced by the auditory component of the movie, most often through enhancement of visual responses but sometimes in response to the auditory stimulus alone. By contrast, few neurons in face patch AM exhibited significant auditory responses or modulation. Control experiments in AF used an animated macaque avatar to demonstrate, first, that the structural elements of the face were often essential for audiovisual modulation and, second, that the temporal modulation of the acoustic stimulus was more important than its frequency spectrum. Together, these results identify a striking contrast between two face patches and specifically identify AF as playing a potential role in the integration of audiovisual cues during natural modes of social communication. Published by Elsevier Inc.

Entities: Chemical

Keywords: audition; electrophysiology; face patches; multisensory integration; primate; vision

Mesh：

Year: 2021 PMID： 33636119 PMCID： PMC8521527 DOI： 10.1016/j.cub.2021.01.102

Source DB: PubMed Journal: Curr Biol ISSN： 0960-9822 Impact factor: 10.834

Introduction

In humans and other primate species, audiovisual integration plays an important role in social communication, for example, during the perception of a conspecific’s vocalization and concomitant facial behavior., The temporal cortex, and particularly the superior temporal sulcus (STS), contain zones of convergence for high-level sensory signals.3, 4, 5, 6 In the macaque, the STS fundus borders high-level visual and auditory cortex7, 8, 9, 10 and exchanges connections with other multisensory areas, including ventrolateral prefrontal cortex (VLPFC) and intraparietal cortex.,, At the single-cell level, neurons within portions of the STS respond to visual and auditory stimuli, as well as their combination.,13, 14, 15, 16 Functional MRI (fMRI) investigation of the macaque temporal cortex has also revealed a number of operationally defined regions named according to their visual category selectivity, such as face and body patches.17, 18, 19, 20, 21 In macaques, face patches are replete with cells that respond more strongly to faces than to other categories of images22, 23, 24 and form an interconnected network., A subset of these patches lies along the STS and is coextensive with known multisensory regions in the fundus., However, despite intensive study of neurons within the visually defined face patches, it is presently unknown whether or not they participate in multisensory integration. Here, we investigated audiovisual single-unit responses in two fMRI-defined face patches. The anterior fundus (AF) and anterior medial (AM) patches were selected as key candidate regions for investigation, as they are both thought to occupy high-level positions in the face-processing hierarchy but are situated in distinct portions of the temporal cortex., Area AF is located in the STS fundus, within regions known to contain multisensory neurons although AM is located on the undersurface of the temporal lobe surface adjacent to, and interconnected with, the perirhinal and parahippocampal cortices, which also receive multisensory information., After identifying these patches based on their selective visual fMRI responses to faces, we recorded the activity of individual neurons within each patch to brief movie clips of macaque vocalizations, including the full audiovisual stimulus as well as the visual and auditory components alone. The results demonstrate that auditory information prominently influences the responses of AF neurons but has little effect on the responses of AM neurons. We then further evaluated the audiovisual modulation in AF with control experiments. These control experiments demonstrate that auditory modulation is specific to faces and depends on the temporal, rather than spectral, structure of the acoustic stimulus. We discuss the findings in relation to the layout of the macaque face patch network and its intersection with known audiovisual cortical areas.

Results

We conducted extracellular recordings in fMRI-defined face patches in four adult macaque monkeys. Based on an initial fMRI mapping of face patches (see STAR methods), we targeted a single, chronic 64-channel microwire electrode bundle into the centers of the AF or AM face patches (Figures 1A and 1B). Each macaque received a single implant into a recorded face patch. We have previously demonstrated that this recording method supports longitudinal, stable recordings from the same cells over multiple sessions., We recorded from 295 neurons in face patches of four monkey subjects: 240 from AF (125 from monkey SP, 115 from monkey SR) and 55 neurons from AM (49 from monkey W, 6 from monkey M). In addition to the main experimental conditions featured in the study, subjects viewed a short “fingerprinting” stimulus set of static images each day, which included human and monkey faces, objects, and scenes. This daily dataset allowed us both to determine each neuron’s face selectivity index (see STAR methods) and to verify its identity across successive sessions.

Figure 1

Localization of recording sites in the STS and IT cortex

(A and B) Functional overlays of AF and AM from monkey SP and monkey W, respectively, of an fMRI contrast of faces versus objects. The tract of the electrode is indicated with the red arrow targeted to the desired areas of recording.

(D) The same vocalization as performed by the avatar, both including the 500-ms still frame indicated by the frames labeled 1.

(E) Presentation timeline of stimulus indicating the onset of the still frame indicated in red and the onset of the movie in green as well as the auditory stimulus, including a spectrogram.

Localization of recording sites in the STS and IT cortex (A and B) Functional overlays of AF and AM from monkey SP and monkey W, respectively, of an fMRI contrast of faces versus objects. The tract of the electrode is indicated with the red arrow targeted to the desired areas of recording. (C) Pant-threat vocalization from an unfamiliar macaque. (D) The same vocalization as performed by the avatar, both including the 500-ms still frame indicated by the frames labeled 1. (E) Presentation timeline of stimulus indicating the onset of the still frame indicated in red and the onset of the movie in green as well as the auditory stimulus, including a spectrogram. Consistent with previous studies, the majority of neurons in both AF and AM were face selective. Specifically, 84.1% of all neurons (198/240 of AF neurons and 50/55 of AM neurons) responded to flashed faces with a face selectivity index (FSI) absolute value of greater than 0.333, a criterion that has previously been used to categorize neurons as face selective,, and both face patches show a distribution of FSI greater than zero (t(118) = 11.375, p = 1 × 10−20 for AF; t(54) = 11.702, p = 2 × 10−16 for AM). During the main electrophysiological experiments, the animals were required to maintain their gaze anywhere within the visual stimulus or, in the case of auditory-only presentation, upon a small fixation marker. The dynamic component of the video was always preceded by a 500-ms static image of the face, corresponding to the first frame of the movie video. This presentation was incorporated to diminish the contribution of abrupt visual transients during the period of audiovisual integration under study (Figures 1C–1E). Subjects experienced 20–40 repetitions of each stimulus, receiving a juice reward after completion of each presentation.

Experiment 1: multisensory responses of AF and AM face patch neurons

The goal of the first experiment was to determine whether the addition of the auditory component of the vocalization influences the responses of neurons in the two face patches. Subjects were presented with fifteen dynamic natural movie clips of three unfamiliar monkeys issuing five different call varieties of differing emotional valence. The call types included eight affiliative coos, two agonistic tonal screams, two aggressive pant-threats, two barks, and one bark-growl (Figure 1C).31, 32, 33, 34 The acoustic structure ranged broadly, with coos and agonistic calls having more tonal elements and barks, pant-threats, and bark-growls having a broadband and atonal structure. Trial sequences consisted of randomly interleaved presentations of each original audiovisual movie, the visual component only (i.e., silent movie), and auditory component only.

AF face patch neurons

The majority of neurons in AF exhibited a significant auditory modulation of their visual responses in response to one or more of the vocalization stimuli. In addition, some AF neurons responded to the auditory component alone. The influence of acoustic information on AF responses took multiple different forms, which we qualitatively separated based on the characteristics of their response. The most commonly observed pattern was multisensory enhancement of the visual response (Figures 2A, 2B, and 2D). For neurons in this category, the auditory stimulus alone did not elicit a significant response but did elevate the neurons’ response to the visual movie. The prominence of this pattern across the population was evident in the auditory enhancement observed in the grand average activity across all AF cells and all stimuli (Figure S1A). A smaller number of neurons exhibited multisensory suppression (Figure 2E), where the auditory stimulus diminished the neurons’ visual response. Finally, a relatively small subset of neurons did respond to one or more auditory stimuli alone (Figures 2F and 2G). These neurons were generally bimodal, meaning that they responded to both the auditory stimuli and the visual stimuli. For such neurons, the magnitude of their response to an audiovisual stimulus typically matched that to the visual stimulus alone, though a few matched that of the auditory stimulus alone.

Figure 2

Example responses from AF and AM

The dark gray panel indicates the static frame although the light gray indicates the audiovisual movie stimulus (magenta), silent movie (blue), or vocalization (red).

(A and B) Typical enhancement of AF neuron’s response for two different stimuli (two-way ANOVA; A, p Vis < 0.0001, p Aud = 0.3401, p Int < 0.0001; B, p Vis < 0.0001, p Aud = 0.8686, p Int < 0.0001). The horizontal black line within the rasters delineates the different recording sessions for the presented neurons.

(D–G) Additional example AF neuron responses.

(D) Another typical AF non-linear multisensory enhanced response (p Vis < 0.0001; p Aud = 0.7110; p Int < 0.001).

(E–G) Different profiles of audiovisual integration also expressed by neurons in AF.

(E) A cell with a non-linear suppression of spiking in response to the audiovisual condition compared to the visual-only condition (p Vis < 0.0001; p Aud = 0.0086; p Int = 0.001).

(F and G) Bimodal responses, where the response to the audiovisual movie mirrored the response to a unimodal condition (visual in F, p Vis = 0.0396, p Aud < 0.0001, p Int = 0.05, and auditory in G, p Vis = 0.7750, p Aud < 0.0001, p Int = 0.2888) along with a response to the other unimodal stimulus. Response types were determined by two-way ANOVA considering the presence or absence of the audio and visual stimulus components and their interaction. The index of modulation is shown in the corner of each spike density function.

AM face patch neurons

We performed the same analyses for neurons recorded from the AM face patch, which is known to receive direct input from AF as well as from other multisensory regions. In stark contrast to AF, few AM neurons were affected by the auditory component of the vocalization movies, and the grand average across all AM cells and stimuli showed little if any audiovisual modulation (Figures 2C and S1B). This contrast was most clearly reflected in the ANOVA analysis across all stimuli of AM neurons, revealing no significant auditory modulation of any neuron (Figure 3D). The audiovisual index across the population had a median of −0.02, which was not significantly different from zero (t(54) = −0.6271; p = 0.5332; Figures 3E and 3F). For individual movies, only a few AM neurons (n = 14) showed significant auditory modulation. Of these neurons, 11/14 showed such modulation to a single stimulus, with the remaining 3 cells showing significant modulation to two stimuli (Figure 3G). These results indicate auditory modulation in area AM is rare and, when present, highly selective for particular stimuli or very weak. The difference in auditory contribution to the AF and AM face patches was underscored by the results of a linear mixed-effects model that included the face patch and modality as its variables (Table 1).

Table 1

Display of the results of linear mixed-effects model, evaluating the effect of each factor on the average spike rate

Factor	Beta coefficient (a.u.)	T-stat	Degrees of freedom	p value
Presence of auditory stim.	0.0186	1.9288	518	p = 0.0543
Presence of visual stim.	0.1321	14.4616	518	p < 0.0001
Interaction of face patch and auditory stim.	0.0339	3.87636	518	p = 0.0001
Face patch	−0.0674	−3.8036	518	p = 0.0002

The model shows a significant effect for the interaction of face patch and auditory stimuli, with the positive beta indicating that AF neurons respond more strongly than AM neurons to the addition of auditory stimulus. See also Figure S3.

Display of the results of linear mixed-effects model, evaluating the effect of each factor on the average spike rate The model shows a significant effect for the interaction of face patch and auditory stimuli, with the positive beta indicating that AF neurons respond more strongly than AM neurons to the addition of auditory stimulus. See also Figure S3. In summary, the results indicate that two high-level anterior face patches, AF and AM, differ sharply in their modulation by the auditory component of macaque vocalizations. The auditory influence in AF was conspicuous, widespread, and often extended to multiple stimuli, whereas that in AM was virtually nonexistent in our recordings. We next focused on the observed audiovisual modulation in face patch AF and, in particular, the requisite auditory and visual components of our stimuli.

Experiment 2: investigation of multisensory responses using macaque avatar

To examine audiovisual processing in the AF face patch further, we used a realistic macaque avatar stimulus, whose facial movements were programmed to mimic real facial actions during the specific vocalizations. The macaque avatar allowed for the investigation of particular aspects of audiovisual integration while maintaining the same face identity, head angle, and other visual stimulus properties. For experiment 2, the avatar was animated to match five different calls (coo, agonistic, pant-threat, bark, and bark-growl) based on the original macaque movie clips (see STAR methods). Now, with this more-controlled visual component of the stimulus, we investigated two questions related to the specific features important to the observed audiovisual responses of AF neurons.

Critical role of the face

We first asked whether the observed auditory modulation would differ if the visual stimulus were a face, now in avatar form, versus a surrogate non-face stimulus. Specifically, we compared responses elicited by a vocalizing avatar (Figure 4A) to those found when the face was replaced by expanding and contracting dynamic disk, whose movements were matched to the changing mouth size and synchronized with the auditory track (Figure 4B).

Figure 4

Responses to visual control stimuli

(A and B) Single-cell example of responses to the different versions of the agonistic call stimulus. (A) portrays the average responses to the avatar producing an agonistic call although (B) shows the response of the same cell to the expanding disk stimulus matched to the same vocalization.

(C and D) Population response of AF to audiovisual avatar stimuli comparing (C) the selectivity of cell responses to the audiovisual avatar stimuli to (D) the selectivity of cell responses to the audiovisual expanding disk control stimuli.

Responses to visual control stimuli (A and B) Single-cell example of responses to the different versions of the agonistic call stimulus. (A) portrays the average responses to the avatar producing an agonistic call although (B) shows the response of the same cell to the expanding disk stimulus matched to the same vocalization. (C and D) Population response of AF to audiovisual avatar stimuli comparing (C) the selectivity of cell responses to the audiovisual avatar stimuli to (D) the selectivity of cell responses to the audiovisual expanding disk control stimuli. Neural responses to the avatar, including the modulation by the corresponding auditory stimulus, were broadly similar to the original movies, albeit with a smaller fraction of neurons demonstrating multisensory responses (Figure 4C). We recorded 121 AF neurons in this experiment, an independent population from those recorded in experiment 1, and again used a two-way ANOVA to establish significant responses to each sensory modality. Of these neurons, 99/121 (81.8%) responded to at least one of the visual or auditory stimuli, although the remaining 22 were unresponsive to the experimental stimuli and excluded from further analysis. 26/99 (26.3%) neurons exhibited a significant response to the auditory component or significant auditory modulation to at least one of the five vocalization-movie call types (Figure 4C). The reduced proportion compared to the original faces likely reflects the imposition of a single avatar facial identity, as well as the lower overall number of stimuli. Importantly, very few neurons showed significant auditory modulation when the dynamic face was replaced with the dynamic disk (Figure 4B). Of the 99 neurons that responded to the avatar movie stimuli, only 5/99 (5.1%) neurons showed any linear or non-linear multisensory interaction with the disk movie control (Figure 4D). These responses suggest that the observed auditory modulation does not reflect a general temporal synchronization with visual movement but instead depends upon viewing facial structure.

Critical acoustic parameters

We next used the same avatar stimulus to investigate the relative importance of spectral versus temporal acoustic parameters in the modulation of visual responses. To this end, we repeated the experiment by pairing the avatar stimuli with temporally patterned broadband noise (BBN) by applying the temporal envelopes of the original calls to carrier noise (1–20,000 Hz frequency range), such that the temporal structure was preserved, but the spectral content was disrupted. Most AF neurons responded similarly to both normal audiovisual avatar stimuli and the matching audiovisual noise avatar stimuli (example shown in Figures 5A and 5B). Cells still responded to or were modulated by matched auditory noise despite the lack of detailed spectral information and did not differ significantly from the response to the normal vocalization (Figure 5C). A similar percentage (25/99; 25.3%) of neurons showed linear or non-linear modulation to the matched noise, and 5.1% responded to the noise stimulus alone. To directly compare these different auditory conditions, we conducted an ANOVA with the natural vocalization and matched noise as a factor and performed a post hoc pairwise comparison with a Tukey-Kramer test. Only 7/121 (5.8%) of neurons exhibited a significant difference between the matching noise and the natural vocalization.

Figure 5

Responses to acoustic control stimuli

(A and B) A single-cell example of responses to the different versions of the bark stimulus with (A) the average response of a single cell to the avatar bark stimulus and (B) the response to the avatar when a temporally modulated broadband noise (BBN) stimulus replaced the bark vocalization.

(C) The distribution of the index of modulation for all calls across the population for both the avatar audiovisual stimuli and the avatar BBN control stimuli.

(D and E) A single-cell example of the response to different versions of the agonistic call with (D) the cell response to the avatar agonistic stimulus and (E) the response to the avatar agonistic BBN stimulus.

(F) The distribution of index of modulation to the tonal coos and agonistic calls.

Responses to acoustic control stimuli (A and B) A single-cell example of responses to the different versions of the bark stimulus with (A) the average response of a single cell to the avatar bark stimulus and (B) the response to the avatar when a temporally modulated broadband noise (BBN) stimulus replaced the bark vocalization. (C) The distribution of the index of modulation for all calls across the population for both the avatar audiovisual stimuli and the avatar BBN control stimuli. (D and E) A single-cell example of the response to different versions of the agonistic call with (D) the cell response to the avatar agonistic stimulus and (E) the response to the avatar agonistic BBN stimulus. (F) The distribution of index of modulation to the tonal coos and agonistic calls. To ensure this similarity was not driven solely by sensitivity to broadband vocalizations, we compared responses to the tonal coo and agonistic calls with their corresponding broadband controls. Even in cases of harmonic calls, the neurons continued to show similar responses between the audiovisual conditions and the matched BBN conditions (example in Figures 5D and 5E), at similar levels across the population (Figure 5F). Similar proportions of neurons exhibited multisensory modulation to both the harmonic calls (29.4%) and their matched noise controls (35.4%). These results suggest that, despite sensitivity to fine visual features, multisensory modulation in AF face patch is principally determined not by the fine spectral details of a vocalization but by its temporal structure.

Discussion

Audiovisual modulation in face patches

Our results indicate that most AF face patch neurons are affected by concomitant auditory stimulation during the viewing of macaque vocalizations. Although previous research indicates that the anterior STS is a multisensory region,,, that contains cells with selective audiovisual responses to faces,36, 37, 38 our data demonstrate, for the first time, this pattern is observed within a visually defined face patch. Though the predominant responses of the recorded AF cells were visual, most showed some level of auditory modulation to movies within our limited stimulus set, and some cells responded to one or more auditory vocalizations in the absence of any visual stimulus. Previous explorations of the STS organization in monkeys and humans have indicated a patchy spatial organization across primate species, with unisensory regions for each modality and audiovisual regions clustering together., In this context, the AF face patch might have been a good candidate for a visual-only region. However, our results instead suggest that this face patch may participate in audiovisual integration. The diverse expression of multisensory responses was striking. For example, some neurons responded to both auditory and visual stimuli alone but, when presented with the combined audiovisual stimulus, responded as if only one or the other unimodal stimulus had been presented. Other cells responded to static faces and then responded only to the vocalization presented, but not the moving face. Some of these results might be due to a high selectivity for individual identities, expressions, or other parameters of the movie, suggesting that estimates of multisensory responses would be greater with larger testing sets. The near absence of auditory modulation observed among AM neurons suggests that audiovisual modulation is expressed differentially among face patches. The contrast between AF and AM is particularly striking given that AM receives direct anatomical projections from AF and responds when AF receives electrical microstimulation., It bears mention, however, that the focal nature of our electrophysiological sampling means that our sampling of AM was limited and that we therefore cannot rule out a stronger multisensory component in other portions of AM that were missed in the two monkeys tested. It is also possible that AM neurons may be responsive to other sensory stimuli through the inputs they receive from neighboring perirhinal and parahippocampal areas, which are known to carry somatosensory information. In contrast to AM, the connections of the AF patch have not been directly assessed with retrograde tracers, so its specific connections are unknown. In general, the STS fundus receives input from multisensory areas, such as intraparietal and prefrontal regions, as well as unisensory association areas, including high-level auditory belt and parabelt cortex as well as visual inferior temporal TE and TEO cortex, both directly and indirectly through lateral regions of the STS.,,,39, 40, 41, 42, 43 Our results, combined with previous anatomical and electrophysiological finding, thus suggest that the AF face patch participates in multisensory integration that is typical for neighboring areas of the STS fundus. Whether neurons in other face patches integrate auditory information in a manner similar to AF remains to be seen. The specific pattern of interconnections among face patches, and their arrangement into one or more hierarchies, is presently a matter of inquiry. Auditory sensitivity adds a new property to the response selectivity of AF neurons, whose response profiles and covariation with other brain areas is already quite varied. Based on the known layout of the temporal cortex and its relationship to audiovisual responses, one might guess that other face patches lying in the fundus (middle fundus [MF]) or upper bank (middle dorsal [MD]] and anterior dorsal [AD]) of the STS might be good candidates for audiovisual integration. Notably, these face patches, like AF, generally exhibit a sensitivity to facial motion, which may be central to the synchronization of visual and auditory information during a vocalization. By contrast, area AM on the ventral surface of the temporal lobe is more commonly associated with processing of individual facial identities. It, like the recently described perirhinal (PR) and temporal pole (TP) areas involved in face familiarity, may be less governed by dynamic facial behaviors and more by facial features. Frontal lobe face patches, prefrontal orbital (PO), prefrontal arcuate (PA), and prefrontal lateral (PL), are known to respond to expressive faces similar to fundus patches, and are coextensive with prefrontal cortical areas that have been shown to be responsive to vocal stimuli and to their combination with facial gestures, including many of the same stimuli used in the present study., These patches may also integrate audiovisual signals in a way that is yet to be elucidated. Further study of these functionally defined regions is needed, including investigation of their specific anatomical interconnections, their participation in multisensory integration, and their roles in reciprocal social communication.

Audiovisual selectivity

Given AF neurons’ selectivity for particular movies, the macaque avatar allowed for a controlled examination of key variables. The virtual absence of auditory modulation for the temporally synchronized dynamic disk stimulus is consistent with the assumed specialization for faces within the face patch network. These responses indicate that AF neurons specifically combine auditory stimuli with facial information, rather than any temporally synchronized visual object. In previous studies, the temporal congruence between a visual stimulus and its auditory pair has been an important feature in multisensory integration in the STS although call type or spectral detail has shown little effect., Indeed, we found that the temporal structure alone, even when applied to broadband noise, was sufficient to elicit auditory modulation of visual responses to a face. Thus, the relative unimportance of auditory spectral content compared to visual input may thus be a characteristic of multisensory integration in the fundus of the STS. Interestingly, nearly the converse was observed in an fMRI-defined voice-specific area, a high-level auditory area on the supratemporal plane of the macaque temporal lobe. In that area, audiovisual neurons expressed selectivity for acoustic vocalizations although visual modulation of the acoustic response exhibited little selectivity to the visual stimulus.,, Together, these results suggest an overall principle of multisensory integration in high-level sensory areas, wherein highly stimulus-selective responses for the primary modality can be modulated by a relatively broad range of temporally synchronized stimuli presented in the secondary modality.

Broader implications

The face-dependent multisensory responses of AF neurons to vocalizations indicate that this area may participate in a larger cortical network in the service of social communication. For example, regions of the macaque auditory cortex also exhibit multisensory modulation tied to visual presentation of faces and face information, with the visual response component likely arising from well-known reciprocal connections with the STS.,,,53, 54, 55 Our results suggest that, within the STS, the AF face patch may be an important region involved in this processing. Projections between these areas may reflect a conserved multisensory pathway. Anatomical and physiological interaction between face-voice areas in humans are thought to mediate vocal communication., The STS also feeds forward to and receives feedback projections from the VLPFC, which also contains high proportions of face-selective audiovisual neurons,, and is itself thought to draw upon visual, auditory, and multisensory areas., Further studies are required to establish the specific anatomical and functional connections of AF with other multimodal, affective, and voice-selective regions and, more importantly, what role it plays within the larger range of multisensory areas. At present, the results draw attention to a well-known face-selective area, whose integration of vocal auditory signals into its visual analysis makes it a likely contributor to primates’ advanced skills in the domain of multisensory social perception.

STAR★Methods

Key resources table

Resource availability

The raw data supporting this study is not available in a public repository because of complex custom data formats and the size of the files but are available from the lead contact upon request.

Lead contact

Further information and requests for reagent should be directed to lead contact David A. Leopold (leopoldd@mail.nih.gov)

Materials availability

This study did not generate any new materials or reagents.

Data and code availability

The raw data supporting this study is not available in a public repository because of complex custom data formats and the size of the files but are available from the lead contact upon request.

Experimental model and subject details

Subjects

Four rhesus macaque monkeys designated SP (Monkey 1, 9 kg), SR (Monkey 2, 10 kg), W (Monkey 3, 11kg), and M (Monkey 4, 9 kg), were implanted with a single chronic microwire bundles fixed within a custom MRI-compatible chambers and microdrive assembly. The electrode bundles were advanced post-surgically to achieve proper depth of recording. The electrodes in Monkey SP were located in the right hemisphere, whereas those in monkeys SR, M, W were in the left hemisphere. All procedures were approved by the Animal Care and Use Committee of the National Institute of Mental Health.

Method details

fMRI

Functional and anatomical magnetic resonance imaging (MRI) was conducted in the Neurophysiology Imaging Facility Core (NEI, NIMH, NINDS) using a vertical 4.7T Bruker Biospin scanner. For all subjects, hemodynamic responses were enhanced by injection with monocrystalline iron-oxide nanoparticles (MION). Details of scanning and stimulus presentation are described further in. Briefly, Monkey SP underwent a standard block design localizer consisting of 24 s blocks of images of static macaque faces contrasted with blocks of images of non-face objects. In monkeys SR, M, and W, the blocks consisted of short movie clips of macaques making facial expressions contrasted with short movie clips of moving scenes and moving objects. Subjects received a juice reward for maintaining fixation every 2 s. All fMRI data was analyzed with AFNI and custom software developed in MATLAB (Mathworks, Natwick, MA).

Experiment design

All subjects performed a viewing task. The subject initiated the trial by fixating on a 0.7° crosshair within a window of 2° visual angle for between 200-300ms. A stimulus was then presented in either an audiovisual, visual only, or audio only format. For the audio only condition, the fixation marker remained on the screen and the subject was required to maintain fixation. For the audiovisual and visual conditions, a visual stimulus appeared in a square 10° visual angle window, and the subject was allowed to freely view any part of the stimulus. An infrared camera (Eyelink II, SR Research) monitored the subject’s gaze as it performed this task and trials were aborted if the subject looked outside the window for longer than 100ms. All visual stimuli were presented on an OLED 4k Monitor 95cm from the subject using a graphical user interface (GUI) derived from PLDAPS (further described here) in MATLAB. All auditory stimuli were presented in mono from two Tannoy Reveal Speakers placed on the edges of the monitor to create the percept the sound originated from the center of the screen. All auditory stimuli were projected at 65-80 dB SPL, verified using a Brüel and Kjaer (Denmark) sound level meter and at the full frequency range available.

Stimuli

In Experiment 1, stimuli consisted of short movies of monkeys vocalizing. The short movie clips featured macaque calls of varied acoustic structure and with a range of referential meaning and valence including affiliative coos, aggressive pant-threats, barks, and bark-growls, and agonistic/submissive screams31, 32, 33, 34. Of these calls, the agonistic and coo calls used here had tonal/harmonic elements, while the remaining calls were broadband.These movies featured three individual monkeys at a variety of head positions (Figure 1C). For Experiment 2, we selected five vocalization movie exemplars that represented each of the different call types described above. We then matched the mouth movements of the computer-generated animated macaque avatar to the vocalization onset and envelope in each call to create new audiovisual movies (Figure 1D). The avatar enabled us to hold the basic visual appearance of a macaque face constant, including its identity and 3D head orientation, while its facial actions and mouth movements were animated and synchronized with the true macaque vocalizations. For this, we independently controlled features such as size of the mouth opening and amount of lip motion from the original movies using a GUI developed in MATLAB and animated the macaque avatar to follow the same patterns of motion. Movies of these avatar-vocalizations were rendered and compiled using the software Blender (the Blender Foundation). Frames were added to the avatar clips to extrapolate starting from or returning to a neutral facial expression before or after the vocalization. In addition to the avatar stimuli, the monkey subject was presented with two categories of other control stimuli to determine the selectivity of audiovisual responses. Both sets of controls maintained the original temporal dynamics of the call structures. The first set controlled for visual motion. We replaced the macaque movie video with a dynamic disc whose instantaneous size was matched to the amplitude of the movements of the monkey’s mouth in the original movie,. This control was designed to determine whether auditory modulation is face-specific or would be found with any temporally synchronized visual stimulus. In the second set of controls the audio track was manipulated by replacing the original spectral content with broadband noise convolved with the envelope of the original vocalization. This control evaluated the contribution of the spectral information on the acoustic modulation of neural responses. To remove the contribution of known transient responses following the initial onset of a visual stimulus, all stimuli were introduced with a 500 ms static image prior to the onset of the movie movement and acoustic vocalization. Thus, the onset of the vocalization movie began after the face had already been on the display for 500 ms. This paradigm enabled us to deconvolve visual transient effects from the response to motion and addition of audio as well as approach a more naturalistic paradigm.

Electrophysiology recording

Following fMRI localization of the relevant face patches, subjects were implanted with 64 channel NiCr microwire bundle arrays fabricated by Microprobes for extracellular recording. Monkeys SR and SP received implants in face patch AF (an overlay of functional activation in Monkey SP is shown in Figure 1A). Monkeys M and W were implanted in face patch AM (functional overlay of Monkey W is shown in Figure 1B). All recordings were conducted in a radio shielded room (ETS-Lingreen) with a RZ2 BioAmp processor (Tucker-Davis Technologies) with a 128-channel capacity collecting a broadband signal of 0.5Hz-20KHz.

Quantification and statistical analysis

Longitudinal recording

One feature of the microwire arrays is the capacity for long-term longitudinal recordings of individual neurons, verified through similarity in waveform and selectivity fingerprints across days. All spike sorting was performed offline. Spikes were sorted using the wave_clus spike sorting package and utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov). To ensure cells were consistent across days, monkeys viewed a “fingerprinting” stimulus set consisting of 60 images containing sets of face categories (monkey faces and human faces) and non-face categories (objects and scenes). Face cells maintain selective responses across days; therefore, by evaluating selectivity to a consistent stimulus set we can combine cell responses across days and months for the same cell. Responses were matched principally based on the pattern of selectivity to the “fingerprinting” stimulus set as well as the spike waveform and basic distribution of interspike intervals. Responses to these stimuli were also used to compute a face selectivity index (FSI) to quantify the strength of selectivity (Equation 1): FSI = (response_face – response_nonface)/ (response_face + response_nonface)

Data analysis

Following spike sorting and concatenation of individual cell responses across days, data were analyzed using custom software also created in MATLAB. We isolated a response window between 500ms (the end of the still frame and the beginning of motion) and 100ms after the conclusion of the movie stimulus and a baseline window between −300 and 0ms before the onset of the still frame. These windows were used for all further stimulus analysis. We conducted a two-way analysis of variance test (ANOVA) comparing response for the presence and absence of each component stimulus against the baseline response (further described in) Neurons were classed as visual only or auditory only if they showed a significant response to either of the component stimuli alone, linear multisensory if they showed a significant main effect of both modalities, and non-linear multisensory if they had a significant interaction term. We also combined all stimuli into a single combined ANOVA again with each modality as a factor to assess the effect of each modality overall. Additionally, for the spectral controls, we conducted a pairwise comparison using a Tukey-Kramer test between the spectral controls, natural vocalization, and the silent condition to directly compare the effect of different auditory components. For both Experiments 1 and 2, the main analysis compared the response to the audiovisual stimulus to the corresponding response to the visual stimuli alone. To this end, we calculated an index of modulation (Equation 2): Index of modulation = AV-V/AV+V. All rates were computed following baseline subtraction. Here, AV is the mean response to an audiovisual stimulus for a particular cell, whereas V is the mean baseline subtracted response to the visual only counterpart of that stimulus. This index enabled us to quantify the magnitude of modulation induced by auditory stimuli with an index between 1 and −1 with a positive index indicating that the addition of acoustic stimuli enhanced the response and a negative response indicating that acoustic stimuli suppressed the response. Finally, we created a linear mixed-effect model to determine the effects of face patch independent of individual cell responses. Taking the average across all stimuli for each modality, we examined the effect of face patch, presence or absence of visual stimulus, and presence or absence of auditory stimulus on the average spiking rate. Each of these factors served a fixed-effect variable whereas the different cells were classed as a random effect. Through this model, we could isolate the exact effect of face patch and its interaction with stimulus type independent of variance between cells.

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Experimental models: organisms/strains

Rhesus Macaque (Macacca Mulatta)	NIMH/NIH	N/A

Software and algorithms

MATLAB	MathWorks	RRID: SCR_001622
AFNI	AFNI	RRID: SCR_005927
Psychophysics Toolbox	Psychtoolbox	RRID: SCR_002881
PLDAPS	HukLab	N/A
Blender	Blender Foundation	RRID: SCR_008606

Other

DataPixx	VPixx Technologies	RRID: SCR_009648
EyeLink	SR Research	RRID: SCR_009602

58 in total

1. Dual streams of auditory afferents target multiple domains in the primate prefrontal cortex.

Authors: L M Romanski; B Tian; J Fritz; M Mishkin; P S Goldman-Rakic; J P Rauschecker
Journal: Nat Neurosci Date: 1999-12 Impact factor: 24.884

2. Subdivisions of auditory cortex and processing streams in primates.

Authors: J H Kaas; T A Hackett
Journal: Proc Natl Acad Sci U S A Date: 2000-10-24 Impact factor: 11.205

3. Neural representation of vocalizations in the primate ventrolateral prefrontal cortex.

Authors: Lizabeth M Romanski; Bruno B Averbeck; Mark Diltz
Journal: J Neurophysiol Date: 2004-09-15 Impact factor: 2.714

4. Integration of auditory and visual information about objects in superior temporal sulcus.

Authors: Michael S Beauchamp; Kathryn E Lee; Brenna D Argall; Alex Martin
Journal: Neuron Date: 2004-03-04 Impact factor: 17.173

5. Functional imaging reveals visual modulation of specific fields in auditory cortex.

Authors: Christoph Kayser; Christopher I Petkov; Mark Augath; Nikos K Logothetis
Journal: J Neurosci Date: 2007-02-21 Impact factor: 6.167

6. The fusiform face area: a module in human extrastriate cortex specialized for face perception.

Authors: N Kanwisher; J McDermott; M M Chun
Journal: J Neurosci Date: 1997-06-01 Impact factor: 6.167

7. Polysensory properties of neurons in the anterior bank of the caudal superior temporal sulcus of the macaque monkey.

Authors: K Hikosaka; E Iwai; H Saito; K Tanaka
Journal: J Neurophysiol Date: 1988-11 Impact factor: 2.714

8. Subdivisions of auditory cortex and ipsilateral cortical connections of the parabelt auditory cortex in macaque monkeys.

Authors: T A Hackett; I Stepniewska; J H Kaas
Journal: J Comp Neurol Date: 1998-05-18 Impact factor: 3.215

Review 5. Faces and Voices Processing in Human and Primate Brains: Rhythmic and Multimodal Mechanisms Underlying the Evolution and Development of Speech.

Authors: Maëva Michon; José Zamorano-Abramson; Francisco Aboitiz
Journal: Front Psychol Date: 2022-03-30

Review 6. Levels of naturalism in social neuroscience research.

Authors: Siqi Fan; Olga Dal Monte; Steve W C Chang
Journal: iScience Date: 2021-06-12

6 in total