| Literature DB >> 30154688 |
Nicholas Huang1, Malcolm Slaney2, Mounya Elhilali1.
Abstract
Deep neural networks have been recently shown to capture intricate information transformation of signals from the sensory profiles to semantic representations that facilitate recognition or discrimination of complex stimuli. In this vein, convolutional neural networks (CNNs) have been used very successfully in image and audio classification. Designed to imitate the hierarchical structure of the nervous system, CNNs reflect activation with increasing degrees of complexity that transform the incoming signal onto object-level representations. In this work, we employ a CNN trained for large-scale audio object classification to gain insights about the contribution of various audio representations that guide sound perception. The analysis contrasts activation of different layers of a CNN with acoustic features extracted directly from the scenes, perceptual salience obtained from behavioral responses of human listeners, as well as neural oscillations recorded by electroencephalography (EEG) in response to the same natural scenes. All three measures are tightly linked quantities believed to guide percepts of salience and object formation when listening to complex scenes. The results paint a picture of the intricate interplay between low-level and object-level representations in guiding auditory salience that is very much dependent on context and sound category.Entities:
Keywords: audio classification; auditory salience; convolutional neural network; deep learning; electroencephalography; natural scenes
Year: 2018 PMID: 30154688 PMCID: PMC6102345 DOI: 10.3389/fnins.2018.00532
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 4.677
Dimensions of the input and each layer of the neural network.
| Layer type | Abbreviation | Dimensions | Total number of outputs |
|---|---|---|---|
| Input spectrogram | 96 × 64 | 16,384 | |
| Convolutional layer | Conv1 | 96 × 64 × 64 | 393,216 |
| Pooling layer | Pool1 | 48 × 32 × 64 | 98,304 |
| Convolutional layer | Conv2 | 48 × 32 × 128 | 196,608 |
| Convolutional layer | Conv3 | 24 × 16 × 256 | 98,304 |
| Output layer/predictions | Predic | 4923 | 4923 |