| Literature DB >> 34751066 |
Björn Schuller1,2,3, Alice Baird1, Alexander Gebhard1, Shahin Amiriparian1,3, Gil Keren1, Maximilian Schmitt1, Nicholas Cummins1,4.
Abstract
Computer audition (i.e., intelligent audio) has made great strides in recent years; however, it is still far from achieving holistic hearing abilities, which more appropriately mimic human-like understanding. Within an audio scene, a human listener is quickly able to interpret layers of sound at a single time-point, with each layer varying in characteristics such as location, state, and trait. Currently, integrated machine listening approaches, on the other hand, will mainly recognise only single events. In this context, this contribution aims to provide key insights and approaches, which can be applied in computer audition to achieve the goal of a more holistic intelligent understanding system, as well as identifying challenges in reaching this goal. We firstly summarise the state-of-the-art in traditional signal-processing-based audio pre-processing and feature representation, as well as automated learning such as by deep neural networks. This concerns, in particular, audio interpretation, decomposition, understanding, as well as ontologisation. We then present an agent-based approach for integrating these concepts as a holistic audio understanding system. Based on this, concluding, avenues are given towards reaching the ambitious goal of 'holistic human-parity' machine listening abilities.Entities:
Keywords: audio intelligence; computer audition; machine learning
Mesh:
Year: 2021 PMID: 34751066 PMCID: PMC8581779 DOI: 10.1177/23312165211046135
Source DB: PubMed Journal: Trends Hear ISSN: 2331-2165 Impact factor: 3.293
Figure 1.Example for an iterative approach to decompose audio interpreting on different semantic levels of ‘understanding’ to lead to an optimal ‘holistic’ audio understanding. Imagine a garage with two people working on a car and listening to music as the (audio) scene.
Figure 2.Overview of agents (decomposition, interpretation, ontologisation, understanding) and their given tasks and their interactions, as well as additional dissemination outputs.
Figure 3.(Left) Example attribute vectors for zero-shot learning. (Right) Structure of an encoder–decoder neural network with an attention mechanism.
Figure 4.Example of an ontology that consequently attributes audio sources states and traits – not only for speech as is the current usual state-of-literature. In this depiction, we see that the audio source is decomposed into three sub-sources; speech, music, and sound, which are then each further decomposed. For example, one of the ‘sound’ sources is noted as being mechanical, vehicle, car, and the car is further labelled for its brand, as well as current action (e.g., speed).