Literature DB >> 35663036

Toward understanding the communication in sperm whales.

Jacob Andreas^1,2, Gašper Beguš^3,2, Michael M Bronstein^4,5,6,2, Roee Diamant^7,2, Denley Delaney^8,2, Shane Gero^9,10,2, Shafi Goldwasser¹¹, David F Gruber^12,2, Sarah de Haas^13,2, Peter Malkin^13,2, Nikolay Pavlov², Roger Payne², Giovanni Petri^14,2, Daniela Rus^1,2, Pratyusha Sharma^1,2, Dan Tchernov^7,2, Pernille Tønnesen^15,2, Antonio Torralba^1,2, Daniel Vogt^16,2, Robert J Wood^16,2.

Abstract

Machine learning has been advancing dramatically over the past decade. Most strides are human-based applications due to the availability of large-scale datasets; however, opportunities are ripe to apply this technology to more deeply understand non-human communication. We detail a scientific roadmap for advancing the understanding of communication of whales that can be built further upon as a template to decipher other forms of animal and non-human communication. Sperm whales, with their highly developed neuroanatomical features, cognitive abilities, social structures, and discrete click-based encoding make for an excellent model for advanced tools that can be applied to other animals in the future. We outline the key elements required for the collection and processing of massive datasets, detecting basic communication units and language-like higher-level structures, and validating models through interactive playback experiments. The technological capabilities developed by such an undertaking hold potential for cross-applications in broader communities investigating non-human communication and behavioral research.

Entities: Chemical

Keywords: Artificial intelligence; Ethology; Linguistics; Natural language processing

Year: 2022 PMID： 35663036 PMCID： PMC9160774 DOI： 10.1016/j.isci.2022.104393

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Introduction

For centuries, humans have been fascinated by how animals communicate (Fögen, 2014). Animals use signals to communicate with conspecifics for a variety of purposes throughout their daily routines; yet it has been argued that their communication systems are not comparable, quantitatively or qualitatively, to human languages (Hauser et al., 2002). The latter derive their expressive power from a number of distinctive structural features, including displacement, productivity, reflexivity, and recursion. Whether known non-human communication systems exhibit similarly rich structure—either of the same kind as human languages, or completely new—remains an open question. Understanding language-like communication systems requires answering three key technical questions: First, by analogy to the phonetics and phonology of human languages, what are the articulatory and perceptual building blocks that can be reliably produced and recognized? Second, by analogy to the morphology and syntax of human languages, what are the composition rules according to which articulatory primitives can be structurally combined? Third, by analogy to semantics in human languages, what are the interpretation rules that assign meanings to these building blocks? Finally, there may possibly be a pragmatics component, whereby meaning is additionally formed by context (Schlenker et al., 2016). While individual pieces of these questions have been asked about certain animal communication schemes, a general-purpose, automated, large-scale data-driven toolkit that can be applied to non-human communication is currently not available. The recent success of machine learning (ML) methods in answering similar questions in human languages (Natural Language Processing or NLP) is related to the availability of large-scale datasets. The effort of creating a biological dataset in a format, level of detail, scale, and time span amenable to ML-based analysis is capital intensive and necessitates a multidisciplinary expertise to develop, deploy, and maintain specialized hardware to collect acoustic and behavioral signals, as well as software to process and analyze them, develop linguistic models that reveal the structure of animal communication and ground it in behavior, and finally perform playback experiments to attempt bidirectional communication for validation (Figure 1). Yet, the deployment of graphics processing unit’s (GPU) is following a trajectory akin to Moore’s Law (https://openai.com/blog/ai-and-compute) and, at the same time, the success of such an endeavor could potentially yield cross-applications and advancements in broader communities investigating non-human communication and animal behavioral research. One of the main drivers of progress making deep learning successful has been the availability of large (both labeled and unlabeled) datasets (and of architectures capable of taking advantage of such large data). To build a more complete picture and capture the full range of a species’ behavior, collecting datasets containing measurements across a broad set of factors is essential. In turn, setting up infrastructure that allows for the collection of broad and sizable datasets would facilitate studies that allow the autonomous discovery of the meaning-carrying units of communication.

Figure 1

An approach to sperm whale communication that integrates biology, robotics, machine learning, and linguistics expertise, and comprise the following key steps

Record: collect large-scale longitudinal multimodal dataset of whale communication and behavioral data from a variety of sensors. Process: reconcile and process the multi-sensor data. Decode: using machine learning techniques, create a model of whale communication, characterize its structure, and link it to behavior. Encode & Playback: conduct interactive playback experiments and refine the whale language model. Illustration © 2021 Alex Boersma.

An approach to sperm whale communication that integrates biology, robotics, machine learning, and linguistics expertise, and comprise the following key steps Record: collect large-scale longitudinal multimodal dataset of whale communication and behavioral data from a variety of sensors. Process: reconcile and process the multi-sensor data. Decode: using machine learning techniques, create a model of whale communication, characterize its structure, and link it to behavior. Encode & Playback: conduct interactive playback experiments and refine the whale language model. Illustration © 2021 Alex Boersma. A dedicated interdisciplinary initiative toward a detailed understanding of animal communication could arguably be made with a number of species as its focus. Birds, primates, and marine mammals have all given insight into the capacity of animal communication. In some ways, the collective understanding of the capacity for and faculty of communication in non-humans has been built through experimentation and observation across a wide number of taxa (Fitch, 2005; Hauser et al., 2002). The findings on both the underlying neurobiological systems underpinning communicative capacity, and the complexity and diversity of the communication system itself often mirror our ability with which to work with a given species, or the existence of prominent long-term field research programs. Animal communication researchers have conducted extensive studies of various species, including spiders (e.g. Elias et al., 2012; Hebets et al., 2013), pollinators (e.g Kulahci et al., 2008), rodents (e.g Ackers and Slobodchikoff, 1999; Slobodchikoff et al., 2009), birds (e.g Baker, 2001; Griesser et al., 2018), primates (e.g. Clarke et al., 2006; Jones and Van Cantfort, 2007; Leavens, 2007; Ouattara et al., 2009; Schlenker et al., 2016; Seyfarth et al., 1980), and cetaceans (e.g Janik, 2014; Janik and Sayigh, 2013), showing that animal communication involves diverse strategies, functions, and hierarchical components, and encompasses multiple modalities. Previous research efforts often focused on the mechanistic, computational, and structural aspects of animal communication systems. In human care, there have been several successful attempts of establishing a dialogue with birds (e.g.((Pepperberg, 1990)) and primates through a shared, trained, anthropocentric lexicon or various media such as iconographic keyboards (e.g. (Savage-Rumbaugh et al., 1985) or sign language (e.g. (Patterson, 1978)). However, due to the complexity of the environment and logistical challenges, such studies are often limited in sample size, continuity, and duration. A comparatively long list of skills required for language learning in humans has been demonstrated among cetaceans (whales, dolphins, and porpoises), who share many social characteristics that are strikingly similar to our own. Whales and dolphins are among a few animals capable of vocal production learning (the ability to copy novel sounds as well as to vary those to produce individually distinctive repertoires) in addition to some birds, bats, pinnipeds, and elephants (Janik and Slater, 1997, 1998; Poole et al., 2005). Of those, only a few species, including parrots and dolphins appear to use arbitrary, learned signals to label objects or conspecifics in their communities in the wild (Balsby and Bradbury, 2009; Janik and Slater, 1998; King and Janik, 2013; Tyack and Sayigh, 1997; Wanker et al., 2005). Dolphins can use learned vocal labels to refer to and address each other when they meet at sea (King and Janik, 2013; Quick and Janik, 2012). This sort of vocal recognition system mediates highly dynamic societies among cetaceans, which involve social relationships lasting decades as well as regular interaction with strangers (Bruck, 2013; Connor, 2000; Gero et al., 2015, 2016a; Tyack, 1986). At the same time, cetaceans provide a dramatic contrast in their ecology and environment compared to terrestrial animals (Steele, 1985). The logistical and technological difficulties related to the observation of marine life are one of the reasons why relatively little is known about many of the toothed whales (Odontocetes). For example, it was not until 1957 that it was even noted that sperm whales (Physeter macrocephalus) produce sound (Worthington and Schevill, 1957) and only in the 1970s came the first understanding that they use sound for communication (Watkins and Schevill, 1977). Among all odontocetes species, P. macrocephalus stands out as an “animal of extremes” (Weilgart et al., 1996; Whitehead, 2003). Sperm whales are the largest of the toothed whales, among the deepest divers, and have a circumglobal distribution (Whitehead, 2003). They can be both ocean nomads and small island specialists whose homes are both thousands of kilometers across and thousands of meters deep (Cantor et al., 2019). Sperm whales' immense nose, the origin of their biological name (“macrocephalus” translates as “large head”), houses the world’s most powerful biological sonar system, which in turn is controlled by the world’s largest brain, six times heavier than a human one (Goldbogen and Madsen, 2018; Marino, 1998; Møhl et al., 2003) and with large cerebral hemispheres and spindle neurons (Butti et al., 2009; Hof et al., 2005; Marino, 2004; Marino et al., 2011). These cerebral structures might be indicative of complex cognition and higher-level functions put by sperm whales to task in both their rich social lives and their complex communication system. When comparing marine species with their terrestrial counterparts, it is important to emphasize the scale of the ocean across all dimensions. Many whales journey thousands of kilometers (e.g. (Stevick et al., 2011) and some are thought to live longer than a hundred years (Seim et al., 2014). Compared to their terrestrial counterparts, marine species also experience substantially greater environmental variation over periods of months or longer (Steele, 1985), and effectively live in a three-dimensional environment (Haskell et al., 2002; Wosniack et al., 2017), creating a situation in which social learning is favored over individual learning or genetic determination of behavior. Together with the fact that many cetaceans live in stable social groups with prolonged parental care, the opportunities for cultural transmission of information and traditional behaviors are high with traits being passed consistently within social groups, but less often between them. As a result, several cetacean species exhibit high levels of behavioral variation between social groups, much of which is thought to be due to social learning. The marine environment renders chemical signals less effective, and whales rely on acoustics as their primary mode of communication. Most of their communication is thus likely to be captured by a single modality, while vision probably plays a significant role while in the photic zone. The problem of constructing an inventory of phonetic, lexical, or grammatical units is much harder for whale communication compared to human languages, not only do we not know which acoustic features are meaningful, how they vary systematically, or which units are correlated with behavior, it is also not trivial to probe for and test and verify meaningful units in their communication. Unaided, human experimenters would have to evaluate an extremely large number of hypotheses by hand. Automatically discovered feature representations or coda boundaries can reduce the number of possibilities that human researchers have to consider, providing initial proposals for phoneme or phrase boundaries that can guide higher-level human analysis (Suzuki et al., 2006). While multiple efforts in past decades to analyze non-human communication have brought a significant new understanding of various animal species and the structure and function of their signals, we still largely lack a functional understanding of non-human communication systems. Unlike human languages that are “pre-segmented, i.e. where basic units are already available, in non-human communication, this is not the case. Identifying such elements among animals has traditionally been done slowly using manual expert annotation and rests on anthropocentric assumptions, whereas ML offers a scalable approach. In retrospect, we can conclude that critical understanding was acquired slowly across long periods of time invested with specific communities of a limited number of species and a modestly sized amount of data for each. This is contrasted with the rapid growth of technologies that allow one to collect and process huge amounts of data. One such technology is ML, in particular, deep learning (Lecun et al., 2015) that has had a dramatic impact in natural language processing (NLP). Over the past decade, advances in machine learning have provided new powerful tools to manipulate language, making it now possible to construct unsupervised human language models capable of accurately capturing numerous aspects of phonetics and phonology, syntax, sentence structure, and semantics. Today’s state-of-the-art NLP tools can segment low-level phonetic information into phonemes, morphemes, and words (Elsner et al., 2012), turn word sequences into grammars (Kim et al., 2019; Klein and Manning, 2004; Naseem et al., 2010), and ground words in visual perception and action (Andreas et al., 2016; Rohrbach et al., 2016; Tellex et al., 2011). These NLP tools can be transferred from natural language to non-human vocalizations in order to identify patterns that would be difficult to discover with a manual analysis. However, interpretable models of both human language and animal communication rely on formal approaches and theoretical insights (Berwick et al., 2011; Davies et al., 2021; Schlenker et al., 2016; Stokes et al., 2020). ML outputs are thus primarily a tool to constrain hypothesis space based to build formal and interpretable descriptions of the sperm whale communication. Using ML models for constraining hypothesis space has already been successfully applied in the fields of pure mathematics (Davies et al., 2021), drug discovery (Stokes et al., 2020), or protein folding (Jumper et al., 2021). Combining key concepts from machine learning and linguistic theory could thus substantially advance the study of non-human communication and, more broadly, bring a data-centric paradigm shift to the study of animal communication. In this paper, we describe the current state of knowledge on sperm whale communication and outline the key ingredients of the collection and processing of massive bioacoustic data from sperm whales, detecting their basic communication units, language-like higher-level features, and discourse structure. We discuss experiments required to validate linguistic models and attribute meaning to communication units, and conclude with perspectives about the future progress in the field.

Background

Sperm whales are born into tightly knit matrilineal families within which females (who are not always related) and their offspring make group decisions when traveling (Whitehead, 2016), finding food, and foraging together (Whitehead, 2003). Family members communally defend and raise their offspring, including nursing each others’ calves (Gero et al., 2009, 2013; Whitehead, 2003). Some families join up for hours to a few days to form “groups” with evidence of decade-long associations (Gero et al., 2013). On a higher level, sperm whales form clans of up to hundreds to tens of thousands of individual whales and exhibit diversity in movement patterns, habitat use, diving synchronization, foraging tactics, and diet; these differences appear to impact survival (Cantor and Whitehead, 2015; Marcoux et al., 2007; Whitehead and Rendell, 2004). Sperm whale clans coexist in overlapping ranges but remain socially segregated, despite not being genetically distinct communities (Rendell et al., 2012).

Acoustic communication of sperm whales

Despite its present-day use for communication, the sperm whales’ remarkable bioacoustic system (see Figure 2A) evolved as a sensory device for echolocation allowing the whales to find prey and navigate in the darkness of the deep ocean (Goldbogen and Madsen, 2018; Tønnesen et al., 2020). Each short, highly directional, broadband echolocation click has a multi-pulse structure with an intense first pulse followed by a few additional pulses of decaying amplitude (see Figure 2B). The multi-pulsed click is the result of the reverberation of the initial pulse in the whale’s spermaceti organ within its nose (Møhl et al., 2003; Zimmer et al., 2005).

Figure 2

Sperm whale bioacoustic system

(A) Sperm whale head contains the spermaceti organ (c), a cavity filled with almost 2,000 L of wax-like liquid, and the junk compartment (f), comprising a series of wafer-like bodies believed to act as acoustic lenses. The spermaceti organ and junk act as two connected tubes, forming a bent, conical horn of about 10 m in length and 0.8 m aperture in large mature males. The sound emitted by the phonic lips (i) in the front of the head is focused by traveling through the bent horn, producing a flat wavefront at the exit surface.

(B) Typical temporal structure of sperm whale echolocation and coda clicks. Echolocation signals are produced with consistent inter-click intervals (of approximately 0.4 s) while coda clicks are arranged in stereotypical sequences called “codas” lasting less than 2 s. Codas are characterized by the different number of constituent clicks and the intervals between them (called inter-click intervals or ICIs). Codas are typically produced in multi-party exchanges that can last from about 10 s to over half an hour. Each click, in turn, presents itself as a sequence of equally spaced pulses, with inter-pulse interval (IPI) of an order of 3–4 ms in an adult female, which is the result of the sound reflecting within the spermaceti organ. Illustration © 2021 Alex Boersma.

Sperm whale bioacoustic system (A) Sperm whale head contains the spermaceti organ (c), a cavity filled with almost 2,000 L of wax-like liquid, and the junk compartment (f), comprising a series of wafer-like bodies believed to act as acoustic lenses. The spermaceti organ and junk act as two connected tubes, forming a bent, conical horn of about 10 m in length and 0.8 m aperture in large mature males. The sound emitted by the phonic lips (i) in the front of the head is focused by traveling through the bent horn, producing a flat wavefront at the exit surface. (B) Typical temporal structure of sperm whale echolocation and coda clicks. Echolocation signals are produced with consistent inter-click intervals (of approximately 0.4 s) while coda clicks are arranged in stereotypical sequences called “codas” lasting less than 2 s. Codas are characterized by the different number of constituent clicks and the intervals between them (called inter-click intervals or ICIs). Codas are typically produced in multi-party exchanges that can last from about 10 s to over half an hour. Each click, in turn, presents itself as a sequence of equally spaced pulses, with inter-pulse interval (IPI) of an order of 3–4 ms in an adult female, which is the result of the sound reflecting within the spermaceti organ. Illustration © 2021 Alex Boersma. Whale communication utilizes short (<2 s) bursts of clicks produced in stereotyped patterns that can be classified into recognizable types termed codas (Watkins and Schevill, 1977; Weilgart and Whitehead, 1997) (see Figure 2B). Distinct vocal sperm whale dialects have been documented in the Pacific, Indian, and Atlantic oceans (Amano et al., 2014; Amorim et al., 2020; Gero et al., 2016b; Huijser et al., 2020; Rendell and Whitehead, 2003). Each distinct socially learned clan dialect contains at least 20 different coda types. A typical coda is made up of 2–40 broadband omnidirectional clicks. Codas are produced most prolifically during longer periods of intense socialization near the surface when sperm whales are in close contact, at the onset of deep foraging dives, as well as during ascent when approaching the surface, but not when at depth foraging (Watwood et al., 2006; Whitehead, 2003). Recent insights into the coda repertoires used by individuals and groups of whales have suggested that specific codas encode varying levels of social recognition to mediate the animals’ complex multilevel societies (Gero et al., 2016b). Codas appear to be rich in information about the caller’s identity and there is some understanding of the diversity of coda types and the patterns of variation in their usage. Yet, the communicative function of particular codas themselves is still largely a mystery. Codas are exchanged in duet-like sequences between two or more sperm whales. There is apparent turn-taking with whales responding within 2 s of each other, often overlapping and matching identical calls (Schulz et al., 2008). These exchanges occur across spatial scales ranging from meters to kilometers, suggesting that they function both between whales immediately together and those farther apart. Individuals within a family share a natal dialect of at least 10 coda types, despite there being some variation in individual production repertoires (Gero et al., 2016b; Schulz et al., 2011). Calves take at least two years to produce recognizable coda types and appear to “babble” in producing a larger number of call types prior to narrowing their usage to the types produced by their natal family (Gero et al., 2016b).

Machine learning for automatic annotation and representation learning

The time and capital investment as well as technical and logistical challenges connected to collecting high-quality field audio recordings and subsequently manually annotating and analyzing them have been a key factor to the relatively slow pace in the study of sperm whale communication. Given these challenges, the development of improved computational techniques for automatic processing, annotation, and analysis of information content and communicative intent of whale vocalizations is a crucial step for future progress in the field. Machine learning techniques used for the analysis of human language (speech recognition and natural language processing) provide great potential in addressing these challenges. Encouraging results in this direction were shown by (Bermant et al., 2019; Zhong et al., 2020), who used ML methods to automatically detect clicks in whale vocalization recordings, distinguish between echolocation and communication clicks, and classify codas into clans and individuals, achieving accuracy similar to previous highly time-consuming manual annotations and older generation statistical techniques. Recent advances in unsupervised learning trained on either spectral representations or raw waveforms potentially allow the use of ML in more complex tasks such as self-supervised acoustic unit discovery, which can provide the crucial first step in understanding communication systems without known meaningful units and without an easy way to elicit meaning. Self-supervised learning is not only appropriate for such tasks, but has been shown to learn more robust representations than supervised learning (e.g. in the visual domain (Goyal et al., 2022). Today’s ML systems used in natural language processing applications are predominantly based on deep representation learning: input signals (e.g. sentences or audio waveforms) are encoded as high-dimensional feature vectors by an artificial neural network; these features are then decoded by another neural network into predictions for a downstream task (e.g. text classification or machine translation). The encoder network can be trained without annotations via “self-supervision,” typically to produce representations that make it possible to reconstruct parts of the input that have been hidden or corrupted. This apparently simple task requires a deep understanding of the structure of language and creates a rich language representation that can be used for a plethora of tasks, including automated grammar induction (Cao et al., 2020; Kim et al., 2020) and machine translation without parallel data (Lample et al., 2018). However, a key characteristic of this self-supervision process is its reliance on massive collections of data: for example, the recent state-of-the-art Transformer models such as GPT-3 (Brown et al., 2020) was pre-trained on a large language corpus comprising over 1011 data points. While unsupervised structure discovery is also possible without self-supervised representation learning (Klein and Manning, 2004; Naseem et al., 2010), recent studies have also shown that unsupervised structure discovery can provide benefits (Harwath et al., 2020; Papadimitriou and Jurafsky, 2020). It is difficult to make an exact analogy between tokens in human languages and whale vocalizations. And, for comparison, the Dominica Sperm Whale Project (DSWP) dataset contains less than 104 coda clicks (Figure 3) collected over a longitudinal study since 2005. It is thus apparent that one of the key challenges toward the analysis of sperm whale (and more broadly, animal) communications using modern deep learning techniques is the need for sizable datasets capturing a wide range of attributes. Secondly, human linguistic corpora are easier to deal with because they are typically pre-analyzed (i.e., already presented in the form of words or letters) and verification against ground truth is available, whereas in bioacoustic communication data, the relevant units must be inferred bottom-up with no ground truth available. Given this highly complex learning objective, we expect larger datasets will facilitate the discovery of meaning-carrying units.

Figure 3

Comparative size of datasets used for training NLP models (represented by the circle area)

GPT-3 is only partially visible, while the DSWP dataset is a tiny dot on this plot (located at the center of the dashed circle). Shown in red is the estimated size of a new dataset planned to be collected in Dominica by Project CETI, an interdisciplinary initiative for cetacean communication interpretation. The estimate is based on the assumption of nearly continuous monitoring of 50–400 whales. The estimate assumes 75%–80% of their vocalizations constituting echolocation clicks, and 20%–25% being coda clicks. A typical Caribbean whale coda has five clicks and lasts 4 s (including a silence between two subsequent codas), yielding a rate of 1.25 clicks/sec. Overall, we estimate it would be possible to collect between 400M and 4B clicks per year as a longitudinal and continuous recording of bioacoustic signals as well as detailed behavior and environmental data.

Comparative size of datasets used for training NLP models (represented by the circle area) GPT-3 is only partially visible, while the DSWP dataset is a tiny dot on this plot (located at the center of the dashed circle). Shown in red is the estimated size of a new dataset planned to be collected in Dominica by Project CETI, an interdisciplinary initiative for cetacean communication interpretation. The estimate is based on the assumption of nearly continuous monitoring of 50–400 whales. The estimate assumes 75%–80% of their vocalizations constituting echolocation clicks, and 20%–25% being coda clicks. A typical Caribbean whale coda has five clicks and lasts 4 s (including a silence between two subsequent codas), yielding a rate of 1.25 clicks/sec. Overall, we estimate it would be possible to collect between 400M and 4B clicks per year as a longitudinal and continuous recording of bioacoustic signals as well as detailed behavior and environmental data.

Recording and processing: building the sperm whale longitudinal dataset

Data acquisition

Large-scale data collection over lengthy timespans (years of recordings and observation) requires the use of autonomous and semi-autonomous assets that continuously operate on, around, and above the whales (Figure 4). Multiple technologies available today can be utilized for purposes including localization of groups of sperm whales, time- and location-stamped audio recording, and collection of other data such as ocean conditions and video capturing of whales’ behavior. Assets coming in contact with whales should be designed with non-invasive technology (Gruber and Wood, 2022) in order to minimize disturbance to animals, which in turn would provide more reliable data and also be more respectful to the study subjects. Finally, the location for data collection should ideally have a known large resident sperm whale population.

Figure 4

Schematic of whale bioacoustic data collection with multiple data sources by several classes of assets

These include tethered buoy arrays (b), which track the whales in a large area in real time by continuously transmitting their data to shore (g), floaters (e), and robotic fishes (d)Tags (c) attached to whales can possibly provide the most detailed bioacoustic and behavioral data. Aerial drones (a) can be used to assist tag deployment (a1), recovery (a2), and provide visual observation of the whales (a3). The collected multimodal data (1) have to be processed to reconstruct a social network of sperm whales. The raw acoustic data (2) have to be analyzed by ML algorithms to detect (3) and classify (4) clicks. Source separation and identification (5) algorithms would allow reconstructing multi-party conversations by attributing different clicks to the whales producing them. Illustration © 2021 Alex Boersma.

Schematic of whale bioacoustic data collection with multiple data sources by several classes of assets These include tethered buoy arrays (b), which track the whales in a large area in real time by continuously transmitting their data to shore (g), floaters (e), and robotic fishes (d)Tags (c) attached to whales can possibly provide the most detailed bioacoustic and behavioral data. Aerial drones (a) can be used to assist tag deployment (a1), recovery (a2), and provide visual observation of the whales (a3). The collected multimodal data (1) have to be processed to reconstruct a social network of sperm whales. The raw acoustic data (2) have to be analyzed by ML algorithms to detect (3) and classify (4) clicks. Source separation and identification (5) algorithms would allow reconstructing multi-party conversations by attributing different clicks to the whales producing them. Illustration © 2021 Alex Boersma. Tethered buoy arrays (Figure 4B) are a typical setup utilized for background recording of bioacoustic signals. Such installations usually comprise an array of sensors mounted at intervals of several hundred meters from the surface to the depth at which sperm whales are known to hunt, approximately 1200 m. The use of multiple sensors on each mooring and multiple moorings should allow the tethered arrays to localize the whales and track their movements. The advantage of such arrays is their reliability and capability to record signals continuously from a broad area in the ocean. Tags (Figure 4C) or recording devices attached to whales have historically provided the most detailed insight into their daily activities and interactions (Johnson and Tyack, 2003). There are currently several designs of animal-borne recording devices that use suction to delicately attach to the whales and record not only the whale acoustics but also pressure, temperature, movement, and orientation. A critical current limitation of tags is onboard energy storage and memory as well as the effectiveness of their adhesion mechanisms. Bioinspired suction-based adhesion mechanisms inspired by carangiform fish (Gamel et al., 2019; Wang et al., 2017) and cephalopod tentacles hold the promise of achieving working times on the order of several days and potentially to weeks. Fused with the sensor array data, the recordings from tags also allow the identification of whales in multi-party discourses and when associating behavior patterns with background recordings of the hydrophone/static sensor arrays. Aquatic drones (Figures 4D and 4E): Free-swimming and passively floating aquatic drones allow obtaining audio and video recordings from multiple animals simultaneously to observe behaviors and communications within a group of whales near the surface. There is a wide spectrum of potential solutions from simple drifters to self-propelled robots capable of autonomous navigation, including numerous existing platforms that can be loosely categorized as active, submarine-like bodies or semi-passive “gliders”. For self-propelled drones, small, short-range, bioinspired designs (Katzschmann et al., 2018; Marchese et al., 2014) hold the potential to operate in close proximity to a group of whales with minimal disruption. Aerial drones (Figure 4A): Hybrid aerial/aquatic drones are capable of surveying areas to monitor the whale population, and providing “just-in-time” deployment of hydrophones and possibly also deploying and recovering tags. Current off-the-shelf drones have payloads in excess of several kilograms (in excess of our target tag mass) and flight times typically up to 30 min. This would allow an individual drone to cover an area with a radius of several kilometers for tag deployment and collection. Furthermore, amphibious drones with the ability to land on and take off from water can be used to directly carry and deploy recording devices to a site of interest.

Data processing

Given the large magnitude of data, a key step is to build appropriate data storage and processing infrastructure, including automated ML pipelines (maintainable and reusable across multiple data collecting devices) that will replace the annotation currently done largely by hand by marine biologists. ML-based methods are already being used for detection and classification among marine mammals (Gillespie et al., 2009; Shiu et al., 2020) and for sperm whale click detection and classification (Bermant et al., 2019; Ferrari et al., 2020; Glotin et al., 2018; Jiang et al., 2018); such methods are potentially scalable to large datasets containing years of recording that would otherwise be beyond reach with previous manual approaches. By aggregating and synchronizing the bioacoustic, behavioral, and environmental signals from multiple assets (Figure 4), it is possible to localize the whales and continuously track them over time. The resulting dataset, a sort of “social network” of whales, will provide longitudinal information about the behavior and communications of individual whales (Farine and Whitehead, 2015; Sah et al., 2019; Sosa et al., 2021) and will be a crucial asset for subsequent machine learning.

Decoding and encoding: Building the sperm whale communication model

In human languages, there has been substantial recent progress in automated processing and unsupervised discovery of linguistic structure, including acoustic representation learning (Chung et al., 2016; Kamper et al., 2014), text generation (Brown et al., 2020), induction of phrase structure grammars (Kim et al., 2019; Klein and Manning, 2004; Naseem et al., 2010), unsupervised translation (Artetxe et al., 2018; Lample et al., 2018), and grounding of language in perception and action (Lu et al., 2019; Shi et al., 2019), based on large-scale datasets. Similar tools could be applied to automatically identify structure in whale vocalizations.

Phonetics and phonology: Basic acoustic building blocks

One of the most striking features of human language is its discrete structure. While the sound production apparatus and the acoustic speech stream are fundamentally continuous (humans can modulate pitch, volume, tongue position, etc. continuously), human spoken languages partition this space into discrete units such as vowels, consonants, and tones (Eimas et al., 1971; Repp, 1984). Even though these discrete mental representations of sounds (phonemes) do not carry meaning, they form the building blocks from which larger meaning-carrying components are built. The distribution of phonemes in human languages is governed by a set of rules (phonotactic and phonological) that have also been identified, in a similar but simpler form, in vocalizations of non-human species, such as birds (Berwick et al., 2011). Previous research has conjectured that sperm whale communication is also built from a set of discrete units. Codas—prototypical sequences of clicks with fixed relative inter-click interval structure—have been identified as such fundamental and discrete communicative units (Schulz et al., 2008; Weilgart and Whitehead, 1997). However, a plethora of questions remain. For example: are codas distinguished only by the absolute inter-click intervals, as suggested by previous studies? Do spectral features of coda clicks carry information? Does the frequency of individual clicks in codas carry meaning? What are the distributional restrictions (equivalents of phonotactic rules) governing codas and how can they be formalized (Antunes et al., 2011; Gero et al., 2016b)? Can we find equivalents of phonological computation in sperm whale vocalizations and what type of formal grammar best describes their vocalizations (Chomsky, 1956)? Answering these questions requires a combination of machine learning modeling as well as interpretable analytical approaches to the acoustic signal in order to understand how much information is lost when clicks are modeled as discretized units. Identifying the fundamental units in whale vocalizations resembles spoken term discovery in human speech processing (Kamper et al., 2014), which has been addressed with a variety of unsupervised learning techniques (Baevski et al., 2020; Beguš, 2021; Chung et al., 2020; van Niekerk et al., 2020). These techniques use raw speech to automatically identify discrete clusters that represent repeated motifs—thereby finding structure inherent in the data via compression. Such techniques are already effective at automatically identifying words from raw audio of human languages (Baevski et al., 2020; Beguš, 2021; Chorowski et al., 2019; Chung et al., 2016; Eloff et al., 2019; van Niekerk et al., 2020). While learning representations in many of these models are not constrained to human speech, it is possible that dependencies in sperm whale communication diverge substantially from human speech, which means that both the models can learn misleading representations or our analysis of the outputs of the learned models will be influenced by anthropocentric biases. The risk of anthropocentrism in comparative animal behavior research is always present; a careful and un-biased study will be essential. One way to address this risk is to introduce context-specific multimodal data and model acoustic and behavioral data of sperm whales simultaneously (social and genetic relationships of signalers, behavioral budgets, foraging success, relative position in relation to conspecifics, velocity, orientation, pressure, water temperature, GPS information, weather etc.), which will provide the models information specific to whales. Deep learning models for unsupervised discovery of meaningful units trained on human speech can readily be evaluated, inasmuch as independent identification of meaningful units in speech is almost always available. However, in the case of sperm whale vocalizations, validation is substantially more challenging and necessitates the use of behavioral data and playback experiments. Unsupervised learning is most effective when applied to large and diverse datasets (applications in human speech perform best with hundreds to thousands of hours of recordings (Chung and Glass, 2018)), highlighting the need for a large-scale bioacoustic data collection.

Morphology and syntax: Grammatical structure of communication

The capacity to construct complex words and sentences from simpler parts according to regular rules is one of the hallmarks of human language. While compositional codes appear in some animal communication systems (e.g. the waggle dance in honeybees composes independent distance and orientation factors (Glass, 2012), and Campbell’s monkeys use affixation to alter alarm call meaning (Ouattara et al., 2009)), no known animal communication system appears to feature more complex structure-building operations like recursion, a central feature of almost all human languages. According to current knowledge, animal systems that have semantics (e.g. primate calls and gestures or bird calls) appear to have a simple syntax; on the other hand, systems that have a somewhat sophisticated syntax (e.g. birdsongs (Berwick et al., 2011)) are not associated with a compositional semantics. In human languages, recent advances in NLP methods for unsupervised grammar induction (Kim et al., 2019; Klein and Manning, 2004; Naseem et al., 2010) have shown the possibility of accurately recovering dependency and phrase structure grammars from a collection of sentences. Applying such techniques to the discretized “basic unit” sequences of whale communications should allow for the generation of hypotheses about higher-level hierarchical structures across codas—the syntax of whale vocalization. As with the representation learning approaches for identifying basic units, large datasets are crucial for this effort: since any given sequence can be explained by many different candidate grammars, many sequences are necessary to adequately constrain the hypothesis space.

Semantics: Inferring meaning

Identifying short-term and long-term structure of vocalizations is a prerequisite to the key question: what do these vocalizations mean? The first step toward this goal is to identify the smallest meaning-carrying units, analogous to morphemes in human languages. It is known that individual codas carry information about the individual, family, and clan identity (Antunes et al., 2011; Gero et al., 2016b; Oliveira et al., 2016), but the function of many codas, as well as their internal variability in structure and individual clicks, remains unexplained. It is imperative that the collected data used for machine learning captures this richer context of whale vocalizations, enabling the grounding of a wider set of morphemes and candidate meanings. The grounding of minimal units (“morphemes”), together with identified hierarchical structures allows one to search for interpretation rules—associations of complex behaviors with long sequences of clicks via an explicit bottom-up process. A number of compositional semantic models in the NLP literature are capable of learning mappings between morpheme sequences and continuous groundings (Andreas et al., 2017; Socher et al., 2014). Currently existing whale bioacoustic datasets are likely too small for this purpose (see Figure 3), hence the need for acquiring a significantly larger and more detailed dataset. Finally, modeling composition should allow building a richer model of communicative intents, and ultimately to perform interventional studies in the form of playback experiments.

Discourse and social communication

Communication (whether human or non-human) occurs in a social context: speakers’ reason about interlocutors’ beliefs and intentions (Grice, 1975), explicitly signal the beginning and end of their conversational turns (Sacks et al., 1974), and adapt both the style and content of their messages to their audience (Giles et al., 1991). The complex, multi-party nature of sperm whale vocalization, and especially the presence of vocal learning and chorusing behaviors with no obvious analog in human communication (Patel, 2003; Schulz et al., 2008; Weilgart and Whitehead, 1993), suggests that this discourse-level structure is as important as the utterance-level structure for understanding whale communication. Characterizing whales’ conversational protocols, the rules that govern which individuals vocalize at what times, is key to understanding their discourse. Diverse communication protocols can be found across the animal kingdom—including uncoupled responding after a pause, chorusing in alternation, and chorusing synchronously—and each of these evolved protocols has been found to provide distinctive advantages for competitive or cooperative reproductive advantage, food advantage, and territorial defense (Ravignani et al., 2014). Variants of all these protocols have been observed in sperm whales (Schulz et al., 2008) and it is necessary to understand the roles that each of them plays vis-a-vis clan structure and group decision-making. The understanding of conversational protocols is also a prerequisite to building predictive models of conversations (analogous to language models and chatbots for human-generated speech and text (Brown et al., 2020; Gao et al., 2019; Shannon, 1951)) capable of generating probable vocalizations given a conversation history, whale identities, and behavioral and environmental context. These models can be made controllable and capable of continuing vocalizations to express specific communicative intents (using inferred meanings for vocalizations in historical data) and will enable interactive playback studies.

Redundancy and fault tolerance of communication

Most forms of communication rely on the capacity to successfully transmit and receive a sequence of some basic units. In instances of imperfect acoustic channels with significant background noise, fault tolerance mechanisms are sometimes built into the communication system at different levels. In the animal kingdom, multiple fault tolerance mechanisms are known that exploit varying sensory modalities to backup communication signals (Johnstone, 1996), or adapt the communication units to noise conditions (LaZerte et al., 2016). Sperm whales, for example, have been shown to repeat vocalizations, including overlapping and matching codas (Schulz et al., 2008), a characteristic that might suggest redundancy mechanisms at the level of basic units and discourse. As studies venture into this area, it is important that such variations are distinguished from dialectal and individual variations, which can be detected using e.g. compression-based techniques (Oliveira et al., 2013).

Language acquisition

All human infants undergo similar stages during acquisition of language in their first years of life, regardless of the language in their environment. For example, the babbling period during which language-acquiring infants produce and repeat basic syllables (such as [da] or [ba]) or reduced handshapes and movements in sign languages (Petitto and Marentette, 1991) is a well-documented developmental stage during the first 6–13 months (Fagan, 2009). Another well-documented concept in language acquisition is the critical period: if children are deprived of primary linguistic inputs in their first years, acquisition is not complete, often resulting in severe linguistic impairments (Friedmann and Rusou, 2015). The study of the developmental stages in language acquisition has yielded insights into how humans learn to discretize the acoustic speech stream into mental units, analyze meaning, and in turn produce language. In human language, for example, syllables that are produced first during language acquisition (e.g. [ma] or [ba]) are also most common in the world’s languages, most stable, and easiest to produce. Similarly, morphological and syntactic constructions that are acquired first are the most basic (Crain and Thornton, 2012). There are currently several known parallels in the developmental stages between human language and animal communication. Acquisition of birdsong in some species, for example, involves the presence of babbling as well as the critical period (Doupe and Kuhl, 1999). These parallels likely stem from common neural and genetic mechanisms behind human speech and animal vocalizations (Bolhuis et al., 2010; Musser et al., 2014). However, in cetacean research, existing data on the vocalizations of non-adult whales in their natural setting are limited. Continuous and longitudinal data acquisition capabilities are required to record vocalizations of calf-mother pairs and collect behavioral data on their interactions as calves mature. Such data will provide insights into the order of acquisition of coda types, leading to insights into the articulatory effort of the vocalization as well as identification of the most basic structural building blocks and their functions.

Playback-based validation

Playbacks are the experimental presentation of stimuli to animals, traditionally used to investigate their behavioral, cognitive, or psychophysiological responses (King, 2015). Playbacks in relation to animal communication can be categorized based on (i) the stimulus type (such as responses to conspecific or heterospecific signals (e.g. (Sayigh et al., 1999; Visser et al., 2016) or anthropogenic noise (e.g., sonar behavioral response studies, reviewed in the study by (Southall et al., 2016) and (ii) the collected data (such as response calls or behavior). While playback validation is a common technique used to study the vocalizations of terrestrial animals including birds (McGregor et al., 1992), primates (Fischer et al., 2013), and elephants (McComb et al., 2014; Stoeger and Baotic, 2016) that has proven successful in both grounding the functional use of calls as well as building understanding of the physiological and cognitive abilities of these animals in cetacean research, the vast majority of playback experiments have focused on the functional use of calls for social identity. It was shown this way, for example, that bottlenose dolphins use vocal labels to address one another (King et al., 2013). For any vocal recognition system to function in this way, it must meet the following three criteria: first, there must be calls that vary enough and/or are sufficiently stereotyped to provide identity information for individuals or groups; second, listeners must be able to distinguish between these calls and hold a shared meaning/function for the calls; and third, listeners must then respond differently to those calls based on the identity of the signaler and their interaction history with them. There are hypotheses that off-axis portions of echolocation clicks can also carry information (Soldevilla et al., 2008). The divide between playbacks in situ at sea and the captive experiments is partly a result of a separation in focus: captive studies have the capacity to examine the auditory capacity and cognitive responses of the animals, while wild studies can address the social and biological context of the response to conspecific calls. A good example among marine mammals is the two studies which when paired provided a holistic understanding of the semantics and function of signature whistles in bottlenose dolphin society. A captive playback study demonstrated that dolphins can learn and readily use vocal labels to address one another (King et al., 2013), while the field study demonstrated their use in this way at sea in social context among well-known individuals (Barber et al., 2001; Quick and Janik, 2012). Another major separating factor is the logistical and technological limitations of performing playback studies at sea: studying animal communication in the wild within a natural social and behavioral context is significantly harder than in controlled settings in captivity. However, despite their complexity, wild playback experiments increase functional validity by avoiding the disturbance of species-typical social groups and daily behavioral routines (Cronin et al., 2017). The inherent challenges of conducting playback experiments for the purpose of grounding hypotheses of any animal communication model fall under three general questions (see Deecke, 2006): Do we know what to playback? Formalizing hypotheses requires a detailed understanding of both the signals being produced and the social/behavioral context in which they are used, and must be preceded by addressing core phonological, syntactical, and semantic questions using language models to better build appropriate playback stimuli to underlie grounding experiments within behavioral contexts. Do stimuli replicate biological signals? Playback signals must adequately replicate the parameters of the natural signals themselves, avoid pseudo-replication with a sufficiently large sample, and reduce the logistical and perceptual limitations of conducting field playbacks from boats with researchers present. This requires developing playback technology based on autonomous interactive systems drifting at sea, which remove the vessel from the experiment, and have the capacity to listen and reply in context at biologically relevant speeds in order to approximate interactive playbacks (King, 2015). Can we recognize a response? The ability to detect and identify behavioral responses to the playback stimuli requires a baseline understanding of the variation in behavior in the wild from observational studies. This is perhaps the biggest challenge as it requires both an understanding of what whales do, but also what we expect them to do in response to our playbacks. While cetacean playbacks have similar interpretation challenges as terrestrial studies, they are logistically more challenging and mainly technologically limited. The purpose of playback experiments is 2-fold. First, and more typical, is the use of playbacks to ground semantic hypotheses and test purported syntax based on hypotheses generated from language models. The second use case, which can be viewed as an evolution of the first one, is more speculative but potentially offering an opportunity to make significant advancement in field-based, interactive playback among whales. There is currently rapid innovation of interactive playbacks in which researchers are able to more rapidly reply to animals’ communication in the wild. This is particularly evident in bird song research (Dabelsteen and McGregor, 2020). Technological development of tools in this area in some ways mirrors advances with increasingly common NLP-based interactive voice assistants and chatbots, which are intended to listen, detect, and appropriately reply in context to their human users. The ethical questions raised by playback studies are extensively covered in the study by (Cuthill, 1991) and deserve continued investigation and discourse. Playback experiments are, by design, active interaction with the animals under study. As such, playback paradigms and experimental design should endeavor to minimize potential impacts on the subjects and require focused natural observations before undertaking them. Playbacks have the potential to impact behavior over minutes or hours, as this is often the intent of the experiments to ground vocalizations in behavior, and mitigation measures should be in place and adverse behavioral responses are observed. From the perspective of this study, cetaceans, especially whales, are undoubtedly disturbed due to the increased presence of anthropogenic underwater sounds, especially with increased global shipping trade. For example, Rolland et al. (2012) showed how the decrease in ship traffic following September 11, 2001 led to a significant reduction of stress-related fecal hormone metabolites in North Atlantic right whales. The aim of these studies is to not only better understand whale communication, but to also offer insights into how sound pollution impacts communication and behavior. Such information is critical when passing further legislation to better protect and conserve whales and other marine life.

Future steps

Recent advances in machine learning developed for the analysis of human-generated signals and broadly used in industry now make it possible to obtain unprecedented insights into the structure and meaning of non-human species communication. Such methods, when applied to purposely built datasets, are likely to bring a shift in perspective in deciphering animal communication in their natural settings. Achieving this ambitious goal requires an orchestrated effort and expertise from multiple disciplines across the scientific community. A prerequisite for this to happen is an open source and data sharing culture that has allowed the machine learning research community to flourish over the past decade. At present, there are promising proposals to assemble a global library of underwater biological sounds collected by passive acoustic monitoring to catalog, study, and map the sounds made by underwater lifeforms worldwide (Parsons et al., 2022). Previous large-scale and collaborative efforts have been successful in yielding substantial steps forward in the understanding of natural systems. Past collaborative projects (in particular, in genetics and astrophysics) turned out to be influential “not because they answer any single question but because they enable investigation of continuously arising new questions from the same data-rich sources” (Abbott et al., 2020), and their impact resulted from providing the technological foundations as well as findings and advancements along the journey. Beyond advancing our understanding of natural communication systems, we see these efforts leading to tool sets that can be utilized in a diversity of fields. A large-scale, interdisciplinary, and integrated study of cetacean communication will also advance the design of underwater acoustic sensors, minimally invasive robotics, processing complex bioacoustic signals, and machine learning for language modeling. Collective advances in this area hold the potential to open new frontiers in interspecies communication and can lead to a deeper appreciation and understanding of the complexity and diversity of communication in the natural world.

Consortia

The authors are the current scientific members of Project CETI collaboration, listed in alphabetical order, are Jacob Andreas, Gašper Beguš, Michael M. Bronstein, Roee Diamant, Denley Delaney, Shane Gero, Shafi Goldwasser, David F. Gruber, Sarah de Haas, Peter Malkin, Nikolay Pavlov, Roger Payne, Giovanni Petri, Daniela Rus, Pratyusha Sharma, Dan Tchernov, Pernille Tønnesen, Antonio Torralba, Daniel Vogt, and Robert J. Wood.

73 in total

Review 1. Language, music, syntax and the brain.

Authors: Aniruddh D Patel
Journal: Nat Neurosci Date: 2003-07 Impact factor: 24.884

2. Fractal geometry predicts varying body size scaling relationships for mammal and bird home ranges.

Authors: John P Haskell; Mark E Ritchie; Han Olff
Journal: Nature Date: 2002-08-01 Impact factor: 49.962

3. Communication in bottlenose dolphins: 50 years of signature whistle research.

Authors: Vincent M Janik; Laela S Sayigh
Journal: J Comp Physiol A Neuroethol Sens Neural Behav Physiol Date: 2013-05-07 Impact factor: 1.836

4. Speech perception in infants.

Authors: P D Eimas; E R Siqueland; P Jusczyk; J Vigorito
Journal: Science Date: 1971-01-22 Impact factor: 47.728

5. Individual recognition in wild bottlenose dolphins: a field test using playback experiments.

Authors:
Journal: Anim Behav Date: 1999-01 Impact factor: 2.844

6. The long-range echo scene of the sperm whale biosonar.

Authors: Pernille Tønnesen; Cláudia Oliveira; Mark Johnson; Peter Teglberg Madsen
Journal: Biol Lett Date: 2020-08-05 Impact factor: 3.703

7. Sperm whale codas may encode individuality as well as clan identity.

Authors: Cláudia Oliveira; Magnus Wahlberg; Mónica A Silva; Mark Johnson; Ricardo Antunes; Danuta M Wisniewska; Andrea Fais; João Gonçalves; Peter T Madsen
Journal: J Acoust Soc Am Date: 2016-05 Impact factor: 1.840

8. Information content and acoustic structure of male African elephant social rumbles.

Authors: Angela S Stoeger; Anton Baotic
Journal: Sci Rep Date: 2016-06-08 Impact factor: 4.379

9. Disturbance-specific social responses in long-finned pilot whales, Globicephala melas.

Authors: Fleur Visser; Charlotte Curé; Petter H Kvadsheim; Frans-Peter A Lam; Peter L Tyack; Patrick J O Miller
Journal: Sci Rep Date: 2016-06-29 Impact factor: 4.379

10. Socially segregated, sympatric sperm whale clans in the Atlantic Ocean.

Authors: Shane Gero; Anne Bøttcher; Hal Whitehead; Peter Teglberg Madsen
Journal: R Soc Open Sci Date: 2016-06-08 Impact factor: 2.963

1 in total

1. Evidence from sperm whale clans of symbolic marking in non-human cultures.

Authors: Taylor A Hersh; Shane Gero; Luke Rendell; Maurício Cantor; Lindy Weilgart; Masao Amano; Stephen M Dawson; Elisabeth Slooten; Christopher M Johnson; Iain Kerr; Roger Payne; Andy Rogan; Ricardo Antunes; Olive Andrews; Elizabeth L Ferguson; Cory Ann Hom-Weaver; Thomas F Norris; Yvonne M Barkley; Karlina P Merkens; Erin M Oleson; Thomas Doniol-Valcroze; James F Pilkington; Jonathan Gordon; Manuel Fernandes; Marta Guerra; Leigh Hickmott; Hal Whitehead
Journal: Proc Natl Acad Sci U S A Date: 2022-09-08 Impact factor: 12.779

1 in total