Literature DB >> 32040540

USVSEG: A robust method for segmentation of ultrasonic vocalizations in rodents.

Ryosuke O Tachibana¹, Kouta Kanno², Shota Okabe³, Kohta I Kobayasi⁴, Kazuo Okanoya¹.

Abstract

Rodents' ultrasonic vocalizations (USVs) provide useful information for assessing their social behaviors. Despite previous efforts in classifying subcategories of time-frequency patterns of USV syllables to study their functional relevance, methods for detecting vocal elements from continuously recorded data have remained sub-optimal. Here, we propose a novel procedure for detecting USV segments in continuous sound data containing background noise recorded during the observation of social behavior. The proposed procedure utilizes a stable version of the sound spectrogram and additional signal processing for better separation of vocal signals by reducing the variation of the background noise. Our procedure also provides precise time tracking of spectral peaks within each syllable. We demonstrated that this procedure can be applied to a variety of USVs obtained from several rodent species. Performance tests showed this method had greater accuracy in detecting USV syllables than conventional detection methods.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 32040540 PMCID： PMC7010259 DOI： 10.1371/journal.pone.0228907

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Various species in the rodent superfamily Muroidae (which includes mice, rats, and gerbils) have been reported to vocalize ultrasonic sounds in a wide range of frequencies up to around 100 kHz [1]. Such ultrasonic vocalizations (USVs) are thought to be associated with specific social behaviors. For several decades, laboratory mice (Mus musculus domesticus and Mus musculus musculus) have been reported to produce USVs as part of courtship behaviors [2,3]. Their vocalizations are known to form a sequential structure [4] which consists of various sound elements, or ‘syllables’. Almost all USV syllables in mice exhibit spectral peaks between 50–90 kHz with a time duration of 10–40 ms, though slight differences in the syllable spectrotemporal pattern were observed among different strains [5]. On the other hand, it has been also well described that laboratory rats (Rattus norvegicus domesticus) produce USV syllables which have two predominant categories: one has a relatively higher frequency (around 50 kHz) with short duration (a few tens of milliseconds), and the other has a lower frequency (~22 kHz) but much a longer duration. These two USV syllables are here named as ‘pleasant’ and ‘distress’ syllables since these are generally considered to be indicators of positive and negative emotional states, respectively [6-9]. This categorization appears to be preserved in different strains of rats, though a slight difference in duration has been reported [10]. In another rodent family, Mongolian gerbils (Meriones unguiculatus), vocalizations have also been extensively studied as animal models for both the audio-vocal system and for social communication [11-14]. They produce various types of USV syllables with a frequency range up to ~50 kHz and distinct spectrotemporal patterns [15,16]. In general, rodent USVs have been thought to have ecological functions for male-to-female sexual display [2,3,17-20], emotional signal transmission [21-26], and mother-infant interactions [27-30]. Mouse USVs can be discriminated into several subcategories according to their spectrotemporal patterns [31-35], and these patterns could predict mating success [35,36], though subcategories are not consistent between studies. Their USV patterns are innately acquired rather than a learned behavior [34,37], though sociosexual experience can slightly enhance the vocalization rate [38]. In rat USVs, the pleasant (~50 kHz) and distress (~22 kHz) calls have been suggested to have a communicative function since these calls can transmit the emotional states of the vocalizer to the listener and can modify the listener’s behavior such as mating [26,39], approaching [40], or defensive behavior [41,42]. It has been suggested that perception of these calls can also modulate the listeners’ affective state [21]. Further discrimination of subcategories within the pleasant call has been studied to better understand their functional differences in different situations [6-9]. From these characteristics and functions, rodent USVs are expected to provide a good window for studying sociality and communication in animals. Mouse USVs have been used for studying disorders of social behavior, with a particular focus on autism spectrum disorder [43-46]. Thanks to recent genetic manipulation techniques, social disorders can be modeled in mice and then studied directly through USV analysis to quantify social behavior. On the other hand, studies utilizing rat USVs have focused on elucidating the neural mechanisms for the emotional system [6-9], maternal behavior [47-49], and social interactions [50]. USVs of other species in the same superfamily of rodents have also been studied as a variety of research models, including parental behaviors, auditory perception and vocal motor control in gerbils [11,51,52]. Thus, a unified analysis tool for analyzing rodent USVs is helpful to transfer knowledge obtained across different species. Previous studies have proposed analysis toolkits such as VoICE [53], MUPET [54] or DeepSqueak [55], which are successful when the recorded sounds have a sufficiently high signal-to-noise ratio. These analysis tools can be less effective when recordings are contaminated with background noise introduced during recording. Noise can be short and transient (e.g., scratching sounds) or stationary (e.g., noise produced by fans or air compressors). Such noise greatly deteriorates the segmentation of USV syllables, and smears acoustical features (e.g., peak frequency) of the segmented syllables, possibly reducing the reliability of classification of vocal categories and quantification of acoustical features of vocalizations. Despite a variety of behaviors and functions among species, rodent USVs generally tend to exhibit a single salient peak in the spectrum, with few weak harmonic components, if any (see Fig 1A for example). This tendency is associated with a whistle-like sound production mechanism [56]. From a sound analysis point of view, this characteristic provides a simple rule for isolating USV sounds from background noise; that is, narrow-band spectral peaks can be categorized as vocalized sounds whereas broadband spectral components can be categorized as the background. Thus, emphasizing the spectral peaks while flattening the noise floor should improve discrimination of vocalized signals from background noise.

Fig 1

Spectrogram of a rodent ultrasonic vocalization (USV) and proposed method for detection of vocal elements in continuously recorded data.

Spectrogram of a rodent ultrasonic vocalization (USV) and proposed method for detection of vocal elements in continuously recorded data.

(A) Example spectrogram of a mouse vocalization. The brief segment of vocalization (a few ten to a few hundred milliseconds) is defined as a ‘syllable’, and the time interval between two syllables is called a ‘gap’. (B) Schematic diagram of the proposed signal processing procedure. (C, D) Example of a multitaper spectrum and a flattened spectrum, respectively. The flattening process subtracts an estimated noise floor (green line), and the segmentation process detects spectral components above the defined threshold (blue line) as syllables. Example multitaper (E) and flattened (F) spectrograms obtained from a recording of a male mouse performing courtship vocalizations to a female mouse. (G) Processed result of USV data (same as E,F) showing detected syllable periods and spectral peak traces of the syllables (blue highlighted dots). Dark gray zones show non-syllable periods and light gray zones indicate a margin inserted before and after a syllable period. Here, we propose a signal processing procedure for robust detection of USV syllables in recorded sound data by reducing acoustic interference from background noise. Additionally, this procedure is able to track multiple spectral peaks of the segmented syllables. This procedure consists of five steps (see Fig 1B): (i) make a stable spectrogram via the multitaper method, which reduces interaction of sidelobes between signal and stochastic background noise, (ii) flatten the spectrogram by liftering in the cepstral domain, which eliminates both pulse-like transient noise and constant background noise, (iii) perform thresholding, (iv) estimate onset/offset boundaries, and (v) track spectral peaks of segmented syllables on the flattened spectrogram. The proposed procedure is implemented in a GUI-based software (“USVSEG”, implemented as MATLAB scripts; available from https://sites.google.com/view/rtachi/resources), and it outputs segmented sound files, image files, and spectral peak feature data after receiving original sound files. These output files can be used for further analyses, for example, clustering, classification, or behavioral assessment by using other toolkits. The present study demonstrated that the proposed procedure can be successfully applied to a variety of USV syllables produced by a wide range of rodent species (see Table 1). It achieves nearly perfect performance for segmenting syllables in a mouse USV dataset. Further, we confirmed that our procedure was more accurate in segmenting USVs, and more robust against elevated background noise than conventional methods.

Table 1

Rodent USV dataset for performance tests.

Species	Strain and Condition	Data ID	File duration (s)	# syllables
Mouse(Mus musculus)	C57BL/6Jmale courtship	A / Aco59_2	113.4	333
		B / Aco59_2	115.5	85
		C / Can15-1	97.6	371
		D / Can15-1	97.0	217
		E / Can16-1	135.9	394
		F / Can16-1	118.7	171
		G / Can9_2	104.7	420
		H / Can9_2	205.9	306
		I / Aco65-1	98.6	383
		L / Can3_1	104.9	119
	BALB/cmale courtship	BALB128-4	60.0	168
	BALB/cmale courtship	ClnBALB124-4	53.0	203
	Shank2-male courtship	Shank2_S2-4-65	65.1	261
		Shank2_S2-4-103	65.1	193
		Shank2_S2-4-108	65.1	326
	C57BL/6Jisolated pup call	Rin3-1pup	175.0	62
		Rin3-3pup	175.0	41
		Rin3-7pup	175.0	117
		Rin3-8pup	175.0	112
		Rin3-9pup	175.0	74
Rat(Rattus norvegicus domesticus)	LEW/CrlCrljpleasant call	a14	33.2	100
		a18	51.6	100
		a42	28.7	100
	LEW/CrlCrljdistress call	a55	110.0	37
		a57	142.0	57
		a58	100.0	29
		a60	100.0	32
Gerbil(Meriones unguiculatus)	male courtship	uFMdata	27.5	124
Gerbil(Meriones unguiculatus)	male courtship	uSFMdata	59.0	112

Results

Overview of USV segmentation

Rodents USVs consist of a series of brief vocal elements, or ‘syllables’ (Fig 1A), in a variety of frequency ranges, depending on the species and situation. For instance, almost all mouse strains vocalize in a wide frequency range of 20–100 Hz, while rat USVs show a focused frequency around 20–30 kHz when they are in distress. These vocalizations are sometimes difficult to detect visually in a spectrogram because of unavoidable background noise. Such situations provide a challenge to the detection and segmentation of each USV syllable from recorded sound data. Here, we assessed a novel procedure consisting of several signal processing methods for segmentation of USV syllables (Fig 1B). A smooth spectrogram of recorded sound was obtained using the multitaper method (Fig 1C and 1E) and was flattened by cepstral filtering and median subtraction (Fig 1D and 1F). The flattened spectrogram was binarized with a threshold that was determined in relation to the estimated background noise level. Finally, the signals that exceeded the threshold were used to determine the vocalization period (Fig 1G). Additionally, our procedure detects spectral peaks at every timestep within the segmented syllable periods. In this procedure, users only need to adjust the threshold value based on the signal-to-noise ratio of the recording, and they do not need to adjust any other parameters (e.g., maximum and minimum limits of syllable duration and frequency) once appropriate values for individual animals have been determined. Note that we provided heuristically determined parameter sets as reference values (see Table 2).

Table 2

Chosen parameter sets for different species and situations.

Species/Strain/Condition	min.frequency(kHz)	max.frequency(kHz)	min.duration(ms)	max.duration(ms)	min.gap(ms)	threshold value(σ)
Mice (C57BL/6J)	40	120	3	300	30	4.0
Mice (BALB/c)	40	120	3	300	30	3.5
Mice (Shank2-)	40	120	3	300	30	3.5
Mice (C57BL/6J pup)	40	120	3	300	30	3.0
Rats (pleasant call)	20	100	3	500	40	3.0
Rats (distress call)	12	40	100	3000	40	6.0
Gerbils	20	60	5	300	30	4.0

Searching for an optimal threshold

To assess the relationship between the threshold parameter and segmentation performance, we validated the segmentation performance of our procedure on a mouse USV dataset (Fig 2). The actual threshold was defined as the multiplication of a weighting factor (or “threshold value”) and the background noise level, which was quantified as the standard deviation (σ) of an amplitude histogram of the flattened spectrogram (Fig 2A; see Methods for details). With low or high threshold values, the segmentation procedure could miss weak vocalizations, or mistakenly detect noise as syllables, respectively (Fig 2B). To find an optimal threshold value for normal recording conditions, we conducted a performance test on a dataset including 10 recorded sound files obtained from 7 mice that had onset/offset timing information defined manually by a human expert (Table 1). We calculated hit and correct rejection (CR) rates to quantify and compare the consistency of segmentation by the proposed procedure with that of manual processing (see Methods). The results showed a tendency for the hit rate to decrease and the correct rejection rate to increase as the threshold value increased (Fig 2C). We also quantified the consistency by an inter-rater consistency index, Cohen’s κ [57]. As the result, we found the κ index tended to increase along increasing threshold values, and was sufficiently high (> 0.8) [58] when the threshold value was 3.5 or more (Fig 2D). Note that we used an identical parameter set for the performance tests in this (see Table 2) with varying the threshold value only.

Fig 2

Relationship between thresholding and segmentation performance for mouse USVs.

(A) Computation scheme of the threshold value. All data points (pixels) of the flattened spectrogram were pooled and used to make a histogram as a function of amplitude in dB (black line). The background noise distribution was parameterized by a standard deviation (σ) of the gaussian curve (green broken line). The threshold value was defined as a weighting factor of σ (shown as blue vertical lines for example values: 2.5, 4.0 and 5.5). (B) Example segmentation results for threshold values at 2.5, 4.0 and 5.5 (σ). Uppermost panel shows the original flattened spectrogram with segmented periods as syllables performed by a human expert (blue shaded area). Lower three panels depict thresholded binary images and results of our automatic segmentation for three different threshold values (orange shaded areas). Typically, the false detection rate decreases and the miss rate increases as the threshold value increases. (C) Segmentation performance of our procedure as a function of threshold value. The hit rate (blue line) tends to decrease, while the correct rejection (red) tends to increase as the threshold value increases. The accuracy score (black) showed a balanced index of the performance. (D) Cohen’s κ index as a function of threshold value, showing consistency of detections between the automatic and the manual segmentations. The κ index tends to increase as the threshold value increases.

Relationship between thresholding and segmentation performance for mouse USVs.

Segmentation performance for various USVs

To demonstrate applicability of the proposed procedure for a wide variety of rodent USVs, we tested segmentation performance on the USVs of two strains of laboratory mice (C57BL/6J and BALB/c), two different call types from laboratory rats (PC and DC), and USV syllables of gerbils, respectively. We conducted performance tests for each dataset using manually detected onset/offset information (Table 1). Note that PC and DC in rats show a remarkable difference in both duration and frequency range even when produced in the same animal; thus we treated the two calls independently and used different parameter sets for them. Similar to threshold optimization, we calculated hit and CR rates to quantify matching of segmentation between automatic methods and manual segmentation by human experts. Results show that when using heuristically chosen parameter sets (Table 2), our procedure has over 0.95 accuracy in segmenting various USV syllables (Fig 3). Slight variability in the accuracy and κ index was observed across conditions, and this can be explained by differences in the background noise level during recording (as shown in the spectrogram for gerbil vocalizations containing scratch noises).

Fig 3

Example results of segmentation and spectral peak tracking on various rodent species.

(A-C)Mice courtship calls obtained from three different strains: (A) C57BL/6, (B) BALB/c, (C) a disease model for mutation in ProSAP1/Shank2 proteins (“Shank2-”). (D) Pup calls obtained an isolated juvenile mouse (C57BL/6). (E-F) Rats calls in the context of pleasant (E) and distress (F) situations. (G) Representative USV sounds of gerbils, named upward sinusoidal frequency modulated (uSFM) calls. (H-I) Detection performance on all seven dataset conditions. All conditions showed more than 0.95 accuracy scores (H) and sufficiently high κ-index values (I) on the dataset (see Methods for details).

Example results of segmentation and spectral peak tracking on various rodent species.

Comparison with conventional methods

We compared our procedure with conventional signal processing methods (Fig 4), which include a single-window (or “singletaper”) method for generating the spectrogram, and long-term spectral subtraction (“whitening”) and have been previously used in other segmentation procedures [59]. As in this previous study, we used the hanning window as a typical singletaper to generate the spectrogram. The performance test was carried out with four conditions consisting of combinations of two windows (multitaper vs. singletaper) and two noise reduction methods (flattening vs whitening) (see “Comparison with conventional methods” in Methods). The dataset used for searching for the optimal threshold was also used for this performance test. Results demonstrated greater performance with flattening than whitening, and slightly higher performance in multitaper compared to singletaper spectrograms (Fig 4A). A statistical test showed a significant effect of noise reduction method (two-way ANOVA; F(1,36) = 19.02, p < 0.001), but not of windowing method (F(1,36) = 0.17, p = 0.681), and there was no significant interaction between them (F(1,36) = 0.00, p = 0.960). Further, to determine robustness of the segmentation methods against noise, we added white noise to the sound dataset at levels of −12, −6, 0, and 6 dB higher than the original sound (see “Noise addition” in Methods), and ran the performance test again. In particular, we compared performance between the multitaper and singletaper spectrograms using only flattening for the noise reduction. The result of this test clearly showed that the multitaper method was more robust for degraded signal-to-noise situations than the singletaper method (Fig 4B). The statistical test showed significant main effects of both additive noise level and windowing method with significant interaction between them (two-way ANOVA; noise level: F(4,90) = 51.54, p < 0.001; window: F(1,90) = 19.59, p < 0.001; interaction: F(4,90) = 6.42, p < 0.001).

Fig 4

Segmentation performance of various combinations of signal processing methods.

(A) Accuracy scores of segmentation on flattened or whitened spectrogram produced by a multitaper (blue) or singletaper (red) method. We used the hanning window as the single taper condition. (B) Performance sensitivity to additive noise. Segmentation with multitaper (blue) and single-taper (red) spectrograms against experimentally added background noise. White noise was added to the original data at levels of −12, −6, 0, or 6 dB before processing.

Segmentation performance of various combinations of signal processing methods.

Discussion

The proposed procedure showed nearly perfect segmentation performance for variable USV syllables of a variety of species and strains in the rodent superfamily. The procedure was designed to emphasize vocal components in the spectral domain while reducing variability of background noise, which inevitably occurs during observation of social behavior, and usually interferes with the segmentation process. This process helped to discriminate vocal signals from the background by thresholding according to the signal-to-noise ratio. Additionally, our procedure also provides a precise tracking of spectral peaks within each vocalized sound. The proposed method was more robust than the conventional method for syllable detection, in particular, under elevated background noise levels. These results demonstrated that this procedure can be generally applied to segment USVs of several rodent species. Our procedure was designed to emphasize distribution differences between vocal signals and background noise, under the assumption that rodent USV signals generally tend to have narrow-band sharp spectral peaks. In this process, we employed the multitaper method which uses multiple windows for performing spectral analysis (Thomson 1982), and has been used in vocal sound analyses for other species, e.g., songbirds [56]. We also introduced the spectral flattening process in which the broadband spectrum in each timestep was flattened by cepstral filtering. As we demonstrated in the performance comparison tests, a combination of multitaper windowing with flattening showed better performance than the conventional method of single-taper windowing with long-term spectral subtraction that has been used in a previous mouse USV analysis [59]. In particular, the difference in performance was seen under degraded signal-to-noise conditions. Note that our experimental results did not focus on applicability for lower-frequency (i.e. <20 kHz), harmonic-rich, or harsh noise-like vocalizations since these types of sounds are outside of the scope of the processing algorithm. We here employed a redundant way to represent the spectral features of USVs, and exporting up to 3 candidates for spectral peaks for every timestep. This provides additional information about harmonics, as well as an appropriate way to treat “jumps,” which are sudden changes in the spectral peak tracks [4]. Researchers have attempted to distinguish syllables into several subcategories according to their spectrotemporal features so they can analyze sequential patterns to understand their syntax [32,35,36,60]. The procedure proposed in the present study will allow for better categorization of USV syllable subtypes.

Methods

Proposed procedure

Our procedure consists of five steps: multitaper spectrogram generation, flattening, thresholding, detecting syllable onset/offset, and spectral peak tracking. In particular, the multitaper method and flattening were core processes for suppressing the variability of background noise as described below in detail.

Multitaper spectrogram generation

We used the multitaper method [61] for obtaining the spectrogram to improve signal salience against background noise distribution. Multiple time windows (or tapers) were designed as a set of 6 series of discrete prolate spheroidal sequences with the time half bandwidth parameter set to 3 [62]. The length of these windows was set to 512 samples (~2 ms for 250 kHz sampling rate). In each time step, the original waveform was multiplied by all six windows and transformed into the frequency domain. The six derived spectrograms were averaged into one to obtain a stable spectrotemporal representation. This multitaper method reduces background noise compared to a typical single-taper spectrogram, while widening the bandwidth of signal spectral peaks.

Flattening

To emphasize spectral peaks for detectability of vocalization events, we reduced the variability of background noise by flattening the spectrogram. This flattening consists of two processes. First, transient broadband (or impulse-like) noises were reduced by liftering in every time step, in which gradual fluctuation in the frequency domain, or spectral envelope was filtered out by replacing the first three cepstral coefficients with zero. This process can emphasize spectral peaks of rodent USVs since they are very narrow-band and have few or no harmonics. Then, we calculated a grand median spectrum that had median values of each frequency channel, and subtracted it from the liftered spectrogram.

Thresholding and detection

After flattening, we binarized the flattened spectrogram image at a threshold which was determined based on the estimated background noise level. The threshold was calculated as the multiplication of a weighting factor (or “threshold value”) and the standard deviation (σ) of a background distribution (Fig 2A). The σ value was estimated from a pooled amplitude histogram of the flattened spectrogram as described in a previous study for determining onsets and offsets of birdsong syllables [63]. The threshold value is normally chosen from 3.5–5.5 and can be manually adjusted depending on the background noise level. After binarization, we counted the maximum number of successive pixels along the frequency axis whose amplitude exceeded the threshold in each time frame, and considered the time frame to include vocalized sounds when the maximum number counted was 5 or more (corresponding to a half bandwidth of the multitaper window).

Timing correction

A pair of detected elements split by a silent period (or gap) with a duration less than the predefined lower limit (“gap min”) was integrated to omit unwanted segmentation within syllables. We usually set this lower limit for a gap around 3–30 ms according to specific animal species or strains. Then, sound elements with a duration of more than the lower limit (“dur min”) were judged as syllables. If the duration of an element exceeded the upper limit (“dur max”), then the element was excluded. These two parameters (dur min and max) were differentially determined for different species, strains, and situations. The heuristically determined values of these parameters are shown in a table (Table 2) for reference.

Spectral peak tracking

We also implemented an algorithm for tracking multiple spectral peaks as an additional analysis after segmentation. Although the focus of our study is temporal segmentation of syllables, we briefly explain this algorithm as follows. First, we calculated the degree of salience of spectral peaks by convolving a second-order differential spectrum of the multitaper window itself into the flattened spectrum along the frequency axis. This process emphasizes the steepness of spectral peaks in each time frame. Then, the strongest four local maxima of spectral saliency were detected as candidates. We grouped the four peak candidates to form a continuous spectral object according to their time-frequency continuity (within 5% frequency change per time frame). If the length of the grouped object was less than 10 time points, the spectral peak data in the object was excluded as a candidate for a vocalized sound. At the final step, the algorithm outputs up to three peaks in each time step.

Output files

The software can output a variety of processed data in multiple forms (Fig 1B). Segmented syllables are saved as sound files (WAV format), and image files of either flattened or original spectrograms (JPG format). A summary file (CSV) contains onset and offset time points and an additional three acoustical features for each segmented syllable: duration, max-frequency (maxfreq), and max-amplitude (maxamp). Maxfreq is defined as the peak frequency of the time frame that has the highest amplitude in that syllable, and maxamp is the amplitude value of the maxfreq. These features are widely used in the field of USV studies [35-37]. Furthermore, the software generates the peak frequency trace of each syllable so that users can perform post-processing after segmentation to obtain additional features.

Dataset

For testing the segmentation performance of our procedure, we prepared datasets consisting of recorded sounds and manually detected onset/offset timing of syllables for three species in the rodent superfamily (Table 1). The manual segmentation for each species was performed by a different human expert. These experts segmented sound materials by visual inspection of a spectrogram, independently of any automatic segmentation system. They were not informed about any of the results of our procedure beforehand. Finally, we collected segmented data for each condition (species, strains, or contexts) as described below. For all species/strain/context conditions, ultrasonic sounds were recorded using a commercial condenser microphone and an A/D converter (Ultra-SoundGate, Avisoft Bioacoustics, Berlin, Germany; SpectoLibellus2D, Katou Acoustics Consultant Office, Kanagawa, Japan). All data were resampled at 250 kHz to have the same sampling rates before starting performance tests for consistency across datasets, though our procedure can be applied to data with much higher sampling rates. The whole dataset is available online (https://doi.org/10.5281/zenodo.3428024).

Mice

We obtained 10 recording sessions of courtship vocalizations from 6 mice (Mus musculus; C57BL/6J, adult males), under the same condition and recording environment as described in our previous work [38]. Briefly, the microphone was set 16 cm above the floor with a sampling rate of 400 kHz. Latency to the first call was measured after introducing an adult female of the same strain into the cage and then ultrasonic recording was performed for one additional minute. The data recorded during the first minute after the first ultrasound call was analyzed for the number of calls. For all recording tests, the bedding and cages for the males were changed one week before the recording tests, and these home-cage conditions were maintained until the tests were completed. These sound data files were originally recorded as part of other experiments (in preparation), and shared on mouseTube [64] and Koseisouhatsu Data Sharing Platform [65]. Note that we chose 10 files from the shared data and excluded two files (“J” and “K”) which did not contain enough syllables for the present study. To assess the applicability of our procedure for a wider range of strains and situations, we also obtained data from another strain (“BALB/c”), a disease model (“Shank2-”), and an isolated juvenile’s pup call (“Pup”). For the disease model, we used the dataset of ProSAP1/Shank2-/- mice [33], which was also available on mouseTube. These mice have mutated ProSAP1/Shank2, which is one of the synaptic scaffolding proteins mutated in patients with autism spectrum disorders (ASD). The experimental procedure used for BALB/c mice was the same as for C57BL/6J mice. For Shank2- mice, the procedure was similar but see reference for details [33].

Mouse pups

For recording pup USVs, we used C57BL/6J mice at postnatal day 5–6. The microphone was set 16 cm above the floor with a sampling rate of 384 kHz. After introducing a pup into a clean cage from their nest, ultrasonic recording was performed for three minutes. These sound data files were originally recorded as part of other experiments (in preparation).

Rats

The pleasant call (PC) or distress call (DC) was recorded from an adult female rat (Rattus norvegicus domesticus; LEW/CrlCrlj, Charles River Laboratories Japan). For the recording of PC, the animal was stroked by hand on the experimenter’s lap for around 5 minutes. To elicit DC, a different animal was transferred to a wire-topped experimental cage and habituated to the cage for 5 minutes. Then, the animal received air-puff stimuli (0.3 MPa) with an inter-stimulus interval of 2 s to the nape from a distance of approximately 5–10 cm. Immediately after 30 air-puff stimuli were delivered, USVs were recorded for 5 min. These vocalizations were detected by a microphone placed at a distance of approximately 15–20 cm from the target animal. The detected sound was digitally recorded at a sampling rate of 384 kHz.

Gerbils

Vocalizations of the Mongolian gerbil (Meriones unguiculatus) were recorded via a microphone positioned 35 cm above an animal cage that was positioned in the center of a soundproof room. The sound was digitized at a sampling rate of 250 kHz. This sound data was originally obtained as part of a previous study [15]. Here, we targeted only calls with fundamental frequencies in the ultrasonic range (20 kHz or more; i.e., upward FMs and upward sinusoidal FMs), which are often observed under conditions that appear to be mating and non-conflict contexts [15].

Performance tests

Segmentation performance score

We quantified segmentation performance of our software by calculating accuracy scores. First, onset and offset timestamps of detected syllables for each data file were converted into a boxcar function which indicated syllable detection status by 1 (detected) and 0 (rejected) in every 1-ms time step. We counted the number of time frames which contained true-positive or true-negative detections as hit and correct-rejection counts, respectively. Then, the accuracy score was calculated as an average of hit and correct-rejection rates. We additionally calculated an inter-rater agreement score, Cohen’s κ [58], to assess degrees of agreement between our software and human experts by the following formula: (pa−pc) / (1 –pc), where pa indicates the accuracy, and pc shows an expected probability to coincide two raters by chance.

Threshold optimization

When the threshold was set too high, the segmentation procedure would miss weak vocal sounds, or mistakenly detect noises as syllables. To find an optimal threshold value for normal recording conditions, we assessed segmentation performance on a dataset for mice USVs by changing the threshold value. For this test, we used 10 files of C57BL/6J mice from the dataset. We varied the threshold value from 3.0 to 6.0 with 0.5 steps. The optimal value was defined as the value which showed a peak in the accuracy index.

Comparison with conventional methods

To determine to what extent our procedure improved the detection performance from conventional methods, we compared the performance of four conditions in which two processing steps were swapped with conventional ones. For conventional methods, we employed a normal windowing method (“singletaper” condition) using the hanning window to replace the multitaper method for making the spectrogram. We also used long-term spectral subtraction (“whitening” condition) to replace the flattening process. These two methods have been used as standard processing methods for signal detection algorithms [59]. Here, we swapped one or both methods (windowing and noise reduction) between ours and conventional ones to make four conditions: multitaper+flattening, multitaper+whitening, singletaper+flattening, and singletaper+whitening. Then, the performance of each condition was tested on the mouse USV dataset that was used for the threshold optimization test. Note that the threshold for bandwidth in the detecting process after the binarization was adjusted to 3 for the singletaper method (it was 5 for the multitaper) to correspond to its half bandwidth. A two-way repeated measures ANOVA was performed on the windowing factor (singletaper vs multitaper) and the noise-reduction factor (flattening vs whitening) with a significance threshold α = 0.05.

Noise addition

As an additional analysis for assessing the robustness against noise, we carried out the performance test again on the multitaper+flattening and singletaper+flattening conditions by adding white noise to the original sound data at levels of −12, −6, 0, and 6 dB, referring to the root-mean-square of the original sound amplitude. We have not tested the whitening method here since the method showed clearly lower performance than the flattening method in the original performance test. A two-way repeated measures ANOVA was performed on the windowing factor (singletaper vs multitaper) and the noise-level factor (none, −12, −6, 0, 6 dB) with a significance threshold α = 0.05.

Ethical information

All procedures for recording vocalizations were approved by the Ethics Committee of Azabu University (#130226–04) or Kagoshima University (L18005) for mice, and the Animal Experiment Committee of Jichi Medical University (#17163–02) for rats.

51 in total

1. Neonatal social isolation alters both maternal and pup behaviors in rats.

Authors: Betty Zimmerberg; Abigail J Rosenthal; Aleksandra C Stark
Journal: Dev Psychobiol Date: 2003-01 Impact factor: 3.038

Review 2. Ethotransmission: communication of emotional states through ultrasonic vocalization in rats.

Authors: Stefan M Brudzynski
Journal: Curr Opin Neurobiol Date: 2013-01-31 Impact factor: 6.627

3. 22-kHz ultrasonic vocalization in rats as an index of anxiety but not fear: behavioral and pharmacological modulation of affective state.

Authors: Piotr Jelen; Stefan Soltysik; Jolanta Zagrodzka
Journal: Behav Brain Res Date: 2003-04-17 Impact factor: 3.332

4. Neural correlates of ticklishness in the rat somatosensory cortex.

Authors: S Ishiyama; M Brecht
Journal: Science Date: 2016-11-11 Impact factor: 47.728

5. Cognitive bias in rats evoked by ultrasonic vocalizations suggests emotional contagion.

Authors: Yumi Saito; Shoko Yuki; Yoshimasa Seki; Hiroko Kagawa; Kazuo Okanoya
Journal: Behav Processes Date: 2016-08-31 Impact factor: 1.777

6. Central c-fos expression following 20kHz/ultrasound induced defence behaviour in the rat.

Authors: S R Beckett; M S Duxon; S Aspley; C A Marsden
Journal: Brain Res Bull Date: 1997 Impact factor: 4.077

Review 7. Importance of mother-infant communication for social bond formation in mammals.

Authors: Shota Okabe; Miho Nagasawa; Kazutaka Mogi; Takefumi Kikusui
Journal: Anim Sci J Date: 2012-04-25 Impact factor: 1.749

8. Ultrasonic songs of male mice.

Authors: Timothy E Holy; Zhongsheng Guo
Journal: PLoS Biol Date: 2005-11-01 Impact factor: 8.029

9. VoICE: A semi-automated pipeline for standardizing vocal analysis across models.

Authors: Zachary D Burkett; Nancy F Day; Olga Peñagarikano; Daniel H Geschwind; Stephanie A White
Journal: Sci Rep Date: 2015-05-28 Impact factor: 4.379

10. Laryngeal airway reconstruction indicates that rodent ultrasonic vocalizations are produced by an edge-tone mechanism.

Authors: Tobias Riede; Heather L Borgard; Bret Pasch
Journal: R Soc Open Sci Date: 2017-11-01 Impact factor: 2.963

13 in total

1. TrackUSF, a novel tool for automated ultrasonic vocalization analysis, reveals modified calls in a rat model of autism.

Authors: Shai Netser; Guy Nahardiya; Gili Weiss-Dicker; Roei Dadush; Yizhaq Goussha; Shanah Rachel John; Mor Taub; Yuval Werber; Nir Sapir; Yossi Yovel; Hala Harony-Nicolas; Joseph D Buxbaum; Lior Cohen; Koby Crammer; Shlomo Wagner
Journal: BMC Biol Date: 2022-07-12 Impact factor: 7.364

2. Capturing the songs of mice with an improved detection and classification method for ultrasonic vocalizations (BootSnap).

Authors: Reyhaneh Abbasi; Peter Balazs; Maria Adelaide Marconi; Doris Nicolakis; Sarah M Zala; Dustin J Penn
Journal: PLoS Comput Biol Date: 2022-05-12 Impact factor: 4.779

3. Analysis of ultrasonic vocalizations from mice using computer vision and machine learning.

Authors: Antonio Ho Fonseca; Gustavo M Santana; Gabriela M Bosque Ortiz; Sérgio Bampi; Marcelo O Dietrich
Journal: Elife Date: 2021-03-31 Impact factor: 8.140

4. Distinct hypothalamic control of same- and opposite-sex mounting behaviour in mice.

Authors: Tomomi Karigo; Ann Kennedy; Bin Yang; Mengyu Liu; Derek Tai; Iman A Wahle; David J Anderson
Journal: Nature Date: 2020-12-02 Impact factor: 49.962

5. Automated annotation of birdsong with a neural network that segments spectrograms.

Authors: Yarden Cohen; David Aaron Nicholson; Alexa Sanchioni; Emily K Mallaber; Viktoriya Skidanova; Timothy J Gardner
Journal: Elife Date: 2022-01-20 Impact factor: 8.713

6. Sex differences in vocalizations to familiar or unfamiliar females in mice.

Authors: Eri Sasaki; Yuiri Tomita; Kouta Kanno
Journal: R Soc Open Sci Date: 2020-12-23 Impact factor: 2.963

7. Paternal age affects offspring via an epigenetic mechanism involving REST/NRSF.

Authors: Kaichi Yoshizaki; Ryuichi Kimura; Hisato Kobayashi; Shinya Oki; Takako Kikkawa; Lingling Mai; Kohei Koike; Kentaro Mochizuki; Hitoshi Inada; Yasuhisa Matsui; Tomohiro Kono; Noriko Osumi
Journal: EMBO Rep Date: 2021-01-05 Impact factor: 8.807

8. Sexual excitation induces courtship ultrasonic vocalizations and cataplexy-like behavior in orexin neuron-ablated male mice.

Authors: Tomoyuki Kuwaki; Kouta Kanno
Journal: Commun Biol Date: 2021-02-05

9. Maturation of Social-Vocal Communication in Prairie Vole (Microtus ochrogaster) Pups.

Authors: Megan R Warren; Drayson Campbell; Amélie M Borie; Charles L Ford; Ammar M Dharani; Larry J Young; Robert C Liu
Journal: Front Behav Neurosci Date: 2022-01-11 Impact factor: 3.558

10. Measuring Behavior in the Home Cage: Study Design, Applications, Challenges, and Perspectives.

Authors: Fabrizio Grieco; Briana J Bernstein; Barbara Biemans; Lior Bikovski; C Joseph Burnett; Jesse D Cushman; Elsbeth A van Dam; Sydney A Fry; Bar Richmond-Hacham; Judith R Homberg; Martien J H Kas; Helmut W Kessels; Bastijn Koopmans; Michael J Krashes; Vaishnav Krishnan; Sreemathi Logan; Maarten Loos; Katharine E McCann; Qendresa Parduzi; Chaim G Pick; Thomas D Prevot; Gernot Riedel; Lianne Robinson; Mina Sadighi; August B Smit; William Sonntag; Reinko F Roelofs; Ruud A J Tegelenbosch; Lucas P J J Noldus
Journal: Front Behav Neurosci Date: 2021-09-24 Impact factor: 3.617