Literature DB >> 34962915

Measuring context dependency in birdsong using artificial neural networks.

Takashi Morita^1,2, Hiroki Koda², Kazuo Okanoya^3,4,5, Ryosuke O Tachibana³.

Abstract

Context dependency is a key feature in sequential structures of human language, which requires reference between words far apart in the produced sequence. Assessing how long the past context has an effect on the current status provides crucial information to understand the mechanism for complex sequential behaviors. Birdsongs serve as a representative model for studying the context dependency in sequential signals produced by non-human animals, while previous reports were upper-bounded by methodological limitations. Here, we newly estimated the context dependency in birdsongs in a more scalable way using a modern neural-network-based language model whose accessible context length is sufficiently long. The detected context dependency was beyond the order of traditional Markovian models of birdsong, but was consistent with previous experimental investigations. We also studied the relation between the assumed/auto-detected vocabulary size of birdsong (i.e., fine- vs. coarse-grained syllable classifications) and the context dependency. It turned out that the larger vocabulary (or the more fine-grained classification) is assumed, the shorter context dependency is detected.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34962915 PMCID： PMC8746767 DOI： 10.1371/journal.pcbi.1009707

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Making behavioral decisions based on past information is a crucial task in the life of humans and animals [1, 2]. Thus, it is an important inquiry in biology how far past events have an effect on animal behaviors. Such past records are not limited to observations of external environments, but also include behavioral history of oneself. A typical example is human language production; The appropriate choice of words to utter depends on previously uttered words/sentences. For example, we can tell whether ‘was’ or ‘were’ is the grammatical option after a sentence ‘The photographs that were taken in the cafe and sent to Mary ____’ only if we keep track of the previous words sufficiently long, at least up to ‘photographs’, and successfully recognize the two closer nouns (cafe and Mary) as modifiers rather than the main subject. Similarly, semantically plausible words are selected based on the topic of preceding sentences, as exemplified by the appropriateness of olive over cotton after “sugar” and “salt” are used in the same speech/document. Such dependence on the production history is called context dependency and is considered a characteristic property of human languages [3-6]. Birdsongs serve as a representative case study of context dependency in sequential signals produced by non-human animals. Their songs are sound sequences that consist of brief vocal elements, or syllables [7, 8]. Previous studies have suggested that those birdsongs exhibit non-trivially long dependency on previous outputs [9-11]. Complex sequential patterns of syllables have been discussed in comparison with human language syntax from the viewpoint of formal linguistics [8, 12]. Neurological studies also revealed homological network structures for the vocal production, recognition, and learning of songbirds and humans [13-15]. In this line, assessing whether birdsongs exhibit long context dependency is an important instance in the comparative studies, and several previous studies have addressed this inquiry using computational methods [9, 11, 16–18]. However, the reported lengths of context dependency were often measured using a limited language model (Markov/n-gram model) that was only able to access a few recent syllables in the context. Thus, it is unclear if those numbers were real dependency lengths in the birdsongs or merely model limitations. Moreover, there is accumulating evidence that birdsong sequencing is not precisely modeled by a Markov process [16, 17]. The present study aimed to assess the context dependency in songs of Bengalese finches (Lonchura striata var. domestica) using modern techniques for natural language processing. Recent advancements in the machine learning field, particularly in artificial neural networks, provide powerful language models [6, 19], which can simulate various time series data without hypothesizing any particular generative process behind them. The neural network-based models also have a capacity to effectively use information in 200–900 syllables from the past (when the data include such long dependency) [5, 6], and thus, the proposed analysis no longer suffers from the model limitations in the previous studies. We performed the context dependency analysis in two steps: unsupervised classification of song syllables and context-dependent modeling of the classified syllable sequence. The classification enabled flexible modeling of statistical ambiguity among upcoming syllables, which are not necessarily similar to one another in acoustics. Moreover, it is preferable to have a common set of syllable categories, which is shared among classifications for all birds, to represent general patterns in the sequences and also to provide the language model with as big data as possible. Conventional classification methods depending on manual labeling by human experts could spoil such generality due to arbitrariness in integrating the category sets across different birds. To satisfy these requirements, we employed a novel, end-to-end, unsupervised clustering method (“seq2seq ABCD-VAE”, see Fig 1). Then, we assessed the context dependency in sequences of the classified syllables by measuring the effective context length [5, 6], which represents how much portion of the song production history impacts on the prediction performance of a language model. The language model we used (“Transformer”) behaves as a simulator of birdsong production, which exploits the longest context among currently available models [6, 19].

Fig 1

Schematic diagram of newly proposed syllable classification.

(A) Each sound waveform segment was converted into the time-frequency representations (spectrograms), and was assigned to one of syllable categories by the unsupervised classification. (B) The unsupervised classification was implemented as a sequence-to-sequence version of the variational autoencoder, consisting of the attention-based categorical sampling with the Dirichlet prior (“seq2seq ABCD-VAE”). The ABCD-VAE encoded syllables into discrete categories between the encoder and the decoder. A statistically optimal number of categories was detected under an arbitrarily specified upper bound thanks to the Dirichlet prior. The identity of the syllable-uttering individual was informed to the decoder besides the syllable categories; Accordingly, individual-specific patterns need not have been encoded in the discrete syllable representation.

Schematic diagram of newly proposed syllable classification.

Results

Unsupervised, individual-invariant classification of Bengalese finch syllables

We first converted birdsong syllables into discrete representations, or “labels”. When predicting an upcoming syllable from previous outputs, probable candidates can have non-similar acoustic profiles. For example, “bag” and “beg” in English are similar to each other in terms of phonology but have different syntactic and semantic distributions, belonging to different grammatical categories (noun and verb, respectively). An appropriate language model must assign a more similar probability to syntactically/semantically similar words like “bag” and “wallet” than acoustically similar ones like “bag” and “beg”. Likewise, it is desirable to perform the context dependency analysis of birdsong based on a flexible model of sequence processing so that it can handle ambiguity about possible upcoming syllables that do not necessarily resemble one another from acoustic perspectives. Categorizing continuous-valued signals and predicting the assigned discrete labels based on a categorical distribution is a simple but effective way of achieving such flexible models, especially when paired with deep neural networks [20-22]. Syllable classification has also been adopted widely in previous studies of birdsong syntax [7, 11, 18, 23]. Recent studies have explored fully unsupervised classification of animal vocalization based on acoustic features extracted by an artificial neural network, called variational autoencoder or VAE [24-26]. We extended this approach and proposed a new end-to-end unsupervised clustering method named ABCD-VAE, which utilizes the attention-based categorical sampling with the Dirichlet prior (see also [27]). This method automatically classifies syllables into an unspecified number of categories in a statistically principled way. It also allowed us to exploit the speaker-normalization technique developed for unsupervised learning of human language from speech recordings [28, 29], yielding syllable classification modulo individual variation. Having common syllable categories across different individuals helps us build a unified model of syllable sequence processing. Individual-invariant classification of syllables is also crucial for deep learning-based analysis that requires a substantial amount of data; i.e., it is hard to collect sufficient data for training separate models on each individual. We used a dataset of Bengalese finches’ songs that was originally recorded for previous studies [30, 31]. Song syllables in the recorded waveform data were detected and segmented by amplitude thresholding. We collected 465,310 syllables in total from 18 adult male birds. Some of these syllables were broken off at the beginning/end of recordings. We filtered out these incomplete syllables, and fed the other 461,994 syllables to the unsupervised classifier (Fig 1A). The classifier consisted of two concatenated recurrent neural networks (RNNs, see Fig 1B). We jointly trained the entire network such that the first RNN represented the entirety of each input syllable in its internal state (“encoding” Fig 1B) and the second RNN restored the original syllable from the internal representation as precisely as possible (“decoding”). The encoded representation of the syllable was mapped to a categorical space (“embedding”) before the decoding process. The number of syllable categories was automatically detected owing to the Dirichlet prior [32], which introduced an inductive bias favoring fewer categories and prevented overclassification. As a result, the classifier detected 37 syllable categories in total for all the birds (Fig 2B). Syllables that exhibited similar acoustic patterns tended to be classified into the same category across different birds (Fig 2A). All birds produced not all but a part of syllable categories in their songs (Fig 2C). The syllable repertoire of each bird covered 24 to 36 categories (32.39±3.35). The detected syllable vocabulary size was greater than the number of annotation labels used by a human expert (5–14; provided in [30]; see also [33] for parallel results based on different clustering methods). Conversely, each category consisted of syllables produced by 7 to 18 birds (15.76±2.91). The detected categories appeared to align with major differences in the spectrotemporal pattern (Fig 2B).

Fig 2

Clustering results of Bengalese finch syllables based on the ABCD-VAE.

Clustering results of Bengalese finch syllables based on the ABCD-VAE.

Quantitative evaluation of syllable classification for Bengalese finch

Speaker-invariant clustering of birdsong syllables should meet at least two desiderata: (i) the resulting classification must keep consistency with the conventional bird-specific classification (i.e., clustered syllables must belong to the same bird-specific class), and (ii) the discovered syllable categories should be anonymized. Regarding (i), we evaluated the alignment of the detected classification with manual annotations by a human expert [30]. We scored the alignment using two metrics. One was Cohen’s kappa coefficient [34], which has been used to evaluate syllable classifications in previous studies [9, 30]. A problem with this metric is that it requires two classifications to use the same set of categories while our model predictions and human annotations had different numbers of categories and, thus, we needed to force-align each of the model-predicted categories to the most common human-annotated label to use the metric [9]. For example, suppose that the model classified 300 syllables into a category named “X”. If 200 of the syllables in “X” are labeled as “a” by the human annotator and the other 100 are labeled as “b”, then all the syllables in “X” received “a” as their force-aligned label of model predictions (Fig 3). This force-alignment makes the 100 syllables misaligned with their original label “b”. Thus, the force-alignment scores uniformity of syllables within the model-predicted categories regarding the manual annotations. To get rid of the force-alignment and any other post-processing, we also evaluated the classification using a more recently developed metric called homogeneity [35]. The homogeneity checks whether the category-mate syllables according to the ABCD-VAE were annotated with the same manual label (see the Materials and methods section for its mathematical definition). Note that homogeneity does not penalize overclassification (see the supporting information S1 Text for additional evaluation that takes overclassification into account). For example, suppose that the ABCD-VAE classified 300 syllables into a category named “X” and another 300 into “Y”. The homogeneity is maximized even if all the 300 syllables in “X” are labeled “a” and all the 300 in “Y” are also labeled as “a”. This is because all the category-mate syllables receive the same label. Instead, the homogeneity penalizes label mismatches within the model-detected categories, as in the case where 200 of the “X” syllables are labeled “a” and the other 100 are labeled “b” (Fig 3). Thus, the homogeneity is considered a unified version of Cohen’s kappa plus force-alignment.

Fig 3

Illustration of clustering evaluation by Cohen’s kappa coefficient and homogeneity.

Illustration of clustering evaluation by Cohen’s kappa coefficient and homogeneity.

The left hand-side is the original clustering results, wherein the letters “a” and “b” represent manual annotation of individual data points, and the rounded rectangles covering the data points (labeled as “X” and “Y”) represent model-predicted categories. The right hand-side is the result of force-alignment used for computing Cohen’s kappa coefficient, which replaces minority annotation labels with the majority within the model-predicted category. The model-predicted category “X” contains “a” and “b”, and thus, is not homogeneous. It is also penalized by Cohen’s kappa; “X” is force-aligned with the majority annotation, “a”, by relabeling all its data points as “a”, and Cohen’s kappa penalizes the replacement of the original “b” annotations with “a”. On the other hand, the other model-predicted category, Y, is homogeneous because it only contains “b” data. There is no penalty in Cohen’s kappa evaluation because the force-alignment does not change the original annotations. Note that “b” data are split into “X” and “Y” but neither Cohen’s kappa nor homogeneity penalizes this overclassification. To assess fulfillment of the second desideratum for ideal clustering (ii), we quantified the speaker-normalization effect of the ABCD-VAE by measuring the perplexity of speaker identification. We built a simple speaker identification model based on a syllable category uttered by the target bird, fitting the conditional categorical distribution to 90% of all the syllables by the maximum likelihood criterion and then evaluating the prediction probabilities on the other 10%. The prediction probabilities of the test data were averaged in the log scale (= entropy) and then exponentiated to yield the perplexity. Intuitively, the perplexity tells the expected number of birds among whom we have to guess by chance to identify the target speaker even after the information about the syllable category uttered by the target bird is provided. Thus, greater perplexity is an index of successful speaker-normalization. We compared the performance of the ABCD-VAE with baseline scores provided by the combination of the canonical, continuous-valued VAE (which we call Gauss-VAE) [24-26] and the Gaussian mixture model (GMM) [32, 36, 37]. This baseline model can be seen as a non-end-to-end version of our clustering method, having distinct optimizations for feature extraction and clustering. The Gauss-VAE was trained on the same datasets and by the same procedure as the ABCD-VAE. On the other hand, the GMM was trained in several ways. First, we built both bird-specific and common models: the former consisted of multiple models, each trained on data collected from a single individual bird, whereas the latter was a single model trained on the entire data collected from all the birds. The bird-specific clusterings provide “topline” scores because the gold-standard annotations by the human expert were also defined in a bird-specific way, and hence, they do not suffer from individual variations included in the Gauss-VAE features. On the other hand, the all-birds-together classifications tell us how much degree of difficulties exist in the clustering without end-to-end optimization or speaker normalization and, thus, serve as a baseline. Another kind of variation in the GMMs we tested was the number of syllable categories. We tested three ways of determining the number: (i) equals to the results from automatic detection by the ABCD-VAE, (ii) equals to the manual annotations by the human expert, and (iii) automatically detected from the distribution of syllable features defined by the Gauss-VAE. (i) and (ii) were obtained by specifying the number of mixture components of the GMM and training the GMM by the maximum likelihood criterion. On the other hand, (iii) was implemented by Bayesian estimation of active mixture components under the Dirichlet distribution prior [32]. As a result, the ABCD-VAE achieved a greater kappa coefficient on average than the baseline models without subject-specific training (Table 1). Moreover, the comparison of the worst-bird scores (“min” in the table) showed that the ABCD-VAE was more robust than the topline models that were optimized to each bird separately. The ABCD-VAE achieved “almost perfect agreements” with the human expert (κ > 0.8) for sixteen of the eighteen birds and “substantial” agreements (0.6 < κ ≥ 0.8) for the other two [38]. Similarly, the ABCD-VAE outperformed the baseline classifications in the average and worst-bird homogeneity scores. This result was also competitive with the topline models, especially regarding the worst-bird score. These results suggest that the syllable categories discovered by the ABCD-VAE kept consistency with the conventional subject-specific classifications, while the consistency was lost in the other all-birds-together classifications without speaker-normalization. In the meantime, the ABCD-VAE scored the greatest individual perplexity, indicating that the discovered syllable categories were more anonymized and individual-invariant than the baselines (see also Fig 2D).

Table 1

Quantitative evaluation of the clustering by the ABCD- vs. Gauss-VAE for Bengalese finch syllables.

Cohen’s kappa coefficient and homogeneity evaluated the alignment of the discovered clusters with manual annotations by a human expert. These scores for each individual bird were computed separately and their mean, maximum, and minimum over the individuals were reported since the manual annotation was not shared across individuals (see Materials and methods). Additionally, the perplexity of individual identification scored the amount of individuality included in the syllable categories yielded by the ABCD- and Gauss-VAE. The best scores are in boldface (results under the all-birds-together and bird-specific settings were ranked separately).

Method	# of clusters (source)	Cohen’s Kappa mean[min,max]	Homogeneity mean[min,max]	Speaker Perplexity
ABCD-VAE	37	0.8990[0.7740, 0.9929]	0.9084[0.7635, 0.9868]	8.0434
Gauss-VAE + GMM (All-Birds-Together)	37(ABCD-VAE)	0.7446[0.5956, 0.8912]	0.7844[0.6004, 0.9086]	4.0783
	14(manual)	0.6057[0.4250, 0.8972]	0.6718[0.5053, 0.8536]	6.7212
	≥128(auto-detected)	0.8475[0.5725, 0.9911]	0.8773[0.6666, 0.9869]	1.7112
Gauss-VAE + GMM (Bird-Specific)	37(ABCD-VAE)	0.9304[0.6619, 0.9906]	0.9292[0.6479, 0.9893]	—
	5–14(manual)	0.7888[0.5012, 0.9328]	0.8090[0.4732, 0.9254]	—
	50–109(auto-detected)	0.9516[0.7629, 0.9982]	0.9505[0.7687, 0.9962]	—

Quantitative evaluation of the clustering by the ABCD- vs. Gauss-VAE for Bengalese finch syllables.

Unsupervised classification of zebra finch syllables

To further assess the effectiveness/limitations of the ABCD-VAE, the same clustering was performed on zebra finch syllables (Taeniopygia guttata). We collected 237,610 syllables from 20 adult male zebra finches. Again, the data included incomplete syllables that were broken off at the beginning/end of the syllables, and after filtering out those incomplete syllables, we fed the remaining 231,792 to the ABCD-VAE. Speaker-normalized classification of zebra finch syllables was not as successful (or interpretable) as that of Bengalese finch syllables. While the syllables were classified into 17 categories in total (8 to 14 categories covered by a single bird, mean±SD:11.2±1.77), most of the classifications were not confident; 10 out of the 17 detected categories had a low mean classification probability under 30% whereas all but two categories of Bengalese finch syllables had a mean classification probability over 75% (Fig 4C). Syllables with seemingly major spectral differences were force-aligned across individuals (Fig 4A). Specifically, syllables consisting of multiple segments with distinct spectral patterns (or notes) seem to lack correspondents in different birds’ repertoire (e.g., Category 14 and 16).

Fig 4

Clustering results of zebra finch syllables based on the ABCD-VAE.

Clustering results of zebra finch syllables based on the ABCD-VAE.

(A) Syllable spectrograms and their classification across individuals. Syllables in each of the first to third rows (orange box) were sampled from the same individual. Each column (blue frame) corresponds to the syllable categories assigned by the ABCD-VAE. The bottom row provides the spectrogram of each category with the greatest classification probability (MAP: maximum-a-posteriori) over all the individuals. The individual-specific examples had a top-5 classification probability among the syllables of the same individual and category. (B) Syllable counts per individual bird (rows) and category (columns). The number of non-zero entries is also reported in the line plots. (C) Mean classification probability of Bengalese finch (left) and zebra finch (right) syllables per category. Quantitative evaluation also indicates that the speaker-normalized clustering of zebra finch syllables by the ABCD-VAE was not as well-aligned with bird-specific human annotations as that of Bengalese finch (Table 2). While the topline bird-specific models scored about 0.9 of Cohen’s kappa coefficient and homogeneity, the scores of the ABCD-VAE stayed around 0.7. Nevertheless, it is of note that the ABCD-VAE outperformed the baseline all-birds-together models, except the one that automatically detected the number of categories (and achieved the upper bound at 128). This auto-detection model achieved high Cohen’s kappa and homogeneity by specializing its categories to individual birds (i.e., by resorting to individual-specific classifications); as a result, the model scored a low individual perplexity, indicating that each individual was almost completely identifiable from the model-predicted category of a syllable. By contrast, the ABCD-VAE only used 17 categories and the high individual perplexity indicates that those categories were anonymized. Looking at each individual bird, the ABCD-VAE yielded “almost perfect agreement” with the manual annotations (κ > 0.8) for seven of the twenty birds, “substantial” agreement (0.6 < κ ≥ 0.8) for other seven, and “moderate agreement” for the remaining six (0.4 < κ ≥ 0.6).

Table 2

Quantitative evaluation of the clustering by the ABCD- vs. Gauss-VAE for zebra finch syllables.

Method	# of clusters (source)	Cohen’s Kappa mean[min,max]	Homogeneity mean[min,max]	Speaker Perplexity
ABCD-VAE	17	0.7097[0.4413, 0.9288]	0.6793[0.4972, 0.8718]	12.2834
Gauss-VAE + GMM (All-Birds-Together)	17(ABCD-VAE)	0.6012[0.2845, 0.9274]	0.6177[0.3030, 0.8942]	4.3094
	13(manual)	0.6102[0.0401, 0.9741]	0.6315[0.0433, 0.9609]	5.7021
	≥128(auto-detected)	0.8938[0.6843, 0.9915]	0.9016[0.7643, 0.9894]	1.3092
Gauss-VAE + GMM (Bird-Specific)	17(ABCD-VAE)	0.9579[0.8847, 0.9938]	0.9545[0.8828, 0.9905]	—
	4–13(manual)	0.8762[0.7915, 0.9744]	0.8623[0.7056, 0.9607]	—
	18–47(auto-detected)	0.9812[0.9360, 1.0000]	0.9782[0.9274, 1.0000]	—

Quantitative evaluation of the clustering by the ABCD- vs. Gauss-VAE for zebra finch syllables.

Analysis of context dependency

The classification described above provided us sequences of categorically represented syllables. To assess the context dependency in the sequence, we then measured differences between syllables predicted from full-length contexts and truncated contexts. This difference becomes large as the length of the truncated context gets shorter and contains less information. And, the difference should increase if the original sequence has a longer context dependency (Fig 5A). Thus, the context dependency can be quantified as the minimum length of the truncated contexts where the difference becomes undetectable [5, 6]. For the context-dependent prediction, we employed the Transformer language model [6, 19]. Transformer is known to capture long-distance dependency more easily than RNNs since it can directly refer to any data point in the past at any time while RNNs can only indirectly access past information through their internal memory [19, 39]. There is also accumulating evidence that Transformer successfully represents latent structures behind data, such as hierarchies of human language sentences [19, 40, 41].

Fig 5

(A) Schematic diagram of the evaluation metric. Predictive probability of each categorized syllable (denoted by x) was computed using the trained language model, conditioned on the full and truncated contexts consisting of preceding syllables (highlighted in blue and orange, respectively). The logarithmic difference of the two predictive probabilities was evaluated, and SECL was defined by the minimum length of the truncated context wherein the prediction difference is not statistically significantly greater than a canonical threshold. (B) The differences in the mean loss (negative log probability) between the truncated- and full-context predictions of Bengalese finch songs. The x-axis corresponds to the length of the truncated context. The error bars show the 90% confidence intervals estimated from 10,000 bootstrapped samples. The loss difference is statistically significant if the lower side of the intervals are above the threshold indicated by the horizontal dashed line. Each sequence included syllables from a single recording. In this section, we only report the analysis of Bengalese finch songs; since the classification of zebra finches’ syllables was not as reliable as Bengalese finches’, we report the analysis of zebra finch song as supplementary results in S3 Text. We obtained a total of 7,879 sequences of Bengalese finch syllables (each containing 8–338 syllables, 59.06 syllables on average), and used 7,779 of them to train the Transformer (see Table 3). The remaining 100 sequences were used to score its predictive performance from which the dependency was calculated. The model predictions were provided of the log conditional probability of the test syllables (x) given the preceding ones in the same sequence. We compared the model predictions between the full-context (“Full”, Fig 5A) and the truncated-context (“Truncated”) conditions. Then, the context dependency was quantified by a statistical measure of the effective context length [5, 6], which is the minimum length of the truncated context wherein the mean prediction difference between the two contexts was not significantly greater than the canonical 1% threshold in perplexity [42].

Table 3

The size of the training and test data used in the neural language modeling of Bengalese finch songs.

Species	Usage	# of sequences	# of syllables
Species	Usage	# of sequences	Total	SECL
Bengalese Finch	Training(incomplete)	7,779	458,992(3,275)	—
Bengalese Finch	Test(incomplete)	100	6,557(41)	4,657(36)
Zebra Finch	Training(incomplete)	11,722	234,674(5,763)	—
Zebra Finch	Test(incomplete)	100	2,936(55)	1,536(49)

The size of the training and test data used in the neural language modeling of Bengalese finch songs.

The “SECL” portion of the test syllables was used to estimate the SECL. The numbers of syllables in parentheses report the incomplete syllables that were broken off at the start/end of recordings, which were labeled with a distinct symbol. To see the relation between the number of syllable categories and context dependency, we also performed the same analysis based on more coarse/fine-grained syllable classifications into 10 to 80, 160, and 320 categories. These classifications were derived from the k-means clustering on the L2-normalized feature vectors of syllables given by the ABCD-VAE. The statistically effective context length (SECL) of the Bengalese finch song was eight based on the 37 syllable categories that were automatically detected by the ABCD-VAE (Fig 5B). In other words, restricting available contexts to seven or fewer preceding syllables significantly decreased the prediction accuracy compared with the full-context baseline, while the difference became marginal when eight or more syllables were included in the truncated context. When syllables were classified into more fine-grained categories, the difference between the model predictions based on the truncated and full contexts became smaller (Fig 5B; p < 0.001 according to the linear regression of the loss difference on the number of syllable categories and the length of truncated contexts, both in the log scale). That is, the context dependency traded off with the number of syllable categories. When 160 or 320 categories were assumed, the SECL of the Bengalese finch songs decreased to 5.

Discussion

This study assessed the context dependency in Bengalese finch’s song to investigate how long individual birds must remember their previous vocal outputs to generate well-formed songs. We addressed this question by fitting a state-of-the-art language model, Transformer, to the syllable sequences, and evaluating the decline in the model’s performance upon truncation of the context. We also proposed an end-to-end clustering method of Bengalese finch syllables, the ABCD-VAE, to obtain discrete inputs for the language model. In the section below, we discuss the results of this syllable clustering and then move to consider context dependency.

Clustering of syllables

The clustering of syllables into discrete categories played an essential role in our analysis of context dependency in Bengalese finch songs. Various studies have observed how fundamental the classification of voice elements is to animal vocalization [7, 11, 18, 43–45]. Our syllable clustering is based on the ABCD-VAE [27] and features the following advantages over previous approaches. First, the ABCD-VAE works in a completely unsupervised fashion. The system finds a classification of syllables from scratch instead of generalizing manual labeling of syllables by human annotators [30]. Thus, the obtained results are more objective and reproducible [46]. Second, the ABCD-VAE automatically detects the number of syllable categories in a statistically grounded way (following the Bayesian optimality under the Dirichlet prior) rather than pushing syllables into a pre-specified number of classes [28, 29, 47]. This update is of particular importance when we know little about the ground truth classification—as in the cases of animal song studies—and need a more non-parametric analysis. Third, the ABCD-VAE adopted the speaker-normalization technique used for human speech analysis and finds individual-invariant categories of syllables [28, 29]. Finally, the end-to-end clustering by the ABCD-VAE is more statistically principled than the previous two-step approach—acoustic feature extraction followed by clustering—because the distinct feature extractors are not optimized for clustering and the clustering algorithms are often blind to the optimization objective of the feature extractors [25, 26]. We consider that such a mismatch led the combination of Gauss-VAE and GMM to detect greater numbers of syllable categories than the ABCD-VAE and manual annotations, even when the clustering was specialized for each individual bird and not disturbed by individual variations (see Table 1). Chorowski et al. [29] also showed that a similar end-to-end clustering is better at finding speaker-invariant categories in human speech than the two-step approach. We acknowledge that discrete representation of data is not the only way of removing individual variations; previous studies have also explored individual normalization on continuous-valued features using deep neural networks. Variational fair autoencoders (VFAE), for example, use speaker embeddings as background information of VAE (in both the encoder and decoder while the ABCD-VAE only fed the speaker information to the decoder) [48]. As the authors note, however, the use of background information does not completely remove individual variations in the extracted features because continuous-valued features can distinguish infinitely many patterns (in principle) and do not have a strong bottleneck effect like discrete categories, making V(F)AE lose motivation to remove individual variations from the features (see also our supporting information S1 Text). Accordingly, VFAE has another learning objective that minimizes distances between feature vectors averaged within each speaker. More recently, researchers started to use adversarial training to remove individual and other undesirable variations [49]. In adversarial training, an additional classifier module is installed in the model, and that classifier attempts to identify the individual from the corresponding feature representation. The rest of the model is trained to deceive the individual classifier into misclassification by anonymizing the encoded features. Both VFAE and adversarial training are compatible with the ABCD-VAE and future studies may combine these methods to achieve stronger speaker-normalization effects. Note, however, that those normalization techniques would not yield speaker-invariant categories if there are no such categories; different individuals may exhibit completely different syllable repertoires and force alignment across individuals can be inappropriate in such cases. Specifically, we suspect that simply adopting other normalization methods would not lead to a more reliable classification of zebra finch syllables modulo speaker variations, unless we find more appropriate segmentation. It should be noted that the classical manual classification of animal voice was often based on visual inspection on the waveforms and/or spectrograms rather than auditory inspection [9, 30, 43]. Similarly, previous VAE analyses of animal voice often used a convolutional neural network that processed spectrograms as images of a fixed size [25, 26]. By contrast, the present study adopted a RNN [50] to process syllable spectra frame by frame as time series data. Owing to the lack of ground truth as well as empirical limitations on experimental validation, it is difficult to adjudicate on the best neural network architecture for auto-encoding Bengalese finch syllables and other animals’ voice. Nevertheless, RNN deserves close attention as a neural/cognitive model of vocal learning. There is a version of RNN called reservoir computer that has been developed to model computations in cortical microcircuits [51, 52]. Future studies may replace the LSTM in the ABCD-VAE with a reservoir computer to build a more biologically plausible model of vocal learning [53]. Similarly, we may filter some frequency bands in the input sound spectra to simulate the auditory perception of the target animal [29], and/or adopt more anatomically/bio-acoustically realistic articulatory systems for the decoder module [54]. Such Embodied VAEs would allow constructive investigation of vocal learning beyond mere acoustic analysis. A visual inspection of classification results shows that the ABCD-VAE can discover individual-invariant categories of the Bengalese finch syllables (Fig 2), which was also supported by their alignment with human annotations and low individuality in the classified syllables (Table 1). This speaker-normalization effect is remarkable because the syllables exhibit notable individual variations in the continuous feature space mapped into by the canonical VAE and cross-individual clustering is difficult there [25, 26, 55] (see Fig 2D and the supporting information S1 Text). Previous studies on Bengalese finch and other songbirds often assigned distinct sets of categories to syllables of different individuals, presumably because of similar individual variations in the feature space they adopted [9, 11, 30, 45]. By contrast, speaker-normalized clustering of zebra finch syllables was less successful, as evidenced by the lower classification probability (Fig 4B) and consistency with speaker-specific manual annotations (Table 2) than that of Bengalese finch syllables. A visual inspection of category-mate syllables across individuals suggests that one major challenge for finding individual-invariant categories is the complex syllables that exhibit multiple elements, or ‘notes’, without clear silent intervals (gaps; Fig 4A). Such complex syllables may be better analyzed by segmenting them into smaller vocal units [12, 56–59], and the prerequisite for appropriate voice segmentation is a major limitation of the proposed method because the unclarity of segment boundaries in low-level acoustic spaces is a common problem in analyses of vocalization, especially of mammals’ vocalization [45], including human speech [60, 61]. A possible solution to this problem (in accordance with our end-to-end clustering) is to categorize sounds frame by frame (e.g., by spectrum and MFCCs) and merge contiguous classmate frames to define a syllable-like span [27, 29, 62, 63].

Context dependency

According to our analysis of context dependency, Bengalese finches are expected to keep track of up to eight previously uttered syllables—not just one or two—during their singing. This is evidenced by the relatively poor performance of the song simulator conditioned on the truncated context of one to seven syllables compared to the full-context condition. Our findings add a new piece of evidence for long context dependency in Bengalese finch songs found in previous studies. Katahira et al. [9] showed that the dependent context length was at least two. They compared the first order and second order Markov models, which can only access the one and two preceding syllable(s), respectively, and found significant differences between them. A similar analysis was performed on canary songs (note, however, that chunks of homogeneous syllable repeats, called phrases, were used as song units rather than individual syllables) by Markowitz et al. [11], with an extended Markovian order (up to seventh). The framework in these studies cannot scale up to assess longer context dependency owing to the empirical difficulty of training higher-order Markov models [64, 65]. By contrast, the present study exploited a state-of-the-art neural language model (Transformer) that can effectively combine information from much longer contexts than previous Markovian models and potentially refer up to 900 tokens [6]. Thus, the dependency length reported in this study is less likely to be upper-bounded by the model limitations and provides a more precise estimation (or at least a tighter lower-bound) of the real dependency length in a birdsong than previous studies. The long context dependency on eight previous syllables in Bengalese finch songs is also evidenced by experimental studies. Bouchard and Brainard [66] found that activities of Bengalese finches’ HVC neurons in response to listening to a syllable x encoded the probability of the preceding syllable sequence x, …, x (i.e., context) given x, or . They reported that the length L of the context encoded by HVC neurons (that exhibited strong activities to the bird’s own song) reached 7–10 syllables, which is consistent with the dependency length of eight syllables estimated in the present study. Warren et al. [10] also provided evidence for long context dependency from a behavioral experiment. They reported that several pairs of syllable categories of Bengalese finch songs had different transitional probabilities depending on whether or not the same transition pattern occurred in the previous opportunity. In other words, where A, B, C are distinct syllable categories, the dots represent intervening syllables of an arbitrary length (∌ A), and the underline indicates the position of B whose probability is measured. Moreover, they found that the probability of such history-dependent transition patterns is harder to modify through reinforcement learning than that of more locally dependent transitions. These results are consistent with our findings. It often takes more than two transitions for syllables to recur (12.24 syllables on average with the SD of 11.02 according to our own Bengalese finch data, excluding consecutive repetitions); therefore, the dependency on the previous occurrence cannot be captured by memorizing just one or two previously uttered syllable(s). There is also a previous study that suggests a longer context dependency in Bengalese finch songs than estimated in this study (i.e., ≫8). Sainburg et al. [18] studied the mutual information between birdsong syllables—including Bengalese finch ones—appearing at each discrete distance. They analyzed patterns in the decay of mutual information to diagnose the generative model behind the birdsong data, and reported that birdsongs were best modeled by a combination of a hierarchical model that is often adopted for human language sentences and a Markov process: subsequences of the songs were generated from a Markov process and those subsequences were structured into a hierarchy. Mutual information decayed exponentially in the local Markov domain, but the decay slowed down and followed the power-law as the inter-syllable distance became large. Sainburg et al. estimated that this switch in the decay pattern occurred when the inter-syllable distance was around 24 syllables. This estimated length was substantially longer than our estimated context dependency on eight syllables. The difference between the two results might be attributed to several factors. First, the long-distance mutual information may not be useful for the specific task of predicting upcoming syllables that defined the context dependency here and in the previous studies based on language modeling. It is possible that all the information necessary for the task is available locally while the mutual information does not asymptote in the local domain (see S4 Text for concrete examples). Another possible factor responsible for the longer context dependency detected by Sainburg et al. is that their primary analysis was based on long-sequence data concatenating syllables recorded in a single day (amounting to 2,693–34,588 syllables, 11,985.56 on average, manually annotated with 16–26 labels per individual). Importantly, they also showed that the bimodality of mutual information decay in the Bengalese finch song became less clear when the analysis was performed on bouts (consisting of 8–398 syllables, 80.98 on average). Since our data was more akin to the latter, potential long dependency in the hierarchical domain might be too weak to be detected in the language modeling-based analysis. We also found that the greater number of syllable categories is assumed, the shorter context length becomes sufficient to predict upcoming syllables. We attribute this result to the minor acoustic variations among syllables that are ignored as a noise in the standard clustering or manual classification but encoded in the fine-grained classifications. When predicting upcoming syllables based on the fine-grained categories, the model has to identify the minor acoustic variations encoded by the categories. And it has been reported that category-mate syllables (defined by manual annotations) exhibited systematically different acoustic variations depending on the type of surrounding syllables [67]. Thus, the identification of fine-grained categories would improve by referring to the local context, rather than syllables far apart from the prediction target. This increases the importance of the local context compared to predictions of more coarse-grained categories. The reported context dependency on previous syllables also has an implication for possible models of birdsong syntax. Feasible models should be able to represent the long context efficiently. For example, the simplest and traditional model of the birdsong and voice sequences of other animals—including human language before the deep learning era—is the n-gram model, which exhaustively represents all the possible contexts of length n − 1 as distinct conditions [7, 64, 65]. This approach, however, requires an exponential number of contexts to be represented in the model. In the worst case, the number of possible contexts in Bengalese finch songs is 378 = 3, 512, 479, 453, 921 when there are 37 syllable types and the context length is eight as detected in this study. While the effective context length can be shortened if birds had a larger vocabulary size, the number of logically possible contexts remains huge (e.g., 1605 = 104,857,600,000). Such an exhaustive representation is not only hard to store and learn—for both real birds and simulators—but also uninterpretable to researchers. Thus, a more efficient representation of the context syllables is required [68]. Katahira et al. [9] assert that the song syntax of the Bengalese finch can be better described with a lower-order hidden Markov model [69] than the n-gram model. Moreover, hierarchical language models used in computational linguistics (e.g., probabilistic context-free grammar) are known to allow a more compact description of human language [70] and animal voice sequences [71] than sequential models like HMM. Another compression possibility is to represent consecutive repetitions of the same syllable categories differently from transitions between heterogeneous syllables [16, 17] (see also [72] for neurological evidence for different treatments of heterosyllabic transitions and homosyllabic repetitions). This idea is essentially equivalent to the run length encoding of digital signals (e.g., AAABBCDDEEEEE can be represented as 3A2B1C2D5E where the numbers count the repetitions of the following letter) and is effective for data including many repetitions like Bengalese finch’s song. For the actual implementation in birds’ brains, the long contexts can be represented in a distributed way [73]: Activation patterns of neuronal ensemble can encode a larger amount of information than the simple sum of information representable by individual neurons, as demonstrated by the achievements of artificial neural networks [51, 52, 74]. We conclude the present paper by noting that the analysis of context dependency via neural language modeling is not limited to Bengalese/zebra finch’s song. Since neural networks are universal approximators and potentially fit to any kind of data [75, 76], the same analytical method is applicable to other animals’ voice sequences [11, 43, 71], given reasonable segmentation and classification of sequence components like syllables. Moreover, the analysis of context dependency can also be performed in principle on other sequential behavioral data besides vocalization, including dance [77, 78] and gestures [79, 80]. Hence, our method provides a crossmodal research paradigm for inquiry into the effect of past behavioral records on future decision making.

Materials and methods

Recording and preprocessing

We used the same recordings of Bengalese finch songs that were originally reported in our earlier studies [30, 31]. The data were collected from 18 Bengalese finches, each isolated in a birdcage placed inside a soundproof chamber. All the birds were adult males (>140 days after hatching). All but two birds were obtained from commercial breeders, and the other two birds (bird ID: b10 and b20) were raised in laboratory cages. Note that one bird (b20) was a son of another (b03), and learned its song from the father bird. No other birds had any explicit family relationship. The microphone (Audio-Technica PRO35) was installed above the birdcages. The output of the microphone was amplified using a mixer (Mackie 402-VLZ3) and digitized through an audio interface (Roland UA-1010/UA-55) at 16-bits with a sampling rate of 44.1 kHz. The recordings were then down-sampled to 32 kHz [30, 31]. Recording process was automatically started upon detection of vocalization and terminated when no voice was detected for 500–1000 msec (the threshold was adjusted for individual birds). Thus, the resulting recordings roughly corresponded to bout-level sequences, and we used them as the sequence unit for the analysis of context dependency. An additional dataset for song recordings of 20 zebra finches was kindly provided by Prof. Kazuhiro Wada (Hokkaido University). The recording was performed in the same procedure as previously reported [81, 82]. Song syllables were segmented from the continuous recordings using the thresholding algorithm proposed in the previous studies [30, 31]. The original waveforms were first bandpass-filtered at 1–8 kHz. Then, we obtained their amplitude envelope via full-wave rectification and lowpass-filtered it at 200 Hz. Syllable onsets and offsets were detected by thresholding this amplitude envelope at a predefined level, which was set at 6–10 SD above the mean of the background noise level (the exact coefficient of the SD was adjusted for individual birds). The mean and SD of background noise were estimated from the sound level histogram. Sound segments detected from this thresholding algorithm were sometimes too close to their neighbors (typically separated by a <5 msec interval), and such coalescent segments were reidentified as a single syllable, by lower-bounding possible inter-syllable gaps at 3–13 msec for Bengalese finches and 3–10 msec for zebra finches (both adjusted for individual birds). Finally, extremely short sound segments were discarded as noise, by setting a lower bound on possible syllable durations at 10–30 ms for Bengalese finches and 5–30 msec for zebra finches (adjusted for individual birds). These segmentation processes yielded 465,310 Bengalese finch syllables (≈ 10.79 hours) and 237,610 zebra finch syllables (≈ 7.72 hours) in total. To perform an analysis parallel to the discrete human language data, we classified the segmented syllables into discrete categories in an unsupervised way. Specifically, we used an end-to-end clustering method, named the seq2seq ABCD-VAE, that combined (i) neural network-based extraction of syllable features and (ii) Bayesian classification, both of which worked in an unsupervised way (i.e., without top-down selection of acoustic features or manual classification of the syllables). This section provides an overview of our method, with a brief, high-level introduction to the two components. Interested readers are referred to S1 Text in the supporting information, where we provide more detailed information. One of the challenges to clustering syllables is their variable duration as many of the existing clustering methods require their input to be a fixed-dimensional vector. Thus, it is convenient to represent the syllables in such a format [83, 84]. Previous studies on animal vocalization often used acoustic features like syllable duration, mean pitch, spectral entropy/shape (centroid, skewness, etc.), mean spectrum/cepstrum, and/or Mel-frequency cepstral coefficients at some representative points for the fixed-dimensional representation [9, 30, 71]. In this study, we took a non-parametric approach based on a sequence-to-sequence (seq2seq) autoencoder [85]. The seq2seq autoencoder is a RNN that first reads the whole spectral sequence of an input syllable frame by frame (encoding; the spectral sequence was obtained by the short-term Fourier transform with the 8 msec Hanning window and 4 msec stride), and then reconstructs the input spectra (decoding; see the schematic diagram of the system provided in the upper half of Fig 1B). Improving the precision of this reconstruction is the training objective of the seq2seq autoencoder. For successful reconstruction, the RNN must store the information about the entire syllable in its internal state—represented by a fixed-dimensional vector—when it transitions from the encoding phase to the decoding phase. And this internal state of the RNN served as the fixed-dimensional representation of the syllables. We implemented the encoder and decoder RNNs by the LSTM [50]. One problem with the auto-encoded features of the syllables is that the encoder does not guarantee their interpretability. The only thing the encoder is required to do is push the information of the entire syllables into fixed-dimensional vectors, and the RNN decoder is so flexible that it can map two neighboring points in the feature space to completely different sounds. A widely adopted solution to this problem is to introduce Gaussian noise to the features, turning the network into the variational autoencoder [24, 85, 86]. Abstracting away from the mathematical details, the Gaussian noise prevents the encoder from representing two dissimilar syllables close to each other. Otherwise, the noisy representation of the two syllables will overlap and the decoder cannot reconstruct appropriate sounds for each. The Gaussian VAE represents the syllables as real-valued vectors of an arbitrary dimension, and researchers need to apply a clustering method to these vectors in order to obtain discrete categories. This two-step analysis has several problems: The VAE is not trained for the sake of clustering, and the entire distribution of the encoded features may not be friendly to existing clustering methods. The encoded features often include individual differences and do not exhibit inter-individually clusterable distribution (see Fig 2D and the supporting information S1 Text). To solve these problems, this study adopted the ABCD-VAE, which encoded data into discrete categories with a categorical noise under the Dirichlet prior, and performed end-to-end clustering of syllables within the VAE (Fig 1B). The ABCD-VAE married discrete autoencoding techniques [28, 29, 47] and the Bayesian clustering popular in computational linguistics and cognitive science [36, 37]. It has the following advantages over the Gaussian VAE + independent clustering (whose indices, except iii, correspond to the problems with the Gaussian VAE listed above): Unlike the Gaussian VAE, the ABCD-VAE includes clustering in its learning objective, aiming at statistically grounded discrete encoding of the syllables. The ABCD-VAE can exploit a speaker-normalization technique that has proven effective for discrete VAEs: The “Speaker Info.” is fed directly to the decoder (Fig 1B), and thus individual-specific patterns need not be encoded in the discrete features [28, 29]. Thanks to the Dirichlet prior, the ABCD-VAE can detect the statistically grounded number of categories on its own [32]. This is the major update from the previous discrete VAEs that eat up all the categories available [28, 29, 47]. Note that the ABCD-VAE can still measure the similarity/distance between two syllables by the cosine similarity of their latent representation immediately before the computation of the classification probability (i.e., logits). The original category indices assigned by the ABCD-VAE were arbitrarily picked up from 128 possible integers and not contiguous. Accordingly, the category indices reported in this paper were renumbered for better visualization.

Other clustering methods

Clustering results of the ABCD-VAE were evaluated in comparison with baselines and toplines provided by the combination of feature extraction by the Gaussian VAE [24-26] and clustering on the VAE features by GMM [32, 36, 37]. The number K of GMM clusters was either predetermined or auto-detected. The former fit K multivariate Gaussian distributions by the expectation maximization algorithm while the latter was implemented by Bayesian inference with the Dirichlet distribution prior, approximated by mean-field variational inference. Since a single run of the expectation maximization and variational inference only achieved a local optimum, the best among 100 runs with random initialization was adopted as the clustering results. We used the scikit-learn implementation of GMMs (GaussianMixture and BayesianGaussianMixture) [87]. The default parameter values were used unless otherwise specified above. In the analysis of context dependency, we obtained fine-/coarse-grained classifications of syllables based on the features extracted immediately before the computation of classification logits by the ABCD-VAE. The ABCD-VAE computes the classification probability based on the inner-product of those features and the reference vector of each category. Thus, we can compute the similarity among syllables by their cosine in the feature space, and accordingly, we applied k-means clustering on the L2-normalized features. We again adopted the scikit-learn implementation of k-means clustering [87].

Evaluation metrics of syllable clustering

The syllable classification yielded by the ABCD-VAE was evaluated by its alignment with manual annotation by a human expert. We used two metrics to score the alignment: Cohen’s kappa coefficient [34] and homogeneity [35]. Cohen’s Kappa coefficient is a normalized index for the agreement rate between two classifications, and has been used to evaluate syllable classifications in previous studies [9, 30]. One drawback of using this metric is that it only works when the two classifications use the same set of categories. This requirement was not met in our case, as the model predicted classification and human annotation had different numbers of categories, and we needed to force-align each of the model-predicted categories to the most common human-annotated label to compute Cohen’s kappa [9]. On the other hand, the second metric, homogeneity, can score alignment between any pair of classifications, even with different numbers of categories. Homogeneity is defined based on the desideratum that each of the predicted clusters should only contain members of a single ground truth class. Mathematically, violation of this desideratum is quantified by the conditional entropy of the distribution of ground truth classes given the predicted clusters : where N denotes the total number of data points, and |c ∩ k| is the number of data that belong to the ground truth class c and the model-predicted category k. The non-conditional entropy normalizes the homogeneity so that it ranges between 0 and 1. As we noted in the Result section, homogeneity does not penalize overclassification, so it is often combined with another evaluation metric for scoring overclassification, called completeness, and constitutes a more comprehensive metric named V-measure [35]. We report the completeness and V-measure scores of the syllable clustering results in the supporting information S1 Text.

Language modeling

After the clustering of the syllables, each sequence, x ≔ (x1, …, x), was represented as a sequence of discrete symbols, x. We performed the analysis of context dependency on these discrete data. The analysis of context dependency made use of a neural language model based on the current state-of-the-art architecture, Transformer [6, 19]. We trained the language model on 7,779 sequences of Bengalese finch syllables (amounting to 458,753 syllables in total; see Table 3). These training data were defined by the complement of the 100 test sequences that were selected in the following way so that they were long enough (i) and at least one sequence per individual singer was included (ii): The sequences containing 15 or more syllables were selected as the candidates. For each of the 18 Bengalese finches, one sequence was uniformly randomly sampled among the candidates uttered by that finch. The other 82/80 sequences were uniformly randomly sampled from the remaining candidates. The training objective was to estimate the probability of the whole sequences x conditioned on the information about the individual s uttering x: That is, . Thanks to the background information s, the model did not need to infer the singer on its own. Hence, the estimated context dependency did not comprise the correlation among syllables with individuality, which would not count as a major factor especially from a generative point of view. The joint probability, , was factorized as , and, the model took a form of the left-to-right processor, predicting each syllable x conditioned on the preceding context , x1, …, x, where stands for the special category marking the start of the sequence. See the supporting information S2 Text for details on the model parameters and training procedure. While the VAE training excluded incompletely recorded syllables positioned at the beginning/end of recordings, we included them in the language modeling by assigning them with a distinct category. This corresponds to the replacement of non-frequent words with the “unk(nown)” label in natural language processing.

Measuring context dependencies

After training the language model, we estimated how much of the context x1, …, x was used effectively for the model to predict the upcoming syllable x in the test data. Specifically, we wanted to know the longest length L of the truncated context x, …, x such that the prediction of x conditioned on the truncated context was worse (with at least 1% greater perplexity) than the prediction based on the full context (Fig 5A). This context length L is called the effective context length (ECL) of the trained language model [5]. One potential problem with the ECL estimation using the birdsong data was that the test data was much smaller in size than the human language corpora used in the previous study. In other words, the perplexity, from which the ECL was estimated, was more likely to be affected by sampling error. To obtain a more reliable result, we bootstrapped the test data (10,000 samples) and used the five percentile of the bootstrapped differences between the truncated and full context predictions. Note that the bootstrapping was performed after the predictive probability of the test syllables was computed, so there was no perturbation in the available contexts or any other factors affecting the language model. We call this bootstrapped version of ECL the statistically effective context length (SECL). It is more appropriate to estimate the SECL by evaluating the same set of syllables across different lengths of the truncated contexts. Accordingly, only those that were preceded by 15 or more syllables (including ) in the test sequences were used for the analysis (4,918 syllables of Bengalese finches; see Table 3).

Details on syllable clustering by VAE.

(PDF) Click here for additional data file.

Details on the Transformer language model.

(PDF) Click here for additional data file.

Analysis of context dependency in zebra finch song.

(PDF) Click here for additional data file.

Detailed comparison with the mutual information analysis.

(PDF) Click here for additional data file.

Detailed comparison with the Markovian analysis of context dependency.

(PDF) Click here for additional data file. 23 Mar 2021 Dear Dr. Tachibana, Thank you very much for submitting your manuscript "Birdsong sequence exhibits long context dependency comparable to human language syntax" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. Dear Authors, Your paper has now been reviewed by 3 experts in the field. As you will see Reviewers 2 and 3 have raised significant issues that need a serious rewrite (at a minimum). In particular, your conclusions about similarities with human language go to far and are unfounded. You might want to remove that part from your manuscript or greatly change its emphasis. You will also want to add additional controls (see the suggestions of the reviewers) to strengthen the presentation of your methodology. Best wishes, Frederic Theunissen We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Frédéric E. Theunissen Associate Editor PLOS Computational Biology Natalia Komarova Deputy Editor PLOS Computational Biology *********************** Dear Authors, Your paper has now been reviewed by 3 experts in the field. As you will see Reviewers 2 and 3 have raised significant issues that need a serious rewrite (at a minimum). In particular, your conclusions about similarities with human language go to far and are unfounded. ou might want to remove that part from your manuscript or greatly change its emphasis. You will also want to add additional controls (see the suggestions of the reviewers) to strengthen the presentation of your methodology. Best wishes, Frederic Theunissen Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This paper by Morita and colleagues attempt to quantify long-range statistical dependencies in Bengalese finch songs, comparing this dependency structure to English sentences. They find that, using higher-capacity neural network-based models in place of traditional n-gram and Markov models, they are able to successfully measure dependencies in the eight syllable range, still much less than English sentences but comparable to English syntatical dependencies. This is an interesting contribution to a pair of very long-running scientific questions: 1) to what extent are sequential behaviors like birdsong valid precursors or models for human speech? and 2) how should these complex sequences be modeled mathematically. In my view, the paper makes a somewhat tangential contribution to the former and a solid contribution to the latter. I found the modeling and its interpretation reasonable, though I have a few clarifying questions. Unfortunately, the full packet I downloaded did not contain the supplementary information with modeling details, so while I suspect that the methods are all reasonable, I would like to review that material before I sign off on publication. Apart from this, I have only minor questions and concerns: Minor points: - It's a bit of a misnomer to say that a Dirichlet prior is "statistically optimal" (l. 91). There are many possible definitions for what constitutes optimality. I might suggest "statistically principled." It is possible to both under-cluster and over-cluster with the Dirichlet prior, depending on what one wants to do, and there's no right answer. Thus, I expected to see the choice of Dirichlet prior defended a little more robustly. I also realize that some of this information was relegated to the supplement, which I will be glad to read. - One thing I'm a bit unclear on: I like the choice of model architecture and the dissociation between bird ID and syllables, but it's unclear to me why the sequence of syllables needs to be discrete tokens and not continuous embedding vectors. Other than alignment with natural human language, why do the syllables need to be indexed by integers and not locations in R^d? - Another robustness question: How strong is the clustering? Might long-term dependencies be indicative of correlated variability within the bout? Could a faster or slower bout, a softer or louder bout, lead to long-term dependency if the data are over-clustered? Some of these nuisance variables may represent long-range correlations even when bird ID is controlled for. If I were to ask for any additional controls to be run, it would be these. Even more minor points: - Figure 2b: It's hard to tell much from this figure. Might the rows and columns be ordered in some way (most to least used for categories?) in a manner that makes it more apparent which categories are most typical and which individuals show most variety? - Figure 2c: It's pretty hard to make sense of some of these spectrograms as syllables (e.g., 2, 3). Could the authors perhaps plot the nearest real-data neighbor of these median generated spectrograms? - Figure 2e: This is a nice comparison, and it seems clear that giving individual IDs to the seq2seq model does a decent job of correcting for this information. The authors might also consider citing some of the work on similar ideas from the literature, including, e.g., the variational fair autoencoder or domain adaptation, which often want to model some commonality across individuals. - There is a typo in the last line of the caption to Table 1. - The authors note (ll. 247-48) that their estimates of dependency lenght are unlikely to be model-limited. It is hard to assess this is a rigorous way, but their results, even if they are only a lower bound, are still valuable as such. - Why is the train/test split so different between the English and finch data sets in Table 2? - Details on the bootstrap method for ll. 482-484? This is also in the supplement? Reviewer #2: This paper seeks to compare context dependency in the songs of Bengalese finches and human speech. The authors provide a method for unsupervised clustering of BF songs, then use a transformer network approach to predict song element categories based on context from sequences of different lengths. Deviations from full-length context are used to define the “statistically effective context length” (SECL). The authors report that BF song has and SECL of about 8 elements, beyond which context no longer exerts an effect. English lemma sentences have an SECL greater than 10, whereas English sentences relabeled with parts-of-speech tags have an SECL of 5. Overall, I think the paper can be an interesting contribution to the growing literature on long-range dependencies in birdsong, but there are a number of concerns that need to be addressed in any revision. First, the methods need to be more explicitly documented. Both the ABCD-VAE and Transformer network approach are potentially attractive methods for understanding birdsong. However, we don’t learn very much about it either in terms of their performance and robustness in the context of different datasets, or their performance relative to other published methods with similar goals. Second, the ties to human language need to be tempered significantly, if not dropped entirely from the paper. There are a range of conceptual problems in the comparison between songs and speech as presented, and the authors overstate several of the comparisons. Finally, it is not entirely clear how to interpret the present results in the context of past work on the explicit modelling of songs using markov-based and information theoretic approaches. Ideally, contrasts with the published work would done be in quantitatively rigorous ways, befitting the PLOS Comp bio venue. Major concerns: The ABCD-VAE is very nice and potentially a useful novel method, but its utility as a general method relative to other available techniques is not explored. In particular, recent work shows that BF is not a particularly challenging species to cluster using spectro-temporal acoustics compared to other songbirds and especially to mammals (paper ref 37). Understanding how this method fares in other species, or in more challenging cases in Bengalese finches (if they exist), would be useful. Likewise, a comparison to at least one other unsupervised method would be useful. The critique of Sainburg 2019 (ref 37) in the discussion is not accurate. While it’s true that that study used the MI between pairs of song elements to estimate long range dependencies, the explicit form of the MI decay as a function of dependency distance gives insight to the role of the intervening elements. Specifically, if the decay falls off exponentially, consistent with a Markov model, long distance dependencies require intervening elements (i.e., “everything in the middle”) by definition. It is true that at distances where a power law is exclusively governing decay, the intervening elements can (in principle) be skipped, but the fit of either model and/or their combination is determined by the data, not by an a priori assumption about the role of intervening elements. More importantly, Sainburg et al. 2019 estimate that dependencies shift from being primarily Markovian to primarily power-law distributed at distances of about 24 elements for BFs, which is consistent with the idea that contextual effects as considered in the current paper extend throughout the bout, i.e., over distances considerably longer than the SECL of 8 reported here. The source for this difference is never considered. The authors spend considerable effort in trying to differentiate their results from Sainburg’s on conceptual grounds, but a more convincing approach would be to provide the technical detail necessary to understand how their approach is, as they claim, measuring different components of sequential dependencies in BF songs (as opposed to measuring the same components with different sensitivity). The authors reference a supplementary methods and discussion section, but I could not find either. In particular, more detail for the Transformer model is required. I am not an expert in these attention-based approaches, but my understanding is that their primary advantage over recurrent networks in seq2seq tasks is their parallel-izability for training. Since RNN’s can model hierarchical structure in sequences (and thus “skip” -or not- intervening elements to find dependencies) whether or not the Transformer network used here behaves similarly needs to be made clear. To do this, it may be necessary to create synthetic datasets that vary in known ways along the dimensions that they argue are relevant to their results. It would also be very helpful to know more about how changing network parameters effects the results. The analogy to human language/speech is quite weak. To begin with, the two signals are parsed in much different ways: BF song on the basis of acoustic similarity, and speech on the basis of word boundaries. It’s not clear what the analog of a word is in BF song, or if it even exists. The speech exemplars are then further organized and relabeled by heavily supervised methods (either for the lemma sequences or PoS). Thus categories (and category elements) in the two datasets are very different things and any relationships between them are difficult to interpret. To this point, its not clear that the bootstrapping methods applied to estimate ECL from the small BF dataset were effective. Specific claims in the abstract are overly strong and not supported by the data: (1) “…birdsongs have a long context dependency comparable to grammatical structure in human language”, (2) “… birdsong is more homologous to human language syntax…”. With reference to the first, the SECL as they define it here for human language is one of many emergent properties of grammatical structure. It is a mistake to conflate grammatical structure with a property of that structure. With reference to the second, I think they mean to imply analogy not homology. In any case, the same confusion of logical type exists. Relatedly, it is not clear why the authors introduce the idea of “memory durability” in the discussion (p 18), as nothing in the BF results justifies the notion that either the singers or recipients of song are actively aware or perceptually sensitive to the dependencies reported here. I would not be surprised if birds were sensitive to these dependencies, but it’s also possible they are not. For example, dependencies might reflect some other, non-cognitive, aspect of the system such as lower-level constraints on motor production. Appealing to the memory components of language involved in syntax and semantics implies an equivalency that is not supported by the results. Reviewer #3: Review of Morita et al. (PCOMPBIOL-D-21-00210) Summary: The sequencing of elements in communication structures, including language and birdsong, can have long-range dependencies. Here, Morita et al. integrate various machine learning techniques to automatically annotate a large corpus of Bengalese finch song and model the long-range dependencies of syllable sequencing. A similar computational model was also used to quantify the context-dependency of English. Consistent with previous studies, the authors demonstrate that syllable sequencing in Bengalese finch song integrates information across many past syllables to influence subsequent syllable sequencing. The manuscript is well-written overall and the end-to-end, unsupervised clustering approach is a useful addition to the range of techniques used for automatic labelling of birdsong. Major concerns: 1. The authors repeatedly try to link their findings dealing with the context-dependency of syllable sequencing in birdsong to word sequencing in English. For example, in the Abstract they write: “We found that the context dependency in the birdsong was much shorter than that in the sentence, but was comparable to the grammatical structure when semantic factors were removed.” It is completely reasonable to compute the number of elements in the past that are required to make accurate predictions about the future for birdsong and language, but it is inappropriate to equate the results from the two because it assumes equivalency between the units of analysis. Consequently, such direct comparisons about context length between birdsong and language should be removed. That being said, comparisons between the current results on the English language corpus and other analyses of language can be informative and interesting. In its current form, there is not much contextualization of the analyses of language. 2. There is insufficient information to evaluate the quality of the automated annotation. As the authors are aware, the inferences from their language model critically depend on the symbols put into the model. While the authors provide some measures of similarity between human and machine classification (e.g., Figure 2 and Table 1), more data is required. For example, it would be useful to provide estimates of the number of unique syllable types in the songs of each individual Bengalese finch as estimated by humans vs. algorithm. If a human rater is asked to annotate the songs of two birds simultaneously, do they come up with the same number of shared vs. unique syllables between birds as estimated by the algorithm? Ultimately, estimates of the extent of context-dependency depend on the number of classes fed into the model. If one used a broader classification scheme (leading to fewer unique syllable types), estimates of effective context length would decrease. Given the spectrograms in Figure 2C, I suspect that modeling data based on human labelling would lead to fewer syllable classes within the corpus; for example, I suspect that a human would cluster together classes 13 & 20, or even all of classes 1-5. While this does not mean that the classifier is “wrong” (just that machine and human classification can lead to different numbers of classes), this could ultimately affect the number of syllables in the past one might need to analyze to accurately predict the identity of the subsequent syllable. It would be useful for the authors to provide some raw examples of how measures of similarity between human and machine classification were computed. I suspect that a number of readers might not be familiar with how the V-measure is calculated, so providing a concrete example of how this is computed in their dataset would be useful. (see also Minor comment on Cohen’s kappa) 3. Related to point #2, one of the fundamental challenges in bioacoustics analyses that is often ignored is segmentation; classification algorithms are only as good as the elements fed into it. I appreciate the authors acknowledgement of this issue in the Discussion section. However, given their approach and use of segmented syllables, more information about how amplitude thresholds were computed and verified are essential here. References to previous work are insufficient, since understanding the accuracy and reliability of this approach is important to interpreting the data. 4. I do not see how the current data “provide a new piece of evidence for the hypothesis that human language modules, such as syntax and semantics, evolved from different precursors that are shared with other animals.” The analyses presented here do not reveal any separate modules for semantics vs. syntax, and many previous studies have already discussed sequence complexities in vocalizations that lack the semantic content of language. This should be removed. Minor concerns: Beginning from the abstract, the authors repeated make the claim that “birdsong is more homologous to human language syntax than the entirety of human language including semantics.” The basis of this statement becomes more evident as one reads the paper, but this is almost impossible to understand in the Abstract. Therefore, the authors need to explain this a little more or rephrase in the Abstract. Line 23-25. There should be a more extensive list of papers cited here since many studies have used computational methods to assess the long-range dependency of syllable sequencing in birdsong (e.g., Sainburg et al., 2019 and Jin & Kozhevnikov, 2011 to name a few that are missing from this list). Lines 94-96: The classifier used by the authors detected 39 syllable categories. The authors then write that “(t)he syllable repertoire of each bird covered 26 to 38 categories (34.78 ± 3.19)”. If I understand this correctly, this indicates that, on average 35 unique labels would be used to annotate an individual Bengalese finches song. My impression is that this is a substantially higher number than one would normally ascribe to an individual Bengalese finch’s song. This gives me pause and further emphasizes the need for additional information and concrete examples about the similarity between human and computer classifications. As indicated above, understanding syllable classification is of utmost importance to understanding the results, so the authors should provide more information about this approach. lines 103-106: “A problem with this metric is that it requires two classifications to use the same set of categories while our model predictions and human annotations had different numbers of categories and, thus, we needed to force-align each of the model-predicted categories to the most common human-annotated label to use the metric.” Because Cohen’s kappa is a widely used metric that is familiar to many readers, the authors should flesh out this approach more an provide more information on the method of forced alignment. Inclusion of a concrete example would be very useful. Lines 113-114: “Hence, our unsupervised clustering of syllables is as reliable as the manual classification by the expert.” The analyses in this paragraph do not speak to the RELIABILITY of labelling (which could be interpreted as the consistency with which an approach labels a particular class of syllables as belonging to that class: e.g., the “a” syllable is always classified as an “a” and never as any other syllable class). Please reword or clarify how these measures speak to the reliability of labelling (as opposed to the similarity of classifications across approaches). Line 115-119: The performance of a classifier on identifying individual birds depends on how distinct the birds’ songs are. Given that the acoustic structure of birdsong is learned, the authors should provide a comprehensive summary of the relationship between birds (e.g., were birds siblings or offspring of another bird in the dataset?) Related to the Major Concern #1, the title of the section “Birdsong sequence more context-dependent than English syntax” should be changed because of concerns about equating units of analyses across birdsong and language. Line 137-139. Information about minimum number of syllables per bout should be included here. Generally speaking, I suggest that more information in Methods should be incorporated into this part of the results section, since this would help readers assess the validity of the approach. Line 147: the summary of English analysis should be in a different paragraph. Line 156: the statistically effective context length of 8 syllables is surprisingly similar to estimates of temporal integration by HVC neurons in Bouchard and Brainard (2013). The authors should read and reference this paper to draw parallels. Related to this point, including some discussion of neural mechanisms of temporal integration would be useful in the Discussion section to broaden the scope of the paper and provide some experimental “validation” of temporal integration estimates. Line 180-183: The authors discuss how the automated approach is more reliable than human labelling. Can the authors please explain this more? (see point above about assessments of reliability). Relatedly, how is “optimality” defined in this context? Line 250: Fujimoto et al. (2011; Neural Coding of Syntactic Structure in Learned Vocalizations in the Songbird”) should also be cited somewhere in this manuscript because it provides concrete examples of the history-dependence of sequencing. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: No: Authors claim that data will be available upon publication. Reviewer #2: No: supporting information, specifically methods and discussion, are reference but not provided. Not sure who's error this is. Reviewer #3: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 29 Jul 2021 Submitted filename: response_letter.pdf Click here for additional data file. 27 Sep 2021 Dear Dr. Tachibana, Thank you very much for submitting your manuscript "Measuring context dependency in birdsong using artificial neural networks" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Dear authors, Please look carefully at the new comments from the same reviewers as in the first round. You will see that they agree that the paper has improved by focusing on songbirds and eliminating the comparison with human language. There are still some concerns on the biological relevance of some of your claims that need to be addressed. Also the reviewers are sorry that their re-review took so long and I am transferring their apologies to you. Best wishes, Frederic Theunissen Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Frédéric E. Theunissen Associate Editor PLOS Computational Biology Natalia Komarova Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Dear authors, Please look carefully at the new comments from the same reviewers as in the first round. You will see that they agree that the paper has improved by focusing on songbirds and eliminating the comparison with human language. There are still some concerns on the biological relevance of some of your claims that need to be addressed. Also the reviewers are sorry that their re-review took so long and I am transferring their apologies to you. Best wishes, Frederic Theunissen Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I appreciate the authors' thoughtful responses to my questions. I think the paper benefits by removing the comparison with English sentences and focusing on the birdsong structure. I was glad to be able to read the supplementary information, which I think is well done and adds much value to the paper. I have only minor points that I'd ask the authors to consider: ll. 97-98: Again, this language is problematic. ll. 110-140: The authors might consider a small schematic figure illustrating this. I think the text description is fine, but a visual illustration would, I suspect, make it easier to grasp. I appreciated the analysis of Section S1.4/Figure S1.3, since I had wondered reading the main text whether the authors had tried to give identity information to the standard VAE as well. (It might be nice to mention that this failed.) Two questions: 1) On line 111, when the text says that the "speaker ID" was given to the standard VAE, is this the same embedding vector used in the ABCD-VAE? 2) Granted, the t-SNE plots do not look clustered at all, but how true is this in the actual latent space? Do clustering metrics also show that clustering performs poorly, or is this just an artifact of the visualization? Table 2: Is the reason for the difference here that zebra finch syllables are just not similar enough to be "speaker de-identified"? That is, are individual birds' repertoires so distinct that it's simply more accurate to use per-bird syllable labels in many cases? SECL results: I remain a bit confused about one aspect here. I appreciate the authors considering these predictions as function of number of syllable categories used, but this clearly complicates the computation of "the" SECL. Indeed, the number of categories used appears to diminish the effectiveness of context, as they report. The authors discuss one possible explanation in ll. 447-455, which is that the larger number of categories simply makes prediction harder, so long-range context matters less. I'm curious about another possibility, though: could it be that using a finer-grained quantization of the existing syllabless, retains more information about previous context (particularly if syllable characteristics like speed or mean pitch are autocorrelated within a sequence), and so I need fewer steps into the past to retain the same amount of predictive information? Reviewer #2: I appreciate the authors attention to my earlier concerns. The revised manuscript is improved. I continued to struggle, however, with some remaining points. Absent the link to language, the paper concerns the long-standing question of how to best characterize temporal structure in birdsongs. The paper introduces two novel applications of methods. The first method (called ABCD-VAE) deals with specific questions of how to cluster song element into singer-invariant categories. The second method (using a Transformer model) deals with how to detect long-range dependencies in sequences comprising those categories. The fundamental result is that context dependencies in BF song can be measured quantitatively out to ranges/distances longer than those observed (previously) with other methods. Unfortunately, the use of the singer-invariant categories is still not well-justified or supported by the analyses, and the alleged improvement enabled by the Transformer model is not justified by direct comparison to other methods on the present (or a standardized) dataset. My specific concerns are detailed below. Justification for the use of individually-invariant categories is not well supported. From the standpoint of wanting to approximate the syntactic relationships for words, I can appreciate the desire to find categories for BF song elements that generalize across singers. But this seems like a holdover from the first iteration of the paper with its (misplaced) focus on comparisons to language. BF song elements are not words, and it is less clear that the categories derived through the ABCD-VAE algorithm are biologically justified or functionally valid, even if they are somehow optimized statistically. Repertoire sharing in general is a complicated issue in songbirds and varies tremendously between species. If there are data on repertoire sharing in BF it should be cited. Absent such, it is equally plausible that there is no such thing as an individual-invariant category for BF song elements, or that there exists a mixture of shared and unique elements (as has been reported in other species) and that sharing varies on an individual bird basis. As the authors now report, larger vocabularies lead to lower SECL. This is not surprising given that total entropy is limited by the finite number of transitions in the training set, but it nonetheless highlights the fact that forcing categorization in any arbitrary way could alter the results. The sample spectrograms show significant acoustic variation within some of the ABCD-VAE derived classes and the actual repertoire size within birds is not clear. Since the set of individually-invariant categories (i.e the intersection of all individual categories) is a subset of any individual’s categories, it seems likely that Transformer models on bird-specific categories that don’t overfit trivial acoustic differences would give similar (or perhaps) even higher SECL. (SECL for individual’s shouldn’t be lower, since the invariant categories have to exist in an individual’s repertoire and sequential structure is ultimately implemented by individual singers.). If this is the case, its not clear what the invariant categorization adds to the paper. If a different result is observed (after controlling for trivial vocabulary size effects), then the use of the invariant categories might be justified. I should add that I think the existence of individually-invariant categories is interesting, I’m just not convinced that it matters for these sequence analysis here. Transformer model benefits: With respect to the sequence analysis, I do think the Transformer model is likely to be a very useful tool and its efficiency in other contexts has been shown. In the present case, however, its not clear whether the reported benefits relative to prior work reflects the model itself or the specific dataset (or, as noted above, the novel classification scheme). To show that the Transformer modle is an improvement, a direct comparison is necessary. This should involve either the implementation of a contemporary Markov-based analyses (e.g. as in Markowitz et al 2013) on the current dataset, or the application of the current Transformer model to a the dataset from the Markowitz or a similar study. Minor point: The discussion (line 287-288) still references a comparison to human text. Reviewer #3: The authors have done an excellent job addressing my concerns and revising their manuscript. I am glad to see that direct parallels to English were removed and that analyses of a different species have been included in the revised draft. While the results about content-dependency (history-dependence) are qualitatively similar to previous studies in Bengalese finches, the approach used by the authors is novel to birdsong and potentially powerful (variational autoencoders combined with speaker normalization and Transformer models). I hope that these scripts will be made available to the scientific community for use in their studies. Medium concerns: The number of categories determined by the unsupervised approach is ~3-5X as large as those determined by humans. I believe a previous paper (Sainburg et al., 2020?) similarly report that humans overlook context-dependent variation in syllable structure and indicate smaller estimates of repertoire size than other computational approaches. However, given that human labelling has been considered the “gold standard” for so long, running the Transformer model on human annotated Bengalese finch song would be useful (I don’t recall seeing these data in the manuscript, maybe because of the limited number of human annotated songs). Given that they find an inverse relationship between the number of syllable classes and SECL, it seems like a greater SECL would be found in the human annotated data. It’s unfortunate that the author’s current attempts at unsupervised classification of zebra finch songs were not particularly successful. Zebra finches are the most commonly studied songbird so optimizing an approach for this species would be particularly impactful. And spectral complexity is not an uncommon feature of animal communication signals, suggesting limited applicability of the current approach. That being said, testing this approach on canary song syntax would be useful given the relative spectral simplicity of canary song syllables and previous studies of context-dependency in canary song. (NOTE: this is not to say that an analysis of canary song is imperative for the publication of this manuscript, just that it would be a welcome addition.) Much of the results from the Transformer model of zebra finch song should be excluded from the main text. If syllable classification is unreliable for zebra finches, modelling the sequence structure of unreliably annotated songs can be misleading. It runs the risk of some readers concluding that 4 syllables are required to make accurate predictions about upcoming syllables in zebra finch song. The authors already word this section of the manuscript carefully, but I think they should prune this down even more to simply indicate that (similar to Bengalese finch song) SECL becomes smaller as the number of categories increases for zebra finch song. Related to this point, Figure 4C can be moved to supplementary information and references to an SECL of 4 should be removed from the Discussion. Minor concerns: Lines 218-224: only 16 of the 20 birds are accounted for in this description. Please revise. Line 388-389. Can the authors confirm whether the canary study cited here analyzed canary syllables or phrases. The time scales of these two types of song descriptions are very different (7 syllables in the past is much shorter than 7 phrases in the past), so this should be clarified and indicated in the manuscript. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No: Responses from authors say that data will be made available upon publication. I did not see a statement about code availability. Reviewer #2: No: I didn't see links to data and/or code but suspect theses would be provided later. Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 25 Nov 2021 Submitted filename: responseletter.pdf Click here for additional data file. 1 Dec 2021 Dear Dr. Tachibana, We are pleased to inform you that your manuscript 'Measuring context dependency in birdsong using artificial neural networks' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Frédéric E. Theunissen Associate Editor PLOS Computational Biology Natalia Komarova Deputy Editor PLOS Computational Biology *********************************************************** Dear authors, Thank you for your revision and congratulations on a nice contribution. A few minor edits: 1. Intro l 11 - you mention "station" but I don't see "station" in that sentence. Did you mean "Mary" ? 2. Results l 98. Please specify what is the gist of a Dirichlet prior and what effect if has on the number of syllables. Frederic Theunissen 20 Dec 2021 PCOMPBIOL-D-21-00120R2 Measuring context dependency in birdsong using artificial neural networks Dear Dr Tachibana, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Katalin Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

42 in total

1. Neural networks that learn temporal sequences by selection.

Authors: S Dehaene; J P Changeux; J P Nadal
Journal: Proc Natl Acad Sci U S A Date: 1987-05 Impact factor: 11.205

2. Pitfalls in the categorization of behaviour: a comparison of dolphin whistle classification methods.

Authors:
Journal: Anim Behav Date: 1999-01 Impact factor: 2.844

3. Linked control of syllable sequence and phonology in birdsong.

Authors: Melville J Wohlgemuth; Samuel J Sober; Michael S Brainard
Journal: J Neurosci Date: 2010-09-29 Impact factor: 6.167

4. Projections of the dorsomedial nucleus of the intercollicular complex (DM) in relation to respiratory-vocal nuclei in the brainstem of pigeon (Columba livia) and zebra finch (Taeniopygia guttata).

Authors: J M Wild; D Li; C Eagleton
Journal: J Comp Neurol Date: 1997-01-20 Impact factor: 3.215

Measuring context dependency in birdsong using artificial neural networks.

Introduction

Schematic diagram of newly proposed syllable classification.

Results

Unsupervised, individual-invariant classification of Bengalese finch syllables

Clustering results of Bengalese finch syllables based on the ABCD-VAE.

Quantitative evaluation of syllable classification for Bengalese finch

Illustration of clustering evaluation by Cohen’s kappa coefficient and homogeneity.

Quantitative evaluation of the clustering by the ABCD- vs. Gauss-VAE for Bengalese finch syllables.

Unsupervised classification of zebra finch syllables

Clustering results of zebra finch syllables based on the ABCD-VAE.

Quantitative evaluation of the clustering by the ABCD- vs. Gauss-VAE for zebra finch syllables.

Analysis of context dependency

The size of the training and test data used in the neural language modeling of Bengalese finch songs.

Discussion

Clustering of syllables

Context dependency

Materials and methods

Recording and preprocessing

Other clustering methods

Evaluation metrics of syllable clustering

Language modeling

Measuring context dependencies

Details on syllable clustering by VAE.

Details on the Transformer language model.

Analysis of context dependency in zebra finch song.

Detailed comparison with the mutual information analysis.

Detailed comparison with the Markovian analysis of context dependency.

1. Neural networks that learn temporal sequences by selection.

2. Pitfalls in the categorization of behaviour: a comparison of dolphin whistle classification methods.

3. Linked control of syllable sequence and phonology in birdsong.

4. Projections of the dorsomedial nucleus of the intercollicular complex (DM) in relation to respiratory-vocal nuclei in the brainstem of pigeon (Columba livia) and zebra finch (Taeniopygia guttata).

5. Audition-independent vocal crystallization associated with intrinsic developmental gene expression dynamics.

6. Experimental determination of a unit of song production in the zebra finch (Taeniopygia guttata).

7. Syllable chunking in zebra finch (Taeniopygia guttata) song.

8. Songs of humpback whales.

9. A compact statistical model of the song syntax in Bengalese finch.

10. Parallels in the sequential organization of birdsong and human speech.