Literature DB >> 33723064

Communicating artificial neural networks develop efficient color-naming systems.

Rahma Chaabouni1,2, Eugene Kharitonov3, Emmanuel Dupoux3,2, Marco Baroni3,4.   

Abstract

Words categorize the semantic fields they refer to in ways that maximize communication accuracy while minimizing complexity. Focusing on the well-studied color domain, we show that artificial neural networks trained with deep-learning techniques to play a discrimination game develop communication systems whose distribution on the accuracy/complexity plane closely matches that of human languages. The observed variation among emergent color-naming systems is explained by different degrees of discriminative need, of the sort that might also characterize different human communities. Like human languages, emergent systems show a preference for relatively low-complexity solutions, even at the cost of imperfect communication. We demonstrate next that the nature of the emergent systems crucially depends on communication being discrete (as is human word usage). When continuous message passing is allowed, emergent systems become more complex and eventually less efficient. Our study suggests that efficient semantic categorization is a general property of discrete communication systems, not limited to human language. It suggests moreover that it is exactly the discrete nature of such systems that, acting as a bottleneck, pushes them toward low complexity and optimal efficiency.

Entities:  

Keywords:  color-naming systems; efficiency of human language; language emergence in artificial neural networks

Year:  2021        PMID: 33723064      PMCID: PMC8000426          DOI: 10.1073/pnas.2016569118

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   12.779


Words partition our world into semantic categories. Converging evidence indicates that, while these categories differ widely across languages, they are shaped by universal constraints (1–3). In particular, it has been suggested that semantic categorization evolves to support efficient communication (4). Humans develop naming systems to talk about their experience under two competing pressures: “accuracy maximization” (words should encode precise information about their referents) and “complexity avoidance” (preventing unwieldy languages). At an extreme, a maximally accurate system would have a different term for each perceptual or mental experience. At the other, a maximally simple system would use only one term to refer to all experiences, completely hindering communication. Actual human naming systems are efficient in the sense that they optimize the accuracy/complexity trade-off. More generally, since the foundational work of Zipf (5), a similar trade-off between precision and simplicity has been observed in many areas of language (6). Zaslavsky et al. (7) formalized the measurement of naming-system efficiency within the general information–theoretic framework of the Information Bottleneck (IB) (8) (see also the closely related rate-distortion theory framework in ref. 9). A system is deemed efficient if it reaches the maximum possible accuracy for a given complexity. In the IB framework, both accuracy and complexity are computed in a communication model where an idealized Speaker aims to communicate a meaning to an idealized Listener. Accuracy is then inversely related to the cost of a misinterpreted meaning, while complexity measures the quantity of information needed to convey the meaning. The IB efficiency of a system is effectively visualized in plots (see Fig. 3). The black curve in Fig. 3 represents the theoretical limit: no system of a certain complexity (horizontal axis) can have accuracy (vertical axis) above the curve. Hence, according to IB, a system is optimal if it lies on the curve. Equipped with this framework, Zaslavsky et al. (7) demonstrated that color-naming systems (4, 10, 11) are notably close to the theoretical limit and hence efficient in a quantifiable way.
Fig. 3.

Human (blue circles) and NN (orange circles) color-naming systems on the information plane. English (light blue circle) is not in WCS, but it is approximated relying on Zaslavsky et al. (SI Appendix, figure S7 in ref. 7). The IB curve (black line) defines the theoretical limit on accuracy given complexity. All color-naming systems achieve near-optimal efficiency.

IB theory is agnostic about where on the theoretical-limit curve a system should lie. Degenerate systems lying at the extremes of the curve, and expressing each referent with a different term or all referents with a unique term, are also efficient according to this theory. However, such systems are not attested. Instead, real color-naming systems approximate a small range of possible optimal solutions, avoiding the extremes, and in particular high-complexity trade-offs (7). This avoidance of complexity extremes has been observed more broadly in studies of categorization and naming across many semantic domains (4, 12–14). We study the efficiency of color naming from a different perspective. We compare natural language systems with those emerging from the interaction of modern neural networks (NNs) faced with a color-communication task. Artificial NNs trained with deep-learning methods (15) have recently been used to study human (neuro)cognition in many fields (e.g., refs. 16–19), including color naming (20, 21). Traditional simulations in cognitive science are specifically designed to assess how certain factors of interest affect system behavior by developing ad hoc models, an approach illustrated by Baronchelli et al. (22) and Loreto et al. (23), in the domain of color naming, and applied by Carr et al. (24) to the study of complexity/accuracy trade-offs in semantic categorization. Deep networks, however, are high-performance general-purpose learners, independently developed for engineering purposes, with no claims of cognitive plausibility concerning their architecture or learning process. In this respect, they might be best seen as complex “animal models” (25, 26). The main interest lies in whether the emergent behavior of these powerful mechanisms mirrors nontrivial properties of human behavior (27). If it does, we can entertain the intriguing hypothesis that the specific converging human and deep-network patterns we observe have common roots. We can moreover directly intervene on the artificial organisms (more easily so than we can on humans), in order to causally assess how different components affect their emergent behavior. Specifically, we show that, when two deep learning-trained NNs play a simple color discrimination game, they develop naming systems that closely match the distribution of human languages on the IB plane, showing both efficiency maximization and complexity control (Fig. 3). The use of human-like artificial systems emerges without imposing ad hoc constraints favoring efficiency or limiting complexity on the training procedure. Having observed the systematic emergence of efficiency and complexity reduction in the NN systems, we proceed to test the hypothesis that these properties crucially depend on the bottleneck imposed by the discrete communication channel. Indeed, as we let NNs exchange messages that are increasingly more continuous, their naming systems become more complex, and, eventually, no longer efficient. Varying the degree of color-discrimination granularity required to play the game affects the complexity of the emergent systems, but not efficiency, and only within the range of attested human variation. NN capacity only affects the complexity of the system in function of discreteness of communication. The emergence of efficient and reasonably simple semantic categorization is not specific to human language but might generally arise in cognitive devices exchanging discrete messages about their world. Discreteness of communication plays a central role in the emergence of efficient and low-complexity naming systems among our artificial agents, raising intriguing questions about the role of discreteness in human language.

Color-Naming Task

Stimuli.

Following prior work (4, 7, 28), we use the World Color Survey (WCS). The WCS contains the names of 330 color chips (Fig. 1) in 110 languages of nonindustrialized societies (29). We represent each color stimulus as a three-dimensional vector in CIELAB space (a color space designed to approximate human vision). In particular, we measure color similarity based on Euclidean distance in CIELAB, as it correlates with human perceptual sensitivity (7).
Fig. 1.

The 330 WCS color chips. Rows correspond to equally spaced lightness values and columns to equally spaced Munsell hues. Each stimulus is at the maximum available saturation for that hue/lightness combination.

The 330 WCS color chips. Rows correspond to equally spaced lightness values and columns to equally spaced Munsell hues. Each stimulus is at the maximum available saturation for that hue/lightness combination.

Discrimination Game.

We implement a classic discrimination game (30) played by 2 NN agents, Speaker and Listener. Speaker receives a target color from the palette and sends one word from its vocabulary to Listener. Speaker chooses the word from a fixed vocabulary of size = 1,024. As is larger than the number of colors (330), it is always possible, in principle, for Speaker to use a unique word to denote each distinct color. Given and two distinct colors, and a distractor , Listener must predict the target. The agents succeed if Listener guesses the correct target (as in Fig. 2).
Fig. 2.

A successful round of the discrimination game. A chip is drawn from a uniform distribution and fed to Speaker. Speaker outputs a probability distribution over its vocabulary of size . Here, a probability is mapped to a color according to a gray gradient (with darker colors representing higher probabilities). A word is sampled from and fed to Listener. Finally, Listener—given , the target chip (in position 1 in this illustration), and a distractor chip (in position 2 in this illustration)—assigns a probability to both positions, representing its guess about the position of the target (in this illustration, Listener correctly assigns a higher probability to the target position).

A successful round of the discrimination game. A chip is drawn from a uniform distribution and fed to Speaker. Speaker outputs a probability distribution over its vocabulary of size . Here, a probability is mapped to a color according to a gray gradient (with darker colors representing higher probabilities). A word is sampled from and fed to Listener. Finally, Listener—given , the target chip (in position 1 in this illustration), and a distractor chip (in position 2 in this illustration)—assigns a probability to both positions, representing its guess about the position of the target (in this illustration, Listener correctly assigns a higher probability to the target position). As in previous work (4, 28), we assume a uniform prior distribution over target colors. In , we test an alternative nonuniform prior (31), and the results still hold. The game is implemented in EGG (32). Further details are in .

Discriminative Need.

Despite the presence of universal tendencies (10, 33, 34), color-naming variance is also observed (35, 36). Prior studies hypothesized that such variance depends on distinct frequencies of occurrence of colors across communities (31, 37). In lack of data capturing these differences, we explore a complementary source of variation, that is easier to model computationally. We hypothesize that different cultures have different discriminative needs. Intuitively, highly industrialized societies might need to distinguish between subtly different color shades characterizing different goods, whereas nonindustrialized societies can rely on coarser distinctions. As indirect evidence, Gibson et al. (31) reported that, in the nonindustrialized Tsimané community, color terms are “only used to discriminate between familiar artificial objects.” Since in a nonindustrialized community, there is relatively low variety of artificial objects, discrimination need will be low. In English, instead, speakers systematically use color terms to discriminate between objects of all kinds (31). Concretely, we define discriminative need as the minimum allowed Euclidean distance between targets and distractors in CIELAB space. Agents trained with small minimum target-distractor distance, , simulate communities with high discriminative need; the ones trained with large represent communities with low discriminative need. We quantify in terms of the nth percentile in the list of pairwise distances between the 330 distinct color chips. For example, with percentile = 50, for a given target color , a distractor is sampled uniformly among candidate colors such thatwhere is the median function and is the Euclidean distance in CIELAB space. Note that larger percentiles correspond to games requiring less granular discrimination. We provide examples in .

Speaker Word Distributions.

Just like in natural language, we allow fuzzy naming: the same color might be, in different occasions, denoted by different words. To estimate the probability distribution associated to a color chip , we sample 25 words with replacement from Speaker after convergence. For instance, since Speaker’s outputs form a categorical distribution over , if this distribution is a Dirac, the resulting set of 25 samples will correspond to a unique word. At the other extreme, if Speaker has no confidence about ’s category, we might get 25 distinct words equiprobably naming .

Evaluating the Accuracy/Complexity Trade-Off.

To compareNN and human naming systems, we adopt the communication model of Zaslavsky et al. (7), keeping the same notation. represents the set of world’s objects, in our case, the set of colors; represents the set of words; and represents the set of Speaker’s meanings. We assume that a NN Speaker, similarly to what is conjectured for humans (4, 7), internally represents each target color chip as a Gaussian distribution over centered at and defined upon CIELAB color similarity. That is, for a given target chip , Speaker constructs an internal representation reflecting its belief about the color chip it wishes to communicate to Listener. The Gaussian , of mean , is then only parameterized by variance , that informs about the Speaker’s (un)certainty about its belief. Concretely, an with low variance, reflecting a certain belief, would only cover and few neighboring chips according to the CIELAB space (e.g., slightly darker and lighter chips). Similarly to Zaslavsky et al. (7), we set for all target chips. Note that is only introduced to compute the accuracy and complexity measures below, and it plays no direct role in the discrimination game. In the framework by Zaslavsky et al. (7), the complexity of a naming system is quantified by the number of bits of information required for expressing the intended meanings. As shown by Zaslavsky et al. (7), this is measured by the mutual information, between and . Also following Zaslavsky et al. (7), we use to measure the accuracy of a naming system. The latter measure is inversely related to the Kullback–Leibler divergence between Speaker and Listener meanings. That is, the better Listener is at reconstructing Speaker’s meaning, the larger is. The theoretically optimal trade-offs between complexity and accuracies are approximated by minimizing the IB objective function:where is the trade-off parameter determining the relative weight a system will attribute to complexity avoidance vs. accuracy maximization. Both complexity and accuracy are quantified by mutual information terms. However, the IB objective minimizes the first term (lowering complexity) and maximizes the second (increasing accuracy; note the minus sign preceding the second term in Eq. ), two constraints that will be in contrast. To minimize Eq. for a fixed , we look for the set , where and , with a variable to optimize. To get the theoretical-limit curve shown in Fig. 3, we repeat this procedure for each , as described in . Refer to Zaslavsky et al. (in particular, Bounds on Semantic Efficiency in the main text of ref. 7 and SI Appendix, section S1.3 in ref. 7) for more details about definitions and derivations. Human (blue circles) and NN (orange circles) color-naming systems on the information plane. English (light blue circle) is not in WCS, but it is approximated relying on Zaslavsky et al. (SI Appendix, figure S7 in ref. 7). The IB curve (black line) defines the theoretical limit on accuracy given complexity. All color-naming systems achieve near-optimal efficiency. The farther a system is to the theoretical-limit curve, the less efficient it is. To quantify the inefficiency of a system , characterized as a point on the accuracy/complexity plane (Fig. 3), we introduce the score:where are the coordinates of the optimal naming system (on the theoretical-limit curve) for a fixed .

Experiments and Results

Human vs. NN Naming Systems.

To simulate communities with different needs, we run the discrimination game varying minimum target-distractor distance, defined in terms of percentile of nearest distractor (see Discriminative Need). With percentile < 20, target-distractor pairs are too close, and agents fail to converge. Above percentile = 80, there are no distractors sufficiently distant to any given target. We hence played the game with , resulting in 60 successful games in total. Control experiments are in . Looking at how human and NN naming systems spread along the IB line in Fig. 3, we can make two striking observations. First, NN systems lie near the theoretical IB limit just like human languages do. shows that NN system inefficiency (Eq. ) falls within the human range. Second, both human and NN systems lie on a narrow segment of the curve. While this segment does not include the minimal values, it is still clearly tilted toward the low-complexity end of the curve. Note that minimum complexity would be achieved by a system with a single color term. As this makes no sense, we do not expect minimum-complexity systems to emerge. Intriguingly, neither WCS nor NN systems include two-word codes (which are exceedingly rare in natural languages in general) (39). shows that, even when we manipulate the game so that agents could achieve perfect discrimination with two words only, they minimally converge to a three-word system. We have no explanation for why two-word systems are avoided. Still, both WCS and NN systems are clearly coming much closer to the lower end of the complexity scale than to the upper bound. In sum, standard NNs trained on the discrimination game develop systems that support efficient communication (i.e., are close to the IB curve) while preferring low complexity, similarly to human color-naming systems. Our focus here is on the IB trade-off. However, the way in which NN systems accomplish this trade-off is not radically different from that of human languages. presents a detailed comparison of color partitioning in human and NN naming systems, highlighting partial differences but also important commonalities, in particular, in terms of the convexity of regions corresponding to distinct color names (see also for qualitative comparison between both systems).

Effect of discriminative need.

Fig. 3 shows the NN systems resulting from exploring the full range of possible percentile values (the parameter controlling discriminative need). While all systems are efficient, we observe some variability in complexity (within the range), that might be due to different discriminative needs. This is confirmed by Fig. 4, which shows NN naming system complexity in function of percentile. Smaller percentile values (requiring more granular discrimination) make systems more complex. Still, this trend is gradual with no significant pairwise differences, suggesting the need for distant discriminative needs to observe a significant difference in systems’ complexity. Furthermore, NN systems’ complexity remains within human-range complexity when exploring the full range of percentile values. Interestingly, Fan et al. (14) showed, in the context of visual communication, that humans are also sensitive to discriminative need and adapt the complexity of their communicative system accordingly.
Fig. 4.

Complexity distributions of NN systems across different discriminative needs (human distribution included for comparison). There is a decreasing trend in complexity when increasing percentile (P = 0.004; Kruskal–Wallis). Pairwise differences are not significant when evaluated with Bonferroni-corrected Mann–Whitney–Wilcoxon.

Complexity distributions of NN systems across different discriminative needs (human distribution included for comparison). There is a decreasing trend in complexity when increasing percentile (P = 0.004; Kruskal–Wallis). Pairwise differences are not significant when evaluated with Bonferroni-corrected Mann–Whitney–Wilcoxon. Thus, discriminative need (or related environmental/societal pressures to make more/less granular distinctions) could account for the range of complexity variation we observe in NN and human naming systems (and that might be somewhat underestimated by the WCS sample). However, alone, it does not explain why the range of observed systems is so narrow.

Preference for low complexity.

Both human and NN systems show much lower complexity than what could be found in an optimal naming system by systematically varying the trade-off parameter [1,+[.** The attested systems all occur within a small segment corresponding to ]1, 1.14]. One might conjecture that more complex codes do not evolve simply because the attested ones are sufficiently granular to support all required discriminations. For our NN agents at least, this is not the case, as they systematically fail to achieve 100% success in the discrimination game, which would instead be possible with more complex systems. To illustrate the latter, we generate additional naming systems by partitioning the color space using the “fuzzy c-means” (FCM) soft clustering algorithm (40), treating cluster labels as color names. We obtain different systems by varying the number-of-clusters hyperparameter. We then play the discrimination game with Speakers and Bayesian Listeners that use distributions derived from the soft clustering solutions. The FCM-based agents can reach 100% communication success at all percentiles. However, this comes at the cost of higher complexity. Table 1 compares, for each percentile, the 100% successful FCM system with lowest complexity to the NN system with highest success rate. In all cases, NNs came up with systems that are considerably less complex but that also fail to reach perfect discrimination success. We conclude that the low complexity of NN systems cannot be explained by lack of sufficient communicative pressure toward more complex solutions. We explore next other possible sources of low-complexity-preference.
Table 1.

Complexity and success rate (game accuracy after training) of FCM-based and NN systems in function of the game percentile parameter

min complexityComplexitySuccess rate
PercentileFCMBest NNBest NN
205.392.5095.45%
304.342.2896.97%
404.012.2395.76%
503.752.6898.79%
603.442.1796.97%
703.392.3097.56%
803.122.2498.78%

FCM success rate is always 100%. For FCM, we report minimal complexity among fully successful solutions. For NN, we report complexity and success rate of the system achieving highest success rate.

Complexity and success rate (game accuracy after training) of FCM-based and NN systems in function of the game percentile parameter FCM success rate is always 100%. For FCM, we report minimal complexity among fully successful solutions. For NN, we report complexity and success rate of the system achieving highest success rate.

Roots of Efficiency and Complexity Avoidance.

Building on recent work (41), we explore the idea that the discrete nature of the communication channel acts as bottleneck on the amount of information that the agents are able to transmit, leading them to establish efficient and low-complexity naming systems. Another natural bottleneck could be agents’ capacity. Perhaps, the “neural power” of our NNs does not suffice to develop more complex languages. We show next that channel discreteness plays a fundamental role in complexity reduction, whereas NN capacity only matters insofar as it allows the agents to further simplify the code in presence of a discrete channel.

Effect of channel discreteness.

We fix percentile = 50 and explore different training regimes ranging from a fully discrete setup to a virtually continuous one, relying on two commonly used methods to train deep networks in language emergence scenarios (e.g., refs. 42 and 43; also see ). The REINFORCE (RF) algorithm uses fully discrete symbol transmission during both training and evaluation. The Gumbel-Softmax (GS) method is fully discrete at evaluation time, but it estimates symbol probabilities through a smooth approximation during training. At training time, discrete symbols are approximated by continuous vectors with most of the mass concentrated around a single value. The “peakiness” (and thus discreteness) of this approximation is controlled by the temperature parameter . The lower the , the peakier the vector (practically converging to a discrete “1-hot” encoding for low s). We explore , corresponding to increasingly smoother communication channels. Settings with less smooth channels, and in particular fully discrete RF, are harder to train. Hence, we launch 60 runs for each GS setting and 180 for RF. In , we discuss the relation between channel smoothness and successful convergence, arguing that the high failure rate of more discrete settings is due to a higher complexity-reduction pressure. Fig. 5 shows that agents trained with RF (thus, in the completely discrete setting) develop significantly less complex systems compared to the ones trained with GS. Within GS, lower (more discreteness) leads to simpler codes. With more complexity, smoother systems also become less efficient, an effect that is clear with the highest (Fig. 5). In , we study one concrete way in which these systems are inefficient, comparing them with complex but still efficient NN systems resulting from high discriminative need.
Fig. 5.

Complexity and inefficiency of NN color-naming systems trained with REINFORCE or GS with different s. Pairwise differences evaluated with Bonferroni-corrected Mann–Whitney–Wilcoxon. *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001. Differences that are not significant are not marked.

Complexity and inefficiency of NN color-naming systems trained with REINFORCE or GS with different s. Pairwise differences evaluated with Bonferroni-corrected Mann–Whitney–Wilcoxon. *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001. Differences that are not significant are not marked.

Effect of agent capacity.

Only Speaker capacity has a significant impact on complexity and only with a discrete communication channel. Interestingly, larger Speakers further reduce the complexity of the emerging naming system (). As further discussed in , a reasonable interpretation for this pattern is that, when the channel is discrete, transmitting information is difficult. Consequently, a “smarter” Speaker will use its extra computational power to come up with an encoding that allows it to transmit even less bits through the channel while maintaining reasonable accuracy. Thus, the agents’ capacity experiments further confirm the importance of the discrete-channel bottleneck for complexity minimization.

Discussion

We have shown that NNs trained to play a color discrimination game develop naming systems whose distribution on the accuracy/complexity trade-off plane strikingly resembles that of human languages. We obtained this result using game success as the sole training signal, without imposing any constraint on the emergent code, except that it had to consist of single discrete symbols. A very recent study by Kågebäck et al. (21) reports that deep NN agents trained with generic techniques to play a color-naming game strike a similarly human-like complexity/accuracy trade-off, despite important differences between their game and ours (in their setup, the Listener receives only the message as input, and it has to reconstruct the color chip seen by the Speaker), different methods to derive a discrete protocol, and different factors modulating the trade-off (the complexity cline, in their experiments, depends on different amounts of noise added to the communication channel). This constitutes important converging evidence that deep-network communication tends to naturally optimize the accuracy/complexity trade-off, independently of the specifics of the simulations. We observed, in particular, that the networks developed “low-complexity” systems, again in accordance with natural language data. We then looked for the source of this low-complexity pressure in NN systems. Building up on a recent study reporting similar results in artificial tasks (41), we showed that the presence of a discrete communication bottleneck plays a crucial role. As we relax discreteness, the emergent naming systems become complex beyond what is attested in human language, and, eventually, significantly inefficient. In the last few years, much evidence for the efficiency of human languages in general (6) and semantic categorization in particular (4) has been accumulated. Yet, we still lack a full scientific understanding of “why” language is efficient. Our results provide two contributions relevant to this question. First, since efficiency and complexity avoidance also characterize the code evolved by communicating NNs, these factors cannot be explained away by least-effort factors specific to biological agents. Second, the fact that NNs exchange a discrete signal is crucial. Discreteness is a striking, possibly unique characteristic of human language (44, 45), often adduced as a precondition for the combinatorial infinity of expression that characterizes it (46). Our finding suggests that it might also be responsible for the efficient nature of semantic categorization (and possibly language at large). We do not have direct evidence on how the language of our ancestors became discrete and on how this affected the structure of semantic categorization. However, our computational results pave the way for experiments with contemporary humans, exploring how a continuous/discrete transition in communication systems affects the nature of information exchange. With human subjects, we might not have a direct equivalent of the GS temperature parameter. We can however build on a strong tradition of experimental semiotics studies using continuous signals, such as drawings, whistles, and nonconventionalized gestures, and sometimes reporting a tendency to discretize the signals as systematic communication strategies emerge (14, 47, 48). By using this framework, we should be able to design experiments that probe a causal relation between discreteness and communicative efficiency, ultimately strengthening our understanding of the roots of efficiency in language.

Materials and Methods

Human Languages.

We used the WCS database (www1.icsi.berkeley.edu/wcs/). Two languages with extremely sparse information (judgments from 1 speaker only for at least some chips) were removed, resulting in 108 analyzed languages. English, which is not in WCS, was approximated based on the relevant figures from the study by Zaslavsky et al. (7).

Agent Architecture and Training.

Both agents are feed-forward NNs. Speaker contains 3 hidden layers, each of size 1,000 and with leaky-ReLU (rectified linear unit) activations. For each color, the Speaker’s output layer defines a Categorical distribution over its vocabulary . Listener is modeled as a linear layer of hidden size 5. The impact of agents’ capacity on results is discussed in . Training NNs to communicate through a discrete channel is nontrivial, as we cannot backpropagate into the Speaker through this bottleneck. We use two methods commonly employed in the deep agent language emergence literature: 1) GS relaxation (e.g., ref. 43) and 2) REINFORCE (e.g., ref. 42)) (in both cases, Listeners’ gradients are obtained with standard backpropagation). We plug the obtained gradient estimates into Adam (49).

GS.

Samples from the GS distribution (50, 51) approximate those from a Categorical distribution through a reparameterization trick, thus enabling gradient-based training. Let us denote the standard softmax function. To get a sample that approximates an -dimensional categorical distribution with probability , we draw , where for each , Gumbel(0,1) and use it to calculate such that:where is the temperature hyperparameter. As tends to 0, the samples get closer to one-hot, making communication more discrete; as , the samples tend to uniform, resulting in smooth communication. At training time only, we use the relaxed samples as messages from Speaker, making the entire Speaker/Listener setup differentiable. We look at the impact of on Speakers’ output distribution in .

Reinforce

Following Schulman et al. (52), we sample Speaker’s words and estimate its gradients as follows:where are agent’s inputs with if agent is Speaker and if it is Listener. denotes Listener’s prediction, denotes the ground-truth, and denotes the cross-entropy loss function; refers to the “stop-gradient” operation. We use the standard running mean baseline (53, 54) to reduce estimate variance. To achieve more robust convergence, we also adopt the common trick to add an entropy maximization term (55, 56) on Speaker’s words. This could favor higher code complexity, which makes our low-complexity result even more striking. When not stated otherwise, results are based on GS training with temperature = 1. Training consists in letting the agents play the game until their performance converge (this happens, on average, after about 6 million interactions). For each considered setting, we repeat experiments with 20 different random initializations and only focus the analysis on the successful runs. We consider a run successful if, after convergence, the agents have at least a success rate. Following standard practice, success rates are computed in games in which the most likely word is deterministically sampled from the Speaker distribution.

IB Curve.

We use the Agglomerative IB method (57) with . At each step of the annealing process, we evaluate the IB solution, i.e., for each () using Iterative IB (57). The latter is an iterative method that alternates between evaluating and (Speaker’s meaning for each ) until convergence. We refer readers to Zaslavsky et al. (SI Appendix, section 1.4 in ref. 7) for more details about this two-step process. When annealing according to Agglomerative IB, the IB solution is initialized with the one found with the previous value of . Optimization ends when .
  22 in total

1.  Resolving the question of color naming universals.

Authors:  Paul Kay; Terry Regier
Journal:  Proc Natl Acad Sci U S A       Date:  2003-07-10       Impact factor: 11.205

2.  Color categories: evidence for the cultural relativity hypothesis.

Authors:  Debi Roberson; Jules Davidoff; Ian R L Davies; Laura R Shapiro
Journal:  Cogn Psychol       Date:  2004-12-21       Impact factor: 3.468

3.  Color naming reflects optimal partitions of color space.

Authors:  Terry Regier; Paul Kay; Naveen Khetarpal
Journal:  Proc Natl Acad Sci U S A       Date:  2007-01-17       Impact factor: 11.205

4.  The possibility of impossible cultures.

Authors:  Marc D Hauser
Journal:  Nature       Date:  2009-07-09       Impact factor: 49.962

Review 5.  Deep Neural Networks as Scientific Models.

Authors:  Radoslaw M Cichy; Daniel Kaiser
Journal:  Trends Cogn Sci       Date:  2019-02-19       Impact factor: 20.229

6.  Color naming across languages reflects color use.

Authors:  Edward Gibson; Richard Futrell; Julian Jara-Ettinger; Kyle Mahowald; Leon Bergen; Sivalogeswaran Ratnasingam; Mitchell Gibson; Steven T Piantadosi; Bevil R Conway
Journal:  Proc Natl Acad Sci U S A       Date:  2017-09-18       Impact factor: 11.205

7.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy.

Authors:  Alexander J E Kell; Daniel L K Yamins; Erica N Shook; Sam V Norman-Haignere; Josh H McDermott
Journal:  Neuron       Date:  2018-04-19       Impact factor: 17.173

Review 8.  How Efficiency Shapes Human Language.

Authors:  Edward Gibson; Richard Futrell; Steven P Piantadosi; Isabelle Dautriche; Kyle Mahowald; Leon Bergen; Roger Levy
Journal:  Trends Cogn Sci       Date:  2019-04-18       Impact factor: 20.229

9.  Modeling the emergence of universality in color naming patterns.

Authors:  Andrea Baronchelli; Tao Gong; Andrea Puglisi; Vittorio Loreto
Journal:  Proc Natl Acad Sci U S A       Date:  2010-01-25       Impact factor: 11.205

10.  Iconicity and the Emergence of Combinatorial Structure in Language.

Authors:  Tessa Verhoef; Simon Kirby; Bart de Boer
Journal:  Cogn Sci       Date:  2015-12-26
View more
  2 in total

1.  Conceptual Similarity and Communicative Need Shape Colexification: An Experimental Study.

Authors:  Andres Karjus; Richard A Blythe; Simon Kirby; Tianyu Wang; Kenny Smith
Journal:  Cogn Sci       Date:  2021-09

2.  Predicting Active NBA Players Most Likely to Be Inducted into the Basketball Hall of Famers Using Artificial Neural Networks in Microsoft Excel: Development and Usability Study.

Authors:  Po-Hsin Chou; Tsair-Wei Chien; Ting-Ya Yang; Yu-Tsen Yeh; Willy Chou; Chao-Hung Yeh
Journal:  Int J Environ Res Public Health       Date:  2021-04-16       Impact factor: 3.390

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.