Literature DB >> 34187189

Win-Stay-Lose-Shift as a self-confirming equilibrium in the iterated Prisoner's Dilemma.

Minjae Kim¹, Jung-Kyoo Choi², Seung Ki Baek¹.

Abstract

Evolutionary game theory assumes that players replicate a highly scored player's strategy through genetic inheritance. However, when learning occurs culturally, it is often difficult to recognize someone's strategy just by observing the behaviour. In this work, we consider players with memory-one stochastic strategies in the iterated Prisoner's Dilemma, with an assumption that they cannot directly access each other's strategy but only observe the actual moves for a certain number of rounds. Based on the observation, the observer has to infer the resident strategy in a Bayesian way and chooses his or her own strategy accordingly. By examining the best-response relations, we argue that players can escape from full defection into a cooperative equilibrium supported by Win-Stay-Lose-Shift in a self-confirming manner, provided that the cost of cooperation is low and the observational learning supplies sufficiently large uncertainty.

Entities: Chemical Disease Gene Species

Keywords: Bayesian inference; Win-Stay-Lose-Shift; evolution of cooperation; observational learning; reciprocity

Mesh：

Year: 2021 PMID： 34187189 PMCID： PMC8242928 DOI： 10.1098/rspb.2021.1021

Source DB: PubMed Journal: Proc Biol Sci ISSN： 0962-8452 Impact factor: 5.349

Introduction

Evolutionary game theorists often assume that behavioural traits can be genetically transmitted across generations [1]. Along this line, researchers have investigated the genetic basis of cooperative behaviour [2,3]. However, humans learn many culture-specific behavioural rules through observational learning [4], and this mechanism mediates ‘cultural’ transmission that has been proved to exist among a number of non-human animals as well [5,6]. The mirror neuron research suggests that the primate brain may even have a specialized circuit for imitating each other’s behaviour, which facilitates social learning [7-9]. In comparison with the direct genetic transmission, the non-genetic inheritance through social learning can provide better adaptability by responding faster to environmental changes [10]. In contrast with genetic inheritance, however, observational learning may lead to imperfect mimicry if observation is not sufficiently informative or involved with a systematic bias. The notion of self-confirming equilibrium (SCE) has been proposed by incorporating such imperfectness of observation in learning [11]: when an SCE strategy is played, some of the possible information sets may not be reached, so the players do not have exact knowledge but only certain untested belief about what their co-players would do at those unreached sets. It is nevertheless sustained as an equilibrium in the sense that no player can expect a better payoff by unilaterally deviating from it once given such belief, and that the beliefs do not conflict with observed moves. Dynamics of learning based on a limited set of information has been investigated in the context of the coordination game [12,13], in which the opponent’s observed decision is assumed to be his or her strategy. However, the subtlety of cultural transmission manifests itself clearly when a strategy is regarded as a decision rule, hidden from the observer, rather than the decision itself. In this work, we investigate the iterated Prisoner’s Dilemma (PD) game among players with memory-one strategies, who infer the resident strategy from observation and optimizes their own strategies against it. By memory-one, we mean that a player refers to the previous round to choose a move between cooperation and defection [14]. If we restrict ourselves to memory-one strategies, it is already well known in evolutionary game theory that ‘Win-Stay-Lose-Shift (WSLS)’ [15-17] can appear through mutation and take over the population from defectors if the cost of cooperation is low [14]. Compared with such an evolutionary approach, we will impose ‘less bounded’ rationality in that our players are assumed to be capable of computing the best response to a given strategy within the memory-one pure-strategy space. We will identify the best-response dynamics in this space and examine how the dynamics should be modified when observational learning introduces uncertainty in Bayesian inference about strategies. If every player exactly replicated each other’s strategy, full defection would be a Nash equilibrium (NE) for any cost of cooperation. Under uncertainty in observation, however, our finding is that defection is not always an SCE so that the population can move to a cooperative equilibrium supported by WSLS, which is both an SCE and an NE and can thus be called a SCENE.

Method and result

Best-response relations without observational uncertainty

Let us define the one-shot PD game in the following form: where we abbreviate cooperation and defection as C and D, respectively, and c is the cost of cooperation assumed to be 0 < c < 1. In this work, the game of equation (2.1) will be repeated indefinitely. Furthermore, the environment is noisy: Even if a player intends to cooperate, it can be misimplemented as defection, or vice versa, with probability ε. In the analysis below, we will take ε as an arbitrarily small positive number. We will restrict ourselves to the space of memory-one (M1) pure strategies. By a M1 pure strategy, we mean that it chooses a move between C and D as a function of the two players’ moves in the previous round. We thus describe such a strategy as [p, p, p, p], where p = 1 means that C is prescribed when the players did X and Y, respectively, in the previous round, and p = 0 if D is prescribed in the same situation. Note that the initial move in the first round is irrelevant to the long-term average payoff in the presence of error so that it has been discarded in the description of a strategy. The set of M1 pure strategies, denoted by Δ, contains 16 elements from d0 ≡ [0, 0, 0, 0] to d15 ≡ [1, 1, 1, 1]. Let us assume that a player, say, Alice, takes a M1 pure strategy as her strategy. The noisy environment effectively modifies her behaviour to as if she were playing a mixed strategy, where 1 ≡ [1, 1, 1, 1]. Likewise, Alice’s co-player Bob chooses , and his effective behaviour is described by The repeated interaction between Alice and Bob is Markovian, and it is straightforward to obtain the stationary probability distribution where v means the long-term average probability to observe Alice and Bob choosing X and Y, respectively [18-20] (see appendix A for more details). The presence of ε > 0 guarantees the uniqueness of v. Alice’s long-term average payoff against Bob is then calculated as where P ≡ (1 − c, − c, 1, 0) is a payoff vector corresponding to equation (2.1). As long as Alice can exactly identify Bob’s strategy with no observational uncertainty, she can find the best response to Bob within the set of M1 pure strategies by applying every to equation (2.5). In table 1, we list the best response to each strategy in Δ in the limit of small ε (see also figure 1 for its graphical representation). In most cases, the best-response dynamics ends up with d0 = [0, 0, 0, 0], which is the best response to itself and often called Always-Defect (AllD). For example, if we start with Tit-for-Tat (TFT), represented as d10 = [1, 0, 1, 0], table 1 shows that the best response to TFT within Δ is Always-Cooperate (AllC), represented as d15 = [1, 1, 1, 1], to which AllD is the best response for obvious reasons.

Table 1

opponent strategy	best response	payoff of the best response to the opponent strategy	Misc.
d₀	d0†	(1 − c)ε	AllD
d₁	d₀	1/2 − (1/4 + c)ε + O(ε²)
d₂	d₁₁	(1 − c)/2 − (1 + c)ε/2 + O(ε²)
d₃	d₀	1/2 − ce + O(ε³)
d₄	d₀	1/3 + (2/9 − c)ε + O(ε²)
d₅	d₀	1 − (2 + c)ε + O(ε²)
d₆	d₉	1 − 3(1 + c)ε + O(ε²)
d₇	d₀	1 − (2 + c)ε + 4ε² + O(ε³)
d₈	{d8†,c>1/3d15,c<1/3}	{3(1−c)ϵ/2+O(ϵ2)1/3−c+O(ϵ)}	GT₁
d₉	{d0,c>1/2d9†,c<1/2}	{1/2+O(ϵ)1−c+O(ϵ)}	WSLS
d₁₀	d₁₅	(1 − c) − (2 − c)ε + O(ε²)	TFT
d₁₁	{d0,c>1/2d13,c<1/2}	{1/2+(1/4−c)ϵ+O(ϵ2)(1−c)−(2−c)ϵ+O(ϵ2)}
d₁₂	d₀	1/2 + O(ε)
d₁₃	d₀	1 − (1 + c)ε + O(ε²)
d₁₄	d₁	1 − 2(1 + c)ε + O(ε²)
d₁₅	d₀	1 − (1 + c)ε + O(ε³)	AllC

Figure 1

Graphical representation of best-response relations in table 1. If is the best response to , we represent it as an arrow from to . The blue node (Win-Stay-Lose-Shift) means an efficient NE with 1 − v ∼ O(ε), whereas the red nodes (Always-Defect and M1 Grim Trigger) mean inefficient ones with as shown in table 2. (Online version in colour.)

Best response among M1 pure strategies. Against each strategy in the first column, we obtain the best response (the second column), and the resulting average payoff (equation (2.5)) earned by the best response is given as a power series of ε in the third column. In the second column, we have placed a dagger next to a strategy when it is the best response to itself. Graphical representation of best-response relations in table 1. If is the best response to , we represent it as an arrow from to . The blue node (Win-Stay-Lose-Shift) means an efficient NE with 1 − v ∼ O(ε), whereas the red nodes (Always-Defect and M1 Grim Trigger) mean inefficient ones with as shown in table 2. (Online version in colour.)

Table 2

Stationary probability distribution , where we have retained only the leading-order term in the ε-expansion for each v. When we describe a strategy in binary, the boldface digits are the ones that are frequently observed with v ∼ O(1) and thus readily identifiable as long as M ≫ 1. In this table, the eight strategies in Category I have three or four such digits, so if the population is using one of these strategies, Alice can tell which one is being played after M ( ≫ 1) observations. As for Category II, the member strategies d1 and d7 would be indistinguishable if M ≪ ε−1 because they differ at their non-boldface digits. Still, Alice can find the best response d0 which is common to both of them (table 1). In Category III, each member strategy has just one boldface digit, so the strategies as well as the best responses can be identified only if M ≫ ε−1.

category	strategy	v_CC	v_CD	v_DC	v_DD
I	d3=[0,0,1,1]	14	14	14	14
	d5=[0,1,0,1]
	d10=[1,0,1,0]
	d12=[1,1,0,0]
	d2=[0,0,1,0]	12ϵ	14	14	12
	d4=[0,1,0,0]
	d11=[1,0,1,1]	12	14	14	12ϵ
	d13=[1,1,0,1]
II	d1=[0,0,0,1]	12	ε	ε	12
II	d7=[0,1,1,1]
III	d0=[0,0,0,0]	ε²	ε	ε	1
	d6=[0,1,1,0]	2ε	ε	ε	1
	d8=[1,0,0,0]	12ϵ	ε	ε	1
	d9=[1,0,0,1]	1	ε	ε	2ε
	d14=[1,1,1,0]	1	ε	ε	12ϵ
	d15=[1,1,1,1]	1	ε	ε	ε²

However, two exceptions exist: The first one is d8 = [1, 0, 0, 0], which we may call M1 Grim Trigger (GT1). If c > 1/3, this strategy is the best response to itself, and it is an inefficient equilibrium giving each player an average payoff of O(ε). The other exception is WSLS, represented by d9 = [1, 0, 0, 1], which is the best response to itself when c ≤ 1/2. It is an efficient NE, at which each player earns 1 − c + O(ε) per round on average.

Observational learning

Now, let us imagine a monomorphic population of players who have adopted a strategy in common. The population is in equilibrium in the sense that a large ensemble of their states XY ∈ {CC, CD, DC, DD} can represent the stationary probability distribution . We have an observer, say, Alice, with a potential strategy . As we learn social norms in childhood, it is assumed that Alice does not yet participate in the game but has a learning period to observe M ( ≫1) pairs of players, all of whom have used the resident strategy . How their mind works is a black box to her: Just by observing their states XY and subsequent moves, Alice has to form belief about , based on which she chooses her own strategy to maximize the expected payoff. If Alice’s optimal strategy turns out to be identical to the resident strategy , it constitutes an SCE. To see how Alice can specify from observation, let us consider an example that the observed probability distribution over states XY is best described as v ≈ (0, 1/4, 1/4, 1/2). If Alice has computed v for every strategy in Δ as listed in table 2, the observation suggests that the resident strategy is unlikely to be TFT (d10 = [1, 0, 1, 0]) because the corresponding stationary distribution would be v = (1/4, 1/4, 1/4, 1/4). She finds that can be either d2 = [0, 0, 1, 0] or d4 = [0, 1, 0, 0]. To distinguish between them, she has to check how people react to CD or DC. According to table 2, these states will be observed frequently because v = v = 1/4. Thus, in this example, Alice succeeds in identifying as long as M ≫ 1. Eight strategies have this property, constituting Category I in Δ (table 2). As another example, if v ≈ (1/2, 0, 0, 1/2), Alice sees that must be either d1 = [0, 0, 0, 1] or d7 = [0, 1, 1, 1]. To resolve the uncertainty, she has to further check how people react to CD or DC, but she may actually save this effort because the best response turns out to be d0 in either case (table 1). This is the case of Category II in Δ (table 2). Stationary probability distribution , where we have retained only the leading-order term in the ε-expansion for each v. When we describe a strategy in binary, the boldface digits are the ones that are frequently observed with v ∼ O(1) and thus readily identifiable as long as M ≫ 1. In this table, the eight strategies in Category I have three or four such digits, so if the population is using one of these strategies, Alice can tell which one is being played after M ( ≫ 1) observations. As for Category II, the member strategies d1 and d7 would be indistinguishable if M ≪ ε−1 because they differ at their non-boldface digits. Still, Alice can find the best response d0 which is common to both of them (table 1). In Category III, each member strategy has just one boldface digit, so the strategies as well as the best responses can be identified only if M ≫ ε−1. In general, the first important piece of information to infer is the stationary distribution v because it heavily depends on (table 2). However, the information of v may be insufficient to single out the answer: Suppose that v gives multiple candidate strategies which prescribe different moves at a certain state XY and thus have different best responses. Alice then needs to observe what players actually choose at XY, and such observations should be performed sufficiently many times, i.e. M v ≫ 1, for the sake of statistical power. If we check every one by one in this way, we see that the best response to the resident strategy can readily be identified as long as M ≫ ε−1, in which case the result of observational learning would be the same as that of exact identification of strategies. If M ≪ ε−1, on the other hand, Alice cannot fully resolve such uncertainty through observation. Still, note that M should be taken as far greater than O(1) for statistical inference to be meaningful. Furthermore, ε has been introduced as a regularization parameter whose exact magnitude is irrelevant, so we look at the behaviour in the limit of small ε. When 1 ≪ M ≪ ε−1, uncertainty in the best response remains only when v ≈ (0, 0, 0, 1) or (1, 0, 0, 0), both of which are characteristic of Category III in table 2. In the former case, d0, d6 and d8 are the candidate strategies for , whereas in the latter case, the candidates are d9, d14 and d15. From the Bayesian perspective, it is reasonable to assign equal probability to each of the candidate strategies. However, if Mε ≪ 1, the number of observations cannot be enough to update this prior probability (see appendix B for a detailed discussion). Therefore, when v ≈ (0, 0, 0, 1), yielding or d6 or d8, Alice tries to maximize the expected payoff and the calculation shows that it can be achieved by playing in the limit of ε → 0. Likewise, when v ≈ (1, 0, 0, 0), yielding or d14 or d15, Alice tries to maximize her expected payoff from the three possibilities, which is achieved when she plays as ε → 0. Now, AllD ceases to be the best-looking response to itself (figure 2): The expected payoff against AllD will be higher when WSLS is played, if c < 16/33. On the other hand, if we consider a WSLS population with c < 2/9, its cooperative equilibrium is protected from invasion of defectors because Alice under observational uncertainty will keep choosing WSLS, which is truly the best response to itself.

Figure 2

Best-looking responses to maximize the expected payoff under uncertainty in observation, when 1 ≪ M ≪ ε−1. Compared with figure 1, the first difference is that Alice uses equation (2.7) against d0, d6 and d8. In addition, she will use equation (2.8) against d9, d14 and d15. (Online version in colour.) The above analysis concerns the uniform prior among three candidate strategies in each case. Let f denote the fraction of d. For an observer who almost always sees defection from the population, the prior in equation (2.6) can be written as (f0, f6, f8) = (1/3, 1/3, 1/3). For a general prior (f0, f6, f8) with 0 < f < 1 and f8 = 1 − f0 − f6, the condition for WSLS to give the highest expected payoff is summarized as the intersection of the following two inequalities (figure 3a): and The above inequalities are written for f6 because it is d6 that has WSLS as the best response (table 1). If c > 1/3, the former inequality becomes trivial because of the positivity of f6. Note that WSLS still gives the highest expected payoff for a significant part of the simplex even when the cost of cooperation is as high as c = 0.9 (figure 3b).

Figure 3

Effect of the prior on the observer’s choice. A point in the triangle represents three fractions, which sum up to one, and its distance to an edge is proportional to the fraction of the strategy at the opposite vertex [21]. (a) When the observer sees nearly defection only, the prior takes the form of (f0, f6, f8), for which we can find the strategy that gives the best expected payoff as written in each region. When c is low, d9 (WSLS) gives the highest expected payoff for most of the prior. (b) Even when the cost increases to c = 0.9, the observer should choose WSLS if the prior contains a sufficiently high fraction of d6. (c) If the observer sees cooperation almost all the time, the prior can be expressed as (f9, f14, f15). If c is low, WSLS can be the observer’s choice when f9 is high enough. (d) The region of WSLS disappears as c exceeds 1/2, and the only possible choice is between d1 and d0 (AllD). Similarly, we can check what an observer would conclude after observing nearly cooperation only, although it is of less importance compared with the above case of a defecting population (figure 2). For a general prior represented by (f9, f14, f15), where f14 = 1 − f9 − f15, WSLS gives the highest expected payoff when as can be seen in figure 3c. This inequality can be satisfied only if c ≤ 1/2: Otherwise, it is better to be a defector by playing d0 or d1 (figure 3d).

Summary and discussion

In summary, we have investigated the iterated PD game in terms of best-response relations and checked how it is modified by observational learning. Thereby we have addressed a question about how cooperation is affected by cultural transmission, which may be systematically involved with observational uncertainty. The notion of SCE takes this systematic uncertainty into account, and its intersection with NE can be an equilibrium refinement. It is worth pointing out the following: If everyone plays a certain strategy d with belief that everyone else does the same, the whole situation is self-consistent in the sense that observation will always confirm the belief, which in turn agrees with the actual behaviour. The importance of SCENE becomes clear when someone happens to play a different strategy or begins to doubt the belief: If d is not an NE, the player will benefit from the deviant behaviour and reinforce it. If d is not an SCE, the player may fail to dispel the doubt, which will undermine the prevailing culture. Therefore, the strategy has to be a SCENE for being transmitted in a stable manner through observational learning. As a reference point, we have started with the conventional assumption that one can identify a strategy without uncertainty, and checked the best-response relations within the set of M1 pure strategies. Our finding is that a symmetric NE is possible if one uses one of the following three strategies: AllD, GT1 and WSLS (figure 1). Only the last one is efficient. Although we have restricted ourselves to pure strategies, we can discuss the idea behind it as follows: Let us consider a monomorphic population playing a mixed strategy q = [q, q, q, q], where each element means the probability to cooperate in a given circumstance. Such a mixed strategy can be represented as a point inside a four-dimensional unit hypercube. The observer seeks the best response to it, say, p = [p, p, p, p]. Suppose that p also turns out to be a mixed strategy, say, containing d and d with k ≠ l. According to the Bishop–Cannings theorem [22], it implies that and this equality imposes a set of constraints on q, rendering the dimensionality of the solution manifold lower than four. Therefore, to almost all q in the four-dimensional hypercube, only one pure strategy will be found as the best response. In appendix C, we provide an explicit proof for this argument in case of reactive strategies. Even if our theoretical framework of Bayesian best-response dynamics is an idealization, we believe that it captures certain aspects of animal behaviour. For example, although the best-response dynamics per se shows poor performance in explaining learning behaviour because of its deterministic character [23], its modified versions can provide reasonable description for experimental results [24,25]. In addition, some studies show that Bayesian updating yields consistent results with observed behaviour of animals, including mammals, birds, a fish and an insect, in the foraging and reproduction activities [26]. These studies support the Bayesian brain hypothesis, which argues that the brain has to successfully simulate the external world in which Bayes’ theorem holds [27]. We also point out that the posterior can be calculated correctly even if the observer has short-term memory as implied by the M1 assumption: As long as input observations are exchangeable with each other, Bayesian updating can be done in a sequential manner (i.e. by modifying the prior little by little every time a new observation arrives), and it is mathematically equivalent to a batch update that uses all the observations at once. To conclude, if we take observational learning into consideration, our result suggests that WSLS can be a SCENE to a Bayesian observer, whereas AllD cannot under observational uncertainty. That is, if the number of observations is too small to see how to behave after error, the uncertainty provides a way to escape from full defection, whereas WSLS can still maintain cooperation: The point is that AllD is not easy to learn by observing defectors because it is difficult to tell what they would choose if someone actually cooperated. WSLS is also difficult to learn, but the uncertainty works in an asymmetric way because one can expect more from mutual cooperation than from full defection by the very definition of the PD game. Click here for additional data file.

15 in total

Review 1. Mechanisms of social cognition.

Authors: Chris D Frith; Uta Frith
Journal: Annu Rev Psychol Date: 2011-08-11 Impact factor: 24.137

Win-Stay-Lose-Shift as a self-confirming equilibrium in the iterated Prisoner's Dilemma.

Introduction

Method and result

Best-response relations without observational uncertainty

Observational learning

Summary and discussion

Review 1. Mechanisms of social cognition.

2. Cultural transmission of tool use in bottlenose dolphins.

3. Tit-for-tat or win-stay, lose-shift?

4. Understanding motor events: a neurophysiological study.

5. Iterated Prisoner's Dilemma contains strategies that dominate any evolutionary opponent.

6. The evolution of transgenerational integration of information in heterogeneous environments.

Review 7. Genetics and developmental biology of cooperation.

8. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner's Dilemma game.

Review 9. The history of the future of the Bayesian brain.

10. Comparing reactive and memory-one strategies of direct reciprocity.