Paula Parpart1, Matt Jones2, Bradley C Love3. 1. University College London, United Kingdom. 2. University of Colorado Boulder, United States. 3. University College London, United Kingdom; The Alan Turing Institute, United Kingdom. Electronic address: b.love@ucl.ac.uk.
Abstract
Simple heuristics are often regarded as tractable decision strategies because they ignore a great deal of information in the input data. One puzzle is why heuristics can outperform full-information models, such as linear regression, which make full use of the available information. These "less-is-more" effects, in which a relatively simpler model outperforms a more complex model, are prevalent throughout cognitive science, and are frequently argued to demonstrate an inherent advantage of simplifying computation or ignoring information. In contrast, we show at the computational level (where algorithmic restrictions are set aside) that it is never optimal to discard information. Through a formal Bayesian analysis, we prove that popular heuristics, such as tallying and take-the-best, are formally equivalent to Bayesian inference under the limit of infinitely strong priors. Varying the strength of the prior yields a continuum of Bayesian models with the heuristics at one end and ordinary regression at the other. Critically, intermediate models perform better across all our simulations, suggesting that down-weighting information with the appropriate prior is preferable to entirely ignoring it. Rather than because of their simplicity, our analyses suggest heuristics perform well because they implement strong priors that approximate the actual structure of the environment. We end by considering how new heuristics could be derived by infinitely strengthening the priors of other Bayesian models. These formal results have implications for work in psychology, machine learning and economics.
Simple heuristics are often regarded as tractable decision strategies because they ignore a great deal of information in the input data. One puzzle is why heuristics can outperform full-information models, such as linear regression, which make full use of the available information. These "less-is-more" effects, in which a relatively simpler model outperforms a more complex model, are prevalent throughout cognitive science, and are frequently argued to demonstrate an inherent advantage of simplifying computation or ignoring information. In contrast, we show at the computational level (where algorithmic restrictions are set aside) that it is never optimal to discard information. Through a formal Bayesian analysis, we prove that popular heuristics, such as tallying and take-the-best, are formally equivalent to Bayesian inference under the limit of infinitely strong priors. Varying the strength of the prior yields a continuum of Bayesian models with the heuristics at one end and ordinary regression at the other. Critically, intermediate models perform better across all our simulations, suggesting that down-weighting information with the appropriate prior is preferable to entirely ignoring it. Rather than because of their simplicity, our analyses suggest heuristics perform well because they implement strong priors that approximate the actual structure of the environment. We end by considering how new heuristics could be derived by infinitely strengthening the priors of other Bayesian models. These formal results have implications for work in psychology, machine learning and economics.
Many real-world prediction problems involve binary classification based on available information, such as predicting whether Germany or England will win a soccer match based on the teams’ statistics. A relatively simple decision procedure would use a rule to combine available information (i.e., cues), such as the teams’ league position, the result of the last game between Germany and England, which team has scored more goals recently, and which team is home versus away. One such decision procedure, the tallying heuristic, simply checks which team is better on each cue and chooses the team that has more cues in its favor, ignoring any possible differences among cues in magnitude or predictive value (Czerlinski et al., 1999, Dawes, 1979). In the scenario depicted in Fig. 1A this heuristic would choose England. Another algorithm, take-the-best (TTB), would base the decision on the best single cue that differentiates the two options. TTB works by ranking the cues according to their cue validity (i.e., predictive value), then sequentially proceeding from the most valid to least valid until a cue is found that favors one team over the other (Gigerenzer & Goldstein, 1996). Thus TTB terminates at the first discriminative cue, discarding all remaining cues.
Fig. 1
Illustrative example of a binary prediction task. (A) Predicting whether Team Germany or England will win is based on four cues: league position, last game result, home vs. away match, and recent goal scoring. Cue validities (v) reflect the relative frequency with which each cue makes correct inferences across many team comparisons (formula in Appendix A). Smiley and frowning faces indicate which team is superior on each cue, whereas a grey face indicates the two teams are equal on that cue. For modeling, a cue is coded +1 when it favors the team on the left (Germany), −1 when it favors the team on the right (England), and 0 when the teams are equal along that cue. (B) Irrespective of cue validity, cues can co-vary (illustrated by overlap) with the criterion variable but also with each other. The heuristics considered here ignore this covariance among cues.
Illustrative example of a binary prediction task. (A) Predicting whether Team Germany or England will win is based on four cues: league position, last game result, home vs. away match, and recent goal scoring. Cue validities (v) reflect the relative frequency with which each cue makes correct inferences across many team comparisons (formula in Appendix A). Smiley and frowning faces indicate which team is superior on each cue, whereas a grey face indicates the two teams are equal on that cue. For modeling, a cue is coded +1 when it favors the team on the left (Germany), −1 when it favors the team on the right (England), and 0 when the teams are equal along that cue. (B) Irrespective of cue validity, cues can co-vary (illustrated by overlap) with the criterion variable but also with each other. The heuristics considered here ignore this covariance among cues.In contrast to these heuristic algorithms, a full-information model such as linear regression would make use of all the cues, their magnitudes, their predictive values, and observed covariation among them. For example, league position and number of goals scored are highly correlated, and this correlation influences the weights obtained from a regression model (Fig. 1B). Although such covariances naturally arise and can be meaningful, the cue validities used by the tallying and TTB heuristics completely ignore them (Martignon & Hoffrage, 1999). Instead, cue validities assess only the probability with which a single cue can identify the correct alternative, as the proportion of correct inferences made by that cue alone across a set of binary comparisons (formal definition in Appendix A). When two cues co-vary highly, they essentially provide the same information, but heuristics ignore this redundancy and treat the related cues as independent information sources. In the heuristic literature, the learner is usually assumed to learn cue validities from past experiences (i.e., the training data) (Gigerenzer and Goldstein, 1996, Gigerenzer and Todd, 1999).Heuristics have a long history of study in cognitive science, where they are often viewed as more psychologically plausible than full-information models, because ignoring data makes the calculation easier and thus may be more compatible with inherent cognitive limitations (Bobadilla-Suarez and Love, 2018, Kahneman, 2003, Simon, 1990, Tversky and Kahneman, 1974). This view suggests that heuristics should underperform full-information models, with the loss in performance compensated by reduced computational cost. This prediction is challenged by observations of less-is-more effects, wherein heuristics sometimes outperform full-information models, such as linear regression, in real-world prediction tasks (Chater et al., 2003, Czerlinski et al., 1999, Dawes, 1979, Gigerenzer and Goldstein, 1996, Goldstein and Gigerenzer, 2002, Hogarth and Karelaia, 2007, Katsikopoulos et al., 2010). These findings have been used to argue that ignoring information can actually improve performance, even in the absence of processing limitations. For example, Gigerenzer and Todd (1999) write, “There is a point where too much information and too much information processing can hurt” (p. 21). Likewise, Gigerenzer and Brighton (2009) conclude, “A less-is-more effect, however, means that minds would not gain anything from relying on complex strategies, even if direct costs and opportunity costs were zero” (p. 111).Less-is-more arguments also arise in other domains of cognitive science, such as in claims that learning is more successful when processing capacity is (at least initially) restricted (Elman, 1993, Newport, 1990). Contrary to existing claims, we argue there is no inherent computational advantage to simplicity of information processing. Less-is-more effects can arise only when the space of models under consideration is limited to a particular family or architecture. At a computational level of analysis, where restrictions on algorithms are set aside (Marr, 1982), more information is always better.We cast our argument in a Bayesian framework, wherein additional information (input data) is always helpful but must be correctly combined with appropriate prior knowledge. We first prove that the tallying and TTB heuristics are equivalent to Bayesian inference under the limit of an infinitely strong prior. This connection suggests that heuristics perform well because their relative inflexibility amounts to a strong inductive bias, one that is suitable for many real-world learning and decision problems.We then use this connection to define a continuum of Bayesian models, determined by parametric variation in the strength of the prior. At one end of the continuum (infinitely diffuse prior), the Bayesian model is equivalent to a variant of linear regression, and at the other end (infinitely strong prior) it is equivalent to a heuristic. Although the Bayesian models mimic the heuristics perfectly in the limit, a crucial difference is that the Bayesian account regulates cue weights but never discards any information. The models are tested on classic datasets that have been used to demonstrate superiority of the heuristics over linear regression, and in all cases we find that best performance comes from intermediate models on the continuum, which do not entirely ignore cue weights or cue covariance but that nonetheless down-weight this information via the influence of their priors. These results suggest that the success of heuristics, and findings of less-is-more effects more broadly in cognitive science, are due not to a computational advantage of simplicity per se, but rather to the fact that simpler models can approximate strong priors that are well-suited to the true structure of the environment.
Bias, variance, and Bayesian inference
The current explanation for less-is-more effects in the heuristics literature is based on the bias-variance dilemma (Gigerenzer & Todd, 1999). The present paper extends this Frequentist concept into a Bayesian framework that formally links heuristics and full-information models. From a statistical perspective, every model, including heuristics, has an inductive bias, which makes it best-suited to certain learning problems (Geman, Bienenstock, & Doursat, 1992). A model’s bias and the training data are responsible for what the model learns. In addition to differing in bias, models can also differ in how sensitive they are to sampling variability in the training data, which is reflected in the variance of the model’s parameters after training (i.e., across different training samples).A core tool in machine learning and psychology for evaluating the performance of learning models, cross-validation, assesses how well a model can apply what it has learned from past experiences (i.e., the training data) to novel test cases (Kohavi, 1995). From a psychological standpoint, a model’s cross-validation performance can be understood as its ability to generalize from past experience to guide future behavior. How well a model classifies test cases in cross-validation is jointly determined by its bias and variance. Higher flexibility can in fact hurt performance because it makes the model more sensitive to the idiosyncrasies of the training sample. This phenomenon, commonly referred to as overfitting, is characterized by high performance on experienced cases from the training sample but poor performance on novel test items. Overfitted models have high goodness of fit but low generalization performance (Fig. 2A; see Pitt & Myung, 2002).
Fig. 2
The concept of overfitting. (A) More flexible models can fit the training sample better (goodness of fit), accounting for most of the variability. However, these models can fare poorly in generalization tasks that test on novel samples (generalizability) (Pitt & Myung, 2002). (B) Our re-analysis of a dataset (Czerlinski et al., 1999) used to evaluate heuristics (predicting house prices) finds that TTB outperforms ordinary linear regression at generalization when the training sample is small (20 training cases). However, the pattern reverses when the training sample is enlarged (100 training cases). Error bars represent SEM. Details are in Appendix A.
The concept of overfitting. (A) More flexible models can fit the training sample better (goodness of fit), accounting for most of the variability. However, these models can fare poorly in generalization tasks that test on novel samples (generalizability) (Pitt & Myung, 2002). (B) Our re-analysis of a dataset (Czerlinski et al., 1999) used to evaluate heuristics (predicting house prices) finds that TTB outperforms ordinary linear regression at generalization when the training sample is small (20 training cases). However, the pattern reverses when the training sample is enlarged (100 training cases). Error bars represent SEM. Details are in Appendix A.Bias and variance tend to trade off with one another such that models with low bias suffer from high variance and vice versa (Geman et al., 1992). With small training samples, more flexible (i.e., less biased) models will overfit and can be bested by simpler (i.e., more biased) models such as heuristics. As the size of the training sample increases, variance becomes less influential and the advantage shifts to the complex models (Chater et al., 2003). Indeed, in a reanalysis of a dataset used to evaluate heuristics (Czerlinski et al., 1999), we find that the advantage for the heuristic over linear regression disappears when training sample size is increased (Fig. 2B).The Bayesian framework offers a different perspective on the bias-variance dilemma. Provided a Bayesian model is correctly specified, it always integrates new data optimally, striking the perfect balance between prior and data. Thus using more information can only improve performance. From the Bayesian standpoint, a less-is-more effect can arise only if a model uses the data incorrectly, for example by weighting it too heavily relative to prior knowledge (e.g., with ordinary linear regression, where there effectively is no prior). In that case, the data might indeed increase estimation variance to the point that ignoring some of the information could improve performance. However, that can never be the best solution. One can always obtain superior predictive performance by using all of the information but tempering it with the appropriate prior. The results in the remainder of this paper demonstrate this conclusion explicitly.
Tallying as a limiting case of regularized regression
The first Bayesian model we develop is conceptually related to ridge regression (Hoerl & Kennard, 1970), a successful regularized regression approach in machine learning. Ridge regression extends ordinary linear regression by incorporating a penalty term that adjusts model flexibility to improve weight estimates and avoid overfitting (Fig. 2A). The types of tasks we model in this paper are binary comparisons, where each input represents a comparison between two alternatives on a set of cues, and the output represents which alternative has the greater value on some outcome variable. Consider a training set of input-output pairs with and . An example is Fig. 1A, where the explanatory variables () encode which soccer team is superior on each cue, and the outcome variable (y) indicates which team won each comparison (match). The aim in any linear regression problem is to estimate the weights, i.e., a vector of regression coefficients , such that prediction error between y and is minimized. The weights estimated by ridge regression are defined bywhere the penalty parameter is nonnegative. denotes the square of the Euclidean norm, is the outcome variable defined over all n binary comparisons in the training sample, and is an matrix with one column for each of the m predictor variables . When the penalty parameter equals zero, ridge regression is concerned only with goodness of fit (i.e., minimizing squared error on the training set). For this special case, ridge regression is equivalent to ordinary linear regression, which is highly sensitive to sampling variability in the training set. As the penalty parameter increases, the pressure to shrink the weights increases, reducing them to zero as . Thus larger values of lead to stronger inductive bias, which can reduce overfitting by reducing sensitivity to noise in the training sample. However, the optimal setting of will always depend on the environment from which the weights, cues, and outcomes were sampled.The ridge penalty term is mathematically equivalent to a Gaussian Bayesian prior on the weights, where is inversely proportional to the prior variance of each (i.e., , where is the variance of the error in y, also assumed to be Gaussian). In the Bayesian interpretation, the strength of the prior is thus reflected by , growing stronger as . This prior distribution is combined with observations from the training sample to form a posterior distribution (also Gaussian) over the weights. Like ordinary linear regression, ridge regression provides a point estimate for the weights, equal to the mean (and also the mode) of the full Bayesian posterior distribution (Marquaridt, 1970, Ripley, 2007). The conceptual relationships among ridge regression, ordinary linear regression, and the Bayesian model are illustrated in Appendix A, Fig. A2.
Fig. A2
Formal relationships among full Bayesian regression, ridge regression, ordinary least-squares linear regression, the Bayesian half-ridge model, and the directed tallying heuristic. The lower-right arrow represents the main contribution of this paper, that a heuristic is a limiting case of Bayesian inference (here, the half-ridge model) with an infinitely strong prior.
Half-ridge model and tallying
Our Bayesian derivation of the tallying heuristic extends ridge regression by assuming the directionalities of the cues (i.e., the signs of the true weights) are known in advance. For example, being higher in the league standings will, if anything, make a team more likely (not less) to win a given match. This assumption is concordant with how the tallying heuristic was originally proposed in the literature (Dawes, 1979). We refer to this definition of the tallying heuristic as directed tallying in order to differentiate it from the version of the tallying heuristic that learns cue directionalities from the training data (Czerlinski et al., 1999). Thus we define the prior for each weight as half-Gaussian, truncated at zero (right-hand side in Fig. A2, Appendix A), and we refer to this Bayesian model as the half-ridge model. Formally, the joint prior is defined bywhere is the covariance matrix among the weights (prior to truncation) and determines the variance for each weight. The restriction notation, , indicates we truncate the distribution to one orthant , defined by the predetermined directionalities of the cues. For example, if the cues were assumed all to have positive (or null) effects on the outcome, then would equal . Under this assumption, the posterior distribution inherits the same truncation (see Appendix A for derivations). The important question is what happens to this posterior as the prior becomes arbitrarily strong, that is, as . Just as with increasing the penalty parameter in regular ridge regression, strengthening the prior in the half-ridge model shrinks the posterior weights toward zero (Eq. (3)). However, the ratios of the weights—that is, the relative inferred strengths of the cues—all converge to unity. This result can be seen through a simple rescaling of the weights, which has no impact on a binary comparison task. In particular, we show in the Appendix A that the posterior distribution for obeysconditional on and (where indicates convergence in distribution). Consequently, the rescaled weights all have the same posterior mean in the limit:with signs determined by each cue’s assumed directionality. Therefore, the optimal decision-making strategy under the Bayesian half-ridge model converges to a simple summation of the predictors—that is, the directed tallying heuristic. Note that, under this limit, the model becomes completely invariant to the training data. In particular, it ignores how strongly each cue is associated with the outcome in the training set (i.e., magnitudes of cue validities). At the other extreme, as the prior becomes extremely weak (), the Bayesian half-ridge model converges to a full regression model akin to ordinary linear regression in that it differentially weights the cues (e.g., more predictive cues receive higher weights than less predictive cues), the only difference being that the weights are constrained to have their predetermined signs. In conclusion, the half-ridge model demonstrates how the directed tallying heuristic arises as an extreme case of a Bayesian prior on the distributions of weights in the environment, and it shows that tallying and linear regression can be related by a continuum of models that differ only in the strength of this prior.
Heuristics vs. intermediate models
From a Bayesian perspective, the model that fares best on a given decision task should be the one with a prior most closely matching the data’s generating process. In many decision environments, cues differ in their predictiveness, but these differences are not arbitrarily large (i.e., the cue weights are not drawn uniformly from all real numbers). An advantage of the Bayesian half-ridge framework is that it specifies a continuum of models between the extremes of linear regression and the directed tallying heuristic. For many environments, the best-performing model should lie somewhere between these two extremes. Furthermore, the best-performing model should not change with different training set sizes (cf. Fig. 2), because—unlike the Frequentist phenomenon of bias-variance tradeoff—a correctly specified Bayesian model is guaranteed to find the optimal tradeoff between prior and likelihood, for any sample size.The Bayesian half-ridge model was simulated on 20 datasets that have been used to compare heuristic and regression approaches (Czerlinski et al., 1999). The key finding is that intermediate models perform best in all cases for all training sample sizes (see Fig. 3). Interestingly, the ordinary regression model (i.e., the limit of ) outperforms the tallying heuristic (i.e., the limit of ). This discrepancy from past less-is-more results arises because cue directions are not learned in these simulations, and therefore there is no opportunity for the more flexible regression model to misestimate the cue directions. We do demonstrate less-is-more results in the 20 datasets (Czerlinski et al., 1999, Katsikopoulos et al., 2010) when comparing heuristics and regression models that estimate cue directions from the training set (Appendix A, Fig. A5, Fig. A6). The main finding, that intermediate half-ridge models outperform tallying in all 20 datasets, suggests that ignoring information is never the best solution. The best-performing model uses all the information in the training data, combining it with the appropriate prior.
Fig. 3
Generalization performance of the Bayesian half-ridge model by training sample size and as a function of the strength of the prior for 20 datasets for which heuristics have been previously evaluated (Czerlinski et al., 1999). The abscissa represents the strength of the prior, and the ordinate represents the predictive accuracy of the model on test comparisons. Note that an approximately infinitely strong prior on the far right of each graph (small values of ) corresponds to the directed tallying heuristic. Intermediate models (i.e., with a medium-strength prior) performed best in all datasets regardless of training sample size. Error bars represent SEM. Because the Oxygen and Ozone datasets contain less than 115 object pairs in total, training size 115 is not included for them. See Appendix A for details.
Fig. A5
Generalization performance of the Bayesian COR model with the Tallying decision rule by training sample size in all 20 datasets that heuristics have been extensively tested on Czerlinski et al. (1999). The abscissa represents an increasing prior strength from left to right, and the ordinate represents the predictive accuracy of the model. Note that an approximately infinitely strong prior (e.g., = 1e+06) corresponds to the tallying heuristic, and a prior strength of zero ( = 0) corresponds to ordinary linear regression. In 11 out of the 20 datasets, a less-is-more effect can be observed, where the tallying heuristic outperformed ordinary linear regression, with 10 and 20 training cases. For example, in the City Size, Car Accidents, and Mammals datasets, the tallying heuristic outperformed ordinary linear regression for training samples sizes of 10 or 20 training cases. However, the optimal performance could be found in the middle, i.e., for medium-strength priors. The optimal performance peak was robust across training sample sizes of 10, 20, and 115 training cases. In other datasets, such as Homelessness, Fish Fertility, and Women’s Attractiveness, ordinary linear regression outperformed tallying. However, the optimal performance for all datasets was found for intermediate COR models, i.e., for medium-strength priors. Error bars represent SEM.
Fig. A6
Generalization performance of the Bayesian COR model with the TTB decision rule by training sample size in all 20 datasets that heuristics have been extensively tested on Czerlinski et al. (1999). The abscissa represents an increasing prior strength from left to right, and the ordinate represents the predictive accuracy of the model. Note that an approx. infinitely strong prior (e.g., = 1e+06) corresponds to the TTB heuristic, and a prior strength of zero ( = 0) corresponds to ordinary linear regression. In 18 out of the 20 datasets, a less-is-more effect can be observed, where the TTB heuristic outperformed ordinary linear regression, with 10 and 20 training cases. For example, in the House Prices, Mortality, City Size, and Professor Salaries datasets, the TTB heuristic outperformed ordinary linear regression for training samples sizes of 10 or 20, but the optimal performance could be found in the middle, i.e., for medium-strength priors. The optimal performance peak was roughly in the same place across training sample sizes of 10, 20, and 115 training cases. In other datasets, such as the Cloud Rainfall or the Ozone levels dataset, ordinary linear regression outperformed the TTB heuristic, but the optimal performance can still be found in the intermediate COR models, i.e., for medium-strength priors. Error bars represent SEM.
Generalization performance of the Bayesian half-ridge model by training sample size and as a function of the strength of the prior for 20 datasets for which heuristics have been previously evaluated (Czerlinski et al., 1999). The abscissa represents the strength of the prior, and the ordinate represents the predictive accuracy of the model on test comparisons. Note that an approximately infinitely strong prior on the far right of each graph (small values of ) corresponds to the directed tallying heuristic. Intermediate models (i.e., with a medium-strength prior) performed best in all datasets regardless of training sample size. Error bars represent SEM. Because the Oxygen and Ozone datasets contain less than 115 object pairs in total, training size 115 is not included for them. See Appendix A for details.
A covariance-based Bayesian model and heuristics
In this section, we consider a second Bayesian model that, unlike the half-ridge model, learns cue directions from the training set and provides a unification of TTB, tallying, and linear regression. Given that ridge regression (L2 regularization) yields tallying, one might wonder whether a strong prior of a different functional form might yield the TTB heuristic. In particular, lasso regression (L1 regularization) (Ripley, 2007) is known to produce sparsity in cue selection (i.e., many weights are estimated as zero), and thus might be expected to yield TTB in the limit. Instead, derivations show that lasso regression also converges to tallying in the limit when the cue directionalities are known a priori. This result highlights the robustness of the conclusions of the previous sections, with tallying arising as a limiting case of Bayesian inference under a variety of different priors.Given this formal result, we take a different approach. One key observation is that, unlike linear regression, both TTB and tallying rely on isolated cue-outcome relationships (i.e., cue validity) that disregard covariance information among cues. We use this insight to construct our second Bayesian model, with a prior that suppresses information about cue covariance but leaves information about cue validity unaffected. We refer to this model as Covariance Orthogonalizing Regularization (COR), because our regularization method essentially makes cues appear more orthogonal to each other. The strength of the prior yields a continuum of models (Fig. 4) defined by sensitivity to covariation among cues, which smoothly vary in their mean posterior weight estimates from those of ordinary linear regression to weights that are linear transforms of the heuristics’ cue validities (see Appendix A derivations).
Fig. 4
The prior of the COR model influences the posterior solution (i.e., the mean of the posterior on ) such that the model encompasses linear regression and the heuristics as extreme cases. In this example, there are cues, represented as vectors , where one set of entries pertains to the ith binary comparison. In order to establish a continuum of covariation sensitivity, the criterion variable is multiplexed as many times as there are cues (i.e., m times). The result is a multivariate regression problem with a dependent matrix of m columns of identical criterion variables. We refer to the dashed arrows as cross-weights, and the solid arrows as direct weights, corresponding respectively to the off-diagonal and diagonal entries of the weight matrix . In an ordinary linear regression model, the estimated weights depend on the cue covariances. In contrast, a model structure without any of the cross-weights would revert to three simple regressions with exactly one predictor each (, , or ). Therefore, in the limit , the prior does not penalize the cross-weights, and the set of mean posterior weights to each copy of the criterion variable is equal to the ordinary linear regression solution (leftmost network). At the other extreme, when , the cross-weights are shrunk to zero, and the knowledge captured in the direct weights becomes equivalent to that embodied by cue validities in heuristics that ignore covariation information (rightmost network). Between these two extreme values of lie models that are sensitive to covariation to varying degrees (middle network).
The prior of the COR model influences the posterior solution (i.e., the mean of the posterior on ) such that the model encompasses linear regression and the heuristics as extreme cases. In this example, there are cues, represented as vectors , where one set of entries pertains to the ith binary comparison. In order to establish a continuum of covariation sensitivity, the criterion variable is multiplexed as many times as there are cues (i.e., m times). The result is a multivariate regression problem with a dependent matrix of m columns of identical criterion variables. We refer to the dashed arrows as cross-weights, and the solid arrows as direct weights, corresponding respectively to the off-diagonal and diagonal entries of the weight matrix . In an ordinary linear regression model, the estimated weights depend on the cue covariances. In contrast, a model structure without any of the cross-weights would revert to three simple regressions with exactly one predictor each (, , or ). Therefore, in the limit , the prior does not penalize the cross-weights, and the set of mean posterior weights to each copy of the criterion variable is equal to the ordinary linear regression solution (leftmost network). At the other extreme, when , the cross-weights are shrunk to zero, and the knowledge captured in the direct weights becomes equivalent to that embodied by cue validities in heuristics that ignore covariation information (rightmost network). Between these two extreme values of lie models that are sensitive to covariation to varying degrees (middle network).In contrast to ridge regression, we express the regression problem in multivariate terms by multiplexing the outcome m times (the number of predictors), which allows the model to capture the sequential nature of TTB. As shown in Fig. 4, every copy of the output receives input from every cue, and thus the weights can be represented as an weight matrix . Unlike in ridge regression, where the Gaussian prior shrinks all model weights toward zero, only the cross-weights (i.e., the off-diagonal elements) are penalized. In the limiting case, when the precision of the prior, , approaches , the cross-weights reduce to zero and the posterior estimates for the direct (diagonal) weights are equivalent to cue validities as used by the heuristics (i.e., neglecting covariance information), up to a linear transformation. At the other extreme, when , every copy of has the same posterior for its set of weights, and the mean (and mode) of this posterior is equal to the ordinary linear regression solution. In particular, the covariance information is reflected in the posterior weights as it is in the ordinary regression solution.The model weights are paired with a decision rule to classify test items. First, the vector is multiplied by the mean posterior weight matrix to generate an output vector :Note that using the posterior mean is equivalent to integrating over the full posterior distribution, due to the linearity of Eq. (5). The TTB decision rule is then applied to the resulting asandThus, the TTB decision rule selects the maximum absolute output (Eq. (6)) and takes the valence of that output as its choice (Eq. (7)). When (and the cross-weights are thus zero), the decision rule exhibits the exact sequential nature of the TTB heuristic, because then each output in Eq. (5) equals the value of the corresponding cue, , times its cue validity. The largest output will correspond to the most valid cue that is not equal to zero (i.e., indifferent) for the particular test comparison. Thus, when the TTB decision rule is adopted, the COR model converges to the TTB heuristic as . In Appendix A, Fig. A3 shows simulations of an artificial binary prediction task similar to Fig. 1, demonstrating that the COR model (with TTB decision rule) and the TTB heuristic reach perfect agreement in their predictions as the prior becomes strong enough.
Fig. A3
Agreement between the COR model (with TTB decision rule) and the TTB heuristic, as well as ordinary linear regression, as a function of the strength of the prior. As expected, agreement (i.e., proportion of equal predictions on test items) between the Bayesian COR model and TTB heuristic increased with a stronger prior, reaching an asymptote of perfect agreement as approached infinity. The opposite pattern held for ordinary linear regression, with agreement being perfect at and declining as the prior strength increases. Parallel results hold for the tallying decision rule (Fig. A4). The ordinate indicates the percentage agreement on test item choices in a simulated binary decision task with three cues. The simulated task contained 20 objects (e.g., an object represents a soccer team in Fig. 1), and the training set comprised all 190 possible pairwise comparisons of the objects. The test set represented all possible combinations of three cues which can take values of {−1, +1, 0} (coding scheme followed pattern in Fig. 1), and contained 26 cue combinations. The simulation process was repeated 1000 times, and error bars represent SEM across simulation runs. More details are provided in the Simulation 3 text.
Notably, the tallying heuristic can also be derived from the COR model, in its undirected version that uses cue validities in the training data to infer cue directionalities. The tallying decision rule is defined byThe tallying decision rule chooses the option with a majority of outputs in its favor (conveyed by their valences indicated by the sign function), irrespective of the magnitudes of the outputs. The choice is determined by Eq. (7), as in the TTB decision rule. When the tallying decision rule is adopted by the COR model, the model converges to the tallying heuristic in the limit as (Fig. A4, Appendix A). Lastly, in the limit of , either decision rule will yield decisions equivalent to ordinary linear regression. Under this limit, the outputs produced according to Eq. (5) are all equal to the ordinary linear regression prediction (as outlined above), and both the TTB and tallying decision rules will yield a choice equal to the valence of that prediction.
Fig. A4
Agreement between the COR model (with tallying decision rule) and the tallying heuristic, as well as ordinary linear regression, as a function of the prior strength. The ordinate reflects the percentage agreement on test item choices in an artificial dataset. The artificial dataset used for this simulation is equivalent to the one displayed in Fig. A3 and described in the Simulation 2 text. Error bars represent SEM across simulation runs.
The COR model demonstrates how ordinary linear regression and both TTB and the tallying heuristic can be derived as extreme cases of a Bayesian prior defined by covariance expectation. Importantly, the only element varying across the continuum is the prior’s strength, and the prior is responsible for recovering the heuristics in the limit. The model converges to ordinary regression as the strength of the prior goes to zero regardless of the decision rule, and these model properties also hold under other forms of regularization (e.g. lasso regularization). As with the half-ridge model, we find that COR’s performance peaks for intermediate priors for all 20 datasets (Czerlinski et al., 1999) (Appendix A, Fig. A5, Fig. A6). Thus once again less is not more, as the heuristics are outperformed by a prior of finite strength that uses all information in the training data but nonetheless down-weights that information.
Discussion
A central message of this work is that, in contrast to less-is-more claims, ignoring information is rarely, if ever optimal (Gigerenzer and Brighton, 2009, Gigerenzer and Todd, 1999, Tsetsos et al., 2016). Heuristics may work well in practice because they correspond to infinitely strong priors that make them oblivious to aspects of the training data, but they will usually be outperformed by a prior of finite strength that leaves room for learning from experience (Fig. 3, and Fig. A5, Fig. A6 in Appendix A). That is, the strong form of less-is-more, that one can do better with heuristics by throwing out information rather than using it, is false. The optimal solution always uses all relevant information, but it combines that information with the appropriate prior. In contrast, no amount of data can overcome the heuristics’ inductive biases. The tallying heuristic is defined to entirely ignore differences in cue magnitude and predictiveness, unlike the intermediate half-ridge models, and cue validities are defined to entirely ignore covariation information, unlike the intermediate COR models.Although the current contribution is formal in nature, it nevertheless has implications for psychology. In the psychological literature, heuristics have been repeatedly pitted against full-information algorithms (Chater et al., 2003, Czerlinski et al., 1999, Katsikopoulos et al., 2010) that differentially weight the available information or are sensitive to covariation among cues. The current work indicates that the best-performing model will usually lie between the extremes of ordinary linear regression and fast-and-frugal heuristics, i.e., at a prior of intermediate strength. Between these extremes lie a host of models with different sensitivity to cue-outcome correlations in the environment.One question for future research is whether heuristics give an accurate characterization of psychological processing, or whether actual psychological processing is more akin to these more complex intermediate models. On the one hand, it could be that implementing the intermediate models is computationally intractable, and thus the brain uses heuristics because they efficiently approximate these more optimal models. This case would coincide with the view from the heuristics-and-biases tradition of heuristics as a tradeoff of accuracy for efficiency (Tversky & Kahneman, 1974). On the other hand, it could be that the brain has tractable means for implementing the intermediate models (i.e., for using all available information but down-weighting it appropriately). This case would be congruent with the view from ecological rationality where the brain’s inferential mechanisms are adapted to the statistical structure of the environment. However, this possibility suggests a reinterpretation of the empirical evidence used to support heuristics: heuristics might fit behavioral data well only because they closely mimic a more sophisticated strategy used by the mind.Although we focused on explaining the success of two popular decision heuristics through a Bayesian analysis, our approach also suggests one could start with a Bayesian model and attempt to derive a novel heuristic by strengthening the prior. For example, in Gaussian process regression with a radial-basis kernel, the length-scale parameter determines how similar a training example needs to be to a test item to significantly influence the model’s prediction. Taking the limit as the length scale approaches zero might yield a heuristic akin to the nearest neighbor algorithm, in which the prediction is based solely on the most similar training item, ignoring all other training data. Such a solution would be algorithmically simple, but likely would be bested by models with intermediate prior strength. Whether this approach to deriving new heuristics would prove fruitful is an open question for future research.There have been various recent approaches looking at the compatibility between psychologically plausible processes and probabilistic models of cognition (Bramley et al., 2017, Daw and Courville, 2008, Griffiths et al., 2015, Jones and Love, 2011, Lee and Cummins, 2004, Sanborn et al., 2010, Scheibehenne et al., 2013). These investigations are interlinked with our own, and while most of that work has focused on finding algorithms that approximate Bayesian models, we have taken the opposite approach. This contribution reiterates the importance of applying fundamental machine learning concepts to psychological findings (Gigerenzer & Brighton, 2009). In doing so, we provide a formal understanding of why heuristics can outperform full-information models by placing all models in a common probabilistic inference framework, where heuristics correspond to extreme priors that will usually be outperformed by intermediate models that use all available information.
Table A1
Computation of cue validities: A binary prediction task where five object comparisons are made on the basis of three cues.
Comparison
Cue x1
Cue x2
Cue x3
y
r1
w1
x1Ty
x1Tx1
1
−1
−1
0
−1
1
0
1
1
2
1
−1
1
−1
0
1
−1
1
3
0
−1
0
1
0
0
0
0
4
1
1
1
1
1
0
1
1
5
1
1
−1
1
1
0
1
1
v1=34
v2=45
v3=13
R1=3
W1=1
R1-W1=2
R1+W1=4
The cue columns represent cue difference values, respectively, and are coded in the same way as the coding column in Fig. 1 in the main text. The criterion variable contains the outcome of each comparison. and indicate whether cue predicted the outcome correctly or incorrectly (r = right, w = wrong) on each comparison, and and are the sums across all comparisons, and . Then, the cue validity for cue is computed as . The validities for and are defined similarly.
Table A2
Parameters of the simulation presented in Fig. 2B.
Parameter
Value
Number of objects
22
Number of pairwise comparisons
N=231
Number of cues
m=10
Class variable (which house had the higher actual sales price)
Binary, ±1
Absolute correlation between cues averaged over cue pairs
Authors: Konstantinos Tsetsos; Rani Moran; James Moreland; Nick Chater; Marius Usher; Christopher Summerfield Journal: Proc Natl Acad Sci U S A Date: 2016-02-29 Impact factor: 11.205