Literature DB >> 32421708

Bayesian regression explains how human participants handle parameter uncertainty.

Jannes Jegminat^1,2, Maya A Jastrzębowska^3,4, Matthew V Pachai^3,5, Michael H Herzog³, Jean-Pascal Pfister^1,2.

Abstract

Accumulating evidence indicates that the human brain copes with sensory uncertainty in accordance with Bayes' rule. However, it is unknown how humans make predictions when the generative model of the task at hand is described by uncertain parameters. Here, we tested whether and how humans take parameter uncertainty into account in a regression task. Participants extrapolated a parabola from a limited number of noisy points, shown on a computer screen. The quadratic parameter was drawn from a bimodal prior distribution. We tested whether human observers take full advantage of the given information, including the likelihood of the quadratic parameter value given the observed points and the quadratic parameter's prior distribution. We compared human performance with Bayesian regression, which is the (Bayes) optimal solution to this problem, and three sub-optimal models, which are simpler to compute. Our results show that, under our specific experimental conditions, humans behave in a way that is consistent with Bayesian regression. Moreover, our results support the hypothesis that humans generate responses in a manner consistent with probability matching rather than Bayesian decision theory.

Entities: Chemical

Mesh：

Year: 2020 PMID： 32421708 PMCID： PMC7259793 DOI： 10.1371/journal.pcbi.1007886

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

The brain evolved in an environment that requires fast decisions to be made based on noisy, ambiguous and sparse sensory information, using noisy information processing and noisy effectors. Hence, decisions are typically made under substantial uncertainty. The main idea behind Bayesian brain hypothesis is that the brain uses the framework of Bayesian probabilistic computation to make optimal decisions in the presence of uncertainty [1-3]. Despite various counterexamples, e.g., [4], a large body of research has established that many aspects of cognition are indeed well described by Bayesian statistics. These include magnitude estimation [5], color discrimination [6], cue combination [7], cross-modal integration [8, 9], integration of prior knowledge [10, 11] and motor control [12-14]. Some experimental studies have considered more complex tasks, including visual search [15, 16], same-different discrimination [17] and change detection [18], but most can be cast into the problem of estimating a hidden quantity from sensory input. Much fewer experimental studies have been performed on regression tasks (but see [19] for an overview, and, e.g., [20-22]). In a regression task, the aim is to learn the mapping from a stimulus x to an output y after having been exposed to a training dataset of N associations between stimulus x and its corresponding y. Since the mapping from x to y can be probabilistic, the aim of regression is to find an expression for p(y|x, D). Classification tasks, such as object recognition, or self-supervised tasks, such as estimating the future position of an object from past observations, are just a few examples of the many regression tasks performed by humans on a daily basis. The machine learning literature contains many solutions to the regression problem, including nonlinear regression, support vector machines, Gaussian processes and deep neural networks (see [23] for an introduction). It is unclear, however, how humans perform regression tasks. Most of the machine learning solutions rely on the assumption that the mapping from x to y is parametrized by a set of parameters w, such that the original regression problem of finding the posterior predictive distribution p(y|x, D) is replaced by a parameter estimation problem, i.e., finding the best set of parameters w* for the parametrized mapping p(y|x, w*). However, this approach is not Bayesian since no uncertainty over the parameters w is included in the regression model. The Bayesian approach to regression proceeds in two steps [24]. First, the posterior distribution over the parameters p(w|D) is computed from the observed data D. Then, this posterior is used to compute the posterior predictive distribution by integrating over the parameters: Taking into account the uncertainty over parameters is particularly relevant for predictions when the size N of the dataset is small compared to the number of parameters. Indeed, taking into account the uncertainty helps to generalize to unknown data and thereby alleviates overfitting. Parameter uncertainty also plays a key role in computing predictive distribution Eq (1), as estimated, e.g., by the variance of the predictive distribution. In Bayesian decision theory, the predictive distribution is used to minimize the expected cost with respect to the predicted variable. This is important when rewards are unequally distributed, as is the case in many behavioural tasks [25-27]. Some recent work supports the notion that humans make simple decisions in a way which conforms to Bayesian decision theory [12, 28]. In more complex tasks, it has been shown that humans respond suboptimally, which can be largely attributed to noisy inference rather than noisy decision making [29]. A competing decision model to that of Bayesian decision theory is probability matching, wherein random samples are drawn from the predictive distribution. Several studies support the idea that humans use probability matching in cognitive [30, 31] and perceptual tasks [32]. Despite the differences in how prediction uncertainty is used in Bayesian decision theory and probability matching, uncertainty is nevertheless an integral part of the decision making process in both cases. Both of the aforementioned potential pitfalls of the regression problem—overfitting to small datasets and lack of prediction uncertainty—currently limit the power of deep neural network models [33, 34]. These models have millions of parameters and their performance improves with the number of layers [35, 36]. To prevent overfitting, training requires ever larger and more expensive training sets. It is interesting to note that classic Deep Neuronal Networks (DNNs) do not use weight uncertainty and are therefore limited in their ability to compute prediction uncertainty. Recently, the idea of computing the probability distribution over weights in DNNs and using the distribution for prediction has gained traction and has given rise to the so-called Bayesian Neuronal Network (BNN), for example [37, 38]. Thus, the proposal of BNNs is simply to apply Bayesian regression to DNNs. BNNs promise better performance in the low data regime. Here, we ask the question whether human observers process parameter uncertainty in accordance with Bayesian regression. We conducted psychophysical experiments in the low data regime with a simple generative model and compared Bayesian regression to other regression models without fitting any hyperparameters other than participant-specific noise. The experimental design made use of the fact that Bayesian regression predicts an uncertainty-modulated transition from a unimodal to a bimodal response distribution. In each trial, we presented participants with 4 points from a hidden, noisy parabola. The task was to correctly extrapolate the parabola, i.e., to find the vertical point of intersection of the parabola with a given horizontal location. The quadratic parameter of each parabola was drawn from a bimodal prior distribution, designed to make the parabolas face either upwards or downwards. After recording the participant’s response, we showed the parabola from which the stimulus dots were generated as feedback. This feedback enabled the participants to learn both the prior and the generative model. Because we wanted to test to what extent participants make decisions in accordance with Bayesian regression, we varied the level of noise of the parabola. The rationale is that the higher the noise level, the higher the uncertainty about the correct parameter and, according to Bayesian regression, the more participants should rely on the prior and produce a bimodal response distribution. We found that Bayesian regression indeed explains participants’ responses better than maximum likelihood regression and maximum a posteriori regression. Moreover, we compared a loss-based decision model with a sampling-based decision model and found clear evidence for the latter. Indeed, a loss based model with exact inference cannot explain the bimodality of participants’ response distributions.

Results

A novel paradigm to test regression

We designed a novel psychophysical experiment in which participants had to extrapolate a noisy parabola displayed on a computer screen. In each trial, we chose the parameter w of the parabola y = wx2 from a bimodal prior distribution π(w) where the two modes are centered at w = 1 and w = −1 and the variances are given by (see Eq 5). The parameter w was either positive (parabola facing upwards) or negative (parabola facing downwards), with the same probability, i.e., 0.5. We selected four dots on the parabola with x-positions close to the parabola’s vertex and added zero-mean Gaussian generative noise σ to the dots’ y-positions (see Eq 4). We then presented a fifth dot to the right of the stimulus, always at the same x-position x⋆ = 2. Participants could move the fifth dot up and down along the y-axis by using the up and down arrow keys. Participants were asked to adjust the y-position so that the dot correctly extrapolated the parabola. During the adjustment task, participants saw only the 4-dot stimulus but not the generating parabola. After the the participant had validated his/her response, we showed the generating parabola and the adjusted point as feedback. Participants were naive to the purpose of the study. They were not informed about the existence of a prior distribution of the parabola’s quadratic parameter, the parabola’s bimodality nor the level of generative noise. In our main experiment, we set the standard deviation of each prior mode to σ = 0.1 (if not specified otherwise, assume this value throughout this work) and fixed the values of x-positions to x1 = −0.3, x2 = −0.1, x3 = 0.1 and x4 = 0.3. We generated j ∈ (1, …, 20) unique stimuli at a low (0.03), medium (0.1) and high (0.4) value of the generative noise σ. The rationale is that the higher the noise level, the higher the uncertainty (the lower the likelihood) and the more participants rely on the prior if they act consistently with a Bayesian regression model. At each noise level, we ran 400 trials, repeating each unique stimulus D 20 times. The stimulus presentation order was randomized within each noise level (Fig 1B). Thus, we obtained a set of responses for each noise level and for each of the 20 stimuli . The advantage of observing several responses for the same exact stimulus is that we can compare the observed response distribution to the response distributions predicted by the different models.

Fig 1

Experimental protocol.

Experimental protocol.

(A): Procedure of a single trial. First, a fixation dot was presented for 1s before the 4-dot stimulus appeared. Observers then had unlimited time to adjust the fifth dot with the up and down arrow keys. They then clicked the space bar to confirm the final position of the adjustable dot. After the response, the generative parabola was shown for 1s as feedback. (B): Experiment 1: The experiment consisted of two sessions on two separate days. Both sessions began with 10 practice trials with virtually no noise (σ = 10−5), followed by 4 blocks of 50 trials of low noise (σ = 0.03). In session 1, the low noise blocks were followed by 8 blocks of 50 trials of medium noise (σ = 0.1), while in session 2, the low noise blocks were followed by 8 blocks of 50 trials of high noise (σ = 0.4). In total, each participant completed 400 trials per noise level, with 20 repetitions of 20 unique stimuli. In this experiment, σ was set to 0.1. (C): Experiment 2: the experiment consisted of a single session which began with 20 practice trials with very low noise (σ = 10−2), followed by 10 blocks of medium noise (σ = 0.1) trials. Each block consisted of 20 trials, with the generative parabola shown as feedback, as in Experiment 1. Half of the 200 trials consisted of stimuli which were presented just once, while the remaining 100 trials consisted of 10 repetitions of 10 unique stimuli. In this experiment, σ was set to 0.5. See Materials and methods for more details. Seven naive participants took part in the experiment. We denote the set of all responses of participant k by . The stimulus presentation order was identical for each participant. At the beginning of each experimental session, we showed a virtually noiseless 4-dot stimulus σ = 10−5 to familiarise the participants with the task and to estimate their internal noise sources (explained in more detail below). In a second experiment, we set the prior standard deviation to σ = 0.5 and the generative noise to the medium level of σ = 0.1. Instead of using fixed x-positions for the stimuli we added weak Gaussian noise (see Material and methods). We repeated 10 unique stimuli 10 times each, which yielded a total of 100 trials. Four naive participants (different from those recruited for the first experiment) completed the experiment. The key difference to the σ = 0.1 condition in the first experiment was that here, the prior provided much less information about the curvature of the parabolas. In total, we studied four different conditions: three with σ = 0.1 and σ ∈ {0.03, 0.1, 0.4} in experiment 1, and one with σ = 0.5 and σ = 0.1 in experiment 2.

The regression models

We considered five regression models (see Materials and methods). The Maximum likelihood regression (ML-R) model computes only the point estimate of w that maximizes the likelihood p(D|w) and does not make use of the prior distribution at all. The maximum a posteriori model (MAP-R) combines the likelihood with the prior distribution to compute the mode of the posterior distribution p(w|D). Despite the fact that it uses the bimodal prior, MAP-R cannot produce bimodal responses because it relies on a point estimate of w. The Bayesian regression (B-R) model takes the entire posterior distribution into account: In B-R, the noise level σ plays the crucial role of modulating the relative strength of unimodal likelihood and bimodal prior, and hence determines the transition between a unimodal and a bimodal response distribution. Relaxing the assumption that the true generative noise is known, we included an additional variant of B-R that (deterministically) estimates the generative noise for the current stimulus D. We indicate this variant of B-R with a subscript: B-R. Note that technically, this trial-by-trial noise estimation could also be applied to the other models such as MAP-R. However, MAP-R reduces the posterior distribution over the quadratic parameter to a point estimate. Therefore, an additional trial-by-trial noise estimation would not change its prediction substantially, i.e., it would only shift the unimodal prediction but would not induce a bimodal predictive distribution. Thus, we did not consider a corresponding “MAP-R” variant. As a null model, we included prior regression (P-R), which replaces the posterior with the prior, i.e., it does not use the likelihood. For all models considered here the predictive distribution depends deterministically on the 4-dot stimulus. In this sense, they rely on exact inference. Noisy inference is an alternative which assumes that the inference process is corrupted by noise [29]. This alternative would require an additional noise parameter which governs the level of inference noise (see S1 Text). Here, we constrain ourselves to models with exact inference to remain fitting-free, i.e., model predictions for a given stimulus D have no free parameter. We used the true values of the hyperparameters because we assume that the participants learned the generative model within a few trials (see S1 Text). Thus, the model predictions require no fitting. For more details, see Materials and methods. In the plots, we denote the models by the arguments of their predictive distributions, i.e., y|x, wML for Maximum Likelihood regression (ML-R); y|x, wMAP for Maximum a Posteriori regression (MAP-R); y|x, D for Bayesian regression (B-R); y|x, D, σ for Bayesian regression with noise estimation (B-R) and y|x for prior regression (P-R).

The decision models

We considered two decision models that turn the predictive distributions into a response distribution: probability matching and Bayesian decision theory. In the case of probability matching, it is assumed that participants draw random samples from the predictive distribution: . If not stated otherwise, we use sampling-based decisions throughout this work. Meanwhile, according to Bayesian decision theory, participants select a response by minimizing the expected loss function . Here, we considered only the square loss, which is equivalent to the choosing mean of the predictive distribution . Independently of the form of the loss function, the Bayesian decision theory generates responses from the predictive distribution deterministically. When we use loss-based decision models, we indicate this by adding the prefix “L”: to the model, e.g., L: y|x, D for a loss-based decision model applied to Bayesian regression. To model participants’ responses, we also accounted for internal sources of noise, i.e., noise which is inhere to neural processing, decision making and the execution of motor action [29, 39]. We call the sum of these noise components motor noise for brevity. The motor noise is not a model parameter but a participant-specific parameter. We computed the motor noise σ for each participant from the 20 responses to the noiseless stimulus. To ensure robustness to outliers, we used the average value between the 16% and 84%-percentile of the response distribution. The values of motor noise for the seven participants of the main experiment were , respectively while for the second experiment, we used the average of these values, i.e., 0.48 because the noise-free responses of experiment 2 were not available. The motor noise was included in the models by convolving the predictive distribution with a Gaussian of variance . In the case of loss-based decision models, motor noise was the only source of response variability.

Modality of predicted and observed response distributions

Fig 2(A) shows the responses of a representative participant along with the predicted response distributions of the different models. Both ML-R and MAP-R ignore one of the modes (here, the mode corresponding to a downward-facing parabola). In addition, the parabola predicted by ML-R has lower curvature than the parabolas predicted by any of the other models (i.e., the absolute value of the ML-R parabola’s quadratic parameter is lower) than that of the parabola that the participant responded with. A potential explanation for this finding is that, while ML-R does not take the prior into consideration, humans do make use of the prior. In the low noise regime (σ = 0.03, Fig 2(B)), the discrepancy between the participant’s response distribution and the prior regression model’s predicted response distribution in terms of the number of modes (unimodal and bimodal, respectively) rules out the explanatory validity of the latter. In the higher noise regimes (σ ∈ (0.1, 0.4), Fig 2(C) and 2(D)), MAP-R and ML-R fail to account for the fact that the participant’s responses are distributed across both modes. The finding that at σ = 0.03 MAP-R matches the participant’s responses often with high accuracy provided implicit evidence that participants used the prior and had learned the parabola’s generative model.

Fig 2

Example responses.

Example responses.

B-R is the only model that can explain the transition from unimodal response (at low noise, (B)) to bimodal response distribution (at high noise, (D)). (A) A sample stimulus (green dots) at high noise level (σ = 0.4). For this specific stimulus, contours indicate the response distributions predicted by ML-R, MAP-R, B-R and B-R (not shown to the participant) at various x⋆. At x⋆ = 2, we recorded the participant’s responses (gray dots). The cross section at x⋆ = 2 is shown in (D). (B—E) The predicted response distributions at x⋆ = 2 of ML-R (blue), MAP-R (orange), B-R (red), B-R (dark red), P-R (green) and observed responses (gray). As σ increases (B—D), the data becomes less informative. Consequently, and in accordance with B-R, the response distribution becomes more bimodal. (E) Due to the weak prior the predictions of B-R and B-R respond more strongly to the data and diverge from the modes of P-R more stronlgy than in the previous conditions. The skewness of B-R results from the mixture of both Gaussian components. Fig 2(E) illustrates the participant’s responses in the condition when σ = 0.5. The participant’s responses cover a wider range of values than in the conditions of experiment 1 when σ is smaller (i.e. σ = 0.1). While the generative noise σ = 0.1 is the same as in Fig 2(B), this condition is more difficult because the prior is less reliable. As a consequence participants rely more strongly on the noisy stimulus and produce more response variability. In this example, the responses are closely clustered around the center. B-R is attracted more strongly to the center than B-R because the former is more driven by the stimulus due to underestimating the noise.

B-R outperforms the other models

Fig 2. In order to formally assess model performance, we next conducted a quantitative model comparison across all participants. For each of the seven participants individually, we computed the log probability that the participant’s responses arise from the given model. We summed these log probabilities for all of the unique stimuli D as a measure of the quality of the model. Fig 3(A) shows these values relative to the B-R baseline value for each noise level, averaged across participants. Negative values indicate poor performance relative to B-R. A subject-level analysis showed that the model comparison results were not driven by any single participant’s data (see S1 Text). Because our model comparison is fitting-free, we do not need to account for different levels of model complexity. Indeed, in the present case, the log likelihood comparison is equivalent to using the Bayesian Information Criterion.

Fig 3

Model comparison.

Model comparison.

The model comparison shows that the B-R model best explains the data (A, B) and that sampling-based decision models outperform loss-based decision models (C). (A) Difference in log likelihood with respect to B-R averaged over participants for different experimental conditions. Negative values mean that B-R wins the comparison. B-R is either winning (σ ∈ (0.03, 0.1)) or equivalent to P-R because the two coincide at high levels of parameter uncertainty (σ = 0.4 and σ = 0.5). (B) The expected likelihood of each model for a randomly selected participant shows what fraction of participants are best described by a model. Overall, B-R and B-R describe the population best. (C) Log likelihood difference between a sampling and a loss-based decision model. Negative values favour sampling. At all other conditions and for all regression models, sampling explains the data better than loss-based decision models with exact inference. For B-R, B-R and P-R, loss-based models do not predict bimodal responses. At low noise σ = 0.03, loss-based models underestimate the response variance. Error bars represent the SEM across participants. As Fig 3(A) shows, B-R is among the highest performing models for all conditions. As the task difficulty increases (left to right), P-R performance approaches that of B-R. This is because the parameter uncertainty encoded in the prior becomes more important and the response distribution becomes bimodal. Neither MAP-R nor ML-R can capture this and therefore perform poorly. These results are consistent across participants (see S1 Text for a subject-level analysis). At low noise σ = 0.03, participants give unimodal answers and the mean predictions of B-R and MAP-R are indistinguishable. Then the model that better captures the response variability wins. In general, B-R explains the variability of the responses better. The variability is also the reason why P-R performs relatively well under the more difficult conditions, i.e., σ = 0.03 and σ = 0.5. The averaged results are largely consistent with a subject-level analysis. A notable exception is that at σ = 0.03, MAP-R emerges as the best model (closely followed by B-R) for participants 3, 4 and 7 (see S1 Text). A Bayesian random effects analysis confirms this. Specifically, we used the model posterior averaged over a randomly selected participant k. This measure reflects the ratio of participants for which model wins. Fig 3(B) shows that the responses of the majority of participants are best modelled by B-R or B-R. Since B-R interpolates between MAP-R (at low noise) and P-R (at high noise), as expected, at σ = 0.03, i.e., the easiest condition, the responses of some participants are also well modelled by MAP-Rwhile at the most difficult condition, i.e., when σ = 0.5, the responses of half of the participants are best described by P-R (and the other half by B-R). Next, we investigated if sampling or the loss function perspective explains the responses better. Fig 3(C) depicts the log likelihood of loss-based decision making compared to sampling for each model. Negative values indicate that sampling wins. Sampling explains the data better for all models and in all experimental conditions. One explanation is that the loss mechanism turns bimodal predictive distributions into unimodal predictive distributions. Here, we use the square loss such that the (unimodal) response distribution is centered on the mean of the predictive distribution. In the case of P-R, the mean of the predictive distribution lies at the center of both modes. Clearly, this method does not capture bimodal responses. This is why the performance difference between the two decision models is smallest at σ = 0.03, where all models except for P-R make predictions which are close to unimodal. The second explanation for the better performance of sampling is that the loss function approach with exact inference underestimates response variability. The response variability differs from one stimulus to another and is often higher than σ. This explains the better performance of sampling for MAP-R and ML-R, since the effect of turning a bimodal response distribution into a unimodal one is absent. In these cases, the sampling-based decision model has the effect of increasing the variance of the predicted response distribution by . This leads to better model performance on variable response data, even in the experimental conditions σ = 0.03 where participants respond unimodally. In conclusion, from the models considered here, B-R with sampling best explains participants’ responses.

B-R explains the generative noise-dependent increase in response variance

A key characteristic of B-R is the transition of the model’s posterior predictive distribution from unimodality to bimodality as σ increases, i.e., as the data become less informative. To analyse this transition, we used the participants’ response variances at the different levels of generative noise. The variance of the response distributions is sensitive to bimodality. For example, if all responses are distributed evenly across both modes, the variance is close to 16, which corresponds to the variance of the prior P-R. If all responses are located in a single mode, the variance is typically smaller by a factor of ten (e.g., see Fig 2(B)). We explain this in more detail below. For each stimulus D and for each participant, we computed the variance of the 20 observed responses . Because we have 7 participants, this yields a distribution over 7 × 20 = 140 empirical variance values at each value of σ. We compared this distribution with the response variance distribution predicted by the models. To achieve a higher resolution and show the dynamics of the variance as a function of the generative noise, we generated 5000 unique stimuli from a densely spaced σ instead of relying on the small number of stimuli and noise levels used in the experiment (see Materials and methods). The empirical variance distribution (gray) and its median (black) are shown in Fig 4 along with the predicted median of the variance distribution for each model (color coded). For B-R and B-R, we plotted the distribution in Fig 4(A) and 4(B), respectively.

Fig 4

Response variances of predicted and empirical distributions, as a function of generative noise.

Response variances of predicted and empirical distributions, as a function of generative noise.

B-R best explains the increase in response variance as a function of the generative noise σ. Variances of the empirical response distributions from all participants (gray dots, median: gray line) and predicted response distributions, corresponding to the two B-R variants (median: red line, log probability: heatmap). B-R (A) Interpolating between MAP-R and P-R, only the B-R variants capture the upward trend in the data. At σ = 0.4, B-R fails to account for the empirical responses with close-to-zero variances. (B) At σ = 0.4, B-R predicts a bimodal variance distribution because, in trials with low noise estimates, the predicted response distribution is unimodal and thus variance is low. Because of these low-variance trials, the median of B-R increases slower than the median of B-R and captures the empirical median better. Because ML-R and MAP-R behaved identically, the MAP-R represents both regression models. The median of the B-R variance distribution increases with the noise level. This is due to the fact that B-R’s predicted response distribution transitions from unimodal to bimodal; this transition is modulated by the generative noise, which determines the relative contribution of the prior and likelihood to the response distribution. Consequently, the B-R model is the only one for which the variance values smoothly transition from the MAP variance at the low noise level (σ = 0.03) to the P-R variance at the high noise level (σ = 0.4). The P-R variance remains constant and the MAP-R variance increases very weakly as a function of the generative noise. The variance analysis provides further evidence for the superiority of both B-R variants over the other models. While B-R captures the general trend in the data, it fails to account for two key characteristics. First, the median variance increases slower than B-R would predict, and secondly, at high levels of generative noise, B-R fails to reproduce the lower part of the distribution (where response variances are close to zero). A potential explanation for this discrepancy is that participants estimate the noise on a trial-by-trial basis. When the noise added to the 4-dot stimulus was, by chance, such that the dots appeared to be well-aligned on a parabola, participants would, presumably, underestimate the generative noise and respond in a way which was consistent with a unimodal distribution. The fact that the B-R model captures the empirical variance better than B-R provides some evidence for this idea. On some trials, B-R underestimates the true σ and applies the B-R formalism with high confidence in the stimulus data. In these cases, the model relies strongly on the likelihood and bimodality, which normally enters through the prior, is not achieved. Rather, the resulting response distribution is unimodal and has low variance. Despite the fact that B-R describes the qualitative features of the variance distribution better than B-R, it performs worse in terms of log likelihood. This shows that the low variance responses of humans and of B-R do not always coincide on a trial-by-trial basis. To better understand the relation between response variance and bimodality, we dissect the variance of a bimodal response distribution to stimulus D into its components: where μ1 and μ2 are the means of the modes of the posterior predictive distribution, c is the mixture coefficient and corresponds to the variance of both modes. The unimodal contribution is not mode-specific because we chose a symmetrical prior. The first two terms constitute a unimodal contribution and the last term a bimodal contribution. The latter is controlled by the mean dispersion (μ1 − μ2)2 and a prefactor c(1 − c) that is equal to zero for c ∈ {0, 1} and is maximal for c = 1/2. To determine to what extent each component of this dissection is present in the response data, we defined the empirical counterparts of μ1, μ2 as the means of the upper and lower modes of the response distribution and c as the mixture coefficient, corresponding to the fraction of positive responses r > 0. For the unimodal variance contribution , we used the variance of the mode which contains the majority of responses (see Methods for more details). The comparison between data and models shows that both B-R variants correctly predict the driver of the observed variance to be the transition to bimodality. Fig 5(A) shows the predicted positive coefficient c (median) of the models as a function of the empirically-observed coefficient across all participants and stimuli (in the main experiment). As further evidence for the validity of the B-R perspective, both B-R variants correctly predict the fraction of positive responses. Indeed, the smooth transition from a unimodal to a bimodal distribution is nicely captured by B-R. In contrast, MAP-R transitions sharply and is more reminiscent of a step function while P-R predicts equally strong modes across all noise level conditions.

Fig 5

Median of each of the bimodal response distribution variance components across all participants and stimuli.

Median of each of the bimodal response distribution variance components across all participants and stimuli.

(A) Predicted coefficient of positive mode as a function of the empirical coefficient (across all noise levels). ML-R behaves identically to MAP-R. Thus, the MAP-R curve represents both models. The shaded area shows the 40% and 60% quantiles. (B) Prefactor of bimodal contribution as a function of generative noise. Data jittered for visibility. (C) Unimodal contribution to the variance. Empirical variance computed on mode with majority of responses. (D) Mean dispersion. Only trials with bimodal responses included. As the stimulus becomes more noisy, human responses and B-R variants conform to the prior. The bimodal distribution of responses depends on the prefactor c(1 − c) and the mean dispersion. Fig 5(B) shows the median value of the prefactor as a function of generative noise (across all participants and stimuli). Data and model predictions qualitatively match the behaviour of the variance in Fig 4. Indeed, the other contributions to the variance are less important. The mean dispersion, shown in Fig 4(C), plays the role of a large constant. The unimodal contribution to the variance, shown in Fig 5(D), is small compared to the bimodal contribution. In conclusion, the coefficient c plays the dominant role in determining the variance of the response distribution. Because the two B-R variants estimate c sufficiently well, they best match the empirical variance distribution. Interestingly, all models overestimate the unimodal variance, with the exception of MAP-R in the low noise condition (Fig 5C). The B-R variants predict larger variance than MAP-R because they translate the posterior parameter into response uncertainty. P-R predicts even larger response variance because it uses the prior parameter uncertainty which is generally larger than the posterior one. Despite the fact that MAP-R best describes the median variance, it performs worse than B-R in terms of log likelihood. Fig 5(C) reveals that one factor contributing to the poorer performance of MAP-R is the occurrence of unimodal, high variance responses. In summary, the variance analysis provides further evidence that B-R captures the way in which generative noise induces a transition from unimodality to bimodality in participant responses. However, B-R overestimates response variance. Trial-by-trial estimation of the noise offers a potential explanation for why participants cluster their responses more unimodally than predicted.

Unimodal responses are overall best explained by B-R

Thus far, the main factor behind the superiority of the B-R model’s performance relative to the other models is the ability of B-R to capture the bimodality of responses, i.e., to correctly set the mixture coefficient. However, the previous analysis showed that unimodal variance decreases as a function of generative noise while B-R predicts an increase. It remains unclear if B-R still wins the model comparison in a unimodal setting where performance is independent of the mixture coefficient. To address this question, we conducted a model comparison on a unimodally conditioned dataset. For each stimulus D, we considered only responses in which the dominant response mode and the dominant mode of the model predictions coincided (see Methods for details). The conditioning yields a unimodal dataset in the sense that all predictions and responses belong to the same mode. To make the model comparison fair for the bimodal predictive distributions of P-R and the B-R variants, we removed the inferior mode and normalised the remaining probability mass to one. At the group level, B-R wins the model comparison across all conditions, as shown in Fig 6(A). B-R clearly outperforms the other models in two conditions in particular: at σ = 0.03 and at σ = 0.5.

Fig 6

Unimodal model comparison.

Unimodal model comparison.

The unimodal analysis confirms previous results: overall B-R with sampling wins the model comparison. (A) Differences in log likelihood on unimodal data, averaged over participants. Negative values mean that B-R wins. ML-R is omitted because its poor performance complicates visualisation. (A) B-R wins at σ = 0.03 and σ = 0.5, but not in the other conditions. (B) All models use the quadratic loss function to select responses, with response variance given by the motor noise . B-R with sampling explains the unimodal data best for most participants. High subject-level variability results in large errors (see S1 Text for a subject-level analysis). (C, D) The fraction of participants best described by a given model. At σ = 0.4, several models perform well. Error bars indicate SEM across participants. Interestingly, in the bimodal dataset, B-R did not emerge as a clear winner at σ = 0.5 because P-R performed similarly well. Thus, B-R is not better than P-R at modelling how participants balance the two modes, but once the mode is chosen, it performs better. At σ = 0.1, B-R and B-R outperform the other models. B-R wins by a small margin (see axis scaling). However, a subject-level analysis (see S1 Text) shows that the average is mostly driven by participant 1, while in the case of other participants all models perform similarly well. At σ = 0.4, no clear winner emerges. Intuitively, this makes sense because the stimulus is not informative and all models rely mostly on the prior information about w. Here, B-R wins or performs similarly to other models. The Bayesian random effects analysis (results shown Fig 6(C)) confirms the previous results. The ratio of participants whose responses are best described by a given model indicates that B-R describes the population at σ = 0.03 and at σ = 0.5 well. As in the bimodal dataset, MAP-R reflects the responses of some participants well at σ = 0.3. At σ = 0.1, the log likelihood performance (Fig 6A) of all models is similar but B-R and MAP-R win by a small margin (see S1 Text for a subject-level analysis). Hence, B-R and MAP-R perform best in the Bayesian random effects analysis (Fig 6C). No clear winner emerges at σ = 0.4. Next, we revisit the question of whether participants sample or use a loss function. In the bimodal data, the loss function approach was at a disadvantage because it could only produce unimodal response distributions. This disadvantage is not present in the unimodal dataset. To make the performance of models in Fig 6(A) and 6(B) comparable, we use B-R with sampling as the baseline in both plots. Fig 6(B) shows that, averaging across participants and conditions, B-R with sampling outperforms the loss-based models. The large errors in (B) reflect large intersubject variability. The Bayesian random effects analysis in Fig 6(D) confirms that B-R also wins at the subject-level at σ = 0.03 and at σ = 0.5. One exception is L:B-R at σ = 0.1. Indeed, the subject-level analysis (S1 Text) shows that in terms of the averaged log likelihood at middle and high noise σ ∈ (0.1, 0.4) participant 1 is an outlier. For other participants, the performance of B-R with sampling and the loss-based models is very similar. Despite the higher intersubject variability in the case of the unimodal dataset than in the bimodal dataset, the unimodal analysis provides convincing evidence of the superiority of B-R with sampling over other models considered here. In contrast to the model comparison in the bimodal analysis, in the unimodal case B-R clearly wins the model comparison at σ = 0.5.

Discussion

In our experiment, participants adjusted a dot such that it coincided with on a parabola determined by four other dots. We used the log likelihood to compare participants’ responses to the predictions of ten models: five regression models ML-R, MAP-R, P-R, B-R and B-R combined with two decision models, i.e., probability matching (sampling) and Bayesian decision theory (loss-based). B-R with sampling best explained the responses across various experimental conditions. An analysis of the observed and predicted response variance showed that the model comparison results were mainly driven by the transition from unimodal to bimodal responses. Only the B-R variants were able to capture this aspect of the data. However, participants clustered their responses more often in one of the mode than B-R predicted. This resulted in a discrepancy between the predicted and empirical response variances. For B-R, this discrepancy was smaller. Thus, one possible explanation for the discrepancy is that participants were estimating noise on a trial-by-trial basis. Since in the variance analysis we considered the response variance from all trials, the relatively better performance of B-R here did not translate into superior performance in the log likelihood analysis, in which we analyzed responses on a trial-by-trial basis. B-R without noise estimation was more accurate in predicting the mean and variance of the data trial-by-trial. In a final analysis, we conditioned the responses to a single mode to eliminate the effects of bimodality which was the driving factor behind model comparison results in the first two analyses. This allowed us to study the performance of B-R based on its mean and variance. The analysis of the unimodal dataset generally confirmed the previous results. B-R with sampling either outperformed or performed similarly to the other models. Our results suggest that humans turn the posterior predictive distribution into a response via probability matching rather than Bayesian decision theory. The loss function approach fails to explain the bimodality of responses to repeated identical stimuli. One way to interpolate between Bayesian decision theory and probability matching is to present distributions by samples [40]. The number of samples used to approximate the (predictive) distribution interpolates between both decision models. If the number of samples is sufficiently large, the approximated distribution converges to the true distribution, and we enter the domain of standard Bayesian decision theory. However, if only a single sample is used for the approximation, probability matching is recovered. This is because applying Bayesian decision theory to a one-sample distribution returns the location of this sample as a response. The number of samples takes the role of a transition parameter between classical Bayesian decision theory and probability matching. [29] In the context of a categorical decision task, Drugowitsch et al. [29] showed that noisy inference (rather than noisy decision making or noisy perception) explains the largest fraction of participants’ response variability. Indeed, noisy inference offers an interesting way to reconcile Bayesian decision theory with bimodal responses. Conceptualising the choice between two modes as noisy inference over two unimodal models leads to a bimodal response distribution (see S1 Text). At low generative noise, the noisy inference procedure yields a unimodal response distribution because the difference between the two model evidences is large. At high generative noise, the evidences for both models are similar such that the inference noise becomes the decisive factor in the participant’s response. In this case, noisy inference predicts a bimodal respsonse distribution. As in Bayesian regression, the transition from unimodal to bimodal response distribution depends on generative noise. However, the speed of the transition also depends on the inference noise, i.e., a free parameter. In Bayesian regression, this transition speed is computed as a function of the stimulus and the parameters of the generative model, and no fitting is required. Because we wanted to study how humans process parameter uncertainty in a fitting-free context, we did not test the noisy inference model quantitatively. Future work is required to further explore the relationship between Bayesian regression and noisy inference. Throughout this study, we assumed that the generative model is known. In real world regression tasks, this assumption is typically not justified. Instead, subjects must simultaneously learn the generative model and its parameters. For example, in the context of our experiment this would translate to not informing participants ahead of time that the 4-dot stimuli were generated from parabolas. Bayesian regression extends naturally to tasks with model uncertainty. The Bayesian approach to making predictions makes use not only of the expectation over the posterior over the parameters but also of the expectation over the posterior over the models. Thus, Bayesian regression with model uncertainty requires subjects to infer the posterior over models and to average over this posterior. Compared to Bayesian regression with a known generative model, this multiplies the computational burden by the number of relevant models. It is an interesting question whether subjects solve regression tasks with model uncertainty by taking advantage of Bayesian regression or whether they rely on point estimates such as the MAP-estimator of the model posterior. A study in the context of sensory fusion suggests the latter [41] but it is unclear to what extent this is also the case in the domain of function fitting. Compared to toy examples, real world tasks typically involve complex models with high-dimensional parameter spaces. This makes the evaluation of the integral in Bayesian regression particularly difficult. Sampling offers a potential solution because it scales well to high dimensions and integrals reduce to the evaluation of a sum. Recent advances in neuronal algorithms [42, 43] suggest that, in theory, the brain can efficiently encode probability distribution via samples. Thus, sampling provides an intriguing direction to further explore potential links between psychophysical experiments and neuronal implementation of uncertainty. For a given generative model, Bayesian regression and other regression models prescribe how to make prediction when parameter uncertainty is present. For example, MAP-R uses a point estimate of the posterior while B-R uses the entire posterior. Thus, the performance of a regression model in terms of its ability to model human responses depends on two factors: the ability of the regression model to describe how humans handle uncertainty and the degree to which the theoretically-chosen generative model is true to the generative model inferred by the observer. The predictions of the regression models in our study are limited in that they assume a parabolic generative model. A previous study reiterated the formal equivalence of Bayesian regression and Gaussian processes and demonstrated the flexibility with which Gaussian processes can model human responses in a complex function fitting task [19]. In the case of the Bayesian regression model, the authors fit various hyperparameters, and it was unclear how they controlled for the complexity of the fit. Thus, the study could not answer if the Bayesian regression model performed well because of its flexibility in representing different generative models or because it captured how humans process parameter uncertainty. In our work, we removed the confounding factor by enforcing a simple generative model through feedback in every trial. Instead we remained fitting-free and could, thus, study directly how participants processed parameter uncertainty. The rationale of simplicity rather than complexity has advantages for the analysis as well. The different models are analytically tractable and thus can be studied systematically. Additionally, the one-dimensional response space was easy to visualize and the amount of data needed to compare the predictive and empirical distributions was limited. To remain fitting-free, we assumed that participants know the generative model, including the prior over the parameters. Without this assumption, we would have had to account for potential temporal dynamics of learning with a participant-specific, time-dependent prior. For instance, it might take participants a non-negligible amount of time to learn the generative model or their responses could be influenced by immediately preceding trials. To avoid such complications, we showed the generative parabola after each trial and we chose a function that humans can learn [21], i.e., a parabola. Indeed, after having run the experiments, we found that there was no substantial learning taking place between the first and the last trials except for some mild learning at σ = 0.1 (see S1 Text). To extend our study to continuous learning, it would be interesting to relax the i.i.d. assumption of the stimuli in the generative model, as in [44], and investigate if a Bayesian framework models the evolution of posterior parameter uncertainty as well. We presented and analysed our experimental task within the framework of regression. After seeing the training data, i.e., the 4-dot stimuli, participants were asked to make predictions. Then, one way of making predictions is to compute the posterior predictive distribution by marginalising over the posterior of the model parameters. Alternatively, the task can be interpreted as inference of the point where the parabola intersects a vertical line at a chosen x-position given the 4-dot stimulus. The posterior predictive corresponds to the posterior of the response location given the data. Indeed, there is a formal equivalence between Bayesian regression with linear Gaussian generative models and Gaussian processes with a kernel that encodes the generative model (e.g. [19]). Algorithmically, however, Bayesian regression and Gaussian process inference differ. B-R focuses on the compression of training data into model parameters or a distribution of model parameters, e.g., the MAP estimator or posterior. The training data does not need to be stored to make new predictions. In contrast, Gaussian process inference requires that the training data is stored. Thus, the memory requirements grow linearly with the size of the training data, which constitutes an important drawback of Gaussian process models. To distinguish between the B-R and Gaussian process inference perspectives, one would need to design regression tasks that cannot be reformulated as inference because observers can neither see nor remember the entire stimulus when they make predictions. One could achieve this by sequentially presenting many training data such that memorization is not a viable option but sequential updates of the posterior are. Our work was inspired by the growing emphasis on parameter uncertainty in the machine learning community; however, it is important to highlight that function learning and extrapolation have been studied before. The function learning literature has addressed which types of functions humans can learn [45], how batch or sequential data representation affects learning [22], to what extent human behaviour can be modelled by parametric functions [46] and how well humans extrapolate [21]. However, to the best of our knowledge, these studies have so far failed to conduct a minimal experiment to establish that humans process parameter uncertainty in accordance with Bayesian regression. Our contribution will help to better understand the brain’s remarkable ability to learn and generalise from very little data and underpins the power of Bayesian regression as a framework in psychophysical modelling.

Methods and materials

Stimulus generation from the bimodal prior

Here, we describe in detail how stimuli are generated. On the j trial, participants are presented with a stimulus consisting of N = 4 points in a 2-dimensional space: . For the main experiment (with σ = 0.1, see below), we fixed the x-values to (−0.3, −0.1, 0.1, 0.3) respectively. For the additional experiment (with σ = 0.5, see below), we drew the x-values from Gaussians with means (−0.18, −0.09, 0, 0.09) and standard deviation 0.09 but resampled if the minimal distance was less than 0.1 between any two points. In both cases, we then generated the y-coordinates from a Gaussian generative model with a parabolic non-linearity and the generative parameter, w: The parameter w is drawn from a mixed Gaussian prior where the parameter set consists of the mean μ = 1, mixing coefficient c = 1/2 and the standard deviation σ = 0.1 for the main experiment and σ = 0.5 for the additional experiment. We denote the total set of hyperparameters (suppressed for notational clarity), from the prior and the generative probability, by . Each parameter w corresponds to a generative parabola. Given this model and given a stimulus D, we asked participants to predict the y-component y⋆ at x⋆ = 2, which is equivalent to mentally fitting a parabola to the four stimulus points and estimating the point of intersection with a vertical line at x⋆. To train participants on the generative model and the prior, we showed participants the generative parabola after each trial. In the main experiment, we showed a set of 20 unique stimuli for each of the three noise levels σ ∈ {0.03, 0.1, 0.4}, and each unique stimulus was repeated 20 times. We denote the set of the 20 responses to the j stimulus as . This amounts to a total of 400 trials per noise level. The order of the stimuli was randomized. For the additional experiment, we set σ = 0.1 and showed 10 unique stimuli 10 times. Fig 1 shows the experimental paradigm.

Regression models

In each trial, we model the participant’s computation by a consecutive inference and prediction step. During the inference step, the model assumes that the participant infers information about the quadratic parameter w based on the presented data (i.e., stimulus) D. The inferred information is then used for a subsequent prediction y⋆. We describe the participant’s overall task as computing the predictive distribution: . Prior regression (P-R) is our null model. P-R assumes that participants make predictions based on their prior belief but disregard information from the stimulus: Maximum likelihood regression (ML-R) relies only on the likelihood maximizing parameter, w: Maximum a posteriori regression (MAP-R) uses the parameter that maximizes the posterior p(w|D) = p(D|w)π(w)/p(D): Bayesian regression (B-R) uses the entire posterior for making predictions by marginalizing over it: Bayesian regression with noise estimate (B-R) loosens the assumption that participants treat σ as a hyperparameter and instead assumes they use an estimate on a trial-by-trial basis. Using the maximum likelihood estimator and the number of points M = 4: After substituting the estimate for the hyperparameter σ in Eq (9), the posterior predictive distribution is computed analogously to B-R.

Participants’ internal noise

To predict the participants’ responses r from the regression models’ output y⋆, we had to account for the internal noise of the participants. We did this by showing a noise-free stimulus 20 times and fitting a Gaussian with variance to each participant’s response distribution: . To be robust against outliers, we took the average of the 16% and 84% percentiles of the response distribution as motor noise. The predicted response distribution is then

Model comparison

We used the log-likelihood and the variance to compare the predicted and empirical response distributions.

Log likelihood

To compute the log likelihood for a model across all response at a given noise level σ, we summed the individual log likelihoods of each response r (the log of Eq (11)) across all stimuli D:

Bayesian random effects

The winning model of the participant averaged log likelihood must not necessarily win the model comparison for each participant. The Bayesian random effects analysis quantifies what fraction of participants are described by a model [47]. Specifically, we report the expected likelihood of each model for a random participant (Eq. (15) in [47]), i.e., the normalised Dirichlet parameter: .

Variance prediction

As a independent comparison of the data and the predicted response distribution, we used the variance of the responses. For each of the 20 stimuli D we obtained a single empirical value from the 20 responses recorded: where high variance values reflect ambiguous and difficult stimuli while low values indicate easy stimuli, prompting participants to give very similar responses across repetitions. Hence, at each noise level σ, we have an empirical variance distribution that corresponds to the 20 stimuli . For the predicted variance distribution, we use the variance predicted by a model in response to a stimulus D: where we used Eq (3) for an analytical computation of the variance. To improve the resolution, we increased the number of stimulus samples D to 5000 for the theoretical prediction. We use the resulting distribution over to compute the median in Fig 4 and the log density in the background.

Determining the components of the variance in the response data

To compare the components of the predictive variance in Eq 3 to data, we make the following definitions for a set of response . The empirical mixing coefficient is the fraction of positive responses: If only one of the modes is present in the data (c ∈ {0, 1}) the bimodal contribution vanishes and we do not require the means for the total variance. If both modes are present we compute their means: We define the unimodal variance contribution as the variance of the dominant mode: where is the set of responses in the dominant mode, i.e., the mode that has the majority of responses. If no dominant mode exists we omit the stimulus. We did not use the inferior mode to have sufficient samples (at least 11) to estimate the variance.

The unimodal dataset

To obtain a unimodal dataset from the full dataset, we consider only responses and model predictions if they have the same dominant mode, i.e., parabolas facing either upwards or downwards. We define the dominant response mode as the one containing more than half of the responses and the dominant mode of the model as the one carrying more than half of the probability mass. For example, if 11 responses fall into the upper mode but the models predict a downward parabola, all responses are disregarded. However, if the models predict an upward parabola the 11 responses enter the unimodal dataset. Note that the symmetric prior ensures that B-R, B-R and MAP-R share a dominant mode. Because the models use the same likelihood term, they process the stimulus as evidence for the same mode and break the symmetry in the same direction. Averaged over participants, the fraction of trials per condition retained for the unimodal dataset is 0.994, 0.849, 0.596 and 0.703 for σ ∈ {0.03, 0.1, 0.4} with σ = 0.1 and σ = 0.1, σ = 0.5, respectively.

Participants

Seven naive participants (3 females, 4 males, ages 21-27) participated in the main experiment and four naive participants (all males, ages 21-30) took part in the second experiment. The experiments were programmed using custom software implemented in MATLAB. Stimuli were presented on a 1920x1080 (36 pixels/cm) monitor with a refresh rate of 120 Hz. Participants viewed the display binocularly. Each trial comprised a fixation dot presented for 1 s followed immediately by presentation of the stimulus (with 5 arcmin point diameter). Participants moved a red point up or down using the up and down arrow keys to indicate the vertical position of the parabola at the given horizontal location. See S1 Text for more details.

Ethics statement

All participants gave informed consent in accordance with protocol 384/2011 “Commission cantonale d’éthique de la recherche sur l’être humain”. Participants provided written consent prior to the experiment.

Derivations and additional details.

(PDF) Click here for additional data file.

The file contains one folder for each participant N ∈ {1, …11}.

Within each folder, the name of the text file indicates the parameters. The participants 1…7 completed four conditions of generative noise σ ∈ {0, 0.03, 0.1, 0.4} and the variance parameter was σ = 0.1. For example, the file subj1_sig_g = 0.1.txt contains all trials of the first participant with generative noise σ = 0.1. The participants 8…11 completed only one condition: σ = 0.1 and σ = 0.5. Since the variance parameter is different from its default value, we indicate it explictely in the file name, e.g. subj8_sig_pi = 0.5_sig_g = 0.1.txt. Each data file contains 11 columns. The first eight columns describe the x and y coordinates of the stimulus points. The last three columns contain (in that order) the stimulus index j ∈ {1, …20}, the generating quadratic parameter w and the vertical location of the observed response. (ZIP) Click here for additional data file. 9 Sep 2019 Dear Dr Jegminat, Thank you very much for submitting your manuscript 'Bayesian regression explains how human participants handle parameter uncertainty' for review by PLOS Computational Biology. Your manuscript has been evaluated by three independent peer reviewers and myself. As you will see in the comments below, the reviewers believed that the study addresses a relevant and interesting research question. However, they also raised substantial concerns about the manuscript as it currently stands. For example, Reviewer #1 is concerned about the framing of the paper and also wonders – just as Reviewer #2 –whether the work is as novel as presented. Also, both Reviewers #2 and #3 are concerned that the use of the bimodal distribution over w may have favored some models over others, which may have implications for the generalizability of the results and conclusions. Please see the reviews below for a more detailed comments. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts. In addition, when you are ready to resubmit, please be prepared to provide the following: (1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors. (2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text. (3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution. Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are: - Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition). - Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video. - Funding information in the 'Financial Disclosure' box in the online system. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here. We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us. Sincerely, Ronald van den Berg Associate Editor PLOS Computational Biology Samuel Gershman Deputy Editor PLOS Computational Biology A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Editor: I have read the paper with great interest and have a few points in addition to the concerns raised by the reviewers. My main comment concerns that statement at the very end of Results that “The two additional variants of B-R perform slightly worse than the original B-R in terms of the log-likelihood”. Perhaps I misunderstood, but isn’t this in direct contradiction with the conclusion that the data support “..the idea that the generative noise is indeed jointly estimated with the quadratic parameters”? The BR_f model may provide a better fit to the variance curves (a summary statistic), but this apparently comes at the cost of not explaining other parts of the data as well as BR. I agree with Reviewer #3 that the presentation of the model comparison results can be improved. I suggest to either go with the solution proposed by this reviewer (compute ratios w.r.t. baseline model) or go with relative values (subtract for each subject the log lh of the overall best model from the log lh off all models, such that the best model has by definition a \\delta log lh of 0). Also, i would suggest the present these results using a bar graph rather than lines (Fig. 3) and add error bars that indicate variability across subjects. Since there are only 7 subjects, you may even want to consider reporting the individual model comparison results (possibly in Supplement if it would make the main figures to cluttered). On p. 7 it is mentioned that “The data is pooled across participants”. I assume that this is only a presentation issue and that the models were still fitted to individual data sets? Please clarify. Also, *if* the models were fitted to pooled data, please justify this choice as that would be a bit strange. Finally, in the introduction it is mentioned that “Most […] experimental studies can be cast into the problem of estimating a hidden quantity from sensory input”. It may be worth noting that we have found evidence for Bayesian inference in various perceptual tasks that go beyond simple estimation, including visual search (e.g. Ma et al. 2011, Nature Neuroscience; Stendgård & Van den Berg 2019), same/different discrimination (Van den Berg et al 2012, PNAS), and change detection (Keshvari et al 2012, PLoS One). Minor: I think that the abbreviation “DNNs” on p. 2 was never introduced Sigma_m is in the supplement referred to as “motor noise”, but in the main text it is explained that it more broadly captures “internal sources of noise arising during neural processing, decision making and the execution of motor action”. I prefer the broader interpretation of sigma_m and believe it would be useful to add a reference to Drugowitsch 2016 (Neuron) and Beck, Ma, et al 2012 (Neuron) here. Reviewer #1: The authors ask if human participants feature behavior that is better explained by Bayesian regression rather than simpler alternatives/heuristics. The specific feature of Bayesian regression is that predictions need to take into account the uncertainty in the inferred regression coefficients. The authors test this by providing subjects with noisy point-wise observations of a quadratic function, and ask for the participants' best guess when extrapolating the (unobserved) function to more extreme values. They demonstrate that the participants' extrapolations are best explained by a Bayesian regression model rather than alternative models, such as maximum likelihood or maximum a-posteriori. This is, to my knowledge, the first demonstration that humans are able to take account of their uncertainty in model parameters when making predictions. Thus, it is novel, original, and should be of interest to a wide audience. Unfortunately, the current manuscript leaves multiple, potentially confounding questions unanswered: First, the authors seem to implicitly assume that predictions are made by drawing samples from the posterior predictive density (after which some motor noise is added). This is just one possible way of turning posteriors into decisions. Bayesian decision theory posits that decisions ought to be made by picking an estimate that minimizes a particular expected loss under the posterior. The squared loss, for example, would result in deciding according to the posterior mean (see Bishop (2016) and Kording & Wolpert (2004)). There exists no loss function, however, that makes posterior sampling the optimal strategy. Therefore, Bayesian decision theory does not currently justify the performed analysis. One could suggest that the participants do not act according to Bayesian decision theory, and instead sample from this posterior. However, this does not conform to recent work (Drugowitsch et al. (2016)) showing that behavioral variability seems to arise from noisy inference rather than noisy decisions. Thus, I urge the authors to elaborate in more detail how they believe posteriors are turned into choices, and potentially investigate if noise in the inference itself could result in the behavior that they aim to describe. Second, the model comparison results seem to be driven by the bimodality of participants' responses (i.e., convex vs. concave function), which appears only supported by Bayesian regression. That by itself would be an interesting result, but needs further investigation. If you would restrict responses to the single, dominant mode (e.g., ignoring all trials in which the other mode was chosen), would the Bayesian regression model still win the model comparison? Would a simpler model that only considers two possible functions (i.e., convex or concave?) + some motor noise perform worse than the Bayesian model? Third, model comparison is performed on the mean log-likelihoods. Even though Fig. 3 might suggest otherwise, this doesn't per se exclude the possibility that the model comparison isn't driven by a few participants. I would suggest two modifications. First, the models should be compared in terms of log-likelihood ratios, usinng the preferred model as baseline. Then, all measures are relative to the model fit of this preferred model, and negative values indicate a worse model fit of alternative models. Averaging across those would correspond to a more informative within-subject comparison. Second, to exclude the possibility of few participants driving the model comparison I would suggest additionally performing a Bayesian random effects model comparison (see Stephan, Penny et al. (2009)). More generally, the presentation of the results could be improved. Some statements are ambiguous or simply wrong (see below), and I would encourage the authors to carefully revise the manuscript to improve its clarity. The SI derivations are lengthy. They rely on partially known results and can be shortened (see below). Detailed comments: Abstract: "The quadratic parameter was drawn from a prior distribution, unknown to the observer" - what was unknown to the observer? The quadratic parameter? Or that this parameter was drawn from a prior distribution? p3, A novel paradigm to test regression: I found this section generally confusing, and I encourage the authors to revise the presentation order to make stimuli and experiment clearer. I understand that you have 20 4-dot stimuli with the same underlying wj. For a fixed noise level, did they also have the exact same noise instantiation - i.e., was the stimulus the same on the screen? Were they the same across participants? Furthermore, you repeat each stimulus 20 times, which would imply 400 stimuli overall. How were these 400 stimuli distributed across the different blocks of different noise level? What about the 10 practice trials? p4, Fig 2 caption: "Contours indicate the equiprobable responses predicted by ..." - not precise as responses only happen at x=2. Is this the predicted response distribution if subjects were asked to respond at different x's? For (D), are those the response distributions corresponding to the dot pattern in (A)? p5, "To complement the log-likelihood analysis [...], we used the variance" - which variance? p6, Fig. 4 caption: "The errors of the other models are too small to show" - errors? Or variance? SEM? SD? p9, "In each trial, the participants carried out [...]" - you only know that their behavior was compatible with the proposed computations, but not the exact computations participants performed to feature this behavior. p9, "in Bayesian regression generalizes Eq (8" - ")" missing p9, "as in ordinary B-R (Eq (8)" - ")" missing p10, "(the log of Eq (10)" - ")" missing Fig 4 B/C/D: please consider horizontal jitter of the grey dots instead of blurring them. SI, general: many of the derivations are standard in the Bayesian literature, and their details can be skipped or they can be simplified (e.g., Bishop (2016), Appendix B, "Gaussian") SI, p2: "see ??" - broken ref SI, p3, "Fig 2(A) / Fig 2(B)" - should this refer to Fig. S2? SI, p4, "[...] because our mixed Gaussian prior is the conjugate prior of the Gaussian likelihood" - "is A conjugate prior", as there are multiple. SI, p6, "[...], we the posterior maintains" - the "we" should be removed SI, p8, Derivation of full Bayesian regression: wouldn't a normal inverse-gamma mixture provide a conjugate prior for this case? This would lead to analytical posteriors. Reviewer #2: The authors asked how humans handle parameter uncertainty in a regression task. Observers had to extrapolate from 4 points generated from a noisy quadratic function to predict the y-coordinate of a 5th point whose x-coordinate was given. They compared Bayesian regression to prior regression, ML regression, and MAP regression, and found the best performance for Bayesian regression models fit to the human data. The question is important, as function estimation is a basic and general task. The paper was very clearly written and technically proficient. The evaluation of the models by two separate methods (log likelihood and response variance) was a strength. All methods details were provided. However, I wondered whether the claim that humans use Bayesian regression depended critically on the specifics of the task and noise assumptions. I also wondered how this paper goes beyond previous arguments that humans use Bayesian regression. Major comments: 1. Novelty: Lucas et al. 2015, Psychonomic Bulletin & Review, argue that previous models of how humans do function estimation can be recast as Bayesian regression. They also compare performance of various models and show that Gaussian process models, which are closely related to Bayesian regression models, fit human data well. How does the current work go beyond these previous arguments for Bayesian regression? 2. It seems that the MAP model failed primarily because it did not capture the bimodal distribution of observers’ responses. In this task, the parameter of interest, the coefficient on the quadratic term of the function, was drawn from a bimodal prior distribution, so that the parabola sometimes opened upward and sometimes opened downward. For the same 4-dot stimulus, human observers sometimes extrapolated for a downward parabola and sometimes for an upward parabola, whereas the MAP model always extrapolated in just one direction. But if I understood correctly, all models knew the generative noise, unlike the human observers. How much do the unimodal predictions of the MAP (and ML) models rely on this assumption? That is, if the noise were estimated trial by trial, would this stochasticity lead to bimodal predictions even for MAP and ML models? This concern seems especially relevant given that the authors show better performance for versions of the Bayesian model that estimate the generative noise trial by trial, and it seems that this possibility was not tested for the non-Bayesian models. 3. Relatedly, is there any aspect of the data that argues for Bayesian regression that is not dependent on the bimodal pattern of extrapolation reports? It seems that both the log likelihood results and the variance results were driven by the bimodal pattern, which raises the question of whether the result would generalize to a case where the prior is unimodal. This seems important, as unimodal priors on parameters are probably more typical in natural tasks. 4. The experiment consisted of 20 repetitions of each of 20 4-dot stimuli. Feedback was presented after every trial, which created the possibility that the observers could learn a direct mapping between a certain dot configuration and an extrapolated point over the course of the experiment. I wondered whether such learning could create the appearance of Bayesian performance, and more generally whether there was any evidence for learning over the course of the experiment. 5. The B-R3 model seemed a bit ad-hoc. Is there a theoretical reason for estimating the noise based on two sets of 3 points rather than all 4 points (the B-R4 model reported in the supplement, which did not perform as well)? Minor comments: 1. The significance statement is quite vague. It seems like it could apply to almost any Bayesian modeling paper. 2. To motivate the study, it may be worth pointing out in the intro that while “many aspects of cognition are indeed well described by Bayesian statistics” (p. 2), many are not (Rahnev & Denison 2018). 3. “This feedback enabled the participants to learn both the prior and the generative model.” (p. 3) What is the evidence that these were learned successfully? 4. It may be helpful to explain why the MAP model does not predict a bimodal distribution even though it uses the prior. 5. For the variance analysis, why was the variance within a given stimulus (observers) compared to the variance generated across a large number of stimuli (models), rather than comparing variance from the same source? 6. For B-R3, how was the ML estimate of the generative noise obtained? (p. 9) Does the procedure assume that w is known? 7. Some figure references were missing or in a different format (e.g. “Bottom right” vs. “D”). Reviewer #3: This is a very neat paper which made me wish I’d thought of doing that experiment. Essentially, the manuscript shows some reasonably convincing evidence that people can complete a simple regression problem in a Bayesian manner. Major points I felt that the manuscript treated the division between the inferential tasks of model comparison, regression, prediction, parameter estimation, posterior prediction, as too sharply divided. While these _concepts_ are distinct, once we start thinking about people completing real behavioural tasks, then I think it becomes a bit less clear to claim the completion of a task only involves one inferential concept and not others. I think the authors could make their claims more robust by adding more reasoning why this task should be considered as regression as opposed to prediction. Someone could make a reasonable case that the participants are doing posterior prediction here, so the case for regression could be firmed up somewhat. Someone may argue that regression is not a 'core' inferential task in that there is just model learning, parameter estimation, and posterior prediction. If so, then the paper demonstrates that people can learn a parabolic model and do parameter estimation and/or posterior prediction. I think the authors could perhaps update the manuscript to make their claims more robust to someone with these views. Minor points The model comparison is nice, but I think it would also be useful very early on to include a short intuitive statement about what kind of results would really support or disconfirm that people are doing this regression task in a Bayesian manner. Clarify if participants were naive to the goals of the study. This is touched upon in the manuscript, but it might be worth elaborating or speculating a little more about what might happen outside of a toy scenario where the participants know the data generating model. Presumably people would have to simultaneously conduct parameter estimation (of multiple models) and do inference over models. Unless I've misunderstood, a core part of this manuscript is the bimodality of the parameter distribution. I felt that this was not given much attention in the manuscript so perhaps there is some scope for clarification. Along the lines of the previous point, I was a bit unsure how to interpret the density plot for B-R in figure 2-A. The methods spell out that the mixture of up and downwards parabolas was 50%. This is not the case in this density plot, so presumably this plot is the best fit to a given participant? Perhaps this point of confusion on my part can help the authors with clarifications in the main text and/or figure legend. Possibly a pedantic point, but Panels B-D in Figure D could feasibly be rotated so the y coordinate is on the y axis, to make it a little more intuitive. But this is just a suggestion for consideration. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No 2 Dec 2019 Submitted filename: response_to_reviewers.docx Click here for additional data file. 17 Jan 2020 Dear Mr. Jegminat, Thank you very much for submitting your manuscript "Bayesian regression explains how human participants handle parameter uncertainty" for consideration at PLOS Computational Biology. Your manuscript was reviewed myself and the three original independent reviewers. As you will see in the reviews attached below, no major concerns were raised, but there were still a few minor concerns. Reviewer #1 believes that the writing is too strong at several places and somewhat imprecise at other places. For example, in the abstract your write "Our results show that humans use Bayesian regression.". I agree with the reviewer that it would be good to soften statements like this one, because this is much stronger than what is justified by the results. Moreover, Reviewer #2 has a suggestion for a minor addition or change in the analyses. Since the remaining issues are minor, I will probably move to a decision without sending the manuscript back to external reviewers. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Ronald van den Berg Associate Editor PLOS Computational Biology Samuel Gershman Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I appreciate the authors' changes to the manuscript. I think that the model comparison, as well as the analysis of the response variance is much improved. I don't have any major comments on the details of the analysis and its description. Two issues nonetheless remain, as I will describe in turn. The first is that not all statements in the manuscript are immediately supported by the performed analyses. This is a thread that goes through the whole manuscript, such that pointing out every single instance would be too tedious. Some of the most glaring instances are: * Probability matching/sampling: the manuscript states that human participants perform sampling/probability matching, but the authors haven't tested alternative models of noisy inference. The latter requiring to fit parameters might be a reason for the authors to not want to test them, but it is not a reason for ruling out the possibility that the observed variability is due to noisy inference. Thus, all statements that state that humans choose by probability matching/sampling needs to be toned down/relativized. This starts with the abstract ("We further add evidence that humans use probability matching rather than Bayesian decision theory..." - a noisy inference model might explain the data equally well), over the Results (pages 6/7), to the Discussion. The discussion at least mentions the possibility of noisy inference as an open question, but the rest of the manuscript should word its conclusions more carefully - in particular in the light of increasing evidence (e.g., from Wyart lab) that behavioral variability that has been previously ascribed to posterior sampling can be equally well (or better) explained by noisy inference. Thus, strongly stating that participants sampled might, in fact, be wrong. * Abstract: "Our results show that humans use Bayesian regression" - they suggest, but don't show * p8: "An explanation for this discrepancy is that participants estimated the noise on a trial-by-trial basis" - this sounds as if you _knew_ that participants performed this estimation. Wouldn't it be better to call this "A potential explanation"? The second is an overall lack of precision in parts of the manuscript. There are multiple ambiguous/imprecise statements, and leaving it up to the reader to deal with them might lead to misunderstandings. Thus, I urge the authors to resolve any potential ambiguities in their writing. Examples include: * p3: "The proposal is to apply Bayesian regression to DDNs." - who proposes this? Are you? Do others? * p3: "The presentation order was randomized" - this is imprecise. It was randomized within each noise level, but blocked across noise levels. * p3: "The advantage of a response distribution over single responses" - this is contradictory, as a single response doesn't give a distribution. Should this be "The advantage of observing several responses for the same exact stimulus is that we can compare the set of responses to the response distribution predicted by the different models"? * p5: "[...] motor noise is the only source of uncertainty" - do you mean that it is the only source of response variability? Uncertainty is never measured. * p7: "we use the model likelihood p(M|Rk)" - p(M|Rk) is the model posterior, not the model likelihood, which would be p(Rk|M). * p7: "shows that the majority of the population" - do you mean that the majority of the participants' responses? * p7: "the transition from unimodality to bimodality" - of what? The response distribution? The model posterior? * p7: "we use the response variance" - which variance? The variance of responses in trials in which the same stimulus was shown? Please be precise. * p12, "regression and inference differ" - too general use of "regression" and "inference" (in particular, as regression is a form of inference), please relativize, e.g. "Algorithmically, however, linear regression and Gaussian Process inference differ" - also for later instances of "regression" and "inference" in this paragraph. * p15, "if they have the same dominant mode, i.e. either up or down" - too colloquial and imprecise. Additional details: p3: "In our main experiment, we set the standard deviation of each prior model to sigma_pi = 0.1" - this is unclear, as sigma_pi hasn't been defined. Overall, this section would benefit from adding the stimulus equations (4) and (5). This would make it significantly easier to understand the generation of the stimulus. "We fixed the values of x positions ..." - do you want to say that those are the locations where the dots are shown? p4, footnote 1 - shouldn't that footnote appear earlier, when you introduce the MAP-R model? p5, "Fig 2(E) illustrates that participant's responses cover a wide range of values" - would be useful to state upfront that Fig 2(E) relates to increased sig_pi. p6, Fig. 3 "Negative value favour sampling" - should be "valueS" p8, "because because" - repetition p11, "It is an interesting question whether subjects solve regression tasks with model uncertainty by taking advantage ..." - resolving model uncertainty has previously been studied under the name of "causal inference". See, for example, Koerding et al. (2007), PLoS ONE. p12, "through constant feedback" - "constant feedback" makes it sound as if the feedback was the same across trials. "continuous feedback", or "feedback in every trial" might be better. Supplement page 8: "... we disregarded that model" - "disregard" sounds as if you don't consider this model to be valid. Please rephrase to state that you don't consider it in the model comparison, but that it is nonetheless a valid model. Fig. 2A: please change colors to make sure that the 4 dots are visible. Fig. 4: green line is not visible Reviewer #2: The authors have revised the manuscript thoroughly and addressed all of my comments. The addition of the unimodal analysis has strengthened the paper. Overall, this is a clear and thoughtful manuscript addressing an important question. I have one concern about the new learning analyses: I am not sure it is appropriate to include all noise conditions in a single regression (Figure S3). Different amounts of learning could be expected for the different noise conditions, because there is more ambiguity for high noise, and thus observers could benefit more from learning those specific stimuli across trials. This does seem to be the case in the data – sigma_g = 0.1 and 0.4 data points fall slightly below the unity line. I agree with the authors that the magnitude of learning is small and is unlikely to be a serious methodological concern (also considering Figure S4). Still, I would recommend performing separate statistical tests for the different noise conditions and, depending on the outcome, perhaps modify the claim in the manuscript of no learning. Reviewer #3: Major point: The authors adequately respond to this point, with updated text on page 12. Minor points: The authors have given attentive responses to all my minor points. I have read the revised manuscript - I think that this version is significantly improved compared to the original manuscript. Overall, I think this is a rigorous study which the readers of PLoS Computational Biology will be interested in reading. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: None Reviewer #3: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Dr Benjamin T. Vincent Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see 7 Apr 2020 Submitted filename: Response to Reviewers Revision2.pdf Click here for additional data file. 19 Apr 2020 Dear Mr. Jegminat, My apologies for the slight delay in handling the revision of your manuscript 'Bayesian regression explains how human participants handle parameter uncertainty'. Since the revision was rather minor, I decided to not send it back to the reviewers, as I indicated in my previous decision letter. I believe that you adequately addressed all remaining issues that were raised by the reviewers and am pleased to inform you that your manuscript has been provisionally accepted for publication in PLOS Computational Biology. Please note that there is a small typo in one of the changes you made ("to what extend" should be "to what extent"). Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Ronald van den Berg Associate Editor PLOS Computational Biology Samuel Gershman Deputy Editor PLOS Computational Biology *********************************************************** 11 May 2020 PCOMPBIOL-D-19-01115R2 Bayesian regression explains how human participants handle parameter uncertainty Dear Dr Jegminat, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

35 in total