Literature DB >> 33119716

Choosing between AR(1) and VAR(1) models in typical psychological applications.

Fabian Dablander¹, Oisín Ryan², Jonas M B Haslbeck¹.

Abstract

Time series of individual subjects have become a common data type in psychological research. The Vector Autoregressive (VAR) model, which predicts each variable by all variables including itself at previous time points, has become a popular modeling choice for these data. However, the number of observations in typical psychological applications is often small, which puts the reliability of VAR coefficients into question. In such situations it is possible that the simpler AR model, which only predicts each variable by itself at previous time points, is more appropriate. Bulteel et al. (2018) used empirical data to investigate in which situations the AR or VAR models are more appropriate and suggest a rule to choose between the two models in practice. We provide an extended analysis of these issues using a simulation study. This allows us to (1) directly investigate the relative performance of AR and VAR models in typical psychological applications, (2) show how the relative performance depends both on n and characteristics of the true model, (3) quantify the uncertainty in selecting between the two models, and (4) assess the relative performance of different model selection strategies. We thereby provide a more complete picture for applied researchers about when the VAR model is appropriate in typical psychological applications, and how to select between AR and VAR models in practice.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2020 PMID： 33119716 PMCID： PMC7595444 DOI： 10.1371/journal.pone.0240730

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Time series of individual subjects have become a common data type in psychological research since collecting them has become feasible due to the ubiquity of mobile devices. First-order Vector Autoregressive (VAR) models, which predict each variable by all variables including itself at the previous time point, are a natural starting point for the analysis of dependencies across time in such data and are already used extensively in applied research [1-5]. A key question that arises when using these models is: how reliable are the estimates of the single-subject VAR model, given the typically short time series in psychological research (i.e., n ∈ [30, 200])? To be more precise, we would like to know how large the estimation error is in this setting. Estimation error is defined as the distance between the estimated parameters and the parameters in the true model, assuming that the true model has the same parametric form as the estimated model. If estimation error is large, it might be possible to obtain a smaller estimation error by choosing a simpler model, even though it is less plausible than the more complex model [6]. A possible simpler model in this setting is the first-order Autoregressive (AR) model, in which each variable is predicted only by itself at the previous time point. While the AR model introduces a strong bias by setting all interactions between variables to zero, it can have a lower estimation error when the number of available observations is small. When analyzing time series in psychological research it is therefore important to know (a) in which settings the AR or the VAR model has a lower estimation error, and (b) how to choose between the two models in practice. Bulteel et al. [7] identified these important and timely questions, and offered answers to both. They investigated question (a) regarding the relative performance of AR and VAR models by selecting three empirical time series data sets, each consisting of a number of individual time series with the same data structure. For each of these data sets, they approximate the out-of-sample prediction error with out-of-bag cross-validation error for both the AR and the VAR model and their mixed model versions. The authors make a valuable contribution by assessing which of the many cross-validation schemes available for time series approximates prediction error best in this context. Using the approximated prediction error obtained via cross-validation, they find that the prediction error for AR is smaller than for VAR, and that the prediction error of mixed AR and mixed VAR is similar. In a last step, they link prediction and estimation error by stating that “[…] the number of observations T [here n] that is needed for the VAR to become better than the AR is the same for the prediction MSE [mean squared error] as well as for the parameter accuracy [estimation error]” [7, p. 10]. Although the latter statement implies that the estimation error of mixed AR and mixed VAR models are similar, Bulteel et al. [7] conclude that “[…] it is not meaningful to analyze the presented typical applications with a VAR model” (p. 14) when discussing both mixed effects (i.e., multilevel models with random effects) and single-subject models. Using their statement about the link between prediction error and estimation error together with a preference towards parsimony, Bulteel et al. [7] also offer an answer to question (b) on how to choose between the AR and VAR models in practice: they suggest using the “1 Standard Error Rule”, according to which one should select the AR model if its prediction error is not more than one standard error above the prediction error of the VAR model, and select the model with lowest prediction error otherwise [8, p. 244]. In this paper, we provide an extended analysis of the problems studied by Bulteel et al. [7]. First, regarding question (a) on the relative performance of the AR and VAR models: when the goal is to determine the estimation error in a given setting, one can obtain it directly with a simulation study. A simulation study allows for a more extensive analysis of this problem for three reasons. First, we do not need to make any claim about the relation between prediction error and estimation error, which—as we will show—turns out to be non-trivial. Second, in a simulation study we can average over sampling variance which allows us to compute the expected value of estimation (and prediction) error. While the approach of Bulteel et al. [7] in using three empirical datasets has the benefit of ensuring the models considered mirror data from psychological applications, these empirical datasets are naturally subject to sampling variation. And third, a simulation study allows us to map out the space of plausible VAR models and base our conclusions on this large set of VAR models instead of the VAR models estimated from the three data sets used by Bulteel et al. [7]. We perform such a simulation study, which allows us to give a direct answer to the question of how large the estimation errors of AR and VAR models are in typical psychological applications. Regarding question (b) on choosing between AR and VAR models in practice, Bulteel et al. [7] base their “1 Standard Error Rule” (1SER) on the idea that the n at which the estimation errors of the AR and VAR models are equal is (approximately) the same n at which the prediction errors of those models are equal, combined with a preference towards the more parsimonious model. While the 1SER is used as a heuristic in the statistical learning literature [8], it is not clear whether this heuristic would perform better in the present problem than simply selecting the model with the lowest prediction error. We show that when choosing between AR and VAR models, the n at which the prediction errors become equal is not necessarily the same as the n at which estimation errors become equal: in fact, there is a substantial degree of variation in how the prediction and estimation errors of both models cross. Using the relationship between estimation and prediction error we are able to show via simulation when the 1SER is expected to perform better than selecting the model with lowest prediction error. This extended analysis of the problem studied by Bulteel et al. [7] provides a more complete picture for applied researchers about when the VAR model is appropriate in typical psychological applications, and how to select between AR and VAR models in practice.

When does VAR outperform AR?

In this section we report a simulation study which directly answers the question of how large the estimation errors of AR and VAR models are in typical psychological applications. This allows the reader to get an idea of how many observations ne one needs, on average, for the VAR model to outperform the AR model. In addition, we will decompose the variance around those averages in sampling variation and variation due to differences in the VAR parameter matrix Φ. Finally, explaining the latter type of variation allows us to obtain ne conditioned on characteristics of Φ. The analysis code for the simulation study is available from https://github.com/jmbh/ARVAR.

Simulation setup

Since the AR model is nested under the more complex VAR model, we focus solely on the VAR as the true data-generating model. To obtain realistic VAR models, we use the following approach: first, we estimate a mixed VAR model to the “MindMaastricht” data [9], which consists of 52 individual time series with on average n = 41 measurements on p = 6 variables, and is the only publicly available data set used by Bulteel et al. [7]. In a second step, we sample stationary VAR models with a diagonal error covariance matrix from this mixed model. We expect that the estimation (and prediction) errors of the AR and VAR model depend not only on the number of observations n, but also on the characteristics of the underlying p × p VAR model matrix Φ. We therefore stratify the sampling process from the mixed model by two characteristics of Φ. This procedure allows us to obtain a better picture of how the performance of AR and VAR may differ depending on the characteristics of the data generating model. The first characteristic is based on the size of the auto-regressive effects, that is, the absolute values of the diagonal elements of the lagged parameter matrix (Φ) which encode the relationship between a variable and itself at the next time point. We summarize the information contained in these diagonal elements by taking the mean of their absolute values D, given as Note here that taking the sum of auto-regressive parameters is equivalent to taking the sum of the eigenvalues of Φ, denoted λ. To ensure stationarity, only Φ matrices with |λ| < 1 are included in our analysis [10]. The second characteristic is based on the size of the cross-lagged parameters (Φ, i ≠ j), encoding the relationships between different processes. We again summarize this information by taking the mean absolute of these parameters, denoted O and given as We expect that true VAR models with a high D value and small O value (i.e., large auto-regressive effects and small cross-lagged effects) result in a low estimation error for AR models, since these VAR models are very similar to an AR model. In contrast, if O is high, we expect that the estimation error of the AR model is large, because it sets the large cross-lagged effects in the true VAR model to zero. Ideally, we would stratify by sampling a fully crossed grid of D and O values. However, this is not possible since some combinations have an extremely small probability: For example, if a matrix has auto-regressive parameters close to one, it is unlikely to describe a stationary process if it also contains high positively-valued cross-lagged parameters. We therefore adopt the following approach: we divided the D-O-space in a grid by dividing each dimension into 15 equally spaced intervals (see S1 Fig). We then include only those cells in the design in which at least one VAR model has been sampled. This procedure returned 74 non-empty cells. We then sample those 74 cells until each of them contains 100 VAR models. We keep the cell size constant to render the results comparable across cells (see Supporting Information for a detailed description of this procedure). This procedure returns a set of 74 × 100 = 7400 VAR models that includes essentially any stationary VAR model with p = 6 variables, and allows us to describe each model in the dimensions O and D. For each of these VAR models, we generate 100 independent time series, each with n = 500 observations and with a burn-in period of nburn = 100. We then estimate both the AR and the VAR model on the first n = {8, 9, …, 499, 500} observations of those time series. This yields a simulation study with 7400 × 493 (parameters × sample size) conditions, and for each of those conditions we have 100 replications. For each model, and each n, we compute the expected estimation error for both the AR model (EEAR) and the VAR model (EEVAR) model by averaging over the 100 replications. This means that while EEAR and EEVAR have different values depending on n and the underlying model, we have averaged over the sampling variation.

Simulation results

The simulation described above allows us to investigate the relative performance of AR and VAR models across different samples, sample sizes, and data-generating models. We define the estimation error as the mean squared error of the estimated parameters to the true parameters, and quantify the relative performance with two measures: the difference between the estimation errors of the AR and VAR models at a particular sample size, EEDiff = EEAR − EEVAR; and, n, the sample size at which the VAR model outperforms the AR model (EEAR > EEVAR). In the following we examine the mean and variance of EEDiff and subsequently study n and its dependence on the characteristics of the true VAR model. Fig 1(a) shows the mean and standard deviation of EEDiff as a function of n, across all 7400 VAR models and 100 replications. The dashed line at EEDiff = 0 indicates the point at which the estimation errors of the two models are equal. Below that line, the AR model performs better, that is, its parameter estimates are closer to the parameters of the true VAR model than the parameter estimates of the VAR model. We see that, across all models, we obtain a median ne = 89. Note that, out of all 740,000 simulated data sets, in only 23 cases the estimation error curves did not yet cross with an n of 500. Notably, the variance around the difference in estimation error is substantial for all n. In the following we decompose this variance in variance due to sampling error, and variance due to differences in VAR matrices.

Fig 1

Difference in estimation error of AR and VAR models (EEDiff) across n on three different levels of aggregation.

Difference in estimation error of AR and VAR models (EEDiff) across n on three different levels of aggregation.

Panel (a) shows EEDiff averaged over replications and models, and the band shows the standard deviation over replications and models; panel (b) shows EEDiff for each model averaged across replications; and panel (c) shows the EEDiff averaged over replications for three specific models, and the bands show the standard deviation across 100 replications (sampling variation). Panel (b) of Fig 1 displays the mean EEDiff for each of the 7400 VAR models, averaged across 100 replications. We see that the lines differ considerably and that ne substantially depends on the characteristics of the true VAR model. This shows that one cannot expect reliable recommendations with respect to ne that ignore the characteristics of the generating model. To illustrate the extent of the sampling variation of the models, we have chosen three particular VAR models (see coloured lines). Fig 1(c) shows that they exhibit considerable sampling variation. Note that, as the variance in (b) is due to differences in mean performance across VAR models, it does not decrease with n. In contrast, the variance in (c) depends on n as it pertains to the sampling variance of a single VAR model, which decreases with the square root of the number of observations. While the mean EEDiff (shown in Fig 1(a)) gives a clear answer to the question of which n is required for the VAR model to outperform the AR model on average, both types of variation (see Fig 1(b) and 1(c)) show that for any particular VAR model it is difficult to determine which model performs better with the sample sizes typically available in psychological applications. However, we see that the sampling variation across replications is smaller than the variation across VAR models for most n. This means that if one has information about the parameters of the data-generating model, one can make much more precise statements about the sample size necessary for the VAR model to outperform the AR model. The large degree of variation around EEDiff also highlights the potential pitfalls of generalizing the findings of Bulteel et al. [7] beyond the empirical data sets, which consist of 28, 52, and 95 individual time-series with an average number of 41, 70 and 70 time points, analyzed by the original authors. This is because (i) it is unlikely that their (in total) 175 time series appropriately represent the population of all plausible VAR matrices, (ii) their sample is subject to a substantial amount of sampling variation, and (iii) the absence of systematic variations of n does not allow a comprehensive answer to how relative performance relates to sample sizes. Above we suggested that the relative performance of AR and VAR models (quantified by EEDiff) depends on the characteristics D and O of the true VAR parameter matrix. In Fig 2(a) we show the median (across models in cells) n at which the estimation error of VAR becomes smaller than the estimation of AR (i.e., EEDiff > 0), depending on the characteristics D and O. We see that the larger the average off-diagonal elements O, the lower the n at which VAR outperforms AR. This is what one would expect: when O is small (as indicated by the lowest rows of cells in Fig 2(a)), the true VAR model is actually very close to an AR model. In such a situation, the bias introduced by the AR model by setting the off-diagonal elements to zero leads to a relatively small estimation error. This trade-off between a simple model with high bias but low variance and a more complex model with low bias but high variance is well-known in the statistical literature as the bias-variance trade-off [8]. It therefore takes a considerable amount of observations until the variance of the VAR estimates becomes small enough for it to outperform the AR model. When O is large (indicated by the upper rows of cells), the bias of the AR model leads to a comparatively larger estimation error. Finally, we can also see that the size of the diagonal elements D is not as critical in determining n as the size of the off-diagonal elements: Picking any row of cells in Fig 2(a), we can see that there is only a very small variation across columns, with larger D values appearing to lead to very slight decreases in n in general. Note that the O characteristic also largely explains the vertical variation of the estimation error curves shown in Fig 1(b): the curves on top (small ne) have high O, while the curves at the bottom (large ne) have low O. Fig 2(b) collapses across these values and illustrates the sampling distribution of n, taking into account the likelihood of any particular VAR matrix (as specified by the mixed model estimated from the “MindMaastricht” data).

Fig 2

Left: ne, the n at which estimation error becomes lower for the VAR than for the AR model, as a function of D and O. Right: Sampling distribution of ne, the n at which the expected estimation error of the VAR model becomes lower than the expected estimation error of the AR model. The dashed line indicates the median of 89. In summary, we used a simulation study to investigate the relative performance of AR and VAR models in a much larger space of plausible data-generating VAR models in psychological applications than considered by Bulteel et al. [7]. Next to investigating the average relative performance as a function of n, we also looked into the variation around averages. We showed that there is substantial variation both due to sampling error and differences in VAR matrices, which means that for a particular time series at hand it is difficult to select between AR and VAR with the n available in typical psychological applications. Finally, we found that the size of the off-diagonal elements influences the relative performance of the VAR model more strongly than the size of the diagonal elements.

Choosing between VAR and AR based on prediction error

In the previous section, we directly investigated the estimation errors of the AR and the VAR model in typical psychological applications and showed that the n at which VAR becomes better than AR depends substantially on the characteristics of the true model. In practice, the true model is unknown, so we can neither look up the n at which VAR outperforms AR in the above simulation study, nor can we compute the estimation error on the data at hand. Thus, to select between these models in practice, we may choose to use the prediction error which we can approximate using the data at hand, for instance by using a cross-validation scheme as suggested by Bulteel et al. [7]. However, since we are interested in estimation error, we require a link between prediction error and estimation error. In the remainder of this section we investigate this link and discuss the implications of this link for the model selection strategy suggested by Bulteel et al. [7], who use the “1 Standard Error Rule” (1SER) to select the model with lowest estimation error. Finally, we use our simulation study from above to directly compare the performance of the 1SER with model selection based only on the minimum prediction error.

The relation between prediction error and estimation error

Bulteel et al. [7] suggest that the link between prediction error and estimation error is relatively straightforward: “[…] the number of observations T [here n] that is needed for the VAR to become better than the AR is the same for the prediction MSE [mean squared error] as well as for the parameter accuracy [estimation error]” [7, p. 10]. More formally, this claim states that if n is the number of observations at which the estimation errors of the AR and VAR model are equal, and if n is the number of observation at which the prediction errors of the AR and VAR model are equal, and ngap = n − n, then ngap = 0. Bulteel et al. [7] do not specify the exact conditions under which this statement should hold, and elsewhere in the text suggest that this should be considered an approximate rather than an exact relationship. If this relationship were indeed approximate, it would still be interesting to study in which settings ngap > 0 or ngap < 0, as this bears on model selection, and so we will focus our investigation on quantifying ngap and investigating any potential systematic deviations from zero through simulation. Clearly, it would be unreasonable to expect that ngap = 0 for any data set, since the observations in a given data set are subject to sampling error. We therefore interpret the statement of Bulteel et al. [7] as a statement about the expectation over errors of any given VAR model.

Assessing n through simulation

We now use the results of the simulation study from the previous section to check whether indeed ngap = 0 on average for all VAR models. To compute prediction error, we generate a test-set time series consisting of ntest = 2000 observations (using a burn-in period of nburn = 100) for each of the 7400 VAR models described in the previous section. For each of the 100 replications of model and sample size condition, we average over the prediction errors which are obtained when estimated model parameters are evaluated on the test set. This is the out-of-sample prediction error (i.e., the expected generalization error) that Bulteel et al. [7] approximate with out-of-bag cross-validation error. We define prediction error as the mean squared error (MSE) of the predicted values relative to the true values in the test data set. Fig 3 shows the estimation (solid lines) and prediction (dashed lines) errors for both the AR (black lines) and VAR (red lines) models as a function of n, averaged across the replications, for model A with D = 0.068 and O = 0.092 (left panel) and model B with D = 0.337 and O = 0.051 (right panel). For model A, we see that ngap < 0, which shows that ngap = 0 for all VAR models is incorrect. What consequences does this gap have for model selection? The negative gap implies that if the prediction errors for the AR and VAR model are the same, the VAR model should be selected, because its estimation error is smaller. In contrast, for model B we observe ngap > 0. In this situation, if the prediction errors are equal, one should select the AR model because it incurs smaller estimation error. Clearly, ngap differs between the two models, and this difference matters for model selection.

Fig 3

Scaled Mean Squared Error (MSE) of estimation (solid lines) and prediction errors (dashed lines) for both the AR (black lines) and VAR (red lines) models as a function of n, separately for model A with D = 0.068 and O = 0.092 (left panel) and model B with D = 0.337 and O = 0.051 (right panel). The red and green shaded area indicates the median ngap, and the grey shaded area shows the 20% and 80% quantiles across the 100 replications per model. So far we only investigated ngap for two individual VAR models. Fig 4(a) shows the distribution of the expected ngap across all VAR models, computed by averaging over 100 replications. Note that for 31 out of 7400 models the curves of prediction errors and estimation errors did not cross within n ∈ {8, 9, …, 499, 500}. The results in Fig 4 are therefore computed on 7369 models.

Fig 4

Panel (a) displays the distribution of the expected ngap across all 7369 VAR models, computed by averaging over 100 replications, and weighted by the probability defined by the original mixed model. Panel (b) shows the distribution of non-zero EEcomp across all n, 7369 VAR models, averaged across replications and weighted by the probability defined by the original mixed model. Each of the data points in the histogram in Fig 4(a) corresponds to the expected ngap of one of the 7369 models. We see that the expected ngap has a right skewed distribution with a mode at zero. This allows us to make a precise statement regarding the crossing of estimation and prediction errors described above: while the most common value of ngap is zero, most expected ngap are not zero. In fact, ngap shows substantial variation across different VAR models. Explaining the variance of ngap is interesting, because ngap has direct consequences for model selection. If we can relate the ngap to characteristics of the Φ matrix, it is possible to make more specific statements with respect to when to apply a bias towards the AR or VAR model, when the prediction errors are the same or very similar. Note that such a function from Φ to ngap must exist, because the only way the 7400 models differ is in their entries of the VAR parameter matrix Φ. However, this function may be very complicated. For example, the Pearson correlation of ngap with D and O are 0.21 and −0.02, respectively. Predicting ngap by D and O including the interaction term with linear regression achieves R2 = 0.048. This shows that a simple linear model including D and O is not sufficient to describe the relationship between ngap and Φ. Future research could look into better approximations of this relationship. If successful, one could build new model selection strategies on reliable predictions of ngap from empirical data.

Performance of the “1 Standard Error Rule”

Bulteel et al. [7] propose, in the words of Hastie et al., to “[…] choose the most parsimonious model whose error is no more than one standard error above the error of the best model.” [8], p. 244]. This model selection criteria is known as the “1 Standard Error Rule” (1SER) and is suggested by Hastie and colleagues as a method of choosing a model with the minimal out-of-sample prediction error (which is typically unknown), on the basis of out-of-bag prediction error (acquired with cross-validation techniques). Making inferences from prediction error to estimation error requires a link between the two. Bulteel et al. [7] provide this link by suggesting that ngap = 0 (or ngap ≈ 0). However, they do not provide justification for why the 1SER should outperform simply selecting the model with the lowest prediction error. Above we showed that ngap = 0 does not hold for all VAR models. In fact, it is this result that explains why the 1SER can perform better than selecting the model with the lowest prediction error. Specifically, this is the case when ngap > 0, which characterizes the situation that the prediction error for VAR is lower than for AR while at the same time the estimation error of VAR is higher than for AR. In such a situation, a bias towards the AR model can be favorable. In contrast, if ngap < 0 and the prediction error of AR is lower than for VAR, even though the estimation error of VAR is lower than for AR, such a bias would be unfavorable. In the following, we assess the relative performance of the 1SER and simply selecting the model with lowest prediction error, both on average and as a function of n. In order to quantify the relative performance of both model selection strategies, we take the prediction and estimation errors of the 7400 VAR models estimated on n ∈ {8, 9, …, 499, 500} and for each model, and each n, select between the AR and VAR model in two different ways: (1) by simply selecting the model with the lowest prediction error, and (2) by applying the 1SER (using the standard-deviation of the out-of-sample prediction error across 100 training sets). For each of the two strategies, we then subtract the estimation error of the selected model (EEsel) from the estimation error of the model with the lowest estimation error (EEbest). The difference EEdiff = EEbest − EEsel equals zero if the model with lower estimation error has been selected, and is negative if the model with higher estimation error has been selected. Subsequently, we compute where is the difference obtained using (2), and is the difference obtained using (1). The resulting value of EEcomp allows us to compare the performance of the two model selection strategies. That is, if EEcomp < 0, simply selecting the model with lowest prediction error performs better, and if EEcomp > 0, the 1SER performs better. Fig 4(b) shows the distribution of non-zero EEcomp across all 7400 VAR models, averaged over replications, and weighted by the probability given by the original mixed model. The only interesting cases when comparing model selection procedures are the cases in which they disagree. Therefore, we analyze only those cases for which EEcomp ≠ 0. Note that for all but 2 of the 7400 models there is some n at which the two decision rules in question choose a different model. We find that using the 1SER is better in 50.1% of cases (where each case is weighted by the probability of the corresponding model). This would suggest that it makes essentially no difference whether we use the 1SER or select the model with lowest prediction error. However, these proportions average over the number of observations n and therefore cannot reveal differences in relative performance for different sample sizes. Fig 5(a) shows EEcomp as a function of n, averaged across all 7400 models. Because the VAR prediction error is huge for very small n, both model selection strategies choose the same model, resulting in EEcomp = 0 for those n. However, from around n = 10 on until around n = 60, EEcomp is substantially positive, indicating that the 1SER outperforms simply selecting the model with the lowest prediction error by a large margin. However, for n > 60 we see that EEcomp approaches zero and then becomes slightly negative. The latter is also illustrated in panel (b), which displays the weighted proportion of models in which the 1SER is better (i.e., EEcomp > 0). The explanation of this curve has three parts. First, ngap tends to be larger if the gap is located at a small n (Pearson correlation r = −0.15). If ngap is large (and therefore positive), the AR model has lower estimation error than the VAR model, even though the prediction errors are the same (compare Fig 5(b)). In such situations, biasing model selection towards selecting the AR model is advantageous. Since the 1SER constitutes a bias towards the AR model, it performs better for small n. Second, this also explains why the 1SER performs worse than simply selecting the model with lowest prediction error for large n: here the gap is small (negative), indicating that if the prediction errors are the same, the VAR model performs better. Clearly, in such a situation, providing a bias towards AR is disadvantageous. Therefore, the 1SER performs worse. Finally, why does the curve get closer and closer to zero? The reason is that the standard error converges to zero with (the square root of) the number of observations, and therefore the probability that both rules select the same model approaches 1 as n goes to infinity.

Fig 5

Panel (a) displays EEcomp averaged across 7400 models as a function of n (black line) and the standard deviation around the average (blue line). Panel (b) displays, for each n, the proportion of times that EEcomp > 0 across 7400 models (i.e., the proportion of 1SER performing better). To summarize, we found that the 1SER is better than simply selecting the model with the lowest prediction error only in 50.1% of the cases in which the two rules do not select the same model. However, when looking at the relative performance as a function of n, we found that the 1SER is better than selecting the model with lowest prediction error until around n = 60, and worse above. Finally, we were able to explain the dependence of the relative performance on n with the fact that ngap is larger when it occurs at a smaller n. For applied researchers these results suggest that, for VAR models with p = 6 variables, the 1SER should be applied for n < 60.

Discussion

In this paper we provided an extended analysis of the problem studied by Bulteel et al. [7] by using a simulation study to (a) map out the relative performance of AR and VAR models in typical psychological applications as a function of the number of observations n, and (b) investigate how to choose between AR and VAR models in practice. We found that, averaged over all models considered in our simulation, the VAR model outperforms the AR model for n > 89 observations in terms of estimation error. In addition, we show that and explain why the 1SE rule proposed by Bulteel et al. [7] performs better than selecting the model with the lowest prediction error when n is small. Next to the average estimation errors of AR and VAR models, we also investigated the variance around those averages. We decomposed this variance in variance due to different true VAR models, and variance due to sampling. The variance across different VAR models showed that the relative performance, that is, the n at which VAR becomes better than AR (ne) depends on the characteristics of the true VAR parameter matrix Φ. For example, if the true VAR model is very close to an AR model, it takes more observations until the VAR model outperforms the AR model. This shows that one cannot expect reliable recommendations with respect to ne that ignore the characteristics of the generating model: n critically depends on the size of the off-diagonal elements present in the data-generating model. The size of the sampling variation also indicates that, for many of the considered sample sizes, whether the VAR or AR model will have lower estimation error largely depends on the specific sample at hand. This implies that it is difficult to select the model with lowest estimation error with the sample sizes available in typical psychological applications. The second question we investigated was: how should one choose between the AR and VAR model for a given data set? Bulteel et al. [7] suggest that, for any VAR model, the n at which the prediction errors of both models are equal is, in expectation, (approximately) the same n at which their estimation errors are equal (i.e., ngap ≈ 0). Combining this claim with a preference towards the more parsimonious AR model, they proposed using the “1 Standard Error Rule”, according to which one should select the AR model if its prediction error is not more than one standard error above the prediction error of the VAR model, and choose the model with lowest prediction error otherwise. We showed that the expected ngap varies as a function of the parameter matrix of the true VAR model. Using the relationship between estimation and prediction error we were able to explain when the 1SER is expected to perform better than selecting the model with lowest prediction error. In addition, we showed via simulation that the 1SER performs better than selecting the model with the lowest prediction error for n < 60, in cases where those decision rules select conflicting models. Our simulations also showed that as n → ∞, both decision rules converge to selecting the same model. This means that there is a relatively small range of sample sizes in which these decision rules lead to contradictory model selections for a given data-generating system. We recommend that researcher wishing to use prediction error to choose between these models examine both the 1SER and lowest prediction error rules, and in cases of conflict between the two, use the 1SER for low (n < 60) sample sizes. The relative performance of the AR and VAR model shown in our simulations can be understood in terms of the bias-variance trade-off. Because the AR model sets all off-diagonal elements to zero, it has a bias that is constant and independent of n. In contrast, the VAR model has a bias of zero, since the true model is a VAR model. This is why a VAR model will always perform better than (or at least as good as, if the all off-diagonal elements of the true VAR model are zero) an AR model as n → ∞. However, for finite sample sizes the variance of the estimates of the two models are different: while both variances converge to zero as n → ∞, for finite samples the variance of VAR parameters is much larger than the variance of AR parameters, especially for small n. This allows for the situation that the biased simpler model is showing a smaller error, even though the true model is in the class of the more complex model. This trade-off between bias and variance also explains the relative performance of AR and VAR models: From Fig 3 we saw that for small n, the variance of the VAR estimates is so large that the error is larger than the error of the AR model, despite the bias of the AR model. However, with increasing n, the variance of the estimates of both models approaches zero. This means that the larger n, the more the bias of the AR model contributes to its error. Thus, at some n the error of the VAR model becomes smaller than the error of the AR model. We agree with Bulteel et al. [7] that the fact that a simple (and possibly implausible) model can outperform a complex (and more plausible) model, even though the true model is in the class of the more complex model, is underappreciated in the psychological literature. An interesting question we did not discuss in our paper is: which model should we choose if the AR and VAR models have equal estimation error? Since we defined the quality of a model by its estimation error, we could simply pick one of the two models at random. However, their model parameters are likely to be very different. The estimation error of the AR model comes mostly from setting off-diagonal elements incorrectly to zero, while the estimation error of the VAR model comes mostly from incorrectly estimating off-diagonal elements. In terms of the types of errors produced by the two models, the AR model will almost exclusively produce false negatives, while the VAR model will produce almost exclusively false positives. A specification of the cost of false positives/negatives in a given analysis may allow to choose between models when the estimation errors are the same or very similar. For example, in an exploratory analysis one might accept more false positives in order to avoid false negatives. Throughout the paper we compared the AR model to the VAR model. However, we believe that it is unnecessarily restrictive to choose only between those extremes (all off-diagonal elements zero vs. all off-diagonal elements nonzero). The AR model, by imposing independence between processes, presents a theoretically implausible model for many psychological processes. Applied researchers who estimate the VAR model may be primarily interested in the recovery of cross-lagged effects rather than auto-regressive parameters, for example to determine which processes are dependent on one another (as evidenced by frequent discussions of Granger causality [11] In such settings, one could estimate VAR models with a constraint that limits the number of nonzero parameters or penalizes their size [12, 13]. This would allow the recovery of large off-diagonal elements without the high variance of estimates in the standard VAR model. Similarly, one could estimate a VAR model and, instead of comparing it to an AR model and thus testing the nullity of the off-diagonal elements jointly, test the nullity of the off-diagonal elements of the VAR matrix individually. Further investigation of these alternatives in future studies would provide a more complete picture to applied researchers. It is important to keep the following limitations of our simulation study in mind. First, we claimed that the 7400 models we sampled from the mixed model obtained from the “MindMaastricht” data represent typical applications in psychology. One could argue that there are sets of VAR models that are plausible in psychological applications that are not included in our set of models. While this is a theoretical possibility, we consider this extremely unlikely, since we heavily sampled the mixed model stratified by O and D. Any VAR model that is not similar to a model in our set of considered VAR models is therefore most likely non-stationary. When presenting our results we weighted all models by their frequency given the estimated mixed model in order to avoid giving too much weight to unusual VAR models. This means that it could be that the weighting obtained from the mixed model does not represent the frequency of VAR models in psychological applications well. While we consider this unlikely, we also used a uniform weighting across VAR models as a robustness check which left all main conclusions unchanged. A second limitation is that we only considered VAR models with p = 6 variables. While this is not a shortcoming compared to Bulteel et al. [7] who use VAR models with 6, 6, and 8 variables, the results shown in the present paper would likely change when considering more or less than six variables. Specifically, we expect that the n at which VAR outperforms AR becomes larger when more variables are included in the model, and smaller when less variables are included. This change may be nonlinear in nature: As we add variables to the model, we would expect the variance of the VAR model to grow much quicker than the variance of the AR model, since in the former case we need to estimate p2 parameters, and in the latter only p. However, the bias of the AR model also grows with each new variable added, with p2 − p elements set to zero in each case, and so again, this will largely depend on the data-generating system at hand. Similarly, we would expect that for models with more variables the 1SER outperforms selecting the model with lowest prediction error for sample sizes larger than 60. While the exact values will change for larger p, we expect that the general relationships between n, O, and D extend to any number of variables p. Although Bulteel et al. [7] also consider mixed VAR and AR models, in the simulation studies presented above we focus exclusively on single-subject time-series for simplicity. Mixed models can be seen as a form of regularization, in which individual parameter estimates are shrunk towards the group-level mean if the number of observations n is small. One would expect that for small n, the use of mixed models would improve the estimation and prediction errors of both models, which is also what Bulteel et al. [7] report in their results. Indeed, mixed models are expected to improve the performance of VAR methods relative to AR, and thus may be a solution to the relatively poor performance of the VAR model we observe in sample sizes realistic for psychological applications. The reason is that the differential performance of AR and VAR models can be understood in terms of a bias-variance trade-off, where AR models are biased but have lower variance than VAR methods. The use of mixed VAR models should decrease this variance through shrinkage in small n settings [14, 15]. The precise effect of using mixed models depends on the variance of parameters across individuals; however, we do not expect the general pattern of results reported here to change when moving from single-subject to mixed settings. Future research could extend the analysis shown here to VAR models with less than or greater than six variables, which would allow to generalize the simulation results to more situations encountered in psychological applications. Another interesting avenue for future research would be to investigate the link between ngap and the VAR parameter matrix Φ. Since ngap has direct implications for model selection, such a link could possibly be used to construct improved model selection procedures. It would be useful to extend the simulation study in this paper to constrained estimation such as the LASSO, especially since those methods are already applied in practice [16]. Finally, it would be useful to study the performance of mixed VAR models in a simulation setting, and perhaps compare this approach to alternative methods of using group-level information in individual time-series analysis, such as GIMME, an approach originally developed for the analysis of neuroimaging data [17]. Early simulation studies have assessed the performance of mixed AR models in recovering fixed effects using Bayesian estimation techniques [18], but these analyses have yet to be extended to mixed VAR models or the recovery of individual-specific random effects. To sum up, we used simulations to study the relative performance of AR and VAR models in settings typical for psychological applications. We showed that, on average, we need sample sizes approaching n = 89 for single-subject VAR models to outperform AR models. While this may seem like a relatively large sample size requirement, such longer time series are becoming more common in psychological research [19, 20] Decomposing this variance showed that (i) one cannot expect reliable statements with respect to the relative performance of the AR and VAR models that ignore the characteristics of the generating model, and (ii) that choosing reliably between AR and VAR models is difficult for most sample sizes typically available in psychological research. Finally, we provided a theoretical explanation for when the “1 Standard Error Rule” outperforms simply selecting the model with lowest prediction error, and showed that the 1SER performs better when n is small.

D and O values for the initially sampled 10000 VAR models.

(PDF) Click here for additional data file. 14 Apr 2020 PONE-D-20-03592 Choosing between AR(1) and VAR(1) Models in Typical Psychological Applications PLOS ONE Dear Mr Haslbeck, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please, take into account all the considerations raised by the reviewers. We would appreciate receiving your revised manuscript by May 22 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Miguel Angel Sánchez Granero Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors conducted comprehensive simulations to examine the performance of AR vs. VAR model using typical psychological time series data, and compared their performance in estimation error and prediction error as related to the length of time series and the characteristics of the true model. The study extended from Bulteel et al. (2018) and its results make a major contribution to the literature. In general, the manuscript is well written and clearly organized. I have a few comments and suggestions for the authors to consider when revising their manuscript. My first general concern is on the “typical” part of the study. The authors, and the Bulteel et al. (2018) as well, fail to elaborate the main reason for the application of VAR model. More than often for applied researchers, they choose to use the VAR model because they are interested to know whether one variable A is related to another variable B at a later time (i.e., cross-lagged paths), after controlling for B at previous time. In other words, one is interested to know whether A has added value in terms of the prediction of B, and the choice of A and B are theoretically derived. From this point of view, it is theoretically meaningful to adopt VAR model rather than AR model. In such situation, the research question becomes whether VAR can accurately recover the cross-lagged links between variables, rather than whether AR outperforms VAR, under some conditions. Relatedly, the authors initially claimed that the length of most applied psychological time series data fall between 30 to 200. It is important to note that the MindMaastricht dataset, where the current simulations are based on, in my mind are not typically psychological time series data (52 individuals with an average of 41 measurements on 6 variables). All three data used by Bulteel et al. (2018) face the same issue as well (individuals fewer than 100, lengths between 41 and 70). From my reading of the applied literature, most studies tend to have a lot more participants with shorter time series and fewer variables (at least those examined in the VAR model). Whether the mean number of 92 based on estimation error, or the number of 60 for prediction performance, they are all beyond the length of most typical psychological time series data. Does it mean that applied researchers should just always go with the AR model? The authors should discuss this point. The authors encouraged future studies with more than 6 variables. However, with fewer than 6 variables considered, how would the current findings hold (I reckon n for both estimation and prediction errors likely will go down)? It is likely that it may take fewer n for VAR to outperform AR. For each VAR model (R and D) condition, 100 independent time series were simulated. These are more referred to as “replications” for each model design condition, rather than “iterations” (e.g., page 5 line 148). The authors should revise the term where applicable throughout the manuscript. The authors simulated n = 500 for estimation simulation but n = 2000 for prediction simulation. From the results and discussions, it appears that 2000 does not matter too much. Discussions are needed regarding this point. Figure 4b and on page 11 line 350, the authors should state how many cases have EEcomp unequal to zero. The authors mentioned mixed models – some recent simulation work on DSEM should be cited, which have shown satisfying estimation results for VAR. Furthermore, the authors should briefly discuss the subgroup/mixture approach when there are distinct subgroups of time series patterns (e.g., GIMME). Minor comments When referring to the mixed effects examined in Bulteel et al. (2018), at least for the first time (page 2 line 40), it would be helpful to clarify it refers to multilevel model with random effects. On page 4 line 123, it should be Figure 6 in the supplementary materials. On page 7 line 201, two “have”s; line 202, two “the”s. Reviewer #2: The authors present results from a series of simulation studies examining the performance of AR and VAR models. Results assist the reader in determining which model structure (i.e., AR versus VAR) to use when modeling n=1 time series data. I appreciate and admire the clarity with which the authors describe complex methodology and present their results. I believe that this paper will be a valuable contribution to the field of psychological time series. Below I have outlined suggestions to facilitate the connection of the theoretical nature of this manuscript to applied psychological data. 1) Page 3 and 4: I appreciate the novel methods the authors used to generate their simulated data through the use of parameters, R & D. However, I am concerned that this method introduces artifacts into the sampling scheme, due to the fact that there is a correlation between R & D (as shown in Figure 6). Thus, it seems that there would be bias in the models generated with this technique. In general, although the authors provide some justification for using R & D, it would be helpful for the author to provide further explanation of their parameterization methods in light of this correlation. In particular, it seems that this correlation may be artificially induced by the authors’ definition of R & D. For example, a theorem from linear algebra states that the sum of the eigenvalues of a matrix (i.e., D) is equal to the sum of its diagonal elements (i.e., it’s trace, in this case the AR parameters included in the numerator of R). Hence, the numerator of R is essentially D. This suggests that the R-D parameterization is likely responsible for the correlation in the simulation samples. I recommend that the authors acknowledge this in their description of their parameterization methods. Additionally, I recommend that they examine the correlation between R & D to demonstrate that this correlation is sufficiently low so as to not overly bias the simulation data. Finally, I strongly suggest that authors reformulate R so that it is free from the influences of this correlation, such as by using the current denominator of R. This would allow for the modeling of autoregressive effects (i.e., D) and cross-lagged effects (i.e., denominator of R), independently. 2) I think it may be useful for the authors to provide more recommendations for the design of psychological time series studies based on their data. In other words, are there suggestions for how applied researchers should implement these findings? 3a) For example, do these results support the recommendation of collecting more observations in general? 3b) Lines 443-455 refer to several theoretical points about choosing between VAR and AR models under the condition of equal estimation error. Given that applied researchers may want to select one model over the other for hypothesis-testing reasons (e.g., testing the AR effect of mood versus including the cross-lagged effect of anxiety on mood), could you provide clarification on whether an applied researcher would be able to test for estimation error equivalence using empirical data? If that is not possible, I believe it may be helpful to state this explicitly. 3c) Line 385: In regards to comparing the 1SER rule versus selecting the model with the lower prediction error, what should applied researchers take away from these results if they are working with data with n > 60? 4) Line 173: Could you clarify what is meant by specifying the data generating model and how a researcher would do this using empirical data? 5) Line 509: I recommend rephrasing this sentence to specify that the relative performance of AR and VAR models were studied using simulations of data generated from typical psychological applications. 6) Line 24 = missing the word, “the”? Overall, I appreciate the authors’ contribution the field of time series psychometrics. I hope that the authors find my comments helpful in assisting them with revising the draft for publication. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Yao Zheng Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 28 Jul 2020 We have uploaded a file that responds in detail to the reviewer and editor comments. However, we have pasted them here as well: Dear Editor, Thank you for sending the comments of the reviewer and the Associate Editor. The comments and the close reading especially of the Associate Editor helped us again to make important improvements to our manuscript. We append our responses to the reviewer's and Associate Editor's comments at the end of this letter. Note that based on the comments of Reviewer 2, we re-ran the main simulation part of our study, using a slightly different sampling scheme, based on the size of the off-diagonal elements and diagonal elements ($O$ and $D$) respectively. This was largely done to aid the interpretation of our results as depicted in Figure 2. This new simulation has not changed our main results in any way, and we note it here mainly to draw attention to slight numerical differences that appear in the new manuscript. For instance, the median sample size requirement of $n_e = 92$ discussed by Reviewer 1 has decreased slightly to $n_e = 89$ in the new manuscript. A full discussion of these changes is given in our reply to Reviewer 2's comments. \\closing{Kind regards,} \\newpage \\Large \\textbf{Reviewer 1} \\normalsize \\textbf{Comment 1} \\begin{displayquote} My first general concern is on the “typical” part of the study. The authors, and the Bulteel et al. (2018) as well, fail to elaborate the main reason for the application of VAR model. More than often for applied researchers, they choose to use the VAR model because they are interested to know whether one variable A is related to another variable B at a later time (i.e., cross-lagged paths), after controlling for B at previous time. In other words, one is interested to know whether A has added value in terms of the prediction of B, and the choice of A and B are theoretically derived. From this point of view, it is theoretically meaningful to adopt VAR model rather than AR model. In such situation, the research question becomes whether VAR can accurately recover the cross-lagged links between variables, rather than whether AR outperforms VAR, under some conditions. \\end{displayquote} We agree with the reviewer on this point, and have added a clarification on the theoretical choice of VAR over AR models in the discussion (lines 447 - 452, below). As a follow-up to Bulteel et al., a full investigation of cross-lagged parameter recovery was beyond the scope of the current paper. However, sample size requirements for when the VAR outperforms the AR model in estimation error is likely to be a lower bound on the sample size requirement for accurate recovery of cross-lagged parameters, as it indicates at what sample size the cross-lagged parameter estimation performs better in approximating the true parameter set than guessing zero for all cross-lagged parameters. Added text: “Throughout the paper we compared the AR model to the VAR model. However, we believe that it is unnecessarily restricting to choose only between those extremes (all off-diagonal elements zero vs. all off-diagonal elements nonzero).The AR model, by imposing independence between processes, presents a theoretically implausible model for many psychological processes. Applied researchers who estimate the VAR model may be primarily interested in the recovery of cross-lagged effects rather than auto-regressive parameters, for example to determine which processes are dependent on one another (as evidenced by frequent discussions of Granger causality [11] in these settings). In such settings, one could estimate VAR models with a constraint that limits the number of nonzero parameters or penalizes their size [12,13]. This would allow the recovery of large off-diagonal elements without the high variance of estimates in the standard VAR model. Similarly, one could estimate a VAR model and, instead of comparing it to an AR model and thus testing the nullity of the off-diagonal elements jointly, test the nullity of the off-diagonal elements of the VAR matrix individually. Further investigation of these alternatives would provide a more complete picture to applied researchers in future studies.” \\pagebreak \\textbf{Comment 2} \\begin{displayquote} Relatedly, the authors initially claimed that the length of most applied psychological time series data fall between 30 to 200. It is important to note that the MindMaastricht dataset, where the current simulations are based on, in my mind are not typically psychological time series data (52 individuals with an average of 41 measurements on 6 variables). All three data used by Bulteel et al. (2018) face the same issue as well (individuals fewer than 100, lengths between 41 and 70). From my reading of the applied literature, most studies tend to have a lot more participants with shorter time series and fewer variables (at least those examined in the VAR model). Whether the mean number of 92 based on estimation error, or the number of 60 for prediction performance, they are all beyond the length of most typical psychological time series data. Does it mean that applied researchers should just always go with the AR model? The authors should discuss this point. \\end{displayquote} We agree with the reviewer that sample size requirements of 89/92 or so repeated measurements may be a tall order in many settings, that many studies utilize multiple-subjects rather than single-subject designs. With regards to the sample size requirements, it is important to note an additional finding of our study: that although the average sample size requirement for VAR to outperform AR is 89, there is a very large degree of variation around this value. This variation is largely determined by the absolute size of the off-diagonal elements: From Figure 2 we can now see clearly that data-generating mechanisms with higher off-diagonal elements may require as little as half that many observations. We have emphasised this more clearly in the discussion section (lines 392 - 395): “This shows that one cannot expect reliable recommendations with respect to $n_\\text{e}$ that ignore the characteristics of the generating model: $n_e$ critically depends on the size of the off-diagonal elements present in the data-generating model.” While the reviewer states that even sample sizes of 40 observations per person appear unrealistic to them, it should be noted that more and more psychological studies are collecting longer and longer time-series, particularly in the domain of clinical psychology. Two recent examples include Wichers et al. (2016), who collected a single subject time series dataset of 1478 repeated measurements and Helmich et al (2020) which used data consisting of 100 repeated measurements each of 329 individuals. Finally, the availability of relatively long multiple-subjects data also opens up the possibility of using mixed effects / multilevel models. The use of these models will certainly decrease the number of measurements needed per person needed to recover model parameters, and although the study of those models was beyond the scope of the current paper, we believe that this is an important topic for future research. In order to reflect this we have added extra detail to this in the discussion section, on lines 504 - 507: “Indeed, mixed models are expected to improve the performance of VAR methods relative to AR, and thus may be a solution to the relatively poor performance of the VAR model we observe in sample sizes realistic for psychological applications.” And in addition on lines 529 - 542: “To sum up, we studied the relative performance of AR and VAR models in simulations of typical psychological applications. We were able to make clear statements about the average performance of VAR models, which showed that, on average, we need sample sizes approaching $n = 89$ for single-subject VAR models to outperform AR models. While this may seem like a relatively large sample size requirement, such longer time series are becoming more common in psychological research \\cite{wichers2016critical, helmich2020sudden} and mixed models may allow for acceptable performance for shorter time series, though much research on that topic is still required. Importantly, we also found the variance around this average sample size to be considerable, with the variation largely a function of the average absolute value of the off-diagonal (i.e. cross-lagged) effects. Decomposing this variance showed that (i) one cannot expect reliable statements with respect to the relative performance of the AR and VAR models that ignore the characteristics of the generating model, and (ii) that choosing reliably between AR and VAR models is difficult for most sample sizes typically available in psychological research.” Finally, we do not agree with the reviewer that our results suggest that researchers should necessarily choose the AR model when sample sizes are low. Rather the choice between these models should be largely informed by theoretical considerations: The AR model presents a theoretically implausible model in imposing independence between all processes. We have added a discussion of this point to lines 454 - 469: “Throughout the paper we compared the AR model to the VAR model. However, we believe that it is unnecessarily restricting to choose only between those extremes (all off-diagonal elements zero vs. all off-diagonal elements nonzero). The AR model, by imposing independence between processes, presents a theoretically implausible model for many psychological processes. Applied researchers who estimate the VAR model may be primarily interested in the recovery of cross-lagged effects rather than auto-regressive parameters, for example to determine which processes are dependent on one another (as evidenced by frequent discussions of Granger causality \\cite{granger1969investigating} in these settings). In such settings, one could estimate VAR models with a constraint that limits the number of nonzero parameters or penalizes their size \\cite{fan2001variable, hastie2015statistical}. This would allow the recovery of large off-diagonal elements without the high variance of estimates in the standard VAR model. Similarly, one could estimate a VAR model and, instead of comparing it to an AR model and thus testing the nullity of the off-diagonal elements jointly, test the nullity of the off-diagonal elements of the VAR matrix individually. Further investigation of these alternatives would provide a more complete picture to applied researchers in future studies.” \\pagebreak \\textbf{Comment 3} \\begin{displayquote} The authors encouraged future studies with more than 6 variables. However, with fewer than 6 variables considered, how would the current findings hold (I reckon n for both estimation and prediction errors likely will go down)? It is likely that it may take fewer n for VAR to outperform AR. \\end{displayquote} We agree with the reviewer and would also predict that the errors go down when decreasing the number of variables $p$. While we in the previous version only focused on the case where more variables are included, we now have broadened this discussion to include our predictions for what would happen when fewer variables are included (lines 478 - 485): “Specifically, we expect that the $n$ at which VAR outperforms AR becomes larger when more variables are included in the model, and smaller when less variables are included. This change may be nonlinear in nature: As we add variables to the model, we would expect the variance of the VAR model to grow much quicker than the variance of the AR model, since in the former case we need to estimate $p^2$ parameters, and in the latter only $p$. However, the bias of the AR model also grows with each new variable added, with $p^2 - p$ elements set to zero in each case, and so again, this will largely depend on the data-generating system at hand. Similarly, we would expect that for models with more variables the 1SER outperforms selecting the model with lowest prediction error for sample sizes larger than 60. While the exact values will change for larger $p$, we expect that the general relationships between $n$, $O$, and $D$ extend to any number of variables $p$.” \\textbf{Comment 4} \\begin{displayquote} For each VAR model (R and D) condition, 100 independent time series were simulated. These are more referred to as “replications” for each model design condition, rather than “iterations” (e.g., page 5 line 148). The authors should revise the term where applicable throughout the manuscript. \\end{displayquote} We agree and have changed this term to “replications” throughout. \\textbf{Comment 5} \\begin{displayquote} The authors simulated n = 500 for estimation simulation but n = 2000 for prediction simulation. From the results and discussions, it appears that 2000 does not matter too much. Discussions are needed regarding this point. \\end{displayquote} N = 2000 refers to the size of the test set, which is only used to compute the out-of-sample prediction error. 2000 was chosen to be sufficiently large to yield an accurate estimate of the out-of-sample prediction error (i.e., not subject to sampling variation due to a small test set). This is the quantity which Bulteel et al. approximate using a cross-validation scheme, which adds another source of potential error to their findings - choosing a small test set may yield unreliable estimates of the true out-of-sample prediction error. For clarity, we have added a simpler description of this to the main text (lines 257 - 265): “To compute prediction error, we generate a test-set time series consisting of $n_{\\text{test}} = 2000$ observations (using a burn-in of $n_{\\text{burn}} = 100$) for each of the 6000 VAR models described in the previous section. For each of the 100 replications of model and sample size condition, we average over the prediction errors which are obtained when estimated model parameters are evaluated on the test set.” \\textbf{Comment 6} \\begin{displayquote} Figure 4b and on page 11 line 350, the authors should state how many cases have EEcomp unequal to zero. \\end{displayquote} It should be noted that the answer to this question depends on how we define “cases”: If we take “cases” to mean all models and sample size conditions (so, 7400 x 493 cases), there is a very low proportion of “cases” in which the two methods pick different models (around 0.5 percent). However, this is not a particularly meaningful metric, since at a large enough sample size, both methods essentially always pick the same (correct) model. We note this latter point explicitly on lines 363 - 365. “Finally, why does the curve get closer and closer to zero? The reason is that the standard error converges to zero with (the square root of) the number of observations, and therefore the probability that both rules select the same model approaches 1 as $n$ goes to infinity. “ If we instead take “cases” to mean only models, we would ask: For how many of the 7400 models do the 1SE and lowest PE rules choose different models at some n? The answer to this question is that in all but 2 “cases” there is some value of EEcomp unequal to zero. We now note this on line 339-340. “Note that for all but 2 of the 7400 models there is some $n$ at which the two decision rules in question choose a different model.” To reflect this discussion we now more clearly specify the state of affairs regarding cases where the 1SER and lowest prediction error rules differ in the discussion section, and use this to offer additional advice to researchers in practice (lines 413 - 419): “Our simulations also showed that as $n \\to \\infty$, both decision rules converged to selecting the same model. This means that there is a relatively small range of sample sizes in which these decision rules lead to contradictory model selections for a given data-generating system. We recommend that researchers wishing to use prediction error to choose between these models examine utilize both the 1SER and lowest prediction error rules, and in cases of conflict between the two, use the 1SER for low ($n<60$) sample sizes.” \\textbf{Comment 7} \\begin{displayquote} The authors mentioned mixed models – some recent simulation work on DSEM should be cited, which have shown satisfying estimation results for VAR. Furthermore, the authors should briefly discuss the subgroup/mixture approach when there are distinct subgroups of time series patterns (e.g., GIMME). \\end{displayquote} Unfortunately no simulation studies that we are aware of have examined the performance of DSEM in recovering mixed VAR models. The only relevant paper we know of \\cite{schultzberg2018number} is limited to only considering AR(1) models, and the focus of their investigation is on the recovery of fixed effects, rather than individual-specific parameters as is the focus in the current n=1 analyses examined in this paper. We agree that GIMME is an interesting approach, but posits a much more general model than considered here (including contemporaneous directed relationships) and again a group-level structure not present in n=1 analyses. We agree however that simulation studies using mixed VAR models and comparing this approach to GIMME would be an interesting future line of research, particularly when the target of inference is the individual-specific parameters. We have extended our discussion of future studies to include this, (lines 513 - 519): “Finally, it would be useful to study the performance of mixed VAR models in a simulation setting, and perhaps compare this approach to alternative methods of using group-level information in individual time-series analysis, such as GIMME, an approach originally developed for the analysis of brain data [17]. Early simulation studies have assessed the performance of mixed AR models in recovering fixed effects using Bayesian estimation techniques [18], but these analyses have yet to be extended to mixed VAR models or the recovery of individual-specific random effects.” \\textbf{Comment 8} \\begin{displayquote} When referring to the mixed effects examined in Bulteel et al. (2018), at least for the first time (page 2 line 40), it would be helpful to clarify it refers to multilevel model with random effects. \\end{displayquote} This is now clarified in text: “Although the latter statement implies that the estimation error of mixed AR and mixed VAR models are similar, Bulteel et al.[1] conclude that ``[...] it is not meaningful to analyze the presented typical applications with a VAR model'' (p. 14) when discussing both mixed effects (i.e., multilevel models with random effects) and single-subject models.” \\textbf{Comment 9} \\begin{displayquote} On page 4 line 123, it should be Figure 6 in the supplementary materials. \\end{displayquote} This has been changed to refer to the Supporting Information throughout \\textbf{Comment 10} \\begin{displayquote} On page 7 line 201, two “have”s; line 202, two “the”s. \\end{displayquote} This has been fixed \\newpage \\Large \\textbf{Reviewer 2} \\normalsize \\textbf{Comment 1} \\begin{displayquote} Page 3 and 4: I appreciate the novel methods the authors used to generate their simulated data through the use of parameters, R & D. However, I am concerned that this method introduces artifacts into the sampling scheme, due to the fact that there is a correlation between R & D (as shown in Figure 6). Thus, it seems that there would be bias in the models generated with this technique. In general, although the authors provide some justification for using R & D, it would be helpful for the author to provide further explanation of their parameterization methods in light of this correlation. In particular, it seems that this correlation may be artificially induced by the authors’ definition of R & D. For example, a theorem from linear algebra states that the sum of the eigenvalues of a matrix (i.e., D) is equal to the sum of its diagonal elements (i.e., it’s trace, in this case the AR parameters included in the numerator of R). Hence, the numerator of R is essentially D. This suggests that the R-D parameterization is likely responsible for the correlation in the simulation samples. I recommend that the authors acknowledge this in their description of their parameterization methods. Additionally, I recommend that they examine the correlation between R & D to demonstrate that this correlation is sufficiently low so as to not overly bias the simulation data. Finally, I strongly suggest that authors reformulate R so that it is free from the influences of this correlation, such as by using the current denominator of R. This would allow for the modeling of autoregressive effects (i.e., D) and cross-lagged effects (i.e., denominator of R), independently. \\end{displayquote} We thank the reviewer for this comment. On reflection we agree that these were not the optimal dimensions to choose when sampling lagged parameter matrices, for the reasons outlined. We have changed the R dimension to refer to the average absolute cross-lagged parameter value as suggested (now denoted O), and re-ran the simulations accordingly. We have also clarified in text that D should be interpreted as the average auto-regressive parameter (which is equivalent to the average eigenvalue). See lines 102 - 111 (pages 3-4) for changes to the definition, and other changes to the results of our simulation throughout: “The first characteristic is based on the size of the auto-regressive effects, that is, the absolute values of the diagonal elements of the lagged parameter matrix ($\\Phi_{ii}$) which encode the relationship between a variable and itself at the next time point. We summarize the information contained in these diagonal elements by taking the mean of their absolute values D, given as [...] Note here that taking the sum of auto-regressive parameters is equivalent to taking the sum of the eigenvalues of $\\Phi$, denoted $\\lambda$. To ensure stationarity, only $\\Phi$ matrices with $|\\lambda| < 1$ are included in our analysis [10]. The second characteristic is based on the size of the cross-lagged parameters ($\\Phi_{ij}, i \\neq j$), encoding the relationships between different processes. We again summarize this information by taking the mean absolute of these parameters, denoted R and given as [...] We expect that true VAR models with a high $D$ value and small $O$ value (i.e., large auto-regressive effects and small cross-lagged effects) result in a low estimation error for AR models, since these VAR models are very similar to an AR model. In contrast, if $O$ is high, we expect that the estimation error of the AR model is large, because it sets the large cross-lagged effects in the true VAR model to zero.” The main results of our paper do not change, though it is now clearer that the mean absolute off-diagonal elements ($O$) largely determines the size of $n_e$. The weighted median $n_e$ is now 89, slightly lower than the value of 92 obtained in the previous simulation. We have updated Figure 2 accordingly, and now describe the results as follows (lines 185 - 208): ``Above we suggested that the relative performance of AR and VAR models (quantified by $\\text{EE}_\\text{Diff}$) depends on the characteristics $D$ and $O$ of the true VAR parameter matrix. In Figure 2 (a) we show the median (across models in cells) $n$ at which the estimation error of VAR becomes smaller than the estimation of AR (i.e., $\\text{EE}_\\text{Diff} > 0$). We see that the larger the average off-diagonal elements $O$, the lower the $n$ at which VAR outperforms AR. This is what one would expect: when $O$ is small (as indicated by the lowest rows of cells in Figure 2 (a)), the true VAR model is actually very close to an AR model. In such a situation, the bias introduced by the AR model by setting the off-diagonal elements to zero leads to a relatively small estimation error. This trade-off between a simple model with high bias but low variance and a more complex model with low bias but high variance is well-known in the statistical literature as the \\textit{bias-variance trade-off} \\cite{hastie2009elements}. It therefore takes a considerable amount of observations until the variance of the VAR estimates becomes small enough to outperform the AR model. When $O$ is large (indicated by the upper rows of cells), the bias of the AR model leads to comparatively larger estimation error. Finally, we can also see that the size of the diagonal elements $D$ is not as critical in determining $n_e$ as the size of the off-diagonal elements: Picking any row of cells in Figure 2 (a), we can see that there is only a very small variation across columns, with larger $D$ values appearing to lead to very slight decreases in $n_e$ in general. Note that the $O$ characteristic also largely explains the vertical variation of the estimation error curves shown in Figure 1 (b): the curves on top (small $n_\\text{e}$) have low $O$, while the curves at the bottom (large $n_\\text{e}$) have high $O$. Figure 2 (b) collapses across these values and illustrates the sampling distribution of $n_e$, taking into account the likelihood of any particular VAR matrix (as specified by the mixed model estimated from the ``MindMaastricht'' data).'' \\textbf{Comment 2} \\begin{displayquote} I think it may be useful for the authors to provide more recommendations for the design of psychological time series studies based on their data. In other words, are there suggestions for how applied researchers should implement these findings? For example, do these results support the recommendation of collecting more observations in general? [...] Line 385: In regards to comparing the 1SER rule versus selecting the model with the lower prediction error, what should applied researchers take away from these results if they are working with data with n $>$ 60? \\end{displayquote} We should note that our paper largely focuses on the distinction between AR and VAR models in single-subject time series, and the topic of psychological time series studies and methodological design is of course a much broader topic than we can hope to comprehensively address in this paper. However, we can make some rather specific recommendations within the scope of what we have examined. First is that, of course, the average sample size requirement needed for VAR to outperform AR models is n = 89, but this provides only a very rough guidelines for the sample sizes researchers should aim for. Crucially, we see a very large degree of variation around this value, depending on the size of the off-diagonal elements. Thus, knowledge or researcher expectations about the underlying system plays a crucial role in choosing a sufficient sample size. Second, based on our analysis of the 1SER and lowest prediction error decision rules, we can recommend that, in cases where both decision rules pick different models, researchers should use the 1SER for low sample sizes. We have made these recommendations more explicit in text, both on lines 411-419: “In addition, we show via simulation that the 1SER performs better than selecting the model with the lowest prediction error for $n<60$, in cases where those decision rules select conflicting models. Our simulations also showed that as $n \\to \\infty$, both decision rules converge to selecting the same model. This means that there is a relatively small range of sample sizes in which these decision rules lead to contradictory model selections for a given data-generating system. We recommend that researcher wishing to use prediction error to choose between these models examine utilize both the 1SER and lowest prediction error rules, and in cases of conflict between the two, use the 1SER for low ($n<60$) sample sizes.” And in addition on lines 529- 545: “To sum up, we used simulations to study the relative performance of AR and VAR models in settings typical for psychological applications. We were able to make clear statements about the average performance of VAR models, which showed that, on average, we need sample sizes approaching $n = 89$ for single-subject VAR models to outperform AR models. While this may seem like a relatively large sample size requirement, such longer time series are becoming more common in psychological research \\cite{wichers2016critical, helmich2020sudden} and mixed models may allow for acceptable performance for shorter time series, though much research on that topic is still required. Importantly, we also found the variance around this average sample size to be considerable, with the variation largely a function of the average absolute value of the off-diagonal (i.e. cross-lagged) effects. Decomposing this variance showed that (i) one cannot expect reliable statements with respect to the relative performance of the AR and VAR models that ignore the characteristics of the generating model, and (ii) that choosing reliably between AR and VAR models is difficult for most sample sizes typically available in psychological research. Finally, we provided a theoretical explanation for when the ``1 Standard Error Rule'' outperforms simply selecting the model with lowest prediction error, and showed that the 1SER performs better when $n$ is small.” \\textbf{Comment 3} \\begin{displayquote} Lines 443-455 refer to several theoretical points about choosing between VAR and AR models under the condition of equal estimation error. Given that applied researchers may want to select one model over the other for hypothesis-testing reasons (e.g., testing the AR effect of mood versus including the cross-lagged effect of anxiety on mood), could you provide clarification on whether an applied researcher would be able to test for estimation error equivalence using empirical data? If that is not possible, I believe it may be helpful to state this explicitly. \\end{displayquote} We agree with the point that applied researchers may not necessarily be primarily interested in general estimation error, but instead in, for instance, the ability to correctly identify non-zero cross-lagged effects. This was also a point raised by Reviewer 1. To address this we have added a clarification on this in the discussion (paragraph on choosing between AR and VAR as extremes, lines 447 - 452, see Reviewer 1 comment #1 response for added text) . We suggest alternative approaches if researchers are interested primarily in cross-lagged effects, and possibilities for future studies to investigate this issue. With regards to testing for estimation error equivalence using empirical data - indeed that is not possible, as it is only possible to try and evaluate prediction error equivalence. We clarify that this is the reason we investigate prediction error in the first place by making changes to lines 225 - 228: “In the previous section, we directly investigated the estimation errors of the AR and the VAR model in typical psychological applications and showed that the n at which VAR becomes better than AR depends substantially on the characteristics of the true model. In practice, the true model is unknown, so we can neither look up the n at which VAR outperforms AR in the above simulation study, nor can we compute the estimation error on the data at hand. Thus, to select between these models in practice, we may choose to use the prediction error which we can approximate using the data at hand, for instance by using a cross-validation scheme as suggested by Bulteel et al. [1].” We also address what researchers should do in practice in different sample size conditions at the end of the discussion, which we have outlined in the response to the previous comment of this reviewer. \\textbf{Comment 4} \\begin{displayquote} Line 173: Could you clarify what is meant by specifying the data generating model and how a researcher would do this using empirical data? \\end{displayquote} In this statement we are referring to the results of our simulation study, which show a) that $n_e$ depends substantially on the particular set of lagged parameter values in the data-generating model, b) that the variation in EE across data-generating models is much larger than the variation across replications of the same data-generating model. As such, although it is difficult to make statements about the sample size necessary for the VAR model to outperform the AR model in general, if one has information about the parameters of the data-generating model, one can make much more precise statements about the sample size necessary for the VAR model to outperform the AR model. We have changed this statement to more clearly communicate this (lines 172 - 176): “However, we see that the sampling variation across replications is smaller than the variation across VAR models for most n. This means that if one has information about the parameters of the data-generating model, one can make much more precise statements about the sample size necessary for the VAR model to outperform the AR model” Of course, it is not possible to specify the data generating model based on a given empirical dataset: But if researchers are trying to determine an acceptable minimum sample size before data collection, it is probably necessary for them to specify their beliefs about the structure of the data-generating model (such as the expected size of auto-regressive and cross-lagged parameters) to do so in any meaningful way. We explore this further in the analysis which follows the aforementioned statement, for instance in Figure 2 (a). \\textbf{Comment 5} \\begin{displayquote} Line 509: I recommend rephrasing this sentence to specify that the relative performance of AR and VAR models were studied using simulations of data generated from typical psychological applications. \\end{displayquote} We agree and this has been changed. Submitted filename: Response_to_Reviewers.pdf Click here for additional data file. 21 Sep 2020 PONE-D-20-03592R1 Choosing between AR(1) and VAR(1) Models in Typical Psychological Applications PLOS ONE Dear Dr. Haslbeck, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please, attend the minor suggestion from both reviews. Please submit your revised manuscript by Nov 05 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Miguel Angel Sánchez Granero Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors are very responsive to my previous comments and have addressed them well. I thank the authors for another contribution to the literature. One tiny new comment: The authors said on page 8 that "for each of the 6000 VAR models described in the previous section" below "Assessing ngap through simulation." I may have missed it but I only recall the 7400 models the authors mentioned previously. Reviewer #2: The authors present results from a series of simulation studies examining the performance of AR and VAR models. Results assist the reader in determining which model structure (i.e., AR versus VAR) to use when modeling n=1 time series data. I appreciate the efforts the authors have undertaken to revise the manuscript. My very minor suggestion is to change “researcher” to “researchers” in line 416. No further recommendations. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 30 Sep 2020 Dear Editor, We are happy to submit a revised version of our manuscript, in which we addressed the two minor comments of the two reviewers. We also made a number of small textual improvements, which did not change any of the content of the manuscript. Kind regards, Jonas Haslbeck 2 Oct 2020 Choosing between AR(1) and VAR(1) Models in Typical Psychological Applications PONE-D-20-03592R2 Dear Dr. Haslbeck, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Miguel Angel Sánchez Granero Academic Editor PLOS ONE Additional Editor Comments (optional): Please, follow Reviewer 2 suggestion: My very minor suggestion is to change “researcher” to “researchers” in line 416 (now line 425). 19 Oct 2020 PONE-D-20-03592R2 Choosing between AR(1) and VAR(1) Models in Typical Psychological Applications Dear Dr. Haslbeck: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Miguel Angel Sánchez Granero Academic Editor PLOS ONE

11 in total

1. Critical Slowing Down as a Personalized Early Warning Signal for Depression.

Authors: Marieke Wichers; Peter C Groot
Journal: Psychother Psychosom Date: 2016-01-26 Impact factor: 17.659

2. The Gaussian Graphical Model in Cross-Sectional and Time-Series Data.

Authors: Sacha Epskamp; Lourens J Waldorp; René Mõttus; Denny Borsboom
Journal: Multivariate Behav Res Date: 2018-04-16 Impact factor: 5.923

3. VAR(1) based models do not always outpredict AR(1) models in typical psychological applications.

Authors: Kirsten Bulteel; Merijn Mestdagh; Francis Tuerlinckx; Eva Ceulemans
Journal: Psychol Methods Date: 2018-05-10

4. Mindfulness training increases momentary positive emotions and reward experience in adults vulnerable to depression: a randomized controlled trial.

Authors: Nicole Geschwind; Frenk Peeters; Marjan Drukker; Jim van Os; Marieke Wichers
Journal: J Consult Clin Psychol Date: 2011-10

5. Capturing the risk of persisting depressive symptoms: A dynamic network investigation of patients' daily symptom experiences.

Authors: Robin N Groen; Evelien Snippe; Laura F Bringmann; Claudia J P Simons; Jessica A Hartmann; Elisabeth H Bos; Marieke Wichers
Journal: Psychiatry Res Date: 2018-12-09 Impact factor: 3.222

6. Exploring the idiographic dynamics of mood and anxiety via network analysis.

Authors: Aaron J Fisher; Jonathan W Reeves; Glenn Lawyer; John D Medaglia; Julian A Rubel
Journal: J Abnorm Psychol Date: 2017-11

7. Emotion-Network Density in Major Depressive Disorder.

Authors: Madeline Lee Pe; Katharina Kircanski; Renee J Thompson; Laura F Bringmann; Francis Tuerlinckx; Merijn Mestdagh; Jutta Mata; Susanne M Jaeggi; Martin Buschkuehl; John Jonides; Peter Kuppens; Ian H Gotlib
Journal: Clin Psychol Sci Date: 2014-08-04

8. Sudden gains in day-to-day change: Revealing nonlinear patterns of individual improvement in depression.

Authors: Marieke A Helmich; Marieke Wichers; Merlijn Olthof; Guido Strunk; Benjamin Aas; Wolfgang Aichhorn; Günter Schiepek; Evelien Snippe
Journal: J Consult Clin Psychol Date: 2020-02

9. The Impact of Treatments for Depression on the Dynamic Network Structure of Mental States: Two Randomized Controlled Trials.

Authors: Evelien Snippe; Wolfgang Viechtbauer; Nicole Geschwind; Annelie Klippel; Peter de Jonge; Marieke Wichers
Journal: Sci Rep Date: 2017-04-20 Impact factor: 4.379

10. A network approach to psychopathology: new insights into clinical longitudinal data.

Authors: Laura F Bringmann; Nathalie Vissers; Marieke Wichers; Nicole Geschwind; Peter Kuppens; Frenk Peeters; Denny Borsboom; Francis Tuerlinckx
Journal: PLoS One Date: 2013-04-04 Impact factor: 3.240

1 in total

Review 1. A Review of Explicit and Implicit Assumptions When Providing Personalized Feedback Based on Self-Report EMA Data.

Authors: IJsbrand Leertouwer; Angélique O J Cramer; Jeroen K Vermunt; Noémi K Schuurman
Journal: Front Psychol Date: 2021-12-08

1 in total