Literature DB >> 30034027

Evaluating Model Fit in Bayesian Confirmatory Factor Analysis With Large Samples: Simulation Study Introducing the BRMSEA.

Huub Hoofs^1,2, Rens van de Schoot^3,4, Nicole W H Jansen¹, IJmert Kant¹.

Abstract

Bayesian confirmatory factor analysis (CFA) offers an alternative to frequentist CFA based on, for example, maximum likelihood estimation for the assessment of reliability and validity of educational and psychological measures. For increasing sample sizes, however, the applicability of current fit statistics evaluating model fit within Bayesian CFA is limited. We propose, therefore, a Bayesian variant of the root mean square error of approximation (RMSEA), the BRMSEA. A simulation study was performed with variations in model misspecification, factor loading magnitude, number of indicators, number of factors, and sample size. This showed that the 90% posterior probability interval of the BRMSEA is valid for evaluating model fit in large samples (N≥ 1,000), using cutoff values for the lower (<.05) and upper limit (<.08) as guideline. An empirical illustration further shows the advantage of the BRMSEA in large sample Bayesian CFA models. In conclusion, it can be stated that the BRMSEA is well suited to evaluate model fit in large sample Bayesian CFA models by taking sample size and model complexity into account.

Entities: Chemical Disease Species

Keywords: Bayesian procedures; factor analysis; model fit; simulation; validity

Year: 2017 PMID： 30034027 PMCID： PMC6041765 DOI： 10.1177/0013164417709314

Source DB: PubMed Journal: Educ Psychol Meas ISSN： 0013-1644 Impact factor: 2.821

Introduction

Educational and psychological measures often include multiple indicators consisting of items from a questionnaire, a set of observations, or results from an interactive application. These indicators are believed to represent (multiple) latent factor(s) which are not directly observable. The Classroom Assessment Scoring System Toddler (CLASS; Pianta, Hamre, & La Paro, 2008), for example, combines observations on different domains to provide an indication of the educational and emotional quality in the classroom. Confirmatory factor analysis (CFA) plays an important role in the assessment of the reliability and validity of such measures (DiStefano & Hess, 2005). With CFA, the underlying theoretical framework of an instrument can be assessed providing a transparent and theoretical description of its (psychometric) properties (e.g., Kline, 2011). As such CFA gives insight in, for example, the relation between indicators and the latent factor(s), the (hierarchical) factor structure, and potential interdependencies between indicators of educational and psychological measures. Besides these aspects, CFA can also assess the validity of an instrument across groups and over time. This aspect, known as measurement equivalence/invariance (ME/I), indicates if an instrument measures the same (latent) construct across different populations or settings (Millsap, 2011; Van de Schoot, Schmidt, De Beuckelaer, Lek, & Zondervan-Zwijnenburg, 2015). As such CFA plays an important role within the development, validation, and application of most measurement instruments. While CFA is classically performed within a frequentist framework, recent decades have seen a strong increase in the use of the Bayesian framework to estimate CFA (Van de Schoot, Winter, Ryan, Zondervan-Zwijnenburg, & Depaoli, 2016). Within large samples with normally distributed data that are not affected by a high proportion of outliers or missingness, Bayesian CFA and frequentist CFA have roughly the same results (Scheines, Hoijtink, & Boomsma, 1999). Bayesian CFA can however offer several advantages over the frequentist approach such as computational advantages and intuitive interpretation of the results (Muthén & Asparouhov, 2012; Van de Schoot et al., 2014). Bayesian CFA also enables new modeling approaches (Muthén & Asparouhov, 2012), such as approximate invariance (i.e., alignment; Muthén & Muthén, 2013; Van de Schoot et al., 2013). Researchers can, furthermore, incorporate background knowledge into their analyses, through the specification of prior information (e.g., Van de Schoot et al., 2014). As such, Bayesian CFA can “simply” be used as a different estimator, but it can also provide access to CFA models that are not feasible within a frequentist framework (Kaplan & Depaoli, 2012). While the application of Bayesian CFA is on the rise, some issues warrant further research. One of these aspects is the objective assessment of overall model fit within large samples. While current measures for model fit within Bayesian CFA show positive properties within studies with small samples, within large samples, surpassing 1,000 subjects, the sensitivity of the overall fit statistic to detect negligible differences between the observed data and the hypothesized model is high (Hoijtink & Van de Schoot, 2017). Within empirical settings, in which negligible deviations from the hypothesized model are always expected, an increase in sample size inevitably leads therefore to a deterioration of model fit (MacCallum, 2003). That is, acceptance rates of models with a “small” misspecification (e.g., nonspecified negligible cross-loading) decrease with increasing sample size (Asparouhov & Muthén, 2010). For applied research this makes it difficult to objectively assess, interpret, and communicate the quality of the model. Consider, for example, that the CLASS would be compared across different countries with a large number of measurements per country in the study. As within empirical studies discrepancy between the hypothesized and observed model is expected, this would result in a deteriorated model fit. This could result in false conclusions with regard to the validity and application of the instrument across countries. Although overall model fit is not synonymous with model quality, it constitutes an important and integral part of it (Bentler, 2007; Millsap, 2007). The current study introduces and validates a fit measure, the Bayesian root mean square error of approximation (BRMSEA), which is less sensitive for large samples. This could improve assessment of overall model fit within Bayesian CFA with large samples, enhancing application of this framework to provide insight regarding the reliability and validity of measurement instruments. Evaluation of the model fit within Bayesian CFA relies on the validity of the model for future observations (Kaplan & Depaoli, 2012). To simulate such future observations, replications of the observed data are generated (Levy, 2011). The χ2 for the observed and replicated (or updated) data is subsequently computed for each iteration within the Markov Chain (Levy, 2011). Within Bayesian CFA the posterior predictive p value (ppp) checks the proportion of iterations for which the replicated χ2 exceeds the observed χ2 (for other implementations of the ppp see, Gelman, Carlin, Stern, & Rubin, 2014; Lee, 2007). A “good” fit is indicated if the ppp is around .50 (Gelman et al., 2014; Muthén & Asparouhov, 2012). The ppp is found to be robust for assessing model fit within small samples (Asparouhov & Muthén, 2010; Lee & Song, 2004; Rupp, Dey, & Zumbo, 2004). It is especially through these characteristics, including the use of priors, that Bayesian CFA works so well in small samples as it is not based on large-sample theory. For large samples, it seems however that the ppp becomes sensitive for trivial deviations from the hypothesized model (Hoijtink & Van de Schoot, 2017). A simulation study by Asparouhov and Muthén (2010) showed, for example, that despite the robustness of the ppp for models with a “minor” misspecification for larger samples compared with p values within frequentist CFA, rejection rates still increase. In this study, a “minor” misspecification was defined as the omission of standardized cross-loadings smaller than .1 within a CFA. Rejection rates increased with increasing sample sizes (N = 300, 500, and 1,000) both for frequentist CFA (19%, 21%, and 44%, respectively) and Bayesian CFA (6%, 12%, and 29%, respectively). While studies within Bayesian CFA regarding this phenomenon, or the functioning of model fit in general, are underrepresented (Levy, 2011; Rindskopf, 2012), it seems that the sensitivity of the ppp to detect negligible differences within large samples approaches 1.0. As such, it seems the ppp is well suited for studies with small to moderate samples, but loses it salience within studies using large samples. To resolve this problem within frequentist CFA, fit indices are frequently used (Bentler, 1990; Kline, 2011). Fit indices provide, on a continuous scale, a quantitative measure of model fit. In general terms, it can be stated that they provide a credibility check of models while taking into account the overall and specific discrepancy between the model and the population (MacCallum, 2003). The first criterion for such fit indices is that they should not be penalized for an increasing sample size (Marsh, Balla, & McDonald, 1988). The second criterion is the correction for model complexity to ensure that there is no free lunch regarding the inclusion of extra parameters—which always improves model fit (Browne & Cudeck, 1992). Fit indices provide a goodness or a badness of fit. In the former, a higher value (often toward 1) indicates a better fitting model while in the latter, a lower value (often toward 0) indicates a better fitting model (West, Taylor, & Wu, 2012). Facilitating the interpretation of fit indices cutoff values are proposed indicating “good” “acceptable” and “poor” fit (Browne & Cudeck, 1992; Hu & Bentler, 1999). It should be noted that there is a long-standing and ongoing discussion about fit indices (e.g., Barrett, 2007). This debate not only focuses on the reliance on indicative thresholds (or cutoff points) as golden rules but also on the neglect of the predictive quality of the models and the negligence with respect to a significant χ2 (e.g., Fan & Sivo, 2007; Marsh, Hau, & Wen, 2004; McDonald & Ho, 2002). In line with Lai and Green (2016), quotation marks are therefore used in the present article for quantifications of model fit (e.g., “good” model) and misspecification (e.g., “large” misspecification) to indicate the ambiguity of such qualifications. Notwithstanding theoretical and statistical criticisms, fit indices can, however, play a crucial, but not solitary, role in the assessment of model quality as qualitative judgment about the overall model fit (e.g., Bentler, 2007; Kline, 2011; Millsap, 2007; Yuan, 2005). Without such quantifications, the judgment of model quality within large sample Bayesian CFA models relies almost solely on subjective measures. Thresholds provide a standard—which is ambiguous by nature—enabling transparent assessment and communication of model quality. A fit index which is robust to an increased sample size is therefore crucial as it would lead to an improved understanding of model fit and accessibility for Bayesian CFA within large samples (e.g., Cieciuch, Davidov, Schmidt, Algesheimer, & Schwartz, 2014; Milojev, Osborne, Greaves, Barlow, & Sibley, 2013; Lung, Chiang, Lin, Shu, & Lee, 2011). Assessing model quality in such samples would be greatly enhanced by a fit index which is informative within large samples. The present article is the first to explore whether the rationale of such a fit index (i.e., the RMSEA) can be applied within Bayesian CFA (i.e., the BRMSEA) to provide a valid evaluation of model fit within large samples. The motivation to implement the rationale of the RMSEA within Bayesian CFA is threefold. First, within frequentist CFA the RMSEA has been shown to work especially well with large samples (Chen, Curran, Bollen, Kirby, & Paxton, 2008; Curran, Bollen, Chen, Paxton, & Kirby, 2003; MacCallum, Browne, & Sugawara, 1996), which is exactly the area in which the ppp become less useful. Second, the RMSEA is an absolute fit index and does therefore not require a baseline, or empty, model (Steiger & Lind, 1980; West et al., 2012). Such a baseline model would be contradictory with the Bayesian framework regarding the inclusion of prior knowledge of the model. Third, the RMSEA enables the computation of a confidence interval (CI) which provides information regarding the trustworthiness of the model fit (Browne & Cudeck, 1992; Steiger, 1990, 2000). This enhances comparability as this corresponds to the approach within the Bayesian framework of reporting posterior probability intervals (PPIs; Van de Schoot et al., 2014). Although not mathematically equivalent, the PPI and the CI serve related inferential goals. These aspects support the implementation of the BRMSEA as a fit index within Bayesian CFA. Additionally, the BRMSEA should also function in accordance with the prior specification of a model, as this influences the overall fit and complexity of a model (Spiegelhalter, Best, Carlin, & Van Der Linde, 2002). Correct and informative priors should therefore positively affect the BRMSEA and vice-versa. It is hypothesized that the BRMSEA accurately assesses model fit in Bayesian CFA within large samples while the ppp in contrast loses its salience for such samples.

Technical Background of the RMSEA and the BRMSEA

Background of the RMSEA

Throughout the Technical Background a parameter with a hat (^) indicates the estimation of a population parameter. The RMSEA stems from the work by Steiger and Lind (1980) who explored the fit of a model, derived from a sample, in relation to the fit of the model in the true population. The fit (statistic) of a model within the population is defined as F0. If a model does not show perfect fit, which is to be expected in empirical settings, an estimate of F0 has to be derived (). Browne and Cudeck (1992) argue that the sample fit () of a model can be used to estimate the fit statistic (): in which d is the number of free parameters and N the sample size. Equation (1) is under the assumption that indicates the degree of lack of fit taking into account the lack of fit arising due to sampling error. As such this takes the number of free parameters and the sample size into account to estimate the misfit of a model in the population. Browne and Cudeck (1992) further state that the model fit of a population decreases if free parameters (q) are added. These two premises result in a measure of discrepancy of the model per free parameter (ε; Browne & Cudeck, 1992), defined as which prefers parsimonious models. That is, if two models have the same fit within the population the model with fewer estimated parameters will yield a smaller value (MacCallum et al., 1996). To estimate F0 in Equation (2) it can be substituted as As it is possible that the numerator is negative, an added condition is that if the is set to zero. This results in a theoretical range of from zero to infinity in which a value of zero denotes a perfect fitting model, whereas larger values reflect a poor model fit (badness-of-fit).

Implementation of the RMSEA Within Frequentist Confirmatory Factor Analysis

Within the frequentist framework the from Equation (3) is referred to as the RMSEA which uses the χ2 to reflect the degree of misfit (; Equation 3): In Equation (4), df (degrees of freedom) reflects the number of free parameters in the model: With p being the number of observations, defined as the number of unique elements within the sample variance–covariance matrix (ν[ν+ 1]/2) and q the number of free (estimated) parameters. If the mean structure is included, this number is summed with the number of (ν) observed variables (Kline, 2011). The from Equation (3) can also be replaced with the misfit from the general least square or asymptotically distribution free instead of the maximum likelihood (ML)–based χ2 (Browne & Cudeck, 1992). Commonly used cutoff points for the RMSEA are values less than .05 denoting good model fit, values less than .08 denoting adequate model fit. Hu and Bentler (1999) suggested that for a good model fit a cutoff point of .06 could also be used. A key strength of the RMSEA is that the sampling distribution is known under certain assumptions. Support for this notion is based on the fact that the asymptotic distribution of RMSEA is a rescaled χ2 for a given sample size, df, and a noncentrality parameter λ (Browne & Cudeck, 1992). The lower (LL) and upper limit (UL) of the RMSEA CI are given as This CI enables the test whether a model exhibits close or worse fit, which is achieved when the lower limit is below or above a certain threshold (Browne & Cudeck, 1992).

Implementation of the BRMSEA Within Bayesian Confirmatory Factor Analysis

Inspired by Browne and Cudeck (1992) who stated that different measures of discrepancy (i.e., χ2) can be used for the estimation of from Equation (3), we propose that it can also be applied within Bayesian CFA. Hence, the fact that the RMSEA was developed and applied within a frequentist framework does not hinder the implementation of its rationale within the Bayesian framework. The notion that the degree of misfit () should be rescaled according to the number of estimated parameters (d) and sample size (N) is therefore implemented within Bayesian CFA. Within a Bayesian framework there is, however, no classical discrepancy function or df. This section illustrates the parameters from a Bayesian CFA framework which are implemented in Equation (3) to achieve a Bayesian variant of the RMSEA the BRMSEA. With regard to model misfit (), for which the χ2 is used within the frequentist framework (Equation 5), the difference between the observed and replicated χ2 () for each iteration (i; after burn-in) is used for the BRMSEA. Within Bayesian CFA this difference can be regarded as the degree of misfit () in Equation (3). Similar to a classical discrepancy function, such as the χ2 within frequentist CFA, will positively increase with increasing levels of misfit. In contrast to classical discrepancy function, such as the χ2 within frequentist CFA, can be negative for an iteration. For multiple iterations, however, will approximately result in 0 for a perfect fitting model and will positively increase with increasing levels of misfit, similar to a classical discrepancy function. To control for model complexity, it is important to include the effect that prior information has on the estimation process, as prior information can alter the “effective” number of estimated parameters. A prior with a mean of zero and a very small variance is, for example, nearly equal to a parameter which is fixed to zero (Asparouhov, Muthén, & Morin, 2015). Especially if a multitude of such priors are used, the difference between the number of estimated parameters and the effective number of estimated parameters can become substantial. To correct for this effect within Bayesian CFA the effective number of parameters (p; Spiegelhalter et al., 2002) are used. The p parameter is developed in conjunction with the deviance information criterion (DIC) as penalty term for complexity. Subtracting the p, instead of q (Equation 5), from the number of observations (p) gives a fair estimation of the effective model complexity within Bayesian CFA. Equivalent models with differing prior information will, therefore, have a different model complexity which is in line with the Bayesian framework. Combining the model fit of Bayesian CFA () with the effective number of parameters (p) results in the following equation for the BRMSEA: As such, the BRMSEA results in a set of (i) rescaled differences between the observed and replicated χ2, taking into account the (effective) number of estimated parameters and sample size. By doing so it provides an estimation of the validity of the model for the population while taking into account the lack of fit arising due to sampling error. The numerator of the BRMSEA will be set at 0 for an iteration if it is negative. As will on average be 0 in a perfect fitting model, the BRMSEA will also be zero for perfectly fitting models, and positively increase toward infinity for increasing levels of misspecification. In contrast to the frequentist framework, in which the CI of the RMSEA is commonly computed on the basis of asymptotic theory, the PPI of the BRMSEA should be derived, as any posterior measure within Bayesian CFA, from the posterior density. The PPI (e.g., 90%) of the BRMSEA is extracted from the total set of iterations. In the present study, the lower limit is 5% and the upper limit 95%, as the used PPI of the BRMSEA is 90%. This 90% is in line with the 90% CI often used for the RMSEA (Browne & Cudeck, 1992). Because of the (theoretical) comparability of the RMSEA and BRMSEA, it is hypothesized that their functioning regarding the assessment of overall model fit is equivalent. A simulation study is proposed to empirically test this hypothesized functioning of the BRMSEA within Bayesian CFA.

Simulation Study

In this article, the validity of the BRMSEA within a Bayesian CFA is evaluated (see Supplement A for R-code). The characteristics of the BRMSEA and the ppp are examined within various conditions in a simulation study. It is hypothesized that for large samples the ppp rejects all models with any form of (“small”) misspecification while the BRMSEA only rejects models which a “large” misspecification and accepts models with a “small” misspecification. The comparison with the RMSEA is made to see whether its characteristics are analogous with those of the BRMSEA. The frequentist χ2-based p value and the Bayesian ppp are expected to reject all models with any form of misspecification. Implementation of the BRMSEA will be further facilitated and evaluated by the implementation of cutoff points.

Method

Data Generation

The simulation study consisted of two sections. In the first section, different population covariance matrices (conditions) were tested against a common one-factor CFA model. In the second section, a partly different set of conditions was tested against a two-factor CFA model. The different population covariance matrices (conditions) in the first section, which were tested against a common one-factor CFA model (Figure 1A), were specified varying the following four aspects: (1) specification of the population factor model (Models A-E; see Figure 1), (2) number of indicators (6 and 12), (3) magnitude of factor loadings (.5 and .7), and (4) sample size (50, 100, 250, 500, 1,000, 5,000, and 10,000). The specification of the population factor models (partly based on, Shevlin & Miles, 1998), which were used to generate the data, were increasingly different compared with a common one-factor model. Specifications B and C were regarded as “small” misspecifications as the residual correlation was .1 and the salient pattern of the factor loadings corresponded with that of the reference model (Heene, Hilbert, Freudenthaler, & Bühner, 2012). The number of residual correlation, especially for Specification B, was furthermore limited. Specification D and especially Specification E were seen as models with more substantial (“large”) amounts of misspecification, primarily because of the difference in the salient pattern of the factor structure and the moderate correlation between these factors.

Figure 1.

The different specifications for the population factor models used to generate the population covariance matrices for each condition.

Note. Factor loadings (λ = .3, .5, or .7; λ = .07, .35) and number of indicators (6 or 12) varied between the conditions. Residuals (θ) were computed on the basis of the factor loadings (1 −λ2). Intercepts and factor means are not displayed as they were estimated to be zero in all models. Models A through C were only used in the first section and Models F1 and F2 only in the second section. Model A was the reference model in the first section and Model D was the reference model in the second section.

The different specifications for the population factor models used to generate the population covariance matrices for each condition. Note. Factor loadings (λ = .3, .5, or .7; λ = .07, .35) and number of indicators (6 or 12) varied between the conditions. Residuals (θ) were computed on the basis of the factor loadings (1 −λ2). Intercepts and factor means are not displayed as they were estimated to be zero in all models. Models A through C were only used in the first section and Models F1 and F2 only in the second section. Model A was the reference model in the first section and Model D was the reference model in the second section. For the second section the reference model was a two-factor model (specification D; Figure 1D). In this section, the number of indicators (i.e., 12) and the magnitude of factor loadings (i.e., .7) were not varied and were based on the findings in the first section. Sample size variation was equal to that in the first section. Specification of the population factor model, partly based on Asparouhov and Muthén (2010), consisted of four models (Models D, E, F1, and F2). Models F1 and F2 were similar to Model D except the inclusion of cross-loadings between the sixth indicator and the second factor and the seventh indicator and the first factor (Figure 1). These cross-loadings were “small” and 10% of the salient factor loadings in Model F1 (.07) and “moderate” in Model F2 (.35). The “small” cross-loadings in model F1 should result in acceptance of the model while the standardized cross-loadings above .3 in Model F2 should result in a majority of the models being rejected (Asparouhov & Muthén, 2010; Saris, Satorra, & Van der Veld, 2009). Specifications A through C were not tested against the reference model in the second section as this would be complicated by the freely estimated covariance between the two factors, which would approach 1 in these models, resulting in a bias in parameter estimates but not in overall model fit. All models were identified through constraining the factor variance(s) to 1. Intercepts of all indicators and latent factor means were specified to be zero. Residuals were estimated through subtracting 1 with the associated magnitude of the factor loadings squared. The different variations (i.e., specification, number of indicators, magnitude of factor loadings, and sample size) resulted in a total of 140 (5 × 2 × 2 × 7) different conditions in the first section and 28 (4 × 1 × 1 × 7) in the second section. For each condition 500 samples were generated. Cumulative averages plots indicated that the number of samples was sufficient as estimates were fully stabilized by 500 samples. Population RMSEA for the various conditions, in both sections, are presented in Table 1.

Table 1.

	Indicators = 6					Indicators = 12
	A	B	C	D	E	A	B	C	D	E	F1	F2
Reference model: One-factor (Specification A)
.5	.000	.034	.070	.089	.106	.000	.017	.042	.070	.091	—	—
.7	.000	.052	.103	.204	.234	.000	.025	.063	.149	.188	—	—
Reference model: Two-factor (Specification D)
.7	—	—	—	—	—	—	—	—	.000	.141	.013	.061

Note. Specification A is a common one-factor model. Specification B is identical to Specification A except for the inclusion of a small error covariance (.1) between the first and second indicators. Specification C is a one-factor model with small error covariances (.1) between each subsequent pair of indicators. Specification D is a two-factor model with a covariance of .5 between the two factors. Specification E is a three-factor model with a covariance of .25 between the factors. Model F1 was similar to Model D except the inclusion of a small (.07) cross-loadings between the sixth indicator and the second factor and the seventh indicator and the first factor. Model F2 was similar to Model F1 except that the two cross-loadings were moderate (.35). For Specifications D through F2 the number of indicators is equally distributed across the factors. Residuals were computed by subtracting the squared factor loading from 1. Intercepts and factor means were estimated to be zero in all models.

Population Parameters (Root Mean Square Error of Approximation) of Each Condition for the Two Different Reference Models, on the Basis of the Number of Indicators, Magnitude of Factor Loadings (Rows), and Specification (Columns). Note. Specification A is a common one-factor model. Specification B is identical to Specification A except for the inclusion of a small error covariance (.1) between the first and second indicators. Specification C is a one-factor model with small error covariances (.1) between each subsequent pair of indicators. Specification D is a two-factor model with a covariance of .5 between the two factors. Specification E is a three-factor model with a covariance of .25 between the factors. Model F1 was similar to Model D except the inclusion of a small (.07) cross-loadings between the sixth indicator and the second factor and the seventh indicator and the first factor. Model F2 was similar to Model F1 except that the two cross-loadings were moderate (.35). For Specifications D through F2 the number of indicators is equally distributed across the factors. Residuals were computed by subtracting the squared factor loading from 1. Intercepts and factor means were estimated to be zero in all models.

Estimation and Prior Specification

In both sections two estimators were used, ML for the frequentist CFA and Bayesian estimation for the Bayesian CFA. For the Bayesian estimation, three variations regarding the specification of priors were examined. Differentiation in prior specification was simulated to examine the effect of priors on the characteristics of the BRMSEA. The first variation included the default, diffuse priors of Mplus which are N(0,∞) for the intercepts and factor loadings, and IG(−1, 0) for residual variances (Asparouhov & Muthén, 2010; Muthén & Muthén, 1998). For the second variation the prior means of the factor loadings and intercepts of the indicators had the “correct” parameter of the current condition (e.g., .7 for a factor loading). As the priors furthermore had a variance of 0.05 (SD = 0.22), these priors were regarded as conservative (weakly informative). The third variation of prior specification was only applied in the second section. This variation included wrong prior specifications for the factor loadings (.9 instead of .7) and factor covariance (.3 instead of .5). Priors had furthermore a variance of 0.005 (SD = 0.07), which was 10 times smaller as in the conservative prior variation, to assure deviation of the prior distribution of the reference model. It should be noted that these prior variations were only used for the model estimation and not for the simulation of the underlying data. As such, each (single) sample was estimated using different prior variations for the Bayesian CFA (and a single frequentist CFA model using the ML estimator). All the models were estimated as a common one-factor model (Figure 1A) in the first section, with either 6 or 12 indicators, or as a two-factor model (Figure 1D), with 12 indicators, in the second section. The estimated model was identified through the specification of the latent factor variance at 1 and its mean at 0. For the model in the second section the covariance between the two factors was freely estimated. The hypothesized models corresponded with the conditions in which the specification of the reference model was used. The other specifications differed from the hypothesized model (Table 1). In these instances, the hypothesized models did not reflect the pattern of the underlying factor structure of the population used to generate the data. For the models, all default estimation settings were used except for the convergence criteria of the Bayesian CFA models. See Supplement B and Muthén and Muthén (1998) for default settings. The default Bayesian CFA convergence criterion (BCONVERGENCE) of 0.05 was set to 0.01. Mplus multiplies this criterion with the multiplicity factor of the model, which can range from 1 (in a model with one parameter) to 2 (in a model with a large number of parameters), to compute the potential scale reduction factor (PSR) of each parameter of a model (for more details, see Asparouhov & Muthén, 2010). It is argued, however, that a stringent PSR criterion is preferable (Brown, 2015). A BCONVERGENCE of 0.01 will, as such, result in the requirement that PSR values are below 1.02 instead of 1.10 with the default convergence criterion of 0.05 (Depaoli & Van de Schoot, 2017). Convergence was furthermore facilitated by a fixed minimum of iterations for each model of at least 5,000 with a maximum 20,000. That is, if by the 20,000th iteration the highest PSR was not below the convergence criterion, the model did not converge. Random checks indicated that further increasing the number of iterations did not alter the results.

Analytic Strategy

For both sections, the same analytic strategy was used and were therefore reported in conjunction. First the convergence of the models was inspected (detailed tables are provided in Supplement C). Models that did not converge were excluded from the analysis. The mean of the relevant parameters outcomes, the p value, and 90% CI RMSEA for frequentist CFA and the ppp and 90% PPI BRMSEA for Bayesian CFA, were (visually) inspected for the different settings. The applicability of these parameters for evaluation of model acceptance was, furthermore, quantified by implementing cutoff values. For the χ2 p value the conventional cutoff value of .05 was used (α = .05). For the ppp a value of .05 was used, based on the recommendations by Muthén and Asparouhov (2012). To quantify the CI of the RMSEA the lower limit should be below .05 and the upper limit below .08 (Browne & Cudeck, 1992; Kenny, 2015). These cutoff points were also applied for the BRMSEA as preliminary results showed, especially for large samples, striking similarities between the RMSEA and the BRMSEA. The software package Mplus (Version 7; Muthén & Muthén, 1998) was used for the data simulation based on the population covariance matrices and for the model estimation (see Supplement B for the syntax of both procedures). R (Version 3.1.1; R Development Core Team, 2014) was used to program the simulation and analyze the results. MplusAutomation (Version 0.6-2; Hallquist & Wiley, 2013) was used to facilitate the exchange between both programs.

Results

Convergence

In Supplemental Tables C1 and C2 the convergence of the models is shown. Convergence rate of the frequentist models was below 90% for some conditions with the one-factor reference model, especially for small samples and “large” misspecification. For the Bayesian CFA models, almost all models converged. In the first section, no specific parameter was specifically associated with nonconvergence in the Bayesian models. In the second section, however, the covariance parameter between the two latent factors had most of the time the highest PSR if model did not converge (87%). The 1,218 models (0.45%) that did not converge were excluded from further analysis.

BRMSEA and RMSEA

Figures 2 to 4 show the mean values of the 90% CI RMSEA within each condition for the frequentist CFA models. For the Bayesian estimation procedures, with diffuse, conservative, and wrong, priors, the mean values of the 90% PPI BRMSEA are shown for each condition. As indicated in the analytic strategy, the performance of the 90% CI RMSEA and the 90% PPI BRMSEA was also quantified by the implementation of cutoff points to indicate whether a model showed an acceptable fit (Tables 2-4). For the RMSEA a cutoff point for the upper limit of .08 and for the lower limit of .05 was used, values below these limits indicated “acceptable” fit (Hu & Bentler, 1999). As the average PPI of the BRMSEA showed striking similarities with that of the average CI of the RMSEA (Figures 2-4), especially for large samples (N≥ 1,000), it seems that the properties of the BRMSEA and RMSEA are analogous for large samples. The cutoff points from the RMSEA were, therefore, also applied for the BRMSEA. These cutoff values were also included in Figures 2 to 4 to compare them with the mean values for each condition. As the differences between conservative and diffuse priors was marginal in the first section, especially for large samples (Figures 2 and 3), only the results for the diffuse priors were presented in Tables 2 and 3.

Figure 2.

Figure 4.

Table 2.

Proportion of Rejected Models With Six Indicators of the First Section, With the One-Factor Model as Reference Model, Using a Cutoff Point for the 90% Confidence Interval and 90% Posterior Probability Intervals of the Root Mean Square Error of Approximation (RMSEA) and BRMSEA for the Upper Limit of .08 and for the Lower Limit of .05 and of .05 for the Posterior Predictive p Value and p Value for the Bayesian Confirmatory Factor Analysis (CFA), With Diffuse Priors, and Frequentist (CFA).

		Factor loadings = .5				Factor loadings = .7
N	Model	BRMSEA	ppp	RMSEA	p	BRMSEA	ppp	RMSEA	p
50	A (ref)	.92	.01	.90	.09	.90	.01	.90	.10
	B	.93	.01	.91	.11	.92	.01	.91	.13
	C	.96	.01	.95	.16	.98	.07	.98	.33
	D	.98	.02	.96	.20	1	.51	1	.84
	E	.95	.02	.96	.12	1	.61	1	.91
100	A (ref)	.72	.01	.80	.07	.67	.01	.80	.07
	B	.78	.01	.83	.11	.79	.03	.87	.17
	C	.91	.05	.94	.23	.97	.20	.99	.52
	D	.96	.11	.97	.39	1	.95	1	1.00
	E	.94	.1	.98	.37	1	.99	1	1
250	A (ref)	.08	.00	.36	.06	.07	.00	.34	.07
	B	.20	.02	.57	.18	.38	.08	.75	.35
	C	.66	.23	.90	.62	.96	.77	1.00	.95
	D	.84	.51	.97	.81	1	1	1	1
	E	.93	.68	1.00	.93	1	1	1	1
500	A (ref)	0	.01	.03	.08	0	.01	.50	.08
	B	.01	.08	.18	.31	.13	.31	1	.65
	C	.46	.69	.81	.90	.98	1.00	1	1
	D	.81	.93	.97	.99	1	1	1	1
	E	.97	1	1	1	1	1	1	1
1,000	A (ref)	0	.00	0	.07	0	.00	0	.06
	B	0	.22	.00	.58	.02	.74	.20	.93
	C	.33	.99	.74	1.00	1.00	1	1	1
	D	.89	1	.99	1	1	1	1	1
	E	1.00	1	1	1	1	1	1	1
5,000	A (ref)	0	0	0	.06	0	0	0	.06
	B	0	1.00	0	1	.20	1	.10	1
	C	1	1	1.00	1	1	1	1	1
	D	1	1	1	1	1	1	1	1
	E	1	1	1	1	1	1	1	1
10,000	A (ref)	0	.00	0	.04	0	0	0	.05
	B	0	1	0	1	.33	1	.16	1
	C	1	1	1	1	1	1	1	1
	D	1	1	1	1	1	1	1	1
	E	1	1	1	1	1	1	1	1

Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value; p = p value; ref = reference model.

Table 3.

Proportion of Rejected Models With 12 Indicators of the First Section, With the One-Factor Model as Reference Model, Using a Cutoff Point for the 90% Confidence Interval and 90% Posterior Probability Intervals of the Root Mean Square Error of Approximation (RMSEA) and BRMSEA for the Upper Limit of .08 and for the Lower Limit of .05 and of .05 for the Posterior Predictive p Value and p Value for the Bayesian Confirmatory Factor Analysis (CFA), With Diffuse Priors, and Frequentist CFA.

		Factor loadings = .5				Factor loadings = .7
N	Model	BRMSEA	ppp	RMSEA	P	BRMSEA	ppp	RMSEA	p
50	A (ref)	.00	.04	.82	.21	0	.03	.82	.22
	B	.00	.04	.84	.22	.00	.05	.85	.23
	C	.02	.09	.90	.31	.03	.17	.96	.45
	D	.04	.19	.96	.51	.74	.94	1	.99
	E	.09	.28	.98	.64	.99	1.00	1	1
100	A (ref)	0	.02	.25	.10	0	.02	.26	.10
	B	0	.02	.30	.13	0	.04	.35	.17
	C	0	.10	.56	.31	.01	.33	.86	.65
	D	.02	.42	.87	.71	.97	1	1	1
	E	.15	.76	.98	.93	1	1	1	1
250	A (ref)	0	.01	0	.08	0	.01	0	.08
	B	0	.03	0	.13	0	.07	.00	.23
	C	0	.38	.02	.66	0	.93	.51	.98
	D	.01	.97	.73	1.00	1	1	1	1
	E	.37	1	1.00	1	1	1	1	1
500	A (ref)	0	.01	0	.06	0	.00	0	.05
	B	0	.07	0	.18	0	.23	.00	.42
	C	0	.92	.01	.97	.05	1	.61	1
	D	.27	1	.88	1	1	1	1	1
	E	.98	1	1	1	1	1	1	1
1,000	A (ref)	0	.02	0	.06	0	.02	0	.06
	B	0	.20	0	.41	0	.63	0	.84
	C	0	1	.00	1	.47	1	.92	1
	D	.88	1	.99	1	1	1	1	1
	E	1	1	1	1	1	1	1	1
5,000	A (ref)	0	.01	0	.05	0	.01	0	.05
	B	0	1	0	1	0	1	0	1
	C	0	1	0	1	1	1	1	1
	D	1	1	1	1	1	1	1	1
	E	1	1	1	1	1	1	1	1
10,000	A (ref)	0	.00	0	.05	0	.00	0	.05
	B	0	1	0	1	0	1	0	1
	C	0	1	0	1	1	1	1	1
	D	1	1	1	1	1	1	1	1
	E	1	1	1	1	1	1	1	1

Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value; p = p value; ref = reference model.

Table 4.

Proportion of Rejected Models of the Second Section, With the Two-Factor Model as Reference Model, Using a Cutoff Point for the 90% Confidence Interval and 90% Posterior Probability Intervals of the Root Mean Square Error of Approximation (RMSEA) and BRMSEA for the Upper Limit of .08 and for the Lower Limit of .05 and of .05 for the Posterior Predictive p Value and p Value for the Bayesian Confirmatory Factor Analysis (CFA), With Diffuse, Informative, and Wrong Priors, and Frequentist CFA.

		Bayesian CFA (priors)						Frequentist CFA
		Diffuse		Conservative		Wrong		—
N	Model	BRMSEA	ppp	BRMSEA	ppp	BRMSEA	ppp	RMSEA	p
50	D (ref)	.00	.03	0	.01	0	.11	.82	.21
	E	.67	.90	.45	.90	.93	1	1	.98
	F1	.00	.03	0	.01	0	.11	.84	.21
	F2	.03	.12	.00	.10	.02	.40	.94	.45
100	D (ref)	0	.01	0	.01	0	.14	.26	.09
	E	.96	1	.94	1	1	1	1	1
	F1	0	.01	0	.01	0	.15	.30	.11
	F2	.00	.26	.00	.24	.02	.80	.83	.62
250	D (ref)	0	.02	0	.01	0	.38	0	.09
	E	1	1	1	1	1	1	1	1
	F1	0	.03	0	.01	0	.46	0	.12
	F2	0	.91	0	.90	.01	1	.51	.97
500	D (ref)	0	.01	0	.01	0	.50	.00	.05
	E	1	1	1	1	1	1	1	1
	F1	0	.03	0	.03	0	.64	0	.10
	F2	.02	1	.02	1	.13	1	.53	1
1,000	D (ref)	0	.02	0	.02	0	.30	0	.06
	E	1	1	1	1	1	1	1	1
	F1	0	.08	0	.08	0	.61	0	.23
	F2	.32	1	.29	1	.48	1	.79	1
5,000	D (ref)	0	.02	0	.02	0	.04	0	.04
	E	1	1	1	1	1	1	1	1
	F1	0	.87	0	.88	0	.95	0	.96
	F2	1	1	1	1	1	1	1	1
10,000	D (ref)	0	.01	0	.00	0	.01	0	.05
	E	1	1	1	1	1	1	1	1
	F1	0	1	0	1	0	1	0	1
	F2	1	1	1	1	1	1	1	1

Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value; p = p value; ref = reference model.

Figure 3.

Mean values of the 500 replications for the 90% posterior probability interval (PPI) of the Bayesian root mean square error of approximation (BRMSEA) and the posterior predictive p value (ppp) for the Bayesian confirmatory factor analysis (CFA) models, both with informative and diffuse priors, and for the 90% confidence interval (CI) RMSEA and p value for the frequentist CFA models of the first section, with the one-factor model as reference (ref) model, for each sample size (as ordinal variable) and specification condition in which the magnitude of the factor loadings was .5. Cutoff values for the BRMSEA and RMSEA (.05 for the lower limit and .08 for the upper limit) and for the posterior predictive p value and p value (.05) are indicated with the dashed lines. Values within these cutoff points have (blue) circles, those outside (red) squares. Mean values of the 500 replications for the 90% posterior probability interval (PPI) of the Bayesian root mean square error of approximation (BRMSEA) and the posterior predictive p value (ppp) for the Bayesian confirmatory factor analysis (CFA) models, both with informative and diffuse priors, and for the 90% confidence interval (CI) RMSEA and p value for the frequentist CFA models of the first section, with the one-factor model as reference (ref) model, for each sample size (as ordinal variable) and specification condition in which the magnitude of the factor loadings was .7. Cutoff values for the BRMSEA and RMSEA (.05 for the lower limit and .08 for the upper limit) and for the posterior predictive p value and p value (.05) are indicated with the dashed lines. Values within these cutoff points have (blue) circles, those outside (red) squares. Mean values of the 500 replications for the 90% posterior probability interval (PPI) of the Bayesian root mean square error of approximation (BRMSEA) and the posterior predictive p value (ppp) for the Bayesian structural equation modeling (BSEM) models, both with informative, diffuse, and wrong priors, and for the 90% confidence interval (CI) RMSEA and p value for the frequentist CFA models of the second section, with the two-factor model as reference (ref) model, for each sample size (as ordinal variable) and specification condition. Cutoff values for the BRMSEA and RMSEA (.05 for the lower limit and .08 for the upper limit) and for the posterior predictive p value and p value (.05) are indicated with the dashed lines. Values within these cutoff points have (blue) circles, those outside (red) squares. Proportion of Rejected Models With Six Indicators of the First Section, With the One-Factor Model as Reference Model, Using a Cutoff Point for the 90% Confidence Interval and 90% Posterior Probability Intervals of the Root Mean Square Error of Approximation (RMSEA) and BRMSEA for the Upper Limit of .08 and for the Lower Limit of .05 and of .05 for the Posterior Predictive p Value and p Value for the Bayesian Confirmatory Factor Analysis (CFA), With Diffuse Priors, and Frequentist (CFA). Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value; p = p value; ref = reference model. Proportion of Rejected Models With 12 Indicators of the First Section, With the One-Factor Model as Reference Model, Using a Cutoff Point for the 90% Confidence Interval and 90% Posterior Probability Intervals of the Root Mean Square Error of Approximation (RMSEA) and BRMSEA for the Upper Limit of .08 and for the Lower Limit of .05 and of .05 for the Posterior Predictive p Value and p Value for the Bayesian Confirmatory Factor Analysis (CFA), With Diffuse Priors, and Frequentist CFA. Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value; p = p value; ref = reference model. Proportion of Rejected Models of the Second Section, With the Two-Factor Model as Reference Model, Using a Cutoff Point for the 90% Confidence Interval and 90% Posterior Probability Intervals of the Root Mean Square Error of Approximation (RMSEA) and BRMSEA for the Upper Limit of .08 and for the Lower Limit of .05 and of .05 for the Posterior Predictive p Value and p Value for the Bayesian Confirmatory Factor Analysis (CFA), With Diffuse, Informative, and Wrong Priors, and Frequentist CFA. Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value; p = p value; ref = reference model. For large samples the 90% CI RMSEA showed lower values for models with lower levels of misspecification, compared with models with higher levels of misspecification. These lower values of the 90% CI RMSEA for models with lower levels of misspecification, compared with models with higher levels of misspecification, was also found for the conditions in the second section (Figure 4). This pattern was also reflected when the performance of the RMSEA was inspected based on model acceptance using the cutoff values (Tables 2-4). Table 5 summarizes these findings of this acceptance rate for large samples (N≥ 1,000). For large samples the 90% CI RMSEA proved to successfully assess model fit.

Table 5.

	Bayesian confirmatory factor analysis (CFA)				Frequentist confirmatory factor analysis (CFA)
	BRMSEA		ppp		RMSEA		p value
	.5	.7	.5	.7	.5	.7	.5	.7
Reference model: One-factor (Specification A)
A (ref)	+	+	+	+	+	+	+	+
B	+	+	0	−	+	0/+[a]	0	−
C	−/+[a]	−	−	−	−/+[a]	−	−	−
D	−	−	−	−	−	−	−	−
E	−	−	−	−	−	−	−	−
Reference model: Two-factor (Specification D)
D (ref)		+		+/0^b		+		+
E		−		−		−		−
F1		+		0		+		0
F2		−		−		−		−
Conclusion (N≥ 1,000)	Using the BRMSEA, models with no or “small” amounts of misspecification (e.g., B & F1) were generally accepted, whereas models with a “moderate” or “large” misspecification (e.g., E) were mostly rejected, irrespective if the reference model had one or two factors, if sample size increased. If factor loadings were smaller model acceptance became more liberal as models with a “large” misspecification were increasingly accepted. There were no noteworthy differences between priors (for large samples)		With increasing sample sizes the ppp rejects all models regardless the level of misspecification. Only a small proportion of models with a “small” amount of misspecification were accepted if the sample size was 1,000. These models were, however, always rejected if the sample size was 5,000 or 10,000. There were no noteworthy differences between priors (for large samples) except that some two-factor reference models were rejected if the wrong prior specification was used. This effect diminished, however, in the largest sample sizes.		Using the RMSEA, models with no or “small” amounts of misspecification (e.g., B & F1) were generally accepted, whereas models with a “moderate” or “large” misspecification (e.g., E) were mostly rejected, irrespective if the reference model had one or two factors, if sample size increased. If factor loadings were smaller model acceptance became more liberal as models with a “large” misspecification were increasingly accepted.		With increasing sample sizes the χ² test rejects all models regardless of the level of misspecification. Some models with a “small” amount of misspecification were accepted if the sample size was 1,000. These models were, however, always rejected if the sample size was 5,000 or 10,000.

Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value; ref = reference model. If no superscripts are given, no noteworthy differences were found for the different prior variations or the number of indicators and a joined summary was given.

Model acceptance differed for the number of indicators: The result for the six indicators was provided first, followed by the result for the 12 indicators. bModel acceptance differed for the different prior variations: The result for the diffuse and informative prior variations was provided first, followed by the result for the wrong prior variation.

Summary of Model Acceptance, for Each Model Specification and Magnitude of the Factor Loadings, Indicating If Models Were Commonly Accepted (+), rejected (−) or a Mixed Pattern Emerged (0) for the Conditions With a Large Sample Size (N≥ 1,000) With a Cutoff Point for the Lower Limit of .05 and for the Upper Limit of .08 for the 90% Posterior Probability Intervals of the BRMSEA and 90% Confidence Interval of the RMSEA and of .05 for the Posterior Predictive p Value and p Value. Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value; ref = reference model. If no superscripts are given, no noteworthy differences were found for the different prior variations or the number of indicators and a joined summary was given. Model acceptance differed for the number of indicators: The result for the six indicators was provided first, followed by the result for the 12 indicators. bModel acceptance differed for the different prior variations: The result for the diffuse and informative prior variations was provided first, followed by the result for the wrong prior variation. In the conditions with 12 indicators the BRMSEA seems invalid for small samples as both the lower and the upper bound of the 90% PPI BRMSEA were zero, regardless for the level of misspecification. The Bayesian CFA estimation procedure using conservative priors compared with estimation procedure using diffuse priors showed a narrower PPI when sample size was small indicating the effect of prior information on the BRMSEA. The wrong prior variation, in contrast, resulted in a broader and somewhat higher BRMSEA (Figure 4). This effect was also visible for conditions with larger sample sizes. For large samples the 90% PPI BRMSEA approached the same values regardless of the prior variation (Figures 2-4). These findings were also reflected when the performance of the BRMSEA was inspected based on models’ acceptance using cutoff values (Tables 2-4). Table 5 summarizes these findings of this acceptance rate for large samples (N≥ 1,000). The 90% PPI BRMSEA showed to successfully assess model fit within large samples. The BRMSEA showed the same characteristics as the RMSEA for large samples. The most noteworthy difference, with regard to model acceptance, was within the condition with six indicators and large (.7) factor loadings (Table 2). Figure 3 shows, however, that the absolute difference between the RMSEA and BRMSEA in this condition was marginal. As the BRMSEA is not derived from asymptotic theory, the form was different compared with the RMSEA. For large samples, however, the BRMSEA, showed a striking similarity with the RMSEA (Figures 2-4). For large samples the characteristics of the RMSEA and BRMSEA seem, therefore, comparable. That is, both the values of the 90% CI RMSEA and the 90% PPI BRMSEA were low for models without or “small” misspecification and high for models with “large” misspecification.

ppp and p Value

As sample size increased the ppp moved toward 0, except for the condition with Specification A. The move toward 0 occurred “faster” if the factor loadings were larger, if the misspecification was larger, or the wrong prior variation was used (Figures 2 and 3). The “dip” in the average ppp of the two-factor reference model with the wrong prior variation was, furthermore, noteworthy. Regardless of the priors and the condition, however, the ppp reached zero when sample size increased toward 10,000 for any level of misspecification. This finding was also supported by the implementation of the cutoff point (.05) for the ppp (Table 5). These findings for the ppp also hold, as expected, for the χ2-based p value (Figures 2-4). It has to be noted however that while the ppp had the same pattern as the χ2-based p value for large samples, the ppp showed to be superior for the smallest samples (N = 50) compared with the χ2-based p value (Tables 2-4). Using the χ2 for the reference models in the largest samples within frequentist CFA, furthermore, resulted in a rejection rate of ~5%. This corresponds with the type I error induced by the nominal α (.05).

Empirical Illustration

The goal of the empirical illustration was to demonstrate what happens if different sample sizes, from the same population, are used. For this illustration, the proposed factor structure of the skill discretion subscale of the Job Content Questionnaire (Karasek, 1985) was examined. This section provides, however, by no means a comprehensive overview of an actual Bayesian CFA analysis. Data from the ongoing Maastricht Cohort Study on fatigue at work was used (see, Kant et al., 2003). The longitudinal study gathers data of employees from 45 companies by means of self-administered questionnaires. The baseline questionnaires in May 1998 were sent together with an invitation letter to the participants. 26,978 Employees received the baseline questionnaire, of which 12,161 responded. Twenty-one participants were excluded due to technical reasons, resulting in a baseline population of 12,140. The skill discretion subscale of the JCQ was used for the factor model. This subscale assesses the level of skill and creativity required on the job and the flexibility permitted in deciding what skills to use. This subscale included six items (e.g., “My job requires that I learn new things”) which were answered on a 4-point Likert-type scale (strongly disagree to strongly agree). All items were hypothesized to load on a single factor reflecting skill discretion. Preliminary analyses showed however a strong dependency between the second and fourth item. Therefore, a residual covariance between these items was modelled. The structure of the hypothesized model reflected the model in Figure 1B, except that the residual covariance was not fixed to .1 but was freely estimated. To illustrate the effect of sample size on the estimation of such a factor model within Bayesian and frequentist CFA random samples of various sizes were extracted from the original data. The selected sample sizes were equal to the ones used in the simulation study (50, 100, 250, 500, 1,000, 5,000, and 10,000). To control for a possible difference between the samples regarding the overall score on skill discretion, the caret (Version 6.0-41; Kuhn, 2015) package was used to extract training sets which were matched on the sum score of the skill discretion subscale. There were, therefore, no differences expected between the samples regarding their average skill discretion score. The model was tested for each data set using the same three estimation procedures as in the first section of the simulation study. Information from three articles, investigating the factor structure of the skill discretion subscale, were used for the Bayesian analysis using conservative priors (Cheng, Luh, & Guo, 2003; De Araújo & Karasek, 2008; Pelfrene et al., 2003). The mean values of the factor loadings of the three articles were Item 1 (Develop own abilities) = .68, Item 2 (Requires creativity) = .67, Item 3 (Variety) = .54, Item 4 (High skill level) = .57, Item 5 (Learn new things) = .50, and Item 6 (Repetitive work) = .39. These articles used, however, exploratory factor analysis and the language of the questionnaires differed. Therefore, a conventional prior mean of 0.4 was chosen with a variance of 0.1 for all factor loadings. Priors for other parameters were not specified. Table 6 shows that for large samples the RMSEA indicated adequate model fit. For small samples, in contrast, the upper limit exceeds the cutoff point of .08. The same pattern emerges for the BRMSEA, both with conservative and diffuse priors. The ppp rejected the model for the largest sample sizes (N≥ 5,000), whereas it accepted the model when sample size was small to moderate (N≤ 1,000). Parameter estimates were nearly identical when sample size was N≥ 5,000. If the sample size was 10,000 the factor loading for the first item was .40 (95% PPI = [.38, .41]) in the Bayesian CFA model with conservative priors, .40 (95% PPI = [.38, .41]) with diffuse priors, and .40 (95% CI = [.38, .41]) in the frequentist CFA model, showing comparability of parameter estimates (see also Scheines et al., 1999).

Table 6.

	Bayesian CFA (diffuse priors)		Bayesian CFA (informative priors)		Frequentist CFA
N	BRMSEA₉₀	ppp	BRMSEA₉₀	ppp	RMSEA₉₀	p value
50	[.000, .206]	.15	[.000, .189]	.16	[.053, .254]	.02
100	[.000, .114]	.40	[.000, .109]	.42	[.000, .136]	.25
250	[.000, .065]	.45	[.000, .063]	.46	[.000, .078]	.36
500	[.000, .055]	.31	[.000, .055]	.31	[.000, .066]	.15
1,000	[.000, .046]	.15	[.000, .046]	.15	[.012, .057]	.02
5,000	[.039, .048]	.00	[.039, .048]	.00	[.038, .055]	< .01
10,000	[.043, .047]	.00	[.043, .047]	.00	[.039, .051]	< .01

Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value. Boldfaced BRMSEA and RMSEA intervals have a lower limit below .05 and an upper limit below .08; Boldfaced p values and posterior predictive p values are greater than .05.

Results of the Empirical Illustration for the Different Sample Sizes With the 90% Confidence Interval of the Root Mean Square Error of Approximation (RMSEA) and p Values for the Frequentist Confirmatory Factor Analysis (CFA) Models and 90% Posterior Probability Intervals of the Bayesian RMSEA (BRMSEA) and Posterior Predictive p Value for the Bayesian CFA Models With Diffuse and Informative Priors. Note. RMSEA = root mean square error of approximation; BRMSEA = Bayesian root mean square error of approximation; ppp = posterior predictive p value. Boldfaced BRMSEA and RMSEA intervals have a lower limit below .05 and an upper limit below .08; Boldfaced p values and posterior predictive p values are greater than .05.

Conclusion

At the moment, there is no appropriate summary statistic within Bayesian CFA protecting against an undesirably high sensitivity to detect negligible differences within large samples. The present article confirms that such a statistic is needed as the posterior predictive p value (ppp) rejects models with only a “small” deviation from the hypothesized model within such large samples, in accordance with previous studies (e.g., Asparouhov & Muthén, 2010). Our (simulation) study shows that the newly proposed Bayesian root mean square error of approximation (BRMSEA; Equation 7), inspired on the rationale of RMSEA (Equation 4; Browne & Cudeck, 1992), is a valid fit index for these large sample studies. As such the credibility of large sample Bayesian CFA models can be evaluated with this new BRMSEA which adjusts the model fit for model complexity and, most important, sample size. This enhances application of the Bayesian framework within CFA to assess the validity and reliability of (educational and psychological) measures (DiStefano & Hess, 2005). Cutoff points were used to aid the evaluation of the BRMSEA and assess its validity. It seems that these cutoff points are fruitful for successful model selection using the 90% PPI of the BRMSEA within Bayesian CFA when investigating large samples. The BRMSEA could be facilitated with a cutoff value of .05 for the lower limit in conjunction with a cutoff value of .08 for the upper limit as an indication of “good” fit. In the present simulation study these cutoff points resulted in the acceptance of models with none or “small” amounts of misspecification whereas “strongly” misspecified models were mostly rejected. The findings with respect to the cutoff points hold for models in which the sample size surpasses 1,000. This reliance on large samples is not regarded as a shortcoming of the BRMSEA. It is, after all, for these large sample sizes that a fit index was sought as within these samples the ppp is too sensitive for “trivial” misspecifications. As previous and the current simulation studies show, however, characteristics of the (B)RMSEA depend on a wide variety of model and data characteristics (Savalei, 2012). Researchers should therefore use cutoff points as a supportive guideline for interpretation of the quality of the model in conjunction with aspects such as, substantive theory, parameter estimates, cross-validation, and predictive quality (e.g., Bentler, 2007; Kaplan & Depaoli, 2012; Marsh et al., 2004; Millsap, 2007; Steiger, 2007; Yuan, 2005). As such, fit indices are not a panacea for the assessment of model quality (e.g., Marsh et al., 2004; Millsap, 2007; Steiger, 2014), nor should a low ppp be outrightly ignored simply because the sample size is large. A promising approach to use more informative cutoff points is the use of equivalence testing (Yuan, Chan, Marcoulides, & Bentler, 2016). This method takes into account the minimum tolerable size (T-size) of misspecification for fit indices (i.e., RMSEA). This approach could also be fruitful for further development of the BRMSEA and its cutoff points. Within the current study this method was, however, not taken into account to limit the number of “moving-parts” within the simulation. That is, the primary goal of this study was to demonstrate the validity of the BRMSEA as such, not to establish ground-truth for specific cutoff points. For a more informative selection of cutoff points, however, implementation of the equivalence testing approach would be recommended (Marcoulides & Yuan, 2017). Still, the cutoff points used in the current study seem to provide a valid first step for applied researchers for accessible and transparent assessment of overall model quality within Bayesian CFA models. The current analyses again illustrate the sensitivity of the ppp for any form of misspecification when sample size increases. These findings with respect to the ppp are important for an improved understanding of model diagnostics within Bayesian CFA, and Bayesian structural equation modeling in general (Levy, 2011; MacCallum, Edwards, & Cai, 2012; Rindskopf, 2012). Although the quantification of misspecification remains subjective, the main rationale entails that even the most marginal deviations eventually lead to a deterioration of the ppp when the sample size increases. While this enhanced precision is informative, it can also hinder the practical application within large samples. Within large samples the BRMSEA can, therefore, be seen as complementary to the ppp. While the BRMSEA provides an indication of overall model fit, it does not provide information regarding the source and form of misspecification. To gain such insights the method proposed by Muthén and Asparouhov (2012) can be used. Leaving aside the possible threats of post hoc model tinkering this method provides valuable information for researchers regarding the model as it quantifies the (marginal) deviations of the model (e.g., Bentler, 2007; McDonald & Ho, 2002; Stromeyer, Miller, Sriramachandramurthy, & DeMartino, 2015). Even these “enhanced” models will, however, be rejected on the basis of the ppp with increasing sample sizes. Specification F1 in the second section, for example, would eventually have been rejected even if informative priors were used for the cross-loadings. Further development of the BRMSEA is therefore recommended, as is the development of fit indices within Bayesian CFA in general. The comparative fit index (CFI) and Tucker–Lewis index (TLI) would seem to be good candidates, based on their implementation within frequentist CFA (e.g., Hu & Bentler, 1999). As indicated in the Introduction section, however, defining an independence model within a Bayesian framework could be difficult. That is, if prior information is provided an empty model would be difficult to define. Within a frequentist framework such a model is simply a model without any relation between any of the variables. Such an absence of relation contradicts with the inclusion of prior knowledge. Estimating the CFI and TLI within a Bayesian framework would, therefore, require a theoretical discussion and examination of an independence model within Bayesian CFA. The parameter estimates of the empirical illustration in the present study show the approximate equivalence between Bayesian and frequentist CFA models within large samples for equivalent models. There are, however, specific models that are only possible within Bayesian CFA and therefore have no equivalent within frequentist CFA. An example of such a Bayesian CFA model, compared with frequentist CFA, concerns the possibility to assess approximate measurement invariance (Muthén & Muthén, 2013; Van de Schoot et al., 2013). Currently, however, it appears that for large samples it seems impossible to reach a satisfactory “baseline” model as it is likely that almost all models will be rejected based on the ppp. The empirical illustration shows that the ppp approaches zero when the sample size is large even while the model seems credible. In conclusion, researchers are “penalized” too much when investigating a large sample. In contrast to ppp, the BRMSEA does not receive this “penalty” when assessing model fit within large samples (Steiger, 2000). Within the empirical illustration, for example, the BRMSEA indicated a satisfactory model fit for the large samples which could enable specific analysis such as the assessment of approximate measurement invariance. Some limitations of the current study and the BRMSEA as a fit index in general should be addressed. It remains, foremost, important to test alternative models even if the model fit is satisfactory (Kline, 2011). As indicated by Browne and Cudeck (1992), model fit does not provide a measure of plausibility but merely indicates the lack of fit within a model. Researchers should remain critical if there are alternative models that could better describe the data, or that the good fit is a result of overfitting. The assumptions regarding the level of misspecification could, furthermore, be debated and are always subject to “substantive and theoretical issues that are likely to be idiosyncratic to a particular study” (Marsh et al., 2004, p. 340). As with each simulation study, the number of conditions is limited. The BRMSEA is, furthermore, not applicable to models with categorical indicators due to constraints on the evaluation of the likelihood in such models (Asparouhov, 2010). For a valid BRMSEA, it is vital that the model shows adequate convergence and adheres to all other assumptions within Bayesian CFA (e.g., Depaoli & Van de Schoot, 2017). The finding that the BRMSEA is susceptible for prior information supports its embedding within the Bayesian framework (Rupp et al., 2004). It should be noted, however, that the BRMSEA, as the ppp, is by no means designed to evaluate prior specifications. This first introduction of the BRMSEA shows that all bodes well for its application within large sample Bayesian CFA studies. Such empirical studies have to prove the actual value of the BRMSEA in the evaluation of model fit. The proof of the pudding is, after all, in the eating. The assessment of model fit within Bayesian CFA using the new BRMSEA could be seen as contradictory to a “true” Bayesian approach (Kaplan & Depaoli, 2012). To cite Spiegelhalter et al. (2002): “In conclusion, it is clear that our pragmatic aims are muddying otherwise pure Bayesian waters” (p. 637). The BRMSEA is however embedded within the Bayesian framework as it includes the observed and replicated χ2 and the (effective) number of parameters. As such the BRMSEA is not directly derived from the RMSEA but inspired on its notion that a general fit statistic can be rescaled taking into account the sample size and model complexity (Steiger, 2000). As such the BRMSEA resolves the sensitivity of the current Bayesian CFA summary statistics for negligible differences within large samples. The BRMSEA will, therefore, result in a more accessible and transparent application of Bayesian CFA within large sample models. An area in which, at the moment, it is only sporadically applied compared with small sample models (Muthén & Asparouhov, 2012; Rupp et al., 2004). It is probably through this focus on small samples and adjoining exploration of the properties of the summary statistics, that the properties of these summary statistics received less attention for large samples (Lee & Song, 2004). With the growing interest and usage of Bayesian theory within the field of CFA and the growing number of large data sets (e.g., Cieciuch et al., 2014; Lung et al., 2011; Milojev et al., 2013), however, the need for a valid fit statistic within such conditions is evident and cannot be ignored. The data used for the empirical illustration are a clear example as many studies within the field of educational and psychological measurement use large samples in which oversensitivity for negligible deviations is a legitimate issue. The BRMSEA, with accompanying cutoff points for its 90% PPI, is a valid and intelligible fit index, which can be used to evaluate model fit within large sample size Bayesian CFA models.

21 in total

1. Principles and practice in reporting structural equation analyses.

Authors: Roderick P McDonald; Moon-Ho Ringo Ho
Journal: Psychol Methods Date: 2002-03

2. Reliability and validity of the Chinese version of the Job Content Questionnaire in Taiwanese workers.

Authors: Yawen Cheng; Wei-Ming Luh; Yue-Liang Guo
Journal: Int J Behav Med Date: 2003

3. Structural Model Evaluation and Modification: An Interval Estimation Approach.

Authors: J H Steiger
Journal: Multivariate Behav Res Date: 1990-04-01 Impact factor: 5.923

4. An Empirical Evaluation of the Use of Fixed Cutoff Points in RMSEA Test Statistic in Structural Equation Models.

Authors: Feinian Chen; Patrick J Curran; Kenneth A Bollen; James Kirby; Pamela Paxton
Journal: Sociol Methods Res Date: 2008-01-01

5. Testing small variance priors using prior-posterior predictive p values.

Authors: Herbert Hoijtink; Rens van de Schoot
Journal: Psychol Methods Date: 2017-04-03

Review 6. A systematic review of Bayesian articles in psychology: The last 25 years.

Authors: Rens van de Schoot; Sonja D Winter; Oisín Ryan; Mariëlle Zondervan-Zwijnenburg; Sarah Depaoli
Journal: Psychol Methods Date: 2017-06

7. Next steps in Bayesian structural equation models: comments on, variations of, and extensions to Muthén and Asparouhov (2012).

Authors: David Rindskopf
Journal: Psychol Methods Date: 2012-09

8. An epidemiological approach to study fatigue in the working population: the Maastricht Cohort Study.

Authors: I J Kant; U Bültmann; K A P Schröer; A J H M Beurskens; L G P M Van Amelsvoort; G M H Swaen
Journal: Occup Environ Med Date: 2003-06 Impact factor: 4.402

9. A gentle introduction to bayesian analysis: applications to developmental research.

Authors: Rens van de Schoot; David Kaplan; Jaap Denissen; Jens B Asendorpf; Franz J Neyer; Marcel A G van Aken
Journal: Child Dev Date: 2013-10-09

10. Editorial: Measurement Invariance.

Authors: Rens Van De Schoot; Peter Schmidt; Alain De Beuckelaer; Kimberley Lek; Marielle Zondervan-Zwijnenburg
Journal: Front Psychol Date: 2015-07-28

7 in total

1. Composition Factor Analysis and Factor Invariance of the Physical Appearance State and Trait Anxiety Scale (PASTAS) in Sports and Non-Sports Practitioner Mexican Adolescents.

Authors: Martha Ornelas; Judith Margarita Rodríguez-Villalobos; Jesús Viciana; Julio César Guedea; José René Blanco; Daniel Mayorga-Vega
Journal: J Sports Sci Med Date: 2021-07-01 Impact factor: 2.988

2. Bayesian Hierarchical Factor Analysis for Efficient Estimation across Race/Ethnicity.

Authors: Jinxiang Hu; Lauren Clark; Peng Shi; Vincent S Staggs; Christine Daley; Byron Gajewski
Journal: Rev Colomb Estad Date: 2021-07-12

3. How Program Users Enhance Fidelity: Implementing the WITS Programs in Rural Canadian Elementary Schools.

Authors: Bonnie Leadbeater; Kara Thompson; Paweena Sukhawathanakul; Gabriel J Merrin
Journal: Prev Sci Date: 2018-11

4. Development and Assessment of Respectful Maternity Care Questionnaire in Iran.

Authors: Simin Taavoni; Zahra Goldani; Narges Rostami Gooran; Hamid Haghani
Journal: Int J Community Based Nurs Midwifery Date: 2018-10

5. Drop-the-p: Bayesian CFA of the Multidimensional Scale of Perceived Social Support in Australia.

Authors: Pedro Henrique Ribeiro Santiago; Adrian Quintero; Dandara Haag; Rachel Roberts; Lisa Smithers; Lisa Jamieson
Journal: Front Psychol Date: 2021-02-26

6. Invariance of the WHO violence against women instrument among Kenyan adolescent girls and young women: Bayesian psychometric modeling.

Authors: Benedict O Orindi; Abdhalah Ziraba; Luk Bruyneel; Sian Floyd; Emmanuel Lesaffre
Journal: PLoS One Date: 2021-10-15 Impact factor: 3.240

7. Psychometric properties of an instrument measuring communication within and between the professional groups licensed practical nurses and registered nurses in anaesthetic clinics.

Authors: Maria Randmaa; Maria Engström; Gunilla Mårtensson; Christine Leo Swenne; Hans Högberg
Journal: BMC Health Serv Res Date: 2019-12-10 Impact factor: 2.655

7 in total