Literature DB >> 35741481

On the Choice of the Item Response Model for Scaling PISA Data: Model Selection Based on Information Criteria and Quantifying Model Uncertainty.

Alexander Robitzsch1,2.   

Abstract

In educational large-scale assessment studies such as PISA, item response theory (IRT) models are used to summarize students' performance on cognitive test items across countries. In this article, the impact of the choice of the IRT model on the distribution parameters of countries (i.e., mean, standard deviation, percentiles) is investigated. Eleven different IRT models are compared using information criteria. Moreover, model uncertainty is quantified by estimating model error, which can be compared with the sampling error associated with the sampling of students. The PISA 2009 dataset for the cognitive domains mathematics, reading, and science is used as an example of the choice of the IRT model. It turned out that the three-parameter logistic IRT model with residual heterogeneity and a three-parameter IRT model with a quadratic effect of the ability θ provided the best model fit. Furthermore, model uncertainty was relatively small compared to sampling error regarding country means in most cases but was substantial for country standard deviations and percentiles. Consequently, it can be argued that model error should be included in the statistical inference of educational large-scale assessment studies.

Entities:  

Keywords:  PISA; item response model; model uncertainty; scaling

Year:  2022        PMID: 35741481      PMCID: PMC9223051          DOI: 10.3390/e24060760

Source DB:  PubMed          Journal:  Entropy (Basel)        ISSN: 1099-4300            Impact factor:   2.738


1. Introduction

Item response theory (IRT) models [1] are central to analyzing dichotomous random variables. IRT models can be regarded as a factor-analytic multivariate technique to summarize a high-dimensional contingency table by a few latent factor variables of interest. Of particular interest is the application of an IRT model in educational large-scale assessment (LSA; [2]), such as the programme for international student assessment (PISA; [3]), which summarizes the ability of students on test items in different cognitive domains. In the official reporting of outcomes of LSA studies such as PISA, the set of test items is represented by a unidimensional summary measure extracted by applying a unidimensional IRT model. Across different LSA studies, there is no consensus on which particular IRT model should be utilized [4,5,6]. In previous research, there are a few attempts that quantity the impact of IRT model choice on distribution parameters of interest such as country means, standard deviations, or percentiles. However, previous research did not systematically study a large number of competing IRT models [7,8,9]. Our research fills a gap because it conducts an empirical comparison involving 11 different IRT models for scaling for PISA 2009 data in three ability domains. Moreover, we compare the model fit of these different IRT models and quantify the variability in model uncertainty using the model error. We compare the model error with the standard error associated with the uncertainty due to the sampling of students. The rest of the article is structured as follows. In Section 2, we discuss different IRT models used for scaling. Section 3 introduces the concepts of model selection and model uncertainty. Section 4 describes the method used to analyze PISA 2009 data. In Section 5, we discuss the empirical results for the PISA 2009 dataset. Finally, the paper closes with a discussion in Section 6.

2. Item Response Models for Scaling Cognitive Test Items

In this section, we present an overview of different IRT models that are used for scaling cognitive test data to obtain a unidimensional summary score [10,11,12]. In the rest of the article, we restrict ourselves to the treatment of dichotomous items. However, the principle can similarly be applied to polytomous items. Let be the vector of I dichotomous items . A unidimensional IRT model [11,12] is a statistical model for the probability distribution for , where In Equation (1), a latent variable is involved that can be interpreted as a unidimensional summary of the test items . The distribution of is modeled using a (semi)parametric distribution F with density function f. In the rest of the article, we fix this distribution to be standard normal, but this can be weakened [13,14,15]. The item response functions (IRF) model the relationship of the dichotomous item with the latent variable, and we collect all item parameters in the vector . In most cases, a parametric model is utilized in the estimation of the IRF (but see [16] for a nonparametric identification), which is indicated by the item parameter in Equation (1). Note that in (1), item responses are conditionally independent on ; that is, after controlling the latent ability , pairs of items and are conditionally uncorrelated. This property is also known as the local dependence assumption, which can be statistically tested [12,17]. The item parameters of the estimated IRFs in Equation (1) can be estimated by (marginal) maximum likelihood (ML) using an EM algorithm [18,19,20]. The estimation can involve sampling weights for students [21] and a multi-matrix design in which only a subset of items is administered to each student [22]. In the likelihood formulation of (1), non-administered items are skipped in the multiplication term. In practice, the IRT model (1) is likely to be misspecified because the unidimensionality assumption is implausible. Moreover, the parametric assumption of the IRF might be incorrect. In addition, in educational LSA studies involving a large number of countries, there will typically be country differential item functioning [23,24,25]; that is, item parameters will vary across countries. In this case, applying ML using country-invariant item parameters defines the best approximation with respect to the Kullback–Leibler distance of the true distribution and a model-implied distribution. In this sense, an IRT model is selected by purpose and not by reasons of model fit because it will not even approximately fit the data (see also [26]). If country means are computed based on a particular IRT model, the parameter of interest should be, rather, interpreted as a descriptive statistic of interest [27]. Using a particular model does not mean that we believe that the model (approximately) fits the data. In contrast, we think that a vector of country means and item parameters summarize a high-dimensional contingency table . Locally optimal weights [28] can be used to discuss the consequences for scoring when using a particular IRT model. A local scoring rule for the ability can be defined by a weighted sum for abilities near . The ability is determined by ML estimation using previously estimated item parameters. The locally optimal weights can be derived as (see [27,28,29]): If the local weight (also referred to as the local item score) varies across different values, the impact of single items in the ability differs. This property can be critically recognized, particularly for country comparisons in LSA studies [29]. Subsequently, we will discuss the properties of different IRT models regarding the optimal weights . In this article, several competitive functional forms of the IRF are compared, and their consequences for distribution parameters (e.g., means, standard deviations, and percentiles) for the prominent LSA study PISA are discussed. Performing such a fit index contest [30,31] does not necessarily mean that we favor model selection based on model fit. In the next Section 2.1, we discuss several IRFs later utilized for model comparisons. In Section 2.2, we investigate the behavior of the estimated ability distribution under misspecified IRFs. Finally, we conclude this section with some thoughts on the choice of the IRT model (see Section 2.3).

2.1. Different Functional Forms for IRT Models

In this section, we discuss several parametric specifications of the IRF that appear in the unidimensional IRT model defined in Equation (1). The one-parameter logistic model (1PL; also known as the Rasch model; [32,33]) employs a logistic link function and parametrizes an item with a single parameter that is called item difficulty. The model is defined by where a is the common item discrimination parameter. Alternatively, one can fix the parameter a to 1 and estimate the standard deviation of the latent variable . Notably, the sum score is a sufficient statistic for  in the 1PL model. The 1PL model has wide applicability in educational assessment [34,35]. The 1PL model uses a symmetric link function. However, asymmetric link functions could also be used for choosing an IRF. The cloglog link function is used in the one-parameter cloglog (1PCL) model [36,37]: Consequently, items are differentially weighted in the estimation of  at each location, and the sum score is not a sufficient statistic. The cloglog link function has similar behavior to the logistic link function in the 1PL model in the lower tail (i.e., for negative values of ), but differs from it in the upper tail. The one-parameter loglog (1PLL) IRT model is defined by In contrast to the cloglog link function, the loglog function is similar to the logistic link function in the upper tail (i.e., for positive values), but different from it in the lower tail. Figure 1 compares the 1PL, 1PCL, and 1PLL models regarding the IRF and the locally optimal weight . The loglog IRT model (1PLL) stretches more in the lower tails than in the lower tail than the logistic link function. The converse is true for the cloglog IRT model (1PCL), which is significantly stretched in the upper tail. In the right panel of Figure 1, locally optimal weights are displayed. The 1PL model has a constant weight of 1, while the local contribution of item score for  differs across the range for the 1PCL and the 1PLL model. The 1PCL model provides a higher local item score for higher values than for lower values. Hence, more difficult items receive lower local item scores than easier items. In contrast, the 1PLL model results in higher local item scores for difficult items compared to easier items. This idea is reflected in the D-scoring method [38,39].
Figure 1

Item response functions (left panel) and locally optimal weights (right panel) for the 1PL, 1PCL and 1PLL models.

Notably, the 1PCL and 1PLL models use asymmetric IRFs. One can try to estimate the extent of asymmetry in IRFs by using a generalized logistic link function (also called the Stukel link function; [40]): where the generalized logit link function is defined as In this 1PGL model, common shape parameters and for the IRFs are additionally estimated. The 1PL, 1PCL and 1PLL models can be obtained as special cases of (6). The four models 1PL, 1PCL, 1PLL, and 1PGL have in common that they only estimate one parameter per item. The assumption of a common item discrimination is weakened in the two-parameter logistic (2PL) IRT model [28], as a generalization of the 1PL model in which the discriminations are now made item-specific: Note that is a sufficient statistic for . Hence, items are differentially weighted by the weight , which is determined within the statistical model. Further, the assumption of a symmetric logistic link function might be weakened, and a four-parameter generalized logistic (4PGL) model can be estimated: In the IRT model (9), the shape parameters and are made item-specific. Hence, the extent of asymmetry of the IRF is estimated for each item. The 2PL model (8) can be generalized to the three-parameter logistic (3PL; [41]) IRT model that assumes an item-specific lower asymptote larger than 0 for the IRF: Parameter is often referred to as a (pseudo-)guessing parameter [42,43]. The 3PL model might be reasonable if multiple-choice items are used in the test. The 3PL model can be generalized in the four-parameter logistic (4PL; [44,45,46]) model such that it also contains upper asymptotes smaller than 1 for the IRF: The parameter is often referred to as a slipping parameter, which characterizes careless (incorrect) item responses [47]. In contrast to the 1PL, 2PL, or the 3PL model, the 4PL model has not yet been applied in the operational practice of LSA studies. However, there are a few research papers that apply the 4PL model to LSA data [48,49]. It should be mentioned that the 3PL or the 4PL model might suffer from empirical nonidentifiability [45,50,51,52]. This is why prior distributions for guessing (3PL and 4PL) and slipping (4PL) parameters are required for stabilizing model estimation. As pointed out by an anonymous reviewer, the use of prior distributions changes the meaning of the IRT model. However, we think that identifiability issues are of less concern in the large-sample-size situations that are present in educational LSA studies. If item parameters are obtained in a pooled sample of students comprising all countries, sample sizes are typically above 10,000. In this case, the empirical data will typically dominate prior distributions, and prior distributions are therefore not needed. In Figure 2, IRFs and locally optimal weights for the 4PL, 3PL, and 2PL models are displayed. The item parameters for the 4PL model were , , , and . The parameters of the displayed 2PL and 3PL models were obtained by minimizing the weighted squared distance between the IRF of the 4PL model and the simpler model under the constraint that the model-implied item-means coincide under the normal distribution assumption of . Importantly, it can be seen in the right panel that the 2PL model has a constant local item score, while it is increasing for the 3PL model and it is inversely U-shaped for the 4PL model. Hence, when using the 4PL model, it must not be too easy or too difficult to obtain a high local item score for a student that got the item correct.
Figure 2

Item response functions (left panel) and locally optimal weights (right panel) for the 4PL, 3PL and 2PL models.

A different strand of model extensions also starts from the 2PL model but introduces more item parameters to model asymmetry or nonlinearity while retaining the logistic link function. The three-parameter logistic model with quadratic effects (3PLQ) additionally includes additional quadratic effects of in the 2PL model [42,50]: Due to the presence of the  parameter, asymmetric IRFs can be modeled. As a disadvantage, the IRF in (12) must not be monotone, although this constraint can be incorporated in the estimation [53,54]. The three-parameter model with residual heterogeneity (3PLRH) extends to the 2PL model by including an asymmetry parameter [55,56]: The 3PLRH model has been successfully applied to LSA data and often resulted in superior model fit compared to the 3PL model [57,58]. In Figure 3, IRFs and locally optimal weights are displayed for three parameter specifications in the 3PLRH model (i.e., , , and ). One can see that the introduced asymmetry parameter governs the behavior of the IRF in the lower or upper tails. The displayed IRFs mimic the 1PL, 1PCL, and 1PLL models. Moreover, with parameters different from zero, different locally optimal weights across the  range are introduced. Notably, a positive parameter is associated with a larger local item score in the lower tail. The opposite is true for a negative parameter.
Figure 3

Item response functions (left panel) and locally optimal weights (right panel) for different IRFs of the 3PLRH model.

Finally, the 3PL model is extended in the four-parameter logistic model with quadratic effects (4PLQ), in which additional item-specific quadratic effects for  are included [50]

2.2. Ability Estimation under Model Misspecification

In this section, we study the estimation of  when working with a misspecified IRT model. In the treatment, we assume that there is a true IRT model with unknown IRFs. We study the bias in estimated abilities for a fixed value of if misspecified IRFs are utilized. This situation refers to the empirical application in an LSA study, in which a misspecified IRF is estimated based on data comprising all countries, and the distribution of  is evaluated at the level of countries. The misspecification emerges due to incorrectly assumed functional forms of the IRF or the presence of differential item functioning at the level of countries [24,59]. We assume that the there are true but unknown IRFs with a continuously differentiable function and denotes the logistic link function. We assume that the local independence assumption holds in the IRT model. For estimation, we use a misspecified IRT model with IRFs with a continuously differentiable function . Notably, there exists a misspecification if . In Appendix A, we derive an estimate under the misspecified IRT model if  is the data-generating ability value under the true IRT model. Hence, we derive a transformation function , where is the bias function that indicates the bias in the estimated ability due to the application of the misspecified IRT model. We assume that the item parameters under the misspecified IRT model are known (i.e., the IRFs are known). Then, the ML estimate is determined based on the misspecified IRT model taking into account that solves the maximum likelihood equation under the true IRT model. It is assumed that the number of items I is large. Moreover, we apply two Taylor approximations that rely on the assumption that is sufficiently small. The derivation in Appendix A (see Equation (A10)) provides where the bias term B is defined by and A is determined by item information functions (see Appendix A). Equation (15) clarifies how the misspecified IRFs enter the computation of . Interestingly, the extent of misspecification is weighted by . Equation (15) provides practical consequences when applying misspecified IRT models. For instance, might be the true country percentile, referring to a true IRT model. If the transformation is monotone, the percentile with the misspecified model is and Equation (15) quantifies a bias for the estimated percentile. Moreover, let be the density of the ability under the true IRT model for country c; then, one can determine the bias in the country means by using (15). The true country mean of country c is given by . The estimated country mean under the misspecified model is given by Note that the bias term will typically be country-specific because the true IRF are country-specific due to differential item functioning at the level of countries. Hence, item-specific relative country effects regarding the IRF that are uniformly weighted in Equation (15) can be considered a desirable property. In the case of a fitted 2PL model, it holds that , and deviations are weighted by in the derived bias in (15). For the 1PL model, the deviations are equally weighted due to . This property might legitimate the use of the often ill-fitting 1PL model because model deviations are equally weighted across items (see [27]). We elaborate on this discussion in the following Section 2.3.

2.3. A Few Remarks on the Choice of the IRT Model

In Section 2.1, we introduced several IRT models and it might be asked which criteria should be used for selecting one among these models. We think that model-choice principles depend on the purpose of the scaling models. Pure research purposes (e.g., understanding cognitive processes underlying item response behavior; modeling item complexity) must be distinguished from policy-relevant reporting practice (e.g., country rankings in educational LSA studies). Several researchers have argued that model choice should be primarily a matter of validity and not based on purely statistical criteria [27,60,61,62,63,64]. Myung et al. [63] discussed several criteria for model selection with a focus on cognition science. We would like to emphasize that these criteria might be differently weighted if applied to educational LSA studies that are not primarily conducted for research purposes. The concept of the interpretability of a selected IRT model means that the model parameters must be linked to psychological processes and constructs. We think that simple unidimensional IRT models in LSA studies are not used because one believes a unidimensional underlying (causal) variable exists. The chosen IRT model is used for summarizing item response patterns and for providing simple and interpretable descriptive statistics. In this sense, we have argued elsewhere [27] that model fit should not have any relevance for model selection in LSA studies. However, it seems in the official LSA publications such as those from PISA that information criteria are also used for justifying the use of scaling models [5]. We would like to note that these model comparisons are often biased in the sense that the personally preferred model is often the winner of this fit contest, and other plausible IRT models are excluded from these contests because they potentially could provide a better model fit. Information-criteria-based model selection falls into the criterion of generalizability according to Myung et al. [63]. These criteria are briefly discussed in Section 3.1. Notably, different IRT models imply a differential weighting of items in the summary variable [29,65]. This characteristic is quantified with locally optimal weights (see Section 2.1). The differential item weighting might impair the comparison of subgroups. More critically, the weighing of items is, in most applications, determined by statistical models and might, hence, have undesirable consequences because practitioners have an implicitly defined different weighing of items in mind when composing a test based on a single test of items. Nevertheless, our study investigates the consequences of using different IRT models for LSA data. To sum up, which of the models should be chosen in operational practice is a difficult question that should not be (entirely) determined by statistical criteria.

3. Model Selection and Model Uncertainty

3.1. Model Selection

It is of particular interest to conduct model comparisons of the different scaling models that involve different IRFs (see Section 2.1). The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are used for conducting model comparisons in this article (see [66,67,68,69]). Moreover, the Gilula–Haberman penalty (GHP; [70,71,72]) is used as an effect size that is relatively independent of the sample size and the number of items. The GPH is defined as , where is the number of estimated model parameters for person p. The GHP can be seen as a normalized variant of the AIC. A difference in GHP larger than 0.001 is a notable difference regarding global model fit [72,73].

3.2. Model Uncertainty

Country comparisons in LSA studies such as PISA can depend on the chosen IRT model. In this case, choosing a single best-fitting model might be questionable [74,75]. To investigate the impact of model dependency, we discuss the framework of model uncertainty [76,77,78,79,80,81,82,83,84,85,86] in this section and quantify it by a statistic that characterizes model error. To quantify model uncertainty, each model m is associated with a weight and we assume [87]. To adequately represent the diversity of findings from different models, an equal weighting of models has been criticized [88]. In contrast, particular models in the set of all models are downweighted if they are highly dependent and produce similar results [89,90,91]. We believe that model fit should not influence model weights [92]. The goal is to represent differences between models in the model error. If the model weights were determined by model fit, plausible but non-fitting models such as the 1PL model would receive a model weight of zero, which is not preferred because the 1PL model should not be excluded from the set of specified models. Moreover, if model weights are computed based on information criteria [80], only one or a few models receive weights that differ from zero, but all other models do not impact the statistical inference. This property is why we do not prefer Bayesian model averaging in our application [82,93,94]. Let be the vector of a statistical parameter of all models. We can define a composite parameter as We can also define a population-level model error (ME) as Now, assume that data is available and is estimated. The estimate is multivariate normally distributed with mean and a covariance matrix . Typically, estimates of different models using the same dataset will be (strongly) positively correlated. An estimate of the composite parameter is given as Due to , we obtain that is an unbiased estimate of . The empirical model error is defined as Now, it can be shown that is a positively biased estimate of  because the former also contains sampling variability. Define , where . Similarly, we can write . Let be the m-th unit vector of length M that has an entry of 1 at the m-th entry and 0 otherwise. This notation enables the representation . Define . From (18), we obtain Furthermore, we can then rewrite the expected value of  as (see Equation (20)) where the second term is a biasing term that is the estimated variation across models due to sampling error. This term can be estimated if an estimate of the covariance matrix of the vector of model estimates is available. As an alternative, the bias in  can be removed by estimating in (22) with resampling techniques such as bootstrap, jackknife or (balanced) half sampling [21,95]. Let be an estimate of the bias; a bias-corrected model error can be estimated by One can define a total error that includes the sampling error due to person sampling and a model error estimate : This total error also takes the variability in the model choice into account and allows for broader inference. Constructed confidence intervals relying on  will be wider than ordinary confidence intervals that are only based on the .

4. Method

In our empirical application, we used data from PISA 2009 to assess the influence of the choice of different scaling models. Similar research with substantially fewer IRT modeling alternatives was conducted in [8,96,97].

4.1. Data

PISA 2009 data was used in this empirical application [3]. The impact of the choice of the scaling model was investigated for the three cognitive domains mathematics, reading, and science. In total, 35, 101, and 53 items were included in our analysis for the domains mathematics, reading, and science, respectively. All polytomous items were dichotomously recoded, with only the highest category being recoded as correct. A total number of 26 countries were included in the analysis. The median sample sizes at the country level were Med = 5398 (M = 8578.0, Min = 3628, Max = 30,905) for reading, Med = 3761 (M = 5948.2, Min = 2510, Max = 21,379) for mathematics, and Med = 3746.5 (M = 5944.2, Min = 2501, Max = 21,344) for science. For all analyses at the country level, student weights were taken into account. Within a country, student weights were normalized to a sum of 5000, so that all countries contributed equally to the analyses.

4.2. Analysis

We compared the fit of 11 different scaling models (see Section 2.1) in an international calibration sample [98]. To this end, 500 students were randomly sampled from each of the 26 countries and each of the three cognitive domains. Model comparisons were conducted based on the resulting samples involving 13,000 students. In the next step, the item parameters obtained from the international calibration sample were fixed in the country-specific scaling models. In this step, plausible values for the  distribution in each of the countries were drawn [99,100]. We did not include student covariates when drawing plausible values. Note that sampling weights were taken into account in this scaling step. The resulting plausible values were subsequently linearly transformed such that a weighted mean of 500 and a weighted standard deviation of 100 holds in the total sample of studies comprising all countries. Weighted descriptive statistics and their standard errors of the  distribution were computed according to the Rubin rules of multiple imputation [3]. The only difference to the original PISA approach is that we apply balanced half sampling instead of balanced repeated replication for computing standard errors (see [21,101]). Balanced half sampling has the advantage of easy computation of the bias for model error (see Equation (23)). For quantifying model uncertainty, model weights were assigned prior to analysis based on the principles discussed in  Section 3.2. First, because the 1PL, 2PL, and the 3PL are the most frequently used models in LSA studies, we decided that the sum of their model weight should at least exceed 0.50. Second, the weights of models with similar behavior (i.e., models that result in similar country means) should be decreased. These considerations resulted in the following weights: 1PL: 0.273, 2PL: 0.136, 3PL: 0.136; 1PCL: 0.061; 1PLL: 0.061; 1PGL: 0.061; 3PLQ: 0.068; 3PLRH: 0.068; 4PGL: 0.045; 4PL: 0.045; 4PLQ: 0.045. It is evident that a different choice of model weight will change the composite parameter of interest and the associated model error. We did not opt for a sensitivity analysis employing an alternative set of model weights in order to ease the presentation of results in this paper. In order to study the importance of sampling error (SE) and the bias-corrected model error (), we computed an error ratio (ER) that is defined by . Moreover, we computed the total error as . All analyses were carried out with the statistical software R [102]. The different IRT models were fitted using the xxirt() function in the R package sirt [103]. Plausible value imputation was conducted using the R package TAM [104].

5. Results

5.1. Model Comparisons Based on Information Criteria

The 11 different scaling models were compared for the three cognitive domains mathematics, reading, and science for the PISA 2009 dataset. Table 1 displays model comparisons based on AIC, BIC, and , which is defined as the difference between the GHP values of a particular model and the best-fitting model.
Table 1

Model comparisons based on information criteria for the three ability domains—mathematics, reading and science—in PISA 2009.

MathematicsReadingScience
ModelAICBICΔGHPAICBICΔGHPAICBICΔGHP
1PL2175102177790.00594135554143170.00553478193482220.0062
1PCL2200222202910.01224147574155190.00703487563491600.0077
1PLL2168822171510.00434169884177510.00983489843493880.0081
1PGL2167842170680.00414133694141460.00533478043482230.0062
2PL2156212161440.0012410032 411541 0.00113445973453890.0009
4PGL 215142 216188 0.0000 409163 412182 0.0000 344064 345648 0.0000
3PLQ 215153 215938 0.0000 409327 411591 0.0002 344097 345285 0.0001
3PLRH 215174 215959 0.0001 409275 411539 0.0001 344083 345271 0.0000
3PL2154862160990.0009409767 411605 0.0008344420 345362 0.0006
4PL 215179 216060 0.0001 409296411852 0.0002 344105 345368 0.0001
4PLQ 215168 216102 0.0001 409245 411913 0.0001 344089 345464 0.0000

Note. AIC = Akaike information criterion; BIC = Bayesian information criteria; DGHP = difference in Gilula–Haberman penalty (GHP) between a particular model and the best-fitting model in terms of GHP; For model descriptions see Section 2.1 and Equations (3) to (14). For AIC and BIC, the best-fitting model and models whose information criteria did not deviate from the minimum value by more than 100 are printed in bold. For DGHP, the model with the smallest value and models with DGHP values smaller than 0.0005 are printed in bold.

Based on the AIC or , one of the models, 4PGL, 3PLQ, 3PLRH, 3PL, 4PL, or 4PLQ, was preferred in one of the domains. If the BIC were used as a selection criterion, the 3PLQ or the 3PLRH will always be chosen across the models. Notably, the operationally used 2PL model had only satisfactory for the reading domain. By inspecting , it is evident that the largest gain in model fit is obtained by switching from one- to two-, three- or four-parameter models. However, the gain in model fit from the 2PL to the 3PL model is not noteworthy. In contrast, the gains in fitting the 3PLQ or 3PLRH can be significant. Among the one-parameter models, it is interesting that the loglog link function resulted in a better model fit for mathematics compared to the logistic or the cloglog link functions. This was not the case for reading or science. Overall, the model comparison for PISA 2009 demonstrated that the 3PLQ or 3PLRH should be preferred over the 2PL model for reasons of model fit.

5.2. Model Uncertainty for Distribution Parameters

To obtain a visual insight into the similarity of the different scaling models, we computed pairwise absolute differences in the country means. We used the average of them as a distance matrix used as the input of a hierarchical cluster analysis based on the Ward method. Figure 4 shows the dendrogram of this cluster analysis. It can be seen that the 2PL and 3PL provided similar results. Another cluster of models was formed by the more complex models 3PLQ, 3PLRH, 4PGL, 4PL, and 4PLQ. Finally, the different one-parameter models 1PLL, 1PGL, 1PL (and 1PGL) provided relatively distinct findings.
Figure 4

Dendrogram of cluster analysis using the Ward method for 11 different scaling models based on the distance matrix defined as average absolute differences between country means of models for PISA 2009 reading data.

In Table 2, detailed results for 11 different scaling models for country means in PISA 2009 reading are shown The largest number of substantial deviations of country means from the weighted mean (i.e., the composite parameter) with at least 1 were obtained for the 1PCL model (10), 1PLL (9), and 4PLQ (9). At the level of countries, there were 11 countries in which none of the scaling models substantially differed from the weighted mean. In contrast, there was a large number of deviations for Denmark (DNK; 9) and South Korea (KOR; 10). The ranges in country means across different scaling models at the level of countries varied between 0.3 (SWE; Sweden) and 7.7 (JPN; Japan), with a mean of 2.4.
Table 2

Detailed results for all 11 different scaling models for country means in PISA 2009 reading.

CNTMrg MEbc 1PL1PCL1PLL1PGL2PL4PGL3PLQ3PLRH3PL4PL4PLQ
AUS515.21.250.29515.1515.8514.8515.2515.7515.2515.2515.5515.0515.0514.5
AUT470.82.360.65470.2 469.6 470.6470.1470.9 472.0 471.6471.7470.6471.6 471.9
BEL509.52.910.78508.9 507.8 509.4508.8509.7 510.7 510.4510.5509.4 510.7 510.6
CAN525.01.790.43525.1525.6525.2525.1525.4524.3524.5524.8524.9524.0 523.8
CHE501.71.270.39501.3501.3501.0501.4501.5502.3502.3502.2501.8502.3502.3
CZE479.90.890.27479.5480.2479.5479.6480.1480.0480.0479.8480.4480.1480.0
DEU498.51.830.39498.2499.3 497.5 498.5498.4499.0498.9498.9498.7498.8499.1
DNK493.75.461.58 495.0 497.3 492.9 495.6 492.6 491.9 492.0 491.8 493.5 492.1 492.1
ESP480.11.430.43480.0480.7479.5480.1480.3479.6479.8479.6480.9479.7479.7
EST501.52.430.75501.2 502.8 500.4 501.4502.0500.9501.0501.0 502.8 500.7500.8
FIN539.01.660.41539.0538.7539.2538.9538.7539.8539.2539.6538.4539.7 540.1
FRA498.04.541.13497.4 495.1 499.0 497.0497.7 499.4 499.4 499.5 497.7 499.6 499.3
GBR494.01.290.20494.0494.7493.4494.1494.0494.0494.1494.0494.2493.8493.8
GRC480.63.420.96 481.7 479.6 482.8 481.1480.3 479.4 480.0479.7480.0 479.6 479.6
HUN494.21.740.40494.4495.0493.8494.4494.5493.5493.6493.7494.3493.3493.4
IRL496.82.040.51496.5497.7 495.7 496.8497.4496.4496.6496.6497.5496.5496.4
ISL501.20.780.15501.3501.6501.5501.2501.3501.1500.8501.0500.8501.3501.2
ITA486.51.370.32486.3485.6486.6486.2486.8486.7487.0486.9486.6486.8486.9
JPN521.37.701.60522.3 517.7 525.4 521.4520.4521.6521.0520.7 519.8 522.2522.2
KOR539.74.031.45 541.3 541.4 541.5 541.2 538.7 538.2 538.5 538.7 538.5 537.4 537.6
LUX472.74.381.22471.7 470.0 473.0 471.3 473.2 474.4 474.2 474.4 472.5 474.0 474.2
NLD509.01.570.28509.1509.8508.2509.4508.6508.9509.1508.7508.8509.2509.1
NOR503.30.890.14503.3503.6503.7503.1503.2503.3503.2503.0503.3503.7503.9
POL501.72.240.72501.0501.2 500.4 501.3502.2502.0502.5502.2502.7502.2502.1
PRT489.22.790.70489.4 490.8 488.0 489.8489.3488.3488.5488.4489.9488.3488.3
SWE497.00.340.00496.9497.0497.0496.9496.9497.2497.0497.1496.9497.1497.2

Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); For model descriptions see Section 2.1 and Equations (3) to (14). Country means that differ from the weighted mean of country means of the 11 different models more than 1 are printed in bold.

In Table A1 in Appendix C, detailed results for 11 different scaling models for country means in PISA 2009 mathematics are shown. The largest number of substantial deviations from the weighted mean was obtained for the 1PCL (12), the 1PLL (11), and the 1PGL (9) model. The ranges of the country means across models ranged between 0.5 and 7.9, with a mean of 2.8.
Table A1

Detailed results for all 11 different scaling models for country means in PISA 2009 mathematics.

CNTMrg MEbc 1PL1PCL1PLL1PGL2PL4PGL3PLQ3PLRH3PL4PL4PLQ
AUS511.20.720.02511.3510.9510.8511.1511.4511.4511.2511.2511.5511.2511.3
AUT492.52.900.71492.7 491.2 494.1 493.9 492.7492.4492.3492.9 491.2 492.1492.1
BEL512.42.990.86513.0 511.3 514.2 514.2 511.6512.2512.4512.1511.5512.3512.2
CAN523.02.170.62522.5 521.9 522.7522.9523.8523.1523.1523.2524.0523.0523.0
CHE533.56.221.44532.5 529.0 535.0 535.2 533.9 534.5 534.4534.4533.4 534.9 534.6
CZE488.11.210.20488.2488.9487.8487.7488.5487.8487.8488.0488.0487.7487.7
DEU508.92.460.89509.7508.9 510.4 510.3 508.1508.3507.9508.2508.0508.1508.0
DNK497.43.520.93498.0 499.7 496.3 496.6497.6 496.2 496.4 496.2 497.9 496.4 496.4
ESP478.90.530.06479.1479.0478.8478.8478.6479.1478.9479.0478.6478.9478.9
EST508.15.351.35507.6 510.8 505.5 505.4 508.9507.8507.9507.7 509.9 507.9507.9
FIN538.15.131.27 539.3 541.1 538.2537.9537.9 536.5 536.4 536.8 538.2 536.0 536.2
FRA490.81.790.50491.3490.0491.6491.8490.0490.4490.7490.6490.4490.5490.5
GBR486.92.300.53486.6486.9 485.3 485.9 487.1487.1487.3487.1487.6487.3487.3
GRC458.03.950.97458.6457.6 459.9 459.2 457.3458.3458.0458.2 456.0 457.9457.8
HUN483.41.110.00483.5484.1483.1483.0483.5483.5483.2483.4483.1483.3483.4
IRL482.61.970.55482.1482.1 481.6 482.0483.1483.0483.0482.7483.6483.2483.2
ISL501.03.020.74501.5 503.0 500.1500.2500.7500.0500.4500.1501.3500.3500.4
ITA478.00.880.18478.1478.6478.1477.8477.7478.2478.2478.2477.8478.2478.2
JPN529.93.061.11 528.4 529.1529.1528.9530.5 531.3 531.1 531.0 530.5 531.4 531.3
KOR544.77.872.45 541.6 540.0 546.4 545.8 545.6 546.7 547.5 547.1 545.6 547.7 547.8
LUX483.41.550.46483.8482.8484.1484.0482.7483.7483.3483.7482.5483.4483.5
NLD521.51.980.51522.0 522.6 521.4521.5521.2520.8520.8520.8521.5520.6520.7
NOR493.34.110.87493.4 495.6 491.5 491.6 493.5492.9493.0492.8493.9493.0493.0
POL487.01.220.15487.1488.0486.8486.7487.1486.9486.9486.8486.8487.0486.9
PRT480.12.260.49479.8 478.7 479.7480.0480.3480.7480.8481.0480.2480.8480.7
SWE487.41.440.47488.1488.3487.4487.6486.8487.2487.0487.0486.9487.0487.1

Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); For model descriptions see Section 2.1 and Equations (3) to (14). Country means that differ from the weighted mean of country means of the 11 different models more than 1 are printed in bold.

In Table A2 in Appendix C, detailed results for 11 different scaling models for country means in PISA 2009 science are shown. For science, many models showed a large number of deviations. This demonstrates large model uncertainty. The ranges of the country means across models varied between 0.6 and 7.8, with a mean of 2.8.
Table A2

Detailed results for all 11 different scaling models for country means in PISA 2009 science.

CNTMrg MEbc 1PL1PCL1PLL1PGL2PL4PGL3PLQ3PLRH3PL4PL4PLQ
AUS517.62.730.83518.4518.1 519.2 518.3516.7517.3517.1517.2 516.5 517.2517.1
AUT488.11.110.18487.9488.6487.6488.0488.4488.3488.4488.7487.9488.3488.2
BEL498.12.370.55497.8 496.6 498.9497.7498.5498.6498.5498.5498.2498.7498.6
CAN519.60.650.09519.6519.5520.0519.6519.4519.6519.6519.5519.6519.4519.6
CHE509.20.960.35508.7508.8508.9508.7509.5509.6509.7509.7509.4509.6509.6
CZE494.12.890.98 495.1 495.7 494.5 495.2 493.5 493.0 492.8 492.9 493.6 492.9 492.9
DEU513.92.130.53514.2 514.9 514.7514.2514.0513.3513.1513.5513.7513.1 512.8
DNK488.34.701.89 490.3 490.9 489.6 490.4 486.2 486.6 486.5 486.4 486.8 486.7 486.6
ESP478.22.070.42478.2479.0 477.0 478.4478.1477.8477.9477.7478.7478.0478.0
EST517.51.000.23517.4517.2517.9517.3517.4517.6517.2517.4518.2517.4517.2
FIN546.53.540.79547.1546.3 549.0 546.9546.0546.4546.0546.1 545.5 546.3546.1
FRA488.23.741.02 487.2 485.9 488.3 487.1 488.9 489.3 489.6 489.5 488.8 489.3 489.5
GBR505.01.120.28504.7504.8505.2504.7504.9505.8505.4505.4504.7505.5505.6
GRC461.44.511.26 460.3 458.3 461.6 460.0 462.4 462.8 462.5 462.5 462.1 462.5 462.5
HUN494.65.051.36 495.8 498.0 493.5 496.1 493.9 492.9 493.0 493.0 494.5 493.0 493.1
IRL497.00.950.27497.3497.4497.4497.3496.7496.8496.5496.7496.7496.5496.6
ISL487.63.341.09 486.5 487.4 485.5 486.6 488.8 488.4488.2488.4 488.8 488.1488.2
ITA479.70.570.17479.9479.5479.5479.9479.8479.5479.4479.3479.7479.3479.3
JPN534.67.852.29 532.4 530.2 534.6 532.1 536.1 536.3 537.6 536.9 535.0 538.1 537.6
KOR530.63.571.42 529.1 529.0 529.1 529.2 531.0 532.0 532.5 532.4 531.5 532.3 532.4
LUX474.83.490.87474.2 472.6 475.1474.0475.3 476.1 476.0 476.1 474.6475.7475.7
NLD514.22.630.93 515.2 515.6 514.8 515.2 513.6 513.1 513.0 513.2513.4 513.0 513.1
NOR491.03.241.10 492.2 492.6 491.2 492.3 490.5 489.4 489.6 489.4 490.6 489.6 489.7
POL499.63.080.70500.0 501.7 498.6500.2499.3499.0498.9499.0499.7498.7498.9
PRT483.44.410.88483.2 485.3 480.9 483.5483.8483.0483.1482.9484.4483.1483.1
SWE487.31.540.34487.1 486.3 487.2487.0487.5487.6487.9487.7487.5487.9487.9

Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); For model descriptions see Section 2.1 and Equations (3) to (14). Country means that differ from the weighted mean of country means of the 11 different models more than 1 are printed in bold.

In Table 3, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 reading are shown. The unadjusted model error had an average of M = 0.66. The bias-corrected model error was slightly smaller, with M = 0.62. On average, the error ratio was 0.24, indicating that the larger portion of uncertainty is due to sampling error compared to model error.
Table 3

Results and model uncertainty of 11 different scaling models for country means and country standard deviations in PISA 2009 reading.

Country MeanCountry Standard Deviation
CNT N MrgSEME MEbc ERTEMrgSEME MEbc ERTE
AUS14,247515.21.22.510.320.290.122.52104.72.61.450.680.640.441.59
AUT6585470.82.43.340.690.650.193.40104.66.82.161.661.640.762.71
BEL8500509.52.92.490.800.780.322.61107.53.11.920.690.650.342.02
CAN23,200525.01.81.490.450.430.291.5595.64.61.121.181.181.051.62
CHE11,801501.71.32.720.420.390.142.7599.70.81.670.230.000.001.67
CZE6059479.90.93.170.320.270.093.1895.21.31.860.390.200.111.87
DEU4975498.51.83.050.420.390.133.08100.11.32.010.300.000.002.01
DNK5920493.75.52.101.581.580.752.6388.03.51.310.700.680.521.48
ESP25,828480.11.42.120.440.430.202.1791.94.61.181.161.130.961.64
EST4726501.52.42.700.770.750.282.8085.53.81.710.850.820.481.89
FIN5807539.01.72.270.430.410.182.3091.59.81.312.682.682.052.98
FRA4280498.04.53.921.161.130.294.08112.21.82.920.550.410.142.95
GBR12,172494.01.32.470.250.200.082.4799.62.81.340.770.730.551.53
GRC4966480.63.44.261.010.960.234.3799.85.42.091.461.380.662.50
HUN4604494.21.73.620.460.400.113.6494.82.72.780.670.580.212.84
IRL3931496.82.03.240.550.510.163.2898.84.22.631.241.190.452.89
ISL3628501.20.81.670.230.150.091.68102.03.51.401.030.960.681.69
ITA30,905486.51.41.610.330.320.201.64101.43.71.350.810.770.571.55
JPN6082521.37.73.711.621.600.434.04107.38.03.161.591.520.483.50
KOR4989539.74.03.101.511.450.473.4284.28.41.762.232.021.152.68
LUX4622472.74.41.191.231.221.021.70109.38.01.212.011.991.652.33
NLD4760509.01.65.580.350.280.055.5995.14.11.891.121.010.542.14
NOR4660503.30.92.610.220.140.062.6196.83.71.550.980.930.601.81
POL4917501.72.22.720.720.720.262.8192.83.61.320.900.840.631.56
PRT6298489.22.83.170.710.700.223.2591.83.21.750.740.710.401.89
SWE4565497.00.33.000.090.000.003.00103.61.71.630.420.270.171.66

Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by (see Equation (24)).

The estimated country standard deviations for reading were much more model-dependent. The bias-corrected model error has an average of 0.96 (ranging between 0.00 and 2.68). This was also pronounced in the error ratio, which had an average of 0.60. The maximum error ratio was 2.05 for Finland (FIN; with a model error of 9.8), indicating that the model error was twice as large as the sampling error. Overall, model error turned out to be much more important for the standard deviation than the mean. In Table 4, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 reading are shown. For the 10th percentile Q10, the error ratio was on average 0.60, with a range between 0.13 and 2.61. The average error ratio was even larger for the 90th percentile Q90 (M = 0.84, Min = 0.23, Max = 2.16). Hence, quantile comparisons across countries can be sensitive to the choice of the IRT scaling model.
Table 4

Results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 reading.

Country 10th PercentileCountry 90th Percentile
CNT N MrgSEME MEbc ERTEMrgSEME MEbc ERTE
AUS14,247379.55.52.981.521.490.503.33646.811.23.333.103.040.914.51
AUT6585332.920.54.825.375.321.107.18602.84.83.641.261.070.303.79
BEL8500369.07.74.092.152.080.514.59644.716.82.784.244.241.525.07
CAN23,200400.84.92.401.421.410.592.78646.711.91.923.003.001.563.56
CHE11,801370.57.53.681.831.770.484.09627.710.93.363.113.090.924.56
CZE6059357.58.44.672.192.130.465.14603.36.23.181.581.530.483.53
DEU4975366.07.54.791.951.810.385.12624.49.22.732.642.580.953.76
DNK5920378.24.12.820.960.910.322.96604.04.72.571.451.430.562.94
ESP25,828359.08.73.242.182.120.663.87595.13.01.860.780.740.402.00
EST4726390.97.33.831.811.760.464.21610.76.23.171.501.460.463.49
FIN5807419.210.02.902.452.450.853.80653.321.62.665.755.752.166.34
FRA4280350.513.85.933.683.590.606.93638.616.34.923.883.820.786.23
GBR12,172365.99.93.002.572.570.863.95621.75.03.011.451.390.463.31
GRC4966350.516.26.243.513.290.537.05607.53.63.061.030.970.323.21
HUN4604368.67.06.081.561.400.236.24613.44.54.081.211.120.284.23
IRL3931370.09.65.612.452.380.436.09619.75.72.841.311.240.443.10
ISL3628366.36.02.671.401.280.482.96628.211.22.332.842.761.183.62
ITA30,905352.412.22.652.672.651.003.75613.77.71.862.012.001.072.73
JPN6082381.04.87.461.171.010.147.52652.925.93.395.735.671.686.60
KOR4989430.513.84.183.533.310.795.33644.514.73.513.683.601.025.03
LUX4622328.324.52.426.366.312.616.76609.85.91.831.631.550.852.40
NLD4760386.83.55.840.910.730.135.89632.712.95.353.473.360.636.31
NOR4660377.13.53.470.850.770.223.55625.713.73.283.453.451.054.76
POL4917381.95.03.251.251.240.383.48620.512.83.183.463.431.084.68
PRT6298369.96.64.511.431.340.304.70606.83.53.200.830.740.233.29
SWE4565363.18.83.972.192.130.544.51627.67.93.602.132.060.574.15

Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by (see Equation (24)).

In Table A3 in Appendix C, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 mathematics are shown. As for reading, the error ratio was on average smaller for country means (M = 0.24, Max = 0.66) than for country standard deviations (M = 0.77, Max = 1.58). Nevertheless, the additional uncertainty associated with model uncertainty is too large to be ignored in statistical inference. For example, South Korea (KOR) had a range of 15.7 for the standard deviation across models, which corresponds to an error of 3.75 and an error ratio of 1.58.
Table A3

Results and model uncertainty of 11 different scaling models for country means and country standard deviations in PISA 2009 mathematics.

Country MeanCountry Standard Deviation
CNT N MrgSEME MEbc ERTEMrgSEME MEbc ERTE
AUS9889511.20.72.750.190.020.012.75101.52.71.820.890.830.452.00
AUT4575492.52.93.170.800.710.223.25105.16.02.051.761.680.822.65
BEL5978512.43.02.390.880.860.362.54111.54.22.201.361.320.602.56
CAN16,040523.02.21.700.620.620.371.8193.55.51.281.731.731.352.16
CHE8157533.56.23.591.451.440.403.87105.27.21.852.332.291.232.94
CZE4223488.11.23.160.320.200.063.1698.92.82.100.930.860.412.27
DEU3503508.92.53.450.910.890.263.56104.62.32.270.860.730.322.38
DNK4088497.43.52.860.950.930.333.0191.91.81.780.360.080.051.79
ESP17,920478.90.52.210.200.060.032.2195.46.11.641.631.600.982.29
EST3279508.15.32.821.371.350.483.1383.55.91.961.601.560.802.50
FIN4019538.15.12.221.321.270.572.5687.88.41.822.612.591.423.17
FRA2965490.81.83.670.590.500.143.71104.74.62.771.341.260.453.05
GBR8431486.92.32.770.590.530.192.8294.23.11.750.900.820.471.93
GRC3445458.03.94.131.030.970.234.2497.69.62.382.882.821.183.69
HUN3177483.41.14.040.260.000.004.0497.85.43.421.691.690.493.82
IRL2745482.62.02.890.610.550.192.9488.35.02.021.411.360.672.44
ISL2510501.03.02.140.760.740.352.2695.02.52.090.690.610.292.18
ITA21,379478.00.92.090.240.180.092.1098.05.51.401.321.320.941.92
JPN4207529.93.13.771.151.110.293.93101.77.92.612.612.540.973.64
KOR3447544.77.93.712.522.450.664.4594.015.72.383.903.751.584.45
LUX3197483.41.61.880.530.460.241.94103.65.11.781.361.300.732.21
NLD3318521.52.05.190.560.510.105.2296.44.52.061.571.490.732.54
NOR3230493.34.12.760.880.870.322.8992.62.81.470.850.740.501.65
POL3401487.01.22.990.280.150.052.9995.45.91.902.462.441.283.10
PRT4391480.12.32.990.540.490.163.0397.74.71.931.531.490.772.44
SWE3139487.41.43.020.530.470.153.0699.33.51.911.141.080.572.19

Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by (see Equation (24)).

In Table A4 in Appendix C, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 mathematics are shown. The error ratios for the 10th and the 90th percentiles were similar (Q10: M = 0.66; Q90: M = 0.65). In general, the relative increase in uncertainty due to model error for percentiles was similar to the standard deviation.
Table A4

Results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 mathematics.

Country 10th PercentileCountry 90th Percentile
CNT N MrgSEME MEbc ERTEMrgSEME MEbc ERTE
AUS9889380.22.23.120.760.610.203.18641.98.44.062.652.560.634.80
AUT4575355.516.24.224.744.601.096.25627.17.33.862.242.090.544.39
BEL5978367.05.84.461.691.530.344.71654.914.92.944.674.661.595.51
CAN16,040402.34.62.751.501.500.543.13643.710.22.013.153.151.573.73
CHE8157393.93.64.291.140.970.234.40666.620.34.165.765.711.377.07
CZE4223361.99.74.802.912.850.595.58617.32.23.910.630.300.083.92
DEU3503371.86.75.121.891.850.365.45642.611.63.753.833.751.005.30
DNK4088379.24.53.491.661.510.433.80616.13.53.691.191.140.313.86
ESP17,920354.413.33.443.753.701.085.05600.84.92.731.121.060.392.92
EST3279401.78.84.282.312.240.524.83616.76.63.661.801.680.464.03
FIN4019425.08.73.392.662.610.774.28650.913.83.124.604.571.475.54
FRA2965354.310.95.452.822.720.506.09623.97.84.853.403.280.685.86
GBR8431366.86.33.322.162.090.633.92609.52.03.930.580.260.073.94
GRC3445332.422.85.636.556.441.148.55584.06.74.641.771.690.364.94
HUN3177356.712.36.073.573.570.597.04608.56.46.031.631.510.256.21
IRL2745368.08.04.452.362.220.504.97594.65.13.391.541.470.433.70
ISL2510378.33.43.701.251.050.283.84622.24.73.461.701.600.463.82
ITA21,379351.511.32.473.413.411.384.21604.34.42.890.910.830.293.01
JPN4207397.87.26.312.152.020.326.63658.716.74.174.974.851.166.40
KOR3447424.910.64.523.222.920.655.38666.732.25.098.047.881.559.38
LUX3197348.514.13.544.124.031.145.36615.83.32.511.271.140.462.75
NLD3318396.94.65.921.180.920.166.00645.810.15.023.313.220.645.96
NOR3230373.75.13.441.861.720.503.85612.94.13.260.900.750.233.35
POL3401364.013.23.604.714.711.315.93610.37.24.092.612.500.614.79
PRT4391354.512.03.473.683.641.055.02607.02.64.220.740.490.124.25
SWE3139359.610.03.743.313.250.874.95616.03.53.991.131.000.254.11

Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by (see Equation (24)).

In Table A5 in Appendix C, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 science are shown. As for reading and mathematics, the importance of model error was relatively small for country means (M = 0.27 for the error ratio). However, it reached 0.72 for Denmark with a bias-corrected model error of 1.89. For country standard deviations, the error ratio was larger (M = 0.53, Min = 0.00, Max = 1.50).
Table A5

Results and model uncertainty of 11 different scaling models for country means and country standard deviations in PISA 2009 science.

Country MeanCountry Standard Deviation
CNT N MrgSEME MEbc ERTEMrgSEME MEbc ERTE
AUS9864517.62.72.720.840.830.302.84104.93.41.750.650.580.331.84
AUT4577488.11.13.640.290.180.053.64105.72.22.910.630.530.182.96
BEL5938498.12.42.510.550.550.222.57106.72.41.980.610.570.292.06
CAN16,075519.60.71.810.150.090.051.8193.83.61.240.940.910.741.54
CHE8215509.21.03.010.400.350.123.0398.92.11.820.480.350.191.86
CZE4252494.12.93.431.000.980.293.5799.11.12.660.300.000.002.66
DEU3477513.92.13.080.550.530.173.12103.35.32.251.091.050.472.48
DNK4101488.34.72.621.921.890.723.2395.23.61.981.111.090.552.26
ESP17,876478.22.12.180.460.420.192.2287.94.01.641.000.970.591.90
EST3272517.51.02.750.310.230.082.7687.34.11.911.091.060.562.18
FIN4016546.53.52.480.840.790.322.6192.810.91.552.352.331.502.80
FRA2960488.23.73.911.101.020.264.04105.34.13.091.271.150.373.29
GBR8413505.01.12.780.360.280.102.79102.61.91.850.640.580.311.94
GRC3452461.44.54.101.261.260.314.2996.88.82.222.052.000.902.99
HUN3193494.65.03.461.431.360.393.7289.82.52.920.590.500.172.97
IRL2738497.01.03.310.360.270.083.3299.41.72.810.500.330.122.83
ISL2501487.63.32.011.091.090.542.2899.55.11.891.171.130.602.20
ITA21,344479.70.61.820.210.170.091.8399.15.91.491.201.200.811.91
JPN4222534.67.83.762.292.290.614.40106.710.33.152.722.690.854.14
KOR3451530.63.63.301.421.420.433.5986.97.91.932.412.341.213.04
LUX3195474.83.51.940.910.870.452.12107.96.51.531.631.581.032.20
NLD3323514.22.65.770.980.930.165.8599.74.72.321.201.110.482.57
NOR3204491.03.22.671.151.100.412.8893.23.21.650.810.740.451.81
POL3397499.63.12.720.730.700.262.8192.72.31.930.580.520.272.00
PRT4336483.44.43.060.890.880.293.1986.04.21.540.890.850.551.76
SWE3157487.31.52.850.390.340.122.87102.42.21.580.500.380.241.63

Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by (see Equation (24)).

In Table A6 in Appendix C, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 science are shown. The influence of model error on percentiles was slightly smaller in science than in reading or mathematics. The average error ratios were M = 0.44 (Q10) and M = 0.57 (Q90), but the maximum error ratios of 1.53 (Q10) and 2.04 (Q90) indicated that model error was more important than sampling error for some countries.
Table A6

Results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 science.

Country 10th PercentileCountry 90th Percentile
CNT N MrgSEME MEbc ERTEMrgSEME MEbc ERTE
AUS9864383.33.83.191.091.010.323.34650.312.74.072.852.760.684.92
AUT4577350.97.95.682.392.290.406.13621.73.84.291.050.900.214.38
BEL5938358.77.14.181.961.960.474.62632.911.52.892.282.250.783.66
CAN16,075398.51.42.590.510.360.142.62638.710.22.252.432.391.063.29
CHE8215379.43.43.951.110.940.244.06634.28.23.832.012.000.524.32
CZE4252366.86.15.491.491.360.255.66621.44.04.261.030.930.224.36
DEU3477379.82.34.870.830.360.074.88645.412.53.502.432.380.684.23
DNK4101366.86.53.641.981.920.534.12610.96.43.592.482.450.684.34
ESP17,876365.27.13.451.681.650.483.82590.33.82.490.980.900.362.65
EST3272404.33.14.051.060.990.244.16629.19.53.322.062.020.613.89
FIN4016426.88.93.412.112.070.613.98665.121.23.054.614.561.495.48
FRA2960349.714.16.263.653.430.557.14619.26.04.701.391.260.274.87
GBR8413372.56.53.561.741.690.473.94635.89.43.862.702.620.684.67
GRC3452336.720.46.024.524.430.747.47584.85.34.011.431.310.334.22
HUN3193378.82.56.410.820.370.066.42609.84.93.771.191.110.303.93
IRL2738370.37.35.602.081.930.345.93623.24.34.011.091.010.254.13
ISL2501357.810.43.772.562.480.664.51613.33.72.781.121.040.372.97
ITA21,344350.714.02.872.852.851.004.04605.72.22.130.610.570.272.21
JPN4222390.55.67.551.481.260.177.66663.427.83.406.976.942.047.72
KOR3451417.46.33.832.021.860.494.26639.916.74.454.914.911.106.63
LUX3195334.618.62.984.624.551.535.44612.21.62.670.490.180.072.68
NLD3323385.43.76.361.341.080.176.45642.411.25.482.522.440.456.00
NOR3204371.15.83.311.421.320.403.56611.33.53.581.231.070.303.74
POL3397380.63.73.810.930.860.223.90619.73.63.511.060.930.263.63
PRT4336373.75.43.671.311.170.323.85595.46.93.461.271.240.363.68
SWE3157355.510.43.422.272.200.644.07617.57.33.621.811.750.484.02

Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by (see Equation (24)).

To investigate the impact of the choice of model weights in our analysis (see Section 4.2), we additionally conducted a sensitivity analysis for the reading domain by using uniform model weights (weighting scheme W2). That is, we weighted each of the 11 scaling models by (). We studied changes in country means and country standard deviations regarding the composite mean, standard errors (SE), and model errors (). The results are displayed in Table 5.
Table 5

Sensitivity analysis for country means and country standard deviations for original and uniform model weighting for PISA 2009 reading.

Country MeanCountry Standard Deviation
MSE MEbc MSE MEbc
CNTW1W2W1W2W1W2W1W2W1W2W1W2
AUS515.2515.22.512.510.290.33104.7104.71.451.460.640.74
AUT470.8471.03.343.330.650.74104.6104.32.162.181.641.90
BEL509.5509.72.492.490.780.90107.5107.61.921.910.650.74
CAN525.0524.81.491.490.430.5395.695.81.121.131.181.34
CHE501.7501.82.722.730.390.4399.799.71.671.680.000.00
CZE479.9479.93.173.160.270.2095.295.21.861.860.200.15
DEU498.5498.73.053.040.390.44100.1100.12.012.000.000.03
DNK493.7493.42.102.101.581.7588.087.81.311.330.680.84
ESP480.1480.02.122.110.430.4491.991.51.181.161.131.34
EST501.5501.42.702.700.750.7785.585.31.711.720.820.99
FIN539.0539.22.272.310.410.4691.592.41.311.312.683.14
FRA498.0498.33.923.931.131.35112.2112.12.922.920.410.49
GBR494.0494.02.472.470.200.2599.699.41.341.350.730.82
GRC480.6480.34.264.230.961.0099.899.42.092.061.381.55
HUN494.2494.03.623.610.400.4794.894.62.782.780.580.66
IRL496.8496.73.243.210.510.5298.898.32.632.601.191.38
ISL501.2501.21.671.680.150.14102.0102.31.401.410.961.07
ITA486.5486.61.611.610.320.36101.4101.51.351.340.770.87
JPN521.3521.33.713.711.601.79107.3107.73.163.161.521.96
KOR539.7539.43.103.131.451.4884.284.71.761.782.022.33
LUX472.7473.01.191.191.221.38109.3108.91.211.231.992.29
NLD509.0509.05.585.620.280.3295.195.51.891.901.011.17
NOR503.3503.42.612.630.140.2096.897.21.551.560.931.14
POL501.7501.82.722.730.720.6792.893.01.321.340.840.96
PRT489.2489.03.173.160.700.8391.891.51.751.740.710.88
SWE497.0497.03.003.000.000.00103.6103.41.631.640.270.32

Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); W1 = model weighting used in the main analysis (see Section 4.2 and results in other tables); W2 = uniform weighting of models.

For the composite estimate of the country mean, we only observed tiny differences between the proposed model weighting W1 and the uniform weighting W2. The absolute difference in country means was 0.14 on average (SD = 0.11) and ranged between 0.01 and 0.36 (South Korea, KOR). The average absolute difference for the change in country standard deviations was also small (M = 0.26; SD = 0.20). Notably, there were almost no changes in the standard error for country means and country standard deviations for the weighting methods. However, the model error slightly increased with uniform weighting from M = 0.62 to M = 0.68 for country means and from 0.96 to 1.12 for country standard deviation. In conclusion, one can state that employing a different weighting scheme might not strongly change the composite estimate or the standard error but can have importance regarding the quantified model uncertainty in the model error .

6. Discussion

Overall, our findings demonstrate that uncertainty regarding IRT scaling model influences country means. This kind of uncertainty is too large to be neglected in reporting. For some of the countries, the model error exceeded the sampling error. In this case, confidence intervals based on standard errors for the sampling of students might be overly narrow. A different picture emerged for standard deviations and percentiles. In this case, the choice of the IRT model turned out to be much more important. Estimated error ratios were, on average, between 0.40 and 0.80, indicating that the model error introduced a non-negligible amount of uncertainty in parameters of interest. However, the importance of model error compared to sampling error was even larger for some of the countries. In particular, distribution parameters for high- and low-performing countries were substantially affected by the choice of the IRT model. In our analysis, we only focused on 11 scaling models studied in the literature. However, semi- or nonparametric IRT models could alternatively be utilized [16,53,105,106,107], and their impact on distribution parameters could be an exciting topic for future research. If more parameters in an IRT model were included, we expect an even larger impact of model choice on distribution parameters. In our analysis, we did not use student covariates for drawing plausible values [100,108]. It could be that the impact of the choice of the IRT model would be smaller if relevant student covariates were included [109]. Future research can provide answers to this important question. As a summary of our research (see also Section 2.3), we would like to argue that model uncertainty should also be reported in educational LSA studies. This could be particularly interesting because the 1PL, 2PL, or the 3PL models are applied in the studies. In model comparisons, we have shown that the 3PL with residual heterogeneity (3PLRH) and the 3PL with quadratic effects of (3PLQ) were superior to alternatives. If the 2PL model is preferred over the 1PL model for reasons of model fit, three-parameter models must be preferred for the same reason. However, a central question might be whether the 3PLRH should be implemented in the operational practice of LSA. Technically, it would be certainly feasible, and there is no practical added complexity compared to the 2PL or the 3PL model. Interestingly, some specified IRT models have the same number of item parameters but a different ability to fit the item response data. For example, the 3PL and the 3PLRH models have the same number of parameters, but the 3PLRH is often preferred in terms of model fit. This underlines that the choice of the functional form is also relevant, not only the number of item parameters [30]. Frequently, the assumed IRT models will be grossly misspecified for educational LSA data. The misspecification could lie in the functional form of the IRFs or the assumption of invariant item parameters across countries. The reliance of ML estimation on misspecified IRT models might be questioned. As an alternative, (robust) limited-information (LI) estimation methods [110] can be used. Notably, ML and LI methods result in a different weighing of model errors [111]. If differential item functioning (DIF) across countries is critical, IRT models can also be separately estimated in each country, and the results brought onto a common international metric through linking methods [112,113]. In the case of a small sample size at the country level, regularization approaches for more complex IRT models can be employed to stabilize estimation [114,115]. Linking methods have the advantage of a clear definition of model loss regarding country DIF [116,117,118] compared to joint estimation with ML or LI estimation [119]. As pointed out by an anonymous reviewer, applied psychometric researchers seem to have a tendency to choose the best fitting model with little care for whether that choice is appropriate in the particular research context. We have argued elsewhere that the 1PL model compared to other IRT models with more parameters is more valid because of its equal weighting of items [27]. If Pandora’s box is opened via the argument of choosing a more complex IRT model due to improved model fit, we argue for a specification of different IRT models and an integrated assessment of model uncertainty, as has been proposed in this article. In this approach, however, the a priori choice of model weights has to be carefully conducted.
  17 in total

1.  Estimation of a four-parameter item response theory model.

Authors:  Eric Loken; Kelly L Rulison
Journal:  Br J Math Stat Psychol       Date:  2009-12-23       Impact factor: 3.380

2.  Shrinkage estimation of the three-parameter logistic model.

Authors:  Michela Battauz; Ruggero Bellio
Journal:  Br J Math Stat Psychol       Date:  2021-03-18       Impact factor: 3.380

3.  Bayesian Modal Estimation of the Four-Parameter Item Response Model in Real, Realistic, and Idealized Data Sets.

Authors:  Niels G Waller; Leah Feuerstahler
Journal:  Multivariate Behav Res       Date:  2017-03-17       Impact factor: 5.923

4.  Metric Transformations and the Filtered Monotonic Polynomial Item Response Model.

Authors:  Leah M Feuerstahler
Journal:  Psychometrika       Date:  2018-11-09       Impact factor: 2.500

5.  IRT Scoring and Test Blueprint Fidelity.

Authors:  Gregory Camilli
Journal:  Appl Psychol Meas       Date:  2018-02-20

6.  Heteroscedastic Latent Trait Models for Dichotomous Data.

Authors:  Dylan Molenaar
Journal:  Psychometrika       Date:  2014-08-01       Impact factor: 2.500

7.  Optimizing Prediction Using Bayesian Model Averaging: Examples Using Large-Scale Educational Assessments.

Authors:  David Kaplan; Chansoon Lee
Journal:  Eval Rev       Date:  2018-04-11

8.  Marginalized maximum a posteriori estimation for the four-parameter logistic model under a mixture modelling framework.

Authors:  Xiangbin Meng; Gongjun Xu; Jiwei Zhang; Jian Tao
Journal:  Br J Math Stat Psychol       Date:  2019-09-25       Impact factor: 3.380

9.  Analysing Standard Progressive Matrices (SPM-LS) with Bayesian Item Response Models.

Authors:  Paul-Christian Bürkner
Journal:  J Intell       Date:  2020-02-04

10.  On the Treatment of Missing Item Responses in Educational Large-Scale Assessment Data: An Illustrative Simulation Study and a Case Study Using PISA 2018 Mathematics Data.

Authors:  Alexander Robitzsch
Journal:  Eur J Investig Health Psychol Educ       Date:  2021-12-14
View more
  1 in total

1.  Exploring the Multiverse of Analytical Decisions in Scaling Educational Large-Scale Assessment Data: A Specification Curve Analysis for PISA 2018 Mathematics Data.

Authors:  Alexander Robitzsch
Journal:  Eur J Investig Health Psychol Educ       Date:  2022-07-07
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.