Literature DB >> 32156195

A study of alternative approaches to non-normal latent trait distributions in item response theory models used for health outcome measurement.

Niels Smits¹, Oğuzhan Öğreden², Mauricio Garnier-Villarreal³, Caroline B Terwee², R Philip Chalmers⁴.

Abstract

Entities: Chemical Disease Gene Species

Keywords: Item response theory; normality; patient-reported outcomes; skewness; zero-inflation

Mesh：

Year: 2020 PMID： 32156195 PMCID： PMC7221458 DOI： 10.1177/0962280220907625

Source DB: PubMed Journal: Stat Methods Med Res ISSN： 0962-2802 Impact factor: 3.021

× No keyword cloud information.

Introduction

Health outcomes research and clinical practice often require measurement of directly unobservable variables, such as pain and depression. A common strategy is measuring these variables by means of questionnaires. For instance, the Patient-Reported Outcomes Measurement Information System (PROMIS) Depression Item Bank is a collection of questions, or “items,” which was developed to measure depression.[1] One of the items of the item bank asks patients to report how often they felt like nothing could cheer them up in the past seven days. Patients can choose among five response options: never, rarely, sometimes, often, and always. In order to assign a score to an individual based on his or her observed item responses, a metric for the measurement instrument is needed; statistical models can be used to create such metrics. For instance, the graded response model (GRM)[2] is often used for items with multiple response options. The GRM is a member from a family of statistical models known as item response theory (IRT).[3] Historically one of the central tools of ability measurement and educational testing, IRT models have recently gained popularity in health outcomes research.[4-8] For instance, the metric of all PROMIS instruments, including the PROMIS Depression Item Bank, were created using the GRM.[9] Under the GRM, the observed score X on item j with ordinal response options is expressed using K cumulative response probabilities, given by which gives the probability that the response will be category k or higher. Given that and , the probability that the observed response will be the kth response option can be obtained by subtracting cumulative probabilities In these equations, θ represents the latent score of a respondent on the metric of the instrument, a determines the rate of increase of the logistic curves, and b is the location on the metric of the instrument where respondents have a probability of 0.50 of choosing category k or higher. Parameters a and b are conventionally called “slope” (or “discrimination”) and “threshold” (or “difficulty”) parameters, respectively.[10] Since the latent score θ is unobservable, the metric of the instrument is not defined by the model, but must be identified using model constraints. In most cases, the GRM is estimated using marginal maximum likelihood (MML),[11,12] where the scale is fixed using a latent density function in which the mean and variance are constrained. Conventionally, is assumed to be the standard normal density (a mean of zero and a standard deviation of one). Following the initial applications of IRT in health outcomes research, challenges of this new venue were identified.[13] One of these challenges is related to the nature of the variables of interest and the assumption of standard normality of the latent score distribution (shown in Figure 1, left-hand panel). In health outcomes research, constructs are often operationalized in such a way that there is a lower end of the scale which merely expresses the absence of the underlying construct (e.g., “no depression”) and a higher end which expresses severity using a substantially finer gradation (e.g., “mild,” “moderate,” or “severe” depression). Such constructs are called “quasi-traits,”[13] and one of their characteristics is that the underlying latent variable may be non-normally distributed in the population used for calibrating the instrument.

Figure 1.

Normal, zero-inflated, and skewed latent trait distribution examples.

Normal, zero-inflated, and skewed latent trait distribution examples. The non-normality of quasi-traits may be the consequence of various processes.[14-16] First, the population may be composed of two or more subpopulations. For instance, the general population may be composed of individuals with and without depression. Then one could assume a normal distribution for each subpopulation, allowing for differences in severity between populations, like in a multiple-group model.[17] However, in the case of quasi-traits, with all respondents from the healthy group selecting the lowest response options, it is not realistic to assume normality within each group. In fact, a latent score cannot be assigned to respondents from this population, since a complete zero pattern does not contain information about the specific latent score. One way to deal with this is to assume an additional response process leading to the surplus of zero-patterns. More specifically, a normal distribution for the latent variable has been suggested for the regular response process and a constant distribution for the latent variable under the other response process. For example, when measuring depression severity, a normal distribution is assumed for the subpopulation suffering from depression, and the healthy subpopulation would be assumed to have a very low latent trait score with no variance (Figure 1, middle panel). This scenario has been referred to as “zero-inflation,” since there is an excess of complete zero patterns under the IRT model.[7,16] An alternative scenario leading to non-normality would be one where the latent trait distribution in the population is skewed (Figure 1, right-hand panel). This situation may arise when the majority of the population has low to medium latent trait scores but a small fraction with much higher scores is also present.[18] This process will be referred to as “skewness.” Previous research suggests that the violation of the normality assumption is detrimental to item parameter and score estimates. For instance, a study examining the robustness of MML estimation of the Rasch model revealed that deviations from normality caused bias in item parameter estimates, with bias increasing with more severe skewness.[19] Another study suggested that when the underlying distribution is positively skewed, assuming a normal distribution would introduce negative bias in slope estimates and negative bias for lower category threshold estimates, which means that the estimates would be too low.[15] In addition, the bias in parameter estimates has been shown to be more outspoken for extreme values of the parameter.[20] Recently, similar results were reported,[21] although one study reported smaller biases.[22] It has also been suggested that parameter estimates are biased in the presence of zero-inflation;[16,23] for instance, the slope parameters would be overestimated. Biased parameter estimates may lead to biased latent score estimates, by which conclusions with reference to a patient’s relative standing or a patient’s progress may be false.[24] Also, biased parameter estimates may lead to biased estimates of reliability, by which the measurement precision of a patient’s score estimate may be under- or overestimated.[17] To deal with the bias stemming from non-normality, alternative IRT models have been developed. Recently, the use of mixtures of normal distributions was suggested as an approximation to non-normal distributions.[25] A special case of the mixtures of normal distributions can be used when zero-inflation is expected and is called zero-inflated mixture IRT (ZIM-IRT).[16,23] Also, solutions have been suggested for estimating IRT models for data resulting from skewed latent distributions, such as incorporating a skewed-normal density,[26] and augmenting the model with an extra step in which the density is estimated empirically. Two recent examples of the latter approach are Ramsay Curve IRT (RC-IRT)[15] and Davidian Curve IRT (DC-IRT).[27] Recently, the use of ZIM-IRT and RC-IRT, among others, was exemplified using the PROMIS Anger item bank,[28] which consists of 29 polytomous items. The authors reported that, in comparison to the GRM, the slope estimates were higher for RC-IRT and lower for ZIM-IRT, whereas the threshold estimates had a smaller range for RC-IRT and a larger range for ZIM-IRT. This study was illustrative of the considerable impact of model choice on item parameters. However, because it used empirical data, which obviously lack an unequivocal criterion (e.g., true values of item parameters), the relative performance of the competing approaches could not be assessed. For a systematical and objective comparison of methods, Monte Carlo simulations are more appropriate. Several simulation studies have shown promising results for methods dealing with non-normality, but the conditions under which these methods were evaluated were not in line with the settings commonly encountered while developing item banks for health outcomes. For instance, ZIM-IRT was evaluated using simulated data sets which consisted of 10 or 11 items with dichotomous response options,[16,23] whereas the PROMIS item banks, for instance, typically contain many more items and more than two response options.[29] Likewise, the parameter recovery performance of DC-IRT has not yet been evaluated in a simulation study where polytomous items are considered. Therefore, the available series of simulation studies is incomplete, and additional studies are required to examine the performance of these new methods in settings typical for health outcome measurement. The aim of the current study was to investigate two models, each corresponding to one of the possible causes of non-normality discussed previously, focusing on the settings typically encountered in the assessment of health outcomes, where the measured constructs often consist of quasi-traits and the item pool commonly consists of a large number of polytomous questions. The method for tackling non-normality caused by zero-inflation, ZIM-IRT, is the most popular method in the literature. The method for dealing with skewness, DC-IRT, is the simplest method available allowing for flexible estimation of various forms of the latent distribution, including skewed distributions. Moreover, both methods are readily available in popular software packages, ZIM-IRT in Mplus[11] and DC-IRT in the R package mirt.[12] Polytomous extensions of ZIM-IRT and DC-IRT were compared to the GRM, in order to identify the conditions under which these models improve parameter estimation.

Methods

Models for non-normal latent trait distributions

The first method that was examined was the ZIM-GRM, which is the extension of ZIM-IRT[23] to deal with zero-inflation in polytomous items. ZIM-IRT was inspired by the models developed for dealing with excess zeros in count data with correlated observations[30] and combines the idea of using latent classes to identify a group with extreme responses with a method for weakening the normality assumption through a mixture distribution.[23] ZIM-GRM replaces the standard normal density used in MML estimation with a mixture of distributions The first term (where I is an indicator function) states that with probability π, the latent trait is assigned a score of −100, and the second term (where g represents a normal density) states that with probability , the latent trait is normally distributed. Parameter π is the proportion of respondents in the population who belong to the zero-inflation state. As such, represents a mixture distribution of a normal and a degenerate component.[30] Alternatively, the model may also be explained through a latent class model.[16] It contains two latent classes, the first class being represented by a normal latent trait distribution and the zero-inflation class by a degenerate distribution for . The second method was the DC-GRM, the extension of DC-IRT[27] for polytomous items. DC-GRM modifies the standard normal density used in estimation of the GRM, , with a Davidian curve (a semi-nonparametric method for approximating densities[31]) which is given by where g represents a normal density, t is the number of smoothing parameters, and the term is used for relaxing normality, where is the vector of smoothing parameters and . Since R is a polynomial of degree t, higher values of t makes the density more flexible. During model estimation, is constrained to have a zero mean and unit variance to maintain the GRM’s usual scale.[27,31] Under these constraints, t = 0 and t = 1 give the standard normal density. During the analysis, the best fitting t is chosen among a set of candidates. Lower values of the Hannan–Quinn information criterion (HQC),[32] which is a function of the model’s likelihood, number of parameters, and sample size, have been suggested for model selection.[27] In the current study, DC-GRMs were estimated using . The model with the smallest HQC was selected for obtaining item parameter estimates. Finally, to illustrate the impact of ignoring non-normality and to provide a benchmark for the new methods, the standard version of the GRM was examined as well.

Simulation design

In the simulation study, the following four factors were varied: number of respondents, number of items, ratio of zero-respondents, and skewness of the latent score distribution. The levels of these factors were based on a review of the literature, as described below. Regarding the number of respondents, simulation studies have suggested to use a sample of at least 500 observations for estimating the GRM with sufficient precision,[33] but is common in health outcome research to use larger samples for calibration,[1,34] and therefore, levels of 500 and 1000 respondents were used. Regarding the number of items, calibration studies in the health outcomes field have been published with number of items ranging from 10 to 124 items;[29] to cover a large part of this range, levels of 5, 25, and 50 were used. The levels of rate of respondents in the zero-inflation group were fixed at 0%, 10%, and 25% to include the absence of zero-inflation, a ratio representative of published studies,[18,28] and a ratio closer to extreme cases in the literature, respectively.[16,23] Since no calibration studies were found that reported on the estimated skewness of the latent trait, skewness levels were based on the range encountered in previous simulation studies[19,20,22]; levels of skewness of 0, 0.50, and 0.75 were used. Because in the simulation study two separate scenarios for causing non-normality were examined, its design had a hierachical nature. The factors number of respondents and number of items were nested within factor rate of zero-inflation and within factor skewness, producing a design for each scenario; within each design cell, 100 replications were produced yielding a total of 1800 + 1800 = 3600 simulated data sets.

Data generation

Although it is common in simulation studies to use parameter values typically encountered in the field of interest,[35,36] it was chosen not to use calibration results from the health outcomes field, as many studies[1,18,37,38] show signs of non-normality themselves which likely resulted in biased parameter estimates. By contrast, it was chosen to use parameter values from a classic simulation study using the GRM,[33] in which the values for item parameters of a “good” test were specified. Discrimination parameters were drawn from a uniform distribution between 0.75 and 1.33, and threshold parameters were drawn as follows: b1 uniform between −2.0 and −1.0, b2 uniform between −1.0 and 0.0, b3 uniform between 0.0 and 1.0, and b4 uniform between 1.0 and 2.0. For each data set, a new draw was taken from these distributions, by which the true item parameter values varied over replications. For each data set, two subsets were created. The first was a “calibration set” for studying the impact of non-normality on the item parameter estimates of the various methods. The second was a “validation set” which was used to illustrate the impact of potential deviations in the estimation of item parameter on estimates of the latent trait score. For the calibration sets, the latent trait scores were created as follows. Under the skewness scenario, the method introduced by Fleishman[39] was used to provide the skewed latent trait distribution with a mean of zero and standard deviation of one. Under the zero-inflation scenario, the non-zero-inflated part of the latent trait distribution followed a standard normal distribution. The zero-inflation part consisted of latent trait scores of −100, and the rate of zero-inflation was varied by varying the fraction of respondents drawn from the zero-inflated part. For all validation sets, 51 true latent score values between −2.5 and 2.5 with increments of 0.1 were used. In the creation of each data set, using the sampled values of item parameters and latent trait scores, equations (1) and (2) were used to calculate the five response category probabilities of each item for each simulated respondent. Next, item responses were obtained by randomly drawing from the resulting multinomial distributions. If not all five item categories were observed for all items in the calibration set, item parameters were re-sampled and item responses were created anew.

Simulation outcomes

Of each data file, the calibration set was used to examine item parameter estimates, and the validation set was used to examine latent trait scores. Although other methods are available for estimating latent traits,[40] it was chosen to use maximum likelihood (ML) in order to keep the estimation in agreement with the modeling approach. The primary outcome of the simulation study was the bias of both item and person parameter estimates, which was calculated as the mean difference between the true values and estimates where x and are the true parameter value, and the estimate of the parameter, respectively, and N is the total number of values used for aggregation. N differs for item and person parameters: for latent trait values (e.g., ), N is equal to the number of replications (100), whereas for each of the five item parameters (e.g., the a-parameter), N is equal to as the outcomes are aggregated across all items within a data set. Note that since true item parameter values vary within data sets and over replications, like , x is expected to vary. Bias will be larger than zero when the parameters are systematically overestimated, equal to zero if estimates are on average equal to the true value and smaller than zero if parameters are underestimated. As secondary outcome measure, the root mean squared error (RMSE) was used, which was calculated as and which is a function of both bias and sampling variance;[41] it was chosen to report on this outcome only when it provided information over and above the bias measure.

Data analysis

For analyzing the impact of non-normality on item parameters, both tabulations and graphical displays of the average bias and RMSE in each cell of the design were used. For analyzing the impact on latent trait estimates, graphical displays depicting the outcomes as a function of true latent values were used. For the simulation and data analysis, R (version 3.5.1)[42] was used. The R packages mirt (version 1.29)[12] and SimDesign (version 1.13)[43] were used for generating data, estimating GRM and DC-GRM, and managing simulations. DC-GRM estimation was implemented as part of the mirt package during the study (see Appendix A for details of the implementation). The ZIM-GRM was estimated using the MplusAutomation package (version 0.7.3)[44] for running Mplus version 7.4 from R.[11] The original Mplus script published by Wall et al.[23] for dichotomous item responses was adapted for the current settings.

Results

Throughout the analyses, the outcomes for sample sizes of 500 and 1000 were highly similar. Therefore, in this section, we mainly focus on results for the condition with 500 respondents and only discuss the condition with 1000 respondents when results clearly deviate. Tables 1 to 4 and 6 to 9 in Appendix B provide detailed results for both levels of sample size. Also, as the secondary outcome measure RMSE provided no additional information compared to the bias it was decided to leave it out of the report.

Zero-inflated latent trait distributions

A first inspection of the ZIM-GRM results showed that the estimate of the rate of zero-inflation (π in equation (3)) was very close to the true rate: in all conditions with 25 and 50 items, for every data set, the estimates were within 0.2 and 0.1 percentages points of the true values, respectively. In the five-item condition, estimates were always within two percentage points of the true value. Figure 2 shows the average bias in the estimation of the a, b1, and b4 parameter as a function of number of items and rate of zero-inflation for both the GRM and ZIM-GRM. In the absence of zero-inflation (see the left-hand panel of each of the three plots), both the GRM and the ZIM-GRM had an average bias close to zero for all levels of number of items. In the zero-inflation conditions, whereas the ZIM-GRM mostly showed unbiased parameter estimates, the GRM showed substantial bias, with bias increasing with increasing zero-inflation and increasing number of items. In the case of 10% zero-inflation, the average bias in a parameter estimates was 0.54, 0.80, and 0.98, respectively, for numbers of items of 5, 25, and 50; in the case of 25% zero-inflation, these values were more than twice as high. Parameter b1 showed a similar pattern: in the 10% zero-inflation condition, the average bias was 0.75, 0.92, and 1.01, respectively, for numbers of items of 5, 25, and 50; in the 25% zero-inflation condition, these respective values were at least 60% higher. The b4 parameter showed an opposite but less extreme pattern: it was somewhat overestimated with less bias for lower numbers of items, and with differences between the rate of zero-inflation conditions that were much smaller. Turning to the GRM estimates of parameters b2 and b3 (see Table 1 in Appendix B), average bias showed a similar pattern as for b1, but bias was smaller with b3 showing considerably less bias than b2. In short, in case of zero-inflation, the GRM overestimated the a parameter, and it yielded b parameters with both a biased center (estimates were too high) and a biased spread (the estimates of the respective thresholds were too close to one another).

Figure 2.

Table 1.

Bias of GRM parameter estimates averaged over the zero-inflated conditions.

			Parameter
Respondents	Items	Zero-inflation	a	b ₁	b ₂	b ₃	b ₄
500	5	0%	0.01	−0.03	−0.01	0.00	0.02
		10%	0.54	0.75	0.39	0.07	−0.23
		25%	1.29	1.33	0.75	0.23	−0.29
	25	0%	0.00	−0.03	−0.01	0.00	0.02
		10%	0.80	0.92	0.48	0.07	−0.35
		25%	2.00	1.51	0.86	0.21	−0.44
	50	0%	0.00	−0.02	−0.01	0.00	0.01
		10%	0.98	1.01	0.53	0.05	−0.43
		25%	2.32	1.62	0.93	0.25	−0.43
1000	5	0%	0.01	−0.01	0.00	0.00	0.01
		10%	0.53	0.75	0.39	0.08	−0.23
		25%	1.27	1.33	0.77	0.23	−0.28
	25	0%	0.01	−0.01	−0.01	0.00	0.00
		10%	0.79	0.92	0.48	0.05	−0.37
		25%	1.98	1.51	0.86	0.21	−0.44
	50	0%	0.00	−0.01	−0.01	−0.01	0.00
		10%	0.97	1.01	0.53	0.05	−0.42
		25%	2.31	1.61	0.93	0.24	−0.45

Bias in item parameter estimates of the GRM and ZIM-GRM per level of number of items and of zero-inflation (0%, 10%, and 25%) for a sample size of 500. GRM: graded response model; ZIM-GRM: zero-inflated mixture graded response model. Figure 3 shows the average bias in the estimation of the latent score as a function of true latent score, number of items, and rate of zero-inflation for both the GRM and ZIM-GRM (also see Tables 3 and 4 in Appendix B, for average bias values of aggregated values of the true latent trait). In the absence of zero-inflation (see the data points depicted by a square), both the GRM and ZIM-GRM showed no bias except for true latent scores smaller than −2 (small positive bias) and larger than 2 (small negative bias) in case of a number of items of five. This is an artifact originating from the use of ML for estimating the latent scores: for these regions, extreme response patterns (0 0 0 0 0, and 4 4 4 4 4, respectively) were frequently encountered, and ML’s inability to scale these resulted in missing values, which were excluded from the analysis. As a result, the average estimate for the remaining response patterns of high (low) true latent scores were artificially low (high). In the two zero-inflation conditions, whereas the ZIM-GRM showed unbiased latent score estimates, the GRM showed substantial bias, with bias increasing with increasing zero-inflation and increasing number of items. For example, in the case of five items and 10% zero-inflation, a true value of −2.5 was overestimated by more than one unit. The plots show that both the spread and center of the latent scale were affected by the GRM: For more extreme true latent scores, estimates were more biased toward zero, but the size of shrinkage was larger on the left-hand side of the scale than on its right-hand side, yielding a positively valued overall average bias.

Figure 3.

Bias in latent trait estimates for the GRM and ZIM-GRM per level of number of items and zero-inflation for a sample size of 500. GRM: graded response model; ZIM-GRM: zero-inflated mixture graded response model.

Table 3.

Bias of GRM estimates of latent trait scores averaged in the zero-inflation conditions.

			True latent trait values
Respondents	Items	Zero-inflation	−2.5≤θ<−1.5	−1.5≤θ<−0.5	−0.5≤θ≤0.5	0.5<θ≤1.5	1.5<θ≤2.5
500	5	0%	0.08	0.03	−0.02	0.06	−0.12
		10%	0.99	0.55	0.23	−0.08	−0.48
		25%	1.66	1.06	0.51	−0.05	−0.66
	25	0%	−0.07	0.00	−0.01	0.00	0.01
		10%	1.12	0.72	0.27	−0.17	−0.59
		25%	1.87	1.22	0.54	−0.15	−0.79
	50	0%	−0.01	0.01	0.00	0.00	0.00
		10%	1.27	0.79	0.28	−0.22	−0.70
1000	5	25%	2.00	1.30	0.59	−0.13	−0.81
		0%	0.06	0.00	0.01	−0.02	−0.09
		10%	1.00	0.56	0.23	−0.06	−0.49
		25%	1.67	1.07	0.50	−0.06	−0.62
	25	0%	−0.04	−0.01	0.02	−0.01	0.02
		10%	1.13	0.70	0.27	−0.17	−0.57
		25%	1.85	1.22	0.53	−0.15	−0.78
	50	0%	−0.02	0.00	0.00	−0.01	0.00
		10%	1.26	0.78	0.29	−0.21	−0.68
		25%	1.98	1.30	0.58	−0.13	−0.82

Table 4.

Bias of ZIM-GRM estimates of latent trait scores averaged in the zero-inflation conditions.

			True latent trait values
Respondents	Items	Zero-inflation	−2.5≤θ<−1.5	−1.5≤θ<−0.5	−0.5≤θ≤0.5	0.5<θ≤1.5	1.5<θ≤2.5
500	5	0%	0.05	0.02	−0.02	0.06	−0.10
		10%	0.08	−0.05	−0.02	0.02	−0.09
		25%	0.09	−0.02	0.02	0.00	−0.12
	25	0%	−0.06	0.01	0.00	0.00	0.02
		10%	−0.05	0.00	0.00	0.00	0.01
		25%	−0.01	0.02	0.00	−0.01	−0.01
	50	0%	0.00	0.02	0.00	0.00	0.01
		10%	0.00	0.00	−0.01	−0.01	−0.01
		25%	0.01	−0.01	0.01	0.00	0.01
1000	5	0%	0.05	−0.01	0.00	−0.01	−0.08
		10%	0.13	−0.04	−0.02	0.04	−0.12
		25%	0.12	0.01	0.01	−0.02	−0.06
	25	0%	−0.03	−0.01	0.03	−0.01	0.03
		10%	−0.03	−0.03	0.00	0.00	0.03
		25%	−0.05	0.01	−0.01	−0.01	0.04
	50	0%	−0.02	0.00	0.00	−0.01	0.01
		10%	0.00	−0.02	0.00	0.01	0.02
		25%	−0.01	0.00	0.01	0.02	0.03

Skewed latent trait distributions

As DC-IRT was not yet available in publicly available software, it was implemented as part of the mirt package during the study. Appendix A shows a validation study in which the simulation results of the original procedure by Woods and Lin[27] were successfully replicated. In a first inspection of the estimated DC-GRMs, the number of selected smoothing parameters (t in equation (4)) was examined (see Table 5 in Appendix B, which shows its average for each cell of the simulation design). When skewness was absent, the average number of selected smoothing parameters was very close to the minimally required number of two parameters for all conditions. When skewness was present, t increased (indicating responsiveness to non-normality) with increasing skewness and increasing number of items; this pattern applied to both numbers of respondents, but with somewhat higher values in the condition with 1000 respondents.

Table 5.

Average number of smoothing parameters used by the DC-GRM in the simulation conditions.

Respondents	Items	Skewness	Mean	SD
500	5	0.00	2.07	0.29
		0.50	2.43	0.54
		0.75	2.72	0.49
	25	0.00	2.04	0.20
		0.50	3.13	0.54
		0.75	4.50	0.87
	50	0.00	2.11	0.37
		0.50	3.35	0.72
		0.75	4.64	0.77
1000	5	0.00	2.04	0.20
		0.50	2.59	0.62
		0.75	3.19	0.77
	25	0.00	2.05	0.22
		0.50	3.47	0.85
		0.75	4.94	0.34
	50	0.00	2.08	0.31
		0.50	3.68	0.91
		0.75	5.00	0.00

Note: While fitting DC-GRM, t values from 2 to 5 were considered, and HQC was used for selecting the best value. SD: standard deviation.

Figure 4 presents the average bias in the estimates of a, b1, and b4 as a function of number of items and level of skewness for both the GRM and DC-GRM. In the absence of skewness (see the left-hand panel of each of the three plots), the GRM and DC-GRM had very similar outcomes, with average bias within 0.02 of each other and no values exceeding 0.04 in absolute value. In the two conditions with skewed latent traits (a skewness of 0.50, and 0.75, respectively), the GRM and DC-GRM mostly showed similar outcomes as well, both with average bias very close to zero for the a parameter, and for the b parameters mostly values within 0.03 of each other, and most values not exceeding 0.05, with the exception of the estimates of b4. For this parameter, whereas for five items, the average bias of the DC-GRM was very close to zero, for 25 and 50 items, its average bias was somewhat higher (0.06 and 0.08, respectively) than that of the GRM (0.02 and 0.03, respectively); for both methods, however, this bias was lower in the case of 1000 observations. In short, whereas the GRM was hardly affected in the two conditions with skewed latent traits, the DC-GRM showed a small bias in conditions with the highest level of skewness and the largest studied item set. To illustrate the effect of bias in item parameters on latent trait estimates, Figure 5 shows the average bias in latent trait estimates as a function of true latent score, level of skewness, and number of items for both the GRM and DC-GRM (also see Tables 8 and 9 in Appendix B, for the average bias of aggregated values of the true latent trait). In the absence of skewness (see the data points depicted by a square), both the GRM and DC-GRM showed no bias except for extreme true latent scores, an artifact due to using ML for estimating latent scores, as described in the previous section. In the presence of skewness, the results of the GRM and DC-GRM were mostly very similar, showing little bias, with the exception of the condition with a skewness of 0.75 and 50 items, in which the DC-GRM showed somewhat more extreme bias than the GRM, with the highest mean difference between the methods being 0.07 for trait scores above 1.5.

Figure 4.

Figure 5.

Bias in latent trait estimates for the GRM and DC-GRM per level of number of items and skewness for a sample size of 500. GRM: graded response model; DC-GRM: Davidian curve graded response model.

Table 8.

Bias of GRM estimates of latent trait scores averaged in the skewness conditions.

			True latent trait values
Respondents	Items	Skew	−2.5≤θ<−1.5	−1.5≤θ<−0.5	−0.5≤θ≤0.5	0.5<θ≤1.5	1.5<θ≤2.5
500	5	0.00	0.08	0.03	−0.02	0.06	−0.12
		0.50	0.10	0.00	0.02	−0.01	−0.09
		0.75	0.09	0.03	0.01	0.02	−0.09
	25	0.00	−0.07	0.00	−0.01	0.00	0.01
		0.50	−0.03	0.00	0.02	0.02	0.04
		0.75	−0.06	0.00	0.02	0.01	0.06
	50	0.00	−0.01	0.01	0.00	0.00	0.00
		0.50	−0.01	−0.01	0.00	0.00	0.00
		0.75	0.00	0.00	0.00	0.00	0.01
1000	5	0.00	0.06	0.00	0.01	−0.02	−0.09
		0.50	0.10	−0.01	0.01	0.01	−0.07
		0.75	0.08	−0.02	0.02	0.02	0.00
	25	0.00	−0.04	−0.01	0.02	−0.01	0.02
		0.50	−0.03	0.00	0.00	0.00	0.05
		0.75	−0.04	0.01	−0.02	0.01	0.01
	50	0.00	−0.02	0.00	0.00	−0.01	0.00
		0.50	−0.02	−0.01	−0.01	0.01	0.01
		0.75	−0.01	0.01	0.01	0.01	0.02

Table 9.

Bias of DC-GRM estimates of latent trait scores averaged in the skewness conditions.

			True latent trait values
Respondents	Items	Skew	−2.5≤θ<−1.5	−1.5≤θ<−0.5	−0.5≤θ<0.5	0.5≤θ<1.5	1.5≤θ<2.5
500	5	0.00	0.07	0.03	−0.02	0.05	−0.12
		0.50	0.12	0.00	0.01	−0.03	−0.11
		0.75	0.11	0.04	0.00	0.00	−0.11
	25	0.00	−0.06	0.01	−0.01	0.00	0.01
		0.50	−0.04	0.00	0.02	0.02	0.05
		0.75	−0.07	−0.01	0.03	0.04	0.10
	50	0.00	0.00	0.01	0.00	0.00	0.01
		0.50	−0.02	−0.02	0.00	0.01	0.02
		0.75	−0.03	0.00	0.02	0.04	0.08
1000	5	0.00	0.06	0.00	0.01	−0.02	−0.09
		0.50	0.11	−0.01	0.01	0.00	−0.08
		0.75	0.10	−0.01	0.01	0.00	−0.02
	25	0.00	−0.03	−0.01	0.02	−0.01	0.02
		0.50	−0.03	0.00	0.00	0.01	0.06
		0.75	−0.05	0.01	−0.01	0.03	0.05
	50	0.00	−0.02	0.00	0.00	−0.01	0.00
		0.50	−0.02	−0.01	0.00	0.02	0.02
		0.75	−0.02	0.01	0.03	0.04	0.07

Bias in item parameter estimates of the GRM and DC-GRM per level of number of items and of skewness (0, 0.50, and 0.75) for a sample size of 500. GRM: graded response model; DC-GRM: Davidian curve graded response model. Bias in latent trait estimates for the GRM and DC-GRM per level of number of items and skewness for a sample size of 500. GRM: graded response model; DC-GRM: Davidian curve graded response model.

Discussion

In the current study, two IRT models for polytomous items designed to deal with two different processes leading to non-normally distributed latent traits were evaluated. The ZIM-GRM was used to deal with zero-inflation, and the DC-GRM was used to deal with skewness. Both methods were compared with the standard GRM in a simulation study with conditions representative of the settings, wherein IRT is commonly used to develop item banks for health outcomes. The impact of the simulation conditions was assessed for both item parameter estimates and latent trait estimates. In the case of zero-inflated latent trait distributions, the GRM showed substantial bias, with larger bias for distributions with a higher rate of zero-inflation, and for larger number of items, whereas the ZIM-GRM showed very little bias. The GRM overestimated the a parameter to a high extent and resulted in b parameters that were too high, with the exception of the highest threshold, which was somewhat underestimated; moreover, the estimates of the respective thresholds were too close to one another. The latent trait estimates of the GRM showed a similar pattern, largely overestimating low latent trait values, and somewhat underestimating high values. Although the conditions and outcomes are not directly comparable, some of the current outcomes were in line with those of the simulation study by Wall et al.[23] for binary items. First, in the binary case, zero-inflation was also associated with overestimating the discrimination parameter. Second, the pattern within polytomous items (of showing an overestimation of the center and a decrease in the spread of the scale) was also present over the collection of binary items in that thresholds were overestimated on average with more overestimation for low thresholds than underestimation for high thresholds. In the case of skewed latent trait distributions, whereas the GRM mostly showed no bias, the DC-GRM showed a small bias with both high skewness and large item sets. In other words, the GRM outperformed its extension, the DC-GRM, which was included in the study to deal with skewness. This is not in line with previous research which suggested that ignoring skewness causes bias, although the size of bias varies across studies.[15,19-22] This discrepancy may have emerged from the method used for creating skewed distributions. Whereas we used the method by Fleishman[39] which resulted in distributions with zero kurtosis, the studies by Woods[15] and Woods and Lin,[27] for example, used mixtures of normals, which resulted in substantial kurtosis. Evidently, a distribution with skewness only is more easily approximated by a normal distribution than one with both skewness and kurtosis. The plausibility of this explanation is supported by the current outcomes being more similar to those of a study using a more similar method for generating skewness[22] than to those of a study using a different method for generating skewness.[15] The somewhat biased results of the DC-GRM may have resulted from the way it was implemented here. Inspection of the number of smoothing parameters (see Table 5) shows that the highest values were selected in the conditions with the highest deviations in parameter estimates, which suggests that the method may have suffered from overfitting.[45] The choice for using HQC for model selection was based on the recommendations from the simulation study that examined DC-IRT for dichotomous items,[27] but possibly this index is not optimal in the case of polytomous items; it is advised that future research compares the index with more conservative indices like the Bayesian information criterion. In their classic article on quasi-traits in clinical measurement, Reise and Waller[46] suggested that high discrimination parameter estimates originated from conceptually narrow constructs and consequently homogeneous item content leading to highly correlated items. For unipolar and skewed constructs, they are correct that improper latent trait shapes are selected and therefore that the resulting parameters may be on a suboptimal scale. However, concerning the bias that arises when using the GRM in case of zero-inflation, it may be claimed that the outcome is both good and bad. It is bad because in reality the items do not discriminate as much among respondents as the parameter estimates suggest. By contrast, it is good because the GRM does not change the ordering of the respondents (see Figure 3, which shows a linear transformation), and because it becomes very efficient at discriminating between the clinical and non-clinical respondents. Thus, the mixing of samples results in a “population discrimination” item-level artifact rather than the originally desired “person discrimination” goal of the IRT measurement model. For future studies on zero-inflation, it may be instructive to discuss the impact of the degenerate part of ZIM-IRT on simulation results. In the current and previous evaluations[23] of ZIM-IRT, the model specification was perfectly in line with the data generating process, both using a mean of zero and a standard deviation of one for the non-degenerate component, and a mean equal to an extreme negative value and a variance of zero in the degenerate component. However, when a model is specified in any other way, the scale of estimated item and person parameters will, by definition, be different than the data generating scale, resulting in artificially biased outcomes. For example, when using DC-IRT for estimating the zero-inflated distribution, the estimation algorithm tries to pick up its shape, including both mixtures in a single smooth curve while constraining the distribution to have a mean of zero and a standard deviation of one, which will lead to failure because it is mathematically impossible to include both the zero-inflated part and the proper scale for non-zero-inflated part in a single distribution.[1] In such cases, since the scales are incompatible, a solution would be to use the correlation between true and estimated parameters as outcome, instead of exploring bias or RMSE. In the current study, we focused on methods available in software, which is possibly only a small selection of the methods that have been proposed to deal with non-normality. In the case of skewness, several other methods have been developed,[26,47,48] and an examination of their performance in the current conditions may seem interesting, but the negligible bias that was observed for GRM reduces its necessity. In the case of ZIM-IRT, the studied approach can be easily extended to not only deal with zero-inflation but also with maximum-inflation (an excess of patterns containing only the highest item category for each item).[16,23] More recently, similar approaches have been taken for questionnaires that include items eliciting count responses that suffer from heaping (an excess of patterns with multiples of 5 in reporting days of having experienced a specific symptom).[49,50] These methods are characterized by modeling extreme response patterns by specifying one or more classes with a degenerate distribution, which may be difficult to interpret for practitioners, and therefore, it seems appropriate to look at other approaches to mixture modeling as well.[2] It is also noted that other types of IRT models have been developed for explicitly dealing with extreme response behavior,[51,52] and finally, it is acknowledged that methods for dealing with other deviations from normality, such as unipolar traits,[50,53] are available. It is recommended that future research examines the usefulness of such methods in the calibration of item banks for measuring health outcomes. When comparing the parameter estimates of the standard GRM under both zero-inflation and skewness, it was striking that the values of the former were very similar to what is commonly encountered in calibration studies of health outcome item banks. For example, like in most studies,[1,18,37,38] in case of zero-inflation, the standard GRM produced very high estimates of discrimination parameters and threshold estimates that were mainly located at the positive side of the latent trait scale. Given that in many calibration studies, excessive zeros were reported and that the parameter estimates were so similar, it seems plausible that the true data generation process is similar to the one that we used for creating zero-inflation. This would imply that using the standard GRM for calibrating health outcome item banks leads to biased parameter estimates. Overestimation of discrimination parameters leads to standard errors of latent trait estimates that are too optimistic,[54] which in computerized adaptive testing could lead to assessments that stop too early. Moreover, if one blindly interprets the latent trait estimates under the metric of the standard normal distribution (e.g., using percentile ranks), then respondents may seem to score less extreme than they actually do. Therefore, when calibrating item banks for the assessment of health outcomes, it seems like a good practice to consider the possibility of subpopulations. Although calibration samples often consist of commingled populations,[17] such as a clinical and non-clinical population, it is often hard to determine for the individual respondent to what population she belongs, which does not allow for a multi-group approach.[55] However, when using the ZIM-GRM, instead of (potentially incorrectly) assigning each individual to a subpopulation prior to fitting the GRM, the uncertainty is built into the model directly, where individuals are assigned probabilistically rather than deterministically. The appropriateness of the ZIM-GRM compared to the GRM for calibration data can be evaluated using model fit indices.[23]

Table 2.

Bias of ZIM-GRM parameter estimates averaged over the zero-inflated conditions.

			Parameter
Respondents	Items	Zero-inflation	a	b ₁	b ₂	b ₃	b ₄
500	5	0%	0.00	−0.05	−0.02	0.00	0.03
		10%	0.01	−0.05	−0.03	0.00	0.02
		25%	0.01	−0.04	−0.03	0.00	0.02
	25	0%	0.00	−0.02	−0.01	0.01	0.02
		10%	0.01	−0.02	0.00	0.01	0.02
		25%	0.02	−0.02	−0.01	−0.01	0.00
	50	0%	0.01	−0.01	0.00	0.01	0.01
		10%	0.01	−0.01	0.00	0.00	0.01
		25%	0.01	−0.02	−0.01	0.01	0.03
1000	5	0%	0.00	−0.02	0.00	0.00	0.02
		10%	0.01	0.00	0.00	0.00	0.01
		25%	0.00	−0.01	0.00	0.01	0.02
	25	0%	0.01	−0.01	0.00	0.00	0.01
		10%	0.00	−0.01	−0.01	−0.01	0.00
		25%	0.01	−0.02	−0.01	−0.01	0.00
	50	0%	0.01	0.00	0.00	0.00	0.00
		10%	0.00	0.00	0.00	0.00	0.01
		25%	0.00	0.00	0.00	0.01	0.02

Table 6.

Bias of GRM parameter estimates averaged in the skewed data conditions.

			Parameter
Respondents	Items	Skewness	a	b ₁	b ₂	b ₃	b ₄
500	5	0.00	0.01	−0.03	−0.01	0.00	0.02
		0.50	0.01	−0.04	0.00	0.01	0.01
		0.75	0.00	−0.03	0.01	0.02	0.01
	25	0.00	0.00	−0.03	−0.01	0.00	0.02
		0.50	0.00	−0.02	0.01	0.02	0.03
		0.75	0.00	−0.03	0.00	0.01	0.02
	50	0.00	0.00	−0.02	−0.01	0.00	0.01
		0.50	0.00	−0.02	0.00	0.00	0.01
		0.75	0.00	−0.01	0.01	0.02	0.03
1000	5	0.00	0.01	−0.01	0.00	0.00	0.01
		0.50	0.00	−0.01	0.01	0.02	0.02
		0.75	−0.01	−0.03	0.01	0.03	0.02
	25	0.00	0.01	−0.01	−0.01	0.00	0.00
		0.50	0.00	−0.01	0.00	0.00	0.00
		0.75	0.01	−0.01	0.00	0.00	0.00
	50	0.00	0.00	−0.01	−0.01	−0.01	0.00
		0.50	0.00	−0.01	0.00	0.01	0.01
		0.75	0.00	−0.01	0.00	0.01	0.01

Table 7.

Bias of DC-GRM parameter estimates averaged in the skewed data conditions.

			Parameter
Respondents	Items	Skewness	a	b ₁	b ₂	b ₃	b ₄
500	5	0.00	0.01	−0.03	−0.01	0.00	0.02
		0.50	0.02	−0.03	0.00	0.00	0.00
		0.75	0.01	−0.01	0.00	0.00	0.00
	25	0.00	0.00	−0.03	−0.01	0.00	0.02
		0.50	0.00	−0.02	0.01	0.02	0.04
		0.75	−0.01	−0.04	0.00	0.03	0.06
	50	0.00	0.00	−0.01	0.00	0.01	0.01
		0.50	−0.01	−0.02	−0.01	0.01	0.02
		0.75	−0.02	−0.03	0.01	0.05	0.08
1000	5	0.00	0.01	0.00	0.00	0.00	0.01
		0.50	0.01	0.00	0.01	0.01	0.01
		0.75	0.01	−0.01	0.01	0.01	0.01
	25	0.00	0.01	−0.01	−0.01	0.00	0.00
		0.50	0.00	−0.01	0.00	0.01	0.01
		0.75	−0.01	−0.02	0.00	0.01	0.03
	50	0.00	0.00	−0.01	−0.01	0.00	0.00
		0.50	−0.01	−0.01	0.00	0.01	0.02
		0.75	−0.02	−0.02	0.01	0.03	0.05

26 in total

1. Linear mixed models with flexible distributions of random effects for longitudinal data.

Authors: D Zhang; M Davidian
Journal: Biometrics Date: 2001-09 Impact factor: 2.571

2. Mixture Factor Analysis for Approximating a Nonnormally Distributed Continuous Latent Factor With Continuous and Dichotomous Observed Variables.

Authors: Melanie M Wall; Jia Guo; Yasuo Amemiya
Journal: Multivariate Behav Res Date: 2012-03-30 Impact factor: 5.923

3. The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005-2008.

Authors: David Cella; William Riley; Arthur Stone; Nan Rothrock; Bryce Reeve; Susan Yount; Dagmar Amtmann; Rita Bode; Daniel Buysse; Seung Choi; Karon Cook; Robert Devellis; Darren DeWalt; James F Fries; Richard Gershon; Elizabeth A Hahn; Jin-Shei Lai; Paul Pilkonis; Dennis Revicki; Matthias Rose; Kevin Weinfurt; Ron Hays
Journal: J Clin Epidemiol Date: 2010-08-04 Impact factor: 6.437

4. Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement.

Authors: Maria Orlando Edelen; Bryce B Reeve
Journal: Qual Life Res Date: 2007-03-21 Impact factor: 4.147

5. A diagnostic procedure to detect departures from local independence in item response theory models.

Authors: Michael C Edwards; Carrie R Houts; Li Cai
Journal: Psychol Methods Date: 2017-04-03

6. A zero- and K-inflated mixture model for health questionnaire data.

Authors: Matthew D Finkelman; Jennifer Greif Green; Michael J Gruber; Alan M Zaslavsky
Journal: Stat Med Date: 2011-03-01 Impact factor: 2.373

7. Impact of IRT item misfit on score estimates and severity classifications: an examination of PROMIS depression and pain interference item banks.

Authors: Yue Zhao
Journal: Qual Life Res Date: 2016-12-01 Impact factor: 4.147

8. Development of a PROMIS item bank to measure pain interference.

Authors: Dagmar Amtmann; Karon F Cook; Mark P Jensen; Wen-Hung Chen; Seung Choi; Dennis Revicki; David Cella; Nan Rothrock; Francis Keefe; Leigh Callahan; Jin-Shei Lai
Journal: Pain Date: 2010-07 Impact factor: 6.961

9. DSM criteria for major depression: evaluating symptom patterns using latent-trait item response models.

Authors: Steven H Aggen; Michael C Neale; Kenneth S Kendler
Journal: Psychol Med Date: 2005-04 Impact factor: 7.723

10. Mixture randomized item-response modeling: a smoking behavior validation study.

Authors: J-P Fox; M Avetisyan; J van der Palen
Journal: Stat Med Date: 2013-06-10 Impact factor: 2.373

5 in total

1. Commentary: Matching IRT Models to PRO Constructs- Modeling Alternatives, and Some Thoughts on What Makes a Model Different.

Authors: Matthias von Davier
Journal: Psychometrika Date: 2021-08-03 Impact factor: 2.500

2. Assessing Dimensionality in Dichotomous Items When Many Subjects Have All-Zero Responses: An Example From Psychiatry and a Solution Using Mixture Models.

Authors: William F Christensen; Melanie M Wall; Irini Moustaki
Journal: Appl Psychol Meas Date: 2022-03-01

3. Design and methodology for an integrative data analysis of coping power: Direct and indirect effects on adolescent suicidality.

Authors: Antonio A Morgan-López; Heather L McDaniel; Catherine P Bradshaw; Lissette M Saavedra; John E Lochman; Chelsea A Kaihoi; Nicole P Powell; Lixin Qu; Anna C Yaros
Journal: Contemp Clin Trials Date: 2022-02-14 Impact factor: 2.261

4. Psychometric properties of the Patient-Reported Outcomes Measurement Information System (PROMIS®) pediatric item bank peer relationships in the Dutch general population.

Authors: Michiel A J Luijten; Raphaële R L van Litsenburg; Caroline B Terwee; Martha A Grootenhuis; Lotte Haverman
Journal: Qual Life Res Date: 2021-02-19 Impact factor: 4.147

5. Psychometric properties of the PROMIS-57 questionnaire, Norwegian version.

Authors: Stein Arne Rimehaug; Aaron J Kaat; Jan Egil Nordvik; Mari Klokkerud; Hilde Stendal Robinson
Journal: Qual Life Res Date: 2021-06-18 Impact factor: 4.147

5 in total