Literature DB >> 34987268

On the Performance of Semi- and Nonparametric Item Response Functions in Computer Adaptive Tests.

Abstract

Large-scale assessments often use a computer adaptive test (CAT) for selection of items and for scoring respondents. Such tests often assume a parametric form for the relationship between item responses and the underlying construct. Although semi- and nonparametric response functions could be used, there is scant research on their performance in a CAT. In this work, we compare parametric response functions versus those estimated using kernel smoothing and a logistic function of a monotonic polynomial. Monotonic polynomial items can be used with traditional CAT item selection algorithms that use analytical derivatives. We compared these approaches in CAT simulations with a variety of item selection algorithms. Our simulations also varied the features of the calibration and item pool: sample size, the presence of missing data, and the percentage of nonstandard items. In general, the results support the use of semi- and nonparametric item response functions in a CAT.

Entities: Chemical

Keywords: computer adaptive test; large-scale testing; monotonic polynomial; nonparametric IRT

Year: 2021 PMID： 34987268 PMCID： PMC8721622 DOI： 10.1177/00131644211014261

Source DB: PubMed Journal: Educ Psychol Meas ISSN： 0013-1644 Impact factor: 2.821

Introduction

Despite Stout (2001) declaring that nonparametric item response theory (IRT) is viable for the scaling of educational and psychological tests, significant barriers remain to the use of these approaches. For example, large-scale assessments are more likely to use models such as the two-parameter logistic (2PL), three-parameter logistic (3PL; Birnbaum, 1968), or generalized partial credit model (Muraki, 1992) when modeling the relationship between a latent trait and item responses. These parametric models are often used despite the fact that nonparametric and flexible IRT approaches typically make fewer restrictive assumptions. One challenge to the use of more flexible modeling approaches is the desire to use estimated item response functions (IRFs) in a computer adaptive test (CAT). The performance of flexibly estimated response functions when used in a CAT is largely unknown as many techniques are not easily tractable with existing testing programs. Classical CAT item selection algorithms, such as maximum Fisher information (MFI; Lord, 1980) or maximum weighted posterior information (MPWI; van der Linden, 1998) are still heavily used in operational CAT settings. These techniques typically require derivatives of the IRF with respect to the latent trait, which are not always readily available for flexibly estimated IRFs. Thus previous applications of CATs to nonparametric IRFs have used alternative item selection algorithms, as otherwise numerical derivatives may be unstable (Y.-P. Chang et al., 2019; Xu & Douglas, 2006). A novel monotonic polynomial (MP) approach to IRF estimation could be more easily used with traditional item selection algorithms but has not yet been evaluated with a CAT. In brief, the MP approach consists of replacing the linear predictor of parametric IRT models with an MP (Falk & Cai, 2016a, 2016b; Feuerstahler, 2016; Liang, 2007; Liang & Browne, 2015). Although the approach technically has additional parameters and results in more flexibly estimated IRFs like those from nonparametric approaches, these parameters are not easily interpretable. Thus, the MP approach has been called “semiparametric” or “quasi-parametric” (Liang, 2007; Liang & Browne, 2015). In addition, the MP approach has some distinct practical advantages that we believe permit its further study. In particular, the MP approach can allow calibration of IRFs by maximizing the marginal likelihood or posterior using the expectation-maximization algorithm (EM-MML; Bock & Aitkin, 1981; Mislevy, 1986). This feature is distinct from kernel smoothing (KS; Ramsay, 1991) and smoothed isotonic regression (Lee, 2002, 2007), in which estimation was developed by relying on a proxy of the latent trait that is computed from observed scores. The MP approach is therefore more readily usable in settings where a planned missing data design is used for field testing (i.e., there are missing item responses; Falk, 2019, 2020), in multiple group settings (Falk & Cai, 2016a), and for linking (Feuerstahler, 2019). The MP approach can also be used in conjunction with parametrically modeled items on the same test, which may facilitate a more seamless integration into operational settings. However, simulations studies have also shown that the MP approach is most suitable for large-scale testing as good estimation typically requires larger sample sizes and many items. Little prior research has studied the performance of flexible or nonparametric IRFs with a CAT. Xu and Douglas (2006) developed two item selection algorithms for the KS approach based on Kullback–Leibler (K-L; H.-H. Chang & Ying, 1996) information and Shannon entropy (Shannon, 1948). These item selection algorithms avoid the need to compute derivatives of the IRF in order to perform item selection. These authors evaluated the item selection algorithms with a fixed 50-item CAT. Other details of simulations included a 500-item test bank (with true IRFs from a 2PL) calibrated using KS with 1,000 subjects’ complete responses. Although the item selection algorithms with KS were found to perform well, the scope of simulations was limited. In particular, KS with both item selection algorithms was compared only with a “random” item selection algorithm and comparisons of this approach with traditional IRT models (2PL, 3PL, etc.) was not made. Furthermore, it is unlikely that calibration sample students would be able to complete 500 items each. In later research, Y.-P. Chang et al. (2019) used similar item selection algorithms to evaluate the performance of a nonparametric technique with a cognitive diagnosis model. They found that a nonparametric approach performed better than parametrically estimated IRFs, yet the focus of their study was on smaller sample educational testing contexts. We therefore have little knowledge of the performance of nonparametric or flexible IRF estimation techniques versus parametric techniques in a CAT under conditions that may be typical of a large-scale test, and no prior research evaluating MP-based models in a CAT. On the one hand, simulations do suggest that flexible IRF estimation techniques such as the MP can improve recovery of the true IRFs, which in turn often leads to better recovery of latent traits (e.g., Falk & Cai, 2016a; Feuerstahler, 2016). However, these previous studies have typically only studied the case where calibration and latent trait estimation is performed on the same sample, and all items on the test are used for all respondents. Use of one calibration sample followed by a CAT using a separate sample is perhaps more similar to cross-validation of the estimated IRFs, and we may not be able to easily extrapolate based on the results of limited previous simulations. In this article, we present a simulation study that compares the performance of the MP approach, KS, and 2PL in a CAT. We begin by describing each IRF estimation technique, followed by item selection algorithms. Then, we present the method and results of a Monte Carlo simulation study. We finally make concluding remarks.

Method

Studied Item Response Function Estimation Techniques

For the purpose of notation for calibration, consider respondents who complete some subset of items in the item pool. The 2PL is one of the most commonly used item response models for dichotomous items on both educational and psychological tests. The functional form is essentially that of a logistic regression, in which the item response is regressed on the latent construct, . For item , one way to write the 2PL is as follows: where is an intercept and is a slope. Conceptually this function traces the probability of a “correct” (or 1; as opposed to an incorrect or 0) response at different levels of the latent trait, . The basic idea behind the MP version of this dichotomous item model is to replace the term with a monotonic polynomial, : Here, the MP is a function of : where is a nonnegative integer for item that controls the order of the polynomial. Due to the form of Equation (2) being a logistic function of , sometimes this approach is called a “filtered” MP (Feuerstahler, 2016, 2019). Typically coefficients of the polynomial are not directly estimated but are a function of other parameters (e.g., Falk & Cai, 2016a; Feuerstahler, 2016; Liang, 2007). In this way, is monotonically increasing in but has additional flexibility beyond the 2PL. For both the 2PL and MP approaches, we obtained parameter estimates via EM-MML (e.g., see Falk & Cai, 2016a) with polynomial order selection for the MP done using simulated annealing (Falk, 2019). KS (Ramsay, 1991) is one of the most popular nonparametric techniques for IRF estimation, possibly due to its availability in programs such as TESTGRAF (Ramsay, 2000) and now in KernSmoothIRT (Mazza et al., 2014). In the context of dichotomously scored items, KS estimated IRFs at any given point along resemble a weighted sum of the following form: where are weights, is examinee ’s response to item , assuming complete data, and is typically a surrogate estimate of the examinee’s score on the latent trait. Values for are often chosen based on a (weighted) sum score of all item responses for respondent . Weights are then often Nadaraya–Watson (Nadaraya, 1964; Watson, 1964) weights as follows, where is a kernel function and is a bandwidth. In addition, although Equation (4) appears to be continuous along , typically evaluation points along are determined, and this may have an impact on scoring and any additional calculations that are performed as part of a CAT. Thus, the grid that is chosen for KS can have implications for how any follow-up scoring or CAT is performed as an item bank would typically need to store such information and would not be able to recompute at new points on the fly from the original calibration data.

Studied Item Selection Algorithms

In the present article, we consider a fixed-length, item-by-item CAT with indexing items in the item pool, indexing iterations of the CAT algorithm, and therefore serving as the index of the item administered at iteration . Then, are the item responses to administered items prior to iteration . Let be the score estimate for examinee at the beginning of iteration and as the updated score estimate after administering item . At iteration , item selection algorithms try to choose the next item that will most improve . To briefly cover the logic behind MFI (Lord, 1980) and MPWI (van der Linden, 1998), suppose that the true latent trait score for examinee , , were known and we wished to administer the item(s) that would most reduce the sampling variability of their score estimate, , under maximum likelihood or expected a posteriori (EAP; Bock & Mislevy, 1982) scoring, for example. Items that have the highest Fisher information at would be the most optimal to choose. Assuming item parameters implied by a parametric model or the exact shape of the true IRFs were known, Fisher information for dichotomously scored item is often written as where and . Note that this expression is generally given when defining expected Fisher information, yet for exponential family item response models such as those in Equations (1) and (2) is equivalent to that of observed information. Note also that this requires partial derivatives of with respect to , which are easily obtained for the 2PL and MP approaches but not in general with KS. As is unknown, MFI uses the current interim estimate as a substitute for in Equation (6) and then the top item is chosen. Due in part to instability in , especially in the early stages of a CAT, the item chosen under MFI may not be optimal for the examinee. MPWI quantifies uncertainty in by computing information for each item integrating across the posterior distribution for . Let be the prior density and be the likelihood for examinee at the start of iteration . Then, the posterior weighted information is defined as follows (see also Magis & Raiche, 2012): where the integral may be computed using numerical methods (e.g., rectangular quadrature). MPWI then proceeds at any given iteration by computing for all items under the current posterior, and choosing the item with the largest value. Finally, K-L information (H.-H. Chang & Ying, 1996) resembles the likelihood ratio between the unknown but true value , and some other value, : with the expectation over possible responses for the item (values for the random variable ), which in this case is . Conceptually, K-L determines which newly administered item would allow us to tell the difference between and some other value on the latent scale using a likelihood ratio. While MFI and MPWI computations for each item rely on some estimate or posterior for , computation of K-L in Equation (9) relies on a stand-in for both and some other possible value(s) for . To achieve this, K-L information is computed using the current interim estimate for , and a range of values for in the vicinity of are considered via an integral: with a typical choice for . In simulations, is often chosen (e.g., H.- H. Chang & Ying, 1996) though could be chosen to determine the range of integration corresponding to a desired coverage probability for based on an interval around . The term thus controls the range of integration and is wide in the early stages of the CAT, to effectively obtain a “global” index of information. At later stages of the CAT (i.e., a larger value for ), the range of integration decreases and Equation (10) would become more local, properly reflecting better information regarding the latent trait estimate. Note that computation of Equation (10) does not require any derivative computations, only the ability to evaluate at particular values of . Thus, it is well-suited for use also with KS (in addition to 2PL and MP) approaches to IRF estimation. However, evaluation of the integral may require numerical integration, which still requires choosing quadrature nodes or grid points along . These nodes will typically not match the exact evaluation points under KS obtained during calibration. Thus, to obtain the values of the IRF under KS at each quadrature node to compute Equation (10), we used linear interpolation based on the already estimated IRF. We suppose this is the most likely situation if KS were to be used to in an item bank.

Study Design and Simulation Details

For clarity, we separate the study design details into two sections. First, we describe how calibration data were simulated and how models were estimated. On the basis of IRF estimation from calibration data, we describe how CAT simulations were then conducted.

Calibration and Data Generating Models

To attempt to construct realistic field testing conditions, we generated calibration sample data under two broad conditions: Complete data and missing data. Within each of these two conditions, we also varied sample size and the percentage of items that had an IRF that departed drastically from the traditional 2PL model (nonstandard items). This resulted in a total of eight different data generating conditions for the calibration samples. For the complete data conditions, all respondents completed all items. The number of items was fixed at , as we expect this to be approximately the largest amount of items that respondents may reasonably complete, especially if such items represent educational test items. The number of respondents ( or ) was fully crossed with the percentage of nonstandard items (30% or 70%). For the missing data conditions, each respondent completed only a random subset of 40 items, which is similar to some recent large-scale educational tests (e.g., Smarter Balanced Assessment Consortium, 2017). Since the number of items under this condition was fixed at , this represents 80% missing data. To compensate for missing data, we would expect research teams under similar testing conditions to utilize more respondents. Thus, the sample size conditions were larger. Sample size ( or ) was again crossed with the percentage of nonstandard items (30% or 70%). All items were dichotomous, and standard items were generated using a normal cumulative distribution function (CDF) as the IRF, , with and drawn randomly across items. Although this does not strictly follow the 2PL, it may be reconceived as following a normal ogive model for which the 2PL provides a very close approximation. Nonstandard items were generated with the following mixture of normal CDFs: . Proportions were generated randomly across items, , as were standard deviations, for with any values less than .2 winsorized at .2. To provide variation in overall difficulty, means of the CDFs were pieced together in the following way: . Thus, provided some overall control of difficulty, and the remaining parameters controlled the center of the CDFs, , , . For each data generating condition, a single calibration sample was generated, with a standard normal assumed. Although it may be preferable to have several calibration samples per data generation condition, and then repeatedly conduct CAT simulations with each calibration sample, such an approach is still very computationally intensive if conducting thousands of CAT simulations for each calibration sample. We therefore decided against this approach and comment further on this issue in the Discussion section. To each calibration sample, several models were fit to obtain IRF estimates for later use in CAT simulations. The exact models depended on the data generating conditions. The KS approach using KernSmoothIRT (Mazza et al., 2014) with default settings was utilized for all complete data generating conditions. As noted, this method is possible but not ideal when there are missing item responses and so was used with only complete data conditions. To obtain for smoothing, the default behavior is to compute sum scores (equal weight for each item), obtain percentile ranks (with ties broken in order of appearance), and then use a normalizing transformation using quantiles of a standard normal distribution. Although multiple choices for a kernel function are possible, a Gaussian kernel is default and a reasonable choice. The bandwidth defaults to the so-called Silverman rule of thumb (Silverman, 1986), which was .266 when and .214 when . And finally, the evaluation grid defaults to 51 equally spaced points between when and when . In addition, a 2PL model was fit to all data sets. Finally, an MP-based model was fit to each data set. For the MP models, the order of the polynomial was determined through use of simulated annealing as described by Falk (2019) with up to considered, a starting temperature of 5, a logarithmic temperature schedule as described by Stander and Silverman (1994), and using Akaike information criterion as the optimization criterion. The number of iterations was set to 800 for complete data and 1,600 for missing data since the latter had a larger item bank and may require more iterations to find a good solution. Simulated annealing was used as not all items follow a nonstandard IRF and a relatively large number of items is not easily amenable to a step-wise approach for selecting polynomial order for each item. OpenMx and rpf packages were used for fitting the 2PL and MP approaches with custom R code for simulated annealing (Neale et al., 2016; Pritikin, 2016; R Core Team, 2017). For both 2PL and MP models, EM-MML was used with integrals evaluated by rectangular quadrature with 101 equally spaced nodes on , and M-step and E-step tolerance for convergence was and , respectively.

Computer Adaptive Test Simulations

We first describe simulation conditions that were fixed across all CAT simulations, followed by manipulated factors. First, for simulees was generated in one of two ways: from a standard normal distribution and at discrete points along (−2 to 2 in 0.5 increments). Under each of these conditions and each manipulated condition described below, , simulees were generated and the true IRFs were used to generate their hypothetical item responses. Such a data generation technique allowed us to tell whether overall some IRF estimation techniques resulted in better recovery of as well as whether there were certain locations along where recovery was better/worse. For all simulees, a fixed CAT length of 25 items was utilized, with interim and final estimates using the EAP scoring method. Depending on the item selection algorithm, a starting of 0 (MFI and K-L), or a starting prior of a standard normal distribution (MPWI), was used. To make the simulations as “fair” as possible for KS, the grid points used for both interim EAP scoring (with MFI and K-L) and for representing the posterior distribution under MPWI were chosen as the same 51 grid points used for KS estimation as defined in the Calibration and Data Generating Models section. Programming for CAT simulations was based on custom R code with analytical derivatives for the MP and 2PL provided by rpf. Manipulated factors for CAT simulations involved (1) the source of IRF estimation, and (2) the item selection algorithm. The IRFs used in the CAT involved up to four conditions: those from the calibration samples (2PL, KS for complete data only, and the MP approach), as well as use of the true IRF. Use of the 2PL allowed us to gauge whether KS or MP have much of an advantage over a parametric model and use of the true IRF allows a benchmark for the best possible IRF that could be used. Three possible item selection algorithms were crossed with the available IRF estimation techniques (where possible): MFI and MPWI (for the 2PL and MP only) and K-L information.

Results

In what follows, we first briefly present results pertaining to the IRF recovery for the eight calibration samples. These results are presented to frame understanding of how different IRF estimation techniques may recover true IRFs, which may indirectly affect CAT performance. Following such results, we will turn to the primary results of interest regarding performance of each type of IRF and item selection method in CAT simulations. As the amount of data collected for the study is vast, this represents our best attempt at understanding the pattern of results.

Calibration Results

Recovery of IRFs was assessed using root integrated mean square error (RIMSE), which is computed as a squared discrepancy between the true, , and estimated IRF, , integrated across the latent distribution with the square root taken of the final quantity. Here the integral was approximated using rectangular quadrature with nodes between −5 and 5: where is the standard normal density function. For KS items, the sum was taken over the evaluation points defined by the KS model. RIMSE was computed for each individual item using a standard normal latent trait distribution. Averaging over items within each cell of the design, MP tended to outperform the 2PL under most conditions (Tables 1 and 2). This was especially true when there was a larger percentage of nonstandard items (70%) or a larger sample size (N = 3,000 with complete data or N = 10,000 under missing data). Otherwise, the MP and 2PL performed similarly, and the 2PL performed better under one condition (N = 5000, missing data, 30% nonstandard items). Recall KS was used in complete data conditions, yet performed worse than the MP under all these conditions and equal to or worse than the 2PL in all but the N = 3,000, 70% nonstandard item condition.

Table 1.

Mean RIMSE for Item Banks Under Complete Data Collection Design.

		Model
Sample size (N)	Proportion nonstandard (%)	2PL	MP	KS
1,000
	30	.07	.07	.10
	70	.11	.10	.11
3,000
	30	.04	.03	.04
	70	.09	.04	.05

Note. RIMSE = root integrated mean square error; 2PL = two-parameter logistic; MP = monotonic polynomial; KS = kernel smoothing.

Table 2.

Mean RIMSE for Item Banks Under Missing Data Collection Design.

		Model
Sample size (N)	Proportion nonstandard (%)	2PL	MP
5,000
	30	.07	.09
	70	.12	.09
10,000
	30	.05	.05
	70	.09	.07

Note. RIMSE = root integrated mean square error; PL = parameter logistic; 2PL = two-parameter logistic; MP = monotonic polynomial.

Mean RIMSE for Item Banks Under Complete Data Collection Design. Note. RIMSE = root integrated mean square error; 2PL = two-parameter logistic; MP = monotonic polynomial; KS = kernel smoothing. Mean RIMSE for Item Banks Under Missing Data Collection Design. Note. RIMSE = root integrated mean square error; PL = parameter logistic; 2PL = two-parameter logistic; MP = monotonic polynomial. Examining the distribution for RIMSE for standard and nonstandard items separately, it was apparent that gains with the MP were due mainly to better estimation for nonstandard items (Figure 1). The MP did not perform better than the 2PL for standard items, and sometimes performed slightly worse. The KS approach performed on par with MP for nonstandard items but clearly worse than the 2PL and MP approaches for standard items.

Figure 1.

Recovery of response functions for standard (std) and nonstandard (nonstd) items for each calibration.

Recovery of response functions for standard (std) and nonstandard (nonstd) items for each calibration. We also compared the recovery of item-level information among the estimated 2PL and MP items. As shown in the Supplemental Material, available online, MP and 2PL items provided equally accurate information, although nonstandard items tended to have worse information accuracy than standard items when fit to both models.

Computer Adaptive Test Simulations

The primary outcome of interest from CAT simulations was recovery of true latent trait scores. For this purpose, we examined root mean square error: , where is the estimated latent trait score and is the true latent trait for simulee . When latent traits were drawn from a standard normal distribution and when complete data was available for calibration, both MP and KS tended to outperform the 2PL approach, regardless of the item selection algorithm (Table 3). Under complete data calibration, for example, the 2PL only led to lower average RMSEs than the MP in a single cell: MPWI item selection with and 30% nonstandard items.

Table 3.

Average RMSE of Latent Trait Scores for CAT Simulations Under Complete Data Calibration, Standard Normal Latent Traits.

			Model
Item selection	Sample size (N)	Proportion nonstandard (%)	True	2PL	MP	KS
KL	1,000
		30	.363	.381	.379	.386
		70	.341	.385	.384	.378
	3,000
		30	.375	.396	.396	.395
		70	.360	.412	.378	.380
MFI	1,000
		30	.359	.383	.377
		70	.343	.384	.380
	3,000
		30	.373	.396	.395
		70	.361	.412	.361
MPWI	1,000
		30	.361	.383	.375
		70	.349	.384	.390
	3,000
		30	.375	.398	.395
		70	.364	.407	.375

Note. Excluding the true model condition, the best performing method in each row appears in bold. Sample size refers to that used in calibration. RMSE = root mean square error; KL = Kullback–Leibler information; MFI = maximum Fisher information; MPWI = maximum posterior weighted information; True = true model; 2PL = two-parameter logistic; MP = monotonic polynomial; KS = kernel smoothing.

Average RMSE of Latent Trait Scores for CAT Simulations Under Complete Data Calibration, Standard Normal Latent Traits. Note. Excluding the true model condition, the best performing method in each row appears in bold. Sample size refers to that used in calibration. RMSE = root mean square error; KL = Kullback–Leibler information; MFI = maximum Fisher information; MPWI = maximum posterior weighted information; True = true model; 2PL = two-parameter logistic; MP = monotonic polynomial; KS = kernel smoothing. When the data calibration had missing data, the MP and 2PL often led to similar average RMSEs as each other (Table 4). MP always led to equal or lower average RMSEs with MFI item selection, but these patterns were more mixed with KL and MPWI item selection. As an example with KL and MPWI, with 70% nonstandard items and , MP had average RMSE that was better than the 2PL. With 70% nonstandard items and , the 2PL was better but by a much smaller amount ( ). In summary, based on examination average performance with normally distributed latent traits, use of MP calibrated items performed as well as or better than the 2PL and similarly to KS.

Table 4.

Average RMSE of Latent Trait Scores for CAT Simulations Under Missing Data Calibration andStandard Normal Latent Traits.

			Model
Item selection	Sample size (N)	Proportion nonstandard (%)	True	2PL	MP
KL	5,000
		30	.331	.370	.374
		70	.286	.361	.335
	10,000
		30	.327	.352	.350
		70	.311	.347	.350
MFI	5,000
		30	.318	.373	.372
		70	.284	.364	.339
	10,000
		30	.327	.351	.348
		70	.310	.347	.347
MPWI	5,000
		30	.331	.369	.367
		70	.288	.360	.329
	10,000
		30	.327	.350	.343
		70	.312	.346	.350

Average RMSE of Latent Trait Scores for CAT Simulations Under Missing Data Calibration andStandard Normal Latent Traits. Note. Excluding the true model condition, the best performing method in each row appears in bold. Sample size refers to that used in calibration. RMSE = root mean square error; KL = Kullback–Leibler information; MFI = maximum Fisher information; MPWI = maximum posterior weighted information; True = true model; 2PL = two-parameter logistic; MP = monotonic polynomial; KS = kernel smoothing. Turning to RMSE at discrete points along , we concentrate primarily on results using KL item selection as this allows comparison with KS estimated IRFs (Figures 2 and 3). Based on such results, it is clear that any performance advantage of one method of IRF estimation versus another is not necessarily consistent across the entire range of the latent trait. For example, the MP had an advantage over the 2PL for much of the latent trait for conditions with a larger proportion of nonstandard items (70%), yet in some regions of the latent trait (especially near or ), these differences were small or the 2PL outperformed the MP. Differences among approaches with 30% nonstandard items were more difficult to visualize, thus indicating similar performance, except perhaps for the KS performing slightly worse than other methods in the middle of the distribution when . It is also possible that these pattern of results may vary across calibrated item banks.

Figure 2.

RMSE of latent trait scores at discrete points along with KL item selection, complete data calibration.

Note. KL = Kullback–Leibler information; RMSE = root mean square error; non-std = nonstandard.

Figure 3.

RMSE of latent trait scores at discrete points along with KL item selection, missing data calibration.

Note. KL = Kullback–Leibler information; RMSE = root mean square error; non-std = nonstandard.

RMSE of latent trait scores at discrete points along with KL item selection, complete data calibration. Note. KL = Kullback–Leibler information; RMSE = root mean square error; non-std = nonstandard. RMSE of latent trait scores at discrete points along with KL item selection, missing data calibration. Note. KL = Kullback–Leibler information; RMSE = root mean square error; non-std = nonstandard.

Discussion

The presented simulation study examined the performance of MP and KS IRF estimation techniques for use with a CAT, and compared them with a standard 2PL approach using item selection techniques based on KL information, MFI, and MPWI. Our results demonstrate that MP and KS approaches lead to comparable or better latent trait recovery than the 2PL. Despite the promise of the MP and KS approaches, it is difficult to pinpoint exact conditions under which such an approach is universally preferable to standard approaches such as the 2PL. In retrospect, different IRFs can still result in very similar latent trait estimates (e.g., Yen, 1981). More substantial departures from the 2PL in the form of more extreme IRFs (including nonmonotonic) may need to be present in the item bank for the various methods to perform much more differently than one another. Our manipulated conditions mainly varied features of the calibration sample to mimic conditions that might be used for a large test (item banks of 100 or 200, and complete or planned missing data collection). While prior simulation studies have found that MP and KS can on average recover IRFs better than the 2PL when nonstandard items are in an item bank (e.g., Falk & Cai, 2016a; Feuerstahler, 2016), in any given calibration sample it may be that such gains do not always clearly materialize or do not then lead to subsequent gains in scoring or CAT performance. With some exceptions, the MP tended to perform better with larger calibration samples and with a larger proportion of nonstandard items. It is suggested that future research may focus on features of the CAT itself (test length, stopping criteria) that may affect performance, as well as on identifying the conditions for which flexible IRF estimation provides a clear advantage over the standard approaches. Our study utilized a realistic calibration phase prior to conducting CAT simulations. We thought this necessary for studying the relative performance of IRF estimation techniques in a CAT. However, this approach also makes it slightly difficult to know whether the relative performance observed in any given cell of the CAT simulation design was due in part to random sampling fluctuation from doing only a single calibration per cell. Although this issue could be addressed by doing multiple calibrations per cell and then multiple CAT simulations, such analyses demand a large amount of computational time and space. In addition, given the size of the item banks (100 and 200 items, depending on the condition) relative to the length of the CAT (25 items), we expected that doing only a single calibration would still be informative. This study was also apparently the first to utilize a flexible IRF estimation technique (the MP) in conjunction with item selection algorithms that require analytical derivatives and are often used in operational settings (MFI and MPWI). We found little difference among item selection algorithms, though we did not present any prior theory to favor any particular technique. This result holds promise for operational programs that may consider a nonparametric or semiparametric approach to IRF estimation but may prefer to use familiar, derivative-based item selection algorithms or would prefer to implement changes in stages to better ensure quality control. In closing, we believe that the MP approach may be particularly well-suited to applications in CAT because it allows for both flexibly estimated IRFs and analytic derivatives. The initial results presented in this article indicate that the MP approach estimates latent traits as well or better than a standard approach when a significant proportion of nonstandard items exist and in the context of a planned missing data field test design. Click here for additional data file. Supplemental material, sj-pdf-1-epm-10.1177_00131644211014261 for On the Performance of Semi- and Nonparametric Item Response Functions in Computer Adaptive Tests by Carl F. Falk and Leah M. Feuerstahler in Educational and Psychological Measurement

5 in total

4. Maximum Marginal Likelihood Estimation of a Monotonic Polynomial Generalized Partial Credit Model with Applications to Multiple Group Analysis.

Authors: Carl F Falk; Li Cai
Journal: Psychometrika Date: 2014-12-09 Impact factor: 2.500

5 in total

On the Performance of Semi- and Nonparametric Item Response Functions in Computer Adaptive Tests.

Introduction

Method

Studied Item Response Function Estimation Techniques

Studied Item Selection Algorithms

Study Design and Simulation Details

Calibration and Data Generating Models

Computer Adaptive Test Simulations

Results

Calibration Results

Computer Adaptive Test Simulations

Discussion

1. Metric Transformations and the Filtered Monotonic Polynomial Item Response Model.

2. OpenMx 2.0: Extended Structural Equation and Statistical Modeling.

3. Nonparametric CAT for CD in Educational Settings With Small Samples.

4. Maximum Marginal Likelihood Estimation of a Monotonic Polynomial Generalized Partial Credit Model with Applications to Multiple Group Analysis.