Literature DB >> 35989730

A Multilevel Mixture IRT Framework for Modeling Response Times as Predictors or Indicators of Response Engagement in IRT Models.

Abstract

Disengaged item responses pose a threat to the validity of the results provided by large-scale assessments. Several procedures for identifying disengaged responses on the basis of observed response times have been suggested, and item response theory (IRT) models for response engagement have been proposed. We outline that response time-based procedures for classifying response engagement and IRT models for response engagement are based on common ideas, and we propose the distinction between independent and dependent latent class IRT models. In all IRT models considered, response engagement is represented by an item-level latent class variable, but the models assume that response times either reflect or predict engagement. We summarize existing IRT models that belong to each group and extend them to increase their flexibility. Furthermore, we propose a flexible multilevel mixture IRT framework in which all IRT models can be estimated by means of marginal maximum likelihood. The framework is based on the widespread Mplus software, thereby making the procedure accessible to a broad audience. The procedures are illustrated on the basis of publicly available large-scale data. Our results show that the different IRT models for response engagement provided slightly different adjustments of item parameters of individuals' proficiency estimates relative to a conventional IRT model.

Entities: Chemical

Keywords: item response theory; multilevel mixture modeling; response engagement; response time

Year: 2021 PMID： 35989730 PMCID： PMC9386881 DOI： 10.1177/00131644211045351

Source DB: PubMed Journal: Educ Psychol Meas ISSN： 0013-1644 Impact factor: 3.088

Introduction

Large-scale assessments of achievement attempt to assess what individuals know and can do. In order to provide unbiased results, individuals should ideally answer all items with full engagement (Wise & Smith, 2016). However, this requirement is often not fulfilled, especially in low-stakes assessments that have no consequences for individuals (List et al., 2017). Disengagement is considered a threat to the validity of results, because it may lead to lowered proficiency estimates (Wise, 2015) and biased item parameter estimates (Schnipke & Scrams, 1997). Since the advent of computerized testing, item response times are most often used to identify disengaged responses. In recent years, a number of item response theory (IRT) models for response engagement that augment item responses by response times have been proposed (Pokropek, 2016; Ulitzsch et al., 2020; Wang & Xu, 2015; Wang et al., 2018). These models have the desirable properties of not requiring explicit classifications of engaged and disengaged responses, of reducing bias in item parameter estimates without discarding responses, and of providing estimates of proficiencies that are adjusted for disengagement. The IRT models build on different assumptions and might therefore give different results. However, in real applications, decisions for one specific model are complicated by the fact that many models have not been implemented in widely accessible statistical software. This limits the applicability of IRT models in practice, although many large-scale programs have made item-level data including response times available (e.g., Goldhammer et al., 2016; OECD, 2017). The aims of the present article are: (1) to summarize the assumptions underlying the classification of response engagement, (2) to outline how these assumptions are embedded in existing IRT models for response engagement, (3) to present some extensions of these models, and (4) to introduce a flexible multilevel mixture IRT (MM-IRT) framework in which IRT models for response engagement can be specified and estimated. The MM-IRT framework builds upon the widespread Mplus program (Muthén & Muthén, 1998), thereby making IRT models for response engagement accessible to applied researchers. All models can be estimated by marginal maximum likelihood (MML) with the expectation maximization algorithm (EM; Darrell Bock & Lieberman, 1970), including those models that have previously been estimated only in a Bayesian framework (e.g., Ulitzsch et al., 2020; Wang et al., 2018). Although Bayesian estimation has many advantages, it comes at the price of long estimation times. This renders Bayesian estimation impractical in the case of large sample sizes, which are typical for large-scale assessments. In contrast, MML estimation is mainly limited by the number of dimensions that need to be numerically integrated, and it can be applied in fairly large samples. The remainder of this article is structured as follows. In the next section, we outline the rationales underlying traditional procedures for classifying response engagement. We then describe the IRT models for response engagement that have been suggested in the literature, discuss their assumptions, and provide some extensions that make the models more flexible. Following that, we introduce the MM-IRT framework and show how the models presented are specified in this framework. The framework is illustrated with publicly available data taken from the program for the international assessment of adult competencies (http://www.oecd.org/skills/piaac/publicdataandanalysis/).

Indicators of Response Engagement Based on Item Response Times

The assessment of response engagement has predominantly focused on rapid guessing (Wise, 2017) as a manifestation of disengagement (for other aspects see van Barneveld et al., 2013). Responses given in a time that falls below a certain threshold are considered to be too fast to reflect solution behavior (SB). Item responses are classified as being engaged or disengaged by means of the SB-index, which has a value of 1 for individual i’s (i = 1, 2, …, N) response to item j (j = 1, 2, …, J) if i’s response time to item j ( ) exceeds the time threshold ( if ), and is 0 otherwise ( if ). The SB-index varies within individuals because individuals may be engaged on some items and disengaged on others. It also varies between individuals because individuals could differ in the proportion of engaged responses (Wise & Kong, 2005). Different methods for setting response time thresholds that are based on different rationales have been suggested. Following common approaches for modeling response times (e.g., van der Linden, 2007), we outline the methods while assuming log-transformed response times throughout this article, to which we refer with (for alternative transformations see De Boeck & Jeon, 2019). A popular method rests on the assumption of a bimodal response time distribution and looks for the lowest density in the region between the modes (e.g., Wise, 2015). We will refer to this method as the visual inspection of bimodality (VIB) method. As shown in Figure 1, the VIB method assumes a mixture distribution due to engaged and disengaged response processes and aims to identify the lowest response time in engaged responses. The conceptual idea underlying the VIB procedure is visualized in Figure 1 by means of a schematic path diagram. Here, denotes an individual i’s latent engagement state on item j that affects her or his response time . Statistically, SB is represented as a latent class variable and response times as indicators thereof.

Figure 1.

Upper panels = examples for procedures for setting response time thresholds based on artificial data. Lower panels = Graphical depiction of conceptual ideas underlying the procedures for setting response time thresholds. A second approach combines response times with item responses (correct vs. incorrect), which are denoted as . This method, referred to here as the VICS method, visually inspects the proportion of correct scores conditional on response times. In the VICS method, the threshold is set according to the response time below which the proportion of correct responses is not higher than the chance level of success (Wise & Ma, 2012; for more elaborate versions see Guo et al., 2016; Lee & Jia, 2014). That is, for engaged responses the probability of a correct response is expected to be higher than for disengaged responses. Therefore, as represented by the path diagram in Figure 1, the VICS procedure assumes that item responses are impacted by the latent engagement state, which, in turn, is predicted by response times. A third method assumes that disengaged responses do not reflect proficiency (i.e., they are uninformative) (Wise, 2019). The method uses response times and correlations of item responses with provisional proficiency estimates (denoted as ). Thresholds are based on a visual inspection of plots of the item score’s correlation with , so that they refer to the response time at which the correlation exceeds a reference value. We refer to this method as the visual inspection of information (VII) method. An example that uses a reference correlation of 0.20 (Wise, 2019) is given in Figure 1. As shown in the path diagram, the VII method assumes that the latent engagement variable determines the impact of on item responses, whereas response times serve as predictors of the engagement state. Wise (2019) suggested a method in which the VICS and the VII procedures are combined. Conceptually, in the combined VICS–VII method, success rates and item information are simultaneously considered as outcomes of response engagement, whereas response times serve as predictors of engagement (Figure 1). A different possibility of combining some aspects of the VIB, the VICS, and the VII methods is to consider response times, success rates, and item information as indicators of engagement.

IRT Models for Response Engagement

IRT models for response engagement build upon assumptions that parallel some assumptions of the VICS (the proportion of correct disengaged responses corresponds to the chance level) and the VII procedures (disengaged responses are not related to proficiency). The IRT models considered here recur on one dichotomous latent class variable that is allowed to vary within individuals, which means that individuals could be engaged on some items and disengaged on others. All models have in common that, for engaged responses ( = 1), they assume an IRT model in which a response to item j, , reflects an individual i’s standing on the proficiency variable (for an extension to multidimensional proficiencies see Lu et al., 2020). To this end, different IRT models have been used, including the one-parameter (Ulitzsch et al., 2020), the two-parameter (2PL; Pokropek, 2016), and the three-parameter logistic (Wang & Xu, 2015) IRT model. In the case of disengagement ( = 0) it is assumed that responses do not depend on proficiency and that the probability of a correct response equals the chance level of success. In the IRT models, the probability of a correct response is structured as where , the probability of a correct response in the engaged state, is parameterized according to an IRT model, whereas is modeled by a single item parameter to reflect the chance probability of success. When the latent class variable is replaced with the SB-index, the structure corresponds to the effort-moderated IRT model of Wise and DeMars (2006), which estimates the IRT parameters and the individuals’ proficiency levels solely on the basis of responses that, through the use of threshold procedures, have been classified as engaged. The IRT models for response engagement enable researchers to bypass the cumbersome procedures of classifying the engagement status of each item response (Figure 1) and make it possible to deal with responses for which the engagement status cannot be univocally classified. Similar to the procedures for setting response time thresholds, different kinds of IRT models for response engagement have been suggested that treat response times either as indicators or as predictors of the engagement states. The most commonly used IRT models (Ulitzsch et al., 2020; Wang & Xu, 2015; Wang et al., 2018) combine Schnipke and Scrams’s (1997) mixture model for response times with an IRT part and consider both response times and item responses as being influenced by the class variable. Therefore, we will refer to this group of models as independent latent class IRT (ILC-IRT) models. In a second type of mixture IRT models, the within-individual latent class variable is treated as a dependent variable that is predicted by response times (e.g., Pokropek, 2016). We will refer to this group of models as dependent latent class IRT (DLC-IRT) models.

DLC-IRT Models

DLC-IRT models where response times are used as predictors of response engagement have not often been used in practice. An exception is a model suggested by Pokropek (2016). In the following, we review this model and suggest some extensions to increase its flexibility.

DLC-IRT Models with Single-Level Relationships

Pokropek (2016) suggested a DLC-IRT model in which engagement states are predicted by within-individual level response times. Because relationships between response times, response engagement, and item responses are located solely on one level in Pokropek’s (2016) model, we refer to this type of model as a DLC-IRT model with single-level relationships (DLC-SL-IRT). In contrast to Pokropek’s (2016)’s original model, we use log-transformed predictors, and we start with a slightly different but statistically equivalent parameterization. In this setup, the probability of an engaged response is expressed as where is a response time threshold that is common to all items and individuals so that the probability of engaged behavior is 0.5 for log response times equal to . In addition, ( > 0) is a logistic regression weight that can be understood as a process discrimination parameter. The higher the values of , the better response times close to discriminate between engaged and disengaged responses. Pokropek’s (2016) model implies that within an individual i, a response to item j is more likely to be engaged than a response to another item k when . Such comparisons are problematic because, typically, items require different amounts of time, and the prevalence of engaged responses differs between items (e.g., Wise, 2019). These differences can be accounted for by making the thresholds item-specific, so that Item-specific thresholds imply that, within an individual i, a response to item j is more likely to be engaged than a response to item k when . The DLC-SL-IRT model is sketched as a schematic path diagram in Figure 2. Response engagement is specified as a within-individual latent class variable . The arrow leading from to the item response means that success probabilities differ between engaged and disengaged responses. The proficiency variable is located on the between-individual level, and its impact on the observed item response is moderated by (i.e., disengaged item responses do not reflect proficiency). In the DLC-SL-IRT model, is predicted by response times. The gray-shaded latent variable stands for a node variable that is a function of the item-specific response time and thresholds ( ). This variable is related to by the process discrimination parameter .

Figure 2.

Schematic representation of different IRT models for response engagement. The dotted horizontal line separates the within-individual (lower part) from the between-individual (upper part) level of analysis. Gray filled circles represent node variables that are functions of other observed and latent variables. Triangles are used to indicate intercepts and thresholds.

DLC-IRT Models with Two-Level Relationships

The DLC-SL-IRT model implies that for each individual, the proportion of engaged responses can be fully explained by their item-specific log response times. This is because the model does not include between-individual differences in the latent class variable. However, Pokropek (2016) (see also Rios et al., 2017) noted that when between-individual differences that are conditional on response times exist in the proportion of engaged responses and these are correlated with proficiency, proficiency estimates are likely to be biased. From a conceptual point of view, such relationships are plausible. More proficient individuals have been found to need less time to complete an item conscientiously (e.g., Goldhammer et al., 2013), which means that, for such individuals, a response to item j that is given slightly below the response time threshold could still be engaged. Conversely, for individuals of lower proficiency, response times that fall slightly above the response time threshold might be more likely to reflect disengaged behavior. Therefore, response time thresholds could be person-specific and correlated with proficiency. Based on these considerations, we suggest an extension of the DLC-SL-IRT model that includes between-individual differences in response time thresholds. As these components extend the model by between-individual level aspects, we refer to the resulting model as the DLC-IRT model with two-level relationships (DLC-TL-IRT). In the DLC-TL-IRT model, we represent the impact of response times as where is the response time threshold for item j, and is a normally distributed individual-specific threshold disturbance [ ] that accounts for between-individual differences in response time thresholds. Individuals with higher values on have higher than average response thresholds, which means that their responses are more likely to be disengaged than the responses of individuals who respond in the same time but have lower values on . Figure 2 includes a graphical depiction of the DLC-TL-IRT model. The model extends the DLC-SL-IRT model by including the between-individual level threshold disturbance . The node variable evaluates to , and the corresponding term is related to via . The proficiency variable is allowed to correlate with . A negative correlation between and indicates that more proficient individuals are more likely to provide thorough responses even when responses are relatively fast.

Posterior Engagement Probabilities in DLC-IRT Models

Equations (3) and (4) refer to the probability of an engaged response that are expected on the basis of response times and, therefore, do not provide a full insight into the probability that a response was actually engaged. To better understand the classification of responses in DLC-IRT models, it is instructive to consider the posterior probability of individual i’s response to item j being engaged, which we refer to as . The posterior engagement probabilities depend on the response times and the observed item responses and can be expressed as where and refer to the probability of the observed item response in the engaged and the disengaged response state, respectively. These probabilities depend on the item responses (correct vs. wrong). For example, refers to the chance level of success of item j, whereas reflects j’s chance level of failure. Typically, the chance level of success is low or even zero (e.g., in the case of open responses), which means that the chance level of failure is close to one or even one. In such typical situations, the posterior engagement probabilities of wrong responses will mainly reflect , which is monotonically related to (Equations 3 and 4). In contrast, correct responses are likely to receive high posterior engagement probabilities even when they are given in a short amount of time.

Interim Summary of DLC-IRT Models

In DLC-IRT models, the meaning of the latent class variable is defined by the item responses, so that disengaged responses have a solution probability that corresponds to the chance level of success and are not related to proficiency. Response times are used as predictors of latent class membership, which means that response times are the main source of information used to separate engaged from disengaged responses. In the typical case of a low chance level of success, the posterior engagement probabilities in DLC-IRT models reduce the contribution of wrong responses coupled with fast response times in the estimation of IRT item parameters and proficiencies. Therefore, DLC-IRT models are very similar to the combined ICS-VII procedure for setting response time thresholds (Figure 1). Of the two DLC-IRT models, the DLC-SL-IRT model is most similar to the VICS-VII procedure because it assumes that the response times contain all information necessary for predicting the engagement status of item responses. The DLC-TL-IRT model relaxes some restrictive assumptions, namely, that response times unequivocally predict the engagement status of each response (random threshold) and that conditional on response times engagement is not related to proficiency (correlation between random threshold and proficiency).

ILC-IRT Models

In ILC-IRT models, response times are considered to be indicators of response engagement. The basis of the most common type of ILC-IRT model was laid out by Schnipke and Scrams (1997). Their approach was extended by Wang and Xu (2015), Wang et al. (2018), and Ulitzsch et al. (2020), who incorporated item responses into the models by using the structure shown in Equation (1) and imposing a similar structure on response times: ILC-IRT models include an additional latent variable that underlies response times. Typically, is specified to reflect individual differences in working speed (higher values on are associated with shorter response times; see Klein Entink et al., 2009; van der Linden, 2006). Here, we use a statistically equivalent parameterization in which reflects individuals’ typical time expenditure (higher values on are associated with longer response times). We assume a congeneric measurement model that corresponds to for engaged responses and to for disengaged responses. The tilde over the intercepts ( ), loadings ( ), and residual variables ( ) indicates that the intercepts, loadings, and residual variances ( and ) can differ between engaged and disengaged responses. ILC-IRT models differ in how the class probabilities are expressed. Ulitzsch et al. (2020) provided the most comprehensive parameterization, which corresponds to where is a normally distributed person characteristic [ ], and is a parameter specific to item j. The -variable captures the individuals’ propensities of being engaged, whereas the -parameters govern the degree to which the items evoke disengagement. The discrepancy between an individual i’s engagement propensity and an item j’s engagement difficulty determines the probability of engagement of individual i on item j. Equation (9) implies unidimensional engagement propensities and conditional independence of item-specific engagement states. ILC-IRT models that include the -variable make it possible to examine the relationships of engagement propensity with and could be beneficial for reducing biases in proficiency estimates (Pokropek, 2016; Rios et al., 2017). Wang and Xu (2015) and Wang et al. (2018) did not include an individual difference variable in their models. Wang and Xu (2015) incorporated only one fixed effect for modeling , whereas Wang et al. (2018) included item effects ( ).

ILC-IRT Models With Random Effects of Response Engagement

Up until today, applications of ILC-IRT models have employed a constrained version of the structure provided in Equations (7) and (8). For engaged responses (Equation 7), the -parameters are constrained to be equal and the residual variances to be the same for all items [ ], whereas for disengaged responses (Equation 7), the loadings are fixed to zero ( = 0 for all j = 1, 2, …, J), and the intercepts and residuals are set to be equal across all items [ and for all j = 1, 2, …, J]. The constraints imposed on disengaged responses mean that their distribution is assumed to be the same for all items and individuals [ ]. This specification was introduced by Schnipke and Scrams (1997), who argued that faster response times are a result of examinees skimming the items in search of cues without thoroughly processing the problem statement. As such, item and person characteristics are expected to have negligible effects on disengaged response times. The restrictions imposed on disengaged response times offer a substantive and a statistical interpretation of . The substantive interpretation is that only engaged responses allow the assessment of the typical time expenditure. According to the statistical interpretation, is an individual-specific random effect of response engagement on response times. Therefore, we refer to these kinds of models as ILC-RE-IRT models. To outline the random effect interpretation, we express the effect of on by inserting Equations (7) and (8) into Equation (6). After some rearrangement, that gives where corresponds to . In Equation (10), the term in the first parentheses, , refers to the effect of on , to a regression intercept, and the remaining terms capture the possibly heteroscedastic residual variance of . The effect of the latent class variable comprises two parts: (1) an average difference in response times and (2) systematic individual deviations thereof . Therefore, in ILC-RE-IRT models, the -variable represents individual differences in the separation of engaged and disengaged response times. Figure 2 includes a schematic path diagram of the ILC-RE-IRT model. The latent class variable reflects the between-individual component and includes an item-specific threshold that governs the engagement difficulty of item j. The engagement state affects the log response time to item j with an average effect (Equation 10). The between-individual level variable captures individual differences in this effect because its loadings are zero when = 0 and when = 1. The latent variables are allowed to correlate. The correlation between and indicates whether higher engagement propensities are related to higher or lower effects of response engagement on response times.

ILC-IRT Models with Random Intercepts of Response Engagement

The assumption that disengaged response times do not reflect typical time expenditure could be questioned when items elicit time-consuming actions even when responses are disengaged. For example, many items contain long reading passages, figures, and tables that require examinees to spend some time even if they examine the material only superficially. In addition, items that have an open response format require examinees to write down an answer, which is more time-consuming than randomly crossing one response option for a multiple choice item (Wise, 2019). In such cases, it is reasonable to assume that both engaged and disengaged response times depend on individuals’ typical time expenditure. This assumption can be incorporated into ILC-IRT models by allowing the loadings in Equation (8) to differ from zero. We suggest imposing the invariance constraint (Equations 7 and 8), which leads to a model with a clear interpretation. Based on this constraint, log response times can be expressed as an effect of the latent class variable as Here, the -parameters ( ) represent the effect of the latent class variable that does not vary between individuals and the sum can be interpreted as a random intercept, with denoting the average intercept and the systematic individual deviations thereof. Similar to Equation (10), the remaining terms are used to capture the possibly heteroscedastic residual variances. Therefore, we refer to the model based on Equation (11) as the ILC-IRT model with a random intercept of response times (ILC-RI-IRT). In ILC-RI-IRT models, the -variable represents overall individual differences in response times, so that engaged and disengaged response times are defined relative to . The structure of ILC-RI-IRT models that also include is shown in Figure 2 with a path diagram. The only difference to the ILC-RE-IRT model is that the response times’ loadings are not affected by the latent class variable. The engagement state of an individual response is affected by , which means that affects the log response times via the latent class variable. Therefore, the correlation of with indicates whether between-individual differences in engagement propensity are related to individual differences in time expenditure over and above the effects of engagement states. For example, a positive correlation between and indicates that individuals with high engagement propensities are expected to spend more time in either engagement state.

Posterior Engagement Probabilities in ILC-IRT Models

In ILC-IRT models the posterior engagement probabilities can be represented as where and are as defined in the context of Equation (5) and and are as specified in Equation (9). In ILC-IRT models the posterior engagement probabilities account for response times by referring to their normal probability density function (PDF) values, defined with respect to the distribution of engaged [ and disengaged response times [ ]. The PDF values indicate how typical an observation is for the engaged or disengaged state, respectively. Similar to the posterior probabilities in DLC-IRT models, in the face of low chance levels of success, correct responses are likely to receive high posterior engagement probabilities, which means that the response times’ PDF values have a stronger impact on the posterior probabilities of wrong responses. Typically, the means of disengaged responses are lower than the means of engaged responses. Therefore, long response times might be expected to be more typical for the engaged than the disengaged state (i.e., a higher PDF in the engaged than in the disengaged response time distribution). However, when the residual variance is much larger than , the PDF value of a long response time derived with respect to the disengaged state could be larger than the PDF value in the engaged state. In this case, ILC-IRT models could indicate that short and very long response times are both more typical of the disengaged than the engaged state, which means that they both receive low posterior engagement probabilities. We observed that, in applications of ILC-IRT models, estimates of tended to be much larger than estimates of (e.g., Schnipke & Scrams, 1997; Ulitzsch et al., 2020). For this reason, the specification of ILC-IRT models should be carefully thought to prevent biased estimates of residual variances. For example, the assumption that a single measurement intercept applies to all disengaged response times should be reconsidered, because erroneously fixing intercepts to be equal could artificially inflate the estimates of .

Interim Summary of ILC-IRT Models

In ILC-IRT models, responses and response times jointly define the meaning of the latent class variable. Engaged item responses are assumed to be indicative of proficiency and to be accompanied by a specific distribution of response times. In contrast, disengaged responses are assumed to not reflect proficiency and to be accompanied by a different distribution of response times. We presented two alternative ways of specifying ILC-IRT models for response engagement. The commonly used ILC-RE-IRT models (Ulitzsch et al., 2020; Wang & Xu, 2015; Wang et al., 2018) specify the relationships of the latent class variable with response times by means of a random effect regression, whereas ILC-RI-IRT models replace the random effect with a random intercept. These two specifications imply a different definition of typical time expenditure, so that is defined solely on the basis of engaged responses in ILC-RE-IRT models and on the basis of both kinds of responses in ILC-RI-IRT models. In ILC-RE-IRT models, represents individual differences in the separation of engaged response times from an unstructured distribution that is assumed to be the same for all individuals. In ILC-RI-IRT models, represents overall individual differences in response times, so that engaged and disengaged response time distributions are defined relative to . ILC-IRT models are conceptually related to the VIB method (Wise, 2015), but this connection is not as close as the one between DLC-IRT models and the VICS-VII method. Clearly, continuous response time variables carry more information than categorical item responses, so that latent class membership is more strongly determined by response times than by item responses (Wang & Xu, 2015). However, ILC-IRT models also account for item responses, whereas the VIB method does not. In addition, the VIB procedure aims to separate fast disengaged responses from engaged responses on the basis of a response time threshold (monotonic relationship), whereas ILC-IRT IRT models allow for more complex relationships between response times and engagement.

DLC-IRT versus ILC-IRT Models

DLC-IRT and ILC-IRT models provide two different ways to include response times in IRT models for response engagement. Table 1 summarizes how response times are used in the different models considered in this article. In ILC-IRT models, response times are specified as reflections of response engagement, whereas in DLC-IRT models, response engagement is predicted by the individuals’ response times. Therefore, DLC-IRT and ILC-IRT models differ in how dependencies between item responses and response times are modeled. DLC-IRT models consider the distribution of item responses conditional on response times, which means that they take response times as given, without incorporating specific assumptions about their distribution. In contrast, in ILC-IRT models, the joint distribution of item responses and response times is modeled on the basis of assumptions about the structure and the distribution of response engagement and response times. These assumptions make it possible to estimate complex relationships between response engagement and response times that could be non-monotonic (Equation 12) and could differ in strengths between items (Equations 10 and 11). In contrast, in DLC-IRT models, item-specific response times are assumed to have monotonic relationships of equal strength with response engagement. Given the fundamental differences between ILC-IRT and DLC-IRT models, decisions in favor of one type of model should be based on substantive considerations. Decisions cannot rely on goodness-of-fit indices based on the model likelihood (e.g., Akaike information criterion and Bayesian information criterion) because DLC-IRT and ILC-IRT models consider different aspects of the data. However, such indices can be used to decide between variants of either type of model.

Table 1.

Summary of the Key Aspects of the Treatment of Item Response Times in Dependent Latent Class (DLC-IRT) and Independent Latent Class (ILC-IRT) Models for Response Engagement.

Dependent latent class IRT models
Common aspects	• Response times predict engagement.• Conditional on item responses, longer response times are monotonically related to higher probabilities of engaged responses.
DLC-SL-IRT	• Estimates response times at which an equal probability of engaged and disengaged responses is expected (response time thresholds).• Response time thresholds could be item-specific.
DLC-TL-IRT	• Allows for individual differences in response time thresholds that could be correlated with proficiency.
Independent latent class IRT models
Common aspects	• Response times reflect engagement.• Response times contribute to the classification of engagement by their relative typicality for the engaged and disengaged state.• Response times can be non-monotonically related to the probability of engaged responses.
ILC-RE-IRT	• Disengaged response times are uncorrelated with each other and have a distribution that is the same for all individuals.• The mean separation of engaged and disengaged response times differs between individuals (random effect) and is reflected by the individuals’ typical time expenditure.
ILC-RI-IRT	• Disengaged and engaged response times both reflect individuals’ typical time expenditure (random intercept).• The mean separation of engaged and disengaged response times is the same for all individuals.

Note. DLC-SL-IRT = dependent latent class IRT model with single-level relationships of response times; DLC-TL-IRT = dependent latent class IRT model with two-level relationships of response times; ILC-RE-IR = Independent latent class IRT model with a random effect of the latent class variable on response times; ILC-RI-IRT = Independent latent class IRT model with a random intercept of response times.

Summary of the Key Aspects of the Treatment of Item Response Times in Dependent Latent Class (DLC-IRT) and Independent Latent Class (ILC-IRT) Models for Response Engagement. Note. DLC-SL-IRT = dependent latent class IRT model with single-level relationships of response times; DLC-TL-IRT = dependent latent class IRT model with two-level relationships of response times; ILC-RE-IR = Independent latent class IRT model with a random effect of the latent class variable on response times; ILC-RI-IRT = Independent latent class IRT model with a random intercept of response times. The most important aspect to consider when deciding on a model type is the assumptions about the role of response times and the definition of response engagement. DLC-IRT models take a perspective that is typical for the rapid guessing literature, according to which response engagement can be expressed as a monotonic function of response times (the longer the response time, the more likely it is that the response is engaged; Wise, 2019). Furthermore, in this stream of research, it is not common to make structural and distributional assumptions about response times. Therefore, researchers favoring the perspective pursued in research on rapid guessing might find DLC-IRT models to be most attractive. Indeed, DLC-IRT models provide many benefits compared to the typically time-consuming and cumbersome two-step procedures employed in this field (threshold-based classification of response engagement followed by the IRT analyses of responses classified as engaged). DLC-IRT models are one-step procedures that are easy to apply. Here, response time thresholds are estimated within the model, and the model-implied posterior engagement probabilities are accounted for in the estimation of IRT item parameters and individual proficiency levels. Furthermore, DLC-IRT models have the appealing feature that they can examine individual differences in response time thresholds (DLC-TL-IRT), whereas this is not possible on the basis of traditional methods. ILC-IRT models contain the key ingredients of van der Linden’s (2007) speed-accuracy IRT model, which is frequently used in psychometrically inspired research. This model is based on the assumption that individuals have their own typical working speed (or typical time expenditure), which they maintain over the course of a test. One key research question employed on the basis of this model refers to the relationships between proficiency and typical time expenditure (e.g., Goldhammer et al., 2013). Disengaged item responses violate the constant time expenditure assumption, which means that ILC-IRT models can be seen as extensions of the speed-accuracy IRT model that account for such violations. As such, ILC-IRT models are a suitable choice when researchers are interested in the relationships between proficiency and typical time expenditure and wish to account for disengaged responses defined by atypical response times. In this vein, the two variants of ILC-IRT models, is whether the log response times’ loadings interact (ILC-RE-IRT) or do not interact (ILC-RI-IRT) with the latent class variable, might prove useful in deriving accurate representations of individuals’ typical time expenditure and engagement states.

The MM-IRT for Response Engagement

In this section, the MM-IRT framework, which allows the previously presented IRT models to be estimated with Mplus (Muthén & Muthén, 2017), is introduced. We first describe how the input data are structured and then move to the specification and estimation of DLC-IRT and ILC-IRT models. The section ends with an illustration of the framework.

Data Setup

IRT models can be formulated and estimated as multilevel models with item responses organized in a long format (e.g., Van Den Noortgate et al., 2003). This setup can be extended to MM-IRT models by including an item-level latent class variable (e.g., Asparouhov & Muthén, 2008; Pokropek, 2016). However, in our experience, the estimation of MM-IRT models on the basis of a long data format is computationally very demanding because the models require the estimation of a large number of interactions. As a solution to this problem, we propose a rearrangement of the input data that greatly speeds up the estimation times when employing MML estimation. In our setup, the item responses are organized in a diagonal format. Each individual i’s item responses are represented in a matrix of order J × J. The elements of are denoted as , with missing responses in the off-diagonals (j ≠ k). Individuals’ arrays of item responses are augmented by a J × J matrix that contains item indicators . In off-diagonal entries are zero, and diagonal entries are one. Finally, the organization of response times differs between DLC-IRT and ILC-IRT models. In DLC-IRT models, response times are entered in a long format, which means that they are organized in a J-valued vector with entries . In the case of ILC-IRT models, response times are organized in a J × J diagonal matrix . In this case, entries are denoted as , and only diagonal entries (j = k) contain nonmissing values. Table 2 provides an example of a data array for one individual, where response times were entered twice to exemplify their organization in the case of DLC-IRT and ILC-IRT models.

Table 2.

Example of a Data Array for One Individual in the Diagonal Format. Variables in the Columns are Indexed by Item (k = 1, 2, …, K with K = J). The Example includes Two Alternative Organizations of Log Response Times as Used in DLC-IRT and ILC-IRT Models.

	Item responses					Item indicators					Log response times
	DLC-IRT and ILC-IRT					DLC-IRT and ILC-IRT					DLC-IRT	ILC-IRT
Item	y1	y2	⋯	yK−1	yK	d1	d2	⋯	dK−1	dK	l	l1	l2	⋯	lK−1	lK
1	1	−99	⋯	−99	−99	1	0	⋯	0	0	3.2	3.2	−99	⋯	−99	−99
2	−99	1	⋯	−99	−99	0	1	⋯	0	0	2.7	−99	2.7	⋯	−99	−99
⋮	⋮	⋮	⋱	⋮	⋮	⋮	⋮	⋱	⋮	⋮	⋮	⋮	⋮	⋱	⋮	⋮
J − 1	−99	−99	⋯	0	−99	0	0	⋯	1	0	1.9	−99	−99	⋯	1.9	−99
J	−99	−99	⋯	−99	1	0	0	⋯	0	1	2.5	−99	−99	⋯	−99	2.5

Note. “−99” denotes missing values.

Specification of IRT Models for Response Engagement in the MM-IRT Framework

Specification of the IRT Part for Item Responses

We first outline the specification of the IRT part that is used to model item responses in all models. In our MM-IRT framework, item responses are represented by a 2PL model that provides a good compromise between flexibility and parsimony and is well suited for the open responses considered in this article. To include the 2PL part in the MM-IRT framework, we specify the item responses to be located on the within-individual level and the proficiency variable to be located on the between-individual level. For engaged item responses, the 2PL structure is parameterized as where and are the discrimination and threshold parameters of item j = k. The term stands for the individual i’s scores on a node variable that is fully determined by a J × 1 unit vector with elements : which means that the proficiency variable is represented by a random effect located at the between-individual level. For model identification, we specify to follow a standard normal distribution [ ]. Because is a unit vector, in Equation (13) can be substituted with , which leads to the typical notation of the 2PL model. For disengaged responses, the success probability does not depend on proficiency, and the probability of a correct response corresponds to the chance level of success. In this case, Equation (13) is altered to with = 0 for all j = k = 1, 2, …, K and where governs the chance level of success of item j = k. The -parameter might be specified as being item-specific or invariant across items and can be freely estimated or fixed to meaningful values. For example, in the case of open responses, the chance level of success is essentially zero, which means that the -parameters might be fixed to a large value (e.g., 15).

Specification and Estimation of DLC-IRT Models

The estimation of the variants of DLC-IRT models recurs on alternative but statistically equivalent parameterizations of Equations (3) and (4). We begin with the more complex DLC-TL-IRT model because the DLC-SL-IRT model is a constrained version thereof. In the DLC-TL-IRT model the latent class variable is specified to depend on item response times that are organized in a long format (Table 2) and on the K = J item indicators via a logistic regression: where corresponds to the process discrimination parameter in Equation (3). In contrast to Equations (3) and (4), the -parameters can be interpreted as item-specific logistic intercepts. The term is a random disturbance [ ] that represents a rescaled counterpart of the DLC-TL-IRT model’s threshold disturbance in Equation (4). The DLC-SL-IRT model can be derived from Equation (16) by omitting from the model. The parameters provided by the MM-IRT specification can be converted into the parameters of Equations (3) and (4) as follows. The process discrimination parameter corresponds to , the response time thresholds are derived as for j = k, the variance of random disturbance equals , and the correlation of with is given by . The standard errors of the converted parameters are provided by the delta method. The DLC-IRT models can be estimated by MML through the EM algorithm. MML estimation follows from the procedures outlined by Vermunt (2003). The log likelihood to be maximized is given by where represents the vector of model parameters. As only the diagonal entries (j = k) of include information, Equation (17) is equivalent to the log-likelihood expression in which item responses are represented as a J × 1 vector . Therefore, the probability density of the data remains the same regardless of how the data are organized. The probability density of the DLT-TL-IRT model can be written as where individual responses are indexed as to highlight that only diagonal entries in (j = k) contribute to the estimation. When the DLC-SL-IRT model is estimated, is excluded. As a consequence, the DLC-SL-IRT model requires only a one-dimensional integration.

Specification and Estimation of ILC-IRT Models

The estimation of ILC-IRT models in the MM-IRT framework requires response times to be organized in a diagonal format (Table 2). On the basis of this data setup, the measurement structure of typical time expenditure is expressed by modifying Equation (7) to and Equation (8) to The index k in Equations (19) and (20) replaces the item index j, which means that , , , , , and for j = k. The term refers to the individual i’s scores on a node variable that is fully determined by the J × 1 unit vector with effects that are allowed to vary on the between-individual level, so that where the between-individual random effect variable is specified to follow a standard normal distribution for model identification [ ]. The difference between the two ILC-IRT model variants is whether the log response times’ loadings interact (ILC-RE-IRT) or do not interact (ILC-RI-IRT) with the latent class variable. In the case of the ILC-RE-IRT model the constraints for all k = 1, 2, …, K are applied, whereas in the case of the ILC-RI-IRT model, the constraints for all k = 1, 2, …, K are imposed. The MML estimation of ILC-IRT models is more challenging. The log likelihood to be maximized is given by (e.g., Vermunt, 2008) which means that, in these models, we consider the joint distribution of item responses and response times that have a probability density that corresponds to where is the density of the continuous log response times conditional on the latent variables. Equation (23) means that, conditional on the between-individual variables, item responses and response times are independent, which allows the model to be estimated by the EM algorithm (Vermunt, 2008).

Illustrative Application to the PIAAC Data

Illustrative Data

PIAAC is an international large-scale assessment of proficiencies in adults. Here, we consider the literacy test administered in the French-speaking samples (France and Canada). We chose this sample because previous research suggests that it exhibits a typical level of disengagement (Goldhammer et al., 2016), and it offers a sample size that is large enough to estimate the models. In PIAAC, a two-stage booklet design is used. In the first stage, examinees are randomly assigned to one of three booklets. Each booklet has a different level of difficulty and consists of nine items. In stage two, individuals are then assigned to one of four different booklets that contain 11 items each. All items have an open response format, which means that the success probability of disengaged responses can be expected to be essentially zero. In order to derive a sample with a decent number of items without introducing additional complications caused by the design, we selected a subgroup of examinees who had worked on a specific combination of stage-one and stage-two booklets (“low” and “intermediate one”). To keep the examples simple, we used only examinees who answered all 20 items (N = 637).

Model Specification and Estimation

All IRT models for response engagement were estimated with Mplus 8.4 (Muthén & Muthén, 2017) using the accelerated EM algorithm and employing 15 integration points per dimension. As models that include latent class variables could be prone to local optima (Lubke & Muthén, 2005), we employed multiple random starting values to check whether the best log likelihood had been replicated. The Mplus input files for all models are provided in the Supplemental Material. In our application of DLC-IRT models, the -parameters were fixed to imply a zero chance probability (i.e., = 15 for all k = 1, 2, …, K). We included this constraint because there were no correct responses that were classified as disengaged. In specifying the ILC-IRT models, we constrained the -parameters to be equal to each other, and we assumed a homoscedastic residual variance of disengaged log response times. The reason for these constraints was that the data set appeared too small to provide solid item-specific parameter estimates given the rather low prevalence of disengaged responses. In line with the previous arguments according to which equal measurement intercepts of disengaged response times could artificially inflate residual variances, we left all measurement intercepts in both response states unconstrained. The estimated -parameters indicated a chance probability close to zero (ILC-RE-IRT: = 6.04, SE = 0.86; ILC-RI-IRT: = 5.60, SE = 0.79).

Model-Data Fit

Decisions between DLC-IRT and ILC-IRT models cannot be based on information criteria, but these indices can be used to examine the merits of accounting for response engagement by comparing the fit of each type of IRT model for response engagement with the fit of a suitable reference model. We chose the 2PL IRT model as a reference for the DLC-IRT models, whereas ILC-IRT models were compared to van der Linden’s (2007) speed-accuracy IRT model (denoted as 2PL + RT). Information criteria were furthermore used to examine the different specifications of each type of IRT model for response engagement. Table 3 summarizes the goodness-of-fit statistics of all IRT models. The DLC-SL-IRT model fitted the data better than the 2PL model, and the fit was further improved by extending the model to the DLC-TL-IRT model. Compared to the 2PL + RT model, the model-data fit was improved by the ILC-IRT models. The ILC-RI-IRT model fitted clearly better than the ILC-RE-IRT model.

Table 3.

Goodness-of-Fit Measures of IRT Models for Response Engagement.

	# Par.	LL	AIC	BIC	SBIC
2PL	40	−6342.1	12764.3	12942.5	12815.5
DLC-SL-IRT	61	−5995.5	12113.0	12384.8	12191.2
DLC-TL-IRT	63	−5970.0	12065.9	12346.7	12146.7
2PL IRT + RT	101	−18037.3	36276.3	36726.7	36406.0
ILC-RE-IRT	146	−16428.0	33147.9	33798.6	33335.1
ILC-RI-IRT	146	−16308.2	32908.3	33559.0	33095.5

Note. LL = model log likelihood; AIC = Akaike information criterion; BIC = Bayesian information criterion; SBIC = sample size adjusted BIC.

Goodness-of-Fit Measures of IRT Models for Response Engagement. Note. LL = model log likelihood; AIC = Akaike information criterion; BIC = Bayesian information criterion; SBIC = sample size adjusted BIC.

Response Time and Response Engagement

In both DLC-IRT models the process discrimination parameters documented a strong effect of response times (DLC-SL-IRT: = 10.28, SE = 1.69, p < .001; DLC-TL-IRT: = 10.26, SE = 2.34, p < .001). The response time thresholds estimated in the two models were very similar and differed between items (Table 4). The DLC-TL-IRT model indicated individual differences in response time thresholds ( = 0.32, SE = 0.05, p < .001) that were correlated with proficiency ( = −.80, SE = 0.10, p < .001). Table 4 reports the average posterior engagement probabilities for each item. The patterns of item-specific probabilities were similar for both DLC-IRT models, but in the DLC-TL-IRT model, a smaller proportion of responses was classified as engaged (90.4% vs. 92.5%). This difference was due to the higher response time thresholds of less proficient individuals, which caused a larger proportion of responses to be classified as disengaged.

Table 4.

Estimated Response Thresholds in DLC-IRT Models and Engagement Difficulties in ILC-IRT Models, and Estimated Proportion of Engaged Responses (Average Posterior Latent Class Probabilities).

	Response time thresholds ( τ^j )^A		Engagement difficulties ( κ^j )^A		Proportion engaged responses
	DLC-SL	DLC-TL	ILC-RE	ILC-R1	DLC-SL	DLC-TL	ILC-RE	ILC-RI
1	1.96	1.81	−11.00	−10.41	.987	.986	.987	.984
2	2.01	1.67	−11.52	−11.43	.990	.990	.989	.990
3	2.63	2.47	−9.23	−8.32	.966	.955	.973	.962
4	1.93	1.88	−9.21	−7.52	.980	.974	.972	.947
5	2.94	3.04	−4.93	−4.35	.905	.862	.849	.825
6	2.24	2.24	−7.18	−6.73	.944	.934	.936	.927
7	3.41	3.54	−5.45	−4.72	.836	.779	.874	.845
8	2.65	2.73	−8.03	−6.42	.959	.942	.955	.918
9	2.55	2.56	−7.10	−6.12	.934	.914	.934	.908
10	2.79	2.71	−8.10	−6.66	.958	.953	.956	.925
11	1.95	1.93	−7.88	−7.43	.956	.950	.952	.945
12	2.57	2.59	−7.47	−6.19	.949	.942	.943	.932
13	2.24	2.24	−7.25	−6.87	.945	.932	.938	.931
14	3.16	3.27	−5.86	−5.26	.914	.894	.891	.872
15	1.44	1.48	−6.11	−5.47	.944	.938	.901	.882
16	3.62	3.70	−3.89	−3.51	.821	.781	.790	.774
17	2.87	3.05	−3.93	−3.51	.839	.804	.792	.774
18	3.80	3.83	−3.79	−3.19	.815	.764	.784	.752
19	2.38	2.51	−5.27	−4.58	.921	.887	.865	.838
20	2.10	2.21	−5.74	−5.23	.929	.895	.886	.871

Note. DLC-SL = dependent latent class IRT model with single-level relationships; DLC-TL = dependent latent class IRT model with two-level relationships; ILC-RE = independent latent class IRT model with random effects; ILC-RI = independent latent class IRT model with random intercepts; A = all estimates are statistically significant at the p ≤ .001 level.

Estimated Response Thresholds in DLC-IRT Models and Engagement Difficulties in ILC-IRT Models, and Estimated Proportion of Engaged Responses (Average Posterior Latent Class Probabilities). Note. DLC-SL = dependent latent class IRT model with single-level relationships; DLC-TL = dependent latent class IRT model with two-level relationships; ILC-RE = independent latent class IRT model with random effects; ILC-RI = independent latent class IRT model with random intercepts; A = all estimates are statistically significant at the p ≤ .001 level. In the ILC-IRT models, the relationships of response engagement with response times are represented by the separation of engaged and disengaged response time distributions. The upper panel in Figure 3 reports the results of the ILC-RE-IRT model on the basis of the engaged and disengaged response time intercepts and the regions around the intercepts that comprised 80% of the values (±1.282 residual standard deviations). Therefore, in the case of engaged responses, the vertical bars indicate the region in which the response times of individuals with an average value of typically occur. Because, in this model, engaged response times reflect between-individual differences due to , the engaged response time bands are accompanied by dotted lines. The dotted lines indicate the lower and upper values of the engaged response time bands expected for individuals who scored at either the 10th or the 90th percentile of the -distribution. The response time bands of disengaged responses were wide because their residual variance ( = 0.74, SE = 0.05) was larger than the residual variances of engaged response times (ranging from = 0.10, SE = 0.01 to = 0.35, SE = 0.02). The response time distributions were clearly separated, even at low levels of .

Figure 3.

Separation of log response times of engaged and disengaged responses provided by the independent latent class IRT model with random effects (ILC-RE-IRT) and the independent latent class IRT model with random intercepts (ILC-RI-IRT). Circles represent intercepts of engaged (filled black) and disengaged (open) log response times. Vertical bars represent the range between the 10th and 90th percentile of the residual distributions around the intercepts. Dotted lines visualize the impact of individual differences in typical time expenditure. The lower panel in Figure 3 describes the response time distributions given by the ILC-RI-IRT model. Because the separation of engaged and disengaged response times is specified to be the same for all individuals, dotted lines that indicate the impact of the random intercept enclose both mixture components. For engaged responses, the intercepts and residual variances (ranging from = 0.10, SE = 0.01 to = 0.32, SE = 0.02) of engaged response times were almost identical to the results of the ILC-RE-IRT model, but the intercepts of disengaged response times were higher. Because the residual variance of disengaged responses was large ( = 0.71, SE = 0.04), the response time distributions had a larger overlap. This result means that, compared to the ILC-RE-IRT model, in the ILC-RI-IRT model longer response times were less univocally classified as engaged. The posterior engagement probabilities given by the ILC-IRT models are reported together with the engagement difficulties in Table 4. Overall, the engagement difficulty parameters and the posterior probabilities were similar in both ILC-IRT models. However, the ILC-RI-IRT model indicated a lower proportion of engaged responses (89.0% vs. 90.8%). This difference was reflected in the larger overlap of response time distributions in the ILC-RI-IRT model (Figure 3), which means that, in this model, relatively long response times were more likely to be classified as disengaged than in the ILC-RE-IRT model. Both ILC-IRT models provided very similar results with respect to the latent characteristics: They estimated a large variability in engagement propensities (ILC-RE-IRT: = 4.48, SE = 0.30; ILC-RI-IRT: = 4.36, SE = 0.29), and the correlations between typical time expenditures , engagement propensities , and proficiencies were very similar (ILC-RE-IRT: = .25, SE = 0.06; = .72, SE = 0.04; = −.01, SE = 0.05; ILC-RI-IRT: = .31, SE = 0.05; = .69, SE = 0.04; = .01, SE = 0.05).

Adjustments of IRT Item Parameters

The estimates of the IRT item parameters are summarized in Figure 4. Compared to the 2PL model, both DLC-IRT models estimated lower item discriminations, but discriminations were slightly lower in the DLC-TL-IRT model. This result suggested that the variability in proficiencies represented by the 2PL model might be inflated by individual differences in response engagement. Figure 4 includes estimates of the item difficulties (i.e., ). The item difficulties estimated by the DLC-IRT models were lower than those estimated by the 2PL model. The only exception was item 15, for which item difficulties need to be interpreted with caution due to a low item discrimination. Overall, the DLC-TL-IRT model resulted in slightly stronger adjustments of item difficulties than the DLC-SL-IRT model. As shown in the lower panels of Figure 4, which refer to the item parameters of the IRT parts of the ILC-IRT models, the item discriminations estimated by the ILC-IRT models were lower than in the 2PL + RT model and were lowest in the ILC-RI-IRT model. With the exception of item 15, which discriminated weakly between proficiency levels, difficulties were estimated to be lower than in the 2PL + RT model. Difficulties were slightly lower in the ILC-RI-IRT model than in the ILC-RE-IRT model.

Figure 4.

Estimates of item discriminations and difficulties derived from DLC-IRT and ILC-IRT models. Results from a conventional 2PL IRT model and a 2PL IRT model including a congeneric measurement model for time expenditure based on log response times (2PL + RT) as a comparison standard.

Adjustments of Proficiency Estimates

The adjustments of the proficiency estimates provided by the IRT models for response engagement relative to the 2PL model were based on EAP estimates of that were, however, given on different metrics. In order to convert the -estimates provided by the DLC-IRT and ILC-IRT models to the metric of the 2PL IRT model, we examined the relationships between the proficiency estimates of individuals who were expected to be engaged throughout the test (i.e., the average posterior latent class probabilities were higher than .99). These estimates correlated almost perfectly (r > .99) and were used to derive the coefficients of the linear conversion equations, which allowed us to approximate the proficiency metrics of the DLC-IRT and ILC-IRT models to the metric of the 2PL IRT model. Figure 5 presents the scatter plots of the EAP -estimates derived from the 2PL IRT model (x-axis) and converted EAP estimates given by the DLC-IRT models (y-axis). For individuals who were expected to be partially disengaged, both DLC-IRT models provided higher proficiency estimates than the 2PL IRT model. Moreover, the pattern of relationships turned out to be highly similar between the DLC-IRT models, so that, relative to the 2PL IRT model, the -estimates were, on average, d = 0.14 and d = 0.15 standard deviations higher in the DLC-SL-IRT and the DLC-TL-IRT model, respectively. However, in the case of individuals who were expected to be disengaged on essentially all items of the test, the adjustments of the -estimates were extreme because their EAP estimates were extremely shrunken to the mean of the proficiency distribution. Apart from these few extreme cases (N = 10), the adjustments of the proficiency estimates were reasonable.

Figure 5.

Scatter plots of EAP proficiency estimates provided by the 2PL IRT model (x-axis) and by equated EAP proficiency estimates provided by the DLC-IRT (y-axis in upper panels) and the ILC-IRT models (y-axis in lower panels) for individuals with different levels of engagement (average posterior probabilities of engagement). Both ILC-IRT models provided proficiency estimates adjusted for response engagement that differed from the estimates of the 2PL IRT model (Figure 5). The marginal distribution of EAP -estimates given by the ILC-RE-IRT model was only d = 0.08 standard deviations higher than that given in the 2PL IRT model. The adjustment was stronger in the case of the ILC-RI-IRT model (d = 0.12). Interestingly, the adjustment of -estimates relative to the estimates of the 2PL IRT model in very poorly engaged individuals did not appear to be overly strong, as was the case in the DLC-IRT models. This is because the -variable, which exhibited a large variability and was strongly correlated with in both ILC-IRT models, added extra information that prevented the -estimates from being overly shrunken to the mean of their distribution.

Summary

All IRT models for response engagement supported the suspicion that the PIAAC literacy assessment was affected by disengaged responses (Goldhammer et al., 2016). In addition, the results indicate that the details of the models’ specifications matter. The preferred DLC-TL-IRT and ILC-RI-IRT models provided stronger adjustments of IRT item parameters and proficiency estimates than the DLC-SL-IRT and ILC-RE-IRT models. However, the information criteria do not tell whether the DLC-TL-IRT or the ILC-RI-IRT model should be chosen as the final model. Nevertheless, we can conclude that DLC-IRT and ILC-IRT models fit the data better than the reference models that do not separate response states and that, within each type of IRT model for response engagement, one parameterization fits the data better than the other. We do not consider this aspect of model selection to be problematic because, in real applications, researchers should decide on one type of IRT model for response engagement on the basis of the perspective they have on response times. Interestingly, in the present applications, the results pertaining to the item responses and the proficiencies provided by these two preferred DLC-IRT and ILC-IRT models were remarkably similar. Given the fundamental differences between the two types of models, such results should not be taken for granted. Indeed, a closer look at the relationships between response times and posterior engagement probabilities revealed non-monotonic relationships in the ILC-RI-IRT model (i.e., some responses associated with very long response times were classified as disengaged; details available upon request). However, in the present application, only a small number of responses was affected by the non-monotonic relationships, which means that the model estimates were not overly affected by this phenomenon.

Summary and Discussion

IRT models for response engagement are tools that allow item responses and response times to be considered simultaneously, while avoiding the necessarily error-prone procedures for classifying the engagement status of item responses. IRT models for response engagement build upon different assumptions, which means that different IRT models can give different results when they are applied to the same data. Therefore, the objectives of the present article were: (1) to outline the rationale of the commonly employed methods for setting response time thresholds, (2) to describe the assumptions they share with different types of IRT models for response engagement, (3) to suggest model extensions that increase the IRT models’ flexibility, and (4) to provide a flexible MM-IRT framework that makes it possible to estimate various IRT models for response engagement.

Types of IRT Models for Response Engagement

We have delineated that IRT models for response engagement and observed variable procedures are built on common ideas, and we have proposed the distinction between ILC-IRT models, which consider response times as indicators of response engagement, and DLC-IRT models, in which response times serve as predictors of engagement. Up until today, most IRT models for response engagement belong to the class of ILC-IRT models in which response engagement is defined jointly on the basis of item responses and response times. However, because the continuous log response times carry more information (Wang & Xu, 2015), they are likely to determine the classification of response engagement more strongly than item responses. As such, ILC-IRT models are connected to the VIB method, which uses only response times, but this connection is loose because ILC-IRT models use the complete data and are based on additional structural assumptions. In our view, ILC-IRT models can be best understood as extensions of van der Linden’s (2007) speed-accuracy IRT model, which make it possible to account for violations of the constant time expenditure assumption due to disengaged responses. Disengaged item responses are specified to not depend on proficiency and to be accompanied by response times that stem from a different distribution than that of engaged responses. In ILC-IRT models response times contribute to the classification of response engagement according to their typicality for each engagement state (Equation 12), which means that these models could classify very slow responses as disengaged responses. From the perspective of the typical time expenditure assumption, such kinds of non-monotonic relationships make sense because both fast and very slow responses could be considered to be atypical and disengaged. The variants of ILC-IRT models presented here use response times differently for classifying response engagement. ILC-RE-IRT models assume that engaged response times differ from an unstructured reference distribution, which means that the more strongly an item’s response time deviates from the reference distribution, the more likely it is that the response is engaged. In contrast, ILC-RI-IRT models state that both engaged and disengaged responses reflect individuals’ typical time expenditure and that the separation between engaged and disengaged response times is the same for all individuals. As shown in the exemplary application, the variants of ILC-IRT models can provide different results. Therefore, researchers need to carefully consider which type of ILC-IRT model is more plausible. The ILC-RI-IRT model might be a better choice when tests consist of items that require nontrivial time expenditure even when responses are not engaged. By relating response times to response engagement based on the theoretical consideration that faster responses can—with higher certainty—be considered to be disengaged, and without recurring on assumptions about the structure of response times, DLC-IRT models closely mirror the perspectives taken in the research on rapid guessing behavior. Therefore, although DLC-IRT models have not attracted much attention in the literature (but see Pokropek, 2016), we believe that they have much to offer. In this article, we have proposed two extensions of Pokropek’s (2016) DLC-IRT model that both include item-specific response time thresholds that are based on a rationale similar to that of the combined VICS and VII methods (Wise, 2019). As such, DLC-IRT models can be conceived as model-based, one-step procedure counterparts of the VICS-VII procedure. Of the two variants of DLC-IRT models presented, the DLC-SL-IRT model is closest to the typical approach of classifying each individual’s response engagement on the basis of the same response time thresholds. The DLC-TL-IRT is an extension thereof that allows for individual differences in response time thresholds that could be correlated with proficiency. The failure to include such a component in DLC-IRT models has been discussed as a potential source of bias (Pokropek, 2016; Rios et al., 2017), and we have demonstrated that including individual differences in response time thresholds could indeed affect item parameter and proficiency estimates. Taken together, the DLC-IRT and ILC-IRT models are conceptually different and fit to different theoretical perspectives of response times as reflections or causes of response engagement. Therefore, decisions in favor of one type of model should optimally be theoretically justified, or the substantive considerations underlying the decision should at least be clearly laid out. Such explanations should be self-evident in research. In the present context, they gain additional importance because DLC-IRT and ILC-IRT models use the input data in different ways so that decisions cannot be guided by information criteria.

The MM-IRT Framework

A further objective of this article was to describe an MM-IRT framework that allows a variety of IRT models for response engagement to be estimated by means of the Mplus software. Our framework is based on a diagonal arrangement of the latent variables’ indicators. This setup combines the strengths of traditional wide and long data formats. An advantage of the wide data format is that all measurement parameters can differ between indicators and latent classes. However, this flexibility comes at the price of requiring a large number of latent class variables, which limits the application of IRT models for response engagement to rather small numbers of items (e.g., Molenaar et al., 2016). In the long data format, multiple latent class variables can be replaced with one class variable located on the within-individual level. The measurement parameters of the indicators of the latent continuous variables are modeled by the effects of the item indicators and their interactions with the latent class variable (Pokropek, 2016). The drawbacks of this arrangement are that it is less flexible in modeling residual structures and that the estimation of item-specific discrimination and loading parameters is computationally demanding. The diagonal data arrangement provides the same flexibility as the wide format and allows a large number of items to be included in the models, while still supporting computationally efficient MML estimation. The MM-IRT framework provides great flexibility because, in principle, all parameters can be specified to be item-specific. The IRT models applied in this article include fewer constraints than the models employed in the literature (e.g., Ulitzsch et al., 2020), and the constraints can be further relaxed. An important aspect of the MM-IRT framework is that it is not limited to item response times, and it allows alternative predictors or indicators of response engagement to be incorporated into IRT models for response engagement. This feature makes it possible to investigate the utility of other within-individual variables, such as interest ratings (e.g., Lindner et al., 2019) or repeating response patterns (e.g., Cui, 2020), as alternative or additional predictors or indicators of response engagement in DLC-IRT and ILC-RIT models, respectively. Similarly, the proposed models can be extended by between-individual variables, which makes it possible to study their relationships with individual differences in thresholds or engagement (e.g., Nagy et al., 2019). These possibilities should be explored in more detail in further research.

Further Research

The models presented here are, in some respects, more constrained than some other models suggested in the literature. The IRT model applied to engaged responses does not include pseudo-guessing parameters, which might be desired in the case of multiple choice items. We plan to explore this modeling option in future studies. A concern in the context of ILC-IRT models could be the assumption that item-specific engagement states are conditionally independent. An alternative is to consider that item-specific engagement states depend on previous states. Molenaar et al. (2016) suggested a model in which response processes are modeled by a hidden Markov process, but their model has not been applied to response engagement yet. Future research could investigate possibilities to include such dependencies in IRT models for response engagement. However, an interesting line of research would be to study whether serial dependencies can be accounted for by means of DLC-IRT models that are computationally less demanding. In this vein, more research could be devoted to DLC-IRT models to increase their flexibility, for example, by including item-specific effects of response times and accounting for measurement error in response times (Molenaar & de Boeck, 2018).

Conclusion

The MM-IRT framework makes it possible to specify and estimate a variety of ILC-IRT and DLC-IRT models that extend the arsenal of latent variable models for response engagement. Furthermore, the MM-IRT framework makes the IRT models for response engagement accessible to applied researchers. Indeed, as shown in the Supplemental Material., the Mplus input code for specifying the models is not complicated. We hope that the ideas and the procedures outlined here will prove useful to researchers studying the impact of response engagement (Wise, 2015) and will stimulate research that extends our knowledge about test-taking motivation (e.g., Nagy et al., 2018). Click here for additional data file. Supplemental material, sj-docx-1-epm-10.1177_00131644211045351 for A Multilevel Mixture IRT Framework for Modeling Response Times as Predictors or Indicators of Response Engagement in IRT Models by Gabriel Nagy and Esther Ulitzsch in Educational and Psychological Measurement

10 in total

1. Investigating population heterogeneity with factor mixture models.

Authors: Gitta H Lubke; Bengt Muthén
Journal: Psychol Methods Date: 2005-03

2. Evaluating cognitive theory: a joint modeling approach using responses and response times.

Authors: Rinke H Klein Entink; Jörg-Tobias Kuhn; Lutz F Hornke; Jean-Paul Fox
Journal: Psychol Methods Date: 2009-03

3. A mixture hierarchical model for response times and response accuracy.

Authors: Chun Wang; Gongjun Xu
Journal: Br J Math Stat Psychol Date: 2015-04-15 Impact factor: 3.380

4. Hidden Markov Item Response Theory Models for Responses and Response Times.

Authors: Dylan Molenaar; Daniel Oberski; Jeroen Vermunt; Paul De Boeck
Journal: Multivariate Behav Res Date: 2016-08-11 Impact factor: 5.923

5. A mixture model for responses and response times with a higher-order ability structure to detect rapid guessing behaviour.

Authors: Jing Lu; Chun Wang; Jiwei Zhang; Jian Tao
Journal: Br J Math Stat Psychol Date: 2019-08-06 Impact factor: 3.380

6. Response Mixture Modeling: Accounting for Heterogeneity in Item Characteristics across Response Times.

Authors: Dylan Molenaar; Paul de Boeck
Journal: Psychometrika Date: 2018-02-01 Impact factor: 2.500

7. A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non-response.

Authors: Esther Ulitzsch; Matthias von Davier; Steffi Pohl
Journal: Br J Math Stat Psychol Date: 2019-11-10 Impact factor: 3.380

8. On a New Algorithm for Removing Repeating Patterns in Similarity Analysis.

Authors: Zhongmin Cui
Journal: Educ Psychol Meas Date: 2019-10-21 Impact factor: 2.821

9. An Overview of Models for Response Times and Processes in Cognitive Tests.

Authors: Paul De Boeck; Minjeong Jeon
Journal: Front Psychol Date: 2019-02-06

10. The Onset of Rapid-Guessing Behavior Over the Course of Testing Time: A Matter of Motivation and Cognitive Resources.

Authors: Marlit Annalena Lindner; Oliver Lüdtke; Gabriel Nagy
Journal: Front Psychol Date: 2019-07-23

10 in total

1 in total

1. Investigating the Effect of Differential Rapid Guessing on Population Invariance in Equating.

Authors: Jiayi Deng; Joseph A Rios
Journal: Appl Psychol Meas Date: 2022-06-16

1 in total