Literature DB >> 32336114

Precision and Sample Size Requirements for Regression-Based Norming Methods for Change Scores.

Zhengguo Gu¹, Wilco H M Emons¹, Klaas Sijtsma¹.

Abstract

To interpret a person's change score, one typically transforms the change score into, for example, a percentile, so that one knows a person's location in a distribution of change scores. Transformed scores are referred to as norms and the construction of norms is referred to as norming. Two often-used norming methods for change scores are the regression-based change approach and the T Scores for Change method. In this article, we discuss the similarities and differences between these norming methods, and use a simulation study to systematically examine the precision of the two methods and to establish the minimum sample size requirements for satisfactory precision.

Entities: Disease Gene Species

Keywords: T Scores for Change method; change assessment; norming; regression-based change approach; regression-based norming

Mesh：

Year: 2020 PMID： 32336114 PMCID： PMC7885019 DOI： 10.1177/1073191120913607

Source DB: PubMed Journal: Assessment ISSN： 1073-1911

This article concerns the practical use of change scores (e.g., posttest score minus pretest score) obtained from psychological tests or questionnaires for drawing inferences about change at the level of the individual. For example, if a patient is in treatment, the clinician may want to evaluate how this patient responds to the treatment in terms of change of mental health, distress, quality of life, or general functioning. Important questions include the following: Has the patient improved at all? Is the patient’s improvement on track? Is the observed change practically important? The answers to these questions may be used to tailor future treatment of the patient. Change scores often are based on counts of correct answers or sums of scores on rating scales observed at two measurement occasions. Without a frame of reference, scores are not directly usable for practical measurement (Angoff, 1984). For example, knowing that John had 18 out of 28 arithmetic items correct or that Mary scored 37 scale points out of 60 on an introversion scale means little if anything as long as their test scores cannot be related to a distribution of test scores or a performance standard with a well-established meaning. The same goes for change scores. Without an interpretative context, it is hard to say whether observed change of an individual is small or large, consistent with natural recovery, or lagging behind compared with the change of other patients undergoing the same treatment. To provide a frame of reference, one needs transformed scores, known as norms (Allen & Yen, 2002). A well-known type of transformed scores is norm-referenced norms (Allen & Yen, 2002), locating an individual amid a norm group. Another type of transformed scores, criterion-referenced norms (Allen & Yen, 2002), refers to diagnostic cutoffs that patients have to pass to be admitted to a course or a therapy, or to qualify levels of severity (e.g., mild vs. strong depression). In change assessment, criterion-referenced norms include the minimal clinical important difference, which is the minimum change a patient must show to qualify change as having practical impact. In this article, we focus on norm-referenced norms and study two often-used regression-based norming methods for change scores. Study of criterion-referenced norms deserves full attention in a separate article. The construction of norms is known as norming. Norming methods for change scores have received surprisingly little attention so far. In this study, we consider the simplest change score possible, which is the difference between the test score obtained after a treatment and the test score obtained prior to the treatment, known as posttest and pretest scores, respectively. Norming change scores can be challenging for at least two reasons. The first challenge is that, pretest scores, posttest scores, and consequently change scores contain measurement errors and therefore may be unreliable (Bereiter, 1963; Cronbach & Furby, 1970; Linn & Slinde, 1977; O’Connor, 1972). In practice, interindividual change typically has small variance (Gu et al., 2018), which may also cause low change-score reliability. The second challenge, in particular in the context of mental health care, is the heterogeneity of a group of patients. Often one cannot simply compare the change of one patient to the change of all other patients. To monitor a patient’s change, ideally the patient should be compared to patients of the same age, the same gender, the same comorbidity, or other relevant background variables. This means that researchers may need to take such information into account when constructing norms. Ideally, data are collected within all relevant subpopulations based on the relevant background variables, but because this approach may require a huge sample, it may be practically infeasible. A solution to the problem is continuous norming (Gorsuch, 1983), which results in norms for subgroups using statistical control, based on a relatively small sample. The goals of this study were to gain more insight in the precision of norms for change scores and to derive sample size requirements for deriving reliable norms for change scores. The structure of this article is the following. First, we present a general, regression-based framework for norming change scores, which includes two popular norming methods for change scores as special cases. The two methods are the regression-based change approach (Van der Elst et al., 2008) and the T Scores for Change method (McSweeny et al., 1993). Second, using a simulation, we study the precision of norms developed by means of the regression-based change approach and the T Scores for Change method. Finally, we discuss the results and provide recommendations for minimum sample size needed to produce sufficiently precise norms.

Norming Methods for Change Scores

First, to provide an overview of the field and also as a precursor for the two methods we study in this research, we briefly discuss two approaches to quantifying individual change in the pretest–posttest design, which are the classical change-score approach (i.e., using the classical change score to quantify individual change) and the residual change-score approach (i.e., using the residual change score resulting from a regression model). Second, we present a general framework of norming change scores, which includes the regression-based change approach and the T Scores for Change method as special cases.

Two Approaches to Quantifying Individual Change

The classical change-score approach and the residual change-score approach (Willett, 1988) are widely adopted methods. Caruso (2004) discussed other methods for quantifying change that were used less frequently. For the classical change-score approach, we use the observable classical change score , which is defined as the difference between the observable posttest score and the observable pretest score ; that is, . Notice that , , and are random variables, whose realizations are denoted by , , and , respectively. In classical test theory (Lord & Novick, 1968), an observable test score is assumed to be the sum of a true score and a random measurement error (i.e., ). A person’s true score is defined as the expectation across hypothetical independent administrations of the test, so that and . Let ∆ denote the person’s true change score, defined as . Then, we can write Because , the classical change score is an unbiased estimate of the true change ∆ (Rogosa et al., 1982), but this has not withheld several researchers from questioning as a useful measure of individual change (e.g., Bereiter, 1963; Cronbach & Furby, 1970; Linn & Slinde, 1977; O’Connor, 1972). Others have supported the use of (e.g., Overall & Woodward, 1975; Rogosa et al., 1982; Williams & Zimmerman, 1977, 1996, Zimmerman & Williams, 1982a, 1982b). Gu et al. (2018) suggested that many negative beliefs about , such as its alleged low reliability, are based on, for example, inappropriate assumptions, thus mitigating the criticism of classical change scores. The residual change-score approach, also known as residual gains or base-free measurement of change, intends to correct for the correlation between pretest score and change score (Cronbach & Furby, 1970; Manning & Dubois, 1962; Willett, 1988) by means of the residualized posttest score. Let be the predicted posttest score obtained by regressing on . Then, the residualized posttest score, denoted by , is defined as . The residualized posttest score for a person shows how much more or less he or she has changed compared with the predicted average change of others with the same observed pretest score. For example, a positive residualized posttest score means that the person’s individual change is larger than the expected change of patients with the same observed pretest scores. A residualized posttest score of zero means that the person’s change equals the average change given the same observed pretest scores, but it does not mean that the person has not changed at all. Residualized posttest scores have been used in studies that search for predictors of interindividual differences in change (e.g., Castro-Schilo & Grimm, 2018). These predictors may be of theoretical interest, for example, to know whether recovery after brain injury can be explained by age. Variables that have been shown to predict interindividual change may also help the interpretation of individual scores. However, for individual assessment one should take into account that residualized scores have a very specific interpretation, which, as we will argue, coincides with a normative interpretation of change scores.

Regression-Based Norming for Change Scores: A General Framework

Regression-based norming for test scores (Van Breukelen & Vlaeyen, 2005; Zachary & Gorsuch, 1985) generates the reference distribution of test scores by means of regression analysis. Let the random vector contain relevant covariates, whose realization is denoted by . Let k index covariates. Regression-based norming assumes that the average test score can be predicted by relevant covariates, so that where is the intercept and are regression coefficients for covariates. It may be noted that Equation (1) may also include quadratic terms such as and include interactions between covariates. An observable test score, denoted by , deviates from by residual . Computing for the persons belonging to the subpopulation satisfying , we obtain a distribution of s, reflecting the relative position of persons within the same subpopulation (i.e., ) on . When deriving norm statistics in practice, one first estimates in Equation (1), estimates denoted by , and then computes the residuals for all persons in the sample. Let denote the test score for person , let denote person ’s scores on the covariates, and let denote the residual for person person . Then, The distribution of is used to compute norm statistics (e.g., percentiles). Regression-based norming for test scores can readily be extended to norming change scores. Replacing in Equation (1) with an observable change score , one obtains the population model An observable change score, denoted by , deviates from by residual . The distribution of s reflects the relative position of persons among other persons within the same subpopulation defined by . Specifically, is the average change in the subpopulation satisfying . If , then the person’s change score, , is larger than the average change (i.e., ) in the corresponding subpopulation. In practice, when computing norm statistics, one first estimates the parameters in Equation (2), then computes the residuals for all the persons in the sample, and finally uses the distribution of the residuals to generate norm statistics. One may notice that Equation (2) is a general framework for norming change scores, and that we have not discussed which covariates should be included. In practice, covariates should be selected based on substantive arguments based on theory and domain knowledge, and so on. In addition, Oosterhuis et al. (2016) suggested that the selection might also be supported by statistical procedures, such as stepwise regression. Interestingly, depending on whether the pretest score is used as a covariate, the literature has identified two regression-based norming methods based on the general framework (Equation 2) that are frequently used in practice. They are the regression-based change approach (Van der Elst et al., 2008) and the T Scores for Change method (McSweeny et al., 1993). In the remainder of this section, we present the two norming methods and discuss their similarities and differences. The regression-based change approach (Van der Elst et al., 2008) assumes that change scores can be predicted by means of a few relevant covariates, and that the pretest score is not included as one of the covariates in Equation (2). The model used is Because Equation (3) directly models the change score, the regression-based change approach follows the classical change-score approach to quantifying individual change. The T Scores for Change method (McSweeny et al., 1993) requires that the pretest score is included as a covariate in Equation (2), so that Adding to both sides of Equation (4), we obtain which is the model McSweeny et al. (1993) proposed. Equation (6) shows that the T Scores for Change method follows the residual change-score approach to quantifying change, which at first glance appears to be different from the regression-based change approach. However, as we have shown from Equations (4) to (6), the T Scores for Change method and the regression-based change approach together define a general framework for norming change scores (Equation 2), and the only difference between the two methods resides in the inclusion of in Equation (6). Here, we notice a special case of including as a covariate when the model also includes a categorical variable, such as gender. Suppose that one of the covariates in the model of the regression-based change approach (i.e., Equation 3) is a categorical variable, denoted by , such that Equation (7) often is referred to as CHANGE, which is a method using the classical change score as the dependent variable to analyze the pretest–posttest control-group design (Van Breukelen, 2013). On the other hand, suppose that the T Scores for Change method (Equation 6) includes a categorical variable, so that Equation (6) becomes Then Equation (8) becomes the analysis of covariance (ANCOVA) model. CHANGE and ANCOVA may cause contradictory results with respect to group-mean differences in nonrandomized studies when groups differ on average at pretest, where ANCOVA indicates a mean effect, whereas CHANGE does not, which is known as Lord’s paradox (Lord, 1967; Van der Elst et al., 2008). Van Breukelen (2013) formally examined this issue and showed that applying ANCOVA and CHANGE to the same data could result in completely different conclusions, because ANCOVA assumes absence of such a group effect at pretest and CHANGE assumes presence of a group effect (Van Breukelen, 2013). The implication of Van Breukelen’s research for our study is that, when using the T Scores for Change method, one may find that in Equation (8) is significant, but when using the regression-based change approach, one may find that in Equation (7) is not significant. Before concluding the section, we remind the reader that regression-based norming requires the assumptions associated with regression analysis to hold for the application of interest. For detailed discussions on this topic, we refer the reader to Oosterhuis (2017).

Deriving Norm Statistics

The two norming methods produce norm statistics in the same manner. To describe the procedure, we use the regression-based change approach as an example. The steps are the following. Step 1: Compute for each person the predicted change score, denoted by , by means of where denotes the sample intercept and denote the sample regression coefficients. Step 2: Compute residual , for person , as where is the observed change score for person . Step 3: One may use the distribution of s to gauge norm statistics, such as percentiles. Sometimes, researchers transform s into standardized scores, when, for example, using the T Scores for Change method: Step 3*: Compute the standardized , where A few remarks are in order. First, when the T Scores for Change method is used, in Step 1, one computes the predicted posttest score instead of based on Equation (6), and then computes the residuals. Second, the T-scores for Change method owes its name from the fact that standardized residuals are rescaled to scores with M = 50 and SD = 10, which produces T scores (Allen & Yen, 2002): . However, residuals may also be transformed to other well-known scales, such as Wechsler scores (i.e., IQ scores) by using the formula . Which scale is convenient depends on the application envisaged. Third, in practice, when a new patient arrives and is measured, the practitioner uses the fitted regression model and the distribution of residuals from the norm samples to obtain the normed scores for the new patient using tables or dedicated software (e.g., De Vroege et al., 2018). In this article, we used a simulation study to examine the two norming methods under various conditions and investigate the precision of estimated norms and the minimum sample size needed to obtain norms of high precision.

Method

Data Generation

Population Model

We assumed that pretest score, , was determined by latent variable representing the attribute scale of interest at pretest. To identify the scale, we assumed that followed a standard normal distribution, . We assumed that the change on the attribute scale, denoted by , was partly predicted by (1) , (2) a dichotomous covariate, for example, gender, denoted by , and (3) a continuous covariate, age, denoted by , and assumed that , , and were independent of one another. We also assumed that the unexplained part of was subsumed under random residual , so that Thus, we assumed that the variance of , denoted by , was partly explained by the variances of , , and , and partly explained by the variance of . Including in Equation (13) is important, because it is exactly the residuals that we intended to norm. We further assumed that, if , , and exert no effect on (i.e., ), then Gu et al. (2018) showed that the choice of corresponds to 75% of the group showing a minimal important difference (MID; Norman et al., 2003; Schünemann & Guyatt, 2005; see supplementary Appendix A, available online). MID refers to the minimal change that clinicians consider important. Gu et al. (2018) showed that the larger the , the higher the change-score reliability. Therefore, manipulating the value of enabled us to examine how change-score reliability influenced the norms. In cases where , , and exerted effect on , we also assumed and , respectively. In the remainder of this section, we use to show how to obtain , , and in situations where , , and exerted effect on . Appendix B (supplementary table available online) presents the parameter values used in the simulation study. We further assumed that posttest score, , was determined by latent variable representing the attribute scale of interest at posttest, and was computed by taking the sum of and (i.e., ). Alternatively, one may first simulate and , and then define , but, unlike our approach, this approach does not allow directly manipulating . Correlation is of interest in psychological and educational research (e.g., Bryk & Raudenbush, 1987; Gu et al., 2018; Hertzog et al., 2008; Linn & Slinde, 1977; Raykov, 1993; Rogosa et al., 1982; Werker & Lalonde, 1988). We considered , (small effect size, choice explained in Appendix C; Cohen, 1992), and . One may notice that where . Thus, when , , .1, and −.1 correspond to , .037, and −.037, respectively. Consistent with Oosterhuis et al. (2016), we assumed that covariate followed a Bernoulli distribution with probability .5, and covariate followed a uniform distribution on the interval [4, 12]. We chose and such that variance of explained by , , and , denoted by , corresponded to small effect size (i.e., ), medium effect size (i.e., ), and large effect size (i.e., ; Cohen, 1992), respectively. The two covariates in Equation (13) explained equal proportions of the variance of . We refer to Appendix C (supplementary material available online) where we show how to obtain and in Equation (13), and why correlation is restricted to have small effect size.

Test Characteristics and Item Parameters

We considered tests containing 10, 20, and 40 items (Jabrayilov et al., 2016; Kruyen et al., 2013). Polytomous items were simulated using the graded response model (GRM; Samejima, 1969), which is a common choice in simulation research, because the GRM enables the easy manipulation of a test with desirable features. Because the GRM assumes that latent variable is a nonlinear function of test score , and the regression models we use for norming are linear functions of the covariates, one might object to the use of the GRM. However, it is well-known that for most tests and correlate high (Macdonald & Paunonen, 2002). For example, in an empirical study, Fan (1998) found that the correlation between and could be higher than .9. Suppose item has ordered scores, so that realization has values . Consistent with Likert-type scales, we chose . Let be the slope parameter and let be the threshold parameter; then, the GRM models the probability of obtaining a score as Slope parameter was sampled from a uniform distribution , and an average threshold was sampled from (Emons et al., 2007; Jabrayilov et al., 2016). Individual s were chosen as: , , , and . For , dichotomous item scores were generated. Parameter was sampled from a uniform distribution , and parameter was sampled from .

Simulation Design

The completely crossed design had cells and included the following factors: Test length: 10, 20, and 40 items. Number of item scores: 2 and 5. Sample size: 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, and 1500. Effect size of covariates: .065, .13, and .26. Correlation between and : , .1, and −.1. Variance of θ change: and 1.14. For each cell, we generated data using the following steps. Step 1: Item parameters were sampled. Step 2: A sample of person parameters at pretest were randomly drawn from the distribution (), and a sample of person-change parameters were randomly drawn based on Equation (13). Based on these samples, the sample of person parameters at posttest was obtained using . Item-score data sets of pretest and posttest were generated given and . Step 3: For each design cell, we repeated Step 2 1,000 times, and as a result, 1,000 item-score data sets of pretest and posttest administrations were generated. Step 4: For each data set, we computed standardized residuals based on the regression-based change approach and the T Scores for Change method.

Dependent Variables and Data Analysis

Rank Correlation

Using Kendall’s tau (Kendall, 1938), we computed three different rank correlations. They are (a) Rank correlation (denoted by ) between (Equation 13) and the standard residuals produced by the T Scores for Change method; (b) Rank correlation (denoted by ) between (Equation 13) and the standard residuals produced by the regression-based change approach; and (c) Rank correlation (denoted by ) between the standard residuals produced by the regression-based change approach and the T Scores for Change method. It may be noted that almost equal to 1 means that the relative position in the sample norm-distribution produced by the T Scores for Change method preserves the relative position in the distribution in Equation (13). Rank correlations and are interpreted similarly.

Precision

We considered the precision of the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles of the standardized residuals generated by the regression-based change approach and the T Scores for Change method. Precision expressed by the standard deviation of the sampling distribution is not suitable for percentile estimates that typically are not normally distributed. Alternatively, we used the 95% interpercentile range (IPR), which is a distribution-free measure defined as the difference between the 97.5th percentile and the 2.5th percentile of the distribution of standardized residuals, based on 1,000 data sets in each design cell. The higher the IPR, the lower the precision.

Results

Rank Correlations

Figure 1 presents boxplots of and against sample size . Because rank correlations remained approximately the same as sample size N increased, we singled out sample size and presented in Table 1 the median values of rank correlations , , and , estimated change-score reliability (denoted by ) using coefficient (Cronbach, 1951), sample variances of pretest and posttest scores (denoted by and ), sample correlation between pretest and posttest scores (denoted by ), and sample correlation between pretest scores and change scores (denoted by ). To interpret the results, we use the first row in Table 1 (i.e., test length equal to 10 items) as an example. Recall that 1,000 item-score data sets were generated. First, for each cell, we computed the mean rank correlation across the 1,000 item-score data sets. Then, the median of the rank correlation was determined across all the mean rank correlations for the other four design factors, including the number of item scores, the effect size of covariates, the population correlation between and , and the variance of , and was found to equal .35.

Figure 1.

Boxplots of (panel a) and (panel b), when .

Table 1.

	Rank correlation			rDD′ *	SXpre2	SXpost2	rXpreXpost	rXpreD
	rE,T	rE,D	rD,T	rDD′ *	SXpre2	SXpost2	rXpreXpost	rXpreD
Test length
10 Items	.35	.32	.77	.68	55.17	55.21	.63	−.41
20 Items	.41	.37	.82	.80	228.32	181.05	.67	−.32
40 Items	.44	.39	.80	.88	909.76	667.98	.69	−.36
Number of item scores
2	.37	.33	.78	.78	31.84	36.51	.63	−.40
5	.45	.39	.81	.84	426.62	474.97	.68	−.32
Effect size of covariates
R2=.065	.48	.46	.83	.81	125.14	146.47	.67	−.30
R2=.13	.44	.39	.78	.81	121.63	130.34	.64	−.38
R2=.26	.37	.31	.71	.83	118.73	115.34	.61	−.50
Population correlation between θpre and θD
ρθpreθD= 0	.42	.37	.79	.82	124.95	129.43	.63	−.37
ρθpreθD= −.1	.43	.37	.77	.82	124.95	124.90	.61	−.39
ρθpreθD= .1	.41	.37	.82	.82	124.95	136.81	.65	−.32
Variance of θD
σθD2= 0.14	.34	.32	.85	.70	124.95	131.52	.79	−.25
σθD2= 1.14	.52	.45	.73	.89	124.95	125.77	.51	−.47

Note. * was computed based on Lord and Novick (1968, p. 76) using coefficient . Let denote the estimated reliability at pretest using . Let denote the estimated reliability at posttest using . Then, . As an aside, the reader may notice that, for the last 5 rows, s are exactly the same, which is due to the same seed used (please see the R script in the supplementary material, available online).

The Median Values of Estimated Rank Correlations, Estimated Change-Score Reliability () Using Coefficient (Cronbach, 1951), Sample Variances of Pretest and Posttest ( and ), Sample Correlation Between the Pretest and the Posttest (), and Sample Correlation Between the Pretest and Change (), When . Note. * was computed based on Lord and Novick (1968, p. 76) using coefficient . Let denote the estimated reliability at pretest using . Let denote the estimated reliability at posttest using . Then, . As an aside, the reader may notice that, for the last 5 rows, s are exactly the same, which is due to the same seed used (please see the R script in the supplementary material, available online). Boxplots of (panel a) and (panel b), when . Based on Table 1, we made the following observations. First, the median of was slightly higher than the median of . In general, both and were lower than .5, suggesting that, given the current simulation setup, the relative position in the sample norm-distribution produced by the two norming methods largely differed from the relative position in the distribution in Equation (13). The low and might be attributed to the measurement errors randomly generated in the data generating procedure. Second, the median of in general was higher than .7. This suggests that the regression-based change approach and the T Scores for Change method generated fairly comparable norm statistics. Third, increasing test length and number of item scores caused higher rank correlations. Finally, larger variance of and smaller effect sizes of covariates were positively associated with higher rank correlation.

Relationship Between IPR and Sample Size

For each percentile and for the T Scores for Change method, Figure 2 shows the box plots of IPR against sample size across all 1,620 cells. We chose not to plot the percentiles generated by the regression-based change approach because the results were similar to Figure 2. The figure shows that, as sample size grew, the norms were more precise, which reflects the well-known, inverse relation between sample size and sampling variance for all statistics based on a sample of independent observations. Specifically, estimation precision sharply increased as sample size increased from 100 to 500. As sample size reached 1,500, the increase of precision leveled off. The figure suggests that, when designing a study for normative data, a sample size between 500 and 1,500 may be desirable, but based on this study suggestions for exact sample sizes do not seem to be realistic.

Figure 2.

Relationship between sample size (N) and interpercentile range (IPR) for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles generated by the T Scores for Change method.

Relationship Between IPR and the Other Five Design Factors

Figures (3) through (7) present the relationship between IPR generated by means of the T Scores for Change method and the other five design factors, which were test length, number of item scores, effect size of covariates, correlation between and , and variance of θ change. In general, the figures suggest that the five design factors did not noticeably influence the precision by which percentiles of the change-score distribution were estimated, especially for the 5th, 10th, 25th, 50th, 75th, 90th, and 95th percentiles. An interesting result is the absence of an effect of test length on the IPRs, and thus the precision by which norms were obtained. This may be surprising because shorter tests generate less reliable test scores, and therefore one would expect lower precision for shorter tests. However, the IPR reflect variability in the estimated norms across samples and this variability is explained by both random sampling of persons and random measurement errors. Because we see high impact of sample-size changes on the IPRs and not for test length, results suggest that the cross-sample variability in the percentiles arising from random person sampling outweighs the variability from measurement errors. This result is consistent with a study by Sijtsma and Emons (2011) who found that the power of an independent samples t test only changed a little when test length was manipulated but greatly as sample size was varied.

Figure 3.

Relationship between test length and interpercentile range (IPR) for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles generated by the T Scores for Change method.

Figure 7.

Relationship between variance of θ change and interpercentile range (IPR) for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles generated by the T Scores for Change method.

Relationship between test length and interpercentile range (IPR) for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles generated by the T Scores for Change method. Relationship between number of item scores and interpercentile range (IPR) for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles generated by the T Scores for Change method. Relationship between effect size of covariates and interpercentile range (IPR) for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles generated by the T Scores for Change method. Relationship between correlation between and and interpercentile range (IPR) for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles generated by the T Scores for Change method. Relationship between variance of θ change and interpercentile range (IPR) for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles generated by the T Scores for Change method. Although the simulation was a fully crossed factorial design, we chose not to use analysis of variance (ANOVA) to analyze IPR results. The reasons are that, first, the normality assumption of ANOVA was severely violated, and second, although the log-transformed IPR to some extent reduced the problem of nonnormality, ANOVA results showed that almost all the design factors were significant. The reasons for significance were considerable sample size, hence large power, and outliers, of which there were quite a few; see Figures (3) through (7).

Discussion

Because of their simplicity and popularity, we limited attention to change scores obtained from pretest–posttest designs, but change scores are not limited to such designs. For example, to monitor cognitive decline in elderly people, norms developed for repeated administration of neuropsychological tests can offer diagnostic assistance. We showed that the regression-based change approach and the T Scores for Change method originated from the same general, regression-based framework for norming change scores. We advise test constructors to make critical decisions about the covariates, such as the pretest score, that they decide to include in the model. Norming change scores is a challenging task. Our simulation study showed that the relative position of persons in the sample norm-distribution produced by the two norming methods largely differed from the relative position of persons in the error distribution, as witnessed by the low rank correlations. This suggests that decision-making (e.g., whether a patient’s health condition has improved compared with a normative sample) solely based on norm statistics may lead to biased conclusions. More studies are needed to understand the cause of low rank correlations to improve the norming methods. Increasing sample size greatly improved the precision of norms, but the benefit of a larger sample size diminished quickly as the sample grew larger than 1,500 observations. Our simulation study suggested that a sample size of 500 is a reasonable minimum for norming change scores. Increasing the sample size to about 1,000 is still beneficial, but samples larger than 1,500 offer little improvement. In the simulation study, we assumed that , , and were independent from each other, but that was dependent on and . It is common that change due to treatment is dependent on gender and age. For example, treatment effectiveness of substance use disorders is associated with gender (Polak et al., 2015). Older people are less responsive to medication and psychotherapy for anxiety disorders than younger people (Wetherell et al., 2013). may also depend on and . The presence of correlation between pretest score and covariates such as gender and age does not affect comparisons between individuals with the same covariate values, but it does complicate comparisons between individuals differing in gender or age. Specifically, the two norming methods discussed in this manuscript give different regression weights for the covariates if they correlate with the pretest score. That in turn can result in different orderings of individuals with respect to their standardized residual, even if the individuals have the same pretest score and also have the same posttest score. However, the two methods can also give different orderings of individuals of the same gender and age if these individuals differ in pretest score. This is because the two methods give a different regression weight to the pretest score when adjusting the posttest score. The two regression-based norming methods for change scores assume a linear relationship between observed change scores and covariates. In the simulation study, we used a linear model to model the relation between and covariates and , and we generated the item-score data by means of the GRM, which posits a nonlinear relation between item scores and latent variables. Several authors have noticed that the resulting test score correlates high in the nineties with the latent variable and, as a result, the use of the GRM in generating item-score data sets is common practice in psychometrics and has also been used in studying regression-based norming methods (Oosterhuis et al., 2016). However, we believe that it may be interesting to examine the potential influence of the nonlinearity caused by the GRM in simulation studies on norming methods. In recent years, new methods have been applied to norming test scores. For example, the Box-Cox Power Exponential model, which is based on the Generalized Additive Model for Location, Scale, and Shape (GAMLSS; Rigby & Stasinopoulos, 2005), has been used to norm IQ scores (Voncken et al., 2017). GAMLSS-based models allow for the flexible specification of a raw test score distribution, including the mean, variance, skewness, and kurtosis and therefore may be better suited in practice where the empirical item-score data set does not show desirable features such as a normal distribution of residuals. Thus, for future research, it might be interesting to investigate how new methods, such as the GAMLSS-based models, can be used to norm change scores. Click here for additional data file. Supplemental material, Appendix_A for Precision and Sample Size Requirements for Regression-Based Norming Methods for Change Scores by Zhengguo Gu, Wilco H. M. Emons and Klaas Sijtsma in Assessment

17 in total

Review 1. Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation.

Authors: Geoffrey R Norman; Jeff A Sloan; Kathleen W Wyrwich
Journal: Med Care Date: 2003-05 Impact factor: 2.983

10. Psychometric Properties of the Bermond-Vorst Alexithymia Questionnaire (BVAQ) in the General Population and a Clinical Population.

Authors: Lars de Vroege; Wilco H M Emons; Klaas Sijtsma; Christina M van der Feltz-Cornelis
Journal: Front Psychiatry Date: 2018-04-23 Impact factor: 4.157