Literature DB >> 29343994

Calibration and validation of an item bank for measuring general physical function of patients in medical rehabilitation settings.

Karon F Cook1, Michael A Kallen1, Deanna Hayes2, Daniel Deutscher3, Julie M Fritz4,5, Mark W Werneke6, Jerome E Mioduski2.   

Abstract

OBJECTIVE: The objective of this study was to report the item response theory (IRT) calibration of an 18-item bank to measure general physical function (GPF) in a wide range of conditions and evaluate the validity of the derived scores.
METHODS: All 18 items were administered to a large sample of patients (n=2337) who responded to the items in the context of their outpatient rehabilitation care. The responses, collected 1997- 2000, were modeled using the graded response model, an IRT model appropriate for items with two or more response options. Inter-item consistency was evaluated based on Cronbach's alpha and item to total correlations. Validity of scores was evaluated based on known-groups comparisons (age, number of health problems, symptom severity). The strength of a single, general factor was evaluated using a bi-factor model. Results were used to evaluate IRT assumption and as an indicator of construct validity. Local independence of item responses was also evaluated.
RESULTS: Response data met the assumptions of unidimensionality and local independence. Explained common variance of a single general factor was 0.88 (omega hierarchical =0.86). Only two of the 153 pairs of item residuals were flagged for local dependence. Inter-item consistency was high (0.93) as were item to total correlations (mean =0.61). Substantial variation was found in both IRT location (difficulty) and discrimination parameters. All omnibus known-groups comparisons were statistically significant (p<0.001).
CONCLUSION: Item responses fit the IRT unidimensionality assumptions and were internally consistent. The usefulness of GPF scores in discriminating among patients with different levels of physical function was confirmed. Future studies should evaluate the validity of GPF scores based on an adaptive administration of items.

Entities:  

Keywords:  computerized adaptive testing; functional status; item response theory; patient-reported outcomes; rehabilitation

Year:  2017        PMID: 29343994      PMCID: PMC5749388          DOI: 10.2147/PROM.S148788

Source DB:  PubMed          Journal:  Patient Relat Outcome Meas        ISSN: 1179-271X


Introduction

The Institute of Medicine has advocated,1 and a number of legislative efforts have supported,2–4 incentivizing performance instead of volume for the US health care delivery system. The envisioned future of a responsive, effective, and efficient health care delivery system that incentivizes performance requires the existence of psychometrically sound patient-reported outcomes measures (PROMs). Increasingly, PROMs are being administered using a tailored approach, known as computer adaptive testing (CAT).5,6 CAT has been developed for use in health outcomes,7,8 rehabilitation,9,10 and clinical applications.11,12 Adaptive item administration is attractive because it reduces respondent burden with little erosion of measurement precision.13,14 Focus On Therapeutic Outcomes, Inc. (FOTO) is an international measurement system that has provided data collection and reporting of medical rehabilitation outcomes since 1994.15,16 In 2001, FOTO began administering PROMs using CAT. The use of CAT requires the development of a bank of items that measure the targeted outcome and whose items have been calibrated using an item response theory (IRT) model.17 Most item banks developed by FOTO have targeted specific body parts.18–23 The purpose of this paper is to report on the calibration and evaluation of an item bank that is domain- rather than body-part-specific – the general physical function (GPF) scale.

Methods

Participants

Study data were drawn from a convenience sample of 2337 adult patients who were treated in clinical facilities participating with FOTO. These participants responded to all 18 items of the GPF item bank and to demographic and clinical questions. Data were collected from 1997 to 2000 in 20 different states in the USA. The study research was ruled exempt from human subjects review by Northwestern University, Chicago, IL institutional review board because the research involved study of existing data which were recorded by the investigator in such a manner that participants cannot be identified.

Instrumentation

GPF item bank

The GPF item bank includes 18 items originally developed to measure functional status. Eleven of the items were adapted from the RAND 36-Item Short Form Health Survey.24 The remainder was developed by FOTO clinician scientists to extend the effective measurement range of the measure. These items targeted lower levels of physical functioning to ensure good discrimination at the “floor” of the measure.

Demographics and clinical characteristics

In addition to responses to GPF items, patients reported their sex, age, impairment category, comorbidity and symptom acuity (“0” = Asymptomatic, no treatment needed at this time; “1” = Symptoms well controlled with current therapy; “2” = Symptoms controlled with difficulty, needs ongoing monitoring and affects daily functioning, “3” = Symptoms poorly controlled, needs frequent adjustment in treatment monitoring, and “4” = Symptoms poorly controlled, history of re-hospitalization).

Analyses

Item analyses, calibration, and scoring

Tests of IRT assumptions

Samejima’s logistic graded response model (GRM)31 was used to calibrate item responses. Like most IRT models, the GRM assumes response data are unidimensional and locally independent.17,25 Typically, the unidimensionality assumption is tested based on a confirmatory factor analysis that posits a single factor model and then evaluates the fit of that model based on standard fit criteria. Newer approaches fit a bifactor model to allow a more direct evaluation of the relevant statistical question of whether item responses are unidimensional enough to warrant calibration using a unidimensional IRT model.26 The bifactor model posits that all items load on a single general factor, and subsets of items load on a single, but different, group factors. From such a model, proportions of total (omega hierarchical) and common variance (explained common variance) accounted for by a general factor are estimated. To obtain these values, we fit a bifactor model using the psych package in R.27 Reise et al recommended “tentative” minimum criterion for omega hierarchical of greater than 0.50 (with >0.75 being preferred)26 and explained common variance ≥0.60.28 Local independence was evaluated by extracting the residuals remaining after responses were fit to a unidimensional confirmatory factor model using MPlus.29 IRT models assume that these residuals are not correlated. Standards for evaluating unidimensionality vary. Reeve et al recommended flagging and considering the deletion of items whose residuals correlate >0.20 with residuals of other items.30

Item level analyses

To estimate inter-item consistency, we calculated Cronbach’s alpha. We also estimated the correlations between item scores and total scores on the remaining items. A range of 0.70 to 0.80 has been recommended as a standard for group level measurement.

IRT calibration and scoring

Responses to the 18 GPF items were calibrated to the GRM31 using Parscale software.32 The GRM is appropriate for items with ordered polytomous responses, which is the format of the GPF items. The GRM allows item discrimination parameters (a) to vary, which is common for functional status items.33,34 After the GRM was fit, a linear transformation was performed so that GPF scores ranged from 0 to 100.

Construct validation

Known-groups construct validity

We hypothesized that lower GPF scores would be observed for those who were older, reported greater symptom severity, and had a higher number of health conditions. Participant ages were grouped into the ranges 18–44, 45–65, and >65. The five symptom severity categories were placed into four comparison groups. Because few participants endorsed the most severe category (“4”), scores of “3” and “4” were grouped into a single category, both of which include the descriptor, “poorly controlled”. Comorbidity groups were those with none, one, two, three, and greater than three comorbidities. Known-groups hypotheses were tested first at the omnibus level (groups are significantly different overall) using analysis of variance (ANOVA). Comparison between pairs of levels was accomplished using Dunnett T3 Post Hoc Test.35

Unidimensionality

The evaluation of unidimensionality described previously served dual purposes. Unidimensionality is an assumption of the IRT model used to calibrate the item responses. A finding of unidimensionality also supports the construct validity of the measure in that it indicates that, as hypothesized, GPF is a single construct.

Results

Table 1 summarizes the demographic and clinical characteristics of the sample. The majority of respondents were female (63.8%). Mean age in years was 61 (SD =18.3; range 18 to 99); 79.0% were 45 or older. The most common impairment category was stroke (22.4%) followed by orthopedic conditions (18.6%) and pain syndrome (14.4%). Just over half of the sample had experienced symptoms for more than 90 days (50.4%).
Table 1

Sample characteristics

CharacteristicsValuesTotal sample (N =2337)
n%
SexMale84336.2
Female148463.8
Missing (percentage of full sample)100.4
Age (years)18–4448820.9
45–6578533.6
≥66106045.4
Missing (percentage of full sample)40.2
Impairment categoryStroke51524.2
Brain dysfunction1215.7
Neurologic condition22010.3
Non-traumatic spinal cord dysfunction552.6
Traumatic spinal cord dysfunction552.6
Amputation401.9
Arthritis1175.5
Pain syndrome30714.4
Orthopedic conditions39518.6
Cardiac pulmonary1034.8
Congenital deformities90.4
Other disabling impairments1929.0
Missing (percentage of full sample)2089.8
Acuity/onset (days)0–2151522.5
22–9061727.0
≥91115250.4
Missing (percentage of full sample)532.0
Severity indexAsymptomatic, no treatment needed at this time70.5
Symptoms well controlled with current therapy23516.5
Symptoms controlled with difficulty, needs ongoing monitoring74352.1
Symptoms poorly controlled, needs frequent adjustment in treatment40728.6
Symptoms poorly controlled, history of re-hospitalization332.3
Missing (percentage of full sample)91239.0
Number of comorbidities073931.6
184035.9
245719.6
≥330112.9
Missing (percentage of full sample)00
Based on a bi-factor model of responses to the 18 GPF items, we obtained an omega hierarchical value of 0.86 and an explained common variance of 0.88. These values are substantially higher than Reise et al’s suggested criteria for omega hierarchical (ie, greater than >0.75 preferred)26 and explained common variance (ie, ≥0.60), supporting the unidimensionality of the item responses.28 Assessment of local independence resulted in 153 possible paired comparisons between item residuals. Of these, only two had correlations >0.20. The residuals of the items, “How much does your health limit vigorous activities like running, lifting heavy objects, sports?” and “How much does your health limit participating in recreation?” had a correlation of 0.29. The residuals of the items, “How much does your health limit going on vacation?” and “How much does your health limit attending social events?” had a correlation of 0.26.

Item analyses

Cronbach’s alpha for the GPF item responses was very high (0.93). This result indicated very high inter-item consistency. The mean item score to total score correlation was 0.61. Correlation values ranged from 0.34 for the two-response item (“Do you limit the kind of work or other regular daily activities as a result of your physical health?”) to 0.74 (two items: “How much does your health limit climbing one flight of stairs/walking several blocks?”). Table 2 presents the item parameter estimates obtained in the GRM calibration of the GPF items. Items varied in discrimination (a; slope) confirming the need for use of a two-parameter IRT model that accounts both for item location and item discrimination (one-parameter models’ slopes are equal across items). The average location (ie, difficulty) of items on the logit metric ranged from −0.68 (“How much does your health limit completing your toileting?”) to 2.24 (“How much does your health limit vigorous activities like running, lifting heavy objects, sports?”).
Table 2

Item parameters for the general physical function scale

ItemAverage locationa(discrimination)b1(threshold 1)b2(threshold 2)
Do you limit the kind of work or other regular daily activities as a result of your physical health?*2.061.052.06N/A
How much does your health limit completing your toileting?−0.682.22−1.23−0.13
How much does your health limit getting in and out of bed?−0.422.31−0.970.13
How much does your health limit walking around a room?−0.412.32−0.970.14
How much does your health limit getting in and out of a chair?−0.322.88−0.880.23
How much does your health limit bathing or dressing?−0.232.16−0.780.32
How much does your health limit walking one block?0.292.59−0.260.84
How much does your health limit climbing one flight of stairs?0.452.82−0.11
How much does your health limit attending social events?0.581.770.031.14
How much does your health limit walking several blocks?0.742.770.191.3
How much does your health limit going on vacation?0.751.70.21.31
How much does your health limit bending, kneeling, or stooping?0.782.540.221.33
How much does your health limit lifting or carrying items like groceries?0.892.220.341.44
How much does your health limit moderate activities like moving a table or pushing a vacuum cleaner?1.002.300.451.55
How much does your health limit climbing several flights of stairs?1.042.420.491.59
How much does your health limit walking more than a mile?1.472.060.922.02
How much does your health limit participating in recreation?1.821.461.272.37
How much does your health limit vigorous activities like running, lifting heavy objects, sports?2.241.561.692.8

Notes:

Response categories for this item were “yes” and “no”. For all other items, responses were: “yes, limited a lot”, “yes, limited a little”, and “no, not limited at all”.

All omnibus known-groups comparisons were statistically significant (p<0.001) (Table 3). All but one pair-wise post hoc group comparison was significant at this level. Those with two comorbidities did not have scores that were significantly greater than those with three or more (p=0.144). The results related to unidimensionality supported that functional status was a single construct when measured in patients in this context.
Table 3

Known-groups validity results

Analysis of variancesGroupsSample (N=2337)
Patients(n)MeanSDp-value(Omnibus F test)F value
General physical function scores by age (years)18–4448847.821.00.00052.1
45–6578540.618.4
≥66106036.919.6
General physical function scores by severity indexSymptoms well controlled23546.619.90.00016.6
Symptoms controlled with difficulty74341.420.6
Symptoms poorly controlled (both poorly controlled categories combined)44037.419.1
General physical function scores by number of comorbidities073945.120.60.00031.7
184040.920.0
3.245736.618.2
≥330133.617.6

Limitation

A limitation of this study is that the items were presented to respondents as a full bank, which is convenient for item calibration and evaluation, but is different from administering using CAT. Future studies should evaluate the validity of GPF scores based on an adaptive administration of items.

Conclusion

We examined an item bank with the purpose of assessing GPF of patients receiving care in a rehabilitation setting. Based on the factor analytic results, we concluded that a dominant general factor drove responses to items in this large and medically diverse sample, supporting the unidimensionality of the scale. The assumption of local independence was largely upheld. Inter-item consistency was very high (0.93), and, if the GPF items were intended as a single, 18-item measure, would warrant concerns about redundancy. However, the items were developed as an item bank for CAT administration. Because Cronbach alpha values are a function of the number of items in the scale as well as covariances between item pair responses and variance in total score, values are typically high in item banks where the number of items tend to be larger. The usefulness of GPF scores in discriminating among patients with different levels of functional status was confirmed by the results of the known-groups analyses. The GPF scores effectively distinguished groups expected to have different score levels.
  24 in total

1.  Item response theory and health outcomes measurement in the 21st century.

Authors:  R D Hays; L S Morales; S P Reise
Journal:  Med Care       Date:  2000-09       Impact factor: 2.983

2.  Equating health status measures with item response theory: illustrations with functional status items.

Authors:  C A McHorney; A S Cohen
Journal:  Med Care       Date:  2000-09       Impact factor: 2.983

3.  The PROMIS initiative: involvement of rehabilitation stakeholders in development and examples of applications in rehabilitation research.

Authors:  Dagmar Amtmann; Karon F Cook; Kurt L Johnson; David Cella
Journal:  Arch Phys Med Rehabil       Date:  2011-10       Impact factor: 3.966

Review 4.  Contemporary measurement techniques for rehabilitation outcomes assessment.

Authors:  Alan M Jette; Stephen M Haley
Journal:  J Rehabil Med       Date:  2005-11       Impact factor: 2.912

5.  Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS).

Authors:  Bryce B Reeve; Ron D Hays; Jakob B Bjorner; Karon F Cook; Paul K Crane; Jeanne A Teresi; David Thissen; Dennis A Revicki; David J Weiss; Ronald K Hambleton; Honghu Liu; Richard Gershon; Steven P Reise; Jin-shei Lai; David Cella
Journal:  Med Care       Date:  2007-05       Impact factor: 2.983

6.  Simulated computerized adaptive test for patients with shoulder impairments was efficient and produced valid measures of function.

Authors:  Dennis L Hart; Karon F Cook; Jerome E Mioduski; Cayla R Teal; Paul K Crane
Journal:  J Clin Epidemiol       Date:  2005-12-27       Impact factor: 6.437

7.  Scoring and modeling psychological measures in the presence of multidimensionality.

Authors:  Steven P Reise; Wes E Bonifay; Mark G Haviland
Journal:  J Pers Assess       Date:  2012-10-02

8.  A computerized adaptive test for patients with hip impairments produced valid and responsive measures of function.

Authors:  Dennis L Hart; Ying-Chih Wang; Paul W Stratford; Jerome E Mioduski
Journal:  Arch Phys Med Rehabil       Date:  2008-11       Impact factor: 3.966

9.  Comparing patient characteristics and treatment processes in patients receiving physical therapy in the United States, Israel and the Netherlands: cross sectional analyses of data from three clinical databases.

Authors:  Ilse C S Swinkels; Dennis L Hart; Daniel Deutscher; Wil J H van den Bosch; Joost Dekker; Dinny H de Bakker; Cornelia H M van den Ende
Journal:  BMC Health Serv Res       Date:  2008-07-30       Impact factor: 2.655

10.  Improving Inpatient Surveys: Web-Based Computer Adaptive Testing Accessed via Mobile Phone QR Codes.

Authors:  Tsair-Wei Chien; Weir-Sen Lin
Journal:  JMIR Med Inform       Date:  2016-03-02
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.