Literature DB >> 23593428

Observed agreement problems between sub-scales and summary components of the SF-36 version 2 - an alternative scoring method can correct the problem.

Graeme Tucker1, Robert Adams, David Wilson.   

Abstract

PURPOSE: A number of previous studies have shown inconsistencies between sub-scale scores and component summary scores using traditional scoring methods of the SF-36 version 1. This study addresses the issue in Version 2 and asks if the previous problems of disagreement between the eight SF-36 Version 1 sub-scale scores and the Physical and Mental Component Summary persist in version 2. A second study objective is to review the recommended scoring methods for the creation of factor scoring weights and the effect on producing summary scale scores.
METHODS: The 2004 South Australian Health Omnibus Survey dataset was used for the production of coefficients. There were 3,014 observations with full data for the SF-36. Data were analysed in LISREL V8.71. Confirmatory factor analysis models were fit to the data producing diagonally weighted least squares estimates. Scoring coefficients were validated on an independent dataset, the 2008 South Australian Health Omnibus Survey.
RESULTS: Problems of agreement were observed with the recommended orthogonal scoring methods which were corrected using confirmatory factor analysis.
CONCLUSIONS: Confirmatory factor analysis is the preferred method to analyse SF-36 data, allowing for the correlation between physical and mental health.

Entities:  

Mesh:

Year:  2013        PMID: 23593428      PMCID: PMC3625168          DOI: 10.1371/journal.pone.0061191

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

The SF-36 and the shorter form SF-12 health status questionnaires have been used extensively in international studies to obtain summary measures of health status. The origin of the instruments has an extensive and well-founded methodological history deriving from the Medical Outcomes Study conducted by the RAND Corporation [1]. However, international concern has been raised questioning the validity of the recommended orthogonal scoring methods of Version 1 of the SF-36 to produce Physical and Mental Component Summary scores (PCS & MCS) [2]-[9]. However, these scoring methods remain in widespread use, indeed they are the default scoring approach around the world. Given the instruments subscales and summary scores are used by national agencies to guide policy [10] and medical authorities to guide treatment and intervention decisions, [11], it is important that questions of validity are addressed to achieve best investment decisions. The creation of Version 2 of the instrument led to a number of refinements to question item response categories, layout and norming of the questionnaire. Data items for the role physical and role emotional items, which contribute substantially to PCS and MCS summary scores were expanded from dichotomous yes/no responses to five point Likert scales. New norms were derived from the 1998 US population, which have since been updated to 2009. [12]. No substantial changes were made to the recommended scoring methods [12], so the question remains as to whether or not the commercial Version 2 still produces summary scores that are at variance with the underlying sub-scale scores [5]. The major putative problem with the recommended scoring methods is they do not allow for a correlation between physical and mental health in creating the summary scores; an issue that is not consistent with the health literature. Epidemiological and clinical studies have shown a strong connection between physical and mental health [13]–[18]. People with depression often have worse physical health, as well as worse perception of their health [16], a characteristic that would affect their reporting of self-related health. Tucker et al [5], acknowledged this connection in the SF-36 version 1 by demonstrating that the use of the recommended orthogonal scoring methods, which do not allow for the correlation, created important discrepancies between the PCS and MCS and their underlying sub-scale scores, and that this could be corrected by use of confirmatory factor analysis (CFA). Given the extensive use of Version 2 [12] it is important to again compare recommended orthogonal scoring methods with CFA, assess if the problems found in Version 1 persist and resolve which methods may best analyse Version 2 to produce summary scores consistent with the sub-scales. A second important question relating to the use of the SF-36 is whether or not cross-country comparisons of health status are valid using the recommended United States (US) factor scoring coefficients in the development of the PCS and MCS. The developers of the SF-36 Version 2 advocate use of US factor score weights in creating the PCS and MCS in other countries [19]. This has the effect of artificially inflating or deflating these components for local decision making, which could confuse investment decisions in health for other countries. Given the potential differences of health status, the distribution of health and the perception of health in different countries, the question arises as to whether or not PCS and MCS scores should be based on country specific weights and, therefore, be free to vary from country to country, in order to accurately reflect the sub scale scores generated. Using US factor score coefficients standardises scores of each country to the US sub-scale score profile [20], which is possibly different to the sub-scale score profile of the country conducting the study. The important question to be answered is whether or not comparisons across countries are best made on the basis of country specific weighting coefficients? Our aim was to assess whether previous problems of disagreement between the eight SF-36 Version 1 sub-scale scores and the Physical and Mental Component Summary scales (PCS and MCS) persist in version 2 of the instrument. A second study objective is to review the recommended scoring methods for the creation of factor scoring weights and the effect on producing summary scale scores

Methods

Statistical background and methodological issues

In producing the SF-36 component summaries (PCS and MCS) from the SF-36 data there are two main options for rotation of factors. This is done depending on whether or not the investigator believes the factors to be correlated (oblique) or uncorrelated (orthogonal). The recommended scoring methods for the SF-36 are based on orthogonal rotations, but we will argue that this creates data agreement problems and that there is strong support for adopting an oblique approach. The items of the SF-36 are set out in Table 1.
Table 1

Detailed items of the SF-36 version 2.

Sub-scaleItemShort descriptionQuestion
Physicala3aVigorous activitiesThe following questions are about activities that you might do
Functioninga3bModerate activitiesduring a typical day. As I read each item, please tell me if your
a3cLift/Carry grocerieshealth now limits you a lot, limits you a little, or does not limit you
a3dClimb several flightsat all, in these activities.
a3eClimb one flight1 = Yes, limited a lot
a3fBend, Kneel2 = Yes, limited a little
a3gWalk kilometre3 = No, no limited at all
a3hWalk half a kilometre
a3iWalk 100 metres
a3jBathe, Dress
Rolea4aCut down timeThe following four questions ask you about your physical health
Physicala4bAccomplished lessand your daily activities. During the past four weeks, how much
a4cLimited in kindof the time have you.?
a4dHad difficulty1 = All of the time
2 = Most of the time
3 = Some of the time
4 = A little of the time
5 = None of the time
Bodily Paina7Pain-magnitudeHow much bodily pain have you had during the past four weeks?
1 = None
2 = Very mild
3 = Mild
4 = Moderate
5 = Severe
6 = Very severe)
a8Pain-interfereDuring the past four weeks, how much did pain interfere with your
normal work, including both work outside the home and
housework?
1 = Not at all
2 = Slightly
3 = Moderately
4 = Quite a bit
5 = Extremely
Generala1EVGFP ratingThese first questions are about your health now and your current
Healthdaily activities. Please try to answer every question as accurately
as you can. In general, would you say your health is:
1 = Excellent
2 = Very good
3 = Good
4 = Fair
5 = Poor
a11aSick easierNow I'm going to read you a list of statements. After each one,
a11bAs healthyplease tell me if its definitely true, mostly true, mostly false, or
a11cHealth to get worsedefinitely false. If you don't know just tell me.
a11dHealth excellent1 = Definitely true
2 = Mostly true
3 = Don’t know
4 = Mostly false
5 = Definitely false
Vitalitya9aFull of lifeThe following questions are about how you feel and how things
a9eEnergyhave been with you in the past four weeks. As I read each
a9gWorn outstatement, please give me the one answer that comes closest to the
a9iTiredway you have been feeling. Would you say all of the time, most of
the time, some of the time, a little of the time or none of the time?
1 = All of the time
2 = Most of the time
3 = Some of the time
4 = A little of the time
5 = None of the time
Sociala6Social-extentDuring the past four weeks, to what extent has your physical health
Functioningor emotional problems interfered with your normal social activities
with family, friends, neighbours or groups? Has it interfered:
1 = Not at all
2 = Slightly
3 = Moderately
4 = Quite a bit
5 = Extremely
a10Social-timeDuring the past four weeks, how much of the time has your
physical health and emotional problems interfered with your social
activities like visiting friends and relatives? Would you say:
1 = All of the time
2 = Most of the time
3 = Some of the time
4 = A little of the time
5 = None of the time
Rolea5aCut down timeThe following three questions ask about your emotions and your
Emotionala5bAccomplished lessdaily activities. During the past four weeks, how much of the time
a5cNot carefulhave you.?
1 = All of the time
2 = Most of the time
3 = Some of the time
4 = A little of the time
5 = None of the time
Mentala9bNervousThe following questions are about how you feel and how things
Healtha9cDown in dumpshave been with you in the past four weeks. As I read each
a9dCalmstatement, please give me the one answer that comes closest to the
a9fFelt downway you have been feeling. Would you say all of the time,
a9hHappymost of the time, some of the time, a little of the time or none of the time?
1 = All of the time
2 = Most of the time
3 = Some of the time
4 = A little of the time
5 = None of the time
A hypothetical factor structure has already been documented for the SF-36 [21]. This formed the basis of the model we evaluated, except that we allowed physical and mental health to be correlated (see Figure 1). It was therefore possible to fit a second order confirmatory factor analysis (CFA). The model fit was the full measurement model, using items re-coded as detailed in the SF36 scoring manual [20], with the exception that integer values of the items were retained so that they could be modeled using polychoric and tetrachoric correlations in LISREL V8.7. The above model was fit on 3,014 observations with no missing data for any items. The data produced using the CFA was compared with an analysis using the recommended orthogonal scoring methods [22].
Figure 1

Hypothesised structure of SF-36 Health Dimensions and the Summary Physical (PCS) and Mental (MCS) Health Measures.

Exploratory factor analysis (EFA) based on z-scores of the sub-scales, employing a principal components (PCA) extraction and an orthogonal rotation of factors was used by the developers to produce the SF-36 scoring coefficients for the component summary scores. This model cannot be directly fit using CFA software as the model is unidentified. However, using MacDonald's “echelon form” [23] where one non-significant path is constrained to zero, fit measures for the EFA model were generated in Stata [24]. It should be pointed out that the EFA model uses Pearson correlations of z-scored normally distributed data for the eight sub-scale scores, whereas the CFA model uses polychoric correlations of the 35 data items involved in the calculation of the SF-36 scores. Also the Akaike Information Criteria (AIC) value from the CFA model fit in LISREL V8.7 [25] is based on the Satorra-Bentler Chi-squared value, and the AIC from the EFA model fit in Stata SE V12 [24] is based on the model chi-square which is -2*log likelihood. To produce a fair comparison of the two models, the AIC was re-calculated for the CFA model based on the value of -2*log likelihood. Hawthorne et. Al [22]. have published population norms for the transformed subscale scores from the 2004 SA Health Omnibus Survey [26], and they used the traditional scoring approach of Ware et al to produce factor score weights for the calculation of the Australian SF-36 summary scores. We also used these published norms and weights to produce subscale and summary PCS and MCS scales, distributed N(50,100), based on the traditional orthogonal method, for comparison with the CFA, using the 2008 SA Health Omnibus Survey data set. Given the complexity of decisions made in the process of the CFA analysis the following methodological explanations are provided. First, Rigdon & Ferguson [27] have shown that Maximum Likelihood (ML) estimation based on a polychoric correlation matrix is insufficient to correct for the problems associated the type of data in this study. For this reason weighted least squares (WLS) estimation is preferred. Further, Mindrilla [28] concluded that Diagonally Weighted Least Squares (DWLS) is superior to ML for the analysis of ordinal data. Nye & Drasgow [29] consider that WLS and DWLS are both from the Asymptotically Distribution Free (ADF) family of estimators, and require similar large size samples. They investigated sample sizes from 400 to 1600. Flora & Curran contradict this paper, concluding that DWLS (they call it robust WLS) is superior to WLS in almost all situations, especially when the model is complex or the sample is small (n = 100). The largest sample size they considered was 1000 [30]. Forero et. al [31] compared unweighted least squares (ULS) and diagonally weighted least squares (DWLS) as alternatives to WLS for estimating Confirmatory Factor Analysis (CFA) models with ordinal indicators in a Monte Carlo study, and concluded that ULS was preferable, but if this did not converge then DWLS should be used, even in small samples (they examined sample sizes of 200. 500, and 2000). WLS was eliminated from consideration due to the requirement for very large sample sizes. For our analysis, we have a moderate sample size of 3014. We attempted to use ULS as recommended by Forero et al [31], but this did not converge for the SF-36 model. We therefore chose to use DWLS to fit the model for SF-36. The model for SF-12 converged using ULS. For maximum likelihood estimation of multivariate normal data, fit measure cutoffs have been set out by Hu and Bentler [32] as: Root Mean Square Error of Approximation (RMSEA) < = 0.06, Standardised Root Mean Square Residual (SRMSR) < = 0.08, Tucker Lewis Index (TLI) > = 0.95, Comparative Fit Index (CFI) > = 0.95. TLI is also known as the Non-Normed Fit Index (NNFI). Nye & Drasgow [29] concluded that the fit measures and cutoffs in use for ML estimation of multivariate normal data do not apply to ADF estimators. They based their proposals for interpretation of fit measures on DWLS estimators of dichotomous indicators in CFA via tetrachoric correlations. They used Monte Carlo computer simulation to study the effects of model misspecification, sample size, and non-normality on fit indices generated from DWLS estimation on dichotomous data. The study consisted of a 3 (model misspecification)×3 (degree of nonnormality)×3 (sample size) design. This is based on simulations of sample sizes of 400, 800, and 1600, using values of 0, 0.5, and 1.75 for skewness, and 0, 1.0, and 3.75 for kurtosis. The reader is indirectly invited to extend the results to ordinal data and polychoric correlations, but this is an assumption. They have set out how to calculate cutoffs for fit measures for different situations (i.e. different levels of skewness, kurtosis, sample size, and required type I error rates). They only considered positive skewness in their calculations. They found that CFI & TLI were almost always near 1, and did not provide any discrimination regarding the fit of these models. Therefore, they recommend judging fit for these models based on their calculated cutoffs for RMSEA and SRMSR. Flora & Curran [30] found that “there were few to no differences found in any empirical results as a function of two category versus five category ordinal distributions.” This conclusion supports the generalisation of Nye & Drasgow's work from tetrachoric to polychoric correlations. They also found that DLWS produced more accurate estimates of the model chi-square, and therefore all of the fit measures that are based on it. In WLS estimation, the “inflation of the test statistic increases Type I error rates for the chi-square goodness-of-fit test, thereby causing researchers to reject correctly specified models more often than expected.”. In this sense, Flora and Curran argue the opposite of Nye & Drasgow, [29] who proffer the advice that goodness-of-fit criteria need to be tightened up to avoid accepting inadequate models. Nye and Drasgow [29] considered sample sizes up to 1600, and the formulae they provide produce complex roots when applied to our dataset, despite our skewness and kurtosis parameters lying within the ranges used in their simulations. We consider that this is because our sample size is much greater than the experience of their simulations. Since the Nye and Drasgow [29] formulae fail to provide real valued cutoffs in our dataset, and Flora and Curran [30] argue for less stringent rather than more stringent fit criteria, we are comfortable using the maximum likelihood criteria advanced by Hu and Bentler [32] to assess model fit in this analysis, with the exception that Nye and Drasgow's advice regarding the non-discrimination of the TLI and CFI fit indices is accepted. We have therefore based our acceptance of the model on an RMSEA< = 0.06 and a SRMSR< = 0.08.

Statistical analysis

The 2004 South Australian Health Omnibus Survey dataset was used as the basis for the production of scoring coefficients [26]. This is the earliest Australian population survey available which included version 2 of the SF-36 health status questionnaire. In this representative population survey n = 3,014 adults aged 15 years or older were interviewed, all of whom provided full information for the SF-36. This is the same dataset as used by Hawthorne et. al. [22]. The data items were recoded as per the instructions of the SF-36 scoring manual [20]. The confirmatory factor analyses were fit on polychoric correlations in LISREL V8.7 [25] software. The model for SF-36 is a second order confirmatory factor analysis model. Unfortunately LISREL does not produce factor score weights for second order factors. The AMOS package [33] does produce these coefficients, but does not model polychoric correlations. Therefore we applied the AMOS formula for the generation of factor score weights to the outputs provided by LISREL to calculate factor score weights for version 2 of the SF-36. The AMOS formula is given by W = B S−1 where W is the matrix of factor score weights, S is the fitted variance covariance matrix of the observed variables in the model, and B is the matrix of covariances between the observed and unobserved variables [33]. As pointed out by Joreskog [34] latent variable scores should be independent of the estimation method used to fit the model. The use of this formula satisfies this requirement. The existence of factor score weights for all of the 35 items in the calculation of the summary scores based on the model is explained by the fact that all variables have an effect on both physical and mental health by virtue of the correlation between them, which is allowed for in the model. A similar approach was used to model the SF-12 variables (see Figure 2). Models were again fit to produce the factor score weights in a confirmatory factor analysis. The data were recoded as per the instructions of the SF-36 scoring manual [20], with the exception that question eight of the SF-36 was recoded according to the instructions where question seven is not answered. This is because question seven is not asked in collecting the SF-12 data items. This resulted in 3,014 records being available to the analysis. In the model, correlations were allowed among the error terms for items from the same SF-36 sub-scale, because items from the same sub-scale, could reasonably be expected to be more closely correlated with each other than with the other items of the SF-12.
Figure 2

Hypothesised structure of SF-12 Summary Physical (PCS) and Mental (MCS) Health Measures.

Comparisons of the PCS and MCS mean scores were based on agreement with the underlying subscales for both the orthogonal rotation and CFA. It was postulated that any sub-group summary score that was higher or lower than average should be in statistical agreement with the underlying subscales that contribute to that summary score. For comparison we used four age groups (<30 years, 30–49 years, 50–69 years and 70+ years) and four medication groups (no medication, physical health medication, mental health medication and both physical and mental health medication). Both sets of scores were based on the 2008 SA Health Omnibus Survey data. Since all scores were hypothesised to be distributed normally with a mean of 50 and a standard deviation of 10, comparisons were made assuming equal variances. Mean scores for four age groups and four medication groups were compared with the complementary groups to determine which age and medication groups had scores which were higher or lower than average scores. Similar comparisons were also made for the eight sub-scale scores. For each age and medication group comparisons of summary scores were made with the underlying sub-scale scores using independent groups t-tests. These analyses were carried out using SPSS Version 19 [35].

Results

The traditional orthogonal EFA model had an RMSEA = 0.104, SRMSR = 0.022, CFI = 0.972, TLI = 0.940, and AIC = 58497.72. This can be compared with our CFA model with RMSEA = 0.049, SRMSR = 0.053, CFI = 0.995, TLI = 0.9908, and AIC = 50495.37. From these fit measures it can be seen that the CFA model provides a much superior fit to the data than the EFA model with an orthogonal rotation. We bear in mind the view of Nye and Drasgow [29] that the CFI and TLI are constrained to be near unity in the analysis of polychoric correlations for ordinal data. Table 3.5 of SF-36 Physical and Mental Health Summary Scales: A User's Manual [21] provides the Pearson product-moment correlations of the sub-scales for the general US population. This table provides sufficient information to test the fit of the original orthogonal EFA model employed by the developers of the scale. Using the same methods as above, the orthogonal EFA of the original US data had an RMSEA = 0.092, SRMSR = 0.028, CFI = 0.971, TLI = 0.938, and AIC = 47130.90. The original US model therefore shows a similar degreee of lack of fit as the same model fit to Australian data by Hawthorne [22].
Table 3

Australian weighting coefficients for the SF-12 version 2.

PCSMCS
A11.30190.2044
A3B1.26250.1984
A3D0.60060.0943
A4B3.00280.4730
A4C2.98090.4693
A82.00330.3157
A5B0.18631.8531
A5C0.09530.9532
A9D0.08000.7996
A9E0.24222.4132
A9F0.13701.3584
A100.49644.9376
Constant term0.3833−9.0891
The coefficients generated by the CFA analysis for the SF-36 are set out in Table 2. The model had a Chi-square of 53511.3 on 551 degrees of freedom, the size of which is explained by the large sample size. The Satorra-Bentler [36] scaled chi-square was 4648.5. The model had an RMSEA of.050 (90% confidence interval.048 to.051), a probability of close fit of 0.6522, and a standardised root mean square residual of 0.076. The Non-Normed Fit Index was 0.9904 and the Comparative Fit Index was 0.9911. The estimate of the correlation between physical and mental health was 0.73 (p<0.001).
Table 2

Australian weighting coefficients for the SF-36 version 2.

PCSMCS
A3A0.02580.0025
A3B0.06230.0120
A3C0.04450.0025
A3D0.06800.0070
A3E0.23660.0263
A3F0.02680.0031
A3G0.10440.0087
A3H1.04570.1367
A3I0.16750.0262
A3J0.01690.0021
A4A0.56210.0697
A4B1.24880.1658
A4C1.92800.2391
A4D2.21870.2757
A70.25660.0352
A80.91240.1121
A10.42970.0523
A11A0.16980.0212
A11B0.28810.0368
A11C0.08950.0119
A11D1.20840.1553
A9A0.16531.6617
A9E0.18821.8847
A9G0.08170.8083
A9I0.08390.8228
A60.14781.4785
A100.23892.4014
A5A0.10221.0694
A5B0.10171.0106
A5C0.03930.3615
A9B0.01250.1300
A9C0.05030.4939
A9D0.02630.2478
A9F0.06010.5955
A9H0.02880.2983
Constant term−0.1097−9.6528
Based on these weights the theoretical range of the SF-36 version 2 PCS is (12.3279,59.6503), and the observed range was (13.5313,59.6503). For the SF-36 version 2 MCS the theoretical range is (5.0138,63.3733), and the observed range was (5.5778,63.3733). The coefficients generated by the CFA analysis for the SF-12 are set out in Table 3. The model had a Chi-square of 2646.6 on 49 degrees of freedom. The Satorra-Bentler scaled chi-square was 588.4. The model had an RMSEA of 0.060 (90% confidence interval 0.056 to 0.065), a probability of close fit of 0.000, and a standardised root mean square residual of 0.075. The Non-Normed Fit Index was 0.9874 and the Comparative Fit Index was 0.9906. The estimate of the correlation between physical and mental health was 0.71 (p<0.001). Based on these weights the theoretical range of the SF-12 version 2 PCS is (12.7725,58.6031), and the observed range was (12.7725,58.6031). For the SF-36 version 2 MCS the theoretical range is (4.9811,60.6765), and the observed range was (4.9811,60.6765). In comparing the effect of orthogonal rotation methods with confirmatory factor analysis we compared the summary scale scores with their underlying sub-scale scores for different age groups in Table 4 and for medication groups in Table 5. From the tables clear discrepancies are apparent between the traditional summary scores and their sub-scales, which are not evident using scoring coefficients derived from confirmatory factor analysis.
Table 4

Comparison of subscale scores and summary scores using different scoring methods, by age groups.

<3030–4950–6970+Total
n5159919394722917
Physical function scale -54.752.747.539.950.0
Aust normed T-score
Role physical scale - Aust52.250.947.243.649.2
normed T-score
Bodily pain scale - Aust52.549.045.745.548.5
normed T-score
General health scale - Aust51.650.447.746.149.3
normed T-score
Vitality scale - Aust51.349.148.947.649.4
normed T-score
Social function scale - Aust50.449.348.448.349.2
normed T-score
Role emotion scale - Aust49.748.648.748.748.9
normed T-score
Mental health scale - Aust49.148.849.050.049.1
normed T-score
SF-36–PCS scored using54.151.646.641.549.5
Aust weighted T-score
SF-36 MCS- scored using48.347.949.652.049.0
Aust weighted T-score
SF-36 PCS - scored using52.650.847.243.949.3
SEM coefficients
SF-36 MCS - scored using51.149.348.247.049.1
SEM coefficients
SF-12 PCS - scored using52.650.847.043.649.2
SEM coefficients
SF-12 MCS - scored using51.249.548.247.049.2
SEM coefficients
Table 5

Comparison of subscale scores and summary scores using different scoring methods, by medication status.

No medicationPhysical onlyMental onlyBothTotal
n15491120951532917
Physical function scale -53.545.648.741.050.0
Aust normed T-score
Role physical scale - Aust52.245.845.940.649.2
normed T-score
Bodily pain scale - Aust51.644.943.738.748.5
normed T-score
General health scale - Aust52.445.844.140.649.3
normed T-score
Vitality scale - Aust51.447.941.940.749.4
normed T-score
Social function scale - Aust51.147.841.240.749.2
normed T-score
Role emotion scale - Aust50.748.537.438.248.9
normed T-score
Mental health scale - Aust50.349.239.040.149.1
normed T-score
SF-36–PCS scored using53.144.648.540.949.5
Aust weighted T-score
SF-36 MCS- scored using49.850.037.040.349.0
Aust weighted T-score
SF-36 PCS - scored using52.745.644.439.349.3
SEM coefficients
SF-36 MCS - scored using51.647.439.237.949.1
SEM coefficients
SF-12 PCS - scored using52.545.644.639.149.2
SEM coefficients
SF-12 MCS - scored using51.747.539.438.049.2
SEM coefficients
Table 4 shows several discrepancies between the summary component scores and their underlying sub-scale scores when scored using orthogonal methods, as set out by Hawthorne [22]. The score for the SF-36 mental health sub-scale for those aged under thirty years is not significantly different to the overall sub-scale average (p = 0.918),. The remaining three sub-scale scores that comprise the SF-36 mental component are all significantly higher than average (role emotional (p = 0.026), vitality (p<0.001), social functioning (p = 0.005)), as are the mental component summary scores (MCS) from CFA coefficients for both SF-36 (p<0.001) and SF-12 (p<0.001), yet the MCS score, based on the original orthogonal scoring algorithm, is significantly lower than average (p = 0.035). For those aged 30–49 years, none of the mental health sub-scales are significantly different to average (vitality (p = 0.272), social functioning (p = 0.650), role emotional (p = 0.295), and mental health (p = 0.264)), yet the MCS was significantly lower than average (p<0.001) using orthogonal scoring, but there was no significant difference for the SF-36 MCS score using CFA coefficients (p = 0.561) or SF-12 using CFA coefficients (p = 0.294). For those aged 50–69 years, three of the mental health scales were not significantly different to average (vitality (p = 0.120), role emotional (p = 0.466), and mental health (p = 0.795)) and social functioning was significantly lower than average (p = 0.012), yet the MCS was significantly higher than average (p = 0.044) using orthogonal scoring but significantly lower than average for both SF-36 (p = 0.003) and SF-12 (p = 0.001) using CFA coefficients. For those aged 70 years or more, the vitality scale was significantly lower than average (p<0.001), whilst the social functioning (p = 0.083), role emotional (p = 0.711), and mental health score (0.069) were not significantly different to average. The MCS scores from CFA coefficients for both SF-36 (p<0.001) and SF-12 (p<0.001) were significantly lower than average, yet the MCS score based on the original orthogonal scoring method was significantly higher than average (p<0.001). There were no inconsistencies evident by age for physical health summary scores when compared to their subscales. Similar discrepancies arise in comparison of the component summary scores with their underlying sub-scale scores for those taking medications for either or both physical and mental health conditions. Table 5 shows that for those not taking medications no inconsistencies between sub-scales and summary scores were evident. For those taking medications for physical ailments the vitality (p<0.001) and social functioning (p<0.001) sub-scales scores were both significantly lower than average, while the role emotional score (p = 0.155) and the mental health score (p = 0.789) were not significantly different to average. This is consistent with the mental health summary scores (MCS) from CFA coefficients which were significantly lower than average for both SF-36 and SF-12 (p<0.001), yet the MCS score based on the original orthogonal scoring method was significantly higher than average (p<0.001). Similarly, three of the physical health subscale scores are significantly lower than average for those taking medications for mental health reasons (role physical (p = 0.002), bodily pain (p<0.001), and general health (p<0.001)), while the physical functioning scale is not significantly different to average (p = 0.196). This is consistent with the physical health summary scores (PCS) from CFA coefficients which are significantly lower than average for both SF-36 (p<0.001) and SF-12 (p<0.001), yet the PCS score based on the original scoring coefficients is not significantly different to average (p = 0.380) for PCS calculated using orthogonal methods. There were no inconsistencies evident for those taking medication for both physical and mental health problems for physical or mental health summary scores when compared to their subscales. In summary, the CFA produced a superior fit to the SF-36 data, provided acceptable fit measures and solved agreement problems observed in the orthogonal analyses.

Discussion

We raise two points of difference with the developers regarding the development of scoring norms and weights. First, that PCS and MCS summary scores should be based on a model that allows correlation of physical and mental health, to preserve consistency of summary scores with their underlying sub-scales. We thank an anonymous reviewer who has also pointed out that “this issue is probably more of a concern with the SF12 than the SF36. The SF36 generates subscale scores, so users can notice and evaluate the potential problems caused by orthogonally-derived summary scores. But the SF12 generates only summary scores, so the problem will be hidden from users.”. Second, that scoring norms and weights should be produced on country specific data, so that all scores are based on the same data items and have the same distributions (normal with mean 50 and standard deviation 10). This is essential for country decision making especially from summary scales for sub-groups, but further in this way all countries will produce T-scores for all sub-scales and summary scales that allow accurate international comparisons, without the need to standardise to USA factor weights The use of US factor score weights in the calculation of summary scores seems inappropriate for other countries, because the linear combination of z-scored sub-scales using US weights results in the emphasis being placed on those sub-scales which have higher US weights. Hawthorne [22] has analysed Australian SF-36 version 2 data from the 2004 Health Omnibus Survey. His analysis replicated precisely to the methods used by the developers, but included allowances for the production of Australian norms for use in calculating the z-scores for the sub-scales, and for the calculation of Australian factor score weights from an orthogonal EFA. His analysis showed that the factor score weights produced based on Australian data were significantly different to those produced using USA data. None of the USA weights were in the 95% CI of the Australian weights. Thus the profile of locally calculated weights can be very different to the US weights and therefore the summary scores produced by locally produced weights would emphasise different sub-scales than the US weights. This results in the calculation of inaccurate summary scores when using US weights. In principal therefore, calculation of summary scores should be based on locally calculated weights. In the present study we used the Australian norms and factor score weights based on Australian data developed by Hawthorne [22] to produce the component summary scores for the traditional orthogonal scoring method. Table 2 of Hawthorne's paper also demonstrated the shortcomings of applying US norms and weights to Australian data, in that the 95% CI for all subscale T-scores and the MCS T-score excluded 50. So even if we stick to orthogonal analyses there is important and increasing evidence that strictly applying US factor score weights in the creation of summary scores is a problem for local interpretation and use of data. It is argued that the profile of locally calculated weights can be very different, as demonstrated by Hawthorne [22], and often for the valid reasons of differences in health. The aim of measuring health status should primarily be for the production of valid local scores based on country specific norms and not for the primary purpose of standardising to US data for comparison purposes. Further, if we need to compare with the US or with any other country it would best be done on the basis of subscale T-scores and summary scores based on individual data items and local population norms for the creation of factor score weights in a second order confirmatory factor analysis, so that scores are all based on the same data items and have the same distribution. In fairness to the authors of the SF-36 they have produced a leading generic quality of life instrument and measure and there is little or no criticism about the long-term historical development of question items. The main points of contention are involved in scoring the summary scores. The question which has to be answered by other interested researchers is does the proposed CFA fix the underlying problems identified with the PCS and MCS and should US factor score weights be used for anything other than academic comparison with US data, and not for country specific estimates which may be skewed by US coefficients. The CFA used in this analysis is based on the original data items and the orthogonal analysis on the underlying subscales. It is argued this is a reasonable comparative approach of the two methods as the data items are used to create the subscales. The main difference in the comparisons is therefore based on the methodological difference of orthogonal or oblique rotation and not on data differences. We argue the oblique rotation method is an improved way of handling the data. We further argue that the approach recommended by the developers is unsustainable in Australia, and possibly elsewhere, because the factor score weights should be free to vary from country to country in order to accurately reflect the sub-scale scores generated by the SF-36 data in each country. This point is supported by Hawthorne's analysis of the Australian data [22]. We accept that Hawthorne's findings contradict the findings of the IQOLA project [19]. Australia appears to offer divergent results to the other mainly European countries included in the IQOLA study, and we note that these analyses were conducted on different datasets. The critical point is the existence of the dataset that produced Hawthorne's results. Hawthorne's analysis satisfactorily demonstrates the need for an Australian country specific scoring algorithm. The question of the need for country specific scoring algorithtms elsewhere has not been covered by our analysis, and should be the subject of further research. We are aware that demonstration of the inconsistencies between the sub-scales and the component summary scores in two tables (4 & 5) is not a comprehensive validation of the scoring coefficients, but we suggest there are limits to how much analysis can be squeezed into one paper.

Conclusion

The conclusion of the study is that the problems of agreement between PCS and MCS summary scores and their underlying sub-scales identified in Version 1 of the SF-36 persist in Version 2. As identified in the Version 1 analyses [4], this occurs when a negative Z-score is multiplied by a negative coefficient, resulting in a positive score. This mathematical difficulty is compounded by the orthogonal method used, and why the authors continue to promote the method in the face of international concerns and a real world correlation between mental and physical health is not clearly understood. In a defence of the SF-36 scoring methods and the instruments accuracy, Ware and Kosinski [37], discuss the question of the PCS and MCS being rotated by orthogonal or oblique methods and ask how much physical health should be in mental health and vice versa. If, however, exploratory factor analysis using maximum likelihood extraction and oblique rotation were used, this would estimate the hypothetical factor structure and the data would determine how much mental health is contained in physical health and vice versa. In Ware and Kosinski's [37] defence of the SF-36 they also contend “results based on summary measures should be thoroughly compared with the SF-36 profile….,” before drawing any conclusions. If we followed this advice for the above analyses of Version 2 data (and also for Version1) we would conclude the disagreement between scales and summary scores is consistent using orthogonal modeling and is based on a mathematical artefact.
  20 in total

1.  The SF-36 summary scales: problems and solutions.

Authors:  D Wilson; J Parsons; G Tucker
Journal:  Soz Praventivmed       Date:  2000

2.  Interpreting SF-36 summary health measures: a response.

Authors:  J E Ware; M Kosinski
Journal:  Qual Life Res       Date:  2001       Impact factor: 4.147

3.  New Australian population scoring coefficients for the old version of the SF-36 and SF-12 health status questionnaires.

Authors:  Graeme Tucker; Robert Adams; David Wilson
Journal:  Qual Life Res       Date:  2010-05-04       Impact factor: 4.147

4.  The equivalence of SF-36 summary health scores estimated using standard and country-specific algorithms in 10 countries: results from the IQOLA Project. International Quality of Life Assessment.

Authors:  J E Ware; B Gandek; M Kosinski; N K Aaronson; G Apolone; J Brazier; M Bullinger; S Kaasa; A Leplège; L Prieto; M Sullivan; K Thunedborg
Journal:  J Clin Epidemiol       Date:  1998-11       Impact factor: 6.437

5.  SF-36 summary scores: are physical and mental health truly distinct?

Authors:  G E Simon; D A Revicki; L Grothaus; M Vonkorff
Journal:  Med Care       Date:  1998-04       Impact factor: 2.983

6.  The RAND 36-Item Health Survey 1.0.

Authors:  R D Hays; C D Sherbourne; R M Mazel
Journal:  Health Econ       Date:  1993-10       Impact factor: 3.046

7.  The SF-36 Health Survey as a generic outcome measure in clinical trials of patients with osteoarthritis and rheumatoid arthritis: relative validity of scales in relation to clinical measures of arthritis severity.

Authors:  M Kosinski; S D Keller; J E Ware; H T Hatoum; S X Kong
Journal:  Med Care       Date:  1999-05       Impact factor: 2.983

Review 8.  Clinical and health services relationships between major depression, depressive symptoms, and general medical illness.

Authors:  Wayne J Katon
Journal:  Biol Psychiatry       Date:  2003-08-01       Impact factor: 13.382

9.  Association between physical activity and mental disorders among adults in the United States.

Authors:  Renee D Goodwin
Journal:  Prev Med       Date:  2003-06       Impact factor: 4.018

10.  Is the SF-36 a valid measure of change in population health? Results from the Whitehall II Study.

Authors:  H Hemingway; M Stafford; S Stansfeld; M Shipley; M Marmot
Journal:  BMJ       Date:  1997-11-15
View more
  11 in total

1.  The case for using country-specific scoring coefficients for scoring the SF-12, with scoring implications for the SF-36.

Authors:  Graeme Tucker; Robert Adams; David Wilson
Journal:  Qual Life Res       Date:  2015-09-28       Impact factor: 4.147

2.  Results from several population studies show that recommended scoring methods of the SF-36 and the SF-12 may lead to incorrect conclusions and subsequent health decisions.

Authors:  Graeme Tucker; Robert Adams; David Wilson
Journal:  Qual Life Res       Date:  2014-03-20       Impact factor: 4.147

3.  Associations Between Hyperphagia, Symptoms of Sleep Breathing Disorder, Behaviour Difficulties and Caregiver Well-Being in Prader-Willi Syndrome: A Preliminary Study.

Authors:  Jessica Mackay; Gillian M Nixon; Antony R Lafferty; Geoff Ambler; Nitin Kapur; Philip B Bergman; Cara Schofield; Chris Seton; Andrew Tai; Elaine Tham; Komal Vora; Patricia Crock; Charles Verge; Yassmin Musthaffa; Greg Blecher; Daan Caudri; Helen Leonard; Peter Jacoby; Andrew Wilson; Catherine S Choong; Jenny Downs
Journal:  J Autism Dev Disord       Date:  2021-09-08

4.  Detecting short-term change and variation in health-related quality of life: within- and between-person factor structure of the SF-36 health survey.

Authors:  Amanda Kelly; Jonathan Rush; Eric Shafonsky; Allen Hayashi; Kristine Votova; Christine Hall; Andrea M Piccinin; Jens Weber; Philippe Rast; Scott M Hofer
Journal:  Health Qual Life Outcomes       Date:  2015-12-21       Impact factor: 3.186

5.  Influence of Gender on Associations of Obstructive Sleep Apnea Symptoms with Chronic Conditions and Quality of Life.

Authors:  Sarah Appleton; Tiffany Gill; Anne Taylor; Douglas McEvoy; Zumin Shi; Catherine Hill; Amy Reynolds; Robert Adams
Journal:  Int J Environ Res Public Health       Date:  2018-05-07       Impact factor: 3.390

6.  Psychometric properties of Short Form-36 Health Survey, EuroQol 5-dimensions, and Hospital Anxiety and Depression Scale in patients with chronic pain.

Authors:  Riccardo LoMartire; Björn Olov Äng; Björn Gerdle; Linda Vixner
Journal:  Pain       Date:  2020-01       Impact factor: 7.926

7.  Psychometric performance of the CAMPHOR and SF-36 in pulmonary hypertension.

Authors:  James Twiss; Stephen McKenna; Louise Ganderton; Sue Jenkins; Mitra Ben-L'amri; Kevin Gain; Robin Fowler; Eli Gabbay
Journal:  BMC Pulm Med       Date:  2013-07-12       Impact factor: 3.317

8.  Health-Related Quality of Life of Older Adults in Costa Rica as Measured by the Short-Form-36 Health Survey.

Authors:  Esmeralda Valdivieso-Mora; Mirjana Ivanisevic; Leslie A Shaw; Mauricio Garnier-Villarreal; Zachary D Green; Mónica Salazar-Villanea; José Moncada-Jiménez; David K Johnson
Journal:  Gerontol Geriatr Med       Date:  2018-07-20

9.  Evidence for measurement bias of the short form health survey based on sex and metropolitan influence zone in a secondary care population.

Authors:  Jake Ursenbach; Megan E O'Connell; Andrew Kirk; Debra Morgan
Journal:  Health Qual Life Outcomes       Date:  2020-04-03       Impact factor: 3.186

10.  Confirmatory factor analysis and measurement invariance of the English, Mandarin, and Malay versions of the SF-12v2 within a representative sample of the multi-ethnic Singapore population.

Authors:  Jue Hua Lau; Edimansyah Abdin; Janhavi Ajit Vaingankar; Saleha Shafie; Rajeswari Sambasivam; Shazana Shahwan; Julian Thumboo; Siow Ann Chong; Mythily Subramaniam
Journal:  Health Qual Life Outcomes       Date:  2021-03-10       Impact factor: 3.186

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.