| Literature DB >> 29942800 |
Godfred O Boateng1, Torsten B Neilands2, Edward A Frongillo3, Hugo R Melgar-Quiñonez4, Sera L Young1,5.
Abstract
Scale development and validation are critical to much of the work in the health, social, and behavioral sciences. However, the constellation of techniques required for scale development and evaluation can be onerous, jargon-filled, unfamiliar, and resource-intensive. Further, it is often not a part of graduate training. Therefore, our goal was to concisely review the process of scale development in as straightforward a manner as possible, both to facilitate the development of new, valid, and reliable scales, and to help improve existing ones. To do this, we have created a primer for best practices for scale development in measuring complex phenomena. This is not a systematic review, but rather the amalgamation of technical literature and lessons learned from our experiences spent creating or adapting a number of scales over the past several decades. We identified three phases that span nine steps. In the first phase, items are generated and the validity of their content is assessed. In the second phase, the scale is constructed. Steps in scale construction include pre-testing the questions, administering the survey, reducing the number of items, and understanding how many factors the scale captures. In the third phase, scale evaluation, the number of dimensions is tested, reliability is tested, and validity is assessed. We have also added examples of best practices to each step. In sum, this primer will equip both scientists and practitioners to understand the ontology and methodology of scale development and validation, thereby facilitating the advancement of our understanding of a range of health, social, and behavioral outcomes.Entities:
Keywords: content validity; factor analysis; item reduction; psychometric evaluation; scale development; tests of dimensionality; tests of reliability; tests of validity
Year: 2018 PMID: 29942800 PMCID: PMC6004510 DOI: 10.3389/fpubh.2018.00149
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Figure 1An overview of the three phases and nine steps of scale development and validation.
Description of model fit indices and thresholds for evaluating scales developed for health, social, and behavioral research.
| Chi-square test | The chi-square value is a test statistic of the goodness of fit of a factor model. It compares the observed covariance matrix with a theoretically proposed covariance matrix | Chi-square test of model fit has been assessed to be overly sensitive to sample size and to vary when dealing with non-normal variables. Hence, the use of non-normal data, a small sample size ( | ( |
| Root Mean Squared Error of Approximation (RMSEA) | RMSEA is a measure of the estimated discrepancy between the population and model-implied population covariance matrices per degree of freedom ( | Browne and Cudeck recommend RMSEA ≤ 0.05 as indicative of close fit, 0.05 ≤ RMSEA ≤ 0.08 as indicative of fair fit, and values >0.10 as indicative of poor fit between the hypothesized model and the observed data. However, Hu and Bentler have suggested RMSEA ≤ 0.06 may indicate a good fit | ( |
| Tucker Lewis Index (TLI) | TLI is based on the idea of comparing the proposed factor model to a model in which no interrelationships at all are assumed among any of the items | Bentler and Bonnett suggest that models with overall fit indices of < 0.90 are generally inadequate and can be improved substantially. Hu and Bentler recommend TLI ≥ 0.95 | ( |
| Comparative Fit Index (CFI) | CFI is an incremental relative fit index that measures the relative improvement in the fit of a researcher's model over that of a baseline model | CFI ≥ 0.95 is often considered an acceptable fit | ( |
| Standardized Root Mean Square Residual (SRMR) | SRMR is a measure of the mean absolute correlation residual, the overall difference between the observed and predicted correlations | Threshold for acceptable model fit is SRMR ≤ 0.08 | ( |
| Weighted Root Mean Square Residual (WRMR) | WRMR uses a “variance-weighted approach especially suited for models whose variables measured on different scales or have widely unequal variances” ( | Yu recommends a threshold of WRMR < 1.0 for assessing model fit. This index is used for confirmatory factor analysis and structural equation models with binary and ordinal variables | ( |
| Standard of Reliability for scales | A reliability of 0.90 is the minimum recommended threshold that should be tolerated while a reliability of 0.95 should be the desirable standard. While the ideal has rarely been attained by most researchers, a reliability coefficient of 0.70 has often been accepted as satisfactory for most scales | Nunnally recommends a threshold of ≥0.90 for assessing internal consistency for scales | ( |
The three phases and nine steps of scale development and validation.
| Domain identification | To specify the boundaries of the domain and facilitate item generation | 1.1 Specify the purpose of the domain | ( |
| Item generation | To identify appropriate questions that fit the identified domain | 1.6 Deductive methods: literature review and assessment of existing scales | ( |
| Evaluation by experts | To evaluate each of the items constituting the domain for content relevance, representativeness, and technical quality | 2.1 Quantify assessments of 5-7 expert judges using formalized scaling and statistical procedures including content validity ratio, content validity index, or Cohen's coefficient alpha | ( |
| Evaluation by target population | To evaluate each item constituting the domain for representativeness of actual experience from target population | 2.3 Conduct cognitive interviews with end users of scale items to evaluate face validity | ( |
| Cognitive interviews | To assess the extent to which questions reflect the domain of interest and that answers produce valid measurements | 3.1 Administer draft questions to 5–15 interviewees in 2–3 rounds while allowing respondents to verbalize the mental process entailed in providing answers | ( |
| Survey administration | To collect data with minimum measurement errors | 4.1 Administer potential scale items on a sample that reflects range of target population using paper or device | ( |
| Establishing the sample size | To ensure the availability of sufficient data for scale development | 4.2 Recommended sample size is 10 respondents per survey item and/or 200-300 observations | ( |
| Determining the type of data to use | To ensure the availability of data for scale development and validation | 4.3 Use cross-sectional data for exploratory factor analysis | – |
| Item difficulty index | To determine the proportion of correct answers given per item (CTT) To determine the probability of a particular examinee correctly answering a given item (IRT) | 5.1 Proportion can be calculated for CTT and item difficulty parameter estimated for IRT using statistical packages | ( |
| Item discrimination test | To determine the degree to which an item or set of test questions are measuring a unitary attribute (CTT) To determine how steeply the probability of correct response changes as ability increases (IRT) | 5.2 Estimate biserial correlations or item discrimination parameter using statistical packages | ( |
| Inter-item and item-total correlations | To determine the correlations between scale items, as well as the correlations between each item and sum score of scale items | 5.3 Estimate inter-item/item communalities, item-total, and adjusted item-total correlations using statistical packages | ( |
| Distractor efficiency analysis | To determine the distribution of incorrect options and how they contribute to the quality of items | 5.4 Estimate distractor analysis using statistical packages | ( |
| Deleting or imputing missing cases | To ensure the availability of complete cases for scale development | 5.5 Delete items with many cases that are permanently missing, or use multiple imputation or full information maximum likelihood for imputation of data | ( |
| Factor analysis | To determine the optimal number of factors or domains that fit a set of items | 6.1 Use scree plots, exploratory factor analysis, parallel analysis, minimum average partial procedure, and/or the Hull method | ( |
| Test dimensionality | To address queries on the latent structure of scale items and their underlying relationships. i.e., to validate whether the previous hypothetical structure fits the items | 7.1 Estimate independent cluster model—confirmatory factor analysis, cf. Table | ( |
| Score scale items | To create scale scores for substantive analysis including reliability and validity of scale | 7.4. calculate scale scores using an unweighted approach, which includes summing standardized item scores and raw item scores, or computing the mean for raw item scores | ( |
| Calculate reliability statistics | To assess the internal consistency of the scale. i.e., the degree to which the set of items in the scale co-vary, relative to their sum score | 8.1 Estimate using Cronbach's alpha | ( |
| Test–retest reliability | To assess the degree to which the participant's performance is repeatable; i.e., how consistent their scores are across time | 8.3 Estimate the strength of the relationship between scale items over two or three time points; variety of measures possible | ( |
| Predictive validity | To determine if scores predict future outcomes | 9.1 Use bivariate and multivariable regression; stronger and significant associations or causal effects suggest greater predictive validity | ( |
| Concurrent validity | To determine the extent to which scale scores have a stronger relationship with criterion measurements made near the time of administration | 9.2 Estimate the association between scale scores and “gold standard” of scale measurement; stronger significant association in Pearson product-moment correlation suggests support for concurrent validity | ( |
| Convergent validity | To examine if the same concept measured in different ways yields similar results | 9.3 Estimate the relationship between scale scores and similar constructs using multi-trait multi-method matrix, latent variable modeling, or Pearson product-moment coefficient; higher/stronger correlation coefficients suggest support for convergent validity | ( |
| Discriminant validity | To examine if the concept measured is different from some other concept | 9.4 Estimate the relationship between scale scores and distinct constructs using multi-trait multi-method matrix, latent variable modeling, or Pearson product-moment coefficient; lower/weaker correlation coefficients suggest support for discriminant validity | ( |
| Differentiation by “known groups” | To examine if the concept measured behaves as expected in relation to “known groups” | 9.5 Select known binary variables based on theoretical and empirical knowledge and determine the distribution of the scale scores over the known groups; use | ( |
| Correlation analysis | To determine the relationship between existing measures or variables and newly developed scale scores | 9.6 Correlate scale scores and existing measures or, preferably, use linear regression, intraclass correlation coefficient, and analysis of standard deviations of the differences between scores | ( |