| Conceptions of construct validity |
| Two definitions of validity | See Section 1: “Conceptualizing Construct Validity” Key papers: Borsboom et al. (2004) and Messick (1989) |
| Validity is “one” | See Section 1: “Conceptualizing Construct Validity” Key paper: Newton and Shaw (2013) |
| Construct validity since Cronbach and Meehl (1955) | Smith (2005)The author reviews construct validity developments in the previous 50 years since Cronbach and Meehl (1955). The paper begins with developments in philosophy of science and then centers on a five-step model of construct validation, from carefully specifying the target constructs, to revising one’s theory and constructs. Also included is a critical review of several more recent statistical approaches for testing validity (e.g., methods for multitrait/multimethod matrices, generalizability theory). |
| Defining constructs | |
| Developing clear definitions | See Section 2: “Creating Clearer Construct Definitions” Key paper: Podsakoff et al. (2016) |
| Specifying the latent continuum | See Section 2: “Creating Clearer Construct Definitions” Key paper: Tay and Jebb (2018) |
| Creating scale items | |
| Readability tests | See Section 3: “Readability Tests for Items” Key paper: Calderón et al. (2006) |
| Modern readability measures | Peter et al. (2018)Two newer readability tools can supplement traditional tests for scale items. First, Coh-Metrix computes a syntactic simplicity score based on multiple variables (e.g., clauses within sentences, conditionals, negations). Second, the Question Understanding Aid (QUAID) was designed specifically to examine the readability of survey instruments, and can identify potential issues like vague wording, jargon, and working memory overload. Both are freely available at websites listed in the paper. |
| Respondent comprehension | Hardy and Ford (2014)Good survey data requires that respondents interpret the survey items as the scale developer intended. However, the authors describe how both (a) specific words and (b) the sentences in items can contribute to respondent miscomprehension. The authors provide evidence for this in popular scales and then discuss remedies, such as reducing words and phrases with multiple or vague meanings and collecting qualitative data from respondents about their interpretations of items. |
| Number of response options and labels | Weng (2004) and Simms et al. (2019) Examining the Big Five Inventory, Simms et al. (2019) found that more Likert response options resulted in higher internal consistency and test-retest reliability (but not convergent validity). These benefits stopped after six response options, and 0–1,000 visual analog scales did not show benefits, either. Including (or removing) a middle point (e.g., “neither agree nor disagree”) did not show any psychometric effects. Weng (2004) also found higher internal consistency and test-retest reliability when all response options had labels compared to when only endpoints of the scale had labels. |
| Item format | Zhang and Savalei (2016) The authors further research on the expanded scale format as a way to gain the benefit of including reverse worded items (i.e., controlling for acquiescence bias) in a scale without the common downside (i.e., introducing method variance into scores leading to method factor emergence). Each Likert-type item has their response options turned into a set of statements; respondents select one statement from each set. |
| Item stability | Knowles and Condon (2000) The stability of item properties should not be assumed when it is placed in different testing contexts. There are available methods from classical test theory, factor analysis, and item response theory to examine the stability of items when applied to new conditions or test revisions. |
| Presentation of items in blocks | Weijters et al. (2014) When putting a survey together, there are many ways to present the scale items. For instance, items from different scales can all be randomized and presented in the same block, or each scale can be given its own block. The authors showed the effects of splitting a unidimensional scale into two blocks with other scales administered in between. Scale items in different blocks had lower intercorrelations, and two factors emerged that corresponded to the two blocks. The authors recommend that assessments of discriminant validity should be mindful of scale presentation and that how scales are presented in surveys should be consistently reported. |
| Content validation | |
| Guidelines for reporting | Colquitt et al. (2019) Two common methods for content validation are reviewed and compared: Anderson and Gerbing (1991) and Hinkin and Tracey (1999). Both approaches ask subjects to rate how well each proposed item matches the construct definition, as well as the definitions of similar constructs. The authors also offer several new statistics for indexing content validity, provide standards for conducting content validation (e.g., participant instructions, scale anchors), and norms for evaluating these statistics. |
| Guidelines for assessment | Haynes et al. (1995) Provides an overview of content validation and its issues (e.g., how it can change over time if the construct changes). The authors also provide guidelines for assessing content validity, such as using multiple judges of scales, examining the proportionality of item content in scales, and using subsequent psychometric analyses to indicate the degree of evidence for content coverage. |
| Consulting focus groups | Vogt et al. (2004) Communicating with the target population is valuable in content validation but is rarely done. One method to do this is to use focus groups, moderator-facilitated discussions that generate qualitative data. This technique can (a) identify the important areas of a construct’s domain, (b) identify appropriate wordings for items, and (c) corroborate or revise conceptualization of the target construct. |
| Analyzing rating/matching data As item similarity data | Li and Sireci (2013) The authors argue that, compared to traditional content validation ratings/matching data, item similarity ratings are (a) less affected by social desirability and expectancy biases because no content categories are offered and (b) can provide more information about how items group together in multidimensional space. However, having subject matter experts engage in pairwise item similarity comparisons is labor-intensive. The authors offer an innovative method of dummy coding traditional content validation ratings/matching data to essentially derive item similarity data, which is conducive to multidimensional scaling. |
| Conducting pilot studies | |
| Sample size considerations | Johanson and Brooks (2010) Provides a cost-benefit analysis of increasing sample size relative to decreasing confidence intervals in correlation, proportion, and internal consistency estimates (i.e., coefficient alpha). Found that most reductions in confidence intervals occurred at sample sizes between 24 and 36. |
| Measurement precision | |
| Limits of reliability coefficients | Cronbach and Shavelson (2004) Although coefficient alpha is the most widely used index of measurement precision, the authors argue that any coefficient is a crude marker that lacks the nuance necessary to support interpretations in current assessment practice. Instead, they detail a reliability analysis approach whereby observed score variance is decomposed into population (or true score), item, and residual variance, the latter two of which comprise error variance. The authors argue that the standard error of measurement should be reported along with all variance components rather than a coefficient. Given that testing applications often use cut scores, the standard error of measurement offers an intuitive understanding to all stakeholders regarding the precision of each score when making decisions based on absolute rather than comparative standing. |
| Omega/alternatives to alpha | See section 4: “Alternative Estimates of Measurement Precision” Key paper: McNeish (2018) |
| Zhang and Yuan (2016) Both coefficient alpha and omega are often estimated using a sample covariance matrix, and traditional estimation methods are likely biased by outliers and missing observations in the data. The authors offer a software package in the R statistical computing language that allows for estimates of both alpha and omega that are robust against outliers and missing data. |
| Confidence intervals | Kelley and Pornprasertmanit (2016) Because psychologists are interested in the reliability of the population, not just the sample, estimates should be accompanied by confidence intervals. The authors review the many methods for computing these confidence intervals and run simulations comparing their efficacies. Ultimately, they recommend using hierarchical omega as a reliability estimator and bootstrapped confidence intervals, all of which can be computed in R using the ci.reliability() function of the MBESS package (Kelley, 2016). |
| IRT Information | See section 4: “Alternative Estimates of Measurement Precision” Key paper: Reise et al. (2005) |
| Controlling for transient error | Green (2003) and Schmidt et al. (2003) Whereas random response error comes from factors that vary moment-to-moment (e.g., variations in attention), transient errors come from factors that differ only across testing occasions (e.g., mood). Because coefficient alpha is computed from a single time point, it cannot correct for transient error and may overestimate reliability. Both articles provide an alternative reliability statistic that controls for transient error, test-retest alpha (Green, 2003), and the coefficient of equivalence and stability (Schmidt et al., 2003). |
| Test-retest reliability | DeSimone (2015) Test-retest correlations between scale scores are limited for assessing temporal stability. The author introduces several new statistical approaches: (a) computing test-retest correlations among individual scale items, (b) comparing the stability of interitem correlations (SRMRTC) and component loadings (CLTC), and (c) assessing the scale instability that is due to respondents (D2pct) rather than scale itself. |
| Barchard (2012) Test-retest correlations do not capture absolute agreement between scores and can mislead about consistency. The author discusses several statistics for test-retest reliability based on absolute agreement: the root mean square difference [RMSD(A,1)] and concordance correlation coefficient [CCC(A,1)]. These measures are used in other scientific fields (e.g., biology, genetics) but not in psychology, and a supplemental Excel sheet for calculation is provided. |
| Item-level reliability | Zijlmans et al. (2018) Reliability is typically calculated for entire scales but can also be computed for individual items. This can help identify unreliable items for removal. The authors investigate four methods for calculating item-level reliability and find that the correction for attenuation and Molenaar–Sijtsma methods performed best, estimating item reliability with very little bias and a reasonable amount of variability. |
| Assessing factor structure | |
| Factor analysis practices | Sellbom and Tellegen (2019) The authors provide a timely review of the issues and “pitfalls” in current factor analysis practices in psychology. Guidance is provided for (a) selecting proper indicators (e.g., analyzing item distributions, parceling), (b) estimation (e.g., alternatives to maximum likelihood), and (c) model evaluation and comparison. The authors conclude with a discussion of two alternatives to traditional factor analysis: exploratory structural equation modeling and bifactor modeling. |
| Exploratory factor analysis | Henson and Roberts (2006) The authors briefly review four main decisions to be made when conducting exploratory factor analysis. Then they offer seven best practice recommendations for reporting how an exploratory factor analysis was conducted after reviewing reporting deficiencies found in four journals. |
| Exploratory factor analysis for scale revision | Reise et al. (2000) The authors provide guidance on EFA procedures when revising a scale. Specifically, they offer guidance on (a) introducing new items, (b) sample selection, (c) factor extraction, (d) factor rotation, and (e) evaluating the revised scale. However, researchers first need to articulate why the revision is needed and pinpoint where the construct resides in the conceptual hierarchy. |
| Cluster analysis for dimensionality | Cooksey and Soutar (2006) The authors revive Revelle’s (1978) ICLUST clustering technique as a way to explore the dimensional structure of scale items. The end product is a tree-like graphic that represents the relations among the scale items. The authors claim this method is useful compared to alternatives (e.g., tables of factor loadings). |
| Unidimensionality | Raykov and Pohl (2013) Some measures may not demonstrate unidimensionality when assessed by fitting a one-factor model to the data due to method or substantive specific factors. This article aims to offer a way to estimate how much of the observed variance in the overall instrument is predominantly explained by a common factor and can thus be treated as essentially homogenous. Mplus and R code are provided to create point and interval estimates for variance explained by both common and specific factors to calculate the difference of these proportions. |
| Ferrando and Lorenzo-Seva (2019) Measures are often intended to be unidimensional, but obtained data are found to be better described by multiple correlated factors (or vice versa). Standard goodness of fit assessments (a) are arguably insufficient to adjudicate on which solution is most accurate and (b) only use internal (i.e., item score) information. The authors propose the idea of using external variables (e.g., criteria) to provide evidence for unidimensionality. A procedure to derive (a) primary factor score estimates and then (b) a second-order factor score estimate is described and finally (c) criteria are regressed on them. Lack of differential or incremental prediction of criteria by primary factor score estimates beyond second-order factor score estimates would suggest evidence for unidimensionality. |
| Ferrando and Lorenzo-Seva (2018) The authors introduce a program to allow determination of construct replicability, degree of factor indeterminacy and reliability of factor score estimates and explained common variance as an index of unidimensionality. In turn, this has implications for deriving individual scores (i.e., factor score estimates) using exploratory rather than confirmatory factor analysis, the latter of which they argue has the unrealistic assumption of simple structure. |
| Influence of item wording | McPherson and Mohr (2005) Including both positively- and negatively-worded items in scales is often done but can produce artifactual factors in dimensionality assessments. The authors show that items with more extreme wording (e.g., “I’m always optimistic about the future” vs. “I’m usually optimistic about the future”) can result in greater multidimensionality for the same target construct. The authors recommend that scale developers exercise awareness of these issues and provide recommendations. |
| Creating short forms | |
| Using IRT information | See section 4: “Alternative Estimates of Measurement Precision” Key paper: Edelen and Reeve (2007) |
| Ant colony optimization | See section 5: “Maximizing Validity in Short Forms Using Ant Colony Optimization” Key paper: Leite et al. (2008) |
| Empirical relations with variables (e.g., nomological network, criterion-related validity) |
| Construct proliferation | Shaffer et al. (2016) Constructs proliferate when discriminant validity is not sufficiently tested. This can happen when (a) important pre-existing constructs are left out of the test or (b) measurement error falsely implies distinct constructs by artificially lowering observed correlations. Remedies for this include (a) making sure all relevant pre-existing constructs have been included, (b) using statistical techniques that account for measurement error (CFA, coefficient of equivalence and stability), and (c) carefully interpreting the results of discriminant validation tests. |
| Raykov et al. (2016) The authors challenge the traditional way of assessing construct “congruence” or redundancy by simply fitting a one-factor model to data from measures purportedly measuring two constructs and examining overall fit. Instead, they recommend comparing nested models, where one- and two-factor solutions are fitted and corrected chi-square difference tests are conducted. The authors note that how finding evidence for construct congruence should be interpreted should be left to subject matter experts in that substantive domain. |
| Incremental validation | Smith et al. (2003) The authors discuss five principles of incremental validation pertinent to scale construction: “(a) careful, precise articulation of each element or facet within the content domain; (b) reliable measurement of each facet through use of multiple, alternate-form items; (c) examination of incremental validity at the facet level rather than the broad construct level; (d) use of items that represent single facets rather than combinations of facets; and (e) empirical examination of whether there is a broad construct or a combination of separate constructs” (p. 467). |
| Hunsley and Meyer (2003) The authors review theoretical, design, and statistical issues when conducting incremental validation. Of key importance is the choice of criterion. The criterion should be reliable, and researchers should also be wary of the variety of methodological artifacts that can influence incremental validation results (e.g., criterion contamination, “source overlap”). |