| Literature DB >> 34992306 |
Ulrich Schroeders1, Christoph Schmidt2, Timo Gnambs3.
Abstract
Careless responding is a bias in survey responses that disregards the actual item content, constituting a threat to the factor structure, reliability, and validity of psychological measurements. Different approaches have been proposed to detect aberrant responses such as probing questions that directly assess test-taking behavior (e.g., bogus items), auxiliary or paradata (e.g., response times), or data-driven statistical techniques (e.g., Mahalanobis distance). In the present study, gradient boosted trees, a state-of-the-art machine learning technique, are introduced to identify careless respondents. The performance of the approach was compared with established techniques previously described in the literature (e.g., statistical outlier methods, consistency analyses, and response pattern functions) using simulated data and empirical data from a web-based study, in which diligent versus careless response behavior was experimentally induced. In the simulation study, gradient boosting machines outperformed traditional detection mechanisms in flagging aberrant responses. However, this advantage did not transfer to the empirical study. In terms of precision, the results of both traditional and the novel detection mechanisms were unsatisfactory, although the latter incorporated response times as additional information. The comparison between the results of the simulation and the online study showed that responses in real-world settings seem to be much more erratic than can be expected from the simulation studies. We critically discuss the generalizability of currently available detection methods and provide an outlook on future research on the detection of aberrant response patterns in survey research.Entities:
Keywords: careless responding; data cleaning; gradient boosted trees; outlier detection; response times
Year: 2021 PMID: 34992306 PMCID: PMC8725053 DOI: 10.1177/00131644211004708
Source DB: PubMed Journal: Educ Psychol Meas ISSN: 0013-1644 Impact factor: 2.821
Overview of Data-Driven Mechanisms to Detect Careless Respondents.
| Index (Abbr.) | Description | Strengths | Weaknesses | Key references |
|---|---|---|---|---|
| Statistical outlier functions | ||||
| Mahalanobis distance (Maha.) | Multivariate distance between a respondent’s response vector and the vector of sample means | • Easy to calculate and understand | • Effective only for truly random responses | |
| Consistency analysis | ||||
| Synonym/antonym score (Ant.) | Within-person correlation between highly correlated item pairs (e.g., | | • Good, sensitive detection of careless respondents | • For semantic synonym/antonym score similar/contrasting worded items | |
| Even–odd consistency (EvenOdd) | Within-person correlation across unidimensional subscales formed by even–odd split halves | • High scale dependency | • Relies on unidimensional scales | |
| Intraindividual response variability (IRV) | Intraindividual standard deviation across a set of consecutive item responses | • Very easy to calculate and understand | • Should be calculated across multiple constructs and reversely coded items | |
| Response pattern functions | ||||
| Longstring (Long.) | Maximum (or average) of consecutive items answered with the same response option | • Easy to calculate and understand | • Uniform response scale | |
| Number of Guttman errors | Number of item pairs that behave contrary to expectations regarding the solution probabilities | • Nonparametric version | • Relies on large sample sizes | |
| Polytomous | Extent to which a person's nonparametric polytomous IRT estimate matches the probability of correctly solving items | • Nonparametric version | • Relies on large sample sizes | |
| | Extent to which a person’s polytomous IRT estimate corresponds to the probability of correctly solving items | • Uses information on the structure of the meausre | • Relies on large sample sizes | |
Note. References marked with an asterisk (*) are key references. The intraindividual response variability is also known as inter-item standard deviation and can also be classified as a response pattern function. The abbreviation in parentheses is used in the Results section. IRT = item response theory.
Figure 1.Example of a decision tree with two nodes and three leaves.
Classification Accuracy of Traditional and Machine Learning Algorithms With Simulated Data.
| Item | Maha. | Antonyms | EvenOdd | Longstring | IRV |
| GBM |
|---|---|---|---|---|---|---|---|
| Random respondents | |||||||
| Accuracy | 1.00 (.00) | .87 (.02) | .88 (.02) | .83 (.02) | .96 (.01) | .92 (.02) | .96 (.01) |
| Sensitivity | 1.00 (.01) | .11 (.07) | .49 (.12) | .01 (.03) | .80 (.07) | 1.00 (.00) | .74 (.11) |
| Specificity | 1.00 (.00) | .96 (.02) | .93 (.02) | .93 (.02) | .98 (.01) | .91 (.02) | .99 (.01) |
| Precision | .97 (.04) | .22 (.15) | .43 (.09) | .02 (.04) | .80 (.07) | .55 (.05) | .90 (.08) |
| Balanced accuracy | 1.00 (.01) | .53 (.04) | .71 (.06) | .47 (.02) | .89 (.04) | .95 (.01) | .86 (.05) |
| Midpoint respondents | |||||||
| Accuracy | .91 (.01) | .90 (.02) | .88 (.02) | .87 (.02) | .94 (.02) | .74 (.03) | .98 (.01) |
| Sensitivity | .18 (.08) | .27 (.10) | .50 (.12) | .37 (.11) | .68 (.09) | .51 (.12) | .88 (.08) |
| Specificity | .99 (.01) | .97 (.02) | .93 (.02) | .93 (.02) | .96 (.01) | .77 (.02) | .99 (.01) |
| Precision | .65 (.22) | .48 (.16) | .43 (.09) | .36 (.10) | .68 (.09) | .20 (.04) | .90 (.07) |
| Balanced accuracy | .59 (.04) | .62 (.05) | .71 (.06) | .65 (.06) | .82 (.05) | .64 (.06) | .94 (.04) |
| Pattern respondents | |||||||
| Accuracy | .87 (.01) | .80 (.06) | .86 (.02) | .85 (.02) | .95 (.02) | .72 (.03) | .96 (.02) |
| Sensitivity | .03 (.04) | .08 (.13) | .10 (.08) | .20 (.09) | .76 (.10) | .50 (.11) | .93 (.08) |
| Specificity | .96 (.01) | .88 (.06) | .93 (.02) | .93 (.02) | .97 (.01) | .75 (.03) | .96 (.02) |
| Precision | .09 (.11) | .06 (.10) | .11 (.08) | .23 (.10) | .76 (.10) | .18 (.04) | .74 (.10) |
| Balanced accuracy | .50 (.02) | .48 (.07) | .51 (.04) | .56 (.05) | .87 (.06) | .63 (.06) | .95 (.04) |
Note. Results are means and standard deviations across 1,000 simulated data sets (ntest = 180 with 10% careless respondents). Maha. = Mahalanobis distance; Antonyms = psychometric antonyms; EvenOdd = even–odd consistency; Longstring = Longstring Index; IRV = intraindividual response variability; Zh = polytomous IRT person-fit statistic; GBM = gradient boosting machine. IRT = item response theory.
Classification Accuracy of Traditional and Machine Learning Algorithms With Empirical Data (10% Prevalence).
| Item | Maha. | Antonyms | EvenOdd | Longstring | IRV |
| GBMRes | GBMRT | GBMRes + RT |
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | .80 (.02) | .86 (.02) | .84 (.02) | .82 (.02) | .83 (.01) | .67 (.02) | .59 (.04) | .70 (.05) | .70 (.04) |
| Sensitivity | .18 (.08) | .11 (.08) | .19 (.09) | .18 (.09) | .14 (.07) | .38 (.11) | .61 (.12) | .56 (.12) | .60 (.12) |
| Specificity | .87 (.02) | .94 (.02) | .91 (.02) | .90 (.02) | .90 (.01) | .70 (.02) | .58 (.04) | .71 (.05) | .71 (.05) |
| Precision | .13 (.06) | .17 (.11) | .19 (.08) | .16 (.07) | .14 (.07) | .12 (.03) | .14 (.02) | .18 (.04) | .19 (.04) |
| Balanced accuracy | .53 (.04) | .53 (.04) | .55 (.04) | .54 (.04) | .52 (.04) | .54 (.06) | .60 (.06) | .64 (.06) | .66 (.06) |
Note. Results are means and standard deviations across 1,000 random test samples of the empirical data (ntest = 180 with 10% careless respondents). Maha. = Mahalanobis distance; Antonyms = psychometric antonyms; EvenOdd = even–odd consistency; Longstring = Longstring Index; IRV = intraindividual response variability; Zh = polytomous IRT person-fit statistic; GBM = gradient boosting machine; Res = responses; RT = response times. IRT = item response theory.
Figure 2.Specificity, sensitivity, and balanced accuracy across detection mechanisms. Maha. = Mahalanobis distance; Ant. = psychometric antonyms; EvenOdd = even–odd consistency; Long. = Longstring Index; IRV = intraindividual response variability; Zh = polytomous IRT person-fit statistic; GBM = gradient boosting machine; Res = responses; RT = response times. Left side: The boxplot reflects the interquartile range, the solid line represents the median, and the whiskers represent minimum/maximum values within 1.5 times the interquartile range. Right side: Jittered point plot of 100 randomly drawn values. IRT = item response theory.