| Literature DB >> 36161238 |
Alene K Rhea1,2, Kelsey Markey1,2, Lauren D'Arinzo1,2,3, Hilke Schellmann4, Mona Sloane2, Paul Squires5, Falaah Arif Khan1,2, Julia Stoyanovich1,2,6.
Abstract
Automated hiring systems are among the fastest-developing of all high-stakes AI systems. Among these are algorithmic personality tests that use insights from psychometric testing, and promise to surface personality traits indicative of future success based on job seekers' resumes or social media profiles. We interrogate the validity of such systems using stability of the outputs they produce, noting that reliability is a necessary, but not a sufficient, condition for validity. Crucially, rather than challenging or affirming the assumptions made in psychometric testing - that personality is a meaningful and measurable construct, and that personality traits are indicative of future success on the job - we frame our audit methodology around testing the underlying assumptions made by the vendors of the algorithmic personality tests themselves. Our main contribution is the development of a socio-technical framework for auditing the stability of algorithmic systems. This contribution is supplemented with an open-source software library that implements the technical components of the audit, and can be used to conduct similar stability audits of algorithmic systems. We instantiate our framework with the audit of two real-world personality prediction systems, namely, Humantic AI and Crystal. The application of our audit framework demonstrates that both these systems show substantial instability with respect to key facets of measurement, and hence cannot be considered valid testing instruments.Entities:
Keywords: Algorithm Audit; Hiring; Personality; Reliability; Stability; Validity
Year: 2022 PMID: 36161238 PMCID: PMC9483468 DOI: 10.1007/s10618-022-00861-0
Source DB: PubMed Journal: Data Min Knowl Discov ISSN: 1384-5810 Impact factor: 5.406
Fig. 1Socio-technical framework for stability auditing, discussed in detail in Sect. 3
Fig. 2Overview of the technical framework, implemented by our open-source library
Resume versions used as input
| Version | File Format | Pre-Processing |
|---|---|---|
| Original | Various | None |
| De-Identified | Remove identifiers (name, phone, email, social media links, usernames). Save as PDF. | |
| Raw Text | Raw Text | Copy text. |
| Save as PDF (if original in other format). | ||
| DOCX | DOCX | Remove identifiers (name, phone, email, social media links, usernames). Save as DOCX. |
| URL-Embedded | Remove identifiers (name, phone, email, social media accounts, LinkedIn URL). Insert hyperlinked LinkedIn URL into beginning of document. Save as PDF. |
Summary of stability results for Crystal and Humantic AI, with respect to facets of measurement from Sect. 4.3.
| Facet |
|
| Details |
|---|---|---|---|
| Resume file format |
|
| Sect. |
| LinkedIn URL in resume | ? |
| Sect. |
| Source context |
|
| Sect. |
| Algorithm-time / immediate |
|
| Sect. |
| Algorithm-time / 31 days |
|
| Sect. |
| Participant-time / LinkedIn |
|
| Sect. |
| Participant-time / Twitter | N/A |
| Sect. |
“✓” indicates both sufficient rank-order stability () and sufficient locational stability () in all traits, “✗” indicates either insufficient rank-order stability () or significant locational instability () in at least one trait, and “?” indicates the facet was not tested in our audit
Fig. 3Screen shots of the Humantic AI “opt out” feature
Fig. 4Comparison of Crystal output across the resume file format facet. Note evidence of discontinuous measurement in DiSC Steadiness and Conscientiousness, with some participants’ scores moving between clusters with different file formats
Fig. 5a Humantic AI Dominance scores from de-identified and URL-embedded resumes. b Humantic AI Extraversion scores produced by de-identified resumes and LinkedIn profiles
Fig. 6Normalized L1 distances between Humantic AI DiSC and Big Five scores produced from pairs of treatments that vary with respect to their input source
Fig. 7Normalized L1 distances between Crystal DiSC scores produced from LinkedIn profiles scored 8–10 months apart
Humantic AI runs (i.e., sets of score-generating calls to Humantic AI models)
| Input type | Run ID | Run dates | # Inputs | # Outputs |
|---|---|---|---|---|
| Original Resume | HRo1 | 11/23/2020 - 01/14/2021 | 89 | 88 |
| De-Identified Resume | HRi1 | 03/20/2021 - 03/28/2021 | 89 | 89 |
| De-Identified Resume | HRi2 | 04/20/2021 - 04/28/2021 | 89 | 89 |
| De-Identified Resume | HRi3 | 04/20/2021 - 04/28/2021 | 89 | 89 |
| DOCX Resume | HRd1 | 03/20/2021 - 03/28/2021 | 89 | 89 |
| URL-Embedded Resume | HRu1 | 04/09/2021 - 04/11/2021 | 86 | 86 |
| HL1 | 11/23/2020 - 01/14/2021 | 92 | 88 | |
| HL2 | 08/10/2021 - 08/11/2021 | 92 | 91 | |
| HT1 | 11/23/2020 - 01/14/2021 | 32 | 21 | |
| HT2 | 08/10/2021 - 08/11/2021 | 32 | 21 |
Demographics of study participants: gender and race
| Gender | Race | ||||||
|---|---|---|---|---|---|---|---|
| Male | Female | Other | Asian | White | Other | No answer | |
| N | 56 | 36 | 2 | 57 | 24 | 12 | 1 |
| % | 59.6 | 38.3 | 2.1 | 60.6 | 25.5 | 12.8 | 1.0 |
Demographics of study participants: birth country and primary language
| Birth Country | Primary Language | ||||||
|---|---|---|---|---|---|---|---|
| India | USA | China | Other | No answer | English | Other | |
| N | 34 | 28 | 12 | 18 | 2 | 60 | 34 |
| % | 36.2 | 29.8 | 12.8 | 19.1 | 2.1 | 63.8 | 36.2 |
Crystal runs (i.e., sets of score-generating calls to Crystal models)
| Input type | Run ID | Run dates | # Inputs | # Outputs |
|---|---|---|---|---|
| Raw Text Resume | CRr1 | 03/31/2021 - 04/02/2021 | 89 | 89 |
| Raw Text Resume | CRr2 | 05/01/2021 - 05/03/2021 | 89 | 89 |
| Raw Text Resume | CRr3 | 05/01/2021 - 05/03/2021 | 89 | 89 |
| PDF Resume | CRp1 | 11/23/2020 - 01/14/2021 | 89 | 89 |
| CL1 | 11/23/2020 - 01/14/2021 | 92 | 91 | |
| CL2 | 09/13/2021 - 09/16/2021 | 89 | 89 |
Rank-order stability of Crystal DiSC scores, as measured by Spearman’s rank correlations. Columns labeled D (Dominance), I (Influence), S (Steadiness), C (Conscientiousness / Calculativeness).
| Facet | Input Versions | N | D | I | S | C |
|---|---|---|---|---|---|---|
| File Format | Raw Text vs. PDF Resume (CRr1 vs. CRp1) | 89 | ||||
| Source Context | PDF Resume vs. LinkedIn (CRp1 vs. CL1) | 86 | ||||
| Immediate Rep. | Raw Text Resume back-to-back (CRr2 vs. CRr3) | 89 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| Algorithm-Time | Raw Text Resume 31 days apart (CRr1 vs. CRr2) | 89 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| Participant-Time | LinkedIn 8–10 months apart (CL1 vs. CL2) | 89 |
Reliabilities below 0.90 highlighted in bold; those between 0.90 and 0.95 highlighted in italic. Results are discussed in Sects. 5.3, 5.4, 5.5, 5.6, and 5.7
Rank-order stability of Humantic AI DiSC scores, as measured by Spearman’s rank correlations. Columns labeled D (Dominance), I (Influence), S (Steadiness), C (Conscientiousness / Calculativeness).
| Facet | Input Versions | N | D | I | S | C |
|---|---|---|---|---|---|---|
| File Format | De-Identified Resume vs. DOCX Resume (HRi1 vs. HRd1) | 89 | 0.9956 | 0.9924 | 0.9978 | 0.9959 |
| URL Embedding | URL-Embedded Resume vs. De-Identified Resume (HRu1 vs. HRi1) | 86 | ||||
| URL Embedding | URL-Embedded Resume vs. LinkedIn (HRu1 vs. HL1) | 83 | ||||
| Source Context | De-Identified Resume vs. LinkedIn (HRi1 vs. HL1) | 84 | ||||
| Source Context | Original Resume vs. LinkedIn (HRo1 vs. HL1) | 84 | ||||
| Source Context | Original Resume vs. Twitter (HRo1 vs. HT1) | 20 | ||||
| Source Context | LinkedIn vs. Twitter (HL1 vs. HT1) | 18 | ||||
| Immediate Rep. | De-Identified Resume back-to-back (HRi2 vs. HRi3) | 89 | 0.9999 | 1.0000 | 1.0000 | 1.0000 |
| Algorithm-Time | De-Identified Resume 31 days apart (HRi1 vs. HRi2) | 89 | 0.9726 | 0.9948 | 0.9925 | 0.9980 |
| Participant-Time | LinkedIn 7–9 months apart (HL1 vs. HL2) | 88 | ||||
| Participant-Time | Twitter 7–9 months apart (HT1 vs. HT2) | 21 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Reliabilities below 0.90 highlighted in bold. Results are discussed in Sects. 5.3, 5.4, 5.5, 5.6, and 5.7
Rank-order stability of Humantic AI Big Five scores, as measured by Spearman’s rank correlations. Columns labeled O (Openness), C (Conscientiousness), E (Extraversion), A (Agreeableness), and S (Emotional Stability).
| Facet | Input Versions | N | O | C | E | A | S |
|---|---|---|---|---|---|---|---|
| File Format | De-Identified vs. DOCX Resume (HRi1 vs. HRd1) | 89 | 0.9891 | 0.9936 | 0.9939 | 0.9927 | 0.9816 |
| URL Embedding | URL-Embedded vs. De-Identified Resume (HRu1 vs. HRi1) | 86 | |||||
| URL Embedding | URL-Embedded vs. LinkedIn (HRu1 vs. HL1) | 83 | |||||
| Source Context | De-Identified Resume vs. LinkedIn (HRi1 vs. HL1) | 84 | |||||
| Source Context | Original Resume vs. LinkedIn (HRo1 vs. HL1) | 84 | |||||
| Source Context | Original Resume vs. Twitter (HRo1 vs. HT1) | 20 | - | - | - | ||
| Source Context | LinkedIn vs. Twitter (HL1 vs. HT1) | 18 | - | - | - | ||
| Immediate Rep. | De-Identified Resume back-to-back (HRi2 vs. HRi3) | 89 | 1.0000 | 1.0000 | 1.0000 | 0.9999 | 1.0000 |
| Algorithm-Time | De-Identified Resume 31 days apart (HRi1 vs. HRi2) | 89 | 0.9954 | 0.9969 | 0.9618 | 0.9921 | 0.9854 |
| Participant-Time | LinkedIn 7–9 months apart (HL1 vs. HL2) | 88 | |||||
| Participant-Time | Twitter 7–9 months apart (HT1 vs. HT2) | 21 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Reliabilities below 0.90 highlighted in bold. Results are discussed in Sects. 5.3, 5.4, 5.5, 5.6, and 5.7
Significance in locational instability of Crystal DiSC scores, as measured by two-tailed Wilcoxon signed-rank test p-values. Columns labeled D (Dominance), I (Influence), S (Steadiness), C (Conscientiousness / Calculativeness).
| Facet | Input Versions | N | D | I | S | C |
|---|---|---|---|---|---|---|
| File Format | Raw Text vs. PDF Resume (CRr1 vs. CRp1) | 89 | 0.5026 | 0.4208 | 0.0173 | 0.0370 |
| Source Context | PDF Resume vs. LinkedIn (CRp1 vs. CL1) | 86 | 0.4190 | 0.0012 | 0.7010 | 0.8421 |
| Immediate Rep. | Raw Text Resume back-to-back (CRr2 vs. CRr3) | 89 | N/A | N/A | N/A | N/A |
| Algorithm-Time | Raw Text Resume 31 days apart (CRr1 vs. CRr2) | 89 | N/A | N/A | N/A | N/A |
| Participant-Time | LinkedIn 8-10 months apart (CL1 vs. CL2) | 89 | 0.7299 | 0.6518 | 0.3305 | 0.2870 |
The absence of bold highlighting indicates that all values are below both the Benjamini–Hochberg and Bonferroni-corrected thresholds based on of 0.05. “N/A” values reflect experiments where there was zero change across the facet. Results are discussed in Sects. 5.3, 5.4, 5.5, 5.6, and 5.7
Significance in locational instability of Humantic AI DiSC scores, as measured by two-tailed Wilcoxon signed-rank test p-values. Columns labeled D (Dominance), I (Influence), S (Steadiness), C (Conscientiousness / Calculativeness).
| Facet | Input Versions | N | D | I | S | C |
|---|---|---|---|---|---|---|
| File Format | De-Identified vs. DOCX Resume (HRi1 vs. HRd1) | 89 | 0.2510 | 0.2940 | 0.4574 | 0.2539 |
| URL Embedding | URL-Embedded vs. De-Identified Resume (HRu1 vs. HRi1) | 86 | 0.3194 | |||
| URL Embedding | URL-Embedded Resume vs. LinkedIn (HRu1 vs. HL1) | 83 | 0.1825 | 0.5324 | 0.1213 | |
| Source Context | De-Identified Resume vs. LinkedIn (HRi1 vs. HL1) | 84 | 0.0580 | 0.3259 | ||
| Source Context | Original Resume vs. LinkedIn (HRo1 vs. HL1) | 84 | 0.2299 | 0.5911 | ||
| Source Context | Original Resume vs. Twitter (HRo1 vs. HT1) | 20 | 0.5706 | 0.3118 | 0.1975 | 0.6874 |
| Source Context | LinkedIn vs. Twitter (HL1 vs. HT1) | 18 | 0.0342 | 0.3247 | 0.6095 | 0.5539 |
| Immediate Rep. | De-Identified Resume back-to-back (HRi2 vs. HRi3) | 89 | 0.3173 | 0.3173 | N/A | N/A |
| Algorithm-Time | De-Identified Resume 31 days apart (HRi1 vs. HRi2) | 89 | 0.1416 | 0.5971 | 0.5690 | 0.0307 |
| Participant-Time | LinkedIn 7–9 months apart (HL1 vs. HL2) | 88 | 0.0709 | 0.0800 | 0.3457 | 0.2969 |
| Participant-Time | Twitter 7–9 months apart (HT1 vs. HT2) | 21 | N/A | N/A | N/A | N/A |
Bold highlighting indicates value below Bonferroni-corrected threshold based on of 0.05. Italic indicates p-value below Benjamini–Hochberg corrected threshold and above Bonferroni-corrected threshold. “N/A” values reflect experiments where there was zero change across the facet. Results are discussed in Sects. 5.3, 5.4, 5.5, 5.6, and 5.7
Significance in locational instability of Humantic AI Big Five scores, as measured by two-tailed Wilcoxon signed-rank test p-values. Columns labeled O (Openness), C (Conscientiousness), E (Extraversion), A (Agreeableness), and S (Emotional Stability).
| Facet | Input Versions | N | O | C | E | A | S |
|---|---|---|---|---|---|---|---|
| File Format | De-Identified vs. DOCX Resume (HRi1 vs. HRd1) | 89 | 0.7193 | 0.9248 | 0.5306 | 0.3003 | 0.9771 |
| URL Embedding | URL-Embedded vs. De-Identified Resume (HRu1 vs. HRi1) | 86 | 0.2214 | ||||
| URL Embedding | URL-Embedded Resume vs. LinkedIn (HRu1 vs. HL1) | 83 | 0.7352 | 0.3603 | 0.7167 | ||
| Source Context | De-Identified Resume vs. LinkedIn (HRi1 vs. HL1) | 84 | 0.3997 | 0.1730 | 0.6718 | ||
| Source Context | Original Resume vs. LinkedIn (HRo1 vs. HL1) | 84 | 0.5300 | 0.0221 | 0.4553 | ||
| Source Context | Original Resume vs. Twitter (HRo1 vs. HT1) | 20 | 0.0121 | 0.0826 | 0.8983 | ||
| Source Context | LinkedIn vs. Twitter (HL1 vs. HT1) | 18 | |||||
| Immediate Rep. | De-Identified Resume back-to-back (HRi2 vs. HRi3) | 89 | 0.1797 | 0.3173 | 0.3173 | 0.6547 | 0.6547 |
| Algorithm-Time | De-Identified Resume 31 days apart (HRi1 vs. HRi2) | 89 | 0.5314 | 0.2540 | 0.0516 | 0.2424 | |
| Participant-Time | LinkedIn 7–9 months apart (HL1 vs. HL2) | 88 | 0.6487 | 0.9615 | 0.6011 | ||
| Participant-Time | Twitter 7–9 months apart (HT1 vs. HT2) | 21 | N/A | N/A | N/A | N/A | N/A |
Bold highlighting indicates value below Bonferroni-corrected threshold based on of 0.05. Italic indicates p-value below Benjamini–Hochberg corrected threshold and above Bonferroni-corrected threshold. “N/A” values reflect experiments where there was zero change across the facet. Results are discussed in Sects. 5.3, 5.4, 5.5, 5.6, and 5.7