| Literature DB >> 34858090 |
Abstract
OBJECTIVE: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database.Entities:
Keywords: accuracy; gender determination; genderize.io; misclassification; name; name-to-gender; performance
Mesh:
Year: 2021 PMID: 34858090 PMCID: PMC8608220 DOI: 10.5195/jmla.2021.1252
Source DB: PubMed Journal: J Med Libr Assoc ISSN: 1536-5050
Confusion matrices for genderize.io (n=6,131 physicians)
| Genderize.io | Classified as women n (%) | Classified as men n (%) | Nonclassified n (%) |
|---|---|---|---|
| Original database (file #1) | |||
| Women | 2,519 (81.7) | 59 (1.9) | 507 (16.4) |
| Men | 17 (0.6) | 2,529 (83.0) | 500 (16.4) |
| Database without diacritic marks for first names (file #2) | |||
| Women | 2,663 (86.3) | 66 (2.2) | 356 (11.5) |
| Men | 18 (0.6) | 2,670 (87.7) | 358 (11.7) |
| Database without diacritic marks for first names and with only the first term for compound first names (file #3) | |||
| Women | 2,987 (96.8) | 86 (2.8) | 12 (0.4) |
| Men | 26 (0.8) | 3,005 (98.7) | 15 (0.5) |
Performance metrics for genderize.io (n=6,131 physicians)
| Genderize.io | errorCoded | errorCodedWithoutNA | naCoded |
|---|---|---|---|
| Original database (file #1) | 0.1766 | 0.0148 | 0.1643 |
| Database without diacritic marks for first names (file #2) | 0.1302 | 0.0155 | 0.1165 |
| Database without diacritic marks for first names and with only the first term for compound first names (file #3) | 0.0227 | 0.0184 | 0.0044 |
errorCoded = the proportion of misclassifications (i.e., wrong gender assigned to a first name) and nonclassifications (i.e., no gender assigned)
errorCodedWithoutNA = the proportion of misclassifications excluding nonclassifications
naCoded = the proportion of nonclassifications