| Literature DB >> 33206667 |
Kai On Wong1, Osmar R Zaïane2, Faith G Davis1, Yutaka Yasui1,3.
Abstract
BACKGROUND: Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features.Entities:
Year: 2020 PMID: 33206667 PMCID: PMC7673495 DOI: 10.1371/journal.pone.0241239
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of the machine learning framework.
Extraction and groupings of name features.
| Full name | First name | Middle name | Last name | ||
|---|---|---|---|---|---|
| (“Wing Sun Lee”) | (“Wing”) | (“Sun”) | (“Lee”) | ||
| “wing”, “sun”, “lee” | n/a | n/a | n/a | ||
| n/a | “w” | “s” | “l” | ||
| n/a | “g” | “n” | “e” | ||
| “w”, “i”, “n”, “g”, “s”, “u”, “l”, “e” | “w”, “i”, “n”, “g” | “s”, “u”, “n” | “l”, “e” | ||
| n/a | “wi”, “in”, “ng” | “su”, “un” | “le”, “ee” | ||
| n/a | “win”, “ing” | “sun” | “lee” | ||
| n/a | “wing” | None | None | ||
| n/a | None | None | None | ||
| n/a | None | None | None | ||
| “ANK”, “FNK”, “SN”, “L” | n/a | n/a | n/a | ||
| 3 (“wing”, “sun”, “lee”) | n/a | n/a | n/a | ||
| 4 (“wing”) +3 (“sun”) +3 (“lee”) = 10 | n/a | n/a | n/a | ||
| (4+3+3)/3 = 3.3 | n/a | n/a | n/a | ||
| 4 (“i”, “u”, “e”, “e”) | n/a | n/a | n/a | ||
| 4/10 = 0.4 | n/a | n/a | n/a |
n/a, not applicable.
Fig 2Frequency count breakdown by ethnic groups in census 1901 (N = 4,812,958).
Each Aboriginal person represents two counts: one in the Aboriginal label and another in one of the followings: First Nations, Métis, Inuit, or all-other-Aboriginal label.
Feature set selection result for multiclass classification using 5-fold cross-validation in development set.
| Feature set | Sensitivity | PPV | F1-score |
|---|---|---|---|
| 0.10 | 0.03 | 0.05 | |
| 0.11 | 0.08 | 0.06 | |
| 0.46 | 0.69 | 0.51 | |
| 0.50 | 0.71 | 0.56 | |
| 0.19 | 0.20 | 0.16 | |
| 0.29 | 0.51 | 0.31 | |
| 0.55 | 0.72 | 0.60 | |
| 0.39 | 0.48 | 0.41 | |
| 0.68 | 0.79 | 0.72 |
PPV, positive predictive value. Dummy features only = randomly-generated string feature and randomly-generated numeric feature. Basic name features only = first, middle, and last names and their corresponding first and last characters. Name substring features only = 1- to 6-letter strings for first, middle, and last names. Numeric name features only = number of name entities, total character length, total character length by name entity, number of vowels, and vowel-to-length ratio. All name features = all name-derived features. All location features = all location-derived features including processed location text string, province/territory, district, and sub-district features. All name and location features = “All name features” and “All location features”.
aThe two chosen final feature sets “All name features” and “All name and location features” were then passed down to the subsequent steps for both multiclass and binary classification pipelines.
Multiclass classification predictive performance in test set with ML models trained in training set.
| Feature set | ML algorithm | Ethnicity | Sensitivity | Specificity | PPV | NPV | F1-score | Accuracy |
|---|---|---|---|---|---|---|---|---|
| Dummy features; Sex | LR, SVC, NB | Overall | 0.31 | 0.69 | 0.12 | 0.79 | 0.14 | 0.86 |
| 0.38 | 1.00 | 0.76 | 0.99 | 0.50 | 0.99 | |||
| 0.89 | 1.00 | 0.94 | 1.00 | 0.92 | 1.00 | |||
| 0.74 | 0.83 | 0.59 | 0.91 | 0.65 | 0.81 | |||
| 0.94 | 0.96 | 0.91 | 0.97 | 0.93 | 0.95 | |||
| 0.54 | 0.93 | 0.63 | 0.90 | 0.58 | 0.85 | |||
| 0.49 | 1.00 | 0.80 | 1.00 | 0.61 | 1.00 | |||
| 0.72 | 1.00 | 0.87 | 1.00 | 0.79 | 1.00 | |||
| 0.54 | 0.98 | 0.70 | 0.96 | 0.61 | 0.94 | |||
| 0.67 | 1.00 | 0.88 | 1.00 | 0.76 | 1.00 | |||
| 0.60 | 0.95 | 0.66 | 0.93 | 0.63 | 0.90 | |||
| 0.72 | 0.92 | 0.72 | 0.94 | 0.71 | 0.89 | |||
| 0.71 | 0.93 | 0.71 | 0.93 | 0.71 | 0.89 | |||
| 0.67 | 0.93 | 0.69 | 0.93 | 0.67 | 0.88 | |||
| 0.82 | 1.00 | 0.86 | 1.00 | 0.84 | 0.99 | |||
| 0.92 | 1.00 | 0.95 | 1.00 | 0.94 | 1.00 | |||
| 0.76 | 0.86 | 0.64 | 0.92 | 0.69 | 0.94 | |||
| 0.95 | 0.97 | 0.94 | 0.98 | 0.95 | 0.97 | |||
| 0.61 | 0.93 | 0.67 | 0.91 | 0.64 | 0.87 | |||
| 0.57 | 1.00 | 0.83 | 1.00 | 0.67 | 1.00 | |||
| 0.86 | 1.00 | 0.94 | 1.00 | 0.90 | 1.00 | |||
| 0.59 | 0.98 | 0.74 | 0.96 | 0.66 | 0.95 | |||
| 0.78 | 1.00 | 0.91 | 1.00 | 0.84 | 1.00 | |||
| 0.63 | 0.95 | 0.70 | 0.94 | 0.66 | 0.90 | |||
| 0.76 | 0.94 | 0.76 | 0.94 | 0.76 | 0.91 | |||
| 0.75 | 0.94 | 0.76 | 0.94 | 0.75 | 0.91 | |||
| 0.69 | 0.93 | 0.70 | 0.93 | 0.70 | 0.89 |
ML, machine learning; PPV, positive predictive value; NPV, negative predictive value; LR, regularised logistic regression classifier; SVC, C-support vector classifier; NB, naïve Bayes classifier.
aWeighted average across individual ethnic categories.
Multiclass confusion matrix in test set for the “All name and location features” set trained in training set with regularised logistic regression classifier.
| Predicted | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ab | Ch | En | Fr | Ir | It | Jp | Others | Ru | Sc | ||
| 14575 | 6 | 970 | 994 | 419 | 8 | 9 | 220 | 14 | 553 | ||
| 16 | 2596 | 112 | 23 | 20 | 0 | 7 | 22 | 0 | 21 | ||
| 539 | 34 | 176509 | 5697 | 26957 | 38 | 5 | 8002 | 45 | 15961 | ||
| 821 | 3 | 6114 | 280663 | 3362 | 59 | 2 | 1877 | 13 | 1554 | ||
| 315 | 20 | 43806 | 4455 | 110699 | 35 | 3 | 3693 | 19 | 18132 | ||
| 17 | 1 | 193 | 401 | 80 | 1057 | 2 | 85 | 1 | 35 | ||
| 8 | 18 | 34 | 5 | 6 | 4 | 678 | 25 | 2 | 10 | ||
| 313 | 31 | 20945 | 4188 | 5413 | 55 | 11 | 49759 | 145 | 3207 | ||
| 15 | 1 | 147 | 47 | 53 | 2 | 2 | 443 | 2592 | 35 | ||
| 352 | 10 | 28155 | 2405 | 18917 | 17 | 3 | 2721 | 5 | 89924 | ||
Ab, Aboriginal; Ch, Chinese; En, English; Fr, French; Ir, Irish; It, Italian; Jp, Japanese; Ru, Russian; Sc, Scottish.
aColor gradient is set across each row with darker green indicating higher counts.
Multiclass confusion matrix in test set for the “All name features” set trained in training set with regularised logistic regression classifier.
| Predicted | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ab | Ch | En | Fr | Ir | It | Jp | Others | Ru | Sc | ||
| 6686 | 47 | 3975 | 2875 | 1448 | 13 | 37 | 942 | 18 | 1727 | ||
| 64 | 2515 | 99 | 56 | 33 | 0 | 5 | 25 | 0 | 20 | ||
| 516 | 27 | 172172 | 7881 | 27580 | 38 | 7 | 8729 | 34 | 16803 | ||
| 468 | 10 | 9116 | 276346 | 3965 | 82 | 9 | 2489 | 22 | 1961 | ||
| 333 | 14 | 52884 | 5263 | 98675 | 27 | 7 | 3654 | 15 | 20305 | ||
| 20 | 1 | 186 | 524 | 82 | 922 | 4 | 87 | 1 | 45 | ||
| 64 | 12 | 30 | 50 | 17 | 4 | 565 | 34 | 2 | 12 | ||
| 339 | 32 | 23599 | 6106 | 5171 | 48 | 11 | 45207 | 196 | 3358 | ||
| 16 | 0 | 203 | 126 | 65 | 3 | 1 | 650 | 2234 | 39 | ||
| 260 | 10 | 31015 | 2986 | 19495 | 12 | 3 | 2709 | 14 | 86005 | ||
Ab, Aboriginal; Ch, Chinese; En, English; Fr, French; Ir, Irish; It, Italian; Jp, Japanese; Ru, Russian; Sc, Scottish.
aColor gradient is set across each row with darker green indicating higher counts.
Binary classification predictive performance in test set based on regularised logistic regression classifiers trained in training set.
| Feature set | Ethnicity | Sensitivity | Specificity | PPV | NPV | F1-score | AUC-ROC | Average PPV | Accuracy |
|---|---|---|---|---|---|---|---|---|---|
| Dummy features; Sex | Across all binary label pairs | 0 | 1.00 | N/A | 0.69–1.00 | N/A | 0.50–0.54 | 0–0.31 | 0.69–1.00 |
| 0.36 | 1.00 | 0.80 | 0.99 | 0.50 | 0.88 | 0.50 | 0.99 | ||
| 0.35 | 1.00 | 0.82 | 0.99 | 0.49 | 0.92 | 0.52 | 0.99 | ||
| 0.20 | 1.00 | 0.69 | 1.00 | 0.31 | 0.89 | 0.30 | 1.00 | ||
| 0.08 | 1.00 | 0.13 | 1.00 | 0.10 | 0.88 | 0.07 | 1.00 | ||
| 0.88 | 1.00 | 0.96 | 1.00 | 0.92 | 1.00 | 0.95 | 1.00 | ||
| 0.84 | 0.78 | 0.55 | 0.94 | 0.67 | 0.89 | 0.71 | 0.80 | ||
| 0.92 | 0.98 | 0.95 | 0.97 | 0.94 | 0.99 | 0.97 | 0.96 | ||
| 0.48 | 0.96 | 0.72 | 0.89 | 0.57 | 0.88 | 0.67 | 0.87 | ||
| 0.50 | 1.00 | 0.81 | 1.00 | 0.62 | 0.95 | 0.59 | 1.00 | ||
| 0.67 | 1.00 | 0.88 | 1.00 | 0.76 | 0.99 | 0.80 | 1.00 | ||
| 0.51 | 0.99 | 0.81 | 0.96 | 0.63 | 0.91 | 0.69 | 0.95 | ||
| 0.72 | 1.00 | 0.90 | 1.00 | 0.80 | 0.98 | 0.81 | 1.00 | ||
| 0.53 | 0.97 | 0.74 | 0.92 | 0.62 | 0.90 | 0.68 | 0.90 | ||
| 0.80 | 1.00 | 0.88 | 1.00 | 0.84 | 0.99 | 0.90 | 0.99 | ||
| 0.78 | 1.00 | 0.84 | 1.00 | 0.81 | 1.00 | 0.88 | 1.00 | ||
| 0.56 | 1.00 | 0.77 | 1.00 | 0.65 | 0.99 | 0.67 | 1.00 | ||
| 0.57 | 1.00 | 0.22 | 1.00 | 0.31 | 1.00 | 0.35 | 1.00 | ||
| 0.89 | 1.00 | 0.96 | 1.00 | 0.93 | 1.00 | 0.96 | 1.00 | ||
| 0.63 | 0.92 | 0.72 | 0.89 | 0.67 | 0.91 | 0.88 | 0.85 | ||
| 0.96 | 0.98 | 0.95 | 0.98 | 0.95 | 0.99 | 0.99 | 0.97 | ||
| 0.55 | 0.96 | 0.74 | 0.90 | 0.63 | 0.91 | 0.73 | 0.88 | ||
| 0.58 | 1.00 | 0.83 | 1.00 | 0.68 | 0.98 | 0.69 | 1.00 | ||
| 0.82 | 1.00 | 0.91 | 1.00 | 0.87 | 1.00 | 0.91 | 1.00 | ||
| 0.61 | 0.99 | 0.82 | 0.96 | 0.70 | 0.94 | 0.77 | 0.95 | ||
| 0.81 | 1.00 | 0.92 | 1.00 | 0.86 | 0.99 | 0.89 | 1.00 | ||
| 0.57 | 0.97 | 0.76 | 0.93 | 0.65 | 0.82 | 0.74 | 0.91 |
ML, machine learning; PPV, positive predictive value; NPV, negative predictive value; AUC-ROC, area under the curve for receiver operating characteristic curve; Ab, Aboriginal; Ab-Fn, First Nations; Ab-Mé, Métis; Ab-In, Inuit; Ch, Chinese; En, English; Fr, French; Ir, Irish; It, Italian; Jp, Japanese; Ru, Russian; Sc, Scottish; N/A, can not be calculated.
Binary classification predictive performance in test set based on C-support vector classifiers trained in training set.
| Feature set | Ethnicity | Sensitivity | Specificity | PPV | NPV | F1-score | AUC-ROC | Average PPV | Accuracy |
|---|---|---|---|---|---|---|---|---|---|
| Dummy features; Sex | Across all binary label pairs | 0 | 1.00 | N/A | 0.69–1.00 | N/A | 0.50–0.54 | 0–0.31 | 0.69–1.00 |
| 0.35 | 1.00 | 0.86 | 0.99 | 0.50 | 0.90 | 0.52 | 0.99 | ||
| 0.35 | 1.00 | 0.88 | 0.99 | 0.50 | 0.90 | 0.51 | 0.99 | ||
| 0.00 | 1.00 | 0.90 | 1.00 | 0.00 | 0.87 | 0.19 | 1.00 | ||
| 0.03 | 1.00 | 1.00 | 1.00 | 0.06 | 0.92 | 0.16 | 1.00 | ||
| 0.87 | 1.00 | 0.96 | 1.00 | 0.92 | 1.00 | 0.94 | 1.00 | ||
| 0.40 | 0.94 | 0.67 | 0.83 | 0.50 | 0.85 | 0.63 | 0.81 | ||
| 0.91 | 0.98 | 0.95 | 0.96 | 0.93 | 0.98 | 0.97 | 0.96 | ||
| 0.46 | 0.96 | 0.74 | 0.89 | 0.57 | 0.88 | 0.68 | 0.87 | ||
| 0.42 | 1.00 | 0.88 | 1.00 | 0.57 | 0.95 | 0.61 | 1.00 | ||
| 0.64 | 1.00 | 0.88 | 1.00 | 0.74 | 0.99 | 0.79 | 1.00 | ||
| 0.52 | 0.99 | 0.81 | 0.96 | 0.63 | 0.91 | 0.69 | 0.95 | ||
| 0.67 | 1.00 | 0.91 | 1.00 | 0.77 | 0.98 | 0.78 | 1.00 | ||
| 0.52 | 0.97 | 0.76 | 0.92 | 0.62 | 0.91 | 0.70 | 0.90 | ||
| 0.77 | 1.00 | 0.89 | 1.00 | 0.83 | 0.99 | 0.90 | 0.99 | ||
| 0.75 | 1.00 | 0.82 | 1.00 | 0.78 | 0.99 | 0.84 | 1.00 | ||
| 0.48 | 1.00 | 0.83 | 1.00 | 0.61 | 0.98 | 0.64 | 1.00 | ||
| 0.00 | 1.00 | N/A | 1.00 | N/A | 0.00 | 0.00 | 1.00 | ||
| 0.90 | 1.00 | 0.97 | 1.00 | 0.93 | 1.00 | 0.96 | 1.00 | ||
| 0.66 | 0.91 | 0.71 | 0.89 | 0.68 | 0.91 | 0.74 | 0.85 | ||
| 0.95 | 0.99 | 0.96 | 0.98 | 0.96 | 0.99 | 0.99 | 0.97 | ||
| 0.52 | 0.96 | 0.76 | 0.90 | 0.62 | 0.91 | 0.72 | 0.88 | ||
| 1.00 | 1.00 | 0.88 | 1.00 | 0.66 | 0.98 | 0.72 | 1.00 | ||
| 0.70 | 1.00 | 0.94 | 1.00 | 0.81 | 1.00 | 0.89 | 1.00 | ||
| 0.57 | 0.99 | 0.83 | 0.96 | 0.68 | 0.94 | 0.76 | 0.95 | ||
| 0.77 | 1.00 | 0.94 | 1.00 | 0.85 | 0.99 | 0.89 | 1.00 | ||
| 0.56 | 0.97 | 0.78 | 0.93 | 0.65 | 0.92 | 0.74 | 0.91 |
ML, machine learning; PPV, positive predictive value; NPV, negative predictive value; AUC-ROC, area under the curve for receiver operating characteristic curve; Ab, Aboriginal; Ab-Fn, First Nations; Ab-Mé, Métis; Ab-In, Inuit; Ch, Chinese; En, English; Fr, French; Ir, Irish; It, Italian; Jp, Japanese; Ru, Russian; Sc, Scottish; N/A, can not be calculated.
Binary classification predictive performance in test set based on naïve Bayes classifiers trained in training set.
| Feature set | Ethnicity | Sensitivity | Specificity | PPV | NPV | F1-score | AUC-ROC | Average PPV | Accuracy |
|---|---|---|---|---|---|---|---|---|---|
| Dummy features; Sex | Across all binary label pairs | 0 | 1.00 | N/A | 0.69–1.00 | N/A | 0.50–0.54 | 0–0.31 | 0.69–1.00 |
| 0.70 | 0.84 | 0.08 | 0.99 | 0.14 | 0.85 | 0.23 | 0.84 | ||
| 0.70 | 0.91 | 0.09 | 1.00 | 0.16 | 0.89 | 0.32 | 0.91 | ||
| 0.46 | 0.95 | 0.04 | 1.00 | 0.08 | 0.85 | 0.05 | 0.95 | ||
| 0.60 | 0.99 | 0.01 | 1.00 | 0.01 | 0.92 | 0.01 | 0.99 | ||
| 0.95 | 1.00 | 0.44 | 1.00 | 0.61 | 0.99 | 0.86 | 1.00 | ||
| 0.87 | 0.67 | 0.46 | 0.94 | 0.60 | 0.84 | 0.58 | 0.72 | ||
| 0.92 | 0.96 | 0.90 | 0.96 | 0.91 | 0.97 | 0.94 | 0.94 | ||
| 0.81 | 0.71 | 0.39 | 0.94 | 0.52 | 0.84 | 0.55 | 0.72 | ||
| 0.78 | 0.96 | 0.03 | 1.00 | 0.06 | 0.94 | 0.17 | 0.96 | ||
| 0.92 | 0.99 | 0.10 | 1.00 | 0.19 | 0.99 | 0.55 | 0.99 | ||
| 0.78 | 0.81 | 0.29 | 0.98 | 0.42 | 0.88 | 0.54 | 0.81 | ||
| 0.81 | 0.98 | 0.11 | 1.00 | 0.20 | 0.97 | 0.31 | 0.98 | ||
| 0.76 | 0.82 | 0.43 | 0.95 | 0.55 | 0.88 | 0.59 | 0.81 | ||
| 0.77 | 0.96 | 0.29 | 1.00 | 0.42 | 0.95 | 0.60 | 0.96 | ||
| 0.79 | 0.97 | 0.27 | 1.00 | 0.40 | 0.96 | 0.59 | 0.97 | ||
| 0.68 | 0.97 | 0.09 | 1.00 | 0.16 | 0.93 | 0.17 | 0.97 | ||
| 0.67 | 0.99 | 0.01 | 1.00 | 0.02 | 0.94 | 0.02 | 0.99 | ||
| 0.95 | 1.00 | 0.54 | 1.00 | 0.69 | 0.99 | 0.89 | 1.00 | ||
| 0.86 | 0.70 | 0.48 | 0.94 | 0.61 | 0.85 | 0.60 | 0.74 | ||
| 0.92 | 0.96 | 0.92 | 0.97 | 0.92 | 0.98 | 0.95 | 0.95 | ||
| 0.77 | 0.73 | 0.40 | 0.93 | 0.52 | 0.83 | 0.54 | 0.74 | ||
| 0.72 | 0.99 | 0.09 | 1.00 | 0.17 | 0.94 | 0.31 | 0.99 | ||
| 0.92 | 1.00 | 0.16 | 1.00 | 0.27 | 0.99 | 0.68 | 1.00 | ||
| 0.79 | 0.84 | 0.32 | 0.98 | 0.45 | 0.89 | 0.58 | 0.83 | ||
| 0.87 | 0.99 | 0.20 | 1.00 | 0.32 | 0.98 | 0.50 | 0.99 | ||
| 0.72 | 0.85 | 0.46 | 0.95 | 0.56 | 0.88 | 0.59 | 0.83 |
ML, machine learning; PPV, positive predictive value; NPV, negative predictive value; AUC-ROC, area under the curve for receiver operating characteristic curve; Ab, Aboriginal; Ab-Fn, First Nations; Ab-Mé, Métis; Ab-In, Inuit; Ch, Chinese; En, English; Fr, French; Ir, Irish; It, Italian; Jp, Japanese; Ru, Russian; Sc, Scottish; N/A, can not be calculated.