| Literature DB >> 34095539 |
C Nanayakkara1, P Christen1, T Ranbaduge1, E Garrett2.
Abstract
INTRODUCTION: The robustness of record linkage evaluation measures is of high importance since linkage techniques are assessed based on these. However, minimal research has been conducted to evaluate the suitability of existing evaluation measures in the context of linking groups of records. Linkage quality is generally evaluated based on traditional measures such as precision and recall. As we show, these traditional evaluation measures are not suitable for evaluating groups of linked records because they evaluate the quality of individual record pairs rather than the quality of records grouped into clusters.Entities:
Year: 2019 PMID: 34095539 PMCID: PMC8142966 DOI: 10.23889/ijpds.v4i1.1127
Source DB: PubMed Journal: Int J Popul Data Sci ISSN: 2399-4908
Figure 1: Examples of different cluster predictions. Node colours represent the five true clusters, solid edges true matches (i.e. correctly predicted links), and dotted edges show wrong matches (incorrectly predicted links).1 Using birth/baptism, death/burial, and marriage records only.
2 Using birth/baptism, death/burial, and marriages records in conjunction with other forms of nominal records.
| Relevant terms as used by different fields |
|---|
| Ground-truth | |||
|---|---|---|---|
| Matches | Non-matches | ||
| Prediction | Positive Link | ||
| Negative Link | |||
| Category | Description |
|---|---|
| Correct singleton ( | These are the records which appear as singletons in both the ground-truth data and the predicted clusters. |
| Wrongly grouped singleton ( | These are the records which appear as singletons in the ground-truth but were assigned to a group of records in the prediction. |
| Missed group member ( | These are the records which appear in a group in the ground-truth, but were assigned as a singleton in the prediction. |
| Exact group match ( | These are the records contained in a predicted cluster that exactly matches a ground-truth cluster (i.e. each record in the predicted cluster appears in a ground-truth cluster, and vice versa), where the size of the cluster is larger than one. |
| Majority group match ( | A majority group match occurs when at least 50% of the records in a predicted cluster (containing at least two records) come from a single ground-truth cluster. For this classification, the best representative predicted cluster of a ground-truth cluster (which contains at least two records from the ground-truth cluster) must be identified. For a majority group match, all the records which appear in both the ground-truth cluster and predicted cluster are assigned to category |
| Minority group match ( | A minority group match is similar to a majority group match, however, less than 50% of the records in a predicted cluster come from the corresponding ground-truth cluster. |
| Wrongly assigned member ( | These are all the records from a ground-truth cluster (containing at least two records) which appear in a predicted cluster (a group) different to the majority or minority group match. That is, once we find the best representative cluster for a given ground-truth cluster, all the records which appear in a predicted cluster other than the representative cluster are assigned to this class. |
| True Singleton | True Group / Cluster | |
|---|---|---|
| Predicted Singleton | SS | GS |
| Predicted Group/Cluster | SG | GG_E |
| GG_M | ||
| GG_m | ||
| GG_W | ||
Figure 2: Ground-truth cluster X.
Figure 3: Ground-truth cluster Y.
Figure 4: Ground-truth singleton
Figure 5: Ground-truth singleton
Figure 6: Ground-truth cluster
Figure 8: Plots for new evaluation results for the three clustering techniques | Attribute name | Number of unique values | Number and percentage of records with a missing value |
|---|---|---|
| Mother’s first name | 97 | 10 (0.06%) |
| Mother’s last name | 286 | 11 (0.06%) |
| Father’s first name | 86 | 955 (5.42%) |
| Father’s last name | 301 | 951 (5.40%) |
| Mother’s occupation | 73 | 16,446 (93.37%) |
| Father’s occupation | 790 | 963 (5.47%) |
| Address | 1,286 | 210 (1.19%) |
| Parent’s marriage date | 5,105 | 2,346 (13.32%) |
Figure 7: Precision-recall curves for three clustering techniques based on six different similarity graphs as described in Section 3 with weighted (W) and unweighted (UW) attribute similarity aggregations.| Clustering technique | AUC | Average AUC for 7 categories | |||||||
|---|---|---|---|---|---|---|---|---|---|
| PR | SS | GG_E | GG_M | GG_m | SG | GS | GG_W | ||
| Connected components | 0.744 | 0.036 | 0.206 | 0.077 | 0.01 | 0.087 | 0.017 | 0.567 | -0.141 |
| Star clustering | 0.775 | 0.046 | 0.367 | 0.333 | 0.02 | 0.077 | 0.02 | 0.137 | 0.114 |
| Robust graph clustering | 0.885 | 0.044 | 0.413 | 0.298 | 0.027 | 0.077 | 0.017 | 0.124 | 0.123 |