| Literature DB >> 21284874 |
John M Finney1, A Sarah Walker, Tim E A Peto, David H Wyllie.
Abstract
BACKGROUND: Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone.Entities:
Mesh:
Year: 2011 PMID: 21284874 PMCID: PMC3039555 DOI: 10.1186/1472-6947-11-7
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Identifiers and Record linkage operation
| Start cluster id | New cluster id | NHS number | hospital number | Surname | Forename | sex | date of birth (ddmmyyyy) | frequency of occurrence |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | NULL | 4496644 | WILSON | DAVID | M | 14061940 | 3 |
| 2 | 2 | 5170231111 | NULL | WILSON | DAVID | M | 01051939 | 1 |
| 3 | 3 | 3319004037 | 4118890 | WILSON | DAVID | M | 20011969 | 2 |
| 4 | 4 | NULL | NULL | WILSON | DAVID | M | 20011969 | 1 |
| 5 | 3 | 3319004037 | NULL | WILSON | DAVID | M | 20011969 | 2 |
| 6 | 6 | NULL | 4118890 | WILSON | DAVID | M | 20011969 | 1 |
An example of identifiers provided for patients with forename and surname 'David Wilson'. The details have been changed to protect patient confidentiality. Null fields indicate there was no information provided in that field.
One cycle of the record linkage is illustrated. Consider each combination of identifiers to belong to its own, discrete cluster, identified by a cluster identifier (Start cluster id). For all sets in which at least one member shares an NHS number identifier with a different set, combine these sets into a single set (New Cluster ID). The operation proceeds for all identifiers.
Identifier cleaning
| All fields | Fields converted to uppercase blanks (e.g. whitespace) deleted | All fields |
|---|---|---|
| Forename & Surname | remove of forenames containing baby/infant/twins, or synonyms. | Forename & Surname |
| Sex | Remove unless M, F, U characters, representing male, female or unknown, respectively | Sex |
| Hospital numbers | Remove checkdigits | Hospital numbers |
| NHS numbers | Delete out-of-range values | NHS numbers |
| Birthdate & Deathdate | Conversion to SQL date format | Birthdate & Deathdate |
The steps taken in cleaning data items are described.
Figure 1Collision of two clusters. The collision described in Table 6 is illustrated graphically. A single identifier joins two clusters containing records from two patients. When edges are not formed from this (right panel) the clusters are no longer joined.
High cardinality of combinations of name and date of birth
| Identifier | Cardinality | Average NHS numbers per identifier |
|---|---|---|
| National Health Service Number | 1066339 | 1 |
| Date of birth | 35694 | 29.87 |
| Surname, complete forename | 829650 | 1.285 |
| Surname, first letter of forename, date of birth | 1065027 | 1.001316 |
| Surname, first three letters of forename, date of birth | 1066184 | 1.000234 |
| Surname, complete forename, date of birth | 1066519 | 1.000090 |
For a set of complete identifiers from one data source (PAS), we show the cardinality (number of discrete values) for each of a series of possible identifiers, and average number of NHS numbers per identifier. This cardinality of NHS number and combinations of name and date of birth are similar.
Combinations of identifiers available on different record sources
| Hospital Number | NHS Number | Name/date of birth | jonah | lims | micro | pas | pashistory | Total |
|---|---|---|---|---|---|---|---|---|
| + | + | + | 456553 (72.2%) | 246326 (3.6%) | 1494645 (28.5%) | 1205042 (53.3%) | 94935 (83.4%) | 1494645 (16.2%) |
| + | + | - | 223 (0%) | 1510 (0%) | 5520 (0.1%) | 19086 (0.8%) | 161 (0.1%) | 19086 (0.2%) |
| + | - | + | 174874 (27.7%) | 2160475 (31.7%) | 978448 (18.6%) | 860916 (38.1%) | 18372 (16.1%) | 2160475 (23.4%) |
| + | - | - | 550 (0.1%) | 30816 (0.5%) | 36636 (0.7%) | 177518 (7.8%) | 366 (0.3%) | 177518 (1.9%) |
| - | + | + | 7 (0%) | 103420 (1.5%) | 813906 (15.5%) | 0 (0%) | 0 (0%) | 813906 (8.8%) |
| - | + | - | 0 (0%) | 591 (0%) | 2244 (0%) | 0 (0%) | 0 (0%) | 2244 (0%) |
| - | - | + | 95 (0%) | 3883941 (57%) | 1245979 (23.7%) | 1 (0%) | 1 (0%) | 3883941 (42.1%) |
| - | - | - | 3 (0%) | 382490 (5.6%) | 671076 (12.8%) | 0 (0%) | 0 (0%) | 671076 (7.3%) |
Up to three identifiers, hospital number, NHS number and name & date of birth are available for each record, but they are not all present in each data set. Shown are the combinations of identifiers (- = absent,+ = present) for each dataset contributing to the database.
Figure 2Representation of data as a graph. Above is shown the result of one real cluster generated by the algorithm; to protect patient confidentiality, patient details have been replaced by example details. Here, there are three discrete patients, all called David Wilson, but differing in dates of birth, NHS and hospital numbers. Edges join nodes having shared identifiers.
Multivariate Logistic model classifying bad clusters from good
| Model: Any Females | Model: No females | |||||
|---|---|---|---|---|---|---|
| Parameter | Example inputs | Distance function | Coefficient | Parameter | Example inputs | |
| -3.33 | <1 × 10-16 | |||||
| 02, 24 | Levensthein | 0.25 | 5 × 10-4 | 02, 24 | ||
| 01, 11 | Levensthein | 0.43 | 1 × 10-4 | 01, 11 | ||
| 1969, 2007 | Levensthein | 5.12 | <1 × 10-16 | 1969, 2007 | ||
| John, Chris | Jaro-Winkler | 2.45 | <1 × 10-16 | John, Chris | ||
| 110111 or 223456 | Jaro-Winkler | -0.56 | <1 × 10-16 | 110111 or 223456 | ||
| Smith, Jones | Jaro-Winkler | not present | - | Smith, Jones | ||
| M, F | Levensthein | 0.80 | <1 × 10-16 | M, F | ||
A random sample of 25,000 clusters was obtained after initial record linkage. These clusters were divided into those which, on the basis of a series of rules, were thought to represent one individual ('good'), or the others ('uncertain'). The uncertain records were not used in model generation. Good clusters were then combined randomly creating a new set of clusters ('bad'). Maximal distances were computed by pairwise comparison of good and bad clusters, and a logistic model was fitted modelling bad cluster status relative to good cluster status for clusters without females, or for clusters including at least one record identified as being from a female, with backwards selection based on AIC. In the female model, surname was omitted; in the non-female model, there is only one level for the Sex field, which was therefore omitted. A model fitted is shown; very similar estimates were obtained from a large number of other builds with different random samples. p refers to the null hypothesis that the coefficient is zero.
Figure 3Classification of data into good and bad clusters. A random sample of 25,000 complex clusters was obtained after initial record linkage. Complex clusters are those with more than one variant of at least one identifier. These clusters were divided into those which, on the basis of a series of rules, were thought to represent one individual ('likely good', purple line), or the others (uncertain, blue line). Good clusters were then combined randomly creating a new set of clusters (bad by simulation, green line). Maximal distances were computed for pairwise distances within all members of 'likely good' and simulated bad clusters. A logistic model was fitted modelling bad cluster status relative to good cluster status for (top) clusters without females, or (bottom) clusters including at least one record identified as being from a female. Here, logistic scores are plotted for each of the three groups. The dashed vertical line is at -1.5 in both models, a position chosen empirically as suitable for discrimination of good from bad clusters.
Classifier performance on an independent validation set of 25,000 complex clusters
| Status | Predicted not bad | Predicted bad | Total | % predicted bad |
|---|---|---|---|---|
| Unknown | 5623 | 2501 | 8124 | 30.7 |
| Good | 15975 | 901 | 16876 | 5.33 |
| Bad | 337 | 11698 | 12035 | 97.2 |
The logistic classifier derived to identify bad clusters (not bad refers to a single individual within a cluster, bad refers to more than one individual), shown in Table 4, was applied to a further random sample of 25,000 clusters obtained after initial record linkage. These were classified into 'good' 'unknown status' and 'bad' using rules, as described in Table 4 Legend and methods. The classifier performance on this validation set is shown.
Effect of collision resolution
| Before collision resolution | After collision resolution | % drop | Before collision resolution | ||
|---|---|---|---|---|---|
| Number of clusters | 3557951 | 3618233 | Number of clusters | 3557951 | |
| Clusters with multiple: | Clusters with multiple: | ||||
| NHS numbers | 6202 | 2122 | ~66% | NHS numbers | 6202 |
| hospital numbers | 97071 | 94238 | ~3% | hospital numbers | 97071 |
| birthdates | 58293 | 35523 | ~39% | birthdates | 58293 |
| deathdates | 830 | 107 | ~87% | deathdates | 830 |
| genders | 81118 | 61337 | ~24% | genders | 81118 |
| forenames | 59426 | 16873 | ~71% | forenames | 59426 |
| surnames | 189657 | 151593 | ~25% | surnames | 189657 |
After initial linkage, a process of collision resolution is applied (see methods). This causes a decrease in the number of clusters containing multiple identifiers, as detailed above.
Overall performance
| Process | Operation | Timing |
|---|---|---|
| 1 | Identifier cleaning; forename/surname duplication screening | 3 min |
| 2 | Construction of unique identifiers | 1 min |
| 3 | Initial clustering using identifiers | 7 min |
| 4 | Identity collision detection | 10 min |
| 5 | Identity collision resolution | 2 min |
| 6 | Identity collision reassessment | 2 min |
After initial linkage, a process of collision resolution is applied (see methods). This causes a decrease in the number of clusters containing multiple identifiers, as detailed above.