| Literature DB >> 23739011 |
Sean M Randall1, Anna M Ferrante, James H Boyd, James B Semmens.
Abstract
BACKGROUND: Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality.Entities:
Mesh:
Year: 2013 PMID: 23739011 PMCID: PMC3688507 DOI: 10.1186/1472-6947-13-64
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Availability of data cleaning functionality across a sample of linkage packages
| Reformat values | Yes | Yes | No | Yes | Yes | No | Yes | Yes |
| Remove punctuation | Yes | Yes | No | Yes | Yes | No | Yes | Yes |
| Remove alt. missing values | Yes | Yes | No | Yes | Yes | No | Yes | Yes |
| Phonetic encoding | Yes | Yes | No | Yes | Yes | No | Yes | Yes |
| Name/Address Standardisation | Yes | Yes | No | No | No | No | Yes | Yes |
| Nickname lookup | Yes | Yes | No | Yes | No | No | No | Yes |
| Sex imputation | Yes | Yes | No | Yes | No | No | No | Yes |
Figure 1Road map for measuring overall linkage quality.
A comparison of the most common fields in the created synthetic data and the original data it was based on
| Missing value | 1.98 | | Missing value | 1.99 | |
| Smith | 0.92 | 0.94 | John | 3.44 | 3.47 |
| Jones | 0.55 | 0.55 | David | 3.09 | 3.09 |
| Brown | 0.46 | 0.46 | Michael | 2.95 | 2.95 |
| Williams | 0.46 | 0.46 | Peter | 2.87 | 2.88 |
| Taylor | 0.44 | 0.44 | Robert | 2.47 | 2.47 |
| | | | | | |
| Missing value | 1.99 | | Missing value | 1.01 | |
| Margaret | 1.57 | 1.56 | 6210 | 2.84 | 2.84 |
| Susan | 1.35 | 1.34 | 6163 | 2.33 | 2.34 |
| Patricia | 1.22 | 1.22 | 6027 | 2.06 | 2.05 |
| Jennifer | 1.19 | 1.20 | 6155 | 2.02 | 2.02 |
| Elizabeth | 1.05 | 1.05 | 6065 | 2.00 | 1.98 |
Specific data cleaning techniques used on each dataset
| Not required | Not required | Not required |
| | ||
| Invalid dates of birth removed | Invalid dates of birth removed | |
| Invalid postal code values removed | Invalid post code values removed | |
| | ||
| Both forename and surname fields had all punctuation and spaces removed | Both forename and surname fields had all punctuation and spaces removed | |
| | | |
| | Nicknames were changed to their more common variant. | |
| | | |
| | | Records with missing sex had a value imputed based on their first name. |
| Date of birth reformatted. | Date of birth reformatted | Date of birth reformatted. |
| | ||
| Invalid dates of birth were removed | Invalid dates of birth were removed | |
| Invalid postcode values were removed (‘9999’ etc.) | Invalid postcode values were removed (‘9999’ etc.) | |
| Uninformative address and suburb values removed (‘NO FIXED ADDRESS’, ‘UNKNOWN’ etc.) | Uninformative address and suburb values removed (‘NO FIXED ADDRESS’, ‘UNKNOWN’ etc.) | |
| Birth information encoded in first name removed (‘TWIN ONE OF MARTHA’ etc.) | Birth information encoded in first name removed (‘TWIN ONE OF MARTHA’ etc.) | |
| | ||
| Forename, middle name surname and suburb fields had all punctuation and spaces removed | Forename, middle name surname and suburb fields had all punctuation and spaces removed | |
| | | |
| Nicknames were changed to their more common variant. | ||
Overall linkage quality results
| | |
| No cleaning | 0.883 |
| Minimal cleaning | 0.882 |
| High cleaning | 0.875 |
| | |
| No cleaning | 0.993 |
| Minimal cleaning | 0.993 |
| High cleaning | 0.992 |
Improvement in predictive ability of data cleaning techniques
| Remove punctuation | −a0.08% | +0.08% |
| Remove alt. missing values | +0.5% | 0% |
| Nickname lookup | −28% | −33% |
| Sex Imputation | NA | −5% |
a Negative sign (-) refers to decrease in predictive ability, positive sign (+) refers to increase in predictive ability compared to baseline.
Examples of single variable changes in predictive ability for individual cleaning techniques in hospital admission data
| | | | |
|---|---|---|---|
| | |||
| Given name original | 0.006575 | 0.946085 | 0.013059 |
| Given name with removed punctuation | 0.006573 | 0.947188 | 0.013056 |
| Given name with nicknames removed | 0.004357 | 0.953738 | 0.008675 |
| Surname original | 0.025265 | 0.98824 | 0.049271 |
| Soundex of surname | 0.008845 | 0.994926 | 0.017533 |
| Address original | 0.687066 | 0.669649 | 0.678246 |
| Address with alternate missing values and uninformative values removed | 0.687398 | 0.709426 | 0.698238 |
b Down arrow symbol (↓) refers to decreased percentage change, up arrow (↑) refers to increased percentage change.