| Literature DB >> 32935016 |
Tom Walter Eitelhuber1, James Thackray1, Steve Hodges1, Janine Alan1.
Abstract
The Western Australia Data Linkage System (WADLS) is maintained and operated by the WA Data Linkage Branch (DLB) at the Western Australian Department of Health. DLB has pioneered a number of data linkage innovations, including the facilitation of genealogical research via the Family Connections system and streamlined data delivery via the Custodian Administered Research Extract Server. DLB's latest innovation is a new data linkage system called "DLS3", which improves DLB's capability and capacity to handle the increasing volume and complexity of its routine operations. DLS3 was built entirely in-house and customised to meet the specific challenges that DLB has encountered throughout over twenty years of experience with a wide variety of linkages. This article describes the development and rollout of DLS3, including its design, architecture, benefits and limitations.Entities:
Year: 2018 PMID: 32935016 PMCID: PMC7299493 DOI: 10.23889/ijpds.v3i3.435
Source DB: PubMed Journal: Int J Popul Data Sci ISSN: 2399-4908
| File-based Linkage Software (FLS) Limitation | |
|---|---|
| Discreteness | FLS did not effectively retain or share information about matching decisions, which led to repetition |
| Iterative process | FLS required linkages to run in a set order, with the clerical review of one phase completed prior to the next phase starting. This approach required substantial manual input by the Linkage Officer. Sometimes, this led to a less efficient strategy being used to determine a link, when a better option may have become apparent during a subsequent phase of the linkage process |
| Pairwise matching | FLS could not concurrently consider more than two records for potential matching, which prevented the Linkage Officer from leveraging the breadth and depth of the WADLS without additional bespoke tools (if feasible and available) and manual intervention |
| Change control | FLS software lacks integrated change control, which means adjustments to linkage protocols over time are not captured easily |
| Proprietary concerns | FLS was subsumed into a new product, which introduced issues associated with cost, compatibility, support and maintenance |
| New Data Linkage System (DLS3) Design Pillar | |
|---|---|
| Concurrency | Run all linkage strategies for a dataset concurrently |
| De-duplication | Remove double-handling of potential matches in the clerical review process |
| Streamlining | Improve the end-to-end linkage process by integrating the standardisation, matching and links management stages and removing repeated or unnecessary manual steps |
| Immediacy | New links would be created immediately upon being identified, rather than waiting to load a whole file of links as a batch job. |
| Data Linkage System (DLS3) Service Structure | |
|---|---|
| Loading | These three Services use dataset profiles, created by the user, that provide DLS3 with a tailored ‘recipe’ for how to standardise and store the data. |
| Cleaning | |
| Importation | |
| Linkage preparation | Creates subset blocks in the database for comparison, assigns initial chains to new records and processes any deleted records. |
| Linkage | Triggers comparison algorithms and generates match weights. |
| Clerical review | Prompts the user to review a de-duplicated list of all potential (i.e. non-automatic) links. Potential links are viewed as clusters of records (not individual pairs) and will only be reviewed if the cluster: (a) does not match automatically via a higher quality link identified during the linkage service; or (b) triggers a quality assurance check that overrides a higher quality link. |
| Extraction | Enables the user to extract linkage keys from the WADLS according to specified criteria. |
Figure 1: Diagrammatical representation of DLS3 components.| Feature | Description & purpose | Description of legacy file-based linkage software (FLS) capability | Description of DLS3 capability |
|---|---|---|---|
| Matching functions | Used to compare different types of fields (i.e. string, numeric, date, location, etc.), including parameters for error tolerance. | New match functions cannot be added and existing functions cannot be modified. | Match functions can be designed, added and updated as required. |
| Matching conditions | Uses inexact comparisons and permissible field values to further restrict which records may be considered for matching. | No match conditions in FLS. Limited to sub-setting of entire input file (e.g. linking only to women). | Match conditions allow excluding some match pairs based on conditions other than exact matching (as implemented by blocking). |
| Data preparation functions | Clean, standardise and transform data in order to improve likelihood of matching with other datasets. | None in FLS. All data preparation must be implemented using custom add-ons developed by DLB, prior to running the FLS. | Can apply additional data cleaning or transformation at linkage stage; for example address standardisation or phonetic transforms. |
| Frequency calculations | Allows configuring of matching functions to give uncommon field values (e.g. surnames) more weight than common ones. | Only for 100 most frequent values. | Frequencies for all values. Can use conditional frequencies (eg. name frequencies for male vs female). Chain based rather than record based (to reduce bias in event based data). |
| Cardinality restrictions | Allows linkage outcomes to be restricted – one-to-one; one-to-many; many-to-one – to meet expectations of the dataset. | 1:1, 1:N, N:1 matching restrictions are possible, but are limited to post-linkage checking (prior to loading links) on a record-to-record basis. | 1:1, 1:N. N:1 matching restrictions are possible on a chain-to-chain basis, and can be triggered as part of a linkage strategy. |
| Linkage metadata | Capture current and historical information about software, systems, data and linkage keys. | No metadata recording built in. Must be done manually by users. |
Metadata recording built in. Includes: Data changes Data profile changes Link changes Linkage strategy changes Linkage runs and parameters Clerical review decisions |