Literature DB >> 12482326

Preparation of name and address data for record linkage using hidden Markov models.

Tim Churches1, Peter Christen, Kim Lim, Justin Xi Zhu.   

Abstract

BACKGROUND: Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs).
METHODS: HMMs were trained to standardise typical Australian name and address data drawn from a range of health data collections. The accuracy of the results was compared to that produced by rule-based systems.
RESULTS: Training of HMMs was found to be quick and did not require any specialised skills. For addresses, HMMs produced equal or better standardisation accuracy than a widely-used rule-based system. However, accuracy was worse when used with simpler name data. Possible reasons for this poorer performance are discussed.
CONCLUSION: Lexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses. Further work is required to improve the performance of this approach with simpler data such as names. Software which implements the methods described in this paper is freely available under an open source license for other researchers to use and improve.

Entities:  

Mesh:

Year:  2002        PMID: 12482326      PMCID: PMC140019          DOI: 10.1186/1472-6947-2-9

Source DB:  PubMed          Journal:  BMC Med Inform Decis Mak        ISSN: 1472-6947            Impact factor:   2.796


  9 in total

1.  Motor vehicle crash characteristics and medical outcomes among older drivers in Utah, 1992-1995.

Authors:  L J Cook; S Knight; L M Olson; P J Nechodom; J M Dean
Journal:  Ann Emerg Med       Date:  2000-06       Impact factor: 5.721

2.  Medical software's free future.

Authors:  D Carnall
Journal:  BMJ       Date:  2000-10-21

3.  New South Wales Mothers and Babies 2000.

Authors: 
Journal:  N S W Public Health Bull       Date:  2001-11

4.  A research registry: uses, development, and accuracy.

Authors:  L L Roos; J P Nicol
Journal:  J Clin Epidemiol       Date:  1999-01       Impact factor: 6.437

5.  Scoring hidden Markov models.

Authors:  C Barrett; R Hughey; K Karplus
Journal:  Comput Appl Biosci       Date:  1997-04

Review 6.  Impact of the Human Genome Project on epidemiologic research.

Authors:  D L Ellsworth; D M Hallman; E Boerwinkle
Journal:  Epidemiol Rev       Date:  1997       Impact factor: 6.222

7.  Computerised linking of medical records: methodological guidelines.

Authors:  L Gill; M Goldacre; H Simmons; G Bettley; M Griffith
Journal:  J Epidemiol Community Health       Date:  1993-08       Impact factor: 3.710

8.  Human genome epidemiology: translating advances in human genetics into population-based data for medicine and public health.

Authors:  M J Khoury
Journal:  Genet Med       Date:  1999 Mar-Apr       Impact factor: 8.822

9.  Amino acid substitution matrices from an information theoretic perspective.

Authors:  S F Altschul
Journal:  J Mol Biol       Date:  1991-06-05       Impact factor: 5.469

  9 in total
  5 in total

1.  Effect of geocoding errors on traffic-related air pollutant exposure and concentration estimates.

Authors:  Rajiv Ganguly; Stuart Batterman; Vlad Isakov; Michelle Snyder; Michael Breen; Wilma Brakefield-Caldwell
Journal:  J Expo Sci Environ Epidemiol       Date:  2015-02-11       Impact factor: 5.563

2.  Privacy preserving probabilistic record linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality.

Authors:  Kurt Schmidlin; Kerri M Clough-Gorr; Adrian Spoerri
Journal:  BMC Med Res Methodol       Date:  2015-05-30       Impact factor: 4.615

3.  The effect of data cleaning on record linkage quality.

Authors:  Sean M Randall; Anna M Ferrante; James H Boyd; James B Semmens
Journal:  BMC Med Inform Decis Mak       Date:  2013-06-05       Impact factor: 2.796

4.  Some methods for blindfolded record linkage.

Authors:  Tim Churches; Peter Christen
Journal:  BMC Med Inform Decis Mak       Date:  2004-06-28       Impact factor: 2.796

5.  Embracing the Sparse, Noisy, and Interrelated Aspects of Patient Demographics for use in Clinical Medical Record Linkage.

Authors:  Stephen M Ash; King Ip-Lin
Journal:  AMIA Jt Summits Transl Sci Proc       Date:  2015-03-25
  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.