| Literature DB >> 28693435 |
Duong Thuy Tran1, Alys Havard2, Louisa R Jorm2.
Abstract
BACKGROUND: Data cleaning is an important quality assurance in data linkage research studies. This paper presents the data cleaning and preparation process for a large-scale cross-jurisdictional Australian study (the Smoking MUMS Study) to evaluate the utilisation and safety of smoking cessation pharmacotherapies during pregnancy.Entities:
Keywords: Data cleaning methods; Data consistency; Perinatal; Record linkage
Mesh:
Year: 2017 PMID: 28693435 PMCID: PMC5504784 DOI: 10.1186/s12874-017-0385-6
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1Data linkage and examples of data set layouts
Descriptions of data sets
| Data type | Jurisdiction | Description | Number of records and persons prior to cleaning | Dates covered |
|---|---|---|---|---|
| Perinatala | ||||
| Perinatal Data Collection (PDC) | NSW | Notifications of all livebirths and stillbirths of at least 20 weeks of gestation or at least 400 g birthweight in NSW and WA. | N records: 1155,207b
| 1/1/2003–31/12/2012b |
| Midwives Notification System (MNS) | WA | N records: 371155b
| 1/1/2003–31/12/2012b | |
| Hospital admissiona | ||||
| Admitted Patient Data Collection (APDC) | NSW | All hospital discharges from all public and private hospitals in NSW and WA. | Mother: 2,708,607 | Mother: 1/7/2001–30/6/2014 |
| Hospital Morbidity Data Collection (HMDC) | WA | Mother: 1,245,018 | Mother: 1/1/1980–30/6/2013 | |
| Emergency department attendance | ||||
| Emergency Department Data Collection (EDDC) | NSW | ED attendances at all EDs in metropolitan areas and majority of EDs in regional areas in NSW and WA. | Mother: 1,518,745 | 1/1/2005–31/12/2014 |
| Emergency Department Data Collection (EDDC) | WA | Mother: 854,050 | Mother: 1/1/2002–30/11/2013 | |
| Death data | ||||
| NSW Registry of Births, Deaths and Marriages (RBDM) | NSW | Date of deaths registered in NSW | Mother: 1337 | 1/1/2003–31/12/2014 |
| Causes Of Death Unit Record File (COD URF) | NSW | Underlying and contributing causes of death for those registered in NSW | Mother: 881 | 1/1/2003–31/12/2012c |
| WA death registration and COD URF | WA | Date of death, underlying and contributing causes of death for those registered in WA | Mother: 478 | 1/1/2003–31/12/2013 |
| Congenital Conditions | ||||
| Register of Congenital Conditions (RoCC) | NSW | Congenital conditions detected during pregnancy or at birth, or diagnosed in infants up to 1 year of age | Baby: 9976 | 1/1/2004–31/12/2009 |
| PBS links mapping tables and claims data | ||||
| Mapping table between PBS PATIDs and NSW mumPPNs | Federal | All identified matches wherein the weight of the match ≥ 17 d | 1,006,960 matches | |
| Mapping table between PBS PATIDs and WA mumPPNs | Federal | 226,817 matches | ||
| PBS claim records, NSW or Australian Capital Territory (ACT)e is the State of the pharmacies | Federal | Claims made in NSW and ACT for all PATIDs included in the NSW mapping tablee | 17,470,068 claim records | 1/1/2003–31/12/2013 |
| PBS claim records, WA is the State of the pharmacies | Federal | Claims made in WA for all PATIDs included in the WA mapping tablee | 3,364,490 claim records | 1/1/2003–31/12/2013 |
aA child birth generates one birth notification (≥2 notifications if plural births), one hospital separation record for the mother, and one hospital separation record for the newborn (≥2 hospital records for the newborns if plural births)
bRecords of pre-2003 deliveries (from 1994 in NSW, 1980 in WA) were included in the linkage for the mothers
cThe NSW COD URF data are not available for the same duration as the RBDM death registrations due to the time needed for coding causes of death
dThe recommended threshold weight to accept the matches to NSW mumPPNs and WA mumPPNs was 29 and 28, respectively
eThe Australian Capital Territory (population 385,996) is geographically surrounded by NSW. PBS claims data for NSW women included those dispensed in NSW and the ACT and covered a wider range of pharmaceutical items than claims data for WA mothers
Fig. 2Summary of data cleaning steps and results
Steps undertaken to assess consistency of State-based data
| Step Data sets | Explanationa | Findings | |
|---|---|---|---|
| Uniqueness of record | |||
| 1 | Record duplicates | ||
| Perinatal, | 1.1 Identify identical duplicates i.e. all variables contain the same information, except the unique ID of the records; | • Marked the records as “duplicate”: 9 perinatal, 1161 hospital admissions, 659 ED visits, and 49 death records. | |
| 2 | Missing record of death registration (NSW only) | ||
| Death | 2.1 Identify whether a PPN is absent from the death registration but present in the causes of death data set | • No cases found (as expected on the basis of deterministic linkage methodology for death data). | |
| 3 | Uniqueness of babyPPN | ||
| Perinatal | Non-unique babyPPN may be due to linkage errors or multiple data entries; | • 31 mothers were flagged as “exclusion” | |
| Consistency of perinatal information | |||
| 4 | Birthweight and gestational age | ||
| Perinatal | 4.1 Cross-tabulate birthweight (categorised as missing, <400, 400–999, 1000–1999, 2000–4999, and ≥5000 g) with gestational age (categorised as missing, <20, 20–26, 27–36, 37–44, and ≥45 weeks); | • 28 records with both gestation <20 weeks and birthweight <400 g; | |
| 5 | Birth order and pregnancy plurality | ||
| Perinatal | It is necessary to select one record per delivery (where birth order = 1st) when pregnancy is the unit of analysise. | • No record with implausible birth order (given the plurality); | |
| 6 | Interval between the two consecutive pregnancies | ||
| Perinatal | 6.1 Select records where birth order = 1st and sort in ascending order of baby DOB; | • 396 mothers marked as “exclusion”. | |
| 7 | Missing value of parity | ||
| Perinatal | 7.1 Subset to mothers who had a missing value of parity. | • Missing parity was replaced for 1218 out of 1633 records. | |
| 8 | Consistency in parity | ||
| Perinatal | 8.1 Select records where birth order = 1st, sort in ascending baby DOB, then for each mother: | • 422 mothers were flagged as “exclusion”. | |
| Consistency across different data sources | |||
| 9 | Consistency in baby DOB | ||
| Perinatal, | Validation studies reported accuracy of baby DOB in perinatal data (referred to as perinatal DOB), this variable can be used to assess the baby’s DOB and age in other linked records (referred to as patient DOB). Vice versa, when the baby DOBs across data sources differ, patient DOB can be used to verify baby DOB recorded in perinatal data. | • Alternative DOB was created for 667 babies. The original and alternative DOBs were 1 day apart (30%), between 2 and 10 days apart (46%), and share the same day and year (16%). | |
| 10 | Consistency between perinatal and congenital condition data | ||
| Perinatal, | 10.1 Identify baby who had a linked birth defect notification; | • 1 mother flagged as “exclusion” (review indicated linkage errors among her children). | |
| 11 | Consistency in mother’s year of birth (YOB) | ||
| Perinatal, | 11.1 Examine and define the range of mother YOBs according to perinatal data; | • YOBs in perinatal data ranged between 1941 and 1999, with no invalid values. | |
| 12 | Mother’s sex as male | ||
| Perinatal, | 12.1 Combine mother’s perinatal, hospital, ED and death records, then remove records with invalid mother YOBs (<1900 or >2014). For each mother: | • Hospital and ED data of the mothers contain records with sex recorded as male; | |
| 13 | Mother having births after total hysterectomy procedures | ||
| Perinatal | 13.1 Identify mothers who had hospital admissions for hysterectomy procedure(s)j; | • 51 mother flagged as “exclusion” | |
| 14 | Baby DOB being later than date of discharge from hospital or ED | ||
| Perinatal, | 14.1 Combine babies’ hospital admission and ED data sets. | • 10 mothers flagged as “exclusion”; | |
| 15 | Date of death being earlier than episodes of health service use | ||
| Death | 15.1 Identify persons (mothers or babies) who have a linked death record; | • 22 mothers flagged as “exclusion”. | |
| 16 | Date of admission or arrival to ED being later than date of discharge | ||
| Hospital, | 16.1 Identify hospital admission and ED records wherein date of admission to hospital or date of arrival to ED is later than the date of discharge. Mark these records as “deletion”. | • 34 hospital or ED records marked as “deletion” | |
aIn this study, adding a variable to a data set is referred to as “merge” while adding records is referred to as “combine”
bIt is useful to examine status of patient at discharge and date of discharge in hospital or ED data relating these deaths if dates of death differ
cInformation useful for review: baby DOB, plurality, birth order, birthweight, gestational age, Apgar scores, discharge status, mother’s age, postcode, country of birth and hospital
dPerinatal data cover births that gestation ≥20 weeks or birthweight ≥400 g
eBirth order indicates the order each baby was born (coded as 1st, 2nd, 3rd, etc.). Birth order for singletons is 1st. Multi-fetal pregnancies generated two or more perinatal records which contain baby-specific information including order of birth while maternal information is the same
fIn complicated plural pregnancies, it might be possible that babies born days apart, thus the gap between baby DOBs needs to be consistent with the difference in gestational age
gParity as zero in the second record indicates an error, given parity defined as the number of previous pregnancies ≥20 weeks; but this errors is not always identified through the check of parity sequence (e.g. parity values as missing-0-1)
hThe interval (calculated as in Step 6) between the first and the second record
iThe interval between the record with missing parity and the prior record
jHospital procedures were coded according to Australian Classification of Health Interventions (ACHI). See the Additional file 1 for hysterectomy procedure codes
Extract recommended PBS links and steps undertaken to check cross-jurisdictional linkage
| Step Data sets | Explanationa | Findings | |
|---|---|---|---|
| Extract recommended PBS links | |||
| PBS PATID- mumPPN mapping tables | Extract the PBS links from the mapping tables where weight of the match is equal to or greater than the recommended threshold (≥29 for NSW and ≥28 for WA). | • 512,887 (50.9%) PBS links to NSW mumPPNs extracted. | |
| 17 | Detection of clusters of mumPPNs | ||
| PBS mapping tables | 17.1 Among the extracted PBS links, identify cases where a PATID matches to two or more mumPPNs. Send these cases to AIHW data linkage unit for review and obtain advice on reliability of the matches. | • 9190 cases (20,254 matches) detected and reviewed. Of those, there were 3404 mumPPN clusters (including 6819 matches to 6815 mumPPNs). | |
| 18 | Date of pharmaceutical dispensing later than date of death | ||
| PBS mapping tables, | 18.1 For any women who have a death record, identify the matched PATIDs (remaining at completion of Step 17); | • 79 PBS links were disregarded. | |
| 19 | Consistency in clusters of mumPPNs | ||
| PBS mapping tables, | For 3404 clusters (identified in Step 17): | • 81 clusters were rejected, due to either | |
| 20 | Consistency in women who had records from both States | ||
| PBS mapping tables, | 20.8 Among the remaining PBS links (at completion of Step 19), identify PATIDs which concurrently match to NSW mumPPNs and WA mumPPNs; | • 2855 PATIDs concurrently match to mumPPNs in both NSW and WA; | |
| 21 | Integration of mothers’ records | ||
| All datasets relating to mothers | Integration of records is required for mumPPN clusters and women having records in both States. | • As per State linkages, there were 783,471 mumPPNs. | |
| 22 | Consistency in finalPPNmums with multiple PATIDs and finalise mother cohort | ||
| PBS mapping table, | 22.1 Combine the NSW and WA mapping tables (at completion of Step 21), Remove PBS links relating to “exclusion” women. Among the remaining links, identify finalPPNmums that match to ≥2 PATIDs. For those women: | • 4601 finalPPNmums with multiple PATIDs. Of those, 2763 were further flagged as “exclusion” | |
aIn this study, adding a variable to a data set is referred to as “merge” while adding records is referred to as “combine”
bFor example, a NSW mumPPN matches to 2 PATIDs and these 2 PATIDs match to three different WA mumPPNs