| Literature DB >> 22370452 |
Khaled El Emam1, Luk Arbuckle, Gunes Koru, Benjamin Eze, Lisa Gaudette, Emilio Neri, Sean Rose, Jeremy Howard, Jonathan Gluck.
Abstract
BACKGROUND: There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013.Entities:
Mesh:
Year: 2012 PMID: 22370452 PMCID: PMC3374547 DOI: 10.2196/jmir.2001
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Recent examples of public releases of health data for the purpose of competitions.
| Competition | Objective |
| Predict HIV Progression [ | Finding markers in the human immunodeficiency DNA sequence that predict a change in the severity of the infection |
| INFORMS data mining contest [ | Predicting hospitalization outcomes of transfer and death |
| Practice Fusion medical research data [ | Developing an application to manage patients with a focus on chronic diseases |
Description of the fields in the patients data table.
| Field | Description |
| MemberID | Unique identifier for the patient |
| Agea | Age in years at the time of the first claim in year 1 |
| Sexa | Patient’s sex |
| DaysInHospital Y2a | Total number of days the patient was hospitalized in year 2 |
| DaysInHospital Y3a | Total number of days the patient was hospitalized in year 3 |
a Quasi-identifier.
Description of the fields for the claims data table.
| Field | Description |
| MemberID | Unique identifier for the patient |
| ProviderID | Unique identifier for the responsible provider giving care |
| Vendor | Unique identifier for the vendor providing the service |
| PCP | Unique identifier for the primary care provider |
| Year | Indicator of claim year (year 1, year 2, or year 3) |
| Specialtya | Specialty of provider |
| PlaceOfServicea | Place of service |
| CPTCodea | CPTb code: these codes provide a means to accurately describe medical, surgical, and diagnostic services, are used for processing claims and for medical review, and are the national coding standard under HIPAAc |
| LOSa | Length of stay in hospital |
| DSFCa | Number of days since first claim computed from the first claim for that patient for each year |
| PayDelay | Number of days of delay between date of service and date of payment of the claim |
| Diagnosisa | ICD-9-CMd code |
a Quasi-identifier.
b Current Procedural Terminology [22].
c Health Insurance Portability and Accountability Act.
d International Classification of Diseases, 9th revision, Clinical Modification [23]
Figure 1Equations describing how re-identification risk was measured.
Figure 2The three domain generalization hierarchies for the 3 quasi-identifiers: date of birth (d), gender (g), and visit date (p).
Figure 3A lattice showing the possible generalizations of the 3 quasi-identifiers: date of birth (d), gender (g), and visit date (p).
Description of the generalization hierarchies for the quasi-identifiers.
| Quasi-identifier | Description |
| Age | Years → 5-year interval; 80+ → 10-year interval; 80+ → 20-year interval; 80+ |
| Sex | no change |
| DaysInHospital Y2/Y3 | Days → days to 2 weeks; >2 weeks → days to 1 week; 1–2 weeks; >2 weeks |
| Specialty | Original specialty → grouped specialty (see |
| PlaceOfService | Original place of service → grouped place of service (see |
| CPTCodea | Original CPT code → grouped CPT code |
| LOSb | Days → days up to 6 days, weeks afterward → days up to 6 days; (1–2] weeks; (2–4] weeks; (4–8] weeks; (8–12 weeks]; (12–26] weeks; 26+ weeks → <1 week; (1–2] weeks; (2–4] weeks; (4–8] weeks; (8–12 weeks]; (12–26] weeks; 26+ weeks → <4 weeks; (4–8] weeks; (8–12 weeks]; (12–26] weeks; 26+ weeks |
| DSFCc | Days → weeks → 2 weeks → months |
| Diagnosis | ICD-9-CMd code → primary condition group (see |
a Current Procedural Terminology.
b Length of stay in hospital.
c Days since first claim.
d International Classification of Diseases, 9th revision, Clinical Modification.
Final generalizations in the dataset.
| Quasi-identifier | Generalization |
| Age | 10-year interval; 80+ |
| Sex | No change |
| DaysInHospital Y2 | Days to 2 weeks; >2 weeks in year 2 |
| DaysInHospital Y3 | Days to 2 weeks; >2 weeks in year 3 |
| Specialty | Grouped specialty (see |
| PlaceOfService | Grouped place of service (see |
| CPTCodea | Grouped CPT code (see |
| LOSb | Days up to 6 days; (1–2] weeks; (2–4] weeks; (4–8] weeks; (8–12 weeks]; (12–26] weeks; 26+ weeks |
| DSFCc | 4 weeks |
| Diagnosis | Primary condition group (see |
a Current Procedural Terminology.
b Length of stay in hospital.
c Days since first claim.
Estimated proportion of all records in the Heritage Health Prize dataset that would be correctly matched against the State Inpatient Database.
| Age | LOSa | Sex | Number of visits | PCGb | CPTc | Year 1 | Year 2 | Year 3 | All years |
| X | X | X | X | 0.001612 | 0.001478 | 0.001515 | 0.005141 | ||
| X | X | X | X | 0.007105 | 0.005684 | 0.005965 | 0.009735 | ||
| X | X | X | X | 0.013334 | 0.010156 | 0.010928 | 0.013579 | ||
| X | X | X | X | X | 0.017272 | 0.012702 | 0.013797 | 0.015991 |
a Length of stay in hospital.
b Primary Condition Group.
c Current Procedural Terminology.
Percentage of total records correctly matched under simulated attack with different assumptions about the number of claims (power).
| Power of adversary | |||
| Assumption | 5 | 10 | 15 |
| Original adversary assumptions | 0.84% | 0.94% | 1.17% |
| Multiple quasi-identifiers in the same claim | 3.67% | 3.72% | 3.87% |
| Ordered claims | 0.96% | 1.0% | 1.2% |