| Literature DB >> 21801360 |
Mohammed Khalilia1, Sounak Chakraborty, Mihail Popescu.
Abstract
BACKGROUND: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.Entities:
Mesh:
Year: 2011 PMID: 21801360 PMCID: PMC3163175 DOI: 10.1186/1472-6947-11-51
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
HCUP data elements
| Element Name | Element Description | |
|---|---|---|
| AGE | Age in years at admission | |
| AGEDAY | Age in days (when age > 1 year) | |
| AMONTH | Admission month | |
| ASOURCE | Admission source (uniform) | |
| ASOURCEUB92 | Admission source (UB-92 standard coding) | |
| ASOURCE_X | Admission source (as received from source) | |
| ATYPE | Admission type | |
| AWEEKEND | Admission day is a weekend | |
| DIED | Died during hospitalization | |
| DISCWT | Weight to discharges in AHA universe | |
| DISPUB92 | Disposition of patient (UB-92 standard coding) | |
| DISPUNIFORM | Disposition of patient (uniform) | |
| DQTR | Discharge quarter | |
| DRG | DRG in effect on discharge date | |
| DRG18 | DRG, version 18 | |
| DRGVER | DRG grouper version used on discharge date | |
| DSHOSPID | Data source hospital identifier | |
| DX1 | Principal diagnosis | |
| DX2 | Diagnosis 2 | |
| DX3 | Diagnosis 3 | |
| DX4 | Diagnosis 4 | |
| DX5 | Diagnosis 5 | |
| DX6 | Diagnosis 6 | |
| DX7 | Diagnosis 7 | |
| DX8 | Diagnosis 8 | |
| DX9 | Diagnosis 9 | |
| DX10 | Diagnosis 10 | |
| DX11 | Diagnosis 11 | |
| DX12 | Diagnosis 12 | |
| DX13 | Diagnosis 13 | |
| DX14 | Diagnosis 14 | |
| DX15 | Diagnosis 15 | |
| DXCCS1 | CCS: principal diagnosis | |
| DXCCS2 | CCS: diagnosis 2 | |
| DXCCS3 | CCS: diagnosis 3 | |
| DXCCS4 | CCS: diagnosis 4 | |
| DXCCS5 | CCS: diagnosis 5 | |
| DXCCS6 | CCS: diagnosis 6 | |
| DXCCS7 | CCS: diagnosis 7 | |
| DXCCS8 | CCS: diagnosis 8 | |
| DXCCS9 | CCS: diagnosis 9 | |
| DXCCS10 | CCS: diagnosis 10 | |
| DXCCS11 | CCS: diagnosis 11 | |
| DXCCS12 | CCS: diagnosis 12 | |
| DXCCS13 | CCS: diagnosis 13 | |
| DXCCS14 | CCS: diagnosis 14 | |
| DXCCS15 | CCS: diagnosis 15 | |
| ECODE1 | E code 1 | |
| ECODE2 | E code 2 | |
| ECODE3 | E code 3 | |
| ECODE4 | E code 4 | |
| ELECTIVE | Elective versus non-elective admission | |
| E_CCS1 | CCS: E Code 1 | |
| E_CCS2 | CCS: E Code 2 | |
| E_CCS3 | CCS: E Code 3 | |
| E_CCS4 | CCS: E Code 4 | |
| FEMALE | Indicator of sex | |
| HOSPID | HCUP hospital identification number | |
| HOSPST | Hospital state postal code | |
| KEY | HCUP record identifier | |
| LOS | Length of stay (cleaned) | |
| LOS_X | Length of stay (as received from source) | |
| MDC | MDC in effect on discharge date | |
| MDC18 | MDC, version 18 | |
| MDNUM1_R | Physician 1 number (re-identified) | |
| MDNUM2_R | Physician 2 number (re-identified) | |
| NDX | Number of diagnoses on this record | |
| NECODE | Number of E codes on this record | |
| NEOMAT | Neonatal and/or maternal DX and/or PR | |
| NIS_STRATUM | Stratum used to sample hospital | |
| NPR | Number of procedures on this record | |
| PAY1 | Primary expected payer (uniform) | |
| PAY1_X | Primary expected payer (as received from source) | |
| PAY2 | Secondary expected payer (uniform) | |
| PAY2_X | Secondary expected payer (as received from source) | |
| PL_UR_CAT4 | Patient Location: Urban-Rural 4 Categories | |
| PR1 | Principal procedure | |
| PR2 | Procedure 2 | |
| PR3 | Procedure 3 | |
| PR4 | Procedure 4 | |
| PR5 | Procedure 5 | |
| PR6 | Procedure 6 | |
| PR7 | Procedure 7 | |
| PR8 | Procedure 8 | |
| PR9 | Procedure 9 | |
| PR10 | Procedure 10 | |
| PR11 | Procedure 11 | |
| PR12 | Procedure 12 | |
| PR13 | Procedure 13 | |
| PR14 | Procedure 14 | |
| PR15 | Procedure 15 | |
| PRCCS1 | CCS: principal procedure | |
| PRCCS2 | CCS: procedure 2 | |
| PRCCS3 | CCS: procedure 3 | |
| PRCCS4 | CCS: procedure 4 | |
| PRCCS5 | CCS: procedure 5 | |
| PRCCS6 | CCS: procedure 6 | |
| PRCCS7 | CCS: procedure 7 | |
| PRCCS8 | CCS: procedure 8 | |
| PRCCS9 | CCS: procedure 9 | |
| PRCCS10 | CCS: procedure 10 | |
| PRCCS11 | CCS: procedure 11 | |
| PRCCS12 | CCS: procedure 12 | |
| PRCCS13 | CCS: procedure 13 | |
| PRCCS14 | CCS: procedure 14 | |
| PRCCS15 | CCS: procedure 15 | |
| PRDAY1 | Number of days from admission to PR1 | |
| PRDAY2 | Number of days from admission to PR2 | |
| PRDAY3 | Number of days from admission to PR3 | |
| PRDAY4 | Number of days from admission to PR4 | |
| PRDAY5 | Number of days from admission to PR5 | |
| PRDAY6 | Number of days from admission to PR6 | |
| PRDAY7 | Number of days from admission to PR7 | |
| PRDAY8 | Number of days from admission to PR8 | |
| PRDAY9 | Number of days from admission to PR9 | |
| PRDAY10 | Number of days from admission to PR10 | |
| PRDAY11 | Number of days from admission to PR11 | |
| PRDAY12 | Number of days from admission to PR12 | |
| PRDAY13 | Number of days from admission to PR13 | |
| PRDAY14 | Number of days from admission to PR14 | |
| PRDAY15 | Number of days from admission to PR15 | |
| RACE | Race (uniform) | |
| TOTCHG | Total charges (cleaned) | |
| TOTCHG_X | Total charges (as received from source) | |
| YEAR | Calendar year | |
| ZIPInc_Qrtl | Median household income quartile for patient's ZIP Code |
Complete list of 126 HCUP data elements. The elements marked with "*" (rows 33-47) are the ones used in the classification as input variables.
*HCUP data elements used in the classification
Figure 1Disease codes and categories hierarchical relationship. This is a snap shot of the hierarchical relationship between the diseases and disease categories. For instance, disease category 49 (diabetes) has a children that are represented in disease codes (ICD-9-CM).
Figure 2Demographics of patients by age, race and sex for the HCUP data set.
The 10 most prevalent diseases categories
| Disease Category | Prevalence |
|---|---|
| Hypertension | 29.1% |
| Coronary Atherosclerosis | 27.65% |
| Hyperlipidemia | 14.46% |
| Dysrhythmia | 14.35% |
| Other Circulatory Diseases | 12.02% |
| Diabetes mellitus no complication | 12% |
| Anemia | 11.93% |
The top 10 most prevalent diseases among the 8 million samples; it shows the percentage of the samples having that disease.
Some of the most imbalanced diseases categories
| Disease Category | Percent of Active class |
|---|---|
| Male Genital Disease | 0.01% |
| Testis Cancer | 0.046% |
| Encephalitis | 0.059% |
| Aneurysm | 0.74% |
| Breast Cancer | 1.66% |
| Peripheral Atherosclerosis | 3.16% |
| Diabetes Mellitus w/complication | 4.7% |
Sample Dataset, the bolded column (Cat. 50) represents the category to predict
| .... | .... | Cat. 257 | Cat. 258 | Cat. 259 | Age | Race | Sex | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | .... | .... | 0 | 1 | 1 | 69 | 3 | 0 | ||
| . | . | . | . | . | . | . | . | . | . | . | . | . |
| 1 | 0 | 0 | .... | .... | 1 | 0 | 0 | 55 | 1 | 1 |
Figure 3Flow diagram of random forest and sub-sampling approach.
Figure 4RF behaviour when the number of trees (. This plot shows how sensitivity in RF varies as the number of trees (ntree) varies, we varied ntree from 1-1001 in intervals of 25 and measured the sensitivity at every interval. Sensitivity ranged from 0.8457 when ntree = 1 and 0.8984 when ntree = 726. In our experiments we used ntree = 500 since the ntree did not have a large affect on accuracy for ntree >1.
Figure 5ROC curve for diabetes mellitus. ROC curve for diabetes mellitus comparing SVM, RF, boosting and bagging.
Figure 6ROC curve for hypertension. ROC curve for hypertension comparing both SVM, RF, boosting and bagging.
Figure 7ROC curve for breast cancer. ROC curve for breast cancer comparing both SVM, RF, boosting and bagging.
RF, SVM, bagging and boosting performance in terms of AUC on eight disease categories
| Disease | RF | SVM | Bagging | Boosting |
|---|---|---|---|---|
| Breast cancer | 0.9063 | 0.905 | 0.8886 | |
| Diabetes no complication | 0.8417 | 0.8568 | 0.8607 | |
| Diabetes with/complication | 0.9239 | 0.9294 | 0.9327 | |
| Hypertension | 0.8592 | 0.8719 | 0.8842 | |
| Coronary Atherosclerosis | 0.8973 | 0.887 | 0.9026 | |
| Peripheral Atherosclerosis | 0.8972 | 0.8967 | 0.9003 | |
| Other Circulatory Diseases | 0.7591 | 0.7669 | 0.7683 | |
| Osteoporosis | 0.867 | 0.8659 | 0.8635 |
Statistical comparison of RF and boosting ROC curves, the lower the value the more significant the difference is
| Disease | |
|---|---|
| Breast cancer | 0.8057 |
| Diabetes no complication | 0.3293 |
| Diabetes with/complication | 0.6266 |
| Hypertension | 0.2 |
| Coronary Atherosclerosis | 0.2764 |
| Peripheral Atherosclerosis | 0.8203 |
| Other Circulatory Diseases | 0.566 |
| Osteoporosis | 0.908 |
Top four most importance variable for the eight disease categories
| Disease | Variable 1 | Variable 2 | Variable 3 |
|---|---|---|---|
| 1. Breast cancer | Age | Sex | Secondary malignant Secondary malignant sddsmalignant malignant |
| 2. Diabetes no complication | Age | Hypertension | Hyperlipidemia |
| 3. Diabetes with/complication | Age | Normal | Fluid-electrolyte |
| 4. Hypertension | Age | Hyperlipidemia | Diabetes without compl. |
| 5. Coronary atherosclerosis | Age | Hypertension | Hyperlipidemia |
| 6. Peripheral atherosclerosis | Age | Coronary | Hypertension |
| 7. Other circulatory diseases | Age | Dysthymia | Anemia |
| 8. Osteoporosis | Age | Race | Hypertension |
RF, SVM, bagging and boostingperformance without sub-sampling in terms of AUC on eight disease categories
| Disease | RF | SVMSVM | Bagging | Boosting |
|---|---|---|---|---|
| Breast cancer | 0.5 | 0.5085 | 0.836 | |
| Diabetes no complication | 0.5 | 0.4749 | 0.8175 | |
| Diabetes with/complication | 0.648 | 0.4985 | 0.8278 | |
| Hypertension | 0.6908 | 0.4886 | 0.8515 | |
| Coronary Atherosclerosis | 0.6601 | 0.4945 | 0.8608 | |
| Peripheral Atherosclerosis | 0.5 | 0.4925 | 0.8279 | |
| Other Circulatory Diseases | 0.5 | 0.4829 | 0.6851 | |
| Osteoporosis | 0.7968 | 0.5 | 0.4931 |
Figure 8ROC curve for breast cancer (sampling vs. non-sampling). ROC curve for breast cancer comparing RF with the sampling and non-sampling approach.
Figure 9ROC curve for other circulatory diseases (sampling vs. non-sampling). ROC curve for other circulatory diseases comparing RF with the sampling and non-sampling approach.
Figure 10ROC curve for peripheral atherosclerosis (sampling vs. non-sampling). ROC curve for peripheral atherosclerosis comparing RF with the sampling and non-sampling approach.