| Literature DB >> 25547173 |
Truyen Tran1,2, Wei Luo3, Dinh Phung4, Sunil Gupta5, Santu Rana6, Richard Lee Kennedy7, Ann Larkins8, Svetha Venkatesh9.
Abstract
BACKGROUND: Feature engineering is a time consuming component of predictive modeling. We propose a versatile platform to automatically extract features for risk prediction, based on a pre-defined and extensible entity schema. The extraction is independent of disease type or risk prediction task. We contrast auto-extracted features to baselines generated from the Elixhauser comorbidities.Entities:
Mesh:
Year: 2014 PMID: 25547173 PMCID: PMC4310185 DOI: 10.1186/s12859-014-0425-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the feature extraction framework.
Figure 2An example subsection of an entity schema. Not all entities were used by the experiments in this paper.
Figure 3Temporal data represent events in a patient history. Assessment point (AP) defines the timestamp from which future readmissions within a pre-defined period are predicted. Events occurring before the AP were used to construct features. Only APs after the first diagnosis of the disease under study were considered.
Feature sets, baseline and auto-extracted
|
|
|
|---|---|
|
| Elixhauser comorbidities + demography, present over 1 month history |
|
| Elixhauser comorbidities + demography, present over 36 months history |
|
| MR + demography, over 36 months history |
|
| MR + demography + Elixhauser comorbidities, 36 months history |
Definition of derivation and validation cohorts and the distribution of analysis units in the cohorts (evaluated at discharges following the first diagnosis)
|
|
| |
|---|---|---|
|
| ||
| Period | 2003-2007 | 2008-2011 |
| Number of patients | 4,930 | 2,101 |
| Number of analysis units | 11,897 | 4,041 |
|
| ||
| Period | 2003-2008 | 2009-2011 |
| Number of patients | 1,816 | 1,816 |
| Number of analysis units | 5,746 | 5,270 |
|
| ||
| Period | 2003-2009 | 2010-2011 |
| Number of patients | 3,089 | 1,248 |
| Number of analysis units | 10,728 | 2,232 |
|
| ||
| Period | 2003-2008 | 2009-2011 |
| Number of patients | 3,258 | 2,264 |
| Number of analysis units | 7,817 | 4,020 |
Characteristics in patient cohorts
|
|
|
|
|---|---|---|
|
| ||
| Average Age | 67.6 | 66.1 |
| Gender Distribution (% of females) | 45.3 | 43.1 |
| Median time to readmission (months) | 5.7 | 8.4 |
|
| ||
| Average Age | 74.9 | 72.0 |
| Gender Distribution (% of females) | 41.8 | 42.4 |
| Median time to readmission (months) | 4.1 | 4.8 |
|
| ||
| Average Age | 48.9 | 49.9 |
| Gender Distribution (% of females) | 50.8 | 48.6 |
| Median time to readmission (months) | 4.9 | 6.4 |
|
| ||
| Average Age | 67.0 | 63.9 |
| Gender Distribution (% of females) | 44.8 | 46.2 |
| Median time to readmission (months) | 5.6 | 8.9 |
Performance (AUC) of predicting unplanned readmissions following the unplanned discharges
|
| |||||
|---|---|---|---|---|---|
|
|
|
|
|
| |
|
| |||||
| 1 M | 0.57 (0.55,0.60) | 0.60 (0.57,0.63) | 0.730 (0.695,0.766) | 0.730 (0.695,0.766) | |
| 2 M | 0.59 (0.56,0.61) | 0.60 (0.57,0.62) | 0.719 (0.689,0.750) | 0.719 (0.689,0.750) | |
| 3 M | 0.58 (0.56,0.61) | 0.60 (0.58,0.63) | 0.719 (0.692,0.746) | 0.720 (0.693,0.746) | |
| 6 M | 0.59 (0.57,0.61) | 0.61 (0.59,0.64) | 0.724 (0.703,0.746) | 0.724 (0.702,0.745) | |
| 12 M | 0.60 (0.57,0.62) | 0.62 (0.59,0.64) | 0.720 (0.701,0.739) | 0.720 (0.701,0.739) | |
|
| |||||
| 1 M | 0.60 (0.57,0.62) | 0.60 (0.58,0.63) | 0.708 (0.674,0.741) | 0.704 (0.670,0.738) | |
| 2 M | 0.61 (0.59,0.63) | 0.63 (0.61,0.65) | 0.718 (0.692,0.744) | 0.718 (0.692,0.743) | |
| 3 M | 0.60 (0.58,0.622) | 0.63 (0.61,0.65) | 0.724 (0.703,0.745) | 0.724 (0.703,0.745) | |
| 6 M | 0.62 (0.60,0.633) | 0.64 (0.62,0.66) | 0.714 (0.697,0.731) | 0.715 (0.698,0.732) | |
| 12 M | 0.64 (0.62,0.653) | 0.66 (0.64,0.68) | 0.718 (0.705,0.732) | 0.718 (0.704,0.732) | |
|
| |||||
| 1 M | 0.56 (0.53,0.59) | 0.57 (0.54,0.60) | 0.748 (0.709,0.787) | 0.747 (0.708,0.786) | |
| 2 M | 0.58 (0.55,0.61) | 0.60 (0.57,0.62) | 0.756 (0.727,0.784) | 0.756 (0.728,0.785) | |
| 3 M | 0.59 (0.57,0.62) | 0.60 (0.58,0.63) | 0.738 (0.713,0.764) | 0.737 (0.711,0.762) | |
| 6 M | 0.61 (0.59,0.64) | 0.63 (0.61,0.65) | 0.718 (0.697,0.740) | 0.718 (0.696,0.739) | |
| 12 M | 0.65 (0.63,0.67) | 0.66 (0.64,0.68) | 0.713 (0.694,0.732) | 0.713 (0.694,0.732) | |
|
| |||||
| 1 M | 0.58 (0.55,0.60) | 0.61 (0.59,0.63) | 0.749 (0.717,0.782) | 0.750 (0.718,0.782) | |
| 2 M | 0.61 (0.59,0.63) | 0.66 (0.64,0.68) | 0.753 (0.729,0.777) | 0.756 (0.733,0.780) | |
| 3 M | 0.62 (0.60,0.64) | 0.67 (0.65,0.68) | 0.760 (0.739,0.780) | 0.762 (0.742,0.782) | |
| 6 M | 0.64 (0.62,0.66) | 0.68 (0.67,0.70) | 0.748 (0.731,0.764) | 0.749 (0.733,0.765) | |
| 12 M | 0.65 (0.63,0.67) | 0.70 (0.68,0.71) | 0.744 (0.731,0.758) | 0.747 (0.733,0.761) | |
AUC stands for Area Under ROC Curve; Feature sets are Elixhauser comorbidities as baselines, automatically extracted features from medical records (MR), and the combination of MR and comorbidities.