| Literature DB >> 26604716 |
Lucas J Adams1, Ghalib Bello2, Gerard G Dumancas1.
Abstract
The problem of selecting important variables for predictive modeling of a specific outcome of interest using questionnaire data has rarely been addressed in clinical settings. In this study, we implemented a genetic algorithm (GA) technique to select optimal variables from questionnaire data for predicting a five-year mortality. We examined 123 questions (variables) answered by 5,444 individuals in the National Health and Nutrition Examination Survey. The GA iterations selected the top 24 variables, including questions related to stroke, emphysema, and general health problems requiring the use of special equipment, for use in predictive modeling by various parametric and nonparametric machine learning techniques. Using these top 24 variables, gradient boosting yielded the nominally highest performance (area under curve [AUC] = 0.7654), although there were other techniques with lower but not significantly different AUC. This study shows how GA in conjunction with various machine learning techniques could be used to examine questionnaire data to predict a binary outcome.Entities:
Keywords: NHANES; genetic algorithm; machine learning; questionnaire
Year: 2015 PMID: 26604716 PMCID: PMC4639510 DOI: 10.4137/BBI.S29469
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Demographics of questionnaire respondents with available five-year mortality data.
| ≥ | ||||||||
| Individuals | 568 | 871 | 824 | 763 | 585 | 817 | 594 | 422 |
| Individuals | 2329 | 1035 | 1550 | 337 | 193 |
Note:
Non-Hispanic.
Fine-tuning parameters for ANN, gradient boosting, SVM, PLS-DA, elastic net, and random forests to achieve the highest AUC values in the test set (f = frequency).
| TECHNIQUE | TOP 24 QUESTIONS | TOP 13 QUESTIONS | TOP 9 QUESTIONS | TOP 5 QUESTIONS | TOP 3 QUESTIONS | |
|---|---|---|---|---|---|---|
| 3 | 30 | 50 | 3 | 3 | ||
| 0.04 | 0.04 | 0.04 | 0.001 | 0.04 | ||
| 5000 | 10000 | 5000 | 1000 | 10000 | ||
| 4 | 1 | 1 | 3 | 2 | ||
| 10−6 | 10−5 | 10 | 10−4 | 2 | ||
| 1 | 1 | 1 | 1 | 0.01 | ||
| 8 | – | – | – | – | ||
| 0.01 | 0.1 | 0.1 | 0.2 | 0.1 | ||
| 50 | 50 | 50 | 10 | 10 | ||
| 10 | 10 | 10 | 10 | 10 |
Breakdown of variables used in the analyses.
| VARIABLE | NUMBER |
|---|---|
| Initial | 2058 |
| >30% missing | 1929 |
| Zero variance | 1 |
| Perfectly collinear | 5 |
| Final | 123 |
List of the top 24 questions selected by GA in the training set.
| QUESTION NO. | CONTENT |
|---|---|
| 59 | Has a doctor or other health professional ever told {you/SP} that {you/s/he} … had a stroke? |
| 60 | Has a doctor or other health professional ever told {you/SP} that {you/s/he} … had emphysema? |
| 90 | {Do you/does SP} now have any health problem that requires {you/him/her} to use special equipment, such as a cane, a wheelchair, a special bed, or a special telephone? |
| 72 | Including living and deceased, were any of {SP’s/your} biological that is, blood relatives including grandparents, parents, brothers, sisters ever told by a health professional that they had … osteoporosis or brittle bones? |
| 27 | {Were you/Was SP} ever told that {you/s/he/SP} had active tuberculosis or TB? |
| 24 | {Have you/has SP} ever received the hepatitis a vaccine series? This is a two dose vaccine that is given to people who travel outside the United states. It has only been available since 1995. |
| 55 | Has a doctor or other health professional ever told {you/SP} that {you/s/he} … had congestive heart failure? |
| 69 | Including living and deceased, were any of {SP’s/your} biological that is, blood relatives including grandparents, parents, brothers, sisters ever told by a health professional that they had … Alzheimer’s disease? |
| 76 | Has a doctor ever told {you/SP} that {you/s/he} had broken or fractured {your/his/her} … wrist? |
| 42 | Up to the present time, what is the most {you have/SP has} ever weighed? |
| 99 | {Have you/has SP} ever been told by a doctor or other health professional that {you/s/he} had weak or failing kidneys? Do not include kidney stones, bladder infections, or incontinence. |
| 108 | Did {you/SP} have flu, pneumonia, or ear infections that started during those 30 days? |
| 118 | {Are you/IS SP} covered by any single service plan? |
| 6 | Now I’m going to ask a few questions about milk products. Do not include their use in cooking. In the past 30 days, how often did {you/SP} have milk to drink or on {your/his/her} cereal? Please include chocolate and other favored milks as well as hot cocoa made with milk. Do not count small amounts of milk added to coffee or tea. Would you say. |
| 47 | The next questions are about the food eaten by {you/you and your household}. {When answering these questions, think about all the people who eat here, even if they are not related to you.} Which of these statements best describes the food eaten {by you/in your household} in the last 12 months, that is since {DISPLAY CURRENT MONTH} of last year. 1. {I/We} always have enough to eat and the kinds of food {I/We} want; 2. {I/We} have enough to eat but not always the kinds of food {I/We} want; 3. Sometimes or often {I/We} don’t have enough to eat. |
| 56 | Has a doctor or other health professional ever told {you/SP} that {you/s/he} … had coronary heart disease? |
| 63 | Has a doctor or other health professional ever told {you/SP} that {you/s/he} … was overweight? |
| 71 | Including living and deceased, were any of {SP’s/your} biological that is, blood relatives including grandparents, parents, brothers, sisters ever told by a health professional that they had … arthritis? |
| 82 | The next questions are about alcoholic beverages. When answering think about {your/SP’s} use over the past 30 days. How often did {you/SP} drink beer or lite beer? |
| 88 | [During the past 3 months], did {you/SP} have low back pain? |
| 104 | {Have you/has SP} used snuff, such as SKOAL, SKOAL BANDIT, or COPENHAGEN at least 20 times in {your/his/her} entire life? |
| 107 | Did {you/SP} have a stomach or intestinal illness with vomiting or diarrhea that started during those 30 days? |
| 109 | During the past 12 months, that is, since (DISPLAY CURRENT MONTH, DISPLAY LAST YEAR), a year ago, (have you/has SP) donated blood? |
| 112 | How much did {you/SP} weigh at age 25? [If you don’t know {your/his/her} exact weight, please make your best guess.] |
Selection frequency of questions in 10 GA iterations.
| QUESTION NO. | TRIAL 1 | TRIAL 2 | TRIAL 3 | TRIAL 4 | TRIAL 5 | TRIAL 6 | TRIAL 7 | TRIAL 8 | TRIAL 9 | TRIAL 10 | FREQUENCY (f) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | x | x | x | x | x | x | x | x | x | x | 1.00 |
| 60 | x | x | x | x | x | x | x | x | x | x | 1.00 |
| 90 | x | x | x | x | x | x | x | x | x | x | 1.00 |
| 72 | x | x | x | x | x | 0.50 | |||||
| 27 | x | x | x | x | 0.40 | ||||||
| 24 | x | x | x | 0.30 | |||||||
| 55 | x | x | x | 0.30 | |||||||
| 69 | x | x | x | 0.30 | |||||||
| 76 | x | x | x | 0.30 | |||||||
| 42 | x | x | 0.20 | ||||||||
| 99 | x | x | 0.20 | ||||||||
| 108 | x | x | 0.20 | ||||||||
| 118 | x | x | 0.20 | ||||||||
| 6 | x | 0.10 | |||||||||
| 47 | x | 0.10 | |||||||||
| 56 | x | 0.10 | |||||||||
| 63 | x | 0.10 | |||||||||
| 71 | x | 0.10 | |||||||||
| 82 | x | 0.10 | |||||||||
| 88 | x | 0.10 | |||||||||
| 104 | x | 0.10 | |||||||||
| 107 | x | 0.10 | |||||||||
| 109 | x | 0.10 | |||||||||
| 112 | x | 0.10 |
AUC values of different algorithms using the top 24 questions generated by GA and selected subsets based on selection frequency (f) by GA. PLS-DA utilized the top 14 questions after removing variables with zero or near zero variance (Table 8).
| TECHNIQUE | AUC | ||||
|---|---|---|---|---|---|
| TOP 24 | TOP 13 | TOP 9 | TOP 5 | TOP 3 | |
| Gradient boosting | 0.6981 | 0.6659 | 0.6629 | ||
| ANN | 0.7522 | 0.7157 | 0.6629 | ||
| Elastic net | 0.7436 | 0.7216 | 0.7008 | 0.6629 | 0.6629 |
| SVM | 0.7417 | 0.7102 | 0.675 | 0.6637 | 0.6611 |
| Ridge regression | 0.7414 | 0.7169 | 0.6889 | 0.6595 | 0.6629 |
| Logistic regression | 0.7405 | 0.7168 | 0.6985 | 0.6597 | 0.6629 |
| Random forest | 0.7258 | 0.6969 | 0.6191 | 0.5912 | 0.5712 |
| LASSO | 0.7135 | 0.7009 | 0.6882 | 0.6628 | 0.6628 |
| PLS-DA | 0.6756 | – | – | – | – |
| Classification trees | 0.6657 | 0.6657 | 0.647 | 0.647 | 0.647 |
DeLong’s test comparing AUCs to that of the top performing technique (gradient boosting, AUC = 0.7654) using the top 24 questions.
| TECHNIQUE | AUC (TOP 24 QUESTIONS) | |
|---|---|---|
| ANN | 0.7522 | 0.4379 |
| Elastic net | 0.7436 | 0.2344 |
| SVM | 0.7417 | 0.3424 |
| Ridge regression | 0.7414 | 0.1973 |
| Logistic regression | 0.7405 | 0.1968 |
| Random forest | 0.7258 | 3.188 × 10−2 |
| LASSO | 0.7135 | 1.988 × 10−3 |
| PLS-DA | 0.6756 | 7.337 × 10−4 |
| Classification trees | 0.6657 | 8.654 × 10−7 |
Fourteen questions used in PLS-DA after removing those with zero or near zero variance from the initial 24 variables (AUC = 0.6756).
| QUESTION NO. | CONTENT |
|---|---|
| 90 | {Do you/Does SP} now have any health problem that requires {you/him/her} to use special equipment, such as a cane, a wheelchair, a special bed, or a special telephone? |
| 72 | Including living and deceased, were any of {SP’s/your} biological that is, blood relatives including grandparents, parents, brothers, sisters ever told by a health professional that they had … osteoporosis or brittle bones? |
| 24 | {Have you/Has SP} ever received the hepatitis A vaccine series? This is a two dose vaccine that is given to people who travel outside the United States. It has only been available since 1995. |
| 69 | Including living and deceased, were any of {SP’s/your} biological that is, blood relatives including grandparents, parents, brothers, sisters ever told by a health professional that they had … Alzheimer’s disease? |
| 76 | Has a doctor ever told {you/SP} that {you/s/he} had broken or fractured {your/his/her} … wrist? |
| 42 | Up to the present time, what is the most {you have/SP has} ever weighed? |
| 6 | Now I’m going to ask a few questions about milk products. Do not include their use in cooking. In the past 30 days, how often did {you/SP} have milk to drink or on {your/his/her} cereal? Please include chocolate and other flavored milks as well as hot cocoa made with milk. Do not count small amounts of milk added to coffee or tea. Would you say.. |
| 47 | The next questions are about the food eaten by {you/you and your household}. {When answering these questions, think about all the people who eat here, even if they are not related to you.} Which of these statements best describes the food eaten {by you/in your household} in the last 12 months, that is since {DISPLAY CURRENT MONTH} of last year. 1. {I/We} always have enough to eat and the kinds of food {I/We} want; 2. {I/We} have enough to eat but not always the kinds of food {I/We} want; 3. Sometimes or often {I/We} don’t have enough to eat. |
| 63 | Has a doctor or other health professional ever told {you/SP} that {you/s/he} … was overweight? |
| 71 | Including living and deceased, were any of {SP’s/your} biological that is, blood relatives including grandparents, parents, brothers, sisters ever told by a health professional that they had … arthritis? |
| 82 | The next questions are about alcoholic beverages. When answering think about {your/SP’s} use over the past 30 days. How often did {you/SP} drink beer or lite beer? |
| 88 | [During the past 3 months], did {you/SP} have low back pain? |
| 107 | Did {you/SP} have a stomach or intestinal illness with vomiting or diarrhea that started during those 30 days? |
| 112 | How much did {you/SP} weigh at age 25? [If you don’t know {your/his/her} exact weight, please make your best guess.] |