| Literature DB >> 34289843 |
Xi Shi1, Gorana Nikolic2, Gorka Epelde3,4, Mónica Arrúe3,4, Joseba Bidaurrazaga Van-Dierdonck5, Roberto Bilbao6, Bart De Moor2.
Abstract
BACKGROUND: The increasing prevalence of childhood obesity makes it essential to study the risk factors with a sample representative of the population covering more health topics for better preventive policies and interventions. It is aimed to develop an ensemble feature selection framework for large-scale data to identify risk factors of childhood obesity with good interpretability and clinical relevance.Entities:
Keywords: Childhood obesity; Ensemble learning; Feature selection; Policy decision making; Public health
Year: 2021 PMID: 34289843 PMCID: PMC8293582 DOI: 10.1186/s12911-021-01580-0
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1The selection and exclusion criteria for the participants
Fig. 2The main procedures of BFSMR. There are mainly four steps, (1) splitting, (2) mapping which draws bootstrapped samples from each chunk, (3) reducing which merges sets with the same Set ID and applies one feature selection classifier on each set and derive one selected feature list, and (4) merge procedure which combines the selected feature lists into the final output
Top 10 features selected from different models based on variable importance
| Filter (MI) | SVM-RFE | Ridge | Lasso | RandomForest | |
|---|---|---|---|---|---|
| 1 | Age | MoDietEducation | Age | Age | SystolicPressure |
| 2 | Sleep_Normal (–) | MoRDType_LowSalt | Sex (–) | Sex (–) | MoDiastolicPressure (–) |
| 3 | BFType_Maternal (–) | RDType_2000 cal | Tobacco_No (–) | Tobacco_No (–) | MoSystolicPressure (–) |
| 4 | DiastolicPressure (–) | AdeDKnowledge | DietEducation | DietEducation | Sex |
| 5 | MoSystolicPressure | MoPE_Inadequate (–) | MoTobacco_Yes | MoDietEducation | Birthyear (–) |
| 6 | MoNumberCigarettes | DietCompliesAdvice | BFType_Maternal (–) | BFType_Maternal (–) | Tobacco_No (–) |
| 7 | Birthheight (–) | MoRDType_Free (–) | PE_Inadequate | Birthyear (–) | MoExerciseAdvice (–) |
| 8 | MoBMI | MoPEHour | MoDiabetes_No (–) | MoNumberCigarettes | MoAlcohol_No (–) |
| 9 | Birthweight (–) | DiastolicPressure (–) | PE_Adequate(–) | PE_Inadequate | PE_Inadequate |
| 10 | MoDiastolicPressure (–) | SystolicPressure | MoDietEducation | DCExecution _No | MoTobacco_Ex |
All “Mothers-” in the variables were replaced with “Mo-” for shorter names
RDType, RecommendedDietType; MoRDType, MoRecommendedDitetType; BFType, BreastfeedingType; PE, PhysicalExercise; MoPE, MoPhysicalExercise; MoPEHour, MoPhysicalExerciseHour; AdeDKnowledge, AdequateDietaryKnowledge; DCExecution, DietCorrectExecution,
Comparison of predictive performance among different models
| Filter (MI) | SVM-RFE | Ridge | Lasso | RandomForest | |
|---|---|---|---|---|---|
| Accuracy | 0.843 | 0.845 | 0.844 | 0.839 | 0.828 |
| F Score | 0.915 | 0.774 | 0.915 | 0.912 | 0.770 |
Accuracy and F-score were jointly used to evaluate the performance. Lasso, Ridge, and filter method had relatively better performance and Random Forest had the worst performance
Comparison of the selected variables with high scores calculated from different voting strategies
| Voting1 | Voting2 | Voting3 | |||
|---|---|---|---|---|---|
| PE_Inadequate | 4 | Age | 30 | Age | 25 |
| Age | 3 | Sex | 25 | Sex | 19.4 |
| BFType_Maternal | 3 | Tobacco_No | 21 | Tobacco_No | 17 |
| MoDietEducation | 3 | BFType_Maternal | 18 | BFType_Maternal | 14 |
| Sex | 3 | MoDietEducation | 17 | DietEducation | 14 |
| Tobacco_No | 3 | DietEducation | 14 | MoDietEducation | 12 |
| Birthyear | 2 | MoSystolicPressure | 14 | PE_Inadequate | 8.4 |
| DiastolicPressure | 2 | SystolicPressure | 11 | MoTobacco_Yes | 6 |
| DietEducation | 2 | Birthyear | 10 | MoNumberCigarett | 5.5 |
| MoDiastolicPressure | 2 | MoDiastolicPressure | 10 | Birthyear | 5.2 |
| MoNumberCigarettes | 2 | PE_Inadequate | 10 | MoSystolPressure | 4.6 |
| MoSystolicPressure | 2 | DiastolicPressure | 9 | DiastolicPressure | 4.5 |
| SystolicPressure | 2 | MoRDType_LowSalt | 9 | MoRDType_LowSalt | 4.5 |
| AdeDKnowledge | 1 | Sleep_Normal | 9 | Sleep_Normal | 4.5 |
| Birthheight | 1 | MoNumberCigarettes | 8 | RDType_2.000cal | 4 |
| Birthweight | 1 | RDType_2.000cal | 8 | AdeDKnowledge | 3.5 |
| DietCompliesAdvice | 1 | AdeDKnowledge | 7 | MoDiabetes_No | 3 |
| DCExecution_No | 1 | MoPE_Inadequate | 6 | MoPE_Inadequate | 3 |
| MoAlcohol_No | 1 | MoTobacco_Yes | 6 | DietComplieAdvice | 2.5 |
| MoBMI | 1 | DietCompliesAdvice | 5 | SystolicPressure | 2.5 |
| MoDiabetes_No | 1 | Birthheight | 4 | MoDiastoPressure | 2.3 |
| MoExerciseAdvice | 1 | MoExerciseAdvice | 4 | Birthheight | 2 |
| MoPE_Inadequate | 1 | MoRDType_Free | 4 | MoRDTyp_Free | 2 |
| MoPEHours | 1 | MoAlcohol_No | 3 | MoBMI | 1.5 |
| MoRDType_LowSalt | 1 | MoBMI | 3 | MoPEHours | 1.5 |
| MoRDType_Free | 1 | MoDiabetes_No | 3 | Birthweight | 1 |
| MoTobacco_Yes | 1 | MoPEHours | 3 | DCExecution_No | 1 |
| MoTobacco_Ex | 1 | Birthweight | 2 | MoExerciseAdvic | 0.8 |
| RDType_2.000cal | 1 | DCExecution_No | 1 | MoAlcohol_No | 0.6 |
| Sleep_Normal | 1 | MoTobacco_Ex | 1 | MoTobacco_Ex | 0.2 |
The top variables changed and the importance of some variables gradually grew with the change from Voting1 to Voting3
All “Mothers-” in the variables were replaced with “Mo-” for shorter names
RDType, RecommendedDietType; MoRDType, MoRecommendedDitetType; BFType, BreastfeedingType; PE, PhysicalExercise; MoPE, MoPhysicalExercise; MoPEHour, MoPhysicalExerciseHour; AdeDKnowledge, AdequateDietaryKnowledge; DCExecution, DietCorrectExecution
Fig. 3Visualization the comparison of selected variables with high scores calculated from 3 voting strategies. The plot was drawn using the percentage of the score of one feature out of the whole set and larger percentage means higher variable importance. Age, Sex, Tobacco_No, DietEducation, and BreaskfeedingType_Maternal gradually gained more importance during the changes from Voting1 to Voting3 while the importance of MothersDiastolicPressure and PhysicalExercise_Inadequate dropped. MothersDietEducation was more stable and took almost the same share of voting scores in all strategies
| Notation | Meaning |
|---|---|
| Feature selection classifier where | |
| The Classifier ID where | |
| Random sample set with Set ID | |
| Feature weights based on the ranking from each classifier where | |
| Method weights based on the model performance where | |
| Feature lists derived from | |
| Feature space with unique features from | |
| Voting score for each unique feature where |