| Literature DB >> 33432543 |
Bingcao Wu1, Wing Chow2, Monish Sakthivel3, Onkar Kakade3, Kartikeya Gupta3, Debra Israel2, Yen-Wen Chen2, Aarti Susan Kuruvilla3.
Abstract
INTRODUCTION: Administrative claims data provide an important source for real-world evidence (RWE) generation, but incomplete reporting, such as for body mass index (BMI), limits the sample sizes that can be analyzed to address certain research questions. The objective of this study was to construct models by implementing machine-learning (ML) algorithms to predict BMI classifications (≥ 30, ≥ 35, and ≥ 40 kg/m2) in administrative healthcare claims databases, and then internally and externally validate them.Entities:
Keywords: Administrative healthcare claims databases; BMI classification; Body mass index; Machine learning; Predictive models; Real-world evidence generation
Mesh:
Year: 2021 PMID: 33432543 PMCID: PMC7889527 DOI: 10.1007/s12325-020-01605-6
Source DB: PubMed Journal: Adv Ther ISSN: 0741-238X Impact factor: 3.845
Fig. 1Methodology flow
Results of the different models across the 3 databases
| Database | Optum EHR | Optum DOD | IBM CCAE | |||
|---|---|---|---|---|---|---|
| No. of patients | 37,011,188 | 5,280,836 | 6,332,087 | |||
| No. index BMI readings | 343,711,980 | 16,316,746 | 15,147,663 | |||
| No. rows/columns in training and testing datasets | Model 1 | Model 2 | Model 1 | Model 2 | Model 1 | Model 2 |
| 6,800,000 | 6,800,000/111 | 3,300,000/123 | 3,300,000/111 | 3,400,000/123 | 3,400,000/111 | |
| Considered patient cases out of the patient cohort | 2% | 20% | 22% | |||
| Oversampling ratio in training data | 50/50, 60/40, 70/30 | |||||
| Age group | ||||||
| < 21 years | 19% | 7% | 27% | |||
| 21–30 years | 13% | 4% | 11% | |||
| 31–45 years | 20% | 12% | 22% | |||
| 46–60 years | 23% | 22% | 30% | |||
| > 60 years | 25% | 55% | 10% | |||
| BMI classification | ||||||
| ≥ 30 kg/m2 | 51% | 40% | 45% | |||
| ≥ 35 kg/m2 | 29% | 20% | 27% | |||
| ≥ 40 kg/m2 | 16% | 10% | 16% | |||
| US region | ||||||
| South | 24% | 50% | 51% | |||
| Midwest | 50% | 21% | 22% | |||
| West | 9% | 20% | 10% | |||
| Northeast | 13% | 9% | 16% | |||
| Others | 4% | 0% | 1% | |||
Fig. 2Process flow of machine-learning algorithm implementation for feature engineering
Number of features selected for each BMI classification prediction
| BMI classification | Features | No. of features before selection | Of the 379 selected features | Of the top 100 selected features |
|---|---|---|---|---|
| ≥ 30 kg/m2 | Diagnoses | 244 | 100 | 49 |
| Medications | 739 | 100 | 34 | |
| Procedures | 283 | 179 | 17 | |
| ≥ 35 kg/m2 | Diagnoses | 244 | 109 | 60 |
| Medications | 739 | 108 | 40 | |
| Procedures | 283 | 144 | – | |
| ≥ 40 kg/m2 | Diagnoses | 244 | 112 | 65 |
| Medications | 739 | 101 | 35 | |
| Procedures | 283 | 139 | – |
Fig. 3Training and testing datasets
Best algorithm of the models and oversampling ratios across all the iterations of BMI classifications
| BMI classification | Model | Algorithm model trained on | Model output | Oversampling ratio |
|---|---|---|---|---|
| ≥ 30 kg/m2 | Model 1 | Super Learner | 1 if BMI ≥ 30 | 50/50 |
| 0 if BMI < 30 | ||||
| ≥ 30 kg/m2 | Model 2 | Super Learner | 1 if BMI ≥ 30 | 50/50 |
| 0 if BMI < 30 | ||||
| ≥ 35 kg/m2 | Model 1 | Super Learner | 1 if BMI ≥ 35 | 60/40 |
| 0 if BMI < 35 | ||||
| ≥ 35 kg/m2 | Model 2 | Super Learner | 1 if BMI ≥ 35 | 60/40 |
| 0 if BMI < 35 | ||||
| ≥ 40 kg/m2 | Model 1 | Super Learner | 1 if BMI ≥ 40 | 60/40 |
| 0 if BMI < 40 | ||||
| ≥ 40 kg/m2 | Model 2 | Super Learner | 1 if BMI ≥ 40 | 60/40 |
| 0 if BMI < 40 |
Fig. 4Predictive performance results of model 1 trained on the Super Learner algorithm and internally validated on the Optum DOD database; ROC AUC area under the receiver operating characteristic curve, NPV negative predictive value
Fig. 5Predictive performance results of model 2 trained on the Super Learner algorithm and internally validated on the Optum DOD database; ROC AUC area under the receiver operating characteristic curve, NPV negative predictive value
| The sizeable under-reporting of body mass index (BMI) data in administrative healthcare claims databases impedes the comprehensive study of the population with obesity, and improved methodology is needed. |
| To address this need for improved methodology, we have harnessed machine-learning techniques to interpolate BMI variable data. |
| Based on this study, machine-learning algorithms can be applied to administrative healthcare claims data to predict BMI classifications with high validity. |
| This novel approach can be leveraged across multiple therapeutic areas to better understand variations in BMI-related disease risk, treatment outcomes, healthcare resource use, and costs in real-world settings. |
| The strategic machine-learning approach undertaken in this study may also be relatively easily applied to the development of similar predictive models for other under-reported clinical variables in administrative healthcare claims databases. |