| Literature DB >> 31717292 |
Meng-Hsuen Hsieh1, Li-Min Sun2,3, Cheng-Li Lin4,5, Meng-Ju Hsieh6, Chung Y Hsu7, Chia-Hung Kao7,8,9.
Abstract
Objective: Early reports indicate that individuals with type 2 diabetes mellitus (T2DM) may have a greater incidence of breast malignancy than patients without T2DM. The aim of this study was to investigate the effectiveness of three different models for predicting risk of breast cancer in patients with T2DM of different characteristics. Study design and methodology: From 2000 to 2012, data on 636,111 newly diagnosed female T2DM patients were available in the Taiwan's National Health Insurance Research Database. By applying their data, a risk prediction model of breast cancer in patients with T2DM was created. We also collected data on potential predictors of breast cancer so that adjustments for their effect could be made in the analysis. Synthetic Minority Oversampling Technology (SMOTE) was utilized to increase data for small population samples. Each datum was randomly assigned based on a ratio of about 39:1 into the training and test sets. Logistic Regression (LR), Artificial Neural Network (ANN) and Random Forest (RF) models were determined using recall, accuracy, F1 score and area under the receiver operating characteristic curve (AUC).Entities:
Keywords: artificial neural network; breast cancer; logistic regression; random forest; type II diabetes mellitus
Year: 2019 PMID: 31717292 PMCID: PMC6895886 DOI: 10.3390/cancers11111751
Source DB: PubMed Journal: Cancers (Basel) ISSN: 2072-6694 Impact factor: 6.639
Baseline characteristics of T2DM patients with and without breast cancer.
| Variable | Breast Cancer | ||||
|---|---|---|---|---|---|
| No | Yes | ||||
|
| (%) |
| (%) | ||
| Age group (year) | <0.001 | ||||
| ≤49 | 171,724 | 27.3 | 1943 | 26.5 | |
| 50–64 | 251,750 | 40.0 | 3716 | 50.6 | |
| 65+ | 205,291 | 32.7 | 1687 | 23.0 | |
| Mean (SD) (year) * | 58.4 | 14.2 | 56.9 | 10.7 | |
| Urbanization level # | <0.001 | ||||
| 1 (highest) | 183,283 | 29.2 | 2589 | 35.2 | |
| 2 | 185,090 | 29.4 | 2272 | 30.9 | |
| 3 | 100,217 | 15.9 | 1049 | 14.3 | |
| 4 (lowest) | 160,175 | 25.5 | 1436 | 19.6 | |
| Occupation | <0.001 | ||||
| White collar | 281,372 | 44.8 | 3632 | 49.4 | |
| Blue collar | 294,699 | 46.9 | 3127 | 42.6 | |
| Others ‡ | 52,694 | 8.38 | 587 | 7.99 | |
| Underlying disease | |||||
| Hypertension | 470,048 | 74.8 | 5236 | 71.3 | <0.001 |
| Hyperlipidemia | 435,254 | 69.2 | 5046 | 69.7 | 0.33 |
| Stroke | 88,246 | 14.0 | 606 | 8.25 | <0.001 |
| Congestive heart failure | 95,160 | 15.1 | 645 | 8.78 | <0.001 |
| Benign breast condition | 111,647 | 17.8 | 4899 | 66.7 | <0.001 |
| Obesity | 42,712 | 6.79 | 479 | 6.52 | 0.36 |
| COPD | 164,128 | 26.1 | 1619 | 22.0 | <0.001 |
| CAD | 250,789 | 39.9 | 2574 | 35.0 | <0.001 |
| Asthma | 138,917 | 22.1 | 1256 | 17.1 | <0.001 |
| Stop-smoking clinic | 6107 | 0.97 | 28 | 0.38 | <0.001 |
| Alcohol-related illness | 26,210 | 4.17 | 216 | 2.94 | <0.001 |
| CKD | 188,584 | 30.0 | 1632 | 22.2 | <0.001 |
| Diabetes complication (components of the aDCSI) | |||||
| Retinopathy | 127,829 | 20.3 | 1123 | 15.3 | <0.001 |
| Nephropathy | 222,113 | 35.3 | 1925 | 26.2 | <0.001 |
| Neuropathy | 212,414 | 33.8 | 2025 | 27.6 | <0.001 |
| Cerebrovascular | 168,028 | 26.7 | 1257 | 17.1 | <0.001 |
| Cardiovascular | 383,242 | 61.0 | 3906 | 53.2 | <0.001 |
| Peripheral vascular disease | 179,865 | 28.6 | 1419 | 19.3 | <0.001 |
| Metabolic | 25,411 | 4.04 | 149 | 2.03 | <0.001 |
| Mean aDCSI score (SD) | |||||
| Onset | 1.62 | 1.68 | 1.29 | 1.46 | <0.001 |
| End of follow-up | 3.12 | 2.33 | 2.27 | 1.96 | <0.001 |
| Medications | |||||
| Statin | 349,906 | 55.7 | 3465 | 47.2 | <0.001 |
| Aspirin | 30,561 | 4.86 | 176 | 2.40 | <0.001 |
| Estrogen | 274,204 | 43.6 | 3416 | 46.5 | <0.001 |
| Insulin | 191,580 | 30.5 | 1181 | 16.1 | <0.001 |
| Sulfonylureas | 340,489 | 54.2 | 3698 | 50.3 | <0.001 |
| Metformin | 389,319 | 61.9 | 3897 | 53.1 | <0.001 |
| TZD | 101,370 | 16.1 | 815 | 11.1 | <0.001 |
| Other antidiabetic drugs | 167,166 | 26.6 | 1414 | 19.3 | <0.001 |
# Urbanization level was divided into four different categories according to the population of the residential areas; level 1 = “most urbanized” to level 4 = “least urbanized”. ‡ Other occupations, e.g., “retired”, “unemployed”, or “low income populations”. aDCSI, adapted Diabetes Complication Severity Index. Chi-square test, and * t-test comparing subjects with and without breast cancer.
Metrics of the ANN, LR, and RF models.
| Dataset | Model | F1 | Precision | Recall | AUROC | AUROC SE | AUROC 95% CI |
|---|---|---|---|---|---|---|---|
| All ( | ANN | 0.789 | 0.791 | 0.790 | 0.865 | <0.001 | 0.864–0.866 |
| LR | 0.763 | 0.765 | 0.763 | 0.834 | <0.001 | 0.833–0.835 | |
| RF | 0.892 | 0.892 | 0.892 | 0.959 | <0.001 | 0.959–0.960 | |
| Train ( | ANN | 0.789 | 0.791 | 0.790 | 0.865 | <0.001 | 0.864–0.866 |
| LR | 0.763 | 0.765 | 0.763 | 0.834 | <0.001 | 0.833–0.835 | |
| RF | 0.892 | 0.892 | 0.892 | 0.960 | <0.001 | 0.959–0.960 | |
| Test ( | ANN | 0.789 | 0.790 | 0.789 | 0.864 | 0.002 | 0.860–0.868 |
| LR | 0.758 | 0.761 | 0.758 | 0.829 | 0.002 | 0.824–0.833 | |
| RF | 0.890 | 0.890 | 0.890 | 0.955 | 0.003 | 0.948–0.961 |
The k-fold cross-validation accuracy (k = 10) of all three prediction models.
| Model | ANN | LR | RF |
|---|---|---|---|
| 0.786 | 0.881 | 0.763 |
Figure 1The receiver operating characteristic curve of the artificial neural network (ANN), logistic regression (LR), and random forest (RF) models in predicting breast cancer.