| Literature DB >> 35955112 |
Cheuk-Kay Sun1,2,3,4, Yun-Xuan Tang5,6, Tzu-Chi Liu2, Chi-Jie Lu2,7,8.
Abstract
This study aimed to investigate the important predictors related to predicting positive mammographic findings based on questionnaire-based demographic and obstetric/gynecological parameters using the proposed integrated machine learning (ML) scheme. The scheme combines the benefits of two well-known ML algorithms, namely, least absolute shrinkage and selection operator (Lasso) logistic regression and extreme gradient boosting (XGB), to provide adequate prediction for mammographic anomalies in high-risk individuals and the identification of significant risk factors. We collected questionnaire data on 18 breast-cancer-related risk factors from women who participated in a national mammographic screening program between January 2017 and December 2020 at a single tertiary referral hospital to correlate with their mammographic findings. The acquired data were retrospectively analyzed using the proposed integrated ML scheme. Based on the data from 21,107 valid questionnaires, the results showed that the Lasso logistic regression models with variable combinations generated by XGB could provide more effective prediction results. The top five significant predictors for positive mammography results were younger age, breast self-examination, older age at first childbirth, nulliparity, and history of mammography within 2 years, suggesting a need for timely mammographic screening for women with these risk factors.Entities:
Keywords: breast cancer; extreme gradient boosting; machine learning; mammography; national mammographic screening program
Mesh:
Year: 2022 PMID: 35955112 PMCID: PMC9368335 DOI: 10.3390/ijerph19159756
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
Figure 1Data preprocessing flow chart.
Demographic and clinical characteristics of participants.
| Characteristics | Metrics |
|---|---|
|
| |
|
| 55.16 (7.13) |
|
| 157.56 (5.35) |
|
| 13.87 (1.61) |
|
| 23.65 (3.57) |
|
| |
|
| |
| 0: Primary school | 3036 (14%) |
| 1: Lower secondary school | 2668 (13%) |
| 2: Upper secondary school | 7295 (35%) |
| 3: University | 6891 (33%) |
| 4: Postgraduate | 1217 (6%) |
|
| |
| 0: No | 17,809 (84%) |
| 1: Benign | 2702 (13%) |
| 2: Cancer (other than breast) | 596 (3%) |
|
| |
| 0: Breast self-exam negative | 15,435 (73%) |
| 1: Never breast self-exam | 4540 (22%) |
| 2: Mass or pain or tenderness | 1132 (5%) |
|
| |
| 0: No | 8338 (40%) |
| 1: Yes | 12,769 (60%) |
|
| |
| 0: No | 19,345 (91%) |
| 1: Yes | 1762 (9%) |
|
| |
| 0: Age < 21 | 1635 (8%) |
| 1: 21 ≤ Age < 35 | 15,741 (75%) |
| 2: Age ≥ 35 | 1037 (5%) |
| 3: No childbirth | 2694 (13%) |
|
| |
| 0: 0 times | 2694 (13%) |
| 1: 1 time | 3136 (15%) |
| 2: 2 times | 9319 (44%) |
| 3: 3 times | 4688 (22%) |
| 4: ≥4 times | 1270 (6%) |
|
| |
| 0: Nulliparous | 2694 (13%) |
| 1: No | 9554 (45%) |
| 2: Yes | 8859 (42%) |
|
| |
| 0: lifespan < 25 | 430 (2%) |
| 1: 25 ≤ lifespan < 30 | 1272 (6%) |
| 2: 30 ≤ lifespan < 35 | 6357 (30%) |
| 3: 35 ≤ lifespan < 40 | 9447 (45%) |
| 4: lifespan ≥ 40 | 3601 (17%) |
|
| |
| 0: No use | 19,655 (93%) |
| 1: Age ≥ 60 | 50 (<1%) |
| 2: 50 ≤ Age < 60 | 681 (3%) |
| 3: 40 ≤ Age < 50 | 590 (3%) |
| 4: 30 ≤ Age < 40 | 103 (<1%) |
| 5: Age < 30 | 28 (<1%) |
|
| |
| 0: duration = 0 | 19,655 (93%) |
| 1: 0 < duration < 5 | 1027 (5%) |
| 2: duration ≥ 5 | 425 (2%) |
|
| N% |
| 0: No use | 19,903 (94%) |
| 1: Age > 25 | 754 (4%) |
| 2: Age ≤ 25 | 450 (2%) |
|
| |
| 0: No use | 19,903 (94%) |
| 1: year ≤ 5 | 929 (4%) |
| 2: year > 5 | 275 (1%) |
|
| |
| 0: 0 | 18,746 (89%) |
| 1: 1 | 2190 (10%) |
| 2: ≥2 | 171 (1%) |
|
| |
| 0: Negative | 18,089 (86%) |
| 1: Positive | 3018 (14%) |
Figure 2Heatmap visualization of the correlation matrix between the 18 variables.
Figure 3Proposed integrated ML scheme.
Comparison of models with and without the balancing technique.
| Data | Methods | Metrics | |||
|---|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | AUC | ||
| Mean (SD) | Mean (SD) | Mean (SD) | Mean (SD) | ||
| Without balancing | Lasso | 85.70 (0.48) | 0.00 | 100 (0.01) | 63.00 (1.11) |
| XGB | 84.73 (3.17) | 3.08 (8.91) | 98.35 (5.12) | 63.22 (1.04) | |
| With undersampling balancing | Lasso | 58.97 (0.94) | 60.97 (2.03) | 58.64 (1.19) | 62.80 (1.05) |
| XGB | 54.30 (16.56) | 63.00 (15.38) | 52.87 (21.83) | 62.32 (1.25) | |
| With oversampling balancing | Lasso | 58.23 (0.83) | 60.67 (1.89) | 58.89 (0.99) | 62.66 (1.10) |
| XGB | 25.62 (6.69) | 87.65 (6.72) | 15.26 (8.86) | 59.26 (1.19) | |
Variable importance rankings of Lasso and XGB models.
| Rank | Variable | Lasso Method | Variable | XGB Method |
|---|---|---|---|---|
| 1 | X7 | Breast self-examination | X1 | Age |
| 2 | X8 | Mammography within 2 years | X7 | Breast self-examination |
| 3 | X15 | Duration of hormone replacement therapy | X10 | Age at first childbirth |
| 4 | X6 | Major diseases | X11 | Parity |
| 5 | X10 | Age at first childbirth | X8 | Mammography within 2 years |
| 6 | X16 | Age at starting oral contraceptives | X6 | Major diseases |
| 7 | X14 | Age at starting hormone replacement therapy | X5 | Education level |
| 8 | X17 | Duration of oral contraceptive use | X4 | Body mass index |
| 9 | X9 | History of breast surgery | X16 | Age at starting oral contraceptives |
| 10 | X1 | Age | X13 | Reproductive lifespan |
| 11 | X12 | Breastfeeding | X2 | Body height |
| 12 | X18 | Number of relatives with confirmed breast cancer | X14 | Age at starting hormone replacement therapy |
| 13 | X5 | Education level | X12 | Breastfeeding |
| 14 | X11 | Parity | X18 | Number of relatives with confirmed breast cancer |
| 15 | X13 | Reproductive lifespan | X3 | Age at menarche |
| 16 | X3 | Age at menarche | X9 | History of breast surgery |
| 17 | X4 | Body mass index | X17 | Duration of oral contraceptive use |
| 18 | X2 | Body height | X15 | Duration of hormone replacement therapy |
Lasso result with XGB-SRVRCs.
| XGB-SRVCs Combinations | Variables | AUC |
|---|---|---|
| C1 | (Age) | 61.19 |
| C2 | (Age) + (Breast self-examination) | 61.98 |
| C3 | (Age) + (Breast self-examination) + (Age at first childbirth) | 62.52 |
| C4 | (Age) + (Breast self-examination) + (Age at first childbirth) + (Parity) | 62.53 |
| C5 | (Age) + (Breast self-examination) + (Age at first childbirth) + (Parity) + (Mammography within 2 years) | 62.72 |
Figure 4Lasso results comparison using XGB-SRVCs and Lasso-SRVCs.
Figure 5Lasso equation with the selected five important variables.