| Literature DB >> 31881042 |
Gigi F Stark1, Gregory R Hart1, Bradley J Nartowt1, Jun Deng1.
Abstract
Among women, breast cancer is a leading cause of death. Breast cancer risk predictions can inform screening and preventative actions. Previous works found that adding inputs to the widely-used Gail model improved its ability to predict breast cancer risk. However, these models used simple statistical architectures and the additional inputs were derived from costly and / or invasive procedures. By contrast, we developed machine learning models that used highly accessible personal health data to predict five-year breast cancer risk. We created machine learning models using only the Gail model inputs and models using both Gail model inputs and additional personal health data relevant to breast cancer risk. For both sets of inputs, six machine learning models were trained and evaluated on the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial data set. The area under the receiver operating characteristic curve metric quantified each model's performance. Since this data set has a small percentage of positive breast cancer cases, we also reported sensitivity, specificity, and precision. We used Delong tests (p < 0.05) to compare the testing data set performance of each machine learning model to that of the Breast Cancer Risk Prediction Tool (BCRAT), an implementation of the Gail model. None of the machine learning models with only BCRAT inputs were significantly stronger than the BCRAT. However, the logistic regression, linear discriminant analysis, and neural network models with the broader set of inputs were all significantly stronger than the BCRAT. These results suggest that relative to the BCRAT, additional easy-to-obtain personal health inputs can improve five-year breast cancer risk prediction. Our models could be used as non-invasive and cost-effective risk stratification tools to increase early breast cancer detection and prevention, motivating both immediate actions like screening and long-term preventative measures such as hormone replacement therapy and chemoprevention.Entities:
Mesh:
Year: 2019 PMID: 31881042 PMCID: PMC6934281 DOI: 10.1371/journal.pone.0226765
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Input variables.
| Input | Breast Cancer | Non-Breast Cancer |
|---|---|---|
| Age | 62.6 (± 5.3) | 62.5 (± 5.4) |
| Age at Menarche | 12.6 (± 1.5) | 12.7 (± 1.5) |
| Age at First Live Birth | 23.7 (± 4.6) | 23.0 (± 4.3) |
| Number of First-Degree Relatives Who Have Had Breast Cancer | 0.2 (± 0.5) | 0.2 (± 0.4) |
| Race / Ethnicity | ||
| White | 94.8% | 92.8% |
| Black | 3.6% | 5.6% |
| Hispanic (Born in US) | 1.4% | 1.3% |
| Hispanic (Born Abroad) | 0.2% | 0.3% |
| Age at Menopause | 48.6 (± 5.0) | 48.0 (± 5.1) |
| Current Hormone Usage | 58.8% | 50.6% |
| Years of Hormone Usage | 4.7 (± 4.1) | 4.1 (± 4.1) |
| BMI | 27.1 (± 5.3) | 27.3 (± 5.6) |
| Pack Years of Cigarettes Smoked | 14.6 (± 23.9) | 13.3 (± 22.4) |
| Birth Control Usage | 2.9 (± 3.6) | 2.7 (± 3.6) |
| Number of Live Births | 2.8 (± 1.4) | 2.9 (± 1.5) |
| Personal History of Prior Cancer | 4.2% | 3.4% |
We show the means and standard deviations for the continuous variables and the percentages for the indicators.
* Among parous women.
Statistics for machine learning models with only BCRAT inputs and for the BCRAT.
| AUC | Sensitivity | Specificity | Precision | |
|---|---|---|---|---|
| 0.561 (0.525-0.596 95% CI) | 0.599 (0.540-0.657 95% CI) | 0.509 (0.501-0.518 95% CI) | 0.0258 (0.0234-0.0284 95% CI) | |
| 0.559 (0.524-0.595 95% CI) | 0.602 (0.544-0.661 95% CI) | 0.504 (0.496-0.513 95% CI) | 0.0257 (0.0233-0.0282 95% CI) | |
| 0.510 (0.474-0.547 95% CI) | 0.387 (0.328-0.445 95% CI) | 0.616 (0.607-0.624 95% CI) | 0.0213 (0.0184-0.0248 95% CI) | |
| 0.561 (0.525-0.596 95% CI) | 0.587 (0.529-0.646 95% CI) | 0.512 (0.503-0.521 95% CI) | 0.0254 (0.0230-0.0281 95% CI) | |
| 0.447 (0.411-0.483 95% CI) | 0.993 (0.982-1.00 95% CI) | 0.00718 (0.00571-0.00865 95% CI) | 0.0212 (0.0210-0.0214 95% CI) | |
| 0.567 (0.532-0.603 95% CI) | 0.621 (0.563-0.679 95% CI) | 0.474 (0.465-0.482 95% CI) | 0.0249 (0.0227-0.0273 95% CI) | |
| 0.563 (0.528-0.597 95% CI) | 0.647 (0.590-0.704 95% CI) | 0.461 (0.452-0.470 95% CI) | 0.0254 (0.0232-0.0277 95% CI) |
Abbreviations: CI = confidence interval, LR = logistic regression, NB = naive Bayes, DT = decision tree, LDA = linear discriminant analysis, SVM = support vector machine, NN = neural network
* Calculated using sensitivity / specificity values based on the threshold that maximized the sum of testing data set sensitivities and specificities rather than the sum of training data set sensitivities and specificities.
Comparisons between machine learning models with only BCRAT inputs and the BCRAT.
| LR | NB | DT | LDA | SVM | NN | |
|---|---|---|---|---|---|---|
| 0.169 | 0.278 | 2.17 | 0.175 | 3.61 | 0.416 | |
| 0.866 | 0.781 | 0.0302 | 0.861 | 3.08e-4 | 0.678 |
Comparisons between machine learning models with only BCRAT inputs.
| NN / LR | NN / NB | NN / DT | NN / LDA | NN / SVM | |
|---|---|---|---|---|---|
| 0.888 | 1.07 | 2.40 | 1.08 | 3.58 | |
| 0.375 | 0.286 | 0.0163 | 0.278 | 3.47e-4 |
Fig 1ROC curves for machine learning models with only BCRAT inputs and for the BCRAT.
These are the receiver operating characteristic (ROC) curves for the six machine learning models with only Breast Cancer Risk Prediction Tool (BCRAT) inputs and for the BCRAT. The six machine learning models include a logistic regression (LR), naive Bayes (NB), decision tree (DT), linear discriminant analysis (LDA), support vector machine (SVM), and neural network (NN). We report Delong 95% confidence intervals for each area under the receiver operating characteristic curve (AUC) value.
Statistics for machine learning models with the broader set of inputs and for the BCRAT.
| AUC | Sensitivity | Specificity | Precision | |
|---|---|---|---|---|
| 0.613 (0.579-0.647 95% CI) | 0.476 (0.416-0.536 95% CI) | 0.691 (0.683-0.699 95% CI) | 0.0323 (0.0285-0.0366 95% CI) | |
| 0.589 (0.555-0.623 95% CI) | 0.639 (0.582-0.697 95% CI) | 0.523 (0.514-0.531 95% CI) | 0.0282 (0.0258-0.0308 95% CI) | |
| 0.508 (0.496-0.521 95% CI) | 0.0446 (0.0199-0.0693 95% CI) | 0.972 (0.969-0.974 95% CI) | 0.0328 (0.0190-0.0562 95% CI) | |
| 0.613 (0.579-0.646 95% CI) | 0.688 (0.632-0.743 95% CI) | 0.467 (0.459-0.476 95% CI) | 0.0272 (0.0251-0.0295 95% CI) | |
| 0.518 (0.484-0.551 95% CI) | 0.517 (0.457-0.576 95% CI) | 0.478 (0.469-0.486 95% CI) | 0.0210 (0.0187-0.0235 95% CI) | |
| 0.608 (0.574-0.643 95% CI) | 0.599 (0.540-0.657 95% CI) | 0.562 (0.553-0.570 95% CI) | 0.0287 (0.0261-0.0317 95% CI) | |
| 0.563 (0.528-0.597 95% CI) | 0.647 (0.590-0.704 95% CI) | 0.461 (0.452-0.470 95% CI) | 0.0254 (0.0232-0.0277 95% CI) |
* Calculated using sensitivity / specificity values based on the threshold that maximized the sum of testing data set sensitivities and specificities rather than the sum of training data set sensitivities and specificities.
Comparisons between machine learning models with the broader set of inputs and the BCRAT.
| LR | NB | DT | LDA | SVM | NN | |
|---|---|---|---|---|---|---|
| 2.50 | 1.31 | 2.99 | 2.62 | 1.86 | 2.50 | |
| 0.0123 | 0.189 | 2.77e-3 | 8.74e-3 | 0.0632 | 0.0123 |
Comparisons between machine learning models with the broader set of inputs.
| LR / NB | LR / DT | LR / LDA | LR / SVM | LR / NN | LDA / NB | LDA / DT | LDA / SVM | LDA / NN | |
|---|---|---|---|---|---|---|---|---|---|
| 2.39 | 5.79 | 0.110 | 4.35 | 0.529 | 2.54 | 5.81 | 4.23 | 0.496 | |
| 0.0168 | 6.85e-9 | 0.912 | 1.39e-5 | 0.597 | 0.0112 | 6.19e-9 | 2.35e-5 | 0.620 |
Fig 2ROC curves for machine learning models with the broader set of inputs and for the BCRAT.
These are the receiver operating characteristic (ROC) curves for the six machine learning models with the broader set of inputs and for the Breast Cancer Risk Prediction Tool (BCRAT). The six machine learning models include a logistic regression (LR), naive Bayes (NB), decision tree (DT), linear discriminant analysis (LDA), support vector machine (SVM), and neural network (NN). We report Delong 95% confidence intervals for each area under the receiver operating characteristic curve (AUC) value.