| Literature DB >> 31221197 |
Chang Ming1, Valeria Viassolo2, Nicole Probst-Hensch3, Pierre O Chappuis2,4, Ivo D Dinov5,6,7,8, Maria C Katapodi9,8.
Abstract
BACKGROUND: Comprehensive breast cancer risk prediction models enable identifying and targeting women at high-risk, while reducing interventions in those at low-risk. Breast cancer risk prediction models used in clinical practice have low discriminatory accuracy (0.53-0.64). Machine learning (ML) offers an alternative approach to standard prediction modeling that may address current limitations and improve accuracy of those tools. The purpose of this study was to compare the discriminatory accuracy of ML-based estimates against a pair of established methods-the Breast Cancer Risk Assessment Tool (BCRAT) and Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA) models.Entities:
Keywords: Big data; Breast cancer; Cancer screening; Machine learning; Personalized medicine; Risk prediction
Mesh:
Year: 2019 PMID: 31221197 PMCID: PMC6585114 DOI: 10.1186/s13058-019-1158-4
Source DB: PubMed Journal: Breast Cancer Res ISSN: 1465-5411 Impact factor: 6.466
Variables included in ML for comparison with BCRAT and BOADICEA
| Variables list | Comparison between ML and BCRAT | Comparison between ML and BOADICEA |
|---|---|---|
| Age | ✓ | |
| Age at menarche | ✓ | |
| Age at first live birth | ✓ | |
| Race | ✓ | |
| Number of biopsies | ✓ | |
| Atypical hyperplasia | ✓ | |
| Number of first-degree relatives with breast cancer | ✓ | |
| Breast cancer | ✓ | |
| Family pedigree (beyond second-degree contained affected and unaffected members from both maternal and paternal side) including: | ✓ | |
| Age (or age at death) | ✓ | |
| Gender | ✓ | |
| Deceased status | ✓ | |
| Ashkenazi Jewish | ✓ | |
| Ovary cancer age onset | ✓ | |
| Prostate cancer age onset (male member only) | ✓ | |
| Pancreatic cancer | ✓ | |
| Pancreas cancer age onset | ✓ | |
| Breast cancer age onset | ✓ | |
| Contralateral breast cancer age onset | ✓ | |
| Estrogen receptor | ✓ | |
| Progesterone receptor | ✓ | |
| BRCA mutation | ✓ |
Sample characteristics of the US population-based sample (n = 1143) and the Swiss clinic-based sample (n = 2481)
| Variables included in BCRAT and BOADICEA models and in ML algorithms | US population-based sample | Swiss clinic-based sample |
|---|---|---|
| Age (range) | 50.86 ± 6.22 (35–64) | 50.78 ± 12.77 (13–89) |
| Age at menarche (range) | 12.56 ± 1.54 (8–18) | 12.91 ± 1.59 (8–18) |
| Age at first live birth (range) | 24.29 ± 5.62 (13–42) | 24.13 ± 5.72 (15–48) |
| Number of biopsies ( | 1.20 ± 1.21 | – |
| Atypical hyperplasia | 14 (1.65%) | – |
| Breast cancer | 850 (74.37%) | 886 (35.71%) |
| First-ductal carcinoma in situ (DCIS) | 434 (51.06%) | 50 (5.64%) |
| First-invasive breast cancer | 404 (47.52%) | 807 (91.08%) |
| First-breast cancer age onset (range) | 40.03 ± 4.79 (26–54) | 46.07 ± 10.69 (22–84) |
| Bilateral breast cancer | 4 (0.47%) | 160 (18.06%) |
| Estrogen receptor (ER) positive | – | 618 (69.75%) |
| Progesterone receptor (PR) positive | – | 561 (63.32%) |
| Pancreatic cancer | – | 13 (0.52%) |
| Pancreatic cancer age onset (range) | 55.10 ± 9.35 (36–75) | |
| Ovarian cancer | 9 (0.79%) | 133 (5.36%) |
| Ovarian cancer age onset (range) | 45.83 ± 5.00 (36–50) | 56.44 ± 13.16 (21–85) |
| Having also breast cancer | 4 | 20 |
| Ethnicity (% Black) | 401 (35.08%) | 71 (2.86%) |
| Ashkenazi Jewish origin | 12 (1.05%) | 65 (2.29%) |
| Number of first-degree relatives with breast cancer | 0.98 ± 1.05 | 0.25 ± 0.55 |
| Breast cancer patients | 0.81 ± 1.05 | – |
| Relatives of breast cancer patients | 1.49 ± 0.88 | – |
| 32 (2.79%) 235 tested | 209 (8.42%) 1052 tested |
– Data not available
Performance AU-ROC curve of BCRAT and ML algorithms (with standard deviation) predicting breast cancer lifetime risk from simulated datasets (n = 1200) and the US population-based sample (n = 1143)
| Dataset | BCRAT | ML: random forest | ML: Logistic Regression | ML: adapt boosting | ML: Linear Model | ML: K-nearest neighbors | ML: linear discriminant | ML: quadratic discriminant | ML: MCMC GLMM |
|---|---|---|---|---|---|---|---|---|---|
| A.Sim_no_signal | 0.5333 | 0.5016 (0.0231) | 0.5133 (0.0271) | 0.5067 (0.0307) | 0.5015 (0.0220) | 0.5054 (0.0211) | 0.5158 (0.0276) | 0.5133 (0.0323) | 0.5090 (0.0210) |
| B.Sim_atifical_signal | 0.5261 | 0.9308 (0.0171) | 0.9417 (0.0103) | 0.9292 (0.0095) | 0.7859 (0.0197) | 0.9125 (0.0109) | 0.9312 (0.0154) | 0.9188 (0.0111) | 0.9329 (0.0087) |
| C. Sim_ atifical_signal + 20% missing | 0.5068 | 0.9275 (0.0179) | 0.9217 (0.0259) | 0.9258 (0.0113) | 0.7807 (0.0227) | 0.9012 (0.0120) | 0.9213 (0.0202) | 0.9104 (0.0237) | 0.9191 (0.0210) |
| D. Sim_ atifical_signal + 20% missing + imputation | 0.5035 | 0.9167 (0.0184) | 0.9300 (0.0111) | 0.9213 (0.0119) | 0.7824 (0.0200) | 0.9058 (0.0117) | 0.9275 (0.0148) | 0.9121 (0.0081) | 0.9232 (0.0099) |
| US population-based sample | 0.6240 | 0.8889 (0.0201) | 0.7192 (0.0314) | 0.8828 (0.0229) | 0.6813 (0.0378) | 0.8089 (0.0217) | 0.8692 (0.0284) | 0.8675 (0.0241) | 0.8234 (0.0189) |
Performance AU-ROC curve of the BOADICEA model and ML algorithms (with standard deviation) predicting breast cancer lifetime risk from simulated datasets (n = 2500) and Swiss clinic-based sample (n = 112,587 women from 2481 families)
| Dataset | BOADICEA model | ML: random forest | ML: logistic regression | ML: adapt boosting | ML: linear model | ML: K-nearest neighbors | ML: linear discriminant | ML: quadratic discriminant | ML: MCMC GLMM |
|---|---|---|---|---|---|---|---|---|---|
| A.Sim_no_signal | 0.5103 | 0.5020 (0.0197) | 0.5093 (0.0210) | 0.5029 (0.0177) | 0.5151 (0.0190) | 0.5254 (0.0199) | 0.5094 (0.0241) | 0.5002 (0.0216) | 0.5075 (0.0201) |
| B.Sim_ atifical_signal | 0.5392 | 0.9101 (0.0148) | 0.9233 (0.0172) | 0.9321 (0.0122) | 0.6659 (0.0164) | 0.9301 (0.0159) | 0.9109 (0.0187) | 0.9244 (0.0166) | 0.9219 (0.0151) |
| C.Sim_ atifical_signal + 20% missing | 0.5022 | 0.8977 (0.0183) | 0.9100 (0.0293) | 0.9291 (0.0156) | 0.6407 (0.0257) | 0.9232 (0.0180) | 0.8982 (0.0276) | 0.9209 (0.0297) | 0.9088 (0.0219) |
| D.Sim_ atifical_signal + 20% missing +imputation | 0.5115 | 0.9028 (0.0127) | 0.9203 (0.0157) | 0.9299 (0.0110) | 0.6463 (0.0147) | 0.9276 (0.0140) | 0.9035 (0.0159) | 0.9220 (0.0141) | 0.9154 (0.0137) |
| Swiss clinic-based sample | 0.5931 | 0.8535 (0.0214) | 0.8271 (0.0189) | 0.9017 (0.0162) | 0.6921 (0.0202) | 0.8377 (0.0156) | 0.7899 (0.0188) | 0.8369 (0.0192) | 0.8932 (0.0149) |
Fig. 1a The area under the receiver operating characteristic curves (AU-ROC) for BCRAT and ML-Random forest approach. b The area under the receiver operating characteristic curves (AU-ROC) for BOADICEA model and ML-adapt boosting approach
Top five important risk factors in descending order for different ML algorithms based on the US population-based training samples in 10-fold internal statistical cross-validations
| ML: random forest | ML: logistic regression | ML: adapt boosting | ML: linear model | ML: K-nearest neighbors | ML: linear discriminant | ML: quadratic discriminant | ML: MCMC GLMM |
|---|---|---|---|---|---|---|---|
| Number of biopsies | Number of first-degree relatives with breast cancer | Number of biopsies | Age | Number of biopsies | Age | Number of first-degree relatives with breast cancer | Number of biopsies |
| Age | Age | Age | Number of biopsies | Number of first-degree relatives with breast cancer | Number of biopsies | Number of biopsies | Age |
| Number of first-degree relatives with breast cancer | Number of biopsies | Number of first-degree relatives with breast cancer | Number of first-degree relatives with breast cancer | Age | Ethnicity | Age | Number of first-degree relatives with breast cancer |
| Age at menarche | Ethnicity | Age at menarche | Age at menarche | Ethnicity | Number of first-degree relatives with breast cancer | Ethnicity | Age at first live birth |
| Ethnicity | Age at first live birth | Ethnicity | Age at first live birth | Age at first live birth | Age at first live birth | Age at menarche | Age at menarche |
Top five important risk factors in descending order for different ML algorithms based on the Swiss clinical-based training samples in 10-fold internal statistical cross-validations
| ML: random forest | ML: logistic regression | ML: adapt boosting | ML: linear model | ML: K-nearest neighbors | ML: linear discriminant | ML: quadratic discriminant | ML: MCMC GLMM |
|---|---|---|---|---|---|---|---|
| Breast cancer age onset | Age | Breast cancer age onset | Age | Family history | Age | Breast cancer age onset | Breast cancer age onset |
| Age | Breast cancer age onset | Age | Breast cancer age onset | Mutation | Breast cancer age onset | Mutation | Age |
| Mutation | Ashkenazi Jewish origin | Mutation | Ashkenazi Jewish origin | Age | Mutation | Age | Mutation |
| Ashkenazi Jewish origin | Ovarian cancer age onset | Ashkenazi Jewish origin | Mutation | Ashkenazi Jewish origin | Ashkenazi Jewish origin | Ashkenazi Jewish origin | Ovarian cancer age onset |
| Ovarian cancer age onset | Mutation | Ovarian cancer age onset | Ovarian cancer age onset | Ovarian cancer age onset | Ovarian cancer age onset | Ovarian cancer age onset | Ashkenazi Jewish origin |