| Literature DB >> 35271664 |
Hisham Hussan1,2, Jing Zhao3, Abraham K Badu-Tawiah1,4,5, Peter Stanich1, Fred Tabung2,6, Darrell Gray1,2, Qin Ma3, Matthew Kalady2,7, Steven K Clinton2,6.
Abstract
BACKGROUND AND AIMS: The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35-50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors.Entities:
Mesh:
Year: 2022 PMID: 35271664 PMCID: PMC9064446 DOI: 10.1371/journal.pone.0265209
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Study plot detailing study flow as well as inclusion and exclusion criteria.
Included predictors and baseline demographics.
| Predictors | Percentages and means | Missing data |
|---|---|---|
| Total number of included patients | 3116 | |
| Mean age (standard deviation or S.D.) | 46.5 (4.73) | 0.0% |
| Female Gender | 55.1% | 0.0% |
| Race | 1.8% | |
| Non-Hispanic White | 67.2% | |
| African American | 18.7% | |
| Hispanic | 2.6% | |
| Asian | 1.5% | |
| Other | 8.1% | |
| Rural-urban commuting area code (RUCA2) | 0.0% | |
| Mean (S.D.) | 1.33 (1.17) | |
| Metropolitan [RUCA 1–3] | 93.3% | |
| Micropolitan [RUCA 4–6] | 5.0% | |
| Small town [RUCA 7–9] | 1.4% | |
| Rural [RUCA 10] | 0.3% | |
| Percentage of returns within income brackets per zip code [mean (S.D.)] | 0.4% | |
| $1 to under $25,000 | 30.60% (9.48) | |
| $25,000 to under $50,000 | 24.59% (6.98) | |
| $50,000 to under $75,000 | 14.64% (2.63) | |
| $75,000 to under $100,000 | 9.50% (2.83) | |
| $100,000 to under $200,000 | 14.70% (7.69) | |
| $200,000 or more | 5.97% (5.88) | |
| Percentage of single tax returns per Zip code [mean (S.D.)] | 49.04% (8.67) | 0.4% |
| Adjusted gross income per zip code [mean (S.D.)] | $1,399,101.15 (862,516.76) | 0.4% |
| American Society of Anesthesiology (ASA) Physical Status Classification System | 0.0% | |
| ASA I (healthy patient) | 25.6% | |
| ASA II (mild systemic disease) | 67.4% | |
| ASA III (severe systemic disease) | 6.9% | |
| ASA IV (life threatening systemic disease) | 0.1% | |
| Colorectal cancer screening indication | 53.0% | 0.0% |
| All diagnostic colonoscopy indications | 47.0% | 0.0% |
| Functional gastrointestinal symptoms: | 32.8% | |
| •Abdominal pain | 11.3% | |
| •Constipation | 5.9% | |
| •Diarrhea | 3.3% | |
| •Rectal pain | 0.6% | |
| •Pelvic pain | 0.3% | |
| •Obstipation | 0.1% | |
| •Irritable bowel syndrome | 0.3% | |
| Weight loss | 1.0% | |
| Gastrointestinal bleeding | 20.1% | |
| Anemia | 3.8% | |
| Change in bowel habits | 2.8% | |
| Change in stool caliber | 0.7% | |
| Personal history of cancer other than CRC | 0.4% | |
| Colorectal neoplasm in distant relative | 3.0% | |
| Family history of cancer other than CRC | 0.0% | |
| Prior diverticulitis prior diverticulitis | 2.1% | |
| Height in feet [mean (S.D.)] | 5.59 (0.34) | 0.7% |
| Weight in pounds [mean (S.D.)] | 194.58 (53.11) | 6.0% |
| BMI (kg/m2) | 6.0% | |
| Mean (S.D.) | 30.24 (7.42) | |
| ≥ 25 Kg/m2 | 70.7% | |
| ≥ 30 Kg/m2 | 40.5% | |
| ≥ 35 Kg/m2 | 19.4% | |
| ≥ 40 Kg/m2 | 9.4% | |
| Median [Inter quartile Range (IQR)] | 28.8 (25–33.9) | |
| Alcohol use | 0.9% | |
| Never | 1.1% | |
| No | 33.8% | |
| Not currently | 3.0% | |
| Yes | 61.2% | |
| Tobacco use | 0.4% | |
| Never | 61.9% | |
| Passive | 0.2% | |
| Quit | 23.1% | |
| Yes | 14.3% | |
| Intravenous drug user | 1.4% | |
| No | 98.4% | |
| Yes | 0.2% | |
| Illicit drug user | 1.4% | |
| Never | 5.5% | |
| No | 84.3% | |
| Not currently | 2.0% | |
| Yes | 6.8% | |
| Total cholesterol (mg/dL) | 29.2% | |
| Mean (S.D.) | 185.57 (41.11) | |
| ≥ 200 mg/dL | 23.9% | |
| < 200 mg/dL | 46.8% | |
| ≥ 170 mg/dL | 45.3% | |
| < 170 mg/dL | 25.5% | |
| Median (IQR) | 183 (159–210) | |
| High Density Lipoprotein (HDL, mg/dL) | 29.9% | |
| Mean (S.D.) | 51.90 (15.82) | |
| ≥35 mg/dL | 63.5% | |
| <35 mg/dL | 6.6% | |
| ≥40 mg/dL | 55.4% | |
| <40 mg/dL | 14.7% | |
| Median (IQR) | 49 (41–60) | |
| Low Density Lipoprotein (LDL, mg/dL) | 30.3% | |
| Mean (S.D.) | 107.01 (34.45) | |
| ≥100 mg/dL | 40.3% | |
| <100 mg/dL | 29.4% | |
| ≥150 mg/dL | 7.2% | |
| <150 mg/dL | 62.5% | |
| Median (IQR) | 106 (84–129) | |
| Triglyceride (TG, mg/dL) | 29.4% | |
| Mean (S.D.) | 143.74 (176.88) | |
| ≥150 mg/dL | 21.9% | |
| <150 mg/dL | 48.7% | |
| | 110 (76–167) | |
| Triglyceride: High Density Lipoprotein (TG: HDL) ratio | 29.9% | |
| Mean (S.D.) | 3.31 (5.92) | |
| High (ratio ≥3) | 25.0% | |
| Low (ratio <3) | 45.1% | |
| | 2.24 (1.35–3.77) | |
| Hemoglobin (mg/dL) | 40.3% | |
| Mean (S.D.) | 13.75 (1.71) | |
| Females with anemia (<12 mg/dL) | 6.3% | |
| Males with anemia (<13.5 mg/dL) | 3.2% | |
| | 13.9 (12.8–14.9) | |
| Reported non-steroidal anti-inflammatory drugs use | 12.5% | 0.0% |
| Statin medications use | 14.3% | 0.0% |
Fig 2Receiver Operator Curves (ROC) of the reference and machine learning models in the test set for colorectal cancer (CRC) and CRC or high-risk polyps (bottom).
Fig 3Area Under the Curve (AUC) of our reference and machine learning models in the test set for colorectal cancer (CRC) and CRC or high-risk polyps.
The p value compares the machine learning models to the reference model using the DeLong test.
Performance metrics of different prediction models.
| Colorectal cancer | |||||
|---|---|---|---|---|---|
| Metric/Model | Logistic regression | Regularized discriminant | Random forest | Neural network | Stochastic gradient boosting |
| Accuracy | 0.11 | 0.96 | 0.66 | 0.71 | 0.86 |
| Accuracy (lower) | 0.09 | 0.95 | 0.62 | 0.68 | 0.84 |
| Accuracy (upper) | 0.13 | 0.97 | 0.69 | 0.74 | 0.88 |
| Balanced accuracy | 0.42 | 0.73 | 0.70 | 0.73 | 0.80 |
| Sensitivity | 0.75 | 0.50 | 0.75 | 0.75 | 0.75 |
| Specificity | 0.10 | 0.96 | 0.65 | 0.71 | 0.86 |
| Positive predictive value | 0.00 | 0.05 | 0 | 0.01 | 0.02 |
| Negative predictive value | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 |
|
| |||||
| Accuracy | 0.52 | 0.61 | 0.61 | 0.62 | 0.54 |
| Accuracy (lower) | 0.49 | 0.58 | 0.58 | 0.59 | 0.51 |
| Accuracy (upper) | 0.55 | 0.64 | 0.65 | 0.65 | 0.58 |
| Balanced accuracy | 0.54 | 0.6 | 0.59 | 0.60 | 0.54 |
| Sensitivity | 0.58 | 0.57 | 0.55 | 0.56 | 0.54 |
| Specificity | 0.51 | 0.62 | 0.63 | 0.63 | 0.55 |
| Positive predictive value | 0.17 | 0.21 | 0.21 | 0.21 | 0.18 |
| Negative predictive value | 0.87 | 0.89 | 0.88 | 0.89 | 0.87 |
Fig 4Comparison of the reference Area Under the Curve (AUC) to machine learning models in the test set for colorectal cancer (CRC) and CRC or high-risk polyps.