| Literature DB >> 35208185 |
Jacopo Troisi1,2, Maria Tafuro3, Martina Lombardi2, Giovanni Scala2,4, Sean M Richards5,6, Steven J K Symes5,7, Paolo Antonio Ascierto8, Paolo Delrio8, Fabiana Tatangelo8, Carlo Buonerba3, Biancamaria Pierri3, Pellegrino Cerino3.
Abstract
Colorectal cancer (CRC) is a high incidence disease, characterized by high morbidity and mortality rates. Early diagnosis remains challenging because fecal occult blood screening tests have performed sub-optimally, especially due to hemorrhoidal, inflammatory, and vascular diseases, while colonoscopy is invasive and requires a medical setting to be performed. The objective of the present study was to determine if serum metabolomic profiles could be used to develop a novel screening approach for colorectal cancer. Furthermore, the study evaluated the metabolic alterations associated with the disease. Untargeted serum metabolomic profiles were collected from 100 CRC subjects, 50 healthy controls, and 50 individuals with benign colorectal disease. Different machine learning models, as well as an ensemble model based on a voting scheme, were built to discern CRC patients from CTRLs. The ensemble model correctly classified all CRC and CTRL subjects (accuracy = 100%) using a random subset of the cohort as a test set. Relevant metabolites were examined in a metabolite-set enrichment analysis, revealing differences in patients and controls primarily associated with cell glucose metabolism. These results support a potential use of the metabolomic signature as a non-invasive screening tool for CRC. Moreover, metabolic pathway analysis can provide valuable information to enhance understanding of the pathophysiological mechanisms underlying cancer. Further studies with larger cohorts, including blind trials, could potentially validate the reported results.Entities:
Keywords: colorectal cancer; ensemble machine learning; fecal occult blood test; metabolomics; screening test
Year: 2022 PMID: 35208185 PMCID: PMC8878838 DOI: 10.3390/metabo12020110
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Enrolled subject characteristics (mean ± standard deviation or %). Abbreviations used are HS: Healthy subjects, BCRT: Benign colon or rectum tumors, CRC: Colorectal cancer affected patients, BMI: Body mass index, HDL: High-density lipoprotein, LDL: Low-density lipoprotein, GGT: Gamma-glutamyltransferase, AST: Aspartate aminotransferase, ALT: Alanine transaminase LDH: Lactate dehydrogenase.
| HS ( | BCRT ( | CRC ( | |
|---|---|---|---|
| Age (years) | 61.6 ± 7.0 | 62.8 ± 7.1 | 66.2 ± 11.3 * |
| Men (%) | 56 | 59 | 64 |
| Weight (kg) | 76.4 ± 15.5 | 80.0 ± 16.9 | 72.8 ± 15.1 § |
| Height (cm) | 165.0 ± 9.5 | 167.5 ± 8.7 | 167.7 ± 9.4 |
| BMI (kg/cm2) | 27.9 ± 4.3 | 28.4 ± 4.8 | 25.7 ± 9.4 *,§ |
| Blood Pressure (mm Hg) | |||
| Systolic | 135.2 ± 24.4 | 132.3 ± 17.7 | 139.9 ± 17.4 |
| Diastolic | 81.6 ± 11.4 | 81.9 ± 11.1 | 80.7 ± 8.0 |
| Heart rate (bmp) | 79.7 ± 7.7 | 79.8 ± 6.8 | 79.4 ± 7.5 |
| Oxygen saturation (%) | 99.0 ± 1.5 | 98.8 ± 1.6 | 99.7 ± 10.0 |
| Azotemia (g/dL) | 38.4 ± 10.4 | 40.8 ± 18.8 | 43.2 ± 13.5 * |
| Total Cholesterol (mg/dL) | 191.9 ± 39.1 | 194.9 ± 42.2 | 189.2 ± 40.0 |
| HDL (mg/dL) | 57.2 ± 13.1 | 52.5 ± 14.7 | 62.9 ± 18.8 § |
| LDL (mg/dL) | 114.2 ± 30.1 | 113.7 ± 33.9 | 113.9 ± 33.9 |
| Triglycerides (mg/dL) | 115.4 ± 57.3 | 138.4 ± 95.3 | 116.6 ± 63.5 |
| Creatinine (mg/dL) | 0.8 ± 0.2 | 0.9 ± 0.4 | 0.9 ± 0.3 * |
| Alkaline phosphatase (UI/L) | 53.7 ± 16.2 | 55.2 ± 13.3 | 81.4 ± 59.3 *,§ |
| GGT (U/L) | 26.4 ± 17.5 | 26.7 ± 22.4 | 49.2 ± 118.6 |
| Glycaemia (mg/dL) | 92.1 ± 25.3 | 99.1 ± 29.3 | 101.0 ± 26.7 |
| White blood cells (n/µL) | 6725.6 ± 1917.1 | 8427.6 ± 12315.9 | 6033.3 ± 1928.7 * |
| Red blood cells (n/µL) | 4.97 * 106 ± 4.92 * 106 | 4.95 * 106 ± 7.14 * 106 | 4.66 * 106 ± 6.78 * 106 * |
| AST (mU/mL) | 21.5 ± 7.7 | 23.8 ± 11.6 | 25.2 ± 11.9 * |
| ALT (mU/mL) | 25.1 ± 12.2 | 28.1 ± 14.9 | 27.2 ± 16.9 |
| LDH (U/L) | 169.3 ± 27.1 | 177.6 ± 28.9 | 172.3 ± 34.9 |
| Serum iron (µg/dL) | 95.0 ± 33.6 | 95.5 ± 36.9 | 79.6 ± 42.8 *,§ |
| Uric acid (mg/dL) | 5.4 ± 1.4 | 5.8 ± 1.7 | 5.2 ± 1.4 § |
| Other pathologies (n(%)) | 45 (90%) | 40 (80%) | 98 (98%) |
| Hypertension ¶ | 24 (53%) | 25 (63%) | 49 (50%) |
| Diabetes ¶ | 9 (20%) | 8 (20%) | 14 (14%) |
| Hypertriglyceridemia ¶ | 2 (4%) | 1 (3%) | 0 (0%) |
| Hypercholesterolemia ¶ | 6 (13%) | 6 (15%) | 4 (4%) |
| Heart disease ¶ | 6 (13%) | 5 (13%) | 7 (7%) |
| Cancer in other organ ¶ | 6 (13%) | 3 (8%) | 13 (13%) |
| Other ¶ | 11 (24%) | 10 (25%) | 24 (24%) |
| Pharmacological treatments (n(%)) | 39 (78%) | 37 (74%) | 82 (82%) |
* Indicates statistical difference (p < 0.05) compared to HS; § indicates statistical differences (p < 0.05) compared to BCRT; ¶ indicates the percentage based on the cases with other pathologies.
Performance metrics (value ± standard error) of the individual and the ensembled machine learning classification algorithms when applied to the test set. Abbreviations; NB: Naïve Bayes, GLM: Generalized linear model, LR: Logistic regression, FLM: Fast large margin, DL: Deep learning, DT: Decision tree, RF: Random forest, GBT: Gradient boosted tree, SVM: Support vector machine, PLS-DA: Partial least square discriminant analysis, EML: Ensemble machine learning, S: Sensitivity, Sp: Specificity; PLR: Positive likelihood ratio, NLR: Negative likelihood ratio, NPV: Negative predictive value, PPV: Positive predictive value, A: Accuracy, ND: Not determinable.
| Model | S | Sp | PLR | NLR | NPV | PPV | A |
|---|---|---|---|---|---|---|---|
| NB | 0.58 ± 0.10 | 1.00 ± 0.00 | ND | 0.42 | 0.67 ± 0.08 | 1.00 ± 0.00 | 0.77 |
| GLM | 0.96 ± 0.04 | 1.00 ± 0.00 | ND | 0.04 | 0.96 ± 0.04 | 1.00 ± 0.00 | 0.98 |
| LR | 0.88 ± 0.06 | 0.95 ± 0.05 | 18.58 | 0.12 | 0.87 ± 0.07 | 0.96 ± 0.04 | 0.91 |
| FLM | 1.00 ± 0.00 | 0.77 ± 0.09 | 4.40 | 0.00 | 1.00 ± 0.00 | 0.83 ± 0.07 | 0.89 |
| DL | 1.00 ± 0.00 | 1.00 ± 0.00 | ND | 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 |
| DT | 1.00 ± 0.00 | 0.82 ± 0.08 | 5.50 | 0.00 | 1.00 ± 0.00 | 0.86 ± 0.06 | 0.91 |
| RF | 0.69 ± 0.09 | 1.00 ± 0.00 | ND | 0.31 | 0.72 ± 0.08 | 1.00 ± 0.00 | 0.83 |
| GBT | 0.46 ± 0.10 | 1.00 ± 0.00 | ND | 0.54 | 0.61 ± 0.08 | 1.00 ± 0.00 | 0.71 |
| SVM | 0.81 ± 0.08 | 1.00 ± 0.00 | ND | 0.19 | 0.81 ± 0.08 | 1.00 ± 0.00 | 0.89 |
| PLS-DA | 0.92 ± 0.05 | 0.87 ± 0.06 | 7.10 | 0.10 | 0.90 ± 0.05 | 0.89 ± 0.05 | 0.90 |
| EML | 1.00 ± 0.00 | 1.00 ± 0.00 | ND | 0.00 | 1.00 ± 0.00 | 1.00 ± 0.00 | 1.00 |
Figure 1Ensemble machine learning (EML) scores calculated for the healthy controls (CTRL) and patients with colorectal cancer (CRC) among the test set; red dashed line represents the optimized cut-off value (Panel A). Receiver operating characteristic (ROC) curve obtained by varying the cut-off value when applying the EML model to the test set (Panel B); the area under the ROC curve is 1.0. Dotted blue line represents the 95% Confidence Bounds.
Figure 2Partial least square discriminant analysis (PLS-DA) score plot performed to classify CTRL and CRC subjects (panel A). For each axis, the percentage of explained variance is reported in parentheses. Panel (B) reports the PLS-DA classification performance using increasing number of latent variables. The red star indicates the best classifier. (C) Permutation test results in which models were built by randomly assigning the class label and then comparing the performance of the permuted models with that of the original model built with the correct class assignment. These were statistically different (based on 2000 permutations), highlighting the lack of overfitting in the original model. (D) The metabolites showing a variable importance in projection (VIP) score higher than 2.0. The blue bars represent metabolites increased in CTRL, while the red bars represent the metabolites decreased in CTRL with respect to CRC. *** represent metabolites with a p-value < 0.001 (E) Volcano plot reporting metabolite concentration fold-changes and their statistical significance when comparing CTRL vs. CRC subjects. 1. Galactose, 2. 4-Hydroxybenzyl alcohol, 3. Myristic acid, 4. Hydroxylamine, 5. Arabinose, 6. Guanine, 7. Fructose, 8. Tetraethylene glycol, 9. Glucose, 10. Quinolinic acid, 11. Estradiol, 12. Threonine, 13. Glutamine, 14. Glyceryl-glycoside, 15. Oxoproline, 16. Lactose, 17. Oxoglutaric acid, 18. 2-Ketobutyric acid, 19. Mandelic acid, 20. Creatinine, 21. Glutamic acid, 22. Nicotinic acid, 23. Norepinephrine, 24. Acetic acid. Horizontal dashed grey line shows p = 0.05; vertical dashed lines represent log2FC = ±1.
Figure 3UpSet representation showing the metabolites selected as significant by a given classification model (horizontal) in addition to multiple models selecting a given metabolite (vertical).
Figure 4Metabolite set enrichment analysis establishes whether compounds implicated in a specific pathway are increased compared to casual occurrence applying the hypergeometric test. Node centrality, which represents an estimate of node importance, was achieved by 0 employing the betweenness centrality. This reveals the number of shortest paths passing through the node. Because the metabolic network is directed, the relative betweenness centrality for a metabolite has been applied as the importance measure. The betweenness centrality measure is focused on the total network topology. Pathway relevance (represented in terms of circle size) was evaluated as the distance of each point (a metabolic pathway) from the axis origin. Colors represent the matching status of each pathway (number of reported metabolites compared to the total metabolites in the pathway).