| Literature DB >> 31236709 |
Jeremy R Ash1,2,3, Melaine A Kuenemann1,3, Daniel Rotroff2,3, Alison Motsinger-Reif2,3, Denis Fourches4,5.
Abstract
Developing predictive and transparent approaches to the analysis of metabolite profiles across patient cohorts is of critical importance for understanding the events that trigger or modulate traits of interest (e.g., disease progression, drug metabolism, chemical risk assessment). However, metabolites' chemical structures are still rarely used in the statistical modeling workflows that establish these trait-metabolite relationships. Herein, we present a novel cheminformatics-based approach capable of identifying predictive, interpretable, and reproducible trait-metabolite relationships. As a proof-of-concept, we utilize a previously published case study consisting of metabolite profiles from non-small-cell lung cancer (NSCLC) adenocarcinoma patients and healthy controls. By characterizing each structurally annotated metabolite using both computed molecular descriptors and patient metabolite concentration profiles, we show that these complementary features enhance the identification and understanding of key metabolites associated with cancer. Ultimately, we built multi-metabolite classification models for assessing patients' cancer status using specific groups of metabolites identified based on high structural similarity through chemical clustering. We subsequently performed a metabolic pathway enrichment analysis to identify potential mechanistic relationships between metabolites and NSCLC adenocarcinoma. This cheminformatics-inspired approach relies on the metabolites' structural features and chemical properties to provide critical information about metabolite-trait associations. This method could ultimately facilitate biological understanding and advance research based on metabolomics data, especially with respect to the identification of novel biomarkers.Entities:
Keywords: Chemical structure; Cheminformatics; Data mining; Metabolomics; Molecular fragmentation; Statistics; Visualization
Year: 2019 PMID: 31236709 PMCID: PMC6591908 DOI: 10.1186/s13321-019-0366-3
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Distribution of intensities for metabolites significantly associated with cancer status in the training set. ADC1 (training) and ADC2 (test) set boxplots shown for healthy (blue) and adenocarcinoma (red) patients. Significant plasma and serum metabolites in ADC1 were determined by a paired t test. *(FDR < .075), ** (FDR < .01), *** (FDR < .001). Many of the metabolites that are significant in ADC1 are also significant in ADC2. Some show more significant differences in the ADC2
Fig. 2PCA of all metabolite and significantly different metabolite profiles. a All plasma, b significant plasma, c all serum, and d significant serum metabolite profiles
Fig. 3Integrated circular dendrogram generated using MACCS fingerprint with average linkage and Soergel distance. A cell next to a metabolite name is colored green if the metabolite has a significant difference in mean relative abundance between for cancer versus control patients in one of the data sets (ADC1/ADC2, serum/plasma) after correction for multiple testing. Metabolites names are colored green if they were significant in at least one data set. Fisher exact test for greater probability of significance for metabolites within the highlighted cluster (orange) than those without *(FDR < .05), ** (FDR < .01), *** (FDR < .001). The metabolites highlighted in blue were selected by our multi-metabolite procedure to form the best classifier without using information about the test set
Fig. 4Metabolite structures from the cluster containing a large proportion of significant metabolites. Mean metabolite abundance fold change for cancer versus healthy patients in Serum and Plasma ADC1 and ADC2 data sets
Fig. 5Integrated circular dendrogram generated using MACCS fingerprint with average linkage and Soergel distance. Cells next to metabolites names are colored dark blue if they belong to a pathway significantly enriched (hypergeometric test; FDR < .05) for metabolites found to be significant in the differential analysis for ADC1 serum (top band) or ADC1 plasma (bottom band). Metabolites names are colored green if they were significant in at least one data set. Fisher exact test for greater probability of pathway membership for metabolites within the highlighted cluster (orange) than those without *(FDR < .05), ** (FDR < .01), *** (FDR < .001)
Performance measures for selected serum models predicting cancer status
| ADC1 (training) LOOCV | ADC2 (test) external validation | |||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | |
| Single metabolite classifiers | ||||||||
| Aspartic Acid |
| 40.8 | 96.8 | 0.698 |
| 62.8 | 95.3 | 0.862 |
| Cystine | 70.0* | 75.5 | 61.3 | 0.685 | 55.8* | 76.7 | 34.9 | 0.677 |
| Glutamic Acid | 62.5 | 42.9 | 93.5 | 0.687 | 76.7 | 65.1 | 88.4 | 0.846 |
| Oxalic Acid | 70.0* | 83.7 | 48.4 | 0.65 | 57.0* | 88.4 | 25.6 | 0.649 |
| Multi-metabolite classifiers—clustered metabolites | ||||||||
| Cluster 1a SVMb | 77.6 | 74.2 | 0.751 | 72.1 | 97.7 | 0.856 | ||
Asterisk represents best model accuracies according to LOOCV. Best model accuracies according to external validation accuracy are underlined
aAspartic acid, cystine, glutamic acid
bSupport vector machines
Performance measures for selected plasma models predicting cancer status
| ADC 1 (training) LOOCV | ADC2 (test) external validation | |||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | |
| Single metabolite classifiers | ||||||||
| 3-phosphoglycerate | 70.7 | 60.8 | 87.1 | 0.734 | 51.2 | 34.9 | 67.4 | 0.578 |
| Maltose | 74.4* | 82.4 | 61.3 | 0.701 | 57.0* | 62.8 | 51.2 | 0.607 |
| Pyrophosphate |
| 66.7 | 74.2 | 0.703 |
| 67.4 | 86.0 | 0.811 |
| Multi-metabolite classifiers—clustered metabolites | ||||||||
| Cluster 1a SVMb | 86.3 | 71.0 | 0.713 | 72.1 | 69.8 | 0.675 | ||
Asterisk represents best model accuracies according to LOOCV accuracy. Best model accuracies according to external validation accuracy are underlined
a3-Phosphoglycerate, pyrophosphate
bSupport vector machines