| Literature DB >> 24307992 |
Roman Eisner1, Russell Greiner, Victor Tso, Haili Wang, Richard N Fedorak.
Abstract
We report an automated diagnostic test that uses the NMR spectrum of a single spot urine sample to accurately distinguish patients who require a colonoscopy from those who do not. Moreover, our approach can be adjusted to tradeoff between sensitivity and specificity. We developed our system using a group of 988 patients (633 normal and 355 who required colonoscopy) who were all at average or above-average risk for developing colorectal cancer. We obtained a metabolic profile of each subject, based on the urine samples collected from these subjects, analyzed via (1)H-NMR and quantified using targeted profiling. Each subject then underwent a colonoscopy, the gold standard to determine whether he/she actually had an adenomatous polyp, a precursor to colorectal cancer. The metabolic profiles, colonoscopy outcomes, and medical histories were then analysed using machine learning to create a classifier that could predict whether a future patient requires a colonoscopy. Our empirical studies show that this classifier has a sensitivity of 64% and a specificity of 65% and, unlike the current fecal tests, allows the administrators of the test to adjust the tradeoff between the two.Entities:
Mesh:
Year: 2013 PMID: 24307992 PMCID: PMC3838851 DOI: 10.1155/2013/303982
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Results of colonoscopy for subjects in our study, along with the label we give each group for the purposes of training/evaluating a classifier.
| Result of colonoscopy | Total | Label |
|---|---|---|
| Normal | 633 | Normal |
| Hyperplastic | 110 | Colonoscopy |
| Adenoma | 243 | Colonoscopy |
| Colorectal cancer | 2 | Colonoscopy |
Clinical features used for prediction.
| Label | Age | Sex | Smoker | GI Bleeding |
|---|---|---|---|---|
| Colonoscopy |
| F = 159 | Yes = 57 | Yes = 12 |
|
| M = 196 | Ex-smoker = 10 | No = 343 | |
| No = 274 | Unknown = 0 | |||
| Unknown = 14 | ||||
|
| ||||
| Normal |
| F = 364 | Yes = 57 | Yes = 8 |
|
| M = 269 | Ex-smoker = 14 | No = 623 | |
| No = 540 | Unknown = 2 | |||
| Unknown = 22 | ||||
Performance of LASSO Classifier using various normalization and transformation methods.
| Method | Average AUC | AUC standard |
|---|---|---|
| None | 0.680 | 0.009 |
| Log |
|
|
| Creatinine + Log | 0.698 | 0.019 |
| Sum + Log | 0.701 | 0.018 |
| Vector Length + Log | 0.703 | 0.014 |
| PQ + Log | 0.670 | 0.024 |
The bolded row shows the approach we have decided to use.
Performance of various prediction algorithms across 5 folds of cross-validation.
| Model | Average AUC | AUC standard |
|---|---|---|
| Linear-SVM | 0.691 | 0.016 |
| RBF-SVM | 0.690 | 0.008 |
| Linear-SVM + | 0.695 | 0.014 |
| Naïve Bayes | 0.661 | 0.046 |
| PLS-DA | 0.660 | 0.029 |
| LASSO |
|
|
| Random Forest | 0.692 | 0.015 |
| KNN | 0.611 | 0.033 |
| C4.5 | 0.629 | 0.027 |
The bolded row shows the approach we have decided to use.
Figure 1Performance of linear SVM classifier using all metabolites and clinical features. ROC curve and convex hull showing tradeoff between true positive rate (sensitivity) and false positive rate (1-specificity) for performance: (a) on the training data (resubstitution error) and (b) the testing (evaluation) data, both during cross-validation. The overall performance of the 3 tested fecal tests is also shown.
Sensitivity and specificity for fecal tests for polyp detection. Some tests are labeled as N/A because the subject did not take the test.
| Test | All data | ||
|---|---|---|---|
| Sensitivity | Specificity | N/A | |
| Fecal guaiac HemII | 2.6% | 99.4% | 22 (2.2%) |
| Fecal immune ICT | 9.1% | 96.9% | 28 (2.8%) |
| Fecal immune MagSt | 15.1% | 94.5% | 23 (2.3%) |
Feature selection methods used to select 10 metabolites, while using a LASSO classifier and all 4 clinical features.
| Method | Average AUC | AUC standard |
|---|---|---|
| Random | 0.678 | 0.036 |
| Correlation (Pearson) |
|
|
| mRMR | 0.707 | 0.018 |
| Mutual information | 0.712 | 0.015 |
| SVM weights | 0.699 | 0.020 |
| SVM recursive | 0.697 | 0.016 |
The bolded row shows the approach we have decided to use.
Figure 2Performance of LASSO with a varying number of metabolites. Two graphs showing performance for LASSO as we vary the number of metabolites used in addition to the clinical features. (a) shows sensitivity and specificity at the selected threshold point, and (b) shows AUC for the entire ROC curve along with standard error. Note that the graphs do not show the full range [0, 1] of the y-axis.
Correlation coefficient for top 4 metabolites and 4 clinical features, where a positive correlation coefficient shows that the metabolite was positively correlated with patients that need to receive colonoscopies.
| Feature | PubChem CID | Metabolic pathways | Correlation |
|---|---|---|---|
| Methanol | 887 | Gut flora metabolism | −0.16 |
| Age | 0.16 | ||
| Sex | 0.12 | ||
| Trigonelline | 5570 | Nicotinate and nicotinamide metabolism | 0.12 |
| Acetone | 180 | Degradation of ketone bodies; | −0.11 |
| Smoker | 0.11 | ||
| Tyrosine | 6057 | Many amino acid pathways; | 0.09 |
| GI Bleeding | 0.07 |
Figure 3Histogram of 100 permutation test results. None of the permutation results are greater than our best classifier's performance of AUC = 0.715.
Figure 4Performance of LASSO classifier using 4 metabolites and 4 clinical features. ROC curve and convex hull for our final “4 metabolite + survey questions” predictor, on the (a) training data (resubstitution error) and (b) testing (evaluation) data, both during cross-validation. The overall performance of the 3 tested fecal tests is also shown.