| Literature DB >> 35087163 |
Eric R Kehoe1, Bryna L Fitzgerald2, Barbara Graham2, M Nurul Islam2, Kartikay Sharma3, Gary P Wormser4, John T Belisle2, Michael J Kirby3,5.
Abstract
We provide a pipeline for data preprocessing, biomarker selection, and classification of liquid chromatography-mass spectrometry (LCMS) serum samples to generate a prospective diagnostic test for Lyme disease. We utilize tools of machine learning (ML), e.g., sparse support vector machines (SSVM), iterative feature removal (IFR), and k-fold feature ranking to select several biomarkers and build a discriminant model for Lyme disease. We report a 98.13% test balanced success rate (BSR) of our model based on a sequestered test set of LCMS serum samples. The methodology employed is general and can be readily adapted to other LCMS, or metabolomics, data sets.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35087163 PMCID: PMC8795431 DOI: 10.1038/s41598-022-05451-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1(a) UMAP visualization of log transformed and KNN imputed LC-MS data from training samples. EDL early disseminated Lyme disease, ELL early localized Lyme disease, HCN healthy control non-endemic, HCE1 healthy control endemic site 1. (b) UMAP visualization of log transformed and KNN imputed LC-MS data from training samples post IFR. EDL early disseminated Lyme disease, ELL early localized Lyme disease, HCN healthy control non-endemic, HCE1 healthy control endemic site 1. (c) UMAP visualization of log transformed and KNN imputed LC-MS data from training samples restricted to the features found by IFR. EDL early disseminated Lyme disease, ELL early localized Lyme disease, HCN healthy control non-endemic, HCE1 healthy control endemic site 1.
Figure 2Magnitude of weights in SSVM model used in kFFS on training samples. The labels at the bottom indicate the transformation/imputation scheme used on the data, while the numeric ticks indicate the fold in kFFS.
Fivefold LCMS accuracy and standard deviation scores for several transformation and imputation schemes post feature selection.
| Method | Mean fivefold accuracy (%) | Standard deviation (%) |
|---|---|---|
| Raw peak areas/KNN imputed | 95.6 | 2.8 |
| Standardized/KNN imputed | 98.0 | 1.4 |
| Log transformed/KNN imputed | 97.6 | 1.6 |
| Median-fold change normalized/KNN imputed | 92.9 | 2.4 |
Biomarkers selected by kFFS on training samples. The RT column indicates the retention time in seconds.
| Method | RT (s) | m/z | Percent missing | Occurence | Notes |
|---|---|---|---|---|---|
| None/KNN | 1103.504 | 481.349 | 0.00 | 5 | Targeted m/z 480.3453 |
| 1256.933* | 469.389* | 0.00 | 4 | ||
| 255.29* | 227.087* | 0.00 | 3 | ||
| 1172.216 | 746.563 | 14.41 | 3 | ||
| 96.231 | 120.081 | 5.93 | 1 | Targeted m/z 166.0862 | |
| 134.919 | 188.069 | 0.00 | 1 | Targeted m/z 205.09718 | |
| 958.025 | 244.263 | 0.00 | 1 | ||
| 240.743 * | 247.142* | 0.00 | 1 | ||
| 684.719 | 314.157 | 0.00 | 1 | Targeted m/z 313.1535 | |
| 1184.953 | 341.248 | 0.00 | 1 | ||
| 1321.413 | 449.266 | 0.00 | 1 | ||
| 710.183 | 472.239 | 0.85 | 1 | Targeted m/z 471.7369 | |
| 1165.713 | 508.377 | 0.00 | 1 | ||
| 845.998* | 831.646* | 0.00 | 1 | ||
| Median fold change/KNN | 1018.741 | 174.131 | 0.85 | 4 | |
| 255.29* | 227.087* | 0.00 | 4 | ||
| 748.564 | 1240.487 | 0.00 | 4 | Not targeted, Isotopic peak of m/z 1238.496 | |
| 240.743* | 247.142* | 0.00 | 2 | ||
| 1256.933* | 469.389* | 0.00 | 2 | ||
| 1164.732 | 470.352 | 0.00 | 2 | ||
| 845.772 | 831.846 | 0.85 | 2 | ||
| 959.672 | 286.144 | 0.00 | 1 | ||
| 1195.622 | 331.225 | 0.00 | 1 | ||
| 891.151 | 829.697 | 0.00% | 1 | ||
| 926.235 | 1086.303 | 0.00% | 1 | ||
| 746.33 | 1238.496 | 2.54 | 1 | ||
| Log/KNN | 737.416 | 280.151 | 5.08 | 5 | |
| 739.352* | 152.016* | 0.00 | 4 | ||
| 1129.774 | 803.572 | 22.03 | 4 | ||
| 739.409 | 238.089 | 38.14 | 2 | ||
| 642.845 | 358.242 | 0.85 | 2 | ||
| 721.821* | 504.337* | 0.00 | 2 | ||
| 835.911 | 1042.803 | 7.63 | 2 | Targeted m/z 1042.5782 | |
| 146.315 | 181.07 | 0.85 | 1 | ||
| 1034.796 | 567.402 | 0.85 | 1 | Targeted m/z 566.3996 | |
| 1078.422 | 786.549 | 14.41 | 1 | Targeted m/z 785.5421 | |
| 837.161 | 834.244 | 1.69 | 1 | ||
| Standard/KNN | 967.457 | 194.117 | 1.69 | 4 | |
| 1045.362 | 478.348 | 4.24 | 3 | ||
| 721.821* | 504.337* | 0.00 | 3 | ||
| 739.352* | 152.016* | 0.00 | 2 | ||
| 255.162 | 169.084 | 0.00 | 2 | ||
| 984.816 | 174.127 | 2.54 | 2 | Not targeted, not present in both LCMS runs | |
| 1231.212 | 429.322 | 2.54 | 2 | Targeted m/z 428.3219 | |
| 758.53 | 671.999 | 5.93 | 2 | Targeted m/z 670.9956 | |
| 1179.631 | 293.401 | 0.85 | 1 | ||
| 1192.645 | 317.407 | 1.69 | 1 | Targeted m/z 317.2475 | |
| 1016.034 | 493.353 | 2.54 | 1 | ||
| 1711.489 | 814.687 | 0.00 | 1 | Targeted m/z 813.6872 | |
| 954.18 | 1569.349 | 0.00 | 1 | Not targeted, atypical MS spectra | |
| 845.998* | 831.646* | 0.00 | 1 |
The M/Z column indicates the mass divided by charge of the metabolite. The percent missing column indicates the percentage of samples that were missing the specific feature. The occurrence column indicates how many times the feature occurred across the fivefold in kFFS. The method indicates the normalization/imputation method used. A (*) on a feature indicates that it was picked more than once across methods.
Figure 3PCA visualization of log transformed and KNN imputed LC-MS data from training samples restricted to the optimal 45 features found by kFFS.
Confusion matrix for classification of test samples restricted to 42 selected biomarkers with LCMS classifier using log normalized features.
| Predicted Lyme | Predicted healthy | ||||
|---|---|---|---|---|---|
| True Lyme | ELL | 40 | 77 | 0 | 3 |
| EDL | 37 | 3 | |||
| True healthy | HCE1 | 0 | 0 | 30 | 38 |
| HCE2 | 0 | 8 | |||
Statistical scores (lyme = positive) for classification of test samples restricted to 42 selected biomarkers with LCMS classifier using log transformed features.
| Scoring method | Score (%) |
|---|---|
| Test sensitivity (TPR) | 96.25 |
| Test specificity (TNR) | 100.00 |
| Test false discovery rate (FDR) | 0.00 |
| Test false omission rate (FOR) | 7.32 |
| Test accuracy | 97.46 |
| Test balanced success rate (BSR) | 98.13 |
Figure 4(a) Projection of log transformed health state labeled training and test samples onto SSVM hyperplane normal, represented as the x-axis. The y-axis represent the first principal component in the PCA decomposition of the training and test samples projected onto the orthogonal space of the hyperplane normal. The solid line indicates the hyperplane boundary, or decision boundary. Relative distance from the decision boundary indicates how strong the classification is; further is stronger, while closer is weaker. The dotted lines indicate the hyperplane margins. (b) Projection of log transformed disease state labeled training and test samples onto SSVM hyperplane normal, represented as the x-axis. The y-axis represent the first principal component in the PCA decomposition of the training and test samples projected onto the orthogonal space of the hyperplane normal. The solid line indicates the hyperplane boundary, or decision boundary. Relative distance from the decision boundary indicates how strong the classification is; further is stronger, while closer is weaker. The dotted lines indicate the hyperplane margins. EDL early disseminated Lyme disease, ELL early localized Lyme disease, HCN healthy control non-endemic, HCE1 healthy control endemic site 1, HCE2 healthy control endemic site 2.
MSMS results of selected biomarkers selected by kFFS. The RT column indicates the retention time in minutes.
| RT (m) | m/z | MS/MS | Structural ID | Level | Description |
|---|---|---|---|---|---|
| 2.46 | 181.070201 | Yes | Theobromine | 1 | Organoheterocyclic compounds/imidazopyrimidines/purines and purine derivatives |
| 2.78 | 205.09718 | Yes | Tryptophan | 1 | Organoheterocyclic compounds/indoles and derivatives/indolyl carboxylic acids and derivatives |
| 16.09 | 286.143724 | Yes | Piperine | 1 | Alkaloids and derivatives |
| 1.81 | 166.0862 | Yes | Phenylalanine | 1 | Organic acids and derivatives/carboxylic acids and derivatives/amino acids, peptides, and analogues |
| 11.42 | 313.1535 | Yes | Phe–Phe | 2 | Organic acids and derivatives/carboxylic acids and derivatives/amino acids, peptides, and analogues |
| 19.92 | 317.247506 | Yes | 14(15)-Epoxy-5Z,8Z,11Z-eicosatrienoic acid [M-H2O]+ | 2 | Lipids and lipid-like molecules/Fatty acyls/fatty acids and conjugates |
| 19.54 | 508.377209 | yes | PC(O-18:0/0:0) | 2 | Lipids and lipid-like molecules/glycerophospholipids/glycerophosphocholines |
| 18.47 | 480.3453 | Yes | PC(P-16:0/0:0) | 2 | Lipids and lipid-like molecules/glycerophospholipids/glycerophosphocholines |
| 4.7 | 227.087183 | Yes | Na+ adduct of lactone (similar fragmentation to cis-jasmone) | 3 | Organic oxygen compounds/organooxygen compounds/carbonyl compounds |
| 2.74 | 247.142426 | Yes | Related to tryptophan | 3 | Organoheterocyclic compounds/indoles and derivatives/indolyl carboxylic acids and derivatives |
| 21.13 | 469.389367 | Yes | Unsaturated alkyl chain | 3 | Lipids and lipid-like molecules/fatty acyls/fatty acids and conjugates |
| 14.96 | 829.696851 | Yes | Peptide | 3 | Organic polymers/polypeptides |
| 14.13 | 831.845956 | Yes | Peptide | 3 | Organic polymers/polypeptides |
| 15.46 | 1086.303121 | Yes | Peptide | 3 | Organic polymers/polypeptides |
| 12.55 | 1238.496491 | Yes | Peptide | 3 | Organic polymers/polypeptides |
| 19.63 | 746.563218 | Yes | Peptide | 3 | Organic polymers/polypeptides |
| 14.13 | 831.646014 | Yes | Peptide | 3 | Organic polymers/polypeptides |
| 12.42 | 152.016163 | Yes | 4 | ||
| 4.27 | 169.084118 | Yes | 4 | ||
| 16.99 | 174.130592 | Yes | 4 | ||
| 12.42 | 238.089239 | Yes | 4 | ||
| 19.7 | 293.400601 | Yes | 4 | ||
| 19.58 | 341.248414 | Yes | 4 | ||
| 10.78 | 358.242021 | Yes | 4 | ||
| 20.6 | 428.3219 | Yes | 4 | ||
| 11.79 | 471.7369 | Yes | 4 | ||
| 17.63 | 478.347583 | Yes | 4 | ||
| 16.98 | 493.352828 | Yes | 4 | ||
| 12.12 | 504.336795 | Yes | 4 | ||
| 17.3 | 566.3996 | Yes | 4 | ||
| 12.64 | 670.9956 | Yes | 4 | ||
| 18.04 | 785.5421 | Yes | 4 | ||
| 16.23 | 194.117098 | No | |||
| 16.03 | 244.263279 | No | |||
| 12.4 | 280.151108 | No | |||
| 19.92 | 331.224627 | No | |||
| 22.23 | 449.266367 | No | |||
| 19.53 | 470.351806 | No | |||
| 18.95 | 803.571864 | No | |||
| 28.49 | 813.6872 | No | |||
| 14.13 | 834.244267 | No | |||
| 13.84 | 1042.5782 | No |
The M/Z column indicates the mass divided by charge of the metabolite.
Figure 5(a) Diagram of kFFS. (b) Diagram of building the final model.
Figure 6Fivefold classification accuracy of SSVM model for different values the hyper-parameter C. The solid line indicates the mean accuracy across fivefold while the shaded regions indicate 1 standard deviation of the accuracy.