| Literature DB >> 33580828 |
Heather Desaire1, Milani Wijeweera Patabandige2, David Hua2.
Abstract
One unifying challenge when classifying biological samples with mass spectrometry data is overcoming the obstacle of sample-to-sample variability so that differences between groups, such as between a healthy set and a disease set, can be identified. Similarly, when the same sample is re-analyzed under identical conditions, instrument signals can fluctuate by more than 10%. This signal inconsistency imposes difficulties in identifying subtle differences across a set of samples, and it weakens the mass spectrometrist's ability to effectively leverage data in domains as diverse as proteomics, metabolomics, glycomics, and imaging. We selected challenging data sets in the fields of glycomics, mass spectrometry imaging, and bacterial typing to study the problem of within-group signal variability and adapted a 30-year-old statistical approach to address the problem. The solution, "local-balanced model," relies on using balanced subsets of training data to classify test samples. This analysis strategy was assessed on ESI-MS data of IgG-based glycopeptides and MALDI-MS imaging data of endogenous lipids, and MALDI-MS data of bacterial proteins. Two preliminary examples on non-mass spectrometry data sets are also included to show the potential generality of the method outside the field of MS analysis. We demonstrate that this approach is superior to simple normalization methods, generalizable to multiple mass spectrometry domains, and potentially appropriate in fields as diverse as physics and satellite imaging. In some cases, improvements in classification can be dramatic, with accuracy escalating from 60% with normalization alone to over 90% with the additional development described herein.Entities:
Keywords: Genomics/proteomics; Glycoprotein; Imaging; Machine learning; Mass spectrometry; Software
Year: 2021 PMID: 33580828 PMCID: PMC8516084 DOI: 10.1007/s00216-020-03117-2
Source DB: PubMed Journal: Anal Bioanal Chem ISSN: 1618-2642 Impact factor: 4.142
Figure 1.Initial characterization of the IgG1 data set. A) The percent of each glycopeptide for 21 samples of native IgG1 (red dots) and 21 samples of modified IgG1 (blue dots.) Glycopeptide compositions (top) are proposed based on the high-resolution mass and MS/MS data. B) PCA plot, which shows the main variability in the data is not related to the change in glycosylation. C) Receiver-operating curve (ROC) for supervised classification using the Aristotle Classifier. D) Receiver operating curve (ROC) for supervised classification using Support Vector Machine (SVM.)
Figure 2.Error rates and AUC (area under the curve) for the IgG data set. The error rates (expressed in percent error) drop precipitously between the base classifier and the local-balanced model. Likewise, the AUC, a key measure of model performance, increases when the local-balanced model is incorporated with either classifier. SVM=Support Vector Machine; AC = Aristotle Classifier.
Classification Results for Imaging Data
| Aristotle Classifier | SVM | |||
|---|---|---|---|---|
| base | local-balanced | base | local-balanced | |
| AUC | 0.910 | 0.994 | 0.999 | 0.999 |
| accuracy | 80.7% | 97.6% | 97.8% | 97.8% |
No improvements were made with smaller data sets; this local-balanced model includes the full data set.
Classification Results for Satellite Data
| Aristotle Classifier | SVM | Benchmark | |||
|---|---|---|---|---|---|
| base | local-balanced | base | local-balanced | ||
| AUC | 0.894 | 0.967 | 0.714 | 0.957 | 0.962 |
| G-mean | 0.758 | 0.831 | 0 | 0.804 | 0.782 |
| F-meas (minority) | 0.337 | 0.729 | 0 | 0.718 | 0.689 |
| Accuracy | 62.8% | 94.9% | 90.3% | 94.9% | n/a |
Benchmark data from ref 18; the best approach of 15 different decision tree based methods is shown.
Figure 3.Examples from the Hill-Valley data set. These graphs show data for two different samples: a hill (top) and a valley (bottom). Each graph is a continuous line connecting the 100 numeric features for each sample, which are plotted in order along the X axis. The Y axis, in arbitrary units, could be considered “instrument response”.
Classification Results for Hill-Valley
| Aristotle Classifier | SVM | ||||
|---|---|---|---|---|---|
| base | L-B#1 | L-B#2 | base | L-B | |
| AUC | 0.521 | 0.588 | 0.986 | 0.798 | 0.975 |
| accuracy | 0.534 | 0.557 | 0.932 | 0.602 | 0.938 |
L-B#1 is the local-balanced model on the origianl feature set. L-B#2 is the local-balanced model on an expanded feature set that includes all possible ratios of two features.
Classification Results for Bacterial Typing w/SVM
| Base Model | Local-Balanced | |
|---|---|---|
| Experiment | Train (LOO) | Train (LOO) |
| AUC | 0.983 | 0.998 |
| # samples | 571 | 571 |
|
|
|
|
| Accuracy | 0.95 | 0.98 |
| Base Model | Local-Balanced | |
| Experiment | Test | Test |
| AUC | 0.963 | >.999 |
| # samples | 248 | 248 |
|
|
|
|
| Accuracy | 0.93 | 0.99 |