| Literature DB >> 30071092 |
Francisca Barceló1,2, Rosa Gomila3, Ivan de Paul2,4, Xavier Gili2,4, Jaume Segura2,4, Albert Pérez-Montaña5, Teresa Jimenez-Marco6, Antonia Sampol5, José Portugal7.
Abstract
Monoclonal gammopathy of undetermined significance (MGUS) is a plasma cell dyscrasia that can progress to malignant multiple myeloma (MM). Specific molecular biomarkers to classify the MGUS status and discriminate the initial asymptomatic phase of MM have not been identified. We examined the serum peptidome profile of MGUS patients and healthy volunteers using MALDI-TOF mass spectrometry and developed a predictive model for classifying serum samples. The predictive model was built using a support vector machine (SVM) supervised learning method tuned by applying a 20-fold cross-validation scheme. Predicting class labels in a blinded test set containing randomly selected MGUS and healthy control serum samples validated the model. The generalization performance of the predictive model was evaluated by a double cross-validation method that showed 88% average model accuracy, 89% average sensitivity and 86% average specificity. Our model, which classifies unknown serum samples as belonging to either MGUS patients or healthy individuals, can be applied to clinical diagnosis.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30071092 PMCID: PMC6072114 DOI: 10.1371/journal.pone.0201793
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Patient demographics and disease characteristics.
| Sample set | Number of samples | Male/Female | Age range | Age | M-protein (g/dL) |
|---|---|---|---|---|---|
| Healthy controls | 108 | 67/41 | 21–66 | 46 ± 9 | Below cut off |
| MGUS patients | 103 | 50/53 | 41–88 | 66 ± 12 | 0.65±0.41 |
a All serum samples were from Caucasian people. Clinical data were collected at the time of diagnosis.
b MGUS encompasses serum samples of the following isotypes: IgG κ (38), IgA κ (12), IgM κ (7), IgG λ (27), IgA λ (13), IgM λ (2), IgG κ + IgM κ (1), IgG κ + IgM λ (1), IgA κ + IgA λ (1), IgM κ + IgM λ (1).
Fig 120-fold cross-validation scheme.
Confusion matrix.
| Biological group | |||
|---|---|---|---|
| MGUS | HC | ||
| MGUS | True Positive (TP) | False Positive (FP) | |
| HC | False Negative (FN) | True Negative (TN) | |
Each cell represents a count of predictions falling into the corresponding category (MGUS or HC).
Classifier performance measures based on the confusion matrix method.
| Accuracy | |
| Sensitivity (True Positive Rate) | |
| Specificity (True Negative Rate) |
Accuracy was computed as the proportion of correctly classified samples. Sensitivity and specificity were computed as the rate of correctly predicted samples in the positive and negative labeled class, respectively (TP: true positive; TN: true negative; FP: false positive; FN: false negative).
Fig 2Double cross-validation scheme.
It highlights the two nested loops. The outer cross-validation loop provides 10 performance estimates from predicting the corresponding test set by the optimized model built in the inner 20 fold cross-validation loop. The data set used to build and tune the model in the inner cross-validation loop is completely independent of the test set used in the outer iteration.
Spectral peaks selected in the analysis of technical replicates mass spectra.
| m/z (Da) | ||||||||||
| 2554.40 | 2660.76 | 2755.04 | 3192.90 | 3242.07 | 3263.82 | 3954.33 | 4092.44 | 4211.37 | 5906.25 | |
| 6434.47 | 6632.72 | 7767.02 | ||||||||
| m/z (Da) | ||||||||||
| 2021.69 | 2082.35 | 2114.80 | 2192.65 | 2209.83 | 2378.88 | 2495.96 | 2554.40 | 2604.38 | 2641.45 | |
| 2660.76 | 2723.83 | 2755.04 | 2769.70 | 2863.08 | 2884.92 | 2933.20 | 2954.46 | 3159.17 | 3192.90 | |
| 3215.50 | 3242.07 | 3263.82 | 3449.01 | 3884.10 | 3954.33 | 4055.61 | 4092.44 | 4211.37 | 4269.32 | |
| 4283.17 | 4644.92 | 4965.40 | 5338.46 | 5906.25 | 6434.47 | 6632.72 | 7767.10 | 9133.84 | 9290.47 | |
(A) Set of m/z reference peaks with a frequency greater than 90% used for spectra alignment.
(B) Set of m/z spectral features with a frequency greater than 50% used to build the feature matrix for statistical analysis.
Classifier performance estimates obtained from the 20-fold cross-validation scheme.
| Fold # | gamma | coef0 | cost | Sensitivity | Specificity | Accuracy | p-value |
|---|---|---|---|---|---|---|---|
| 0.00010 | 0.12 | 150 | 1.00 | 1.00 | 1.00 | 0.0077 | |
| 0.00060 | 0.09 | 150 | 1.00 | 0.86 | 0.93 | 0.0009 | |
| 0.00005 | 0.13 | 175 | 1.00 | 0.50 | 0.67 | 0.6503 | |
| 0.00005 | 0.09 | 175 | 1.00 | 0.75 | 0.88 | 0.0021 | |
| 0.00005 | 0.15 | 185 | 1.00 | 0.80 | 0.92 | 0.0166 | |
| 0.00005 | 0.12 | 160 | 1.00 | 0.67 | 0.83 | 0.1094 | |
| 0.00005 | 0.09 | 160 | 1.00 | 0.40 | 0.67 | 0.3743 | |
| 0.00005 | 0.09 | 185 | 1.00 | 0.44 | 0.58 | 0.9456 | |
| 0.00005 | 0.08 | 190 | 0.70 | 0.80 | 0.73 | 0.4041 | |
| 0.00005 | 0.12 | 180 | 1.00 | 0.86 | 0.94 | 0.0016 | |
| 0.00005 | 0.13 | 180 | 1.00 | 0.50 | 0.63 | 0.8862 | |
| 0.00005 | 0.40 | 240 | 1.00 | 1.00 | 1.00 | 0.0016 | |
| 0.00005 | 0.15 | 120 | 0.83 | 1.00 | 0.88 | 0.3671 | |
| 0.00005 | 0.20 | 170 | 1.00 | 0.60 | 0.80 | 0.0547 | |
| 0.00020 | 0.90 | 240 | 1.00 | 0.86 | 0.92 | 0.0039 | |
| 0.00005 | 0.40 | 120 | 1.00 | 1.00 | 1.00 | 0.0050 | |
| 0.00005 | 0.20 | 90 | 1.00 | 0.80 | 0.89 | 0.0413 | |
| 0.00030 | 1.10 | 240 | 1.00 | 0.83 | 0.93 | 0.0046 | |
| 0.00050 | 0.90 | 200 | 1.00 | 1.00 | 1.00 | 0.0199 | |
| 0.00005 | 0.30 | 120 | 1.00 | 0.83 | 0.92 | 0.0039 |
The tuned parameters (gamma, coef0, cost) and the performance estimates (sensitivity, specificity, accuracy) for each iteration are shown. The parameters corresponding to the best performance are shaded. A p-value from McNemar's Chi-square test was computed, and p < 0.05 was considered statistically significant.
Biological group and predicted class label for serum samples in the blinded test set.
| 1T | 2T | 3T | 4T | 5T | 6T | 7T | 8T | 9T | 10T | 11T | 12T | 13T | 14T | 15T | 16T | |
| HC | HC | HC | MGUS | MGUS | MGUS | MGUS | HC | HC | HC | MGUS | MGUS | MGUS | MGUS | HC | HC | |
| HC | HC | HC | MGUS | MGUS | HC | HC | HC | HC | HC | MGUS | MGUS | MGUS | MGUS | HC | HC |
Blinded test samples were identified as nT to mask any information about the biological group before their classification. False negative results are shaded.
Performance estimates obtained from the double cross-validation method.
| Outer fold # | Sensitivity | Specificity | Accuracy | p-value |
|---|---|---|---|---|
| 1.00 | 1.00 | 1.00 | 0.0000001 | |
| 0.64 | 0.93 | 0.80 | 0.0111700 | |
| 1.00 | 0.86 | 0.94 | 0.0013510 | |
| 1.00 | 0.75 | 0.89 | 0.0002533 | |
| 0.82 | 1.00 | 0.92 | 0.0001199 | |
| 0.91 | 0.82 | 0.86 | 0.0004277 | |
| 0.62 | 0.69 | 0.65 | 0.0843200 | |
| 1.00 | 0.90 | 0.92 | 0.1618000 | |
| 1.00 | 0.89 | 0.95 | 0.0001114 | |
| 0.94 | 0.80 | 0.88 | 0.0025660 | |
| 0.89 | 0.86 | 0.88 | ||
| 0.15 | 0.10 | 0.10 |
The performance estimates (sensitivity, specificity, accuracy) for every outer iteration and the corresponding average values are shown. A p-value from McNemar's Chi-square test was computed and p < 0.05 was considered statistically significant.
Fig 3Classification process of an unknown serum sample.
n technical replicates from serum sample raw mass spectra are pre-processed and features selected. The correlation threshold (r) sets the k technical replicates passing the intra-experimental quality control (QC). The SVM predictive model classifies the k technical replicates. The majority voter assigns the serum sample predicted class. The parameters determined from the processing of the training set and the building of the predictive model are shaded.