| Literature DB >> 27801817 |
Yun Xu1, Howbeer Muhamadali2, Ali Sayqal3, Neil Dixon4, Royston Goodacre5.
Abstract
Partial least squares (PLS) is one of the most commonly used supervised modelling approaches for analysing multivariate metabolomics data. PLS is typically employed as either a regression model (PLS-R) or a classification model (PLS-DA). However, in metabolomics studies it is common to investigate multiple, potentially interacting, factors simultaneously following a specific experimental design. Such data often cannot be considered as a "pure" regression or a classification problem. Nevertheless, these data have often still been treated as a regression or classification problem and this could lead to ambiguous results. In this study, we investigated the feasibility of designing a hybrid target matrix Y that better reflects the experimental design than simple regression or binary class membership coding commonly used in PLS modelling. The new design of Y coding was based on the same principle used by structural modelling in machine learning techniques. Two real metabolomics datasets were used as examples to illustrate how the new Y coding can improve the interpretability of the PLS model compared to classic regression/classification coding.Entities:
Keywords: Y coding; experimental design; metabolomics; partial least squares; structural modelling
Year: 2016 PMID: 27801817 PMCID: PMC5192444 DOI: 10.3390/metabo6040038
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Figure 1Structured output coding for the riboswitch set; for example, a wild-type sample under the IPTG inducer condition would be coded as [1 0 0 0 0 0 1 0 0].
PLS predictions of the riboswitch set. Confusion matrix of strain prediction using structured output.
| Wild-Type | PET | EGFP | iL3EGFP | iL3PET | |
|---|---|---|---|---|---|
| Wild-type | 97.20% | 0.38% | 0.08% | 1.73% | 0.63% |
| PET | 10.03% | 71.93% | 8.20% | 6.23% | 3.63% |
| EGFP | 0.00% | 2.88% | 89.33% | 4.83% | 2.98% |
| iL3EGFP | 1.20% | 8.53% | 2.35% | 69.10% | 18.83% |
| iL3PET | 3.35% | 12.55% | 2.85% | 7.78% | 73.48% |
Overall CCR = 80.21% (p < 0.001) (CCR = correct classification rate).
PLS predictions of the riboswitch set. Confusion matrix of inducer condition prediction using structured output.
| No Inducer | IPTG | IPTG + PPDA | PPDA | |
|---|---|---|---|---|
| No inducer | 66.62% | 6.56% | 10.92% | 15.90% |
| IPTG | 12.66% | 58.04% | 22.58% | 6.72% |
| IPTG + PPDA | 4.02% | 34.42% | 44.40% | 17.16% |
| PPDA | 19.14% | 6.58% | 10.26% | 64.02% |
Overall CCR = 58.20% (p < 0.01).
PLS predictions of the riboswitch set. Confusion matrix of strain prediction using binary coding.
| Wild-Type | PET | EGFP | iL3EGFP | iL3PET | |
|---|---|---|---|---|---|
| Wild-type | 79.53% | 4.40% | 1.58% | 9.55% | 4.95% |
| PET | 8.93% | 52.88% | 11.53% | 18.25% | 8.43% |
| EGFP | 0.70% | 8.55% | 83.23% | 3.65% | 3.88% |
| iL3EGFP | 6.53% | 8.28% | 5.60% | 62.08% | 17.53% |
| iL3PET | 7.43% | 10.00% | 10.25% | 6.13% | 66.20% |
Overall CCR = 67.78% (p < 0.001).
PLS predictions of the riboswitch set. Confusion matrix of inducer condition prediction using binary coding.
| No Inducer | IPTG | IPTG + PPDA | PPDA | |
|---|---|---|---|---|
| No inducer | 64.60% | 7.10% | 12.16% | 16.14% |
| IPTG | 10.44% | 59.40% | 22.50% | 7.66% |
| IPTG + PPDA | 7.58% | 34.44% | 43.46% | 14.52% |
| PPDA | 15.38% | 5.00% | 9.60% | 70.02% |
Overall CCR = 59.37% (p < 0.01).
Figure 2Structured output coding for the propranolol set, For example, a sample of S1, D2, and T2 would be coded as [6 0 0 4 6].
PLS predictions for the propranolol set. Confusion matrix for strain prediction.
| S1 | S2 | S3 | |
|---|---|---|---|
| S1 | 78.03% | 21.26% | 0.71% |
| S2 | 10.40% | 87.67% | 1.93% |
| S3 | 4.78% | 3.22% | 92.00% |
Overall CCR = 85.90% (p < 0.001).
PLS predictions for the propranolol set. Confusion matrix for dosages of propranolol prediction.
| D0 | D1 | D2 | D3 | |
|---|---|---|---|---|
| D0 | 99.08% | 0.91% | 0.01% | 0% |
| D1 | 0.20% | 64.45% | 35.35% | 0% |
| D2 | 0% | 0.05% | 93.23% | 6.72% |
| D3 | 1.67% | 0.10% | 39.35% | 58.88% |
Overall accuracy = 78.91% (p < 0.001).
PLS predictions for the propranolol set. Confusion matrix for time point prediction.
| T0 | T1 | T2 | |
|---|---|---|---|
| T0 | 30.13% | 55.57% | 14.30% |
| T1 | 15.60% | 58.35% | 26.05% |
| T2 | 1.48% | 27.15% | 71.37% |
Overall accuracy = 52.97% (p < 0.01).
PLS predictions for the propranolol set. Confusion matrix for time point prediction using an evenly spaced coding.
| T0 | T1 | T2 | |
|---|---|---|---|
| T0 | 20.64% | 70.40% | 8.96% |
| T1 | 7.15% | 67.78% | 25.07% |
| T2 | 0.78% | 35.72% | 63.50% |
Overall accuracy = 49.30% (p < 0.01).
Figure 3VIP score plots for the riboswitch data. The variable identifications and their corresponding VIP scores values were annotated in the data tips. Note that each metabolite is only annotated once, if a metabolite is significant in both VIP score plots (e.g., variable 27), only the higher one is annotated.
Figure 4VIP score plots for the propranolol data. The variable identification and their corresponding VIP score values were annotated in the data tips.
Different inducer conditions.
| Inducer Compound | Final Concentration |
|---|---|
| 0.9% NaCl solution (control, no inducer) | - |
| IPTG ( | 50 μM |
| PPDA (riboswitch inducer ligand) | 200 μM |
| IPTG + PPDA | 50 μM + 200 μM |