| Literature DB >> 35252718 |
Luke R Sadergaski1, Travis J Hager2, Hunter B Andrews1.
Abstract
Selecting optimal combinations of preprocessing methods is a major holdup for chemometric analysis. The analyst decides which method(s) to apply to the data, frequently by highly subjective or inefficient means, such as user experience or trial and error. Here, we present a user-friendly method using optimal experimental designs for selecting preprocessing transformations. We applied this strategy to optimize partial least square regression (PLSR) analysis of Stokes Raman spectra to quantify hydroxylammonium (0-0.5 M), nitric acid (0-1 M), and total nitrate (0-1.5 M) concentrations. The best PLSR model chosen by a determinant (D)-optimal design comprising 26 samples (i.e., combinations of preprocessing methods) was compared with PLSR models built with no preprocessing, a user-selected preprocessing method (i.e., trial and error), and a user-defined design strategy (576 samples). The D-optimal selection strategy improved PLSR prediction performance by more than 50% compared with the raw data and reduced the number of combinations by more than 95.5%.Entities:
Year: 2022 PMID: 35252718 PMCID: PMC8892473 DOI: 10.1021/acsomega.1c07111
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Figure 1Stokes Raman spectrum of an aqueous solution containing 0.65 M HNO3 and 0.5 M HAN. The most intense peaks are labeled and correspond to the NH3OH+ symmetric N–O stretch at 1007 cm–1, the NO3– symmetric N–O stretch at 1048 cm–1, and the O–H stretching band from approximately 2700 to 3800 cm–1.
D-Optimal Design Matrix for Preprocessing Transformationsa
| run | scatter | der. order | poly. order | left/right | scaling | space type | build type |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 2 | 1 | vertex | model |
| 2 | 0 | 1 | 5 | 3 | 0 | interior | model |
| 3 | 1 | 0 | 7 | 4 | 0 | edge | model |
| 4 | 1 | 0 | 3 | 2 | 0 | edge | model |
| 5 | 1 | 2 | 3 | 20 | 0 | vertex | lack of fit |
| 6 | 0 | 0 | 1 | 2 | 0 | vertex | model |
| 7 | 1 | 0 | 5 | 24 | 0 | plane | lack of fit |
| 8 | 0 | 0 | 3 | 17 | 1 | plane | model |
| 9 | 0 | 2 | 7 | 4 | 0 | edge | model |
| 10 | 0 | 2 | 7 | 30 | 1 | vertex | model |
| 11 | 1 | 1 | 7 | 17 | 0 | plane | model |
| 12 | 1 | 2 | 3 | 30 | 1 | vertex | model |
| 13 | 1 | 0 | 7 | 30 | 1 | vertex | model |
| 14 | 0 | 0 | 7 | 4 | 1 | edge | model |
| 15 | 0 | 2 | 3 | 30 | 0 | vertex | model |
| 16 | 0 | 0 | 5 | 13 | 0 | plane | lack of fit |
| 17 | 1 | 2 | 7 | 4 | 1 | edge | model |
| 18 | 1 | 1 | 3 | 13 | 1 | interior | lack of fit |
| 19 | 1 | 2 | 3 | 2 | 0 | vertex | model |
| 20 | 0 | 0 | 1 | 30 | 1 | vertex | model |
| 21 | 0 | 2 | 3 | 2 | 1 | vertex | model |
| 22 | 1 | 0 | 1 | 15 | 0 | vertex | model |
| 23 | 1 | 2 | 7 | 30 | 0 | vertex | model |
| 24 | 1 | 0 | 1 | 30 | 0 | vertex | model |
| 25 | 0 | 1 | 7 | 30 | 1 | edge | model |
| 26 | 0 | 0 | 7 | 30 | 0 | vertex | model |
Abbreviations used in this table are derivative (der.) and polynomial (poly.). Left/right smoothing points are for a SG smoothing or derivative. Scatter and scaling refer to SNV and MC.
Figure 2Plot of total Y-variable RMSEC (C) and RMSECV (CV) vs the number of factors for PLSR models built (a) without preprocessing and (b) with the optimal strategy selected using D-optimal design.
PLSR Model Calibration and Validation Statistics for Each Analyte Derived from Multiple Preprocessing Strategiesa
| design | NP | U-S | D-optimal | U-DD |
|---|---|---|---|---|
| no. samples | 1 | undefined | 26 | 576 |
| preprocessing | none | SNV/derivative/MC | derivative/MC | derivative/MC |
| no. factors | 4 | 3 | 3 | 3 |
| Calibration/CV statistics | ||||
| 0.9992096 | 0.9995188 | 0.9993514 | 0.9994043 | |
| RMSEC | 0.0060845 | 0.0042479 | 0.0049315 | 0.0047265 |
| RMSECV | 0.0454421 | 0.0184775 | 0.015596 | 0.0147932 |
| 0.9990751 | 0.9996105 | 0.9991647 | 0.9991879 | |
| RMSEC | 0.0123394 | 0.0076351 | 0.0111814 | 0.0110249 |
| RMSECV | 0.0588891 | 0.0204559 | 0.032528 | 0.0315437 |
| 0.9991027 | 0.9995734 | 0.9989014 | 0.9989355 | |
| RMSEC | 0.0147092 | 0.0090814 | 0.0145736 | 0.014346 |
| RMSECV | 0.0393324 | 0.0232106 | 0.0384820 | 0.0380667 |
| Validation statistics | ||||
| RMSEP (HA+) | 0.032912 | 0.018964 | 0.0108452 | 0.0105082 |
| RMSEP% | 12.28 | 7.09 | 4.29 | 3.79 |
| bias | 0.014777 | 0.0143382 | –0.0004326 | –0.0019518 |
| SEP | 0.029910 | 0.0126239 | 0.0110218 | 0.0105019 |
| RMSEP (H+) | 0.0523674 | 0.0362628 | 0.0215038 | 0.0209245 |
| RMSEP% | 10.05 | 6.83 | 4.24 | 4.57 |
| bias | 0.019598 | 0.0296535 | 0.0056263 | –0.0032218 |
| SEP | 0.049392 | 0.0212292 | 0.0211095 | 0.0210284 |
| RMSEP (NO3–) | 0.045651 | 0.0534647 | 0.0244241 | 0.0240554 |
| RMSEP% | 5.78 | 6.69 | 3.21 | 3.22 |
| bias | 0.034373 | 0.0439889 | 0.0051911 | –0.0051765 |
| SEP | 0.030556 | 0.0309078 | 0.0242741 | 0.0238934 |
R2 of the calibration, CV, different SG derivatives for U-S, D-optimal (D), and U-DD strategies. Abbreviated model with no preprocessing (NP).
Figure 3Confidence intervals for HA+ (a) bias and (b) SEP for all six comparisons. Prediction performance is statistically similar between designs if the confidence interval crosses the solid vertical line for bias and SEP.
Figure 5Confidence intervals for NO3– (a) bias and (b) SEP for all six comparisons. Prediction performance is statistically similar between designs if the confidence interval crosses the solid vertical line for bias and SEP.