| Literature DB >> 36249363 |
Phatham Loahavilai1,2, Sopanant Datta3, Kiattiwut Prasertsuk1, Rungroj Jintamethasawat1, Patharakorn Rattanawan1, Jia Yi Chia1, Cherdsak Kingkan1, Chayut Thanapirom1, Taweetham Limpanuparb3.
Abstract
Caffeine, quinic acid, and nicotinic acid are among the significant chemical determinants of coffee quality. This study develops a chemometric model to quantify these compounds in ternary mixtures analyzed by terahertz time-domain spectroscopy (THz-TDS). A data set of 480 THz spectra was obtained from 80 samples. Combinations of data preprocessing methods, including normalization (Z-score, min-max scaling, Mie baseline removal) and dimensionality reduction (principal component analysis (PCA), factor analysis (FA), independent component analysis (ICA), locally linear embedding (LLE), non-negative matrix factorization (NMF), isomap), and prediction models (partial least-squares regression (PLSR), support vector regression (SVR), multilayer perceptron (MLP), convolutional neural network (CNN), gradient boosting) were analyzed for their prediction performance (totaling to 4,711,685 combinations). Results show that the highest quantification performance was achieved at a root-mean-square error of prediction (RMSEP) of 0.0254 (dimensionless mass ratio), using min-max scaling and factor analysis for data preprocessing and multilayer perceptron for prediction. Effects of preprocessing, comparison of prediction models, and linearity of data are discussed.Entities:
Year: 2022 PMID: 36249363 PMCID: PMC9558605 DOI: 10.1021/acsomega.2c03808
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Figure 1Chemical structure and content of caffeine, d-(−)-quinic acid, and nicotinic acid in coffee.
Recent Studies of THz-TDS Coupled with Chemometric Methods in Food Analysis
| compound | sample matrix | chemometric
method | reference |
|---|---|---|---|
| citric acid, | n/a | PLSR, ANN | ref ( |
| imidacloprid | rice powder | PLSR, SVR, iPLS, biPLS | ref ( |
| cereal (foxtail millet) | PLSR, SVM | ref ( | |
| cereal (foxtail millet) | TM-stepwise regression,
PLSR, | ref ( | |
| imidacloprid, carbendazim | flour | PLSR, PCA, SVM | ref ( |
| n/a | PLSR, SVR | ref ( | |
| benzoic acid | flour | GRNN, BPNN | ref ( |
| flavanoids | n/a | PLSR, ANN, PCA, SVM | ref ( |
| proteins | soybean | PLSR, PCA-RBFNN, ABC-SVR | ref ( |
| bisphenol A, bisphenol S, bisphenol AF, bisphenol E | n/a | SVR | ref ( |
| caffeine, | n/a | PLSR, SVR, MLP, CNN, gradient boost | this work |
Reported as n/a if pure samples with binder (polyethylene) are used.
PLSR, partial least-squares regression; ANN, artificial neural networks; SVR, support vector regression; iPLS, interval partial least squares; biPLS, backward interval partial least squares; SVM, support vector machine; TM, Tchebichef image moment; N-PLSR, N-way partial least-squares regression; PCA, principal component analysis; GRNN, generalized regression neural network; BPNN, back-propagation neural network; RBFNN, radial basis function neural network; ABC, artificial bee colony; MLP, multilayer perceptron; and CNN, convolutional neural network.
Figure 2Ternary diagram of 80 sample compositions analyzed by THz-TDS. To read the mixture compositions (dimensionless mass ratio), the key on the top right can be placed on a point, extending the lines to the axes or triangle edges. Preparation methods are described in Section .
Figure 3THz absorption spectra of pure caffeine, quinic acid, nicotinic acid, and a ternary mixture before (left) and after (right) Mie baseline removal (a.u. = arbitrary unit).
Figure 4Pipeline of chemometric methods. See the list of abbreviations in Table .
List of Investigated Chemometric Techniques with Brief Description
| process | technique | description |
|---|---|---|
| normalization | features are transformed so that the mean is 0 and the standard deviation is 1. | |
| min-max scaling | features are transformed so that data is in the range of [0, 1]. | |
| Mie baseline removal | Mie scattering caused by the interaction of the HDPE binder with THz waves is corrected. | |
| Mie efficiency
( | ||
| we regarded this as a convex
optimization problem to a valid set of discretized angular frequencies
Ω, which excludes the water peaks (with ±25 GHz bandwidth)
and the frequencies ranging from 0.3 to 0.55 THz due to negative absorbance
residues arising from noises. We denoted absorbance as | ||
| ξ is the solution
found after optimization. | ||
| ◦ for single-factor, ξ is a constant scalar value. Minimum ξ computed in a training set is used among all spectra, including the test set. | ||
| ◦ for multifactor, ξ can have different values, and the baseline of each spectrum is individually removed. A special case of multifactor concatenates the ξ parameters after dimensionality reductor as another feature. | ||
| dimensionality reduction | principal component analysis (PCA) | data is projected onto a low-dimensional linear space formed by principal orthogonal components. |
| cumulative variance in the predictors is explained. | ||
| factor analysis (FA) | data is projected onto a low-dimensional linear space formed by factors that correlate with other variables. | |
| correlations between variables are explained. | ||
| independent component analysis (ICA) | data is projected onto a low-dimensional linear space formed by maximally independent components. | |
| locally linear embedding (LLE) | data is projected onto a low-dimensional nonlinear space, preserving distances between neighboring points. | |
| non-negative matrix factorization (NMF) | data is projected onto a low-dimensional linear space formed by non-negative additive factors. | |
| isomap | data is projected onto a low-dimensional nonlinear space, preserving geodesic distances. | |
| prediction model | partial least-squares regression (PLSR) | least-squares regression is performed on dimensionally reduced input data with the retained correlation between data. |
| support vector regression (SVR) | hyperplane parameters, whose margin of error confines as many training data points as possible, are obtained and used for prediction. Outliers tend to be far away from the margin of error. | |
| multilayer perceptron (MLP) | nonlinear function approximator is obtained for either classification or regression, and parameters are adjusted using “back-propagation”. Hidden and output layers contain neurons using an activation function. Data is fed from one layer to another. | |
| convolutional neural network (CNN) | nonlinear mapping
function
(residual block[ | |
| gradient boosting | model is developed from a combination of trained base learners. XGBoost is an optimized distributed gradient boosting library that has regularization parameters to prevent overfitting and was selected in this study. |
Highest-Performance Preprocessing Technique by RMSEP for Each Prediction Model and Their Respective Optimal Hyperparametersa
| model | normalization | dimensionality reduction | RMSEC | RMSEP |
|---|---|---|---|---|
| MLP | min-max scaling | FA | 0.0213 | 0.0254 |
| #neurons: 4, activation fn: logistic, solver: lbfgs | 27 components | |||
| SVR | Mie baseline removal | PCA | 0.0262 | 0.0260 |
| type: NuSVR, nu: 0.5, C: 1.0, iterations: 2000, kernel: rbf, γ: auto | multifactor | 27 components | ||
| CNN | Mie baseline removal | none | 0.0276 | 0.0293 |
| activation fn: sigmoid | multifactor (with factor concatenation) | |||
| gradient boosting | Mie baseline removal | none | 0.0214 | 0.0316 |
| learning rate: 0.01, max depth: 5, min child weight: 2, γ: 0, subsample: 0.3, colsample_bytree: 0.6, num_round: 10,000 | multifactor | |||
| PLSR | Mie baseline removal | LLE | 0.0312 | 0.0283 |
| with prescale, 3 components | multifactor | modified with 14 components, 30 neighbors |
Abbreviations are listed in Table . RMSEC and RMSEP values are reported as dimensionless mass ratios.
Figure 5Performance of MLP with min-max scaling and FA preprocessing: predicted vs actual (dimensionless) mass ratio of caffeine (red), quinic acid (green), and nicotinic acid (blue) in ternary mixtures (point colors are proportional to their associated predicted composition—see Figure ).
Mean Change in RMSEP for Each Prediction Model by Different Preprocessing Methodsa
| | mean change in RMSEP | |||||
|---|---|---|---|---|---|---|
| preprocessing technique | PLSR | SVR | MLP | CNN | gradient boost | |
| normalization | Mie baseline removal | –0.3326 | –0.1252 | –0.0063 | –0.0087 | –0.0071 |
| min-max scaling | –0.0866 | 0.3918 | 0.0088 | 0.0144 | 0.0130 | |
| 0.0732 | 1.0256 | 0.0282 | 0.0273 | 0.0240 | ||
| dimensionality reduction | FA | –0.0047 | –0.3354 | –0.0109 | 0.0249 | 0.0055 |
| ICA | –0.0037 | –0.3356 | 0.0312 | 0.0443 | 0.0114 | |
| isomap | 0.0229 | 0.6478 | 0.0108 | 0.0316 | 0.0239 | |
| LLE | –0.0021 | –0.4042 | 0.0305 | 0.0080 | 0.0097 | |
| NMF | 0.0001 | –0.4179 | 0.0118 | 0.0129 | 0.0080 | |
| PCA | –0.0026 | 0.9325 | –0.0050 | 0.0133 | 0.0119 | |
RMSEP values are reported as dimensionless mass ratios.
Figure 6Prediction performance of nonlinear (blue) and linear (red) MLP models for (dimensionless) mass ratios of caffeine, quinic acid, and nicotinic acid in ternary mixtures. The left vertical axis is shifted up with respect to the right vertical axis for ease of comparison.
Figure 7Custom-made THz-TDS system at NECTEC, Pathum Thani, Thailand. The photoconductive antenna transmitter and receiver are labeled as Tx and Rx, respectively. Off-axis parabolic mirrors are labeled as #1, #2, #3, and #4.