Literature DB >> 26166772

Optimization of Parameter Selection for Partial Least Squares Model Development.

Na Zhao1, Zhi-sheng Wu1, Qiao Zhang1, Xin-yuan Shi1, Qun Ma1, Yan-jiang Qiao1.   

Abstract

In multivariate calibration using a spectral dataset, it is difficult to optimize nonsystematic parameters in a quantitative model, i.e., spectral pretreatment, latent factors and variable selection. In this study, we describe a novel and systematic approach that uses a processing trajectory to select three parameters including different spectral pretreatments, variable importance in the projection (VIP) for variable selection and latent factors in the Partial Least-Square (PLS) model. The root mean square errors of calibration (RMSEC), the root mean square errors of prediction (RMSEP), the ratio of standard error of prediction to standard deviation (RPD), and the determination coefficient of calibration (Rcal(2)) and validation (Rpre(2)) were simultaneously assessed to optimize the best modeling path. We used three different near-infrared (NIR) datasets, which illustrated that there was more than one modeling path to ensure good modeling. The PLS model optimizes modeling parameters step-by-step, but the robust model described here demonstrates better efficiency than other published papers.

Entities:  

Year:  2015        PMID: 26166772      PMCID: PMC4499800          DOI: 10.1038/srep11647

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Near-infrared (NIR) Spectroscopy combined with Partial Least-Square (PLS) has gained wide acceptance because it is rapid, nondestructive, and environmentally friendly. It has been successfully applied to different fields including agriculture12, food stuffs3, petrochemistry4, and pharmacy567. In addition, NIR technology offers sufficient accuracy and precision for solid and liquid systems without any sample pretreatment. However, establishing a reliable and robust PLS model is a vital procedure for successful application of NIR. There are many parameters that should be optimized in the PLS model. These include spectral pretreatment, variable selection and latent factors. The NIR spectra contain unexpected chemical interferences related to sampling error, instrument error, etc. In the process of establishing a PLS model, spectral pretreatment methods have been applied to extract relevant information and reduce the effect of noise and baseline drift89. Variable selection is then also used to identify highly informative features and eliminate useless variables from the original spectral dataset1011. The latent factors that explain the spectral matrix are sorted in decreasing order according to their contribution to the spectral features. It is crucial to determine the latent factors and avoid over-fitting or under-fitting1213. Most published work about PLS models offer few details about the model development. It is assumed that there is a univariate used to select these parameters one by one according the root mean square error of cross validation (RMSECV) and the root mean square errors of prediction (RMSEP). The inter-influence among the three parameters is rarely considered in model development. This is risky because the modeling path is not necessarily the best approach to step-by-step optimization. Wenlong Li et al. established quantitative methods based on NIR to determine the active compounds of Lonicerae Japonicae Flos14. Ioan Tomuta et al. applied NIR for the determination of active pharmaceutical ingredients (API) and pharmaceutical properties including the crushing strength and disintegration time of meloxicam tablets15. Nádia Reis et al. developed a quantitative method for multiple adulterants in roasted coffee using diffuse reflectance infrared Fourier transform spectroscopy16. Yajie Xi et al. developed a NIR method to determinate the trace dimethyl fumarate in milk17. In the above reports, spectral pretreatment methods were compared according to the predictive performance of the calibration models. Using the optimal pretreatment method, latent factors were selected with a cross validation method. The optimum wavelength range was then tested and selected based on these pretreatment and latent variables. Although quantitative models could be obtained, they did not investigate the trajectory routes of PLS models that consider the combination of different parameters. Based on this consideration, we propose a novel and systematic approach to improve the efficiency and accuracy in the development of PLS models. The tracking procedure of PLS modeling provided a systematic profile that combined different spectral pretreatment methods, latent factors, and the variable importance in the projection (VIP) variable selection. The model performance was assessed using the root mean square errors of calibration (RMSEC), RMSEP, the determination coefficient of calibration (Rcal2) and validation (Rpre2) as well as the ratio of standard error of prediction to standard deviation (RPD)1819. To demonstrate the advantage of this novel approach, three different NIR spectral datasets were analyzed including two standard and open source datasets. The proposed procedure was used to predict water, baicalin, and API in corn, Yinhuang granules, and pharmaceutical tablets, respectively. These specific NIR applications were selected because they are important to quality control.

Result

Near-infrared spectral features

The NIR spectra (1,100–2,498 nm) of typical corn samples are shown in Fig. 1a. There were several broad peaks located near 1,190, 1,450, and 1,940 nm. These represent the characteristic peaks regarding the water in the corn spectral dataset. The strong absorption bands of water appeared around 1,400–1,450 nm and 1,900–1,950 nm due to the first overtone of OH stretching and combination of the OH stretching band with OH bending.
Figure 1

Raw NIR spectra of corn sample (a), Yinhuang granules sample (b) and pharmaceutical tablets sample (c).

On the other hand, the NIR spectra of Yinhuang granules are shown in Fig. 1b. There was a weak absorption in the second overtones region (SCOT, 10,000–7,100 cm−1) of the fundamental CH stretching bands. There was much fluctuation in the region of first combination-overtone (FCOT, 7,100–4,900 cm−1) and combination region (CR, 4,900–4,000 cm−1). The NIR spectra of pharmaceutical tablets are shown in Fig. 1c. As seen in the raw spectra, there were many clear absorption bands of the API from 900 to 1,630 nm. There was also severe spectral overlap and baseline drift in the spectra of corn, Yinhuang granules, and pharmaceutical tablets. Light scattering affected the raw spectra as well; this depended on the physical conditions such as particle size, density and variation among lots.

Reference method of three spectral datasets

The response variable was the relative content of water in the corn (%, w/w). A detailed description of sample conditions can be found at http://www.eigenvector.com/data/Corn/index.html. The other standard dataset of pharmaceutical tablets included 655 samples. The assay values of the API (%, w/w) were used as a reference value. Baicalin in Yinhuang granules was quantitated with a calibration curve based on the concentration range from 25.4 to 50.8 μg mL−1 of baicalin using consecutive injections of different concentrations. The regression equation was y = 20.25x + 35.10 (r = 1.0000) with y being the peak area in mAUs and x being the concentration of injection (μg mL−1). We concluded that high performance liquid chromatography (HPLC) was suitable for quantitation. Baicalin concentrations varied 1.61% to 6.66% (mg mg−1). The reference values of baicalin were accurate and could be used in NIR models.

The selection of sample sets using three datasets

The Kennard-Stone (K-S) algorithm selected the calibration and validation sets. The corn samples included 53 samples for calibration set, and 27 samples for external validation. In addition, the Yinhuang granule dataset used 48 and 24 samples in the calibration and validation sets, respectively. The three different subsets of pharmaceutical tablets were used as the calibration set (155), validation set (460), and test set (40). The statistics for water, baicalin, and API contents in the calibration, validation, and test sets are summarized in Table 1.
Table 1

The statistic of water, baicalin and API contents in the calibration and validation sets.

Analytesample setsample numberMin(%)Max(%)Mean(%)SD(%)
corncalibration539.3810.9810.220.41
validation279.6710.9910.250.34
Yinhuang granulescalibration481.616.433.940.02
validation242.046.663.601.18
pharmaceutical tabletscalibration15540.3063.7050.785.88
validation46041.1063.3049.764.26
test4050.2052.4051.710.54

Processing trajectory of PLS model

Using the corn dataset as an example, the calibration spectra were preprocessed with different methods including Standard Normal Variate (SNV) and Savitzky-Golay smoothing with 9 points (SG(9)), as well as SG(9) combined with the derivative spectra. The latent factor was set from 1 to 10 to avoid over-fitting. VIP was then used to select variables with different latent factors. Finally, the process routines from PLS model development and validation were selected (Fig. 2). The parameters for PLS models in water, baicalin and API are shown in Table. 1s-3s. There are various trends in the model evaluation indexes. In Fig. 2a, we see that the RMSEC and RMSEP decreased and the Rcal2, Rpre2, and RPD increased with increasing latent factor coupled with different pretreatment methods.
Figure 2

Schematic diagram of processing trajectory and assessment of PLS model corn samples (a), Yinhuang granules samples (b) and pharmaceutical tablets sample (c).

The model for Yinhuang granules dataset is shown in Fig. 2b. This was different from the corn dataset model. The Yinhuang granules dataset model used first derivative spectra (1D) combined with SG(9) and SNV to preprocess the spectra. This was superior to other spectra pretreatment methods in which the various trends of the model evaluation indexes were unclear. The model for pharmaceutical tablets is shown in Fig. 2c. Changes in model evaluation indexes were not obvious when the latent factor increased to a certain value. However, this result showed that there was more than one modeling path that can ensure a successful model. There were two fair PLS models with RPD between 2.5 to 3 (Fig. 3a) including: 1) a combinational method of SG(9) spectral pretreatment, VIP, and 10 factors, and 2) a combinational method using the second derivative spectra (2D) combined with SG(9) spectral pretreatment, VIP, and 10 factors. Most of PLS models were fair and there were also some good model paths with RPD values greater than 3 (Fig. 3b). In Fig. 3c, there were many models with good performance that adopted a processing trajectory.
Figure 3

Schematic diagram of processing trajectory of PLS model corn samples (a), Yinhuang granules samples (b) and pharmaceutical tablets sample (c).

In the previous modeling process routine, the parameters included spectral pretreatment methods, variable selection and latent factors were optimized one at a time. Table 4s shows that this is a good approach to path modeling versus step-by-step parameter optimization. The optimal nonsystematic parameters of the water PLS model were the raw spectra and VIP-selecting variables under 7 factors; the model performance was poor. However, processing trajectory showed that two fair model could be obtained through combinational methods using SG(9) pretreatment, VIP, and 10 factors as well as a combinational method of 2D + SG(9) pretreatment, VIP, and 10 factors. The best nonsystematic parameter combination for the baicalin PLS model was SNV pretreatment and VIP-selecting variables under 4 factors. This gave good model performance. However, the processing trajectory showed eight good models with different systematic parameter combinations. The best parameter combination of the baicalin PLS model was a SNV spectra pretreatment with VIP-selecting variables under 3 factors. The parameter combination for the API model used raw spectra with VIP-selecting variables under 4 factors. These were optimized one by one. The model performance was very good. However, there were 18 very good models and 5 excellent models with RPD values above 4 that were obtained by processing trajectory. The best parameter combination for the PLS models were SG(9) spectra pretreatment and VIP selecting variables under 10 factors. The best models were obtained through a step-by-step optimization and processing trajectory that was tested with a test set of 40 independent samples. The RMSEP of the two models were 0.7904% and 0.5681%, respectively. This demonstrated that the model obtained through the processing trajectory was better than the model optimized step-by-step. This illustrates that the best systematic optimal model parameters were obtained via the processing trajectory.

Development and validation of calibration models

The model validity was characterized with RMSEC, RMSEP, Rcal2, Rpre2, and RPD. Taking the calibration model of the corn dataset as an example, Fig. 2a showed that the PLS model of water with SG(9) pretreatment and VIP-selecting variables under 10 factors had the best performance. The RMSEC and Rcal2 of the calibration set were 0.1042% and 0.9321, respectively. The RMSEP was 0.1256% – quite close to RMSEC. The RPD and Rpre2 of the validation set were 2.6387 and 0.8554, respectively. These results demonstrated that the PLS model of water also had a good predictive performance. Similarly, we developed the PLS model of baicalin with SNV spectral pretreatment and VIP-selecting variables under 3 factors. The RMSEC and Rcal2 of the calibration set were 0.5609% and 0.8250, respectively. The RMSEP, Rpre2 and RPD of the baicalin validation set were 0.3524%, 0.9066, and 3.2723, respectively. The model of API with SG(9) pretreatment and VIP-selecting variables under 10 factors was also established. The RMSEC and Rcal2 of the calibration set were 1.0048% and 0.9706. The RMSEP, Rpre2 and RPD of the API validation set were 0.9581%, 0.9493, and 4.4581, respectively. Figure 4 presents data for the PLS models using three datasets. The prediction values were quite close to the wet analysis. The parameters for water, baicalin and API models indicated that NIR could be used for the determination of water, baicalin, and API of corn, Yinhuang granules, and pharmaceutical tablets, respectively.
Figure 4

Correlation between the prediction and reference values of corn samples (a), Yinhuang granules samples (b) and pharmaceutical tablets sample (c).

Discussion

We proposed the use of processing trajectory to develop and optimize multivariate calibration models. The PLS models were established with different spectral pretreatments and VIP variable selection methods with different latent factors. Based chemometric indicators (RMSEP, RMSEC, Rcal2, Rpre2 and RPD), different PLS models were used to assessed the water, baicalin , and API in corn, Yinhuang granules and pharmaceutical tablets, respectively. The present work demonstrates the feasibility and advantages of processing trajectory in the development and optimization of multivariate calibration models. In conclusion, a systematic procedure for model optimization based on the processing trajectory shows excellent results to develop a robust model. The proposed approach should be integrated into PLS software to improve on the available PLS models.

Methods

Datasets

Corn spectral dataset

NIR diffuse spectra of corn were obtained from the website as standard data (http://www.eigenvector.com/data/Corn/index.html)20. The dataset consists of 80 corn samples. The NIR spectra of corn samples were measured with three spectrometers; the moisture content was included (%, w/w). The spectra were measured with a mp5 NIR spectrometer and the water values were used as reference value. The spectral acquisition range was 1,100–2,498 nm at 2 nm intervals resulting in a total of 700 variables per sample.

Yinhuang granules spectral dataset

The dataset consists of diffuse reflectance NIR spectra with 72 Yinhuang granule samples. The response variable was the relative content of active baicalin in the granules (%, w/w). The baicalin content was measured with HPLC as described in the Ch.P. (2010 Edition, Volume I). We used an Agilent 1100 HPLC system (Agilent Technologies, USA) with a vacuum degasser, a quaternary pump, an auto sampler, a thermostatic column compartment, and a diode array detector (DAD). Separation was performed on an ODS column (150 mm, 4.6 mm, 5 mm, Waters, USA) with isocratic elution of the mobile phase consisting of methanol, water and phosphoric acid (50:50:0.2, v/v) at a flow rate of 1.0 mL/min. The column temperature was 30 °C, and the detection wavelength was 274 nm. The NIR spectra were collected in an integrating sphere diffuse mode with an Antaris Nicolet FT-NIR system (Thermo Fisher Scientific Inc., USA). Each sample spectrum was the result of 32 scans from 10,000 to 4,000 cm−1 (a total of 1557 wavenumber variables per sample) at ambient temperature using 8 cm−1 resolution. Every sample was scanned three times, and the final spectrum used for each sample was an average of the three results. All NIR spectra were collected and archived using the Thermo Scientific Result software.

Pharmaceutical tablets spectral dataset

The dataset for the pharmaceutical tablets was available at http://www.eigenvector.com/data/tablets/index.html. It contains 1,308 spectra of 655 pharmaceutical tablets measured on two similar instruments (Foss/NIR Systems Multitab Spectrometers). The spectral region was from 600 to 1,898 nm with 2 nm increments. The data of each instrument were organized into three different subsets. The assay value of the active ingredient, tablet weight, and tablet hardness were provided. In this work, we used the spectra of the first instrument and the assay values of the API (%, w/w). There were three different sets of 155, 460, and 40 spectra that were used as calibration, validation, and test sets in this study, respectively.

Multivariate data analyses

The spectra pretreatment and model development were performed with Unscrambler 9.7 software package (CAMO Software AS, Norway). The VIP-based variable selection methods were implemented with custom routines in MATLAB (MATLAB, The MathWorks, Massachussetts).

Summary of the proposed procedure

The procedure we used to track and evaluate modeling processes with different spectral pretreatment methods, latent factors, and VIP variable selection is summarized in Fig. 2 and is detailed in the following section. Step 1. The modeling parameters and their levels were defined, which contained spectral pretreatment methods, latent factors, and variable selection methods. Step 2. The evaluation indexes of model were established to include RMSEC, Rcal2, RMSEP, and Rpre2. Important parameters for the RPD were also included to assure the model assessment. Step 3. The calibration and validation data sets were selected to ensure that both datasets were representative of the experimental design. Step 4. We ran the PLS models established for all systematic parameters optimization. The results for the evaluation indexes of each model were registered. Step 5. The results were analyzed, and the trajectory routes for PLS modeling were defined. Step 6. PLS model was established at the best systematic optimization of parameters. When necessary, the model was further refined to exclude samples with higher spectral and concentration residuals. Step 7. The model was applied to routine analysis and was continuously monitored to check for needed updates.

Multivariate calibration

The PLS was used to build calibration models. First, the K-S algorithm was used to split the data set into calibration and validation sets (2:1) such that the samples could be selected to represent the entire experimental domain. We used spectral pretreatment to decrease baseline shifts and remove the scattering effect created by diffuse reflectance and overlapping peaks. These caused detrimental effects on the signal-to-noise ratio. We used SNV, SG(9), and SG(9) combined with derivative spectra to remedy this. Meanwhile, VIP was utilized to select variables for identifying highly informative features21. The performance of the regression models was evaluated in terms of the Rcal2 and RMSEC. The prediction ability of the external validation was assessed by RMSEP, Rpre2 and RPD. Generally, the optimal number of latent factors was determined from the result of 10-fold cross validation tests after spectral pretreatment. The VIP variable selection algorithms were then used to select the characteristic variables. The ideal methods for spectral pretreatment, the number of latent factors and models were selected based on the RMSEC, RMSEP, Rcal2, Rpre2, RPD, etc. Unlike the previous method, this approach uses the global optimal strategy and trajectory routes of all possible parameter combinations.

Additional Information

How to cite this article: Zhao, N. et al. Optimization of Parameter Selection for Partial Least Squares Model Development. Sci. Rep. 5, 11647; doi: 10.1038/srep11647 (2015).
  14 in total

1.  Using FTIR spectroscopy to model alkaline pretreatment and enzymatic saccharification of six lignocellulosic biomasses.

Authors:  Deborah L Sills; James M Gossett
Journal:  Biotechnol Bioeng       Date:  2011-11-22       Impact factor: 4.530

2.  Simultaneous data pre-processing and SVM classification model selection based on a parallel genetic algorithm applied to spectroscopic data of olive oils.

Authors:  Olivier Devos; Gerard Downey; Ludovic Duponchel
Journal:  Food Chem       Date:  2013-10-14       Impact factor: 7.514

3.  Quantitative evaluation of multiple adulterants in roasted coffee by Diffuse Reflectance Infrared Fourier Transform Spectroscopy (DRIFTS) and chemometrics.

Authors:  Nádia Reis; Adriana S Franca; Leandro S Oliveira
Journal:  Talanta       Date:  2013-06-14       Impact factor: 6.057

4.  Internal and external validation strategies for the evaluation of long-term effects in NIR calibration models.

Authors:  Valeria Sileoni; Frans van den Berg; Ombretta Marconi; Giuseppe Perretti; Paolo Fantozzi
Journal:  J Agric Food Chem       Date:  2011-02-11       Impact factor: 5.279

5.  Reproducibility, complementary measure of predictability for robustness improvement of multivariate calibration models via variable selections.

Authors:  Hae Woo Lee; Andrew Bawn; Seongkyu Yoon
Journal:  Anal Chim Acta       Date:  2012-11-01       Impact factor: 6.558

6.  Fast determination of trace dimethyl fumarate in milk with near infrared spectroscopy following fluidized bed enrichment.

Authors:  Ya-Jie Xie; Zhuan Wang; Wan-Peng Hu; Song Xu
Journal:  Anal Bioanal Chem       Date:  2012-10-03       Impact factor: 4.142

7.  Development and validation of NIR-chemometric methods for chemical and pharmaceutical characterization of meloxicam tablets.

Authors:  Ioan Tomuta; Rares Iovanov; Ede Bodoki; Loredana Vonica
Journal:  Drug Dev Ind Pharm       Date:  2013-04-17       Impact factor: 3.225

8.  A novel model selection strategy using total error concept.

Authors:  Zhisheng Wu; Qun Ma; Zhaozhou Lin; Yanfang Peng; Lu Ai; Xinyuan Shi; Yanjiang Qiao
Journal:  Talanta       Date:  2013-01-10       Impact factor: 6.057

9.  Quality control of Lonicerae Japonicae Flos using near infrared spectroscopy and chemometrics.

Authors:  Wenlong Li; Zhiwei Cheng; Yuefei Wang; Haibin Qu
Journal:  J Pharm Biomed Anal       Date:  2012-09-24       Impact factor: 3.935

10.  Multivariate detection limits of on-line NIR model for extraction process of chlorogenic acid from Lonicera japonica.

Authors:  Zhisheng Wu; Chenglin Sui; Bing Xu; Lu Ai; Qun Ma; Xinyuan Shi; Yanjiang Qiao
Journal:  J Pharm Biomed Anal       Date:  2013-01-02       Impact factor: 3.935

View more
  12 in total

1.  Chemical fingerprinting and quantitative monitoring of the doping drugs bambuterol and terbutaline in human urine samples using ATR-FTIR coupled with a PLSR chemometric tool.

Authors:  Faisal K Algethami; Sherif M Eid; Khadiga M Kelani; Mohamed R Elghobashy; Mohamed K Abd El-Rahman
Journal:  RSC Adv       Date:  2020-02-17       Impact factor: 3.361

2.  Online near-infrared analysis coupled with MWPLS and SiPLS models for the multi-ingredient and multi-phase extraction of licorice (Gancao).

Authors:  Yang Li; Mingye Guo; Xinyuan Shi; Zhisheng Wu; Jianyu Li; Qun Ma; Yanjiang Qiao
Journal:  Chin Med       Date:  2015-12-18       Impact factor: 5.455

3.  High-throughput analysis of chemical components and theoretical ethanol yield of dedicated bioenergy sorghum using dual-optimized partial least squares calibration models.

Authors:  Meng Li; Jun Wang; Fu Du; Boubacar Diallo; Guang Hui Xie
Journal:  Biotechnol Biofuels       Date:  2017-09-04       Impact factor: 6.040

4.  Systematic discovery about NIR spectral assignment from chemical structural property to natural chemical compounds.

Authors:  Lijuan Ma; Yanfang Peng; Yanling Pei; Jingqi Zeng; Haoran Shen; Junjie Cao; Yanjiang Qiao; Zhisheng Wu
Journal:  Sci Rep       Date:  2019-07-01       Impact factor: 4.379

5.  Spatiotemporal predictions of obesity prevalence in Chinese children and adolescents: based on analyses of obesogenic environmental variability and Bayesian model.

Authors:  C Guo; H Wang; G Feng; J Li; C Su; J Zhang; Z Wang; W Du; B Zhang
Journal:  Int J Obes (Lond)       Date:  2018-12-19       Impact factor: 5.095

6.  Process Analytical Technology for Advanced Process Control in Biologics Manufacturing with the Aid of Macroscopic Kinetic Modeling.

Authors:  Martin Kornecki; Jochen Strube
Journal:  Bioengineering (Basel)       Date:  2018-03-16

7.  Novelty application of multi-omics correlation in the discrimination of sulfur-fumigation and non-sulfur-fumigation Ophiopogonis Radix.

Authors:  Shengyun Dai; Zhanpeng Shang; Fei Wang; Yanfeng Cao; Xinyuan Shi; Zhaozhou Lin; Zhibin Wang; Ning Li; Jianqiu Lu; Yanjiang Qiao; Jiayu Zhang
Journal:  Sci Rep       Date:  2017-08-30       Impact factor: 4.379

8.  Novel NIR modeling design and assignment in process quality control of Honeysuckle flower by QbD.

Authors:  Lijuan Ma; Daihan Liu; Chenzhao Du; Ling Lin; Jinyuan Zhu; Xingguo Huang; Yuan Liao; Zhisheng Wu
Journal:  Spectrochim Acta A Mol Biomol Spectrosc       Date:  2020-07-19       Impact factor: 4.098

9.  Pharmaceutical Analysis Model Robustness From Bagging-PLS and PLS Using Systematic Tracking Mapping.

Authors:  Na Zhao; Lijuan Ma; Xingguo Huang; Xiaona Liu; Yanjiang Qiao; Zhisheng Wu
Journal:  Front Chem       Date:  2018-07-06       Impact factor: 5.221

Review 10.  A critical review of recent trends, and a future perspective of optical spectroscopy as PAT in biopharmaceutical downstream processing.

Authors:  Laura Rolinger; Matthias Rüdt; Jürgen Hubbuch
Journal:  Anal Bioanal Chem       Date:  2020-03-07       Impact factor: 4.142

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.