Literature DB >> 33817518

Predictive Modeling of Lignin Content for the Screening of Suitable Poplar Genotypes Based on Fourier Transform-Raman Spectrometry.

Wenli Gao^1,2, Ting Shu³, Qiang Liu³, Shengjie Ling³, Ying Guan¹, Shengquan Liu^1,2, Liang Zhou^1,2.

Abstract

The quick and non-invasive evaluation of lignin from biomass has been the focus of much attention. Several types of spectroscopies, for example, near-infrared (NIR) and Fourier transform-Raman (FT-Raman), have been successfully applied to build quantitative predictive lignin models based on chemometrics. However, due to the effect of sample moisture content and ambient humidity on its signals, NIR spectroscopy requires sophisticated pre-testing preparation. In addition, the current FT-Raman predictive models require large variations in the independent value inputs as restrictions in the corresponding mathematical algorithms prevent the effective biomass screening of suitable genotypes for lignin contents within a narrow range. In order to overcome the limitations associated with the current methods, in this paper, we employed Raman spectra excited using a 1064 nm laser, thus avoiding the impact of water and auto-fluorescence on NIR signals. The optimal baseline correction method, data type, mathematical algorithm, and internal reference were selected in order to build quantitative lignin models based on the data with limited variation. The resulting two predictive models, constructed through lasso and ridge regressions, respectively, proved to be effective in assessing the lignin content of poplar in large-scale breeding and genetic engineering programs.

Entities: CellLine Chemical Disease

Year: 2021 PMID： 33817518 PMCID： PMC8015071 DOI： 10.1021/acsomega.1c00400

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Lignocellulosic material, considered as one of the most promising substitutes for fossil energy, originates from the cell wall of natural plants and has extensive applications in biofuels and bio-based chemicals.[1] To measure the chemical content of lignocellulose (i.e., cellulose, lignin, and hemicellulose), wet chemical methods (WCMs) such as the acetyl bromide, Klason, or acid-insoluble lignin and Van Soest methods are commonly applied.[2] However, the majority of such methods are costly, time-consuming, and employ corrosive reagents (e.g., acetyl bromide and H2SO4).[2] The assaying of a large number of lignocellulosic samples is typically required in genetic or breeding programs, resulting in low throughput and thereby hindering the working process of these WCMs.[3,4] Furthermore, the minimum sample weight requirement of WCMs often exceeds the breeding program resources. Thus, quick and non-invasive methods have been established to replace the WCMs. For example, vibration spectroscopy, including near-infrared (NIR) and Raman spectroscopies, has been integrated with chemometrics to obtain the chemical content of lignocellulose.[1,4−7] NIR accompanied by specific mathematical models can predict most chemical components in biomass. However, the absorbance signal at the NIR region is around 10–100 times weaker than that of the mid-infrared (MIR) region.[8] Moreover, the moisture content of the samples and the humidity of the ambient atmosphere dramatically influence the NIR signal by forming “water peaks” at ∼1450 and ∼1890 cm–1, corresponding to OH bonds.[9,10] A relatively high sample moisture content can result in overwhelming peaks. Therefore, in terms of the chemical content assay of lignocellulose, the results obtained by NIR often need to be verified with the additional characterization techniques.[5,7] Raman spectroscopy can complement NIR and has also been successfully applied to evaluate the lignin content.[4,5,7] Unlike NIR, Raman spectroscopy is not sensitive to the sample moisture content as it is a water-insensitive technique.[35] Suitable mathematical models have recently been employed to determine the prediction ratios of syringyl-to-guaiacyl (S/G) units in lignin as well as the classification of lignin.[11,12] However, the auto-fluorescence originating from lignin or other chemicals in lignocellulosic materials can lead to a broad signal in the resultant Raman spectrum.[13] Furthermore, the conjugated lignin substructures can initiate the additional scattering contribution of Raman signals, resulting in a deviation from the prediction.[3] To overcome these limitations, employment of NIR excitation using a 1064 nm laser has been proposed to significantly reduce the spectral background caused by the fluorescence emission.[14−16] Besides, mathematical techniques including spectral data processing and anti-collinearity algorithms have proved to be effective for dealing with combination and the higher-order overtones of the fundamental signals in quantitative NIR models.[1] Such mathematical techniques are expected to subdue the negative effect of the conjugated aromatic structures for quantitative Raman modeling. Poplar has the largest plantation area in the world and has recently been identified as a key potential resource for biomass and biofuels.[1] The high variation in lignin content among different poplar genotypes requires a quick and non-invasive method to detect lignin content in massive genetic or breeding programs.[17] Thus, in the current paper, wood samples were obtained from several poplar clones in order to simultaneously determine their lignin contents via traditional WCMs and the Raman spectra of the samples. Different models were then organized according to the baseline correction strategy, data type, mathematical algorithm, and internal reference peak. These models were evaluated using the Pearson correlation coefficient (R) and root-mean-square error (RMSE) in order to determine the optimal model. This work evaluates the ability of Raman spectroscopy to predict the lignin content in genetic breeding selecting programs.

Results and Discussion

Lignin Content of Poplar Samples

Table reports the WCM-derived lignin contents of the poplar samples ranging from 20.31 to 25.69 wt % and an average content of 22.79 wt %. This is in agreement with the previous reports of similar poplar genotype.[1,18]

Table 1

Lignin Contents of Poplar Samples for the Training and Test Sets

training set sample	lignin content (wt %)	training set sample	lignin content (wt %)	test set sample	lignin content (wt %)
S1	20.31a± 0.73b	S9	22.37 ± 0.16	T1	21.49 ± 0.01
S2	20.72 ± 0.26	S10	22.47 ± 0.20	T2	22.43 ± 1.27
S3	21.22 ± 1.68	S11	22.86 ± 0.74	T3	22.86 ± 0.81
S4	21.28 ± 0.69	S12	22.97 ± 0.63	T4	22.88 ± 0.34
S5	21.74 ± 0.53	S13	23.12 ± 0.58	T5	22.97 ± 1.22
S6	22.04 ± 0.97	S14	23.90 ± 0.46	T6	24.40 ± 0.07
S7	22.10 ± 0.22	S15	24.99 ± 0.78	T7	24.64 ± 0.98
S8	22.16 ± 0.13	S16	25.69 ± 0.82	T8	25.45 ± 0.78

Mean (% w/w).

Standard deviation (% w/w).

Mean (% w/w). Standard deviation (% w/w).

Evaluating Parameters

R and RMSE were selected to evaluate the quality of the mathematical models in the training and test sets. R and Rp generally represent the correlation between the measured and predicted values in a model, with values closer to 1 or −1, indicating a higher accuracy. Here, R′ and Rp′ were used to indicate the absolute value of R and Rp, respectively. The value of R′ and Rp′ ranged from 0.13 to 0.95 and 0.01 to 0.90, respectively (Figures , 2, S7, and S8). The evaluation results demonstrate the considerable variation of R′ and Rp′ across different baseline correction methods, data types, and mathematical algorithms. This indicates the ability of R′ and Rp′ to monitor the quality of the predictive models, as is the consensus in the literature.[19−21] RMSE and root-mean-square error of prediction (RMSEP) denote the standard deviation of the residuals and are commonly used in regression analysis to verify experimental results. RMSE and RMSEP values close to 0 theoretically represent a perfect match between predicted and measured values. In the current paper, we calculated RMSE and RMSEP using the Raman peak intensities. The RMSE and RMSEP values ranged from 0.47 to 2.52 and 0.75 to 2.97, respectively (Figures , 2, S7, and S8). The values varied with the baseline correction methods, mathematical algorithms, and data types, indicating the effectiveness of RMSE and RMSEP to describe the quality of predictive models.[22]

Figure 1

Figure 2

R′, Rp′, RMSE, and RMSEP values of the PCR, PLSR, RR, and LR models based on the B2-A dataset. (a) R′ of the models; (b) Rp′ of the models; (c) RMSE of the models; and (d) RMSEP of the models.

R′, Rp′, RMSE, and RMSEP values of the principal components regression (PCR), partial least-squares regression (PLSR), ridge regression (RR), and lasso regression (LR) models based on the B2-I dataset. (a) R′ of the models; (b) Rp′ of the models; (c) RMSE of the models; and (d) RMSEP of the models. R′, Rp′, RMSE, and RMSEP values of the PCR, PLSR, RR, and LR models based on the B2-A dataset. (a) R′ of the models; (b) Rp′ of the models; (c) RMSE of the models; and (d) RMSEP of the models.

Selection of the Baseline Correction Strategy

The baseline correction process is commonly applied to the Raman spectra to reduce the disturbing signals from baseline drift. In general, the number of baseline set points can alter each peak’s extracted value, irrespective of the intensity and area. The greater the number of baseline set points, the less impact the background will have on the extracted value. However, it will also reduce the signal originating from the molecular vibration of lignocellulosic materials.[23] Therefore, a suitable baseline correcting method is required in order to balance these effects. We tested two baseline correction strategies to select the most optimal. Figure presents the evaluation parameters corresponding to each of the strategies.

Figure 3

Comparison of two baseline correction strategies based on model quality. (a) R′ and Rp′ of the models and (b) RMSE and RMSEP of the models.

Comparison of two baseline correction strategies based on model quality. (a) R′ and Rp′ of the models and (b) RMSE and RMSEP of the models. The average and maximum values of R′ for the B2 strategy were higher than those for B1, while the opposite was observed for Rp′. Moreover, the average and minimum values of RMSE and RMSEP for B2 were lower than those for B1. This indicates a shorter distance between the predicted and observed values for the models with six baseline set points. In addition, the relationship between the predicted and observed values were enhanced (weakened) in the training (test) set. The maximum value of Rp′ (0.87) for B2 was close to that of B1 (0.90). These results indicate B2 to be a more favorable baseline correction strategy for reliable and accurate predictive models of lignin content based on the Raman spectra compared to B1.

Selection of Data Type

The predictive models used to describe the algebraic relationship between the lignin content and its specific scattering peaks were established based on the intensity (I type) and area (A type) of the specific Raman peaks. Figure compares the models based on R′, Rp′, RMSE, and RMSEP.

Figure 4

Comparison of two models based on different data types [area (A) and intensity (I)] extracted from Raman spectra. (a) R′ and Rp′ of the models and (b) RMSE and RMSEP of the models.

Comparison of two models based on different data types [area (A) and intensity (I)] extracted from Raman spectra. (a) R′ and Rp′ of the models and (b) RMSE and RMSEP of the models. The average values of R′ and Rp′ for the A type models exceeded those of the I type models (Figure ); however the latter exhibited the highest R′ value (0.95). Moreover, the average values of RMSE and RMSEP were similar for both model types, with minimum values lower for the I type models compared to the A type models. The results of the model parameters suggest that the I type model is suitable for the analysis of lignin content in lignocellulosic materials. This may be attributed to the complex chemical components of the lignocellulosic materials, resulting in the overlap of Raman peaks.[3] These overlapping Raman peaks cause uncertainty in the indeterminacy of the results after the deconvolution of the peak areas.

Selection of the Mathematical Algorithm

PCR, PLSR, RR, and LR were applied to fit the models predicting the relationship between the lignin content and specific Raman peaks in the lignocellulosic material (Figure ). The RR algorithm exhibited the highest average (0.74 and 0.31) and maximum (0.95 and 0.90) R′ and Rp′ values among the four models (Figure a). The maximum R′ value (0.94) of the LR algorithm was similar to that of the RR algorithm (0.95). However, the average R′ (0.62) value of the LR algorithm was much lower than that of the RR algorithm (0.74). The PCR and PLSR algorithms exhibited low maximum R′ values at 0.78 and 0.88, respectively. Figure a suggests that the RR and LR algorithm models are more suitable for the fitting of the lignin content in the lignocellulosic material compared to PCR and PLSR. The RMSE and RMSEP values of the four algorithm models in Figure b confirmed this result.

Figure 5

Comparison of four models based on different algorithms (PCR, PLSR, RR, and LR). (a) R′ and Rp′ of models and (b) RMSE and RMSEP of models.

Comparison of four models based on different algorithms (PCR, PLSR, RR, and LR). (a) R′ and Rp′ of models and (b) RMSE and RMSEP of models. All four algorithms are considered effective for the analysis of multiple regression data with multi-collinearity.[24] PCR and PLSR depress the collinearity by reducing the number of variables via the principal component extraction with the threshold value set as 90% in the current paper. In contrast, RR and LR adopt different techniques to solve the collinearity. A penalty term controlled by a regularization penalty (λ, see Section 2.6 in the Supporting Information) was arranged to substitute the role of principal component analysis used by PCR and PLSR in reducing the model bias without reducing the number of variables.[25] The sum of the absolute penalty terms in LR varies until the Mallows’ Cp value reaches a minimum.[26] Ridge parameter K plays the same role in controlling the penalty term in RR,[27] iteratively reaching its optimal value. Therefore, at the same bias level, both RR and LR can potentially hold more reliable information from the variables compared to PCR and PLSR. Our results reveal higher maximum values of R′ and Rp′ for the RR and LR algorithm compared to the PCR and PLSR algorithms. Ill-posed and ill-conditioned problems, which describe an overestimation of the output value of the function for small changes in the input, are often reported when building a predictive model.[28,29] The problem is generally evaluated by the condition number, which is calculated from the independent data. In the current paper, we determined the condition number by extracting data from the Raman spectra (Figure S2). Values higher than 1 suggest a problem in the model.[28,29] In our case, the predicted value of lignin content will be exaggerated by a minimal alteration in the Raman spectra. However, research has suggested the ability of the RR and LR algorithms to depress the negative effect caused by an ill-posed or ill-conditioned problem.[30−32] This explains the lower maximum values of RMSE and RMSEP in RR and LR compared to PCR and PLSR.

Selection of the Internal Standard Peak in the Raman Spectra

The sample roughness, the auto-fluorescence of the lignocellulosic materials, and the input power of the laser source all affect the intensity of the Raman signals. Therefore, a standard internal peak was selected to normalize the Raman spectra prior to the quantitative or semi-quantitative analysis of the chemical components in the lignocellulosic materials.[7,33,34] Raman bands used for the analysis are listed in Table . Figure demonstrates that the data array corresponding to the internal standard with a peak intensity of 2895 cm–1 (Ilp/2895) has the highest maximum and average values for R′ and Rp′ among all groups. This implies the suitability of the 2895 cm–1 peak intensity as an internal standard for standardizing the Raman spectra prior to quantitative or semi-quantitative analysis. This was also confirmed via the RMSE and RMSEP values, whereby Ilp/2895 exhibits the lowest minimum and average values among all combinations.

Table 2

Raman Peak Position and Band Assignments[3,14,16,35,36]

wavenumber/cm^–1	component	assignment
2939	lignin, glucomannan, and cellulose	CH stretching in OCH₃ asymmetric
2895	cellulose	CH and CH₂ stretching
1660	lignin	ring conjugated C=C str. of coniferyl alcohol; C=O str. of coniferaldehyde
1602	lignin	aryl ring stretching., symmetric.
1462	lignin and cellulose	lignin methoxy deformation, CH₂ scissoring, CH₂ scissoring
1427	lignin	OCH₃ deformation; CH₂ scissoring; guaiacyl ring vibration
1378	cellulose	HCC,HCO, and HOC bending
1331	cellulose	HCC and HCO bending, aliphatic OH stretch
1275	lignin	Aryl-O of aryl OH and aryl O–CH₃; guaiacyl ring (with C=O group) mode
1149	cellulose	stretching and HCC and HCO bending, CC, CO ring breathing, asymmetric
1122	cellulose, xylan, and glucomannan	CC and CO stretching, COC, glycosidic; ring breathing, symmetric
1095	cellulose, xylan, and glucomannan	CC and CO stretching, COC, glycosidic; ring breathing, symmetric
1043	lignin	OC stretching; ring deformation; CH₃ wagging

Figure 6

Comparative Lasso and ridge regression results of 12 models based on different internal standards using the B2-I data groups. (a) R′ of the models; (b) Rp′ of the models; (c) RMSE of the models; and (d) RMSEP of the models. Previous work has demonstrated the ability of the Raman peak centered at 1096 cm–1, which is assigned to the vibration of CC and CO stretching in carbohydrates, to normalize the scattering peak of lignin at 1600 cm–1.[4,35] Moreover, Agarwal (2011) established a linear model based on the relationship between the standardized relative intensity (intensity of peak 1600 cm–1/intensity of peak 1096 cm–1) and the lignin content determined by the WCMs.[4,37] However, the relative intensity needs to be further “corrected” by normalizing with the carbohydrate content in order to reduce the impact of scattering contribution at 1096 cm–1 assigned to CC and CO stretching of the carbohydrate.[4] The peak located at approximately 2895 cm–1 in the Raman biomass spectra is generally assigned to the vibration of C–H in CH and CH2 in cellulose.[11,36] This peak was observed to disappear for the Raman spectra of spruce milled wood lignin (MWL). Therefore, the peak of 2895 cm–1 is irrelevant to the lignin content and can be employed to standardize the lignin-related band intensity/area. Note that a “self-absorption” in the C–H stretching vibration of the Raman spectra will occur when water is added to the biomass sample. This may have a negative impact when the peak of the C–H stretching is used to quantify the cellulose content.[33] However, this negative effect is minimal for predictive models of lignin content as the numerator data array includes the peak assigned to the C–H stretching of −OCH3 in lignin. Research has speculated that the “self-absorption” effect will be depressed by the 2939 cm–1 peaks with the help of anti-collinearity algorithms.[26] These observations suggest the 2895 cm–1 peak intensity to be the best internal standard option for the predictive model.

Description of the Optimal Predictive Models

Following the optimization of the baseline correction strategy, data type, mathematical algorithm, and internal standard peaks, two predictive models were established using the RR and LR mathematical algorithms. The R′, Rp′, RMSE, and RMSEP of these models indicate that both models are suitable for the prediction of lignin content in lignocellulosic materials via their Raman spectra (Table ). Figure depicts the lignin content determined by the WCMs and Raman spectra predictive models. The data points of the RR and LR models are distributed close to the line of equality in Figure a,b. This is attributed to the similar techniques applied by both models to reduce the multi-collinearity between the actual and predicted values of the lignin content.

Table 3

Statistical Results for the Training, Cross-Validation, and Test Sets of the RR and LR Models

data group	algorithm	R′	RMSE	R_V′	RMSEV	R_p′	RMSEP
B2-I_lp/2895	RR	0.95	0.47	0.95	0.41	0.86	0.88
	LR	0.94	0.52	0.95	0.48	0.87	0.75

Figure 7

Lignin content determined from the WCMs and Raman spectra based on the B2-Ilp/2895 dataset (a) in the RR model and (b) in the LR model.

Lignin content determined from the WCMs and Raman spectra based on the B2-Ilp/2895 dataset (a) in the RR model and (b) in the LR model. The evaluation parameters determined for the training and test sets of the two models agree with those reported for the lignin content of wood samples.[1] However, our observed parameters are inferior to those of models predicting the lignin content of kraft pulps or pre-treated biomass.[4,35] Despite this, the aforementioned models do not meet the requirements of genetic breeding scientists as they take the independent value (i.e., namely, lignin content) as the input, with larger variations than that of biomass across genotypes. Quantitative models are generally considered as more reliable for predicted values that exhibit a variation equivalent to that of the independent value inputs. This suggests that the proposed model is more suitable than those in the literature for the prediction of lignin content in lignocellulosic materials from different genotypes. With the inclusion of the optimal coefficients and constants, the Lasso regression model is described as followsand the Ridge regression model is determined aswhere Clignin is the lignin content and I1043/2895 is the intensity ratio between the 1043 and 2895 cm–1 peaks. The remaining parameters follow the same definitions. Our results demonstrate the ability of the two models to predict the lignin content for the screening of poplar genotypes based on their Raman spectra, particularly for the optimized building process detailed in the previous sections.

Leave-One-Out Cross-Validation for Evaluating the Optimal Predictive Models

For evaluating the stability and robustness of the quantitative models, leave-one-out cross-validation (LOOCV) was performed to estimate the performance of the optimal predictive models described in Section . In each iteration of LOOCV, one data group was set aside and the remaining data groups were used for training. The statistical results of LOOCV are listed in Table . The ridge regression model and the lasso regression model for lignin content showed a good performance in cross-validation, the RV′ value is 0.95 and the root-mean-square error of validation (RMSEV) values are 0.41 and 0.48, respectively. The values calculated from the validation datasets are nearly equivalent to that calculated from the training sets. Therefore, the results of LOOCV demonstrated that two optimal models are stable and robust.[1] Then, the reliable predictive performance of lignin quantitative models was further verified.

Conclusions

In the current paper, we established an effective predictive model of lignin content for the screening of poplar genotypes based on their Raman spectra. We employed the intensity of deconvoluted Raman peaks to establish mathematical prediction models for the lignin content of lignocellulosic materials, with the Raman peak centered at 2895 cm–1 considered as the standard internal peak. We reveal the capability of the RR and LR models to fit the lignin content in the lignocellulosic material via the collected Raman spectra. This highlights the potential of these models in assessing the lignin content in large-scale breeding and genetic engineering programs.

Materials and Methods

Materials

Eight poplar clones, namely, Populus deltoides CL. ’55/65, Populus euramericana cv.“Zhonglin46”, P. euramericana cv. Guariento, P. deltoides CL. “2KEN8”, Populus nigra CL. “N179”, P. deltoides CL. “Danhong”, P. euramericana CL. “Sangju”, and P. deltoides CL. “Nanyang”, were planted at the Jiaozuo Forest Farm in Jiaozuo City, Henan Province, China (113°13′E, 35°14′N), in 2010. Three trees of each clone were harvested in November 2018, and the discs at breast height were cut and processed into particles in order to study the chemical components. A total of 24 different wood samples were obtained. All samples were stored at room temperature for more than 1 month at a moisture content of approximately 9.5%. The wood particles were ground into a fine powder using a pulverizer. The 40–60 mesh wood powder was sieved out to measure the lignin content via the WCMs, while the 100–120 mesh wood powder was reserved for the collection of Raman spectra.

Methods

Lignin Content Measurements

Lignin is typically divided into two parts during the compositional analysis of lignocellulosic materials with WCMs, namely, Klason lignin and acid-soluble lignin.[18,38−42] The Klason lignin measurements in this paper employed the official TAPPI (Technical Association of the Pulp and Paper Industry) T222 om-06 test method,[43] while acid-soluble lignin measurements were based on TAPPI UM-250.[39,43] The lignin content was then taken as the sum of Klason and acid-soluble lignins. The average value of three technical replicates was registered as the lignin content of each sample. Therefore, the dataset consisted of 24 lignin content values.

Collection of Raman Spectra

A Fourier transform–Raman spectrometer (Bruker, VERTEX 70-RAMII) was used to collect Raman spectra of the poplar samples. The spectrometer was equipped with a 1064 nm, 500 mW Nd:YAG diode laser. During the test, the ambient relative humidity and temperature were 60 ± 10% and 20 ± 2 °C, respectively. Approximately 2 mg of wood powder was pressed into a sample cell and the collection was repeated three times for each sample. The average spectrum was regarded as the sample spectrum. During measurements, the fluorescence was excited at 1064 nm, the laser defocus mode was used, and the number of scans was increased to obtain a good signal-to-noise ratio.[4,44] Spectra were collected with 64 scans and a spectral resolution of 4 cm–1 across 200–3600 cm–1.

Processing of Raman Spectra

The Raman spectra of the poplar samples exhibit an excellent resolution (Figure S3). Table details the assignments of the specific peaks. Peak-fit 4.12 was employed to process the spectral data, including smoothing, background correction, and band peak-fitting.[45] First, Savitzky–Golay (level 1.0%) smoothing was applied prior to the processing of the spectra. Second, the second derivation zero algorithm was used for baseline correction by using peakfit 4.12 software.[46] The sets of baseline points were confirmed according to the second derivative spectra. The baseline correction was then executed via two different strategies by changing the set points: (i) Four set points located at 940, 1750, 2750, and 3050 cm–1 divided the spectra into five sections (B1) and (ii) six set points located at 940, 1210, 1540, 1750, 2750, and 3050 cm–1 divided the whole spectra into seven sections (B2). Regions 940–1750 and 2750–3050 cm–1 include the signals closely related to the molecular vibration in lignocellulosic materials and also exhibit a high resolution in all spectra. Thus, these two regions were selected in strategy B1 for the baseline correction and further peak-fitting (Figures S4 and S5). In addition, regions 940–1210, 1210–1540, 1540–1750, and 2750–3050 cm–1 were used for further processing in strategy B2 (Figure ). Third, the peak-fitting of overlapping bands was performed for each peak based on Gauss and Lorentz peak-fitting. In order to ensure the effectiveness of the separated peaks, they were determined according to the second-derivative spectra, while the peak positions were fixed during the peak-fitting process.[47]Figure S6 presents the results of the Raman spectra for lignocellulosic materials.

Table 4

Groups of Independent Input Variables

B1–I	B1-A	B2–I	B2-A
B1-I_lp/1095a	B1-A_lp/1095b	B2-I_lp/1095	B2-A_lp/1095
B1-I_lp/1122	B1-A_lp/1122	B2-I_lp/1122	B2-A_lp/1122
B1-I_lp/1275	B1-A_lp/1275	B2-I_lp/1275	B2-A_lp/1275
B1-I_lp/I1331	B1-A_lp/1331	B2-I_lp/1331	B2-A_lp/1331
B1-I_lp/I1378	B1-A_lp/1378	B2-I_lp/1378	B2-A_lp/1378
B1-I_lp/1462	B1-A_lp/1462	B2-I_lp/1462	B2-A_lp/1462
B1-I_lp/1602	B1-A_lp/1602	B2-I_lp/1602	B2-A_lp/1602
B1-I_lp/1660	B1-A_lp/1660	B2-I_lp/1660	B2-A_lp/1660
B1-I_lp/2895	B1-A_lp/2895	B2-I_lp/2895	B2-A_lp/2895
B1-I_lp/2939	B1-A_lp/2939	B2-I_lp/2939	B2-A_lp/2939
B1-I_lp/sumc	B1-A_lp/sum	B2-I_lp/sum	B2-A_lp/sum
B1-I_{lp/carbohydrate}d	B1-A_{lp/carbohydrate}	B2-I_{lp/carbohydrate}	B2-A_{lp/carbohydrate}

, the same as follows.

Ilp/sum = Ilp/(I1095 + I1122 + I1275 + I1331 + I1378 + I1462 + I1602 + I1660 + I2895 + I2939).

Ilp/carbohydrate = Ilp/(I1095 + I1122 + I1331 + I1378 + I2895).

Figure 8

Four spectral regions of the Raman spectra in the B2 baseline correction strategy at (a) 940–1210; (b) 1210–1540; (c) 1540–1750; and (d) 2750–3050 cm–1. The black line indicates the spectrum before smoothing and baseline correction and the red line indicates the spectrum after smoothing and baseline correction. , the same as follows. , the same as follows. Ilp/sum = Ilp/(I1095 + I1122 + I1275 + I1331 + I1378 + I1462 + I1602 + I1660 + I2895 + I2939). Ilp/carbohydrate = Ilp/(I1095 + I1122 + I1331 + I1378 + I2895).

Selection and Normalization of Input Data from the Raman Spectra

The intensity (I) and area (A) of each peak in the Raman spectra were determined following the peak-fitting process. As both of these parameters are suitable inputs for predictive models,[4,34] they were compared in order to determine the most suitable based on the model evaluating parameters. A selection of internal standards was then executed to normalize these inputting values.[4] A data array of the peak digital values related to lignin (e.g., 1043, 1275, 1427, 1462, 1602, 1660, and 2939 cm–1), denoted as lp, was assigned to the numerator.[2,3] Digital values of each prominent peak in the Raman spectra were allocated as the denominators of the internal standard candidates. The intensities of peaks assigned to carbohydrates (1095, 1122, 1331, 1378, and 2895 cm–1) were summed,[2,3] as were the intensity of all selected prominent peaks. These sums were also included in the internal standard candidates (Table ). Therefore, 48 groups of independent variables were collected from the Raman spectra as inputs for the modeling process.

Mathematical Algorithms

Multi-collinearity was observed in the values determined from the Raman spectra (Figure S1), resulting in a divergence between the predicted and actual values.[21,48] PCR, PLSR, RR, and LR have all been reported to reduce the negative effect of multi-collinearity and enhance the reliability of mathematical models.[25,49−52] In the current paper, we employed all four algorithms to build the mathematical model based on the original lignin content data and to estimate the lignin content from the Raman spectra. The threshold value for the contribution of all principle components was set at 90% for PCR and PLSR. Based on the model grouping, 16 samples were randomly selected as the training set (S1–S16), with the remaining 8 samples as the test set (T1–T8) (Table ). Furthermore, LOOCV was used to validate the performance of specified models. In addition, the validation set were split from the training set according to the LOOCV rules.[53]

Model Evaluation

The Pearson correlation coefficient (R) and RMSE were used to evaluate the model qualities of the training sets, while the validation set and the test set model qualities were evaluated via the Pearson correlation coefficient of validation (Rv), the RMSEV, the Pearson correlation coefficient of prediction (Rp), and the RMSEP. The Pearson correlation coefficient (R), also known as the Pearson R statistical test, measures the strength of the relationship between the different variables. R is determined as follows[19−21]where X and Y are two different variables and n is the sample size. RMSE measures the error between two datasets and is an indicator of how far apart the predicted values are from the observed values.[54−56] RMSE is described as follows[54−58]where P is the predicted value for the ith observation in the dataset, M is the measured value for the ith observation in the dataset, and n is the sample size.

17 in total

1. Quantifying colocalization by correlation: the Pearson correlation coefficient is superior to the Mander's overlap coefficient.

Authors: Jeremy Adler; Ingela Parmryd
Journal: Cytometry A Date: 2010-08 Impact factor: 4.355

2. Rapid determination of syringyl: guaiacyl ratios using FT-Raman spectroscopy.

Authors: Lan Sun; Patanjali Varanasi; Fan Yang; Dominique Loqué; Blake A Simmons; Seema Singh
Journal: Biotechnol Bioeng Date: 2011-11-06 Impact factor: 4.530

3. Method for automatically identifying spectra of different wood cell wall layers in Raman imaging data set.

Authors: Xun Zhang; Zhe Ji; Xia Zhou; Jian-Feng Ma; Ya-Hong Hu; Feng Xu
Journal: Anal Chem Date: 2015-01-08 Impact factor: 6.986

4. Lignin content in natural Populus variants affects sugar release.

Authors: Michael H Studer; Jaclyn D DeMartini; Mark F Davis; Robert W Sykes; Brian Davison; Martin Keller; Gerald A Tuskan; Charles E Wyman
Journal: Proc Natl Acad Sci U S A Date: 2011-03-28 Impact factor: 11.205

5. Exploring the mechanism of high degree of delignification inhibits cellulose conversion efficiency.

Authors: Dayong Ding; Xia Zhou; Tingting You; Xun Zhang; Xueming Zhang; Feng Xu
Journal: Carbohydr Polym Date: 2017-11-17 Impact factor: 9.381

6. Examining water in model membranes by near infrared spectroscopy and multivariate analysis.

Authors: Jorge J Wenz
Journal: Biochim Biophys Acta Biomembr Date: 2017-12-09 Impact factor: 3.747

7. 1064 nm dispersive multichannel Raman spectroscopy for the analysis of plant lignin.

Authors: Matthew W Meyer; Jason S Lupoi; Emily A Smith
Journal: Anal Chim Acta Date: 2011-08-27 Impact factor: 6.558

8. Lignin from sugar cane bagasse: extraction, fabrication of nanostructured films, and application.

Authors: A A Pereira; G F Martins; P A Antunes; R Conrrado; D Pasquini; A E Job; A A S Curvelo; M Ferreira; A Riul; C J L Constantino
Journal: Langmuir Date: 2007-05-10 Impact factor: 3.882

9. Potential of visible-near infrared spectroscopy combined with chemometrics for analysis of some constituents of coffee and banana residues.

Authors: M K D Rambo; E P Amorim; M M C Ferreira
Journal: Anal Chim Acta Date: 2013-03-13 Impact factor: 6.558

10. Effect of lignin content on changes occurring in poplar cellulose ultrastructure during dilute acid pretreatment.

Authors: Qining Sun; Marcus Foston; Xianzhi Meng; Daisuke Sawada; Sai Venkatesh Pingali; Hugh M O'Neill; Hongjia Li; Charles E Wyman; Paul Langan; Art J Ragauskas; Rajeev Kumar
Journal: Biotechnol Biofuels Date: 2014-10-14 Impact factor: 6.040

1 in total

1. Quantification of Salicylates and Flavonoids in Poplar Bark and Leaves Based on IR, NIR, and Raman Spectra.

Authors: Sylwester Mazurek; Maciej Włodarczyk; Sonia Pielorz; Piotr Okińczyc; Piotr M Kuś; Gabriela Długosz; Diana Vidal-Yañez; Roman Szostak
Journal: Molecules Date: 2022-06-20 Impact factor: 4.927

1 in total