Wenli Gao1,2, Ting Shu3, Qiang Liu3, Shengjie Ling3, Ying Guan1, Shengquan Liu1,2, Liang Zhou1,2. 1. School of Forestry and Landscape Architecture, Anhui Agriculture University, Hefei 230036, Anhui, China. 2. Key Lab of State Forest and Grassland Administration on Wood Quality Improvement & High Efficient Utilization, Hefei 230036, Anhui, China. 3. School of Physical Science and Technology, Shanghai Tech University, 393 Middle Huaxia Road, Shanghai 201210, China.
Abstract
The quick and non-invasive evaluation of lignin from biomass has been the focus of much attention. Several types of spectroscopies, for example, near-infrared (NIR) and Fourier transform-Raman (FT-Raman), have been successfully applied to build quantitative predictive lignin models based on chemometrics. However, due to the effect of sample moisture content and ambient humidity on its signals, NIR spectroscopy requires sophisticated pre-testing preparation. In addition, the current FT-Raman predictive models require large variations in the independent value inputs as restrictions in the corresponding mathematical algorithms prevent the effective biomass screening of suitable genotypes for lignin contents within a narrow range. In order to overcome the limitations associated with the current methods, in this paper, we employed Raman spectra excited using a 1064 nm laser, thus avoiding the impact of water and auto-fluorescence on NIR signals. The optimal baseline correction method, data type, mathematical algorithm, and internal reference were selected in order to build quantitative lignin models based on the data with limited variation. The resulting two predictive models, constructed through lasso and ridge regressions, respectively, proved to be effective in assessing the lignin content of poplar in large-scale breeding and genetic engineering programs.
The quick and non-invasive evaluation of lignin from biomass has been the focus of much attention. Several types of spectroscopies, for example, near-infrared (NIR) and Fourier transform-Raman (FT-Raman), have been successfully applied to build quantitative predictive lignin models based on chemometrics. However, due to the effect of sample moisture content and ambient humidity on its signals, NIR spectroscopy requires sophisticated pre-testing preparation. In addition, the current FT-Raman predictive models require large variations in the independent value inputs as restrictions in the corresponding mathematical algorithms prevent the effective biomass screening of suitable genotypes for lignin contents within a narrow range. In order to overcome the limitations associated with the current methods, in this paper, we employed Raman spectra excited using a 1064 nm laser, thus avoiding the impact of water and auto-fluorescence on NIR signals. The optimal baseline correction method, data type, mathematical algorithm, and internal reference were selected in order to build quantitative lignin models based on the data with limited variation. The resulting two predictive models, constructed through lasso and ridge regressions, respectively, proved to be effective in assessing the lignin content of poplar in large-scale breeding and genetic engineering programs.
Lignocellulosic
material, considered as one of the most promising
substitutes for fossil energy, originates from the cell wall of natural
plants and has extensive applications in biofuels and bio-based chemicals.[1] To measure the chemical content of lignocellulose
(i.e., cellulose, lignin, and hemicellulose), wet chemical methods
(WCMs) such as the acetyl bromide, Klason, or acid-insoluble lignin
and Van Soest methods are commonly applied.[2] However, the majority of such methods are costly, time-consuming,
and employ corrosive reagents (e.g., acetyl bromide and H2SO4).[2] The assaying of a large
number of lignocellulosic samples is typically required in genetic
or breeding programs, resulting in low throughput and thereby hindering
the working process of these WCMs.[3,4] Furthermore,
the minimum sample weight requirement of WCMs often exceeds the breeding
program resources. Thus, quick and non-invasive methods have been
established to replace the WCMs. For example, vibration spectroscopy,
including near-infrared (NIR) and Raman spectroscopies, has been integrated
with chemometrics to obtain the chemical content of lignocellulose.[1,4−7]NIR accompanied by specific mathematical models can predict
most
chemical components in biomass. However, the absorbance signal at
the NIR region is around 10–100 times weaker than that of the
mid-infrared (MIR) region.[8] Moreover, the
moisture content of the samples and the humidity of the ambient atmosphere
dramatically influence the NIR signal by forming “water peaks”
at ∼1450 and ∼1890 cm–1, corresponding
to OH bonds.[9,10] A relatively high sample moisture
content can result in overwhelming peaks. Therefore, in terms of the
chemical content assay of lignocellulose, the results obtained by
NIR often need to be verified with the additional characterization
techniques.[5,7]Raman spectroscopy can complement
NIR and has also been successfully
applied to evaluate the lignin content.[4,5,7] Unlike NIR, Raman spectroscopy is not sensitive to
the sample moisture content as it is a water-insensitive technique.[35] Suitable mathematical models have recently been
employed to determine the prediction ratios of syringyl-to-guaiacyl
(S/G) units in lignin as well as the classification of lignin.[11,12] However, the auto-fluorescence originating from lignin or other
chemicals in lignocellulosic materials can lead to a broad signal
in the resultant Raman spectrum.[13] Furthermore,
the conjugated lignin substructures can initiate the additional scattering
contribution of Raman signals, resulting in a deviation from the prediction.[3] To overcome these limitations, employment of
NIR excitation using a 1064 nm laser has been proposed to significantly
reduce the spectral background caused by the fluorescence emission.[14−16] Besides, mathematical techniques including spectral data processing
and anti-collinearity algorithms have proved to be effective for dealing
with combination and the higher-order overtones of the fundamental
signals in quantitative NIR models.[1] Such
mathematical techniques are expected to subdue the negative effect
of the conjugated aromatic structures for quantitative Raman modeling.Poplar has the largest plantation area in the world and has recently
been identified as a key potential resource for biomass and biofuels.[1] The high variation in lignin content among different
poplar genotypes requires a quick and non-invasive method to detect
lignin content in massive genetic or breeding programs.[17] Thus, in the current paper, wood samples were
obtained from several poplar clones in order to simultaneously determine
their lignin contents via traditional WCMs and the Raman spectra of
the samples. Different models were then organized according to the
baseline correction strategy, data type, mathematical algorithm, and
internal reference peak. These models were evaluated using the Pearson
correlation coefficient (R) and root-mean-square
error (RMSE) in order to determine the optimal model. This work evaluates
the ability of Raman spectroscopy to predict the lignin content in
genetic breeding selecting programs.
Results
and Discussion
Lignin Content of Poplar
Samples
Table reports the
WCM-derived lignin contents of the poplar samples ranging from 20.31
to 25.69 wt % and an average content of 22.79 wt %. This is in agreement
with the previous reports of similar poplar genotype.[1,18]
Table 1
Lignin Contents of Poplar Samples
for the Training and Test Sets
training set sample
lignin content (wt %)
training set sample
lignin content (wt %)
test set sample
lignin content (wt %)
S1
20.31a± 0.73b
S9
22.37 ± 0.16
T1
21.49 ± 0.01
S2
20.72 ± 0.26
S10
22.47 ± 0.20
T2
22.43 ± 1.27
S3
21.22 ± 1.68
S11
22.86 ± 0.74
T3
22.86 ± 0.81
S4
21.28 ± 0.69
S12
22.97 ± 0.63
T4
22.88 ± 0.34
S5
21.74 ± 0.53
S13
23.12 ± 0.58
T5
22.97 ± 1.22
S6
22.04 ± 0.97
S14
23.90 ± 0.46
T6
24.40 ± 0.07
S7
22.10 ± 0.22
S15
24.99 ± 0.78
T7
24.64 ± 0.98
S8
22.16 ± 0.13
S16
25.69 ± 0.82
T8
25.45 ± 0.78
Mean (% w/w).
Standard
deviation (% w/w).
Mean (% w/w).Standard
deviation (% w/w).
Evaluating Parameters
R and RMSE were
selected to evaluate the quality of the mathematical
models in the training and test sets. R and Rp generally represent the correlation between
the measured and predicted values in a model, with values closer to
1 or −1, indicating a higher accuracy. Here, R′ and Rp′ were used to indicate the
absolute value of R and Rp, respectively. The value of R′ and Rp′ ranged from 0.13 to 0.95 and 0.01 to 0.90, respectively
(Figures , 2, S7, and S8). The evaluation
results demonstrate the considerable variation of R′ and Rp′ across different baseline
correction methods, data types, and mathematical algorithms. This
indicates the ability of R′ and Rp′ to monitor the quality of the predictive models, as is
the consensus in the literature.[19−21] RMSE and root-mean-square
error of prediction (RMSEP) denote the standard deviation of the residuals
and are commonly used in regression analysis to verify experimental
results. RMSE and RMSEP values close to 0 theoretically represent
a perfect match between predicted and measured values. In the current
paper, we calculated RMSE and RMSEP using the Raman peak intensities.
The RMSE and RMSEP values ranged from 0.47 to 2.52 and 0.75 to 2.97,
respectively (Figures , 2, S7, and S8). The values varied with the baseline correction methods, mathematical
algorithms, and data types, indicating the effectiveness of RMSE and
RMSEP to describe the quality of predictive models.[22]
Figure 1
R′, Rp′, RMSE,
and RMSEP values of the principal components regression (PCR), partial
least-squares regression (PLSR), ridge regression (RR), and lasso
regression (LR) models based on the B2-I dataset. (a) R′ of the models; (b) Rp′ of the models;
(c) RMSE of the models; and (d) RMSEP of the models.
Figure 2
R′, Rp′, RMSE,
and RMSEP values of the PCR, PLSR, RR, and LR models based on the
B2-A dataset. (a) R′ of the models; (b) Rp′ of the models; (c) RMSE of the models; and (d)
RMSEP of the models.
R′, Rp′, RMSE,
and RMSEP values of the principal components regression (PCR), partial
least-squares regression (PLSR), ridge regression (RR), and lasso
regression (LR) models based on the B2-I dataset. (a) R′ of the models; (b) Rp′ of the models;
(c) RMSE of the models; and (d) RMSEP of the models.R′, Rp′, RMSE,
and RMSEP values of the PCR, PLSR, RR, and LR models based on the
B2-A dataset. (a) R′ of the models; (b) Rp′ of the models; (c) RMSE of the models; and (d)
RMSEP of the models.
Selection
of the Baseline Correction Strategy
The baseline correction
process is commonly applied to the Raman
spectra to reduce the disturbing signals from baseline drift. In general,
the number of baseline set points can alter each peak’s extracted
value, irrespective of the intensity and area. The greater the number
of baseline set points, the less impact the background will have on
the extracted value. However, it will also reduce the signal originating
from the molecular vibration of lignocellulosic materials.[23] Therefore, a suitable baseline correcting method
is required in order to balance these effects. We tested two baseline
correction strategies to select the most optimal. Figure presents the evaluation parameters
corresponding to each of the strategies.
Figure 3
Comparison of two baseline
correction strategies based on model
quality. (a) R′ and Rp′
of the models and (b) RMSE and RMSEP of the models.
Comparison of two baseline
correction strategies based on model
quality. (a) R′ and Rp′
of the models and (b) RMSE and RMSEP of the models.The average and maximum values of R′
for
the B2 strategy were higher than those for B1, while the opposite
was observed for Rp′. Moreover, the average
and minimum values of RMSE and RMSEP for B2 were lower than those
for B1. This indicates a shorter distance between the predicted and
observed values for the models with six baseline set points. In addition,
the relationship between the predicted and observed values were enhanced
(weakened) in the training (test) set. The maximum value of Rp′ (0.87) for B2 was close to that of B1 (0.90).
These results indicate B2 to be a more favorable baseline correction
strategy for reliable and accurate predictive models of lignin content
based on the Raman spectra compared to B1.
Selection
of Data Type
The predictive
models used to describe the algebraic relationship between the lignin
content and its specific scattering peaks were established based on
the intensity (I type) and area (A type) of the specific Raman peaks. Figure compares the models
based on R′, Rp′,
RMSE, and RMSEP.
Figure 4
Comparison of two models based on different data types
[area (A)
and intensity (I)] extracted from Raman spectra. (a) R′ and Rp′ of the models and (b) RMSE
and RMSEP of the models.
Comparison of two models based on different data types
[area (A)
and intensity (I)] extracted from Raman spectra. (a) R′ and Rp′ of the models and (b) RMSE
and RMSEP of the models.The average values of R′ and Rp′ for the A type
models exceeded those of the I type models
(Figure ); however
the latter exhibited the highest R′ value
(0.95). Moreover, the average values of RMSE and RMSEP were similar
for both model types, with minimum values lower for the I type models
compared to the A type models. The results of the model parameters
suggest that the I type model is suitable for the analysis of lignin
content in lignocellulosic materials. This may be attributed to the
complex chemical components of the lignocellulosic materials, resulting
in the overlap of Raman peaks.[3] These overlapping
Raman peaks cause uncertainty in the indeterminacy of the results
after the deconvolution of the peak areas.
Selection
of the Mathematical Algorithm
PCR, PLSR, RR, and LR were
applied to fit the models predicting
the relationship between the lignin content and specific Raman peaks
in the lignocellulosic material (Figure ). The RR algorithm exhibited the highest
average (0.74 and 0.31) and maximum (0.95 and 0.90) R′ and Rp′ values among the four models
(Figure a). The maximum R′ value (0.94) of the LR algorithm was similar to
that of the RR algorithm (0.95). However, the average R′ (0.62) value of the LR algorithm was much lower than that
of the RR algorithm (0.74). The PCR and PLSR algorithms exhibited
low maximum R′ values at 0.78 and 0.88, respectively. Figure a suggests that the
RR and LR algorithm models are more suitable for the fitting of the
lignin content in the lignocellulosic material compared to PCR and
PLSR. The RMSE and RMSEP values of the four algorithm models in Figure b confirmed this
result.
Figure 5
Comparison of four models based on different algorithms (PCR, PLSR,
RR, and LR). (a) R′ and Rp′ of models and (b) RMSE and RMSEP of models.
Comparison of four models based on different algorithms (PCR, PLSR,
RR, and LR). (a) R′ and Rp′ of models and (b) RMSE and RMSEP of models.All four algorithms are considered effective for the analysis
of
multiple regression data with multi-collinearity.[24] PCR and PLSR depress the collinearity by reducing the number
of variables via the principal component extraction with the threshold
value set as 90% in the current paper. In contrast, RR and LR adopt
different techniques to solve the collinearity. A penalty term controlled
by a regularization penalty (λ, see Section 2.6 in the Supporting Information) was arranged to substitute
the role of principal component analysis used by PCR and PLSR in reducing
the model bias without reducing the number of variables.[25] The sum of the absolute penalty terms in LR
varies until the Mallows’ Cp value reaches a minimum.[26] Ridge parameter K plays the
same role in controlling the penalty term in RR,[27] iteratively reaching its optimal value. Therefore, at the
same bias level, both RR and LR can potentially hold more reliable
information from the variables compared to PCR and PLSR. Our results
reveal higher maximum values of R′ and Rp′ for the RR and LR algorithm compared to the PCR
and PLSR algorithms.Ill-posed and ill-conditioned problems,
which describe an overestimation
of the output value of the function for small changes in the input,
are often reported when building a predictive model.[28,29] The problem is generally evaluated by the condition number, which
is calculated from the independent data. In the current paper, we
determined the condition number by extracting data from the Raman
spectra (Figure S2). Values higher than
1 suggest a problem in the model.[28,29] In our case,
the predicted value of lignin content will be exaggerated by a minimal
alteration in the Raman spectra. However, research has suggested the
ability of the RR and LR algorithms to depress the negative effect
caused by an ill-posed or ill-conditioned problem.[30−32] This explains
the lower maximum values of RMSE and RMSEP in RR and LR compared to
PCR and PLSR.
Selection of the Internal
Standard Peak in
the Raman Spectra
The sample roughness, the auto-fluorescence
of the lignocellulosic materials, and the input power of the laser
source all affect the intensity of the Raman signals. Therefore, a
standard internal peak was selected to normalize the Raman spectra
prior to the quantitative or semi-quantitative analysis of the chemical
components in the lignocellulosic materials.[7,33,34] Raman bands used for the analysis are listed
in Table . Figure demonstrates that
the data array corresponding to the internal standard with a peak
intensity of 2895 cm–1 (Ilp/2895) has
the highest maximum and average values for R′
and Rp′ among all groups. This implies the
suitability of the 2895 cm–1 peak intensity as an
internal standard for standardizing the Raman spectra prior to quantitative
or semi-quantitative analysis. This was also confirmed via the RMSE
and RMSEP values, whereby Ilp/2895 exhibits
the lowest minimum and average values among all combinations.
Table 2
Raman Peak Position and Band Assignments[3,14,16,35,36]
wavenumber/cm–1
component
assignment
2939
lignin, glucomannan,
and cellulose
CH stretching in OCH3 asymmetric
2895
cellulose
CH and
CH2 stretching
1660
lignin
ring conjugated C=C str. of coniferyl
alcohol; C=O
str. of coniferaldehyde
OCH3 deformation; CH2 scissoring; guaiacyl
ring vibration
1378
cellulose
HCC,HCO, and HOC bending
1331
cellulose
HCC and HCO bending, aliphatic OH stretch
1275
lignin
Aryl-O of aryl OH and aryl O–CH3; guaiacyl ring (with C=O group) mode
1149
cellulose
stretching and HCC and HCO bending, CC, CO
ring breathing,
asymmetric
1122
cellulose,
xylan, and glucomannan
CC and CO stretching, COC, glycosidic;
ring breathing, symmetric
1095
cellulose, xylan, and glucomannan
CC and CO stretching,
COC, glycosidic; ring breathing, symmetric
1043
lignin
OC stretching; ring deformation;
CH3 wagging
Figure 6
Comparative
Lasso and ridge regression results of 12 models based
on different internal standards using the B2-I data groups. (a) R′ of the models; (b) Rp′
of the models; (c) RMSE of the models; and (d) RMSEP of the models.
Comparative
Lasso and ridge regression results of 12 models based
on different internal standards using the B2-I data groups. (a) R′ of the models; (b) Rp′
of the models; (c) RMSE of the models; and (d) RMSEP of the models.Previous work has demonstrated the ability of the
Raman peak centered
at 1096 cm–1, which is assigned to the vibration
of CC and CO stretching in carbohydrates, to normalize the scattering
peak of lignin at 1600 cm–1.[4,35] Moreover,
Agarwal (2011) established a linear model based on the relationship
between the standardized relative intensity (intensity of peak 1600
cm–1/intensity of peak 1096 cm–1) and the lignin content determined by the WCMs.[4,37] However,
the relative intensity needs to be further “corrected”
by normalizing with the carbohydrate content in order to reduce the
impact of scattering contribution at 1096 cm–1 assigned
to CC and CO stretching of the carbohydrate.[4] The peak located at approximately 2895 cm–1 in
the Raman biomass spectra is generally assigned to the vibration of
C–H in CH and CH2 in cellulose.[11,36] This peak was observed to disappear for the Raman spectra of spruce
milled wood lignin (MWL). Therefore, the peak of 2895 cm–1 is irrelevant to the lignin content and can be employed to standardize
the lignin-related band intensity/area. Note that a “self-absorption”
in the C–H stretching vibration of the Raman spectra will occur
when water is added to the biomass sample. This may have a negative
impact when the peak of the C–H stretching is used to quantify
the cellulose content.[33] However, this
negative effect is minimal for predictive models of lignin content
as the numerator data array includes the peak assigned to the C–H
stretching of −OCH3 in lignin. Research has speculated
that the “self-absorption” effect will be depressed
by the 2939 cm–1 peaks with the help of anti-collinearity
algorithms.[26] These observations suggest
the 2895 cm–1 peak intensity to be the best internal
standard option for the predictive model.
Description
of the Optimal Predictive Models
Following the optimization
of the baseline correction strategy,
data type, mathematical algorithm, and internal standard peaks, two
predictive models were established using the RR and LR mathematical
algorithms. The R′, Rp′,
RMSE, and RMSEP of these models indicate that both models are suitable
for the prediction of lignin content in lignocellulosic materials
via their Raman spectra (Table ). Figure depicts the lignin content determined by the WCMs and Raman spectra
predictive models. The data points of the RR and LR models are distributed
close to the line of equality in Figure a,b. This is attributed to the similar techniques
applied by both models to reduce the multi-collinearity between the
actual and predicted values of the lignin content.
Table 3
Statistical
Results for the Training,
Cross-Validation, and Test Sets of the RR and LR Models
data group
algorithm
R′
RMSE
RV′
RMSEV
Rp′
RMSEP
B2-Ilp/2895
RR
0.95
0.47
0.95
0.41
0.86
0.88
LR
0.94
0.52
0.95
0.48
0.87
0.75
Figure 7
Lignin content determined
from the WCMs and Raman spectra based
on the B2-Ilp/2895 dataset (a) in the RR model and (b)
in the LR model.
Lignin content determined
from the WCMs and Raman spectra based
on the B2-Ilp/2895 dataset (a) in the RR model and (b)
in the LR model.The evaluation parameters determined for the training and test
sets of the two models agree with those reported for the lignin content
of wood samples.[1] However, our observed
parameters are inferior to those of models predicting the lignin content
of kraft pulps or pre-treated biomass.[4,35] Despite this,
the aforementioned models do not meet the requirements of genetic
breeding scientists as they take the independent value (i.e., namely,
lignin content) as the input, with larger variations than that of
biomass across genotypes. Quantitative models are generally considered
as more reliable for predicted values that exhibit a variation equivalent
to that of the independent value inputs. This suggests that the proposed
model is more suitable than those in the literature for the prediction
of lignin content in lignocellulosic materials from different genotypes.With the inclusion of the optimal coefficients and constants, the
Lasso regression model is described as followsand
the Ridge regression model is determined
aswhere Clignin is
the lignin content and I1043/2895 is the
intensity ratio between the 1043 and 2895 cm–1 peaks.
The remaining parameters follow the same definitions.Our results
demonstrate the ability of the two models to predict
the lignin content for the screening of poplar genotypes based on
their Raman spectra, particularly for the optimized building process
detailed in the previous sections.
Leave-One-Out
Cross-Validation for Evaluating
the Optimal Predictive Models
For evaluating the stability
and robustness of the quantitative models, leave-one-out cross-validation
(LOOCV) was performed to estimate the performance of the optimal predictive
models described in Section . In each iteration of LOOCV, one data group was set
aside and the remaining data groups were used for training. The statistical
results of LOOCV are listed in Table .The ridge regression model and the lasso regression
model for lignin content showed a good performance in cross-validation,
the RV′ value is 0.95 and the root-mean-square
error of validation (RMSEV) values are 0.41 and 0.48, respectively.
The values calculated from the validation datasets are nearly equivalent
to that calculated from the training sets. Therefore, the results
of LOOCV demonstrated that two optimal models are stable and robust.[1] Then, the reliable predictive performance of
lignin quantitative models was further verified.
Conclusions
In the current paper, we established an effective
predictive model
of lignin content for the screening of poplar genotypes based on their
Raman spectra. We employed the intensity of deconvoluted Raman peaks
to establish mathematical prediction models for the lignin content
of lignocellulosic materials, with the Raman peak centered at 2895
cm–1 considered as the standard internal peak. We
reveal the capability of the RR and LR models to fit the lignin content
in the lignocellulosic material via the collected Raman spectra. This
highlights the potential of these models in assessing the lignin content
in large-scale breeding and genetic engineering programs.
Materials and Methods
Materials
Eight
poplar clones, namely, Populus deltoides CL. ’55/65, Populus euramericana cv.“Zhonglin46”, P. euramericana cv. Guariento, P.
deltoides CL. “2KEN8”, Populus nigra CL. “N179”, P. deltoides CL. “Danhong”, P. euramericana
CL. “Sangju”, and P. deltoides CL. “Nanyang”, were planted at the Jiaozuo Forest
Farm in Jiaozuo City, Henan Province, China (113°13′E,
35°14′N), in 2010. Three trees of each clone were harvested
in November 2018, and the discs at breast height were cut and processed
into particles in order to study the chemical components. A total
of 24 different wood samples were obtained. All samples were stored
at room temperature for more than 1 month at a moisture content of
approximately 9.5%. The wood particles were ground into a fine powder
using a pulverizer. The 40–60 mesh wood powder was sieved out
to measure the lignin content via the WCMs, while the 100–120
mesh wood powder was reserved for the collection of Raman spectra.
Methods
Lignin Content Measurements
Lignin
is typically divided into two parts during the compositional analysis
of lignocellulosic materials with WCMs, namely, Klasonlignin and
acid-soluble lignin.[18,38−42] The Klasonlignin measurements in this paper employed
the official TAPPI (Technical Association of the Pulp and Paper Industry)
T222 om-06 test method,[43] while acid-soluble
lignin measurements were based on TAPPI UM-250.[39,43] The lignin content was then taken as the sum of Klason and acid-soluble
lignins. The average value of three technical replicates was registered
as the lignin content of each sample. Therefore, the dataset consisted
of 24 lignin content values.
Collection
of Raman Spectra
A Fourier
transform–Raman spectrometer (Bruker, VERTEX 70-RAMII) was
used to collect Raman spectra of the poplar samples. The spectrometer
was equipped with a 1064 nm, 500 mW Nd:YAG diode laser. During the
test, the ambient relative humidity and temperature were 60 ±
10% and 20 ± 2 °C, respectively. Approximately 2 mg of wood
powder was pressed into a sample cell and the collection was repeated
three times for each sample. The average spectrum was regarded as
the sample spectrum. During measurements, the fluorescence was excited
at 1064 nm, the laser defocus mode was used, and the number of scans
was increased to obtain a good signal-to-noise ratio.[4,44] Spectra were collected with 64 scans and a spectral resolution of
4 cm–1 across 200–3600 cm–1.
Processing of Raman Spectra
The
Raman spectra of the poplar samples exhibit an excellent resolution
(Figure S3). Table details the assignments of the specific
peaks. Peak-fit 4.12 was employed to process the spectral data, including
smoothing, background correction, and band peak-fitting.[45] First, Savitzky–Golay (level 1.0%) smoothing
was applied prior to the processing of the spectra. Second, the second
derivation zero algorithm was used for baseline correction by using
peakfit 4.12 software.[46] The sets of baseline
points were confirmed according to the second derivative spectra.
The baseline correction was then executed via two different strategies
by changing the set points: (i) Four set points located at 940, 1750,
2750, and 3050 cm–1 divided the spectra into five
sections (B1) and (ii) six set points located at 940, 1210, 1540,
1750, 2750, and 3050 cm–1 divided the whole spectra
into seven sections (B2). Regions 940–1750 and 2750–3050
cm–1 include the signals closely related to the
molecular vibration in lignocellulosic materials and also exhibit
a high resolution in all spectra. Thus, these two regions were selected
in strategy B1 for the baseline correction and further peak-fitting
(Figures S4 and S5). In addition, regions
940–1210, 1210–1540, 1540–1750, and 2750–3050
cm–1 were used for further processing in strategy
B2 (Figure ). Third,
the peak-fitting of overlapping bands was performed for each peak
based on Gauss and Lorentz peak-fitting. In order to ensure the effectiveness
of the separated peaks, they were determined according to the second-derivative
spectra, while the peak positions were fixed during the peak-fitting
process.[47]Figure S6 presents the results of the Raman spectra for lignocellulosic materials.
Four spectral
regions of the Raman spectra in the B2 baseline correction
strategy at (a) 940–1210; (b) 1210–1540; (c) 1540–1750;
and (d) 2750–3050 cm–1. The black line indicates
the spectrum before smoothing and baseline correction and the red
line indicates the spectrum after smoothing and baseline correction.
Four spectral
regions of the Raman spectra in the B2 baseline correction
strategy at (a) 940–1210; (b) 1210–1540; (c) 1540–1750;
and (d) 2750–3050 cm–1. The black line indicates
the spectrum before smoothing and baseline correction and the red
line indicates the spectrum after smoothing and baseline correction., the same as follows., the same as follows.Ilp/sum = Ilp/(I1095 + I1122 + I1275 + I1331 + I1378 + I1462 + I1602 + I1660 + I2895 + I2939).Ilp/carbohydrate = Ilp/(I1095 + I1122 + I1331 + I1378 + I2895).
Selection
and Normalization of Input Data
from the Raman Spectra
The intensity (I) and area (A) of
each peak in the Raman spectra were determined following the peak-fitting
process. As both of these parameters are suitable inputs for predictive
models,[4,34] they were compared in order to determine
the most suitable based on the model evaluating parameters. A selection
of internal standards was then executed to normalize these inputting
values.[4] A data array of the peak digital
values related to lignin (e.g., 1043, 1275, 1427, 1462, 1602, 1660,
and 2939 cm–1), denoted as lp, was assigned to the
numerator.[2,3] Digital values of each prominent peak in
the Raman spectra were allocated as the denominators of the internal
standard candidates. The intensities of peaks assigned to carbohydrates
(1095, 1122, 1331, 1378, and 2895 cm–1) were summed,[2,3] as were the intensity of all selected prominent peaks. These sums
were also included in the internal standard candidates (Table ). Therefore, 48 groups of independent
variables were collected from the Raman spectra as inputs for the
modeling process.
Mathematical Algorithms
Multi-collinearity
was observed in the values determined from the Raman spectra (Figure S1), resulting in a divergence between
the predicted and actual values.[21,48] PCR, PLSR,
RR, and LR have all been reported to reduce the negative effect of
multi-collinearity and enhance the reliability of mathematical models.[25,49−52] In the current paper, we employed all four algorithms to build the
mathematical model based on the original lignin content data and to
estimate the lignin content from the Raman spectra. The threshold
value for the contribution of all principle components was set at
90% for PCR and PLSR. Based on the model grouping, 16 samples were
randomly selected as the training set (S1–S16), with the remaining
8 samples as the test set (T1–T8) (Table ). Furthermore, LOOCV was used to validate
the performance of specified models. In addition, the validation set
were split from the training set according to the LOOCV rules.[53]
Model Evaluation
The Pearson correlation
coefficient (R) and RMSE were used to evaluate the
model qualities of the training sets, while the validation set and
the test set model qualities were evaluated via the Pearson correlation
coefficient of validation (Rv), the RMSEV,
the Pearson correlation coefficient of prediction (Rp), and the RMSEP.The Pearson correlation coefficient
(R), also known as the Pearson R statistical test, measures the strength of the relationship between
the different variables. R is determined as follows[19−21]where X and Y are two different variables and n is the sample
size. RMSE measures the error between two datasets and is an indicator
of how far apart the predicted values are from the observed values.[54−56] RMSE is described as follows[54−58]where P is the predicted value for the ith observation
in the dataset, M is
the measured value for the ith observation in the
dataset, and n is the sample size.
Authors: Michael H Studer; Jaclyn D DeMartini; Mark F Davis; Robert W Sykes; Brian Davison; Martin Keller; Gerald A Tuskan; Charles E Wyman Journal: Proc Natl Acad Sci U S A Date: 2011-03-28 Impact factor: 11.205
Authors: A A Pereira; G F Martins; P A Antunes; R Conrrado; D Pasquini; A E Job; A A S Curvelo; M Ferreira; A Riul; C J L Constantino Journal: Langmuir Date: 2007-05-10 Impact factor: 3.882
Authors: Qining Sun; Marcus Foston; Xianzhi Meng; Daisuke Sawada; Sai Venkatesh Pingali; Hugh M O'Neill; Hongjia Li; Charles E Wyman; Paul Langan; Art J Ragauskas; Rajeev Kumar Journal: Biotechnol Biofuels Date: 2014-10-14 Impact factor: 6.040
Authors: Sylwester Mazurek; Maciej Włodarczyk; Sonia Pielorz; Piotr Okińczyc; Piotr M Kuś; Gabriela Długosz; Diana Vidal-Yañez; Roman Szostak Journal: Molecules Date: 2022-06-20 Impact factor: 4.927