Literature DB >> 34283823

Determining soil particle-size distribution from infrared spectra using machine learning predictions: Methodology and modeling.

Elizabeth Jeanne Parent¹, Serge-Étienne Parent¹, Léon Etienne Parent¹.

Abstract

Accuracy of infrared (IR) models to measure soil particle-size distribution (PSD) depends on soil preparation, methodology (sedimentation, laser), settling times and relevant soil features. Compositional soil data may require log ratio (ilr) transformation to avoid numerical biases. Machine learning can relate numerous independent variables that may impact on NIR spectra to assess particle-size distribution. Our objective was to reach high IRS prediction accuracy across a large range of PSD methods and soil properties. A total of 1298 soil samples from eastern Canada were IR-scanned. Spectra were processed by Stochastic Gradient Boosting (SGB) to predict sand, silt, clay and carbon. Slope and intercept of the log-log relationships between settling time and suspension density function (SDF) (R2 = 0.84-0.92) performed similarly to NIR spectra using either ilr-transformed (R2 = 0.81-0.93) or raw percentages (R2 = 0.76-0.94). Settling times of 0.67-min and 2-h were the most accurate for NIR predictions (R2 = 0.49-0.79). The NIR prediction of sand sieving method (R2 = 0.66) was more accurate than sedimentation method(R2 = 0.53). The NIR 2X gain was less accurate (R2 = 0.69-0.92) than 4X (R2 = 0.87-0.95). The MIR (R2 = 0.45-0.80) performed better than NIR (R2 = 0.40-0.71) spectra. Adding soil carbon, reconstituted bulk density, pH, red-green-blue color, oxalate and Mehlich3 extracts returned R2 value of 0.86-0.91 for texture prediction. In addition to slope and intercept of the SDF, 4X gain, method and pre-treatment classes, soil carbon and color appeared to be promising features for routine SGB-processed NIR particle-size analysis. Machine learning methods support cost-effective soil texture NIR analysis.

Entities: Chemical

Mesh：

Substances：
Soil
Carbon

Year: 2021 PMID： 34283823 PMCID： PMC8291647 DOI： 10.1371/journal.pone.0233242

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Soil particle-size distribution (PSD) is of prime importance for plant growth and soil management [1]. Mimicking particle sedimentation in natural water bodies, PSD has been traditionally quantified using the sieve-pipette method that determines particle mass, and the sieve-hydrometer or sieve-plummet balance method that measures changes in suspension density [2]. Sedimentation techniques are thought to overestimate the concentration of plate-like clay particles that do not fit into Stokes’ law [3]. Reynolds’ number should be less than 0.05, otherwise, the drag force for sedimentation increases faster than predicted by Stokes’ law [3]. Because organic matter binds soil particles [4], organic matter is often destroyed using a peroxide pre-treatment as method modifier to disperse soil particles. Laser techniques are much faster than sedimentation techniques but tend to underestimate clay-size particles due to insufficient particle dispersion despite sonication pre-treatment, and this may require using conversion equations [5,6]. On the other hand, visible and near infrared (VIS–NIR: 350–2500 nm) spectroscopy is an efficient soil quality and fertility screening tool [7] because spectra correlate well with several chemical, physical and mineralogical properties [8]. The VIS represents the visible light range between 350 and 780 nm, from violet to red. Because most routine soil and plant laboratories are equipped with NIR spectrometers for forage analysis, they may contribute to documenting soil characteristics at low cost. Where VIS is not available, soil RGB (red-green-blue) can be assessed from the Munsell color chart using computer models (“munsell2rgb” from the “aqp” package). Mid-infrared (MIR) is generally more accurate than NIR [9] but requires more sample preparation, limiting its application as routine determination method. Infrared spectroscopy (IRS) more accurately predicts clay than sand and silt contents [10] because the IR spectrum is sensitive to clay mineralogy [10-12] and total reflectance decreases as grain size increases [13,14]. Light absorption is also influenced by soil features such as particle roundness [15-18], soil pH, and the Mehlich-3 soil test for Ca, Mg and Mn [19]. The IRS detects Al-OH (2200 nm) and Fe-OH (2290 nm) [20,21] that in turn have an impact on soil structure [22] and IR reflectance [23]. The VIS-NIR spectra are sensitive to soil moisture and C content [24]. Organic matter, multi-nutrient extraction and reconstituted bulk density from scooped soil samples are also common features quantified in routine laboratories. The sedimentation methods provide percentages of sand, silt, and clay from the log-log relationship between settling time and suspension density. The slope and intercept return proportions of sand, silt and clay at pre-selected settling times that may vary between laboratories [1], thus affecting the accuracy of IRS models. Providing more flexibility using the slope and intercept of the log-log relationship and reducing the arbitrariness of the settling time selection may allow increased reliability of IRS models calibrated against sedimentation methods. There are unattended sources of error in IRS calibration. There is systematic negative covariance between sand, silt and clay fractions due to resonance within the ternary diagram [25]. Indeed, there are D-1 degrees of freedom in a D-part composition [26]. Not considering the problem of closure to 100% in statistical analysis, confidence intervals about means of proportions may take values outside of the compositional space, i.e. < 0 or > 100% [27], and the measures of distance and dissimilarity are non-Euclidian [28]. To return unbiased statistical results, orthonormal balances among subsets of components can be computed as D-1 isometric log ratios (ilr) [29]. The back-transformed ilr values allow recovering proportions of sand, silt, and clay totalling exactly 100% within the limits of the ternary diagram. For both the sedimentation and laser techniques there are several soil pre-treatments (peroxide, sodium hypochlorite, sodium hexametaphosphate, sonication intensity), soil features, calibration techniques, options (NIR 2X, NIR 4X, settling times or suspension density function for sedimentation; pump and stirrer spin, refractive index of the medium, real or imaginary refractive index, density for laser) and expressions (percentages, ratios) that influence results of particle-size distribution. Machine learning (ML) is an emerging data mining technique of artificial intelligence that can unravel patterns and rules in large data sets [30] and predict target variable from input data [31]. Machine learning methods can account for numerous independent variables that may impact on NIR spectra to assess accurately soil particle-size distribution. This paper is presented in two parts, one focusing on methodology and the other on modeling. In the methodology section, we hypothesized that methods and pre-treatments return different results and thus cannot be combined to run a single IRS calibration set without input information on methodologies. In the modeling section, we hypothesized that: 1. IRS cannot accurately predict PSD from laser methods due to underestimation of the clay fraction; 2. IRS accurately predicts sand fraction determined using sand sieving methods; 3. IRS more accurately predicts PSD from sedimentation methods after accounting for soil features; 4. IRS prediction accuracy for PSD is improved using the slope and intercept of the log-log relationship and isometric log ratios compared to raw percentages and pre-determined settling times; and 5. MIR prediction accuracy for PSD is higher compared to the NIR spectra. The objective of this study was to calibrate infrared determination methods against routine PSD methods, methodological modifiers and soil features to reach high IR-ML model accuracy for routine soil texture determination.

Materials and methods

Characterization of soils

The data set of 1298 soil samples collected in the arable layer (0–20 cm) was obtained from several research institutions in Québec, Canada (Fig 1). The main crops were maize (Zea mays) cereals and forages on coarse- to fine-textured soils, potatoes (Solanum tuberosum) on sandy loams and loamy sands, and cranberries (Vaccinium macrocarpon) on sandy soils. The soils were mainly Inceptisols and Spodosols.

Fig 1

Repartition of soil sampling areas on eastern provinces in Canada.

Repartition of soil sampling areas on eastern provinces in Canada.

Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL. Contains information from OpenStreetMap and OpenStreetMap Foundation, which is made available under the Open Database License. http://maps.stamen.com/#toner/12/37.7706/-122.3782. Soil samples were air-dried then passed through a 2-mm sieve [32,33]. The 2-mm sieved soil was 3-mL scooped and then weighed to determine the reconstituted bulk density as performed routinely in soil testing laboratories. Soil color was assessed on dry samples using the Munsell chart and then transformed into RGB percentages using “munsell2rgb” in R. The total carbon (Ct) was quantified using the Leco CNS analyzer (Leco Corporation, St. Joseph, Michigan). For pH determination, 10 g of soil was mixed with 20 mL 0.01M CaCl2. For oxalate extracts, a 0.5 g sample of soil was mixed with 20 mL of oxalate solution (0.2 M of ammonium oxalate and 0.2 M of oxalate acid) and agitated for 4 h in the dark [34]. The mixture was centrifuged at 2000 rpm for 5 min and then filtered through Whatman no. 40 paper. Concentrations of P, Fe, Ca, Al, Mn and Si were quantified by ICP-OES. Soils were also extracted using the routine Mehlich3 method [35]. Concentrations of P, Ca, Mg, Fe, Al, Mn, Zn and Cu were quantified by ICP-OES.

Particle-size analysis

Particle size distribution was analyzed in 50–100 g samples using the sedimentation method [36]. A separate batch of soils was pre-treated with peroxide for comparison with the no-peroxide pre-treatment. Samples were mixed with 0.05 M hexametaphosphate and agitated at 300 rpm for 16 h. The mixture was transferred to a 1 L cylinder and hand-shaken for 30 sec. Suspension density readings (g L-1) taken after 0.75-, 5-, 120-, 420-, and 1440- min were referred to as the sedimentation multi-point method. Samples reporting suspension density after 0.67 and 120 min as originally suggested [37] were assigned to the 2-point 2-h sedimentation method. The clay fraction was also recorded after 7-h settling time, close to the 6-h settling time used by Gee and Bauder (1979) [38]. Although the clay fraction was overestimated as compared to the 7-h sedimentation multi-point method, the 2-point 2-h sedimentation method was selected as our reference because it has been widely used as a proximate method in soil surveys combined with tactile assessment. After taking the last reading, the entire hydrometer contents from sedimentation method were passed through 1-, 0.5-, 0.25-, 0.10- and 0.05- mm sieves under tap water to clean the coarser particles of any adhering finer particles and to determine the sand-size distribution. The suspension density was transformed into particle-size percentages by mass using standard equations [37,38]. The sedimentation curve relating suspension density to settling time was log-transformed (ln) to determine the slope and intercept as model parameters. Samples were also analyzed using the Mastersizer 2000 Laser particle size analyzer (Malvern Instruments, Worcestershire, UK, measurements at 633 nm and 466 nm) combined with Hydro 2000G (800-mL tap water volume, 500-rpm stirrer and 2000-rpm pump) with or without ultrasonic action at nominal 40 kHz frequency for 2 min [39]. Only samples within 10 to 20% of obscuration were retained in the database. The refractive index of the medium was set at 1.5.

Calculation

The three particle-size fractions and carbon content were isometric log ratio (ilr) transformed as follows [29]:

Spectral data acquisition

The air-dried and 2-mm sieved samples [40] were placed into a 5-cm quartz cup then scanned using a Nicolet Antaris FT-NIR analyzer (Thermo Electron Corp., Ann Arbor, Michigan). Absorbance was measured with gains of 2X or 4X. Triplicated spectra were scanned 30 times in the range of 9090 to 4000 cm-1 (1100 to 2500 nm) at a resolution of 2 cm-1 (0.3 nm at 1250 nm) [41]. A library of 3069 NIR spectra with 2X gain and 668 spectra with 4X gain were obtained after probabilistic quotient normalization and first derivation [42]. All samples were stored in white plastic storage containers. The soil and KBr samples were ground to less than 74 μm using a mortar and pestle [43], then oven-dried at 105°C for 3 h [44]. A 2 g sample was weighed and mixed with 200 g KBr powder. The mixture was pressed at 517 MPa for 3 min using a manual hydraulic press (Carver, Model 4350.L, Carver Inc., Wabash, Indiana) to yield clear 13-mm-diameter pellets. The mid-infrared spectra were scanned using a transmission DTGS detector Varian 1000 FT-IR Scimitar series spectrometer (Varian Inc., Palo Alto, California). Spectra were in the range of 4000 to 400 cm-1 (2500 to 25 000 nm) at 4 cm-1 (0.6 nm) resolution. Each pellet was scanned 10 times and rotated once manually at 90°. A library of 413 MIR spectra was obtained.

Spectral data modeling

Statistical analyses were conducted using the R-3.6.1 version [45] with the packages “tidyverse”, “reshape2”, “stringi”, “signal”, “mvoutlier”, “caret”, “compositions”, “soiltexture”, “aqp”, “patchwork”, “broom”, “ggExtra”, “ggmap”, “neuralnet” and “outliers”. The machine learning (ML) method was GBM (Stochastic Gradient Boosting) that outperformed other ML models including neural networks. Neural networks was tested with a hidden of 3 and an activation function settled at “logistic” (S2 Table). The NIR and MIR spectra were normalized and passed through a binning process with 10 cut points and a Savitsky-Golay smoothing filter. After first derivation, spectra compressed into scores using principle component analysis (PCA) [46]. Because of the diversity of the soil samples, we did not remove outliers. Sand, silt and clay percentages, oxalate and Mehlich-3 extracts composed of 0.2N CH3COOH-0.25N NH4N03-0.015NNH4F-0.013NHN03-0.001M EDTA were ilr-transformed. Every set was divided into training (70%) and testing (30%).

Results

Soil characterization

There was a large spectrum of soil properties (Table 1). In comparison, soils of the region have been reported to range from 5 to 137 g organic matter kg-1, 560 to 19 992 mg Feoxalate kg-1 and 270 to 36 638 mg Aloxalate kg-1, and 4–494 g sand kg-1, 40–734 g silt kg-1, and 6–796 g clay kg-1 [47]. Particle-size distribution is presented in Fig 2. Orthonormal balances were normally distributed except for [clay | silt,sand] using the laser method (Fig 3). Indeed, the clay content was systematically underestimated, and the silt content systematically overestimated by the laser method.

Table 1

Ranges of soil properties (0–20 cm) in the data set (particle-size distribution according to the multi-point 7-h sedimentation method using 45-sec settling time for sand and 7-h settling time for clay).

Site	Mean	Standard deviation	Minimum	Maximum
pH (0.01 M CaCl₂)	5.36	0.67	3.37	7.90
	g kg^-1
Sand total	613	24	1	986
• 1–2 mm	41	50	0	282
• 0.5–1.0 mm	108	115	1	541
• 0.25–0.5 mm	191	126	5	580
• 0.1–0.25 mm	235	154	3	766
• 0.25–0.01 mm	118	103	0	587
Silt	256	15	5	816
Clay	131	13	5	839
Total C	25.5	2.4	0.1	443
Total N	1.9	1.6	0.7	19
Total S	0.3	0.4	0	10.3
		mg kg^-1
Fe_oxalate	5599	2449	447	10400
Al_oxalate	4000	2693	528	15346
Mn_oxalate	309	304	7	2287
Ca_oxalate	30	14	3	153
P_oxalate	803	464	68	2266
Si_oxalate	829	578	88	5585
P_Mehlich3	118	62	6	362
Ca_Mehlich3	1776	1437	113	5103
Mg_Mehlich3	191	224	5	1572
Fe_Mehlich3	232	108	72	715
Al_Mehlich3	1265	431	310	2241
Mn_Mehlich3	25	20	5	206
Zn_Mehlich3	8	9	0	130
Cu_Mehlich3	4	3	0.4	30
	%
Green	47	8	28	68
Blue	34	10	11	58
Red	54	7	31	75
	g cm^-3
Reconstituted bulk density	1.06	0.16	0.71	1.46

Fig 2

Particle size distribution of studied soils in the Canadian textural diagram.

Fig 3

Distribution of ilr textural variables across methodologies.

Methodologies

Paired t-tests comparing methods are presented in Table 2. Comparisons involving the peroxide pre-treatment must be interpreted with care due to the smaller number of observations (26 to 38). Mean ilr differences between the reference 2-point 2-h sedimentation method and other methods are illustrated in Fig 4. The difference between the laser and the 2-point 2-h sedimentation methods depended on the ilr coordinate (also shown in Fig 3). After 2-min of sonication, the [silt | sand] balance differed slightly between methods while the [clay | silt,sand] balance differed markedly between methods, indicating insufficient dispersion of clay-size particles.

Table 2

Comparison of methods (method1 minus method2) using paired t-test and confidence intervals (p ≤ 0.05).

Target variable	Method1	Method2	p.value	N
[Carbon \| Clay,Silt,Sand]	Sedimentation No peroxide	Sedimentation peroxide	ns	38
[Carbon \| Clay,Silt,Sand]	Sedimentation No peroxide	Laser 0 min	**	46
[Carbon \| Clay,Silt,Sand]	Sedimentation No peroxide	Laser 2 min	**	106
[Carbon \| Clay,Silt,Sand]	Sedimentation 2-pointpoint No peroxide	Sedimentationmulti-pointpoint No peroxide	**	763
[Carbon \| Clay,Silt,Sand]	Sedimentation 2-pointpoint peroxide	Sedimentation multi-points peroxide	ns	227
[Clay \| Silt,Sand]	Sedimentation No peroxide	Sedimentation peroxide	**	38
[Clay \| Silt,Sand]	Sedimentation No peroxide	Laser 0 min	**	46
[Clay \| Silt,Sand]	Sedimentation No peroxide	Laser 2 min	**	106
[Clay \| Silt,Sand]	Sedimentation 2-point No peroxide	Sedimentation multi-point No peroxide	**	763
[Clay\| Silt,Sand]	Sedimentation2-point peroxide	Sedimentation multi-point peroxide	**	227
[Silt \| Sand]	Sedimentation No peroxide	Sedimentation peroxide	ns	38
[Silt \| Sand]	Sedimentation No peroxide	Laser 0 min		46
[Silt\| Sand]	Sedimentation No peroxide	Laser 2 min	*	106
[Silt \| Sand]	Sedimentation 2-point No peroxide	Sedimentation multi-point No peroxide	**	763
[Silt \| Sand]	Sedimentation 2-point peroxide	Sedimentation multi-point peroxide	**	227
Sand	Sedimentation No peroxide	Sieving	*	746
Sand	Sedimentation peroxide	Sieving	ns	26
Sand	Laser 0 min	Sieving	*8	34
Sand	Laser 2 min	Sieving	*	77
Sand	Sedimentation multi-point No peroxide	Sieving	ns	358
Sand	Sedimentation multi-point peroxide	Sieving	*	84

ns, *,**: Non-significant and significant at the 0.05 and 0.01 levels, respectively.

N: Sample size.

Fig 4

Mean differences (p ≤ 0.025) in textural balances between several methodologies against the reference 2-point 2-h sedimentation method without pre-treatment.

ns, *,**: Non-significant and significant at the 0.05 and 0.01 levels, respectively. N: Sample size. As expected, the multi-point 7-h sedimentation method returned a smaller proportion of clay-size particles compared to the 2-point 2-h sedimentation method due to longer settling time. The peroxide pre-treatment for the multi-point 7-h sedimentation method tended to increase clay and silt fractions. The [silt | sand] balance resulting from the no-peroxide pre-treatment preceding sedimentation methods was comparable across methods, with the exceptions of the laser and pretreated multi-point 7-h sedimentation methods. Excluding the 0-min sonicated laser and comparable methods, the amount of sand was slightly lower compared to sand sieving. Sand contents did not differ significantly between the peroxide pre-treatments or multi-points calculation without pre-treatment. Except for the peroxided pre-treated dataset, the [carbon | clay,silt,sand] balances were differentially influenced by methodologies. To conduct predictions to details of methodologies and features, the data set was divided into eight subsets (Table 3). Prediction models were run for each subset. In general, prediction accuracies were weakest for the silt fraction and were similar whether data were crude or ilr-transformed. As expected, the clay fraction determined by laser in Set1 was poorly related to the NIR spectra (R2 = 0.45–0.64) whether the PSD was expressed as ilr or % (Table 3). Combining laser and sedimentation methods in Set2 improved the clay predictions with R2 values of 0.85–0.91. In Set3, PSD predictions were more accurate with the 2-point 2-h sedimentation method (R2 = 0.49–0.79) than with the multi-point sedimentation method (R2 = 0.44–0.70). Modeling Set4 showed that NIR was less accurate (R2 = 0.40–0.71) compared to MIR (R2 = 0.45–0.80). Accuracy was higher with NIR-4X (R2 = 0.87–0.95) than with NIR-2X (R2 = 0.69–0.92) (Set5). Models for Set6 showed that data expressed as ilr or % (R2 = 0.76–0.94) for interpolated PSD were similarly accurate to the slope and intercept from the relationship between settling time and suspension density (R2 = 0.84–0.92). Several features in Set7 improved model accuracy (R2 = 0.86–0.91) compared to no features at all (R2 = 0.82–0.97). The total carbon content contributed the most to the increase in accuracy, followed by colors and oxalate extracts. Finally, the carbon content showed higher performance with raw data (R2 = 0.63–0.89) than with ilr-transformed data (R2 = 0.28–0.84). Compared to a model without features (R2 = 0.97), the reconstituted bulk density (R2 = 0.99), color (R2 = 0.96), oxalate (R2 = 0.98) and Mehlich-3 (R2 = 0.98) extracts were similar to the accuracy of the NIR model. For the sand fraction, the NIR model calibrated against the sand-sieving method showed higher accuracy (R2 = 0.66) compared to the sedimentation 2-point 2-h method (R2 = 0.53).

Table 3

Accuracy, slope, intercept, and number of observations of models.

Set	Dependent Variables	Independent variables	Laboratory method for calibration	Sand	Silt	Clay	C	Sand	Silt	Clay	C	Value interpretation*	N
				Adjusted R²				RMSE
Set1	Ilr	2X	Laser	0.80	0.79	0.45	0.28	10.80	10.64	1.32	0.67	-	92
Set1	%	2X	Laser	0.86	0.80	0.64	0.63	9.11	10.35	1.06	0.48	-	92
Set2	Ilr	Lab method, PT, NIR-2X	2-point 2-h sedimentation+ Laser	0.90	0.85	0.80	0.58	9.05	6.94	5.27	0.72	+	485
Set2	%	Lab method, PT, NIR-2X	2-point 2-h sedimentation+ Laser	0.89	0.91	0.85	0.71	9.30	7.17	4.21	0.62	++	485
Set3	Ilr	NIR-2X	Multi-point 7-h sedimentation	0.63	0.44	0.70	0.40	16.17	13.22	6.94	0.92	-	667
Set3	Ilr	NIR-2X	2-point 2-h sedimenttion	0.76	0.49	0.79	0.36	13.42	11.50	6.60	0.95	-	667
Set4	Ilr	MIR	2-point 2-h sedimentation	0.75	0.45	0.80	0.21	14.75	11.34	7.88	0.77	-	222
Set4	Ilr	NIR-2X	2-point 2-h sedimentation	0.71	0.40	0.69	0.02	15.78	11.81	9.75	0.86	-	222
Set5	Ilr	NIR-4X	2-point 2-h sedimentation	0.94	0.87	0.95	0.85	7.26	6.15	3.62	0.48	++	311
Set5	Ilr	NIR-2X	2-point 2-h sedimentation	0.83	0.69	0.92	0.76	12.42	9.40	4.78	0.62	+	311
Set6	Ilr	PT, NIR-2X	2-point 2-h sedimentation	0.90	0.81	0.93	0.84	8.75	6.98	3.94	0.73	+	860
	%			0.91	0.76	0.94	0.89	8.40	7.74	3.84	0.61	+
	Slope, Intercept			0.90	0.84	0.92	-	8.63	6.34	4.29	-	++
Set7	Ilr	All features, NIR-2X	2-point 2-h sedimentation	0.91	0.86	0.90	0.69	9.27	5.86	6.06	0.36	++	156
		No feature, NIR-2X		0.94	0.82	0.97	0.97	7.78	7.67	3.07	0.71	++
		Carbon, NIR-2X		0.96	0.88	0.97	0.95	6.59	6.34	3.36	0.97	++
		Bulk density, NIR-2X		0.89	0.63	0.94	0.99	10.59	10.84	4.66	0.52	-
		pH, NIR-2X		0.92	0.78	0.96	0.94	8.84	8.39	3.53	1.06	+
		Color, NIR-2X		0.94	0.86	0.95	0.96	7.55	6.64	4.10	0.84	++
		Oxalate, NIR-2X		0.95	0.84	0.96	0.98	7.19	7.17	3.71	0.54	++
		Mehlich3, NIR-2X		0.89	0.70	0.93	0.98	10.62	9.83	4.71	0.64	+
Set8	%	NIR-2X	Sand sieving	0.66	-	-	-	12.15	-	-	-	+	746
Set8	%	NIR-2X	2-point 2-h sedimentation	0.53	-	-	-	14.73	-	-	-	-	746

C: Carbon; PT: Pre-treatments (no peroxide or peroxide); MIR: MIR scores; RMSE: Root mean square error; N: Sample size.

*: Value interpretation on sand, silt and clay contents.

-: Vulnerable estimations; +: Approximated estimations; ++: Research application estimations; +++: Quality control estimations [48,49].

C: Carbon; PT: Pre-treatments (no peroxide or peroxide); MIR: MIR scores; RMSE: Root mean square error; N: Sample size. *: Value interpretation on sand, silt and clay contents. -: Vulnerable estimations; +: Approximated estimations; ++: Research application estimations; +++: Quality control estimations [48,49].

Discussion

Methodologies

The IR technologies are generally calibrated against a single methodology. In this paper, several methodologies were IR-tested on comparable soil samples. The present paper presented IR-SGB results across several methodologies and features that are rarely addressed altogether in the literature. Machine learning integrated the available information on pre-treatments, methods and features to predict soil texture at low cost in routine laboratories.

Sedimentation methods

The peroxide pre-treatment tended to increase the clay and silt fractions at the expense of the sand fraction, indicating that clay- and silt-size particles were freed from the clay- or silt-organic matter complexes in micro-aggregates [50]. While metals such as Fe are sequestered by soil organic matter [51] or associated with it [52], iron may be freed from organic matter using hydrogen peroxide to form reactive iron hydroxide [53] that may, conversely, positively impact soil aggregation [54]. Peroxide-treated soils may thus require additional dispersion prior to sedimentation [55]. Nevertheless, the 2-point 2-h sedimentationwith and without peroxide pre-treatments showed comparable results or minor differences for the [clay | silt,sand] and [silt | sand] balances.

Laser method

The Mie theory considers soil particles to be spherical (Malvern Instruments Ltd., 2007). Parameters such as the refractive index of the medium, the real or imaginary refractive index and the density could be included to improve accuracy. The 2-min sonicated laser method and the sand sieving method returned closed results. The 40 kHz sonication for a duration of 2 min appeared to be suitable for sandy soils [56]. Variants include 36 kHz for 3 min [57] and 30 kHz for 30 min [2]. Although fragile quartz grains may be broken using ultrasonication [58], the [silt | sand] balance for the sedimentation method showed results close to the 2-min sonicated laser method. According to Storti and Balsamo (2010) [59], the high-strength materials are not as affected by the procedures as the low-strength materials. Using water as the dispersing agent and 1750-, 700- rpm for the pump and stirrer speeds, Sochań et al. (2012) [60] obtained R2 values of between 0.67 and 0.95 for high-strength materials in silt loamy to sandy soils, values close to our raw data results (R2 = 0.64–0.86). For clayed soils, a surfactant or solvent may replace tap water [61]. As small particles increase in number and combine into sand-sized aggregates, pre-treatments may be required and adjusted to soil specificities. There is no standard procedure to disperse soil samples for the laser method because pre-treatment is soil specific. In this study, we used subsets ≥ 92 samples and GBM as the machine learning method for particle-size prediction compared to suggested subsets ≥ 130 samples for cLHS and FCMS [62]. Here, only Set1 for laser samples did not meet that criterion. The Boosted Regression Trees is another suitable ML method for complex predictions [41] and is relatively fast compared to random forest and neural network algorithms. Model accuracies in validation were in the range reported in the literature where R2 values have been found to vary between 0.46 and 0.94 for clay [10,63,64] and between 0.53 and 0.82 for sand. More generally, the R2 values may vary widely between 0.05 and 0.84 [65,66].

Prediction accuracy

Because the 2-point 2-h sedimentation with and without pre-treatment showed comparable results, they could be combined for IRS calibration with method and treatment information inputs. Hence results of soil surveys could provide a large database for IRS calibration purposes. Laser methods produced results with lower accuracy compared to the sedimentation methods, due to underestimation of the clay fraction. The [silt | sand] balance from the laser method, however, was close to that of the 2-point 2-h sedimentation method, despite higher silt and sand percentages for the laser method, indicating advantage for log-ratioing. Higher accuracies from the laser method were obtained by Blott and Pye (2006) [67] across a wide range of soils, sediments and powders. Zobeck (2004) [68] related results from a LS-230 laser diffraction particle size analyzer to those of the pipette method and obtained R2 values of 0.97, 0.99, and 0.99 for the < 2-, < 50-, and < 100 –μm in non-calcareous soils, using a shape factor of 0.2 compared to the default 1.0 shape factor for the laser method. Nevertheless, we found that the laser method did not produce results as consistent as the sedimentation method for clay predictions by IRS, indicating regional, soil-specific, calibration. Mehlich-3 extracts, reconstituted bulk density and pH did not improve prediction accuracy. Soil features such as total carbon content, colors and amorphous materials (oxalate extracts) increased model prediction accuracy. Prediction accuracy of carbon content could perform with features as reconstituted bulk density as well as Mehlich-3, colors and oxalate extracts. Using suspension density function parameters instead of arbitrary settling times did not increase the accuracy of PSD predictions but provided a uniform base to run NIR models as various settling times have direct impact on predictions. Then, in Set3, we observed that 0.67-min and 2-h settling times were more accurate than 0.75-min and 7-h periods for sand and clay NIR predictions. In the present study, MIR spectra were more accurate than NIR for sand, silt and clay determinations. The NIR method can provide accurate prediction for clay as it has been also found to be accurate for cation exchange capacity (CEC) with R2 values of 0.82 [41] and 0.81 [63]. On the same direction, Viscarra Rossel et al. (2006) [9] concluded that MIR was more suitable than NIR for texture and carbon determination, due to higher incidence of spectral bands combined with higher intensity and specificity of the signal compared to NIR. To further support NIR calibration and model accuracy, the R-coded GBM machine learning model used in the present study came across several soil textural classes, carbon contents and features that have not been addressed simultaneously in past research.

Conclusions

In this paper, IRS inaccurately predicted the PSD’s clay fraction using laser methods. However, IRS accurately predicted PSD against sedimentation and sieving methods after adding soil features such as color, total carbon content and concentration of amorphous materials related to soil genesis and classification. Soil pre-treatments and the need for dispersing agents could be adjusted to the nature and concentration of binding agents for silt-, clay-size particles and fine particles adhering to sand particles. Combined with method and treatment as features, post-screening total carbon content and color routinely determined in service laboratories can improve IRS accuracy for mineral soils. Features as reconstituted bulk density and Mehlich-3 extracts could be added as features for higher-C soils. The GBM returned similar results whether particle-size data were analyzed as raw pre-determined settling times (percentages) or as ilr-transformed percentages. The GBM returned similar accuracies using the slope and intercept of the log-log relationship between settling time and suspension density, ilr-transformed percentages or raw percentages. The MIR and NIR 4X gain methods performed better than the NIR 2X gain method. However, additional features increased NIR 2X predictability. Modeling the log-log relationship between settling time and suspension density provided greater flexibility in the choice of soil-specific settling times. The GBM model proved to be a powerful tool to process the results of several analytical methods used worldwide to determine soil grain-size distribution [8]. More methodologies could be included in the future such as the pipette methods, and chemical and sonication methods to disperse microaggregate.

Slope, intercept, and number of observations of models.

(DOCX) Click here for additional data file.

Accuracy of GBM and neural network models using k-fold cross validation (k = 10).

(DOCX) Click here for additional data file. (PDF) Click here for additional data file.

3 in total

1 in total

1. Correction: Determining soil particle-size distribution from infrared spectra using machine learning predictions: Methodology and modeling.

Authors: Elizabeth Jeanne Parent; Serge-Étienne Parent; Léon Etienne Parent
Journal: PLoS One Date: 2022-08-16 Impact factor: 3.752

1 in total

Determining soil particle-size distribution from infrared spectra using machine learning predictions: Methodology and modeling.

Introduction

Materials and methods

Characterization of soils

Repartition of soil sampling areas on eastern provinces in Canada.

Particle-size analysis

Calculation

Spectral data acquisition

Spectral data modeling

Results

Soil characterization

Methodologies

Discussion

Methodologies

Sedimentation methods

Laser method

Prediction accuracy

Conclusions

Slope, intercept, and number of observations of models.

Accuracy of GBM and neural network models using k-fold cross validation (k = 10).

1. Mechanisms of hydrogen peroxide decomposition in soils.

2. Adequacy of laser diffraction for soil particle size analysis.

3. Prediction of Soil Organic Carbon at the European Scale by Visible and Near InfraRed Reflectance Spectroscopy.

1. Correction: Determining soil particle-size distribution from infrared spectra using machine learning predictions: Methodology and modeling.