Literature DB >> 34920352

Prediction of degradability of micropollutants by sonolysis in water with QSPR - a case study on phenol derivates.

Judith Glienke¹, Willy Schillberg¹, Michael Stelter², Patrick Braeutigam³.

Abstract

The increasing quantity and variety of organic contaminants discharged into surface and groundwater increase the necessity of additional and suitable water treatment methods, which can be incorporated into existing wastewater treatment plants. The huge variety of micropollutants and local variability of the composition of the organic load or matrix effects paired with multiple possible degradation processes lead to the requirement of a recommendation tool for the best possible water treatment method under given local conditions. Due to the diversity of physicochemical properties of micropollutants, such predictions are challenging. In this study, a quantitative correlation between the structural properties of certain micropollutants and their degradability using high-frequency sonolysis has been investigated. Therefore, Quantitative Structure-Property Relationship (QSPR) has been applied on a set of phenol derivates. To obtain the kinetic data, all experiments have been conducted in standardized, constant conditions for all 32 investigated phenol derivates. QSPR modelling was then executed using the software PaDEL for descriptor calculation and the software QSARINS for the overall modelling process including genetic algorithm (GA) and multiple linear regression (MLR). The final model consisting of 5 molecular descriptors was selected using a multi-criteria decision-making method based on extensive statistical parameters. The predictive power and robustness of the model was evaluated by means of internal cross validation and external validation using an independent validation set. The final selected model showed very good values for regression abilities, predictive power as well as stability (R2adj = 0.9455, CCCtr = 0.9777, Q2loo = 0.9285, CCCext = 0.9797 and Q2ext-F1 = 0.9711). The applicability domain of the QSPR model was defined based on the Williams plot and Insubria plot. The five OECD principles for the application of QSPR/QSAR modelling in industry and regulation were fulfilled in the whole process to the best of our knowledge, including the collection of the underlying experimental data as well as the entire modelling process. The final QSPR model included the molecular polarity and occurrence of hydrogen bonds as major influences on the reaction rate constants in accordance with previous studies. Nevertheless, potential biases in the selection of these descriptors due to the small size of the dataset were highlighted.

Entities: Chemical

Keywords: Advanced Oxidation Processes; Machine learning; Organic micropollutants; Prediction; Quantitative structure-property relationship; Sonolysis

Year: 2021 PMID： 34920352 PMCID： PMC8799606 DOI： 10.1016/j.ultsonch.2021.105867

Source DB: PubMed Journal: Ultrason Sonochem ISSN： 1350-4177 Impact factor: 7.491

Introduction

In recent years, an increasing quantity and variety of organic micropollutants in wastewater but also in surface and ground water have been observed [1], [2], [3]. Such pollutants, including pharmaceutical active compounds, personal care products, industrial chemicals, pesticides, herbicides, and endocrine disrupting chemicals (EDCs), cause a big challenge to the wastewater treatment facilities [4]. Although micropollutants are present in low concentrations and only account for a minor part of dissolved organic matter (DOM), their potential toxicity paired with persistence and bioaccumulation can have disturbing effects on local water environments and biosystems [1], [3]. This also includes phenol and its derivates, which are widely used in the industry but are suspected to have carcinogenic properties [5]. Due to their risk they provide not only to the environment but also possibly to human health, the occurrence, control, and elimination of these micropollutants are getting more and more attention in science [1], [6], [7], [8]. With ongoing emission of new chemicals, conventional wastewater treatment plants more often become unable to eliminate all observed dissolved organic micropollutants completely. This is why degradation methods need to get more and more adjusted and diverse [6], [9]. Therefore, additional water treatment methods specialized on the removal of organic micropollutants are highly investigated by different research groups and companies around the world. Promising approaches are ozonation and different advanced oxidation processes (AOPs) like electrochemical oxidation, Fenton reaction, photocatalysis or cavitation-based processes [10], [11], [12], [13], [14]. Ultrasound-assisted degradation methods are of particular interest as previous studies demonstrated high effectiveness in removing chlorinated organic compounds [15]. When a liquid is exposed to acoustic ultrasound waves, formation of liquid voids and cavitation bubbles can take place. As these microbubbles expand during the cycles of the pressure waves, they ultimately reach a critical size and undergo implosions, producing high local temperatures (up to several 1000 K) and high pressure (up to several 100 atm) [16]. These effects create the foundation of the hot spot theory [17]. Under such extreme conditions, water molecules can be thermally dissociated to •H and •OH radicals [18]. Organic contaminants can be trapped as gas molecules in the cavitation bubble where they undergo thermal fragmentation by pyrolysis. Otherwise, they can be solved in the bulk near the bubbles where they can react with reactive radicals. Phenol and phenol derivates are typically transformed into inorganic ions and short-chain organic compounds, which eventually are being mineralized to CO, CO2 and H2O [19], [20], [21], [22]. Exact degradation pathways, effects and influences of different parameters have not been clearly enlightened yet due to the high complexity of sonochemical reactions [15]. All AOP have in common, that reactive oxygen species (ROS) like hydroxyl radicals (•OH) are generated in-situ [11], [23]. Advantages of AOP include rapid oxidation reaction rate due to high oxidation potentials of ROS, and small to none secondary pollution problems [24], [25]. Several AOP, such as photocatalysis, also distinguish themselves by high mineralization efficiencies. Cavitation based degradation however are most effective in degrading nonpolar compounds but tend to generate more hydrophilic products, resulting in their accumulation and lower overall mineralization [26], [27]. A recommendation or even prediction of the best elimination approach for different pollutants would be highly supportive. However, accumulating extensive experimental data to characterize the activity and behaviour of all relevant substances in every possible degradation method is highly cost and time consuming [28], [29], [30]. The calculation of mathematical models to predict the chemical behaviour of chemicals without their experimental examination is a promising alternative. Quantitative Structure-Property Relationship (QSPR) (or Quantitative Structure-Activity Relationship (QSAR)) is used to correlate different biological, chemical or physical properties of a compound with its physicochemical characteristics [31]. These studies are not supposed to substitute experimental studies completely but to reduce their extent and gain a deeper and broader understanding of existing systems [32]. The use of such models is also suggested in the European legislation of chemicals REACH [33] to reduce the application of animal tests but also the general experimental costs. Applied to ultrasound-assisted degradation, the tool of QSPR modelling can contribute to a broader understanding of reaction pathways by identifying important molecular sites, which are the most relevant for the degradation. QSPR modelling is based on the assumption that a molecules intrinsic properties like the melting point or the biological activity are defined by its molecular geometric, steric and electronic properties [34]. Two main principles are applied during the use of QSPR models. (1) Chemicals with similar structures show resemblances in molecular behaviour under related environmental conditions. (2) Variable behaviours are a result of differences in structure and composition [35]. The chemical structure is thereby translated into numerical parameters, so called molecular descriptors [36], [37]. To implement QSPR models in industry and regulation, the Organization of Economis Co-operation and Development (OECD) drafted five principles of good practice QSPR models have to meet. According to the OECD, reliable QSPR models have to satisfy the following requirements: (1) a defined endpoint, (2) an unambiguous algorithm, (3) a defined domain of application, (4) appropriate measures of goodness-of-fit, robustness, and predictivity, and (5) a mechanistic interpretation, if possible [38], [39]. Although QSPR modelling has already been applied in water research for a few years, many studies are not consistent with all requirements mentioned above. A fundamental issue is that many studies collect their underlying experimental data from different publications using different experimental setups and varying reaction conditions. Although this may be necessary to obtain a sufficiently amount of data, this may also lead to data heterogeneity, resulting in a potentially compromised integrity of the model [35], [40], [41]. This may interfere with the first OECD principle, which proposes that QSAR models based on defined endpoint values gained through standardized test protocols have the greatest potential for accordance between experimental values and calculated ones [38], [42]. Therefore, collecting data from different resources should be avoided if possible [43]. This can also be seen in previous studies of the degradation of phenol derivates using sonolysis. By applying different experimental setups, parameters and matrices, different reaction rate constants for the same compound can be found in literature [20], [44]. For example, Cost et al. [45] and Tauber et al. [46] found different reaction rate constants for the sonolytic degradation of 4-nitrophenol not only within their own study but also compared with one another due to a different experimental basis. Because of the inconsistency in data curation, combining results of different studies would not lead to datasets suitable for reliable QSPR modelling. Also, according to Awfa et al. [32], many developed QSPR models in wastewater treatment lack sufficient but necessary validation techniques. Further use of extensive validation methods and other statistical quality assurances should be introduced. Cherkasov et al. [47], Cronin et al. [43], Dearden et al. [40], and Gramatica et al. [41] also summarized substantial problems found in recent QSPR/QSAR models, addressing not only inconsistences of sufficient validation and inadequate or missing statistic measures, but also problems in data curation, data splitting and missing pre-treatments, such as descriptor-normalization and redundancy in information. In this study the degradation of a group of phenol derivates using high-frequency sonolysis is investigated using a QSPR workflow based on Gramatica et al.[48]. To the best of our knowledge, this is the first QSPR model applied to sonolysis as Advanced Oxidation Process in wastewater treatment and in water research the first model evaluated with extensive amounts of statistical methods, e.g. multiple validation methods, tests for chance correlation and multicollinearity. To address some problems mentioned in previous publications [40], [41], [43], [47], the experimental data was obtained with a standardized setup and protocol under reproducible condition to ensure the homogeneity of the experimental dataset. The introduced workflow possesses a variety of statistical methods to ensure the stability and reliability of the obtained model to reduce many problems as mentioned before. Overall, this study gives an overview of necessary statistical values and methods, which should be included in further QSPR modelling in water research to meet all OECD principles.

Material and methods

Reagents and materials

All sources of chemicals, including CAS-numbers, molecular weight, structures, SMILES-codes and purity, are described in Tables S3 and S4. All chemicals were used as received and possessed a purity > 97%. Reaction solutions were prepared using freshly filtered ultrapure water (σ ≤ 0.055 µS/cm, TOC < 5 ppb; GenPure Pro, Fisher Scientific).

Experimental data

The reaction rate constant k of the degradation with sonolysis was obtained for 32 phenol derivates in a standardized experimental setup. All experiments were performed in 400 mL of a 10 µmol/L micropollutant solution in a triple determination. The reaction took place in a cylindric quartz glass reactor (Meinhardt Ultrasonics, Germany) with flange-mounted ultrasonic transducer (E/805/T, Meinhardt Ultrasonics), connected to an ultrasonic power generator (K 8, Meinhardt Ultrasonics) (Fig. 1). An ultrasound frequency of 860 kHz was used with a power of around 90 W. The system was kept at a constant temperature of 23 °C ± 1 °C through a water-cooling system. A CAD (computer-aided design) drawing concerning the reactor geometry in 3D can be found in the Supplement Material.

Fig. 1

Experimental setup and reactor geometry of experimental data curation using high-frequency sonolysis.

Experimental setup and reactor geometry of experimental data curation using high-frequency sonolysis. To determine the reaction kinetics, samples were taken at specific intervals during 30 min. To stop the reaction and to make sure, no possible polymerization reactions [49] took place, 800 µL of the sample was blended with 200 µL of a 5 mmol/L sodium thiosulfate solution. The reproducibility of the experimental setup has been ensured with an ANOVA test. Results can be found in Text S2, Figure S2 and Tables S1 and S2. For analysis, the concentration of micropollutants were measured using a high-performance liquid chromatography (HPLC) (LC2000, Jasco). The system included a fluorescence detector (FP-2020Plus, Jasco), a multiwavelength detector (MD-2010Plus, Jasco), an autosampler (AS-2055Plus), a 100 µL injection loop and a RP C18 column (Dr. Maisch GmbH Kromasil 100 C18 10 mm*4.6 mm, 5 µm & 250 mm*4.6 mm, 5 µm) tempered at 40 °C. All HPLC methods are described in Table S5, including dissolvent ratio, retention time, detector, wavelength, and injection volume. Because of the low concentration range of the analyte, a reaction rate of pseudo-first order can be assumed. Therefore, the following equation (1) was used to calculate the reaction rate constant k [14], [50]:With c0 as start concentration, ct as concentration of the analyte at time t and k as the reaction rate constant. As theoretically in QSPR-modelling, every compound should only have one experimental value for the reaction rate constant k. Therefore, the average values of the triple determination were calculated and used for the dataset.

QSPR modelling process

To ensure the reliability of the modelling process in this study, all five OECD-principles were carefully respected. A list of the five principles with the respected modelling part is summarized in Table 1.

Table 1

OECD principles and associated modelling step of the QSPR-process used (LOO: Leave-one-out, LMO: Leave-many-out).

OECD principle	Associated modelling step
(1) Defined endpoint	Data curation in standardized experimental setup
(2) Unambiguous algorithm	Feature Selection with genetic algorithm, Model equation with multiple linear regression, Definition of all modelling parameters
(3) a defined domain of application	Williams plot, Insubria plot
(4) appropriate measures of goodness-of-fit, robustness, and predictivity	Internal validation (LOO, LMO), external validation
(5) a mechanistic interpretation, if possible	Interpretation of model descriptors in context of underlying process
Additional	Pretreatment of descriptors, QUIK test, test for chance correlation (Y-Scrambling, Y-Randomization)

OECD principles and associated modelling step of the QSPR-process used (LOO: Leave-one-out, LMO: Leave-many-out). A flowchart of the entire QSPR modelling process is shown in Fig. 2. Some detail information can be also found in the Supplement Material (Text S3-S8).

Fig. 2

Flowchart of the QSPR process used in this study following Gramatica [39].

Molecular descriptors

Molecular descriptors were calculated using PaDEL-descriptor software version 2.21 [51]. Only one- and bi-dimensional descriptors were selected to avoid further complexity of three-dimensional conformation. Descriptors with missing values were deleted. To avoid problems with redundant information and binary collinearity, the descriptors were filtered for descriptors with pair-wise correlation > 95%, constant and near constant (>80%) descriptors during the implementation step, because highly correlated descriptors cause mathematical problems later in the modelling process and information only relevant for a few molecules might not be good for future prediction ability of the model [39]. The remaining descriptors were then implemented in the QSARINS software, version 2.2.4 [http://www.qsar.it] [48], [52] for further analysis and modelling. A list of these descriptors is given in Text S7. After implementation into QSARINS software, the descriptor values were normalized.

Data analysis and setup

Before starting the modelling step, the underlying dataset was further analysed to ensure good quality. This step is not only useful for optimizing someone’s own modelling process, but also to minimize further problems and misleading results. Therefore, the profile and distribution of the dataset compounds was displayed. This included both the distribution of the experimental endpoint but also the structural domain inspected by principal component analysis (PCA) of the molecular descriptors to search for potential clustering of compounds or possible outliers [41].

Dataset splitting

The dataset was split into a training set and a validation set to ensure a later external validation of the selected QSPR model. To guarantee the significance of the later external validation, the two sets have to represent the whole dataset. Due to the fact that random splitting showed a huge diversity in model quality in trial models depending on the selected subsets, the splitting for this study was performed structure based on the previously performed PCA. After splitting, another round of filtering constant, near constant (>80%) and highly bivariate correlated (>95%) descriptors on basis of the training set was executed.

Model building and descriptor selection

In this study, multiple linear regression was used as the model approach with the following equation (2). The by QSARINS software implemented OLS technique thereby minimizes the sum of squares of the difference between experimental endpoint and its calculated value. After setting up the dataset and its splitting, the descriptor selection step was executed. Therefore, all possible combinations of descriptors up to a number of 3 descriptors were calculated for the training set. This ensures the calculation of all low dimensional models before increasing the number of variables to obtain maximal results. On that basis, higher dimensional models were obtained by applying a genetic algorithm to find the best model combination of a higher number of descriptors. The fitness function, on which the algorithm optimized the model quality, was selected to be Q2loo, which is calculated following equation (3).with PRESS as predictive residual sum of squares, TSS as total sum of squares, as the predicted endpoint, as the experimental endpoint and as the mean of the experimental endpoints. The maximal number of descriptors was set as 1:5 of the number of compounds in the training set to nfeature,max = 5. Only models including descriptors with a significance level ≤ 0.05 were retained. Best modelling output with minimal computational power was found with the GA-parameters set to a population size of 200, 100 generations per size and mutation rate of 20%.

Model analysis and internal validation

After the modelling step, for each number of descriptors 10 final models were stored by the program with different statistical values for regression ability and for the internal LOO-cross validation applied during the model building. To select the best and final model, a few analysis steps were done. First of all, the QUIK rule (Q2 Under Influence of K) was applied with a critical value of 0.05 to filter the models, where the correlation between the block of descriptors and the response is too similar to the inter-correlation between the model descriptors [38], [53]. For the remaining models, the relative importance of the model descriptors was inspected by counting their occurrence among the best 10 models. This helped during the later mechanistical interpretation. To check for any chance correlation between endpoint and variables, Y-scrambling and Y-randomization were applied with 2000 iterations. In addition to the internal validation via Leave-One-Out cross validation, an internal validation through Leave-Many-Out cross validation was applied. In 2000 iterations, 30% of the training set were selected randomly for the internal validation set.

External validation, model selection and interpretation

In addition to the internal validation done during the model calculation, an external validation was performed to analyse the capability to predict chemicals not included in the training set. Therefore, the remaining models were applied on the validation set introduced in section 2.3.3. The predictive performance was evaluated on numerous statistical values. With the calculation of a high number of quality criteria, QSARINS software allows a multi-criteria decision-making (MCDM) to select the final model on basis of the accumulation of all calculated values. In this study, the MCDM tool was applied after evaluating the models on each statistical parameter individually and then the selection of the final model was secured by the overall MCDM value calculated by the program. After selecting the final model, this single model was statistically and graphically analysed, including the Williams plot and Insubria plot for applicability domain. To check for the probability of chance correlation additionally to the Y-randomization and Y-scrambling done on all models before, a permuted endpoint randomization test and a randomized endpoint randomization test was applied on the whole modelling process. In these tests, the overall performance of the best model calculated during variable selection and model building with either permuted or randomized endpoint values are compared to the final selected model. With this comparison, a value for the probability of chance correlation can be calculated [41]. The parameters for these tests were kept exactly how they were set in the original modelling process (chapter 2.3.4) with an iteration of 25 for the permutation and randomization, respectively. After the final model was selected and analysed, an interpretation of the model equation with its included descriptors was performed. This included the verification of the relevance of each descriptor by looking at its standardized coefficients and checking the confidence intervals for each descriptor.

Results and discussion

Experimentally derived k values and calculated descriptors

Sonolytic degradation experiments were performed with 32 different phenol derivates. The analytical data obtained through HPLC analysis allowed to calculate the reaction rate constant for each compound. A full list of all experimental data as well as degradation curves for all experiments are given in Table S6 and Figure S3. As seen in Fig. 3 the reaction rate constants for the examined phenol derivates differ in value in the range from 0.01142876 1/min to 0.033556363 min−1, with the highest value for 4-hexylbenzene-1,3-diol, whereas 2,5-dihydroxybenzoic acid shows the lowest rate constant. Standard variations are all below 10%, indicating good reproducibility in accordance with the results of the executed one-factor ANOVA (Text S2, Tables S1 and S2, Figure S2).

Fig. 3

Reaction rate constants of investigated phenol derivates. Red: (hydroxy-)phenols, yellow: alkylphenols, green: halogen-phenols, blue: hydroxybenzoic acids, gray: nitro-phenols. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) The distribution of the resulting reaction rate constants is shown in Fig. 4. Even a perfect normal distribution could not be achieved, the endpoint distribution still seemed to be suitable for modelling purposes.

Fig. 4

a) Response distribution of the whole dataset with 7 different intervals, b) PCA of dataset plotting the first two principal components PC1(EV%=21.78%) and PC2(EV%=17.65%), which include the highest variance. (EV%: Percentage of explained variances). Fig. 4 displays the first two PCA dimensions of the dataset with the 256 selected and normalized descriptors. As seen, a highly outlying compound has not been found in this analysis step, therefore all 32 compounds were used in the following modelling step.

QSPR model for kUS

Statistical parameters of the final QSPR model

Based on these determined k values and the normalized descriptors, QSPR modelling and model selection was executed as described in chapter 2.3.4. The final QSPR model selected via MCDM consists of the following equation for the unstandardized coefficients: The model equation with standardized coefficients is as followed: The coefficients, standardized coefficients, the confidence intervals (Co. int. 95 %) and the p-values of the descriptors are listed in Table 2. The ratios of confidence interval and descriptor coefficient as well as the p-values are below the critical values, whereas the model does not have to be considered suspect.

Table 2

Standardized coefficients, standardized coefficient, confidence intervals (Co. Int. 95%) and p-values of the model descriptors and intercept.

Variable	Coefficient	Standardized coefficient	Co. Int. 95%	p-value
Intercept	0.0220		0.0018	0.0000
ATSC4i	−0.0034	−0.2190	0.0019	0.0010
GATS2e	−0.0110	−0.6174	0.0024	0.0000
GATS5s	0.0036	0.1777	0.0021	0.0018
CrippenLogP	0.0169	0.8494	0.0022	0.0000
maxHBa	−0.0064	−0.4390	0.0016	0.0000

Standardized coefficients, standardized coefficient, confidence intervals (Co. Int. 95%) and p-values of the model descriptors and intercept. The intercorrelation matrix of the model descriptors is listed in Table S8. The binary correlation between two descriptors is always below 0.7, which is stated as the critical value in most research papers for high collinearity [54]. With the applied value of the QUIK test higher than 0.050, severe multicollinearity of the model descriptors can be precluded with high probability. All other statistical values obtained during the whole modelling process, including fitting criteria, internal validation criteria, and external validation criteria are listed in Table 3. With high values for R2 and R2ext, there is an almost linear regression between predicted and experimental endpoint values seen in Fig. 5 with simultaneous low residuals (Figure S4). This is also supported by the high values of the lack of fit (LOF), and the concordance correlation coefficients for the training set and external validation set, CCCtr and CCCext, which are statistical variables for evaluating the regression ability and correctness of linear regression. With good quality of the internal and external validation, an over-fitting of the model can be rejected. Overall, the model shows very good performance in regression, predictability, and robustness.

Table 3

Calculated statistical parameters of the final QSPR model.

N_tr = 26, N_ext = 6, MCDM_all = 0.9397
Fitting criteria:
R² = 0.9564, R²_adj = 0.9455, R²-R²_adj = 0.0109, LOF = 0.0000, K_x = 0.3230, ΔK = 0.0632, RSME_tr = 0.0010, MAE_tr = 0.0008, RSS_tr = 0.0000, CCC_tr = 0.9777, s = 0.0011, F = 87.7304

Internal validation criteria:
Q²_loo = 0.9285, R²-Q²_loo = 0.0279, RMSE_cv = 0.0012, MAEcv = 0.0010, PRESS_cv = 0.0000, CCC_cv = 0.9632, Q²_LMO = 0.9201, R²_Yscr = 0.2015, Q²_Yscr = -0.3659, RMSE Average_Yscr = 0.0041, R²_Yrnd = 0.2010, Q²_Yrnd = -0.3665

External validation criteria:
RMSE_ext = 0.0008, MAE_ext = 0.0007, PRESS_ext = 0.0000, R²_ext = 0.9767, Q²_F1 = 0.9711, Q²_F2 = 0.9609, Q²_F3 = 0.9684, CCC_ext = 0.9797, rm2- = 0.9562, Δrm2 = 0.0228

Fig. 5

Regression plot for a) the model equation, b) LOO internal validation. Values for the training and validation set chemicals are labelled differently.

Calculated statistical parameters of the final QSPR model. Regression plot for a) the model equation, b) LOO internal validation. Values for the training and validation set chemicals are labelled differently.

Applicability domain

The Williams plot of the diagonal hat elements vs. standardized residual predictions by the model equation with a calculated critical HAT value of h* = 0.6923 are shown in Fig. 6. For the model equation there is neither a structural outlier nor an outlier with high standardized residuals. Therefore, one can say, that both data subsets are inside the applicability domain and can be predicted with high precision. The Insubria graph of the diagonal hat elements vs. the predictions by the model equation is also shown in Fig. 6. This graph verifies the applicability domain for external applied compounds outside the training set without experimental data.

Fig. 6

a) Williams plot of the model equation. The dotted lines are, respectively, the 2.5σ limit and the warning value of hat (h* = 0.6923). b) Insubria graph for applicability domain. The dotted lines mark the minimal and maximal predicted endpoint values and the warning value of hat (h* = 0.6923). For future prediction application of the model equation, this applicability domain has to be considered, because every external compound has to fall within these boundaries. The model can be applied on molecules inside this defined applicability domain. This includes not only the structural domain based on the 5 model descriptors, but also the predicted endpoint value range, which is not included in the Williams plot but in the Insubria graph. The reliable prediction zone is therefore defined by the critical leverage value h* and the minimal and maximal values of the predicted endpoint from the training set. In this study, all compounds of the external validation set were found to be inside this domain.

Additional validation and test for chance correlation

Figure 7 shows the LMO plot, the Y-scramble plot as well the graph for the random responses.

Fig. 7

Plot of a) LMO validation, b) Y-scrambling and c) random response test. Values for the model equation and values for cross validation, scrambling and randomization, respectively, are labelled differently. All these three statistical tests show very good results. The Q2-values for the LMO-cross validation are almost all relatively high. Therefore, one can say, that the model results are not entirely dependent on the whole training set. Hence the stability of the final model can be evaluated as good. The Y-scrambling, a tool for showing possible chance correlation between the descriptors and the endpoint, resulted in very low values for R2 and Q2. The same effect was achieved by the Y-randomization. A possibility of mathematical chance correlation in the final model between the endpoint and the five model descriptors can therefore be seen as very low based on the results. This is also consistent with the result of the permuted endpoint randomization test on the entire dataset. The calculated probability for chance correlation resulted in a p-value for R2 of 0.0067 and for Q2 of 0.0043. For the randomized endpoint randomization test, the calculated probability for chance correlation resulted in a p-value for R2 of 0.0004 and for Q2 also of 0.0004. With the widely used acceptance level of probability of chance correlation ≤ 0.05, the probability for mathematical chance correlation can be dismissed for the original final model. This is especially important, because of the small number of molecules used to develop the model due to higher probability of chance correlation with small datasets combined with large size of descriptor pool.

Mechanistical interpretation

Mechanistical interpretations of QSPR models are quite difficult, as multiple aspects have to be considered. This is why the OECD stated that it only has to be executed if possible [38]. Even though the obtained model is mathematically not overfitted, some potential biases in the descriptor selection due to the small size of the underlying dataset have to be highlighted. Detailed reasons are given below. Five descriptors were selected for the final model. Compared with the relative frequency of descriptors in the best 10 models calculated during the whole modelling process, these descriptors are also present in a lot of other good models (Fig. 8).

Fig. 8

Frequency of molecular descriptors included in the best 10 models.

Frequency of molecular descriptors included in the best 10 models. ATSC4i is included in 30% of the 10 best models, and GATS5s and CrippenLogP are the most often selected descriptors with a 100% and 90% including rate, respectively. This indicates the high importance of the structural information included in these descriptors and their major influence on the endpoint. The descriptor ATSC4i is the centred Broto-Moreau autocorrelation of lag 4 weighted by first ionization potential. Therefore, this descriptor represents the autocorrelation over the topological distance 4 of the first ionization potential as the atomic property [55]. It is calculated as the sum of all products of the first ionization potential of atoms with a topological distance of 4. The descriptor has a negative influence on the value of the calculated reaction rate constant in this model. The descriptor GATS2e is the Geary autocorrelation of lag 2 weighted by Sanderson electronegativity. It represents 2D-autocorrelation classes of descriptors. This descriptor is related to the topology of the structure or its parts that are in association with a given physicochemical property, in this case the Sanderson electronegativity [55]. According to the model equation, an increase in the GATS2e descriptor’s value decreases the reaction rate constant. The descriptor GATS5s, which is the Geary autocorrelation of lag 5 weighted by intrinsic state, has a positive influence on the value of the reaction rate constant in the model of this study. The descriptor CrippenLogP describes the logP calculated based on the approach of Crippen, which assigns additive contributions to the molecular logP value by the individual atoms of the molecule [56]. According to the model equation, higher values of the descriptor result in a higher calculated reaction rate constant. This effect is visible for example for 4-hexylbenzene-1,3-diol, which has the highest value for the CrippenLogP descriptor (Table S7), as it is the most nonpolar molecule in the dataset due to the long alkyl-chain. Simultaneously, it showed the highest reaction rate constant in the experimental investigation. At the same time, as seen in the experimental data (Fig. 3), the benzoic acids in the dataset have the lowest reaction rate constants with concurrently high values for molecular polarity (Table S7). Overall, molecules with an increased molecular polarity showed a lower reaction rate constant. This effect is also observed in other experimental studies. For example, Xu et al. [57] showed, that in a group of phthalate acid esters, molecules with higher polarity (lower logP) were degraded slower with sonolysis. Likewise, Wu et al. [58] found a strong correlation between the logP value and the reaction rate constant of sonolysis, stating a molecule with higher logP value degrades faster. The obtained QSPR model in this study seems to support the observations, that the molecular polarity is indeed responsible for the decreased reactivity of these molecules. Due to the small size of the dataset in this study, this interpretation could however be misleading to an extent, as one cannot be entirely sure if this descriptor is only selected due to a bias in the underlying dataset. A low logP value seems to be equivalent to the sole presence of a carboxyl group, in this dataset even equivalent to a benzoic acid group. The later possesses a negative mesomeric effect due to its electron-withdrawing abilities, which leads to a decreased reactivity of the molecule. The activation energy for secondary substitutions, for example with ROS, is elevated and therefore a degradation with ROS would be decelerated. These two effects, sole polarity on the one hand, and presence of a benzoic acid on the other, might not be distinguished by the model and the real driving force behind the selection of that descriptor cannot be resolved completely. To prove which of these effects is influencing the reaction rate constant in sonolytic degradation, such biases have to be eliminated first. Therefore, a significantly larger dataset is needed in the future, where all key functional groups are represented in various contexts and combinations to reduce the occurrence of such biases. A selection of polar molecules without substituents with mesomeric effects as well as carboxyl groups substituted not directly onto the phenyl ring could provide clarity for that specific case. The descriptor maxHBa is dimensionless and represents the maximum E-states for (strong) hydrogen bond acceptors. With a negative coefficient in the model equation, an increase of the maxHBa descriptor decreases the calculated endpoint of this QSPR model. Strong hydrogen bond acceptors like free pairs of electrons of oxygen atoms or ions therefore contribute negatively to the overall reactivity of the molecule during sonolysis. This could be due to an increased intermolecular interaction with other micropollutant molecules or water, which decreases the overall reactivity of the compound. The high significance of hydrogen bonds in stabilizing phenol derivates was also examined by Mammino et al. [59]. The dihydroxybenzoic acid phenols in the dataset show the lowest reaction rate constants with simultaneous low values for the CrippenLogP and large values for the maxHBa descriptor (Table S7), which could go along with the assumption, that acid groups paired with multiple hydroxy groups lead to higher polarity as well as stronger intermolecular hydrogen bonds. In the dataset, the phenol derivates with nitro groups have also high values for the maxHBa descriptor (Table S7). This could be explained with the meaning of the descriptor maxHBa, because nitro groups as hydrogen bond acceptors result in a decrease of overall reactivity for these molecules as well. But there occurs the same problem as with the CrippenLogP descriptor. In all molecules containing strong hydrogen bond acceptors, these functional groups are directly substituted to the phenyl ring. Both functional groups (–COOH and –NOO) are fragments with negative mesomeric effect, resulting in a stabilization of the molecule. Therefore, the presence of hydrogen bond acceptors in this dataset is equivalent with the presence of a negative mesomeric effect. Thus, the selection of this descriptor might have been biased, as the model can again not distinguish between the stabilizing influence of hydrogen bond acceptors or of the negative mesomeric effect. This could only be differentiated with a larger dataset, including the functional groups substituted not directly to the phenyl ring and molecules with substituents functioning as strong hydrogen bond acceptors without negative mesomeric effect. Due to the experimental background of the underlying dataset, the obtained model is only valid for sonolytic degradation at 860 kHz. As chemical and mechanical effects in sonolytic degradation are highly dependent on the frequency, different degradation pathways are more or less relevant due to for example different ROS concentrations, cavitation bubble sizes and lifetimes and other mechanical processes. Hence, models for other frequencies could be based on different descriptors, as polarity, hydrogen bonds, volatility and other molecular properties could have a varied impact on the degradation abilities of a compound. This study showed, that the mechanistical interpretation of QSPR models is definitely not trivial. To obtain an extensive interpretation without biases, a large and diverse pool of chemicals, including all important functional groups in various contexts and combinations, is necessary as an underlying dataset in the future. Although not every descriptor could be fully interpreted, the two descriptors CrippenLogP and maxHBa allowed a connection of the experimental observations and the results of the QSPR modelling. It was shown that despite the complexity of sonolytic degradation mechanisms, assumptions about the degradation ability of different phenol derivates were possible. Due to the fact that sonolytic degradation does not follow only one reaction pathway, but is an interaction of multiple simultaneous reactions, QSPR is especially useful as it can contribute to mechanistical clarification.

Conclusions

In this study, a QSPR model to predict the reaction rate constant of sonolysis was developed for 32 phenol derivates. The dataset was obtained using a standardized experimental setup with fixed parameters to eliminate the negative influence of anomalies inside the data on the modelling results. All five OECD principles for QSAR/QSPR models were carefully respected to the best of our knowledge. 1D and 2D descriptors calculated with PaDEL software were used as a basis and normalization was performed to eliminate the influence of different scales. The whole modelling process was executed with the software QSARINS. Constant, near constant and correlated descriptors were excluded from further processing. The dataset was split into training set and external validation set in consideration of the chemical structure. 5 relevant descriptors were selected using genetic algorithm applying Q2loo as fitness function. The final model was selected from a pool of generated models using a multi-criteria decision-making tool, finding the best model with an agreement of optimal values for all calculated statistical parameters. The final model shows good performance in regression, predictability, and robustness with R2adj = 0.9455, CCCtr = 0.9777, Q2loo = 0.9285, CCCext = 0.9797 and Q2ext-F1 = 0.9711. It was shown that a very homogenous dataset with a standardized experimental background leads to good modelling results, even with a very small number of compounds. For sufficient model evaluation, multiple internal and external statistical tests were performed, and statistical parameters were calculated. This included not only an external validation for evaluation of the predictability of the model, but also internal cross validation to evaluate the stability. To remove the possibility of chance correlation between endpoint and descriptors, Y-scrambling and Y-randomisation was carried out with good results. Probability of chance correlation could also be dismissed for the final model through permuted and randomized response modelling. This was especially important due to the small size of the underlying dataset. The interpretation of selected descriptors was carried out as best as possible. An influence of the polarity and the occurrence of hydrogen bonds on the degradation ability of phenol derivates was calculated by the obtained QSPR model, which goes along with previous experimental studies. However, due to the small size of the dataset, a biased selection of these descriptors could not be excluded. This is because all molecules with high polarity and strong hydrogen bond acceptors included substituents on the phenyl ring with negative mesomeric effect, which could also be responsible for the stabilization of these compounds. Larger datasets, including multiple variations of functional groups, are necessary to eliminate such biases and to make an extensive interpretation possible. For future application of QSPR modelling in wastewater treatment, we suggest implementing not only the necessary internal and external validation techniques according to the OECD principles, but also methods to minimize the possibility of the occurrence of for example chance correlation and multicollinearity. This prevents mathematical problems on the one hand, but also might help with the comparability of different QSPR studies. The additional values strengthen the validity of the whole model with its calculated predictivity from external validation and stability from internal validation. Also, additional statistical parameters might enhance comparability between different modelling approaches currently applied in different research groups.

CRediT authorship contribution statement

Judith Glienke: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization, Project administration. Willy Schillberg: Investigation, Writing – review & editing. Michael Stelter: Resources, Supervision. Patrick Braeutigam: Conceptualization, Resources, Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

33 in total

1. Sonolysis of natural phenolic compounds in aqueous solutions: degradation pathways and biodegradability.

Authors: Constantinos Vassilakis; Ariadni Pantidou; Elefteria Psillakis; Nicolas Kalogerakis; Dionissios Mantzavinos
Journal: Water Res Date: 2004-07 Impact factor: 11.236

2. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints.

Authors: Chun Wei Yap
Journal: J Comput Chem Date: 2010-12-17 Impact factor: 3.376

3. Evaluation of a model for the removal of pharmaceuticals, personal care products, and hormones from wastewater.

Authors: Benjamin D Blair; Jordan P Crago; Curtis J Hedman; Ronan J F Treguer; Christopher Magruder; L Scott Royer; Rebecca D Klaper
Journal: Sci Total Environ Date: 2013-01-04 Impact factor: 7.963

4. On the development and validation of QSAR models.

Authors: Paola Gramatica
Journal: Methods Mol Biol Date: 2013

5. Sonolysis of aqueous 4-nitrophenol at low and high pH.

Authors: A Tauber; H P Schuchmann
Journal: Ultrason Sonochem Date: 2000-01 Impact factor: 7.491

Review 6. Emerging pollutants in the environment: present and future challenges in biomonitoring, ecological risks and bioremediation.

Authors: Maria Gavrilescu; Kateřina Demnerová; Jens Aamand; Spiros Agathos; Fabio Fava
Journal: N Biotechnol Date: 2014-01-21 Impact factor: 5.079

7. Sonophotolytic degradation of phthalate acid esters in water and wastewater: influence of compound properties and degradation mechanisms.

Authors: L J Xu; W Chu; Nigel Graham
Journal: J Hazard Mater Date: 2015-02-09 Impact factor: 10.588

Review 8. Ultrasonic destruction of phenol and substituted phenols: a review of current research.

Authors: Rana Kidak; Nilsun H Ince
Journal: Ultrason Sonochem Date: 2006-01-03 Impact factor: 7.491

9. Rate constants of hydroxyl radicals reaction with different dissociation species of fluoroquinolones and sulfonamides: Combined experimental and QSAR studies.

Authors: Xiang Luo; Xiaoxuan Wei; Jingwen Chen; Qing Xie; Xianhai Yang; Willie J G M Peijnenburg
Journal: Water Res Date: 2019-09-14 Impact factor: 11.236

Review 10. Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs.

Authors: Lennart Eriksson; Joanna Jaworska; Andrew P Worth; Mark T D Cronin; Robert M McDowell; Paola Gramatica
Journal: Environ Health Perspect Date: 2003-08 Impact factor: 9.031