Literature DB >> 34950792

Predicting reversed-phase liquid chromatographic retention times of pesticides by deep neural networks.

Abstract

To be able to predict reversed phase liquid chromatographic (RPLC) retention times of contaminants is an asset in order to solve food contamination issues. The development of quantitative structure-retention relationship models (QSRR) requires selection of the best molecular descriptors and machine-learning algorithms. In the present work, two main approaches have been tested and compared, one based on an extensive literature review to select the best set of molecular descriptors (16), and a second with diverse strategies in order to select among 1545 molecular descriptors (MD), 16 MD. In both cases, a deep neural network (DNN) were optimized through a gridsearch.

Entities: Chemical

Keywords: Deep neural network; Molecular descriptors; Pesticides; QSRR; Reversed-phase liquid chromatography; Selection of inputs

Year: 2021 PMID： 34950792 PMCID： PMC8671870 DOI： 10.1016/j.heliyon.2021.e08563

Source DB: PubMed Journal: Heliyon ISSN： 2405-8440

Introduction

Contaminants and especially pesticides in food are of growing concern as the general public is increasingly aware about their health effects (Dashtbozorgi et al., 2013). Depending on their concentrations, toxicity, and frequence of detection in food and in the environment, pesticides may lead to health impairment, disease and even death (Colosio et al., 2017). Detecting and quantifying these compounds helps to guarantee compliance of imported goods with the laws and regulations of the importing country (Chiesa et al., 2016). The high accuracy and mass sensitivity of high-resolution mass spectrometry (HRMS) instruments hyphenated to liquid (LC) or gas (GC) chromatography make it possible to observe thousands of chemical features in food and environment samples. These features include monoisotopic exact mass, chromatographic retention time (RT), abundance, isotope profiles and MS2 fragmentations. However, data processing and chemical characterization remain difficult despite recent developments. Chemical reference standards and spectral data enable us to confirm the structure of observed characteristics, but reference standards, especially metabolites and by-products, are rarely available for thousands of characteristics in non-target analysis (NTA) and suspect screening analysis (SSA) (McEachran et al., 2018), and having these thousands of standards can also represent a considerable cost. Since the appearance of HRMS, the interest in improving confidence in the identification of small molecules increase, such as pesticides, from putative positive samples based on detection to confirmation (Bade et al., 2015a; Schymanski et al., 2014). SSA studies are those in which observed but unknown features are compared against a database of chemical suspects to identify plausible hits. NTA studies are those in which chemical structures of unknown compounds are postulated without the aid of suspect lists (Sobus et al., 2018). In both cases, confirming the identification of a contaminant requires its standard, which may be unavailable, expensive, or time-consuming to obtain in the case of food poisoning. This is especially true for pesticides where there are a few thousand analytes, metabolites and by-products. In order to increase confidence in the tentative identification of compounds, especially in SSA, it is conceivable to predict their chromatographic retention time (RT) (Bade et al., 2015b; Barron and McEneff, 2016; Parinet, 2021; Randazzo et al., 2016). To predict RT, different strategies using various molecular descriptor (MD) sets and multiple machine-learning algorithms have been tested and published (Aalizadeh et al., 2019; Bade et al., 2015a; Barron and McEneff, 2016; Goryński et al., 2013; McEachran et al., 2018; Munro et al., 2015; Noreldeen et al., 2018; Parinet, 2021; Randazzo et al., 2016). These strategies range from the use of logKow models (Bade et al., 2015b) to more complex in silico approaches based on quantitative structure-retention relationship (QSRR) modeling, including artificial neural networks (ANNs), support vector machines (SVMs), random forest (RF), partial least squares regression (PLS-R), and multilinear regression (MLR) (Ghasemi and Saaidpour, 2009; Munro et al., 2015; Parinet, 2021). In the first part of this study, two different approaches were tested and compared in order to build an effective QSRR model dedicated specifically to predicting pesticide RTs analyzed by reversed-phase liquid chromatography (RPLC) (C18) in SSA or NTA. The first approach was based on an exhaustive literature review in order to find the best MD set to predict pesticide RTs. The second approach had no preconceived ideas as to which MDs that should be selected among 1545 MDs to feed the QSRR. Indeed, in this second approach, various strategies using the Lasso regression, a Pearson correlation feature selection (Pearson), a recursive feature elimination (RFE) and the use of principal components analysis (PCA) have been used in order to select among the entire MD available, sixteen MD. In both cases, a deep learning algorithm was retained and optimized (a multilayer perceptron (MLP)) in order to predict RTs of pesticides, and a comparison was done between the two approaches in order to select the best one.

Materials and methods

Dataset

Initially, the dataset included 843 RTs of pesticides collected from the article of Wang et al. (2019). Ultra-high-performance liquid chromatography (UHPLC) gradient conditions, column temperatures, mobile phases, columns, and instruments used to generate the data presented in detail in Wang et al. (2019). Three free software applications have been used in order to compute the pesticide's MD. These applications are free, can calculate a large number of descriptors and are widely available. The ACD software (Advanced Chemistry Development, Toronto, ON, Canada) was used to calculate LogP and LogD. The Toxicity Estimation Software Tool (TEST, Cincinnati, OH, USA) was used to compute Hy, Ui, IB, BEHp1, BEHp2, GATS1m, and GATS2m. The rest of the molecular descriptors (1834 MD) were calculated using the ChemDes online platform (http://scbdd.com/chemdes/). Once the MDs were computed, the dataset was cleaned in order to remove constant and missing values (Figure 1). Indeed, constant values are useless in order to develop QSRR models and missing values make learning and prediction impossible. The missing values are due to the softwares and their inability to generate, depending on the molecules, the MD. At the end of this curation process, 792 pesticides, their RTs, and 1545 MDs remained in the final dataset. The dataset containing the MDs for each pesticide was then ready to build QSRR models (Table S1).

Figure 1

QSRR model development and evaluation of performances.

QSRR model development

The dataset constituted previously and containing the pesticides (792), their MDs (1545), and RTs was used in order to select among them the best MDs inherited from the literature review (Model 1). Importantly, in order to find the best set of MDs, a literature review was done by selecting the most recent and pertinent papers with the following criteria: the prediction of retention times measured by RPLC and for pesticides or similar compounds (pharmaceuticals, veterinary drugs). At the end of this literature review, seven articles, their MDs, and models were selected (shown in Table 1 with their performances) and compared in term of performance measured principally through the percentage of error, which is the ratio between the root mean square error (RMSE) divided by the maximum retention time measured on the last eluted compound. In order to pursue the no a priori approach on which MD to select (Model 2 to Model 8), diverse strategies were used and compared in order to select among the 1545 MD, the best sixteen MD. Sixteen MD were retained in order to be able to compare the performances of the models (Model 2 to 8) to the model inherited from the literature review (Model 1). Hence, the Lasso regression, a regularized linear regression that aims to constrain the coefficients to be close to 0 or equal to zero, thus allowing an automatic selection of the characteristics/MD, here 16 MD (ATS8m, ATS5i, iedm, SRW10, ATS5v, VR2_Dt, VR1_D, VR1_Dt, VR2_D, ATS8i, ATS7i, ATS3i, ATSC3m, ATS0m, ATS0v, ATS4v). The second strategy was based on the Pearson correlation between the 1545 MD and the output (pesticides RTs), and the larger the relationship and more likely the feature/MD should be selected for modeling, then sixteen MD were selected based on this strategy (LogP, BEHm4, CrippenLogP, ALOGP2, ALOGP, XLOGP2, XLOGP, ATS6p, ATS5p, ATS4p, ATS3p, ATS1p, ATS6v, BEHm8, BEHm5, BEHm7). The third strategy, a recursive feature elimination (RFE), was based on an iterative selection of features/MD made by initially selecting all the MD, then a model is built (here a multi-linear regression), then the least important characteristic is rejected and this process is done until a model with 16 MD is obtained (maxtsC, MWC2, MWC03, MWC4, MWC5, nN, k2, MDEN-23, MDEN-33, MDEO-11, MDEO-12, MDEC-34, MDEC-44, MAXDP2, MDEN-22, ieadjmm). Finally, the fourth strategy was based on principal component analysis (PCA) and declined under four sub strategies (PCA1 to PCA4). For the four sub strategies, the same PCA was used. Hence, a PCA was done on the 1545 MD and measured on the 792 pesticides. The MD were normalized (reduced and centered) before doing the PCA and 16 principal components (PC) were retained; PCA1 strategy was based on the selection of the MD most correlated to each PC, thus 16 MD were selected (TWC, CIC1, ETA_Epsilon_2, AATS1p, icyce, MLFER_E, MATS2v, nCl, AATSC3p, R, JGI3, StsC, nHCHnX, ATSC6e, MATS6i, MATS6m). The PCA2 strategy was based on the selection of the 16 MD most correlated to PC1, as PC1 was the PC the most correlated to RT (TWC, Zagreb, nBonds, nBO, MWC01, SRW02, MPC01, ZM1, WTPT-1, SRW04, CID, nHeavyAtom, MPC2, nSK, SRW01, BID). The PCA3 strategy was based on the selection of the 16 MD most correlated to PC1 (8 MD) and PC4 (8 MD) as PC1 and PC4 were the most correlated to RT (TWC, Zagreb, nBonds, nBO, MWC01, SRW02, MPC01, ZM1, AATS1p, AATS0p, AATS4p, Mp, ETA_AlphaP, AATS3p, AATS5p, AATS2p). Finally, the PCA4 strategy was based on the selection of the 16 PC and their corresponding scores used as input (PC1 to PC16).

Table 1

QSRR models selected from the literature review.

References	Type of contaminant	Number of contaminants	MDs selected	Best machine learning algorithms used	RT max measured (min)	R² test set	RMSE test set (min)	Percentage of error
Aalizadeh et al. (2019)	Emerging contaminants	1830	LogDa, CIC1b, SeigZc, RDF020pd, AlogPe	SVM	14.4	0.88	1.04	7%
McEachran et al. (2018)	Environmental contaminants	97	LogPf, LogD, molecular weight, molecular volume, polar surface areag, molar refractivityh, H_donorsi, H_acceptorsj	ACD/ChromGenius®	40.8	0.92	2.66	6.5%
Bade et al., 2015a, b	Emerging contaminants	544	nDBk, nTBl, nCm, nOn, nR04-nR09o, UIp, Hyq, MlogPr, AlogP, logP, logD	MLP	16.5	0.91	0.89	5.4%
Munro et al. (2015)	Pharmaceuticals	166	nDB or nTB, nC or nO, nR04-nR09, UI, Hy,MlogP, AlogP, LogD, nBnzs, pKat	GRNN	23.2	0.88	1.39	5.9%
Noreldeen et al. (2018)	Veterinary drugs	95	ACDlogPu, ALOGP, ALOGP2v, Hy, Ui, ibw, BEHp1x, BEHp2y,GATS1mz, GATS2ma2.	MLR	9.3	0.95	0.62	6.6%
Bride et al., 2021	Environmental contaminants	274	logD, DBEa3, nO, nC, nH, molecular weight, H_donors, logSwa4	MLR	14.7	0.76	1.36	9.2%
Yang et al., 2020	Pharmaceuticals	133	XlogPa5, BCUTp.1ha6, AATS1ia7, AATS3ia8, GATS1ea9, ALogP, AATSC0pa10,ETA_EtaP_Ba11, AATS4ia12, AATS5ia13	MLR	15.0	0.63	1.42	9.4%

logD is the measure of hydrophobicity for the ionizable compounds.

CIC1 is the Complementary Information Content index (neighborhood symmetry).

SeigZ is the eigenvalue sum from a Z weighted distance matrix of a Hydrogen-depleted Molecular Graph.

RDF020p is radial distribution function weighted by atomic polarizabilities.

AlogP is logP estimated by the Ghose–Crippen method.

LogP or LogKow, LogP is equal to the logarithm of the ratio of the concentrations of the test substance in octanol and water. This value allows apprehending the hydrophilic or hydrophobic (lipophilic) character of a molecule.

defined as the surface sum over all polar atoms or molecules, primarily oxygen and nitrogen, also including their attached hydrogen atoms.

is a measure of the total polarizability of a mole of a substance.

the number of H-bond donor as descriptors of the H-bonding property.

the number of H-bond acceptor groups as descriptors of the H-bonding property.

number of double bonds.

number of triple bonds.

number of Carbon.

number of Oxygen.

the number of 4–9 membered rings.

unsaturation index.

hydrophilic factor.

Moriguchi logP.

number of benzen groups.

equilibrium constant of the dissociation reaction of an acid species in acid-base reactions.

ACDlogPa molecular properties octanol-water partitioning coefficients.

ALOGP2 molecular properties Ghose-Crippen octanol water coefficient squared.

Ib information indices information bond index.

BEHp1 burden eigenvalue descriptors highest eigenvalue n. 1 of burden matrix/weighted by atomic polarizabilities.

BEHp2 burden eigenvalue descriptors highest eigenvalue n. 2 of burden matrix/weighted by atomic polarizabilities.

GATS1mb 2D autocorrelation descriptors Geary autocorrelation-lag 1/weighted by atomic masses.

GATS2mb 2D autocorrelation descriptors Geary autocorrelation-lag 2/weighted by atomic masses.

the double-bond equivalent descriptor is the number of unsaturations present in a organic molecule.

the water solubility described by the logarithm of water solubility in mg/L at 25 °C.

XlogP is the constitutional descriptors-describe hydrophobic/hydrophilic properties.

BCUTp.1h is the BCUT descriptor/nlow highest polarizability weighted BCUTS.

AATS1i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 1/weighted by first ionization potential.

AATS3i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 3/weighted by first ionization potential.

GATS1e is the autocorrelation descriptor/Geary autocorrelation - lag 1/weighted by Sanderson electronegativities.

AATSC0p is the autocorrelation descriptor/average centered Broto-Moreau autocorrelation - lag 0/weighted by first ionization potential.

ETA_EtaP_B is the extended topochemical atom descriptor/branching index EtaB relative to molecular size.

AATS4i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 4/weighted by first ionization potential.

AATS5i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 5/weighted by first ionization potential.

QSRR models selected from the literature review. logD is the measure of hydrophobicity for the ionizable compounds. CIC1 is the Complementary Information Content index (neighborhood symmetry). SeigZ is the eigenvalue sum from a Z weighted distance matrix of a Hydrogen-depleted Molecular Graph. RDF020p is radial distribution function weighted by atomic polarizabilities. AlogP is logP estimated by the Ghose–Crippen method. LogP or LogKow, LogP is equal to the logarithm of the ratio of the concentrations of the test substance in octanol and water. This value allows apprehending the hydrophilic or hydrophobic (lipophilic) character of a molecule. defined as the surface sum over all polar atoms or molecules, primarily oxygen and nitrogen, also including their attached hydrogen atoms. is a measure of the total polarizability of a mole of a substance. the number of H-bond donor as descriptors of the H-bonding property. the number of H-bond acceptor groups as descriptors of the H-bonding property. number of double bonds. number of triple bonds. number of Carbon. number of Oxygen. the number of 4–9 membered rings. unsaturation index. hydrophilic factor. Moriguchi logP. number of benzen groups. equilibrium constant of the dissociation reaction of an acid species in acid-base reactions. ACDlogPa molecular properties octanol-water partitioning coefficients. ALOGP2 molecular properties Ghose-Crippen octanol water coefficient squared. Ib information indices information bond index. BEHp1 burden eigenvalue descriptors highest eigenvalue n. 1 of burden matrix/weighted by atomic polarizabilities. BEHp2 burden eigenvalue descriptors highest eigenvalue n. 2 of burden matrix/weighted by atomic polarizabilities. GATS1mb 2D autocorrelation descriptors Geary autocorrelation-lag 1/weighted by atomic masses. GATS2mb 2D autocorrelation descriptors Geary autocorrelation-lag 2/weighted by atomic masses. the double-bond equivalent descriptor is the number of unsaturations present in a organic molecule. the water solubility described by the logarithm of water solubility in mg/L at 25 °C. XlogP is the constitutional descriptors-describe hydrophobic/hydrophilic properties. BCUTp.1h is the BCUT descriptor/nlow highest polarizability weighted BCUTS. AATS1i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 1/weighted by first ionization potential. AATS3i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 3/weighted by first ionization potential. GATS1e is the autocorrelation descriptor/Geary autocorrelation - lag 1/weighted by Sanderson electronegativities. AATSC0p is the autocorrelation descriptor/average centered Broto-Moreau autocorrelation - lag 0/weighted by first ionization potential. ETA_EtaP_B is the extended topochemical atom descriptor/branching index EtaB relative to molecular size. AATS4i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 4/weighted by first ionization potential. AATS5i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 5/weighted by first ionization potential. Regardless of the MD dataset used, the following procedure was used. The MD datasets, and the corresponding values of pesticide RTs, were divided into three subsets: a training, a test and a validation dataset (Figure 1). The training dataset was composed of 445 pesticides chosen randomly, their corresponding MD (input) and experimentally measured pesticide RTs (output). The test dataset was composed of 148 pesticides chosen randomly, their corresponding MD (input) and experimentally measured pesticide RTs (output). The training and a test set have a size ratio of three to one, respectively. The validation dataset was composed of 198 randomly chosen pesticides never used before, their corresponding MDs, and experimentally measured pesticide RTs. Initially, the training dataset was used to train the DNN, here an MLP, by tuning the hyper-parameters through a gridsearch and a cross-validation process, where the training dataset was divided in five equal size sub-datasets (cv = 5). The hyper-parameters tuned were: Number of hidden layers constituted each by a number of neurons equal to the number of MD used as inputs Geron (2017): from 1 to 5 hidden layers constituted each by 16 neurons The activation function among: ReLu, tanh and logistic The alpha value: 10 or 1 The solver function among: Adam, SGD and Lbfgs. The data were standardized (mean-centered) in order to accelerate and enhance the training and the predictions, and also to simplify interpretation of the importance of the features/MDs. All the models were developed with Python 3.8 from the Python Software Foundation and available at http://www.python.org. In order to optimize and develop the DNN, the Scikit-learn library (https://scikit-learn.org) was used and in particular the sklearn.neural_network module.

Model validation

The validation of QSRR models is probably the most significant and critical part of model evaluation in order to prevent overfitting in particular. For this reason, we carried out the validation step using the validation dataset never used for the training and testing parts (Noreldeen et al., 2018) (Figure 1). The coefficient of determination (R2) and the RMSE were used to evaluate and compare the models extracted from the literature review and were measured on the test set (Table 1). These parameters were also used for the models developed in this study in order to determine the error between the experimental and predicted RTs in the QSRR models, especially in terms of their ability to be generalized to new pesticide substances with unknown RTs. The lower the RMSE and the higher the R2 value, the better the model. The R2 and RMSE were measured, in the case of the models developed in this present study, on the training set (n = 445 pesticides), on the test set (n = 148 pesticides), and on the validation set (n = 198 pesticides) (Table 2).

Table 2

Performances of QSRR models applied to the pesticide dataset.

N° Model	Number of molecular descriptors	Name of the Model	Internal set				Validation set			DNN Optimized
			Training set		Test set		Validation set			DNN Optimized
			R²	RMSE	R²	RMSE	R²	RMSE	Percentage of error	Number of neurons per hidden layers	Activation function	Solver	Alpha
1	16	Bade-MLP	0.95	0.43	0.90	0.63	0.82	0.67	6%	16-16-16-16-16	ReLu	Adam	10
2	16	Lasso-MLP	0.60	1.19	0.50	1.27	0.49	1.36	12%	16	tanh	SGD	1
3	16	Pearson-MLP	0.79	0.86	0.79	0.83	0.78	0.88	8%	16–16	ReLu	SGD	10
4	16	RFE-MLP	0.69	1.04	0.60	1.15	0.63	1.16	10%	16-16-16-16-16	ReLu	SGD	10
5	16	PCA1-MLP	0.75	0.94	0.61	1.12	0.64	1.14	10%	16	tanh	Adam	1
6	16	PCA2-MLP	0.42	1.44	0.34	1.47	0.38	1.50	13%	16	tanh	Adam	1
7	16	PCA3-MLP	0.61	1.18	0.53	1.24	0.56	1.26	11%	16-16-16	ReLu	SGD	10
8	16	PCA4-MLP	0.82	0.79	0.75	0.91	0.76	0.93	8%	16-16-16-16	ReLu	SGD	10

Performances of QSRR models applied to the pesticide dataset. The percentage of error was used to compare the models. Of note, the gradient durations are not the same between the different studies mentioned in the literature review (Table 1), and an RMSE of 1 min does not have the same meaning for a gradient of 10 min or for a gradient of 40 min. For this reason, the maximum chromatographic retention time (RT max) was systematic recorded (Tables 1 and 2). The RT max, displayed in Table 2, corresponds to the elution time of the last compound analyzed. The following statistics were calculated using Python Software (Version 3.8) for model validation and comparison (McEachran et al., 2018):where and are the predicted and experimental RTs, respectively, and is the mean experimental RT.where and are the predicted and experimental responses, respectively. The coefficient of determination (R2) between predicted and experimental RTs was calculated as follows (Eq. (1)): The root mean square error (RMSE) between predicted and experimental RTs was calculated as follows (Eq. (2)): The percentage of error (% error) was calculated as follows (Eq. (3)):

Structure of the DNN

DNN is a computer program inspired by the biological neural network and designed in order to modelize complex, non-linear problems (classification or regression). A typical DNN is composed of a number of neurons from a few to millions, which are arranged in a series of layers (Zhong et al., 2020). A neuron is a computational unit that has one or more weighted input connections, a transfer function that combines the inputs in some way, and an output connection. The input neurons in the input layer are designed to receive the data, such as the MDs used here, and the output neurons in the last layer are the final predictions made by the DNN, which will be used to compare with the true target data, such as RTs of pesticides. Between the input layer and the output layer are hidden layers, often more than one layer (Zhong et al., 2020) in case of DNN. The input data go into the DNN through the input layer, are then transformed in the hidden layers, and finally become the predictions in the output layer. The values in all neurons in the hidden and output layers are calculated by the application of an activation function on the sum of the values in the previous neurons×weight + bias calculation, in which weights and biases can be updated based on the errors between the predictions and the target until the errors reach a minimum value. Update of the weights and biases is done through back-propagation of the errors between the target (RT experimental) and the prediction (RT predicted). This process is the “learning” process of DNN. DNNs have two main hyperparameters: the number of neurons per layer, and the number of layers. The number of layers and neurons is also called the “depth” and “width” of DNN, respectively. Larger numbers of layers and neurons mean deeper and wider DNNs, which often have more powerful fitting ability and can achieve better accuracy on the prediction. However, too many layers and neurons can lead to an overfitting problem, which is an accurate prediction on the training set but poorer prediction on the test set. It is crucial for the DNN to be able to generalize on a dataset never seen before. For this last reason, we split the dataset into a training, test and validation datasets, in order to evaluate the capacity of the DNN to generalize. The model development process is hence to develop an optimum architecture of the DNN with an appropriate fitting ability. In this study, our DNN was composed of an input layer, several hidden layers, and an output layer. In each layer, there are numerous neurons accepting values from the neurons of the neighboring layer. In the input and hidden layers, the number of neurons was equal to the number of MDs selected. For instance, if the number was 16 MDs, then there were 16 neurons in the input and in each hidden layer, as suggested by Geron (2017). The number of neurons in the output layer was 1 because there was only one RT for each pesticide. The number of neurons in the hidden layers was set manually before the learning process began. Here, we focused on the following hyperparameters: the number of hidden layers, the activation function, the alpha value, and the solver used. We investigated their effects on the performance of the DNN through a gridsearch and a cross-validation (cv = 5) process done on the training set. The R2 and RMSE values were calculated to evaluate the effects of the hyperparameters on the performances of the models developed and on overfitting. A detailed description of the theory behind DNNs has been adequately provided elsewhere (Zhong et al., 2020). Model training was stopped after 1000 epochs (iterations).

Results and discussion

For a DNN, prediction accuracy is highly related to its structure, the number of layers, neurons, other hyperparameters (activation function, solver for weight optimization, etc.), and even more to the inputs retained, in our case the MDs.

Comparison of published QSRR models

One of the main bottlenecks in designing QSRR models is selecting the MDs (May et al., 2011; Parinet, 2021; Scotti et al., 2016). The selection of the most suitable MDs, among several thousand, can follow various strategies (May et al., 2011); this step is particularly complicated because there are many molecular descriptors that can be calculated and used (Aalizadeh et al., 2019; Bade et al., 2015a, 2015b; McEachran et al., 2018; Munro et al., 2015; Noreldeen et al., 2018) and many strategies to select the MDs. Here, to develop the most accurate QSRR dedicated to pesticides, we used two different approaches. The first approach was based on an extensive literature review on the prediction of RPLC retention times of compounds similar in their structures and properties to pesticides, such as pharmaceuticals and veterinary drugs. Based on this literature review, seven articles emerged (Table 1). In order to select the best set of MDs among the seven research papers, a study of the QSRR models developed was carried out. In order to do this, the performances of the QSRR models were documented and compared (Table 1). The number of contaminants used to build and optimize the QSRR models was found to be between 95 and 1830 compounds, the number of MDs selected was between 5 and 16, and the RT max values measured were between 9.3 and 40.8 min. The machine learning algorithms used were SVM, DNN (MLP and general regression neural networks (GRNN)), and MLR. The performances measured on the test set are for the R2 between 0.63 and 0.95, and for the RMSE between 0.62 and 1.42 min. Nevertheless, the gradients are not similar, reflected by the different RT max measurements. The RMSE and the R2 alone are not sufficient to determine which MD set and QSRR model is the most efficient. For this reason, we calculated the percentage of error (Eq. (3)), which was not done in the recent article of Parinet (2021) where all the references selected, and their corresponding MD datasets were applied directly on the pesticides dataset in order to make the prediction of RT. The percentage of error was between 5.4% and 9.4%. The lowest value for the percentage of error was obtained for the QSRR developed by Bade and colleagues (2015) on 544 emerging contaminants and by the use of 16 MDs (nDB, nTB, nC, nO, nR04-nR09, UI, Hy, MLogP, ALogP, LogP, LogD) and a DNN (MLP). Based on these results, we retained for our QSRR development, the Bade and colleagues (2015) MD set and the MLP as the best ML algorithm to use (model 1) with a percentage of error equal to 5.4%. Then, we used the MD listed by Bade and colleagues (2015) on our dataset and through a MLP (Bade-MLP – Model 1) as described before in the text. By this approach we got a R2 on the training and test set equal to 0.95 and 0.90, respectively (Table 2, Figure S1A & S1B). The RMSE obtained on the training and test set were equal to 0.43 and 0.63. On the validation set, never used for the learning and optimizing process, the R2 was equal to 0.82 and the RMSE equal to 0.67 (Table 2, Figure S1C). These past results are similar to those obtained by Parinet (2021) with the McEachran 3 MDs, on the validation dataset, and by the use of SVM and MLP as machine learning algorithms where the R2 were between 0.85-0.89 and the RMSE between 0.64-0.69, respectively. The percentage of error obtained thanks to these molecular descriptors and with a MLP was around 6%, which is close to the 5.4% got by Bade and colleagues (2015) on their compounds.

Comparison between QSRR models developed thanks to the literature review and to the no a priori approaches

To develop the most efficient QSRR model specifically for pesticides, we compared the performances obtained for Model 1 (Bade-MLP) with those of Model 2 to 8 (no a priori approach). The performances of Model 2 (Lasso-MLP) applied on our pesticide dataset gave R2 on the training and test set equal to 0.60 and 0.50, respectively (Table 2, Figure S2A & S2B). The RMSE obtained on the training and test set were equal to 1.19 and 1.27. On the validation set, the R2 was equal to 0.49 and the RMSE equal to 1.36 (Table 2, Figure S2C). The percentage of error obtained thanks to these molecular descriptors and with a MLP was around 12%, which is twice as much as Model 1 (Bade-MLP) with 6% on the same compounds. The performances of Model 3 (Pearson-MLP) applied on our pesticide dataset gave R2 on the training and test set equal to 0.79 and 0.79, respectively (Table 2, Figure S3A & S3B). The RMSE obtained on the training and test set were equal to 0.86 and 0.83. On the validation set, the R2 was equal to 0.78 and the RMSE equal to 0.88 (Table 2, Figure S3C). The percentage of error obtained thanks to these molecular descriptors and with a MLP was around 8%, which is less good as Model 1 (Bade-MLP) with 6% on the same compounds but much better than Model 2. The performances of Model 4 (RFE-MLP) applied on our pesticide dataset gave R2 on the training and test set equal to 0.69 and 0.60, respectively (Table 2, Figure S4A & S4B). The RMSE obtained on the training and test set were equal to 1.04 and 1.15. On the validation set, the R2 was equal to 0.63 and the RMSE equal to 1.16 (Table 2, Figure S4C). The percentage of error obtained thanks to these molecular descriptors and with a MLP was around 10%, which is less good as Model 1 (Bade-MLP) with 6% on the same compounds, and less good as Model 3. The performances of Model 5 (PCA1-MLP) applied on our pesticide dataset gave R2 on the training and test set equal to 0.75 and 0.61, respectively (Table 2, Figure S5A & S5B). The RMSE obtained on the training and test set were equal to 0.94 and 1.12. On the validation set, the R2 was equal to 0.64 and the RMSE equal to 1.14 (Table 2, Figure S5C). The percentage of error obtained thanks to these molecular descriptors and with a MLP was around 10%, which is less good as Model 1 (Bade-MLP) with 6% on the same compounds, and quite similar to Model 4. The performances of Model 6 (PCA2-MLP) applied on our pesticide dataset gave R2 on the training and test set equal to 0.42 and 0.34, respectively (Table 2, Figure S6A & S6B). The RMSE obtained on the training and test set were equal to 1.44 and 1.47. On the validation set, the R2 was equal to 0.38 and the RMSE equal to 1.50 (Table 2, Figure S6C). The percentage of error obtained thanks to these molecular descriptors and with a MLP was around 13%, which is less good as Model 1 (Bade-MLP) with 6% on the same compounds, and the worst model developed with performances quite similar to Model 2. The performances of Model 7 (PCA3-MLP) applied on our pesticide dataset gave R2 on the training and test set equal to 0.61 and 0.53, respectively (Table 2, Figure S7A & S7B). The RMSE obtained on the training and test set were equal to 1.18 and 1.24. On the validation set, the R2 was equal to 0.56 and the RMSE equal to 1.26 (Table 2, Figure S7C). The percentage of error obtained thanks to these molecular descriptors and with a MLP was around 11%, a little better than Model 5 but which is less good as Model 1 (Bade-MLP) with 6% on the same compounds. The performances of Model 8 (PCA4-MLP) applied on our pesticide dataset gave R2 on the training and test set equal to 0.82 and 0.75, respectively (Table 2, Figure S8A & S8B). The RMSE obtained on the training and test set were equal to 0.79 and 0.91. On the validation set, the R2 was equal to 0.76 and the RMSE equal to 0.93 (Table 2, Figure S8C). The percentage of error obtained thanks to these molecular descriptors and with a MLP was around 8%, better than all the models developed thanks to the PCA approach and similar in term of performances to Model 3, but still less good as Model 1 (Bade-MLP). Whatever the strategy used, the model which offers the best performances, is the Model 1 (Bade-MLP) inherited from the literature review. Nevertheless, the no a priori approach offers two models (Model 3 and Model 8) with effective performances. Among all the models developed thanks to the PCA approach, the Model 8 offers the best performances, and then comes next the Model 5 and 7 and finally the Model 6 that is the worst one.

Optimization of the hyperparameters

The QSRR models were optimized using an MLP through a gridsearch process. Nevertheless, the number of neurons per hidden layers was set manually and was determined by applying the recommendations of Geron (2017). Importantly, Geron mentions that the common practice of sizing the hidden layers to form a funnel, with an ever-decreasing number of neurons at each layer is no longer as common, and instead we can simply give the same size to all the hidden layers, resulting in only one hyperparameter to adjust instead of one per layer. Nonetheless, it is more useful, still according to Geron (2017), to increase the number of layers rather than the number of neurons per layer. For this reason, the number of hidden layers used by the gridsearch was between 1 to 5 layers, irrespective of the QSRR. Once the number of neurons per hidden layer and the number of hidden layers are set, there remains a large number of hyperparameters to optimize. Nevertheless, some of them are more important than others, such as the activation function and the solver used. For this reason, the gridsearch for the activation function was done among the following functions: ReLu, tanh, and logistic. A gridsearch was also carried out to select the best solver among three possible choices (Adam, SGD and Lbfgs). The last hyperparameter to optimize through the gridsearch was the alpha value, which is a regularization parameter (L2 regularization); alpha value was comprised between 0.01 and 100 (Table 2). All the architecture of DNN and theire hyperparameters retained through the gird search for the models 1 to 8 are listed in Table 2. Hence, the number of layers are comprised between 1 to 5, two activation functions among three were used (ReLu and tanh) and the logisitic function was never retained by the gridsearch, two solver (Adam and SGD) among three were used. Finally, despite the amplitude values of alpha, two alpha values were retained: 1 and 10.

Conclusions

We compared a literature review approach to a no a priori approach in order to select, by diverse strategies, the best set of molecular descriptors among 1545 MD in order to predict, through a QSRR model, the RPLC retention times of 792 pesticides. The literature review approach yielded the best results when DNN was used as the ML algorithm, with an R2 of 0.82 and an RMSE of 0.67 min (Model 1) on the validation set. However, it could be useful in future research to test some other no a priori selection strategies in order to determine new MD datasets and also to consider reducing the number of MD with the goal to simplify the models while obtaining good predictions.

Declarations

Author contribution statement

Julien Parinet: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.

Funding statement

This work was supported by (ANR-19-CE21-0002).

Data availability statement

Data included in article/supplementary material/referenced in article.

Declaration of interests statement

The authors declare no conflict of interest.

Additional information

No additional information is available for this paper.

17 in total

Review 1. Quantitative structure-retention relationships models for prediction of high performance liquid chromatography retention time of small molecules: endogenous metabolites and banned compounds.

Authors: Krzysztof Goryński; Barbara Bojko; Alicja Nowaczyk; Adam Buciński; Janusz Pawliszyn; Roman Kaliszan
Journal: Anal Chim Acta Date: 2013-08-20 Impact factor: 6.558

2. Critical evaluation of a simple retention time predictor based on LogKow as a complementary tool in the identification of emerging contaminants in water.

Authors: Richard Bade; Lubertus Bijlsma; Juan V Sancho; Felix Hernández
Journal: Talanta Date: 2015-03-09 Impact factor: 6.057

3. Prediction of pesticide retention time in reversed-phase liquid chromatography using quantitative-structure retention relationship models: A comparative study of seven molecular descriptors datasets.

Authors: Julien Parinet
Journal: Chemosphere Date: 2021-03-01 Impact factor: 7.086

4. Development and application of retention time prediction models in the suspect and non-target screening of emerging contaminants.

Authors: Reza Aalizadeh; Maria-Christina Nika; Nikolaos S Thomaidis
Journal: J Hazard Mater Date: 2018-09-18 Impact factor: 10.588

5. Identifying small molecules via high resolution mass spectrometry: communicating confidence.

Authors: Emma L Schymanski; Junho Jeon; Rebekka Gulde; Kathrin Fenner; Matthias Ruff; Heinz P Singer; Juliane Hollender
Journal: Environ Sci Technol Date: 2014-01-29 Impact factor: 9.028

6. Streamlined MRM method transfer between instruments assisted with HRMS matching and retention-time prediction.

Authors: J J Yang; Y Han; C H Mah; E Wanjaya; B Peng; T F Xu; M Liu; T Huan; M L Fang
Journal: Anal Chim Acta Date: 2019-12-03 Impact factor: 6.558

7. Prediction of retention time in reversed-phase liquid chromatography as a tool for steroid identification.

Authors: Giuseppe Marco Randazzo; David Tonoli; Stephanie Hambye; Davy Guillarme; Fabienne Jeanneret; Alessandra Nurisso; Laura Goracci; Julien Boccard; Serge Rudaz
Journal: Anal Chim Acta Date: 2016-02-19 Impact factor: 6.558

8. A comparison of three liquid chromatography (LC) retention time prediction models.

Authors: Andrew D McEachran; Kamel Mansouri; Seth R Newton; Brandiese E J Beverly; Jon R Sobus; Antony J Williams
Journal: Talanta Date: 2018-01-11 Impact factor: 6.057

9. Suspect screening of large numbers of emerging contaminants in environmental waters using artificial neural networks for chromatographic retention time prediction and high resolution mass spectrometry data analysis.

Authors: Richard Bade; Lubertus Bijlsma; Thomas H Miller; Leon P Barron; Juan Vicente Sancho; Felix Hernández
Journal: Sci Total Environ Date: 2015-09-28 Impact factor: 7.963

Review 10. Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA.

Authors: Jon R Sobus; John F Wambaugh; Kristin K Isaacs; Antony J Williams; Andrew D McEachran; Ann M Richard; Christopher M Grulke; Elin M Ulrich; Julia E Rager; Mark J Strynar; Seth R Newton
Journal: J Expo Sci Environ Epidemiol Date: 2017-12-29 Impact factor: 5.563