| Literature DB >> 34950792 |
Abstract
To be able to predict reversed phase liquid chromatographic (RPLC) retention times of contaminants is an asset in order to solve food contamination issues. The development of quantitative structure-retention relationship models (QSRR) requires selection of the best molecular descriptors and machine-learning algorithms. In the present work, two main approaches have been tested and compared, one based on an extensive literature review to select the best set of molecular descriptors (16), and a second with diverse strategies in order to select among 1545 molecular descriptors (MD), 16 MD. In both cases, a deep neural network (DNN) were optimized through a gridsearch.Entities:
Keywords: Deep neural network; Molecular descriptors; Pesticides; QSRR; Reversed-phase liquid chromatography; Selection of inputs
Year: 2021 PMID: 34950792 PMCID: PMC8671870 DOI: 10.1016/j.heliyon.2021.e08563
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1QSRR model development and evaluation of performances.
QSRR models selected from the literature review.
| References | Type of contaminant | Number of contaminants | MDs selected | Best machine learning algorithms used | RT max measured (min) | R2 test set | RMSE test set (min) | Percentage of error |
|---|---|---|---|---|---|---|---|---|
| Emerging contaminants | 1830 | LogD | SVM | 14.4 | 0.88 | 1.04 | 7% | |
| Environmental contaminants | 97 | LogP | ACD | 40.8 | 0.92 | 2.66 | 6.5% | |
| Emerging contaminants | 544 | nDB | MLP | 16.5 | 0.91 | 0.89 | 5.4% | |
| Pharmaceuticals | 166 | nDB or nTB, nC or nO, nR04-nR09, UI, Hy, | GRNN | 23.2 | 0.88 | 1.39 | 5.9% | |
| Veterinary drugs | 95 | ACDlogP | MLR | 9.3 | 0.95 | 0.62 | 6.6% | |
| Environmental contaminants | 274 | logD, DBE | MLR | 14.7 | 0.76 | 1.36 | 9.2% | |
| Pharmaceuticals | 133 | XlogP | MLR | 15.0 | 0.63 | 1.42 | 9.4% |
logD is the measure of hydrophobicity for the ionizable compounds.
CIC1 is the Complementary Information Content index (neighborhood symmetry).
SeigZ is the eigenvalue sum from a Z weighted distance matrix of a Hydrogen-depleted Molecular Graph.
RDF020p is radial distribution function weighted by atomic polarizabilities.
AlogP is logP estimated by the Ghose–Crippen method.
LogP or LogKow, LogP is equal to the logarithm of the ratio of the concentrations of the test substance in octanol and water. This value allows apprehending the hydrophilic or hydrophobic (lipophilic) character of a molecule.
defined as the surface sum over all polar atoms or molecules, primarily oxygen and nitrogen, also including their attached hydrogen atoms.
is a measure of the total polarizability of a mole of a substance.
the number of H-bond donor as descriptors of the H-bonding property.
the number of H-bond acceptor groups as descriptors of the H-bonding property.
number of double bonds.
number of triple bonds.
number of Carbon.
number of Oxygen.
the number of 4–9 membered rings.
unsaturation index.
hydrophilic factor.
Moriguchi logP.
number of benzen groups.
equilibrium constant of the dissociation reaction of an acid species in acid-base reactions.
ACDlogPa molecular properties octanol-water partitioning coefficients.
ALOGP2 molecular properties Ghose-Crippen octanol water coefficient squared.
Ib information indices information bond index.
BEHp1 burden eigenvalue descriptors highest eigenvalue n. 1 of burden matrix/weighted by atomic polarizabilities.
BEHp2 burden eigenvalue descriptors highest eigenvalue n. 2 of burden matrix/weighted by atomic polarizabilities.
GATS1mb 2D autocorrelation descriptors Geary autocorrelation-lag 1/weighted by atomic masses.
GATS2mb 2D autocorrelation descriptors Geary autocorrelation-lag 2/weighted by atomic masses.
the double-bond equivalent descriptor is the number of unsaturations present in a organic molecule.
the water solubility described by the logarithm of water solubility in mg/L at 25 °C.
XlogP is the constitutional descriptors-describe hydrophobic/hydrophilic properties.
BCUTp.1h is the BCUT descriptor/nlow highest polarizability weighted BCUTS.
AATS1i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 1/weighted by first ionization potential.
AATS3i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 3/weighted by first ionization potential.
GATS1e is the autocorrelation descriptor/Geary autocorrelation - lag 1/weighted by Sanderson electronegativities.
AATSC0p is the autocorrelation descriptor/average centered Broto-Moreau autocorrelation - lag 0/weighted by first ionization potential.
ETA_EtaP_B is the extended topochemical atom descriptor/branching index EtaB relative to molecular size.
AATS4i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 4/weighted by first ionization potential.
AATS5i is the autocorrelation descriptor/average Broto-Moreau autocorrelation - lag 5/weighted by first ionization potential.
Performances of QSRR models applied to the pesticide dataset.
| N° Model | Number of molecular descriptors | Name of the Model | Internal set | Validation set | DNN Optimized | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Training set | Test set | ||||||||||||
| R2 | RMSE | R2 | RMSE | R2 | RMSE | Percentage of error | Number of neurons per hidden layers | Activation function | Solver | Alpha | |||
| 1 | 16 | 0.95 | 0.43 | 0.90 | 0.63 | 0.82 | 0.67 | 6% | 16-16-16-16-16 | ReLu | Adam | 10 | |
| 2 | 16 | 0.60 | 1.19 | 0.50 | 1.27 | 0.49 | 1.36 | 12% | 16 | tanh | SGD | 1 | |
| 3 | 16 | 0.79 | 0.86 | 0.79 | 0.83 | 0.78 | 0.88 | 8% | 16–16 | ReLu | SGD | 10 | |
| 4 | 16 | 0.69 | 1.04 | 0.60 | 1.15 | 0.63 | 1.16 | 10% | 16-16-16-16-16 | ReLu | SGD | 10 | |
| 5 | 16 | 0.75 | 0.94 | 0.61 | 1.12 | 0.64 | 1.14 | 10% | 16 | tanh | Adam | 1 | |
| 6 | 16 | 0.42 | 1.44 | 0.34 | 1.47 | 0.38 | 1.50 | 13% | 16 | tanh | Adam | 1 | |
| 7 | 16 | 0.61 | 1.18 | 0.53 | 1.24 | 0.56 | 1.26 | 11% | 16-16-16 | ReLu | SGD | 10 | |
| 8 | 16 | 0.82 | 0.79 | 0.75 | 0.91 | 0.76 | 0.93 | 8% | 16-16-16-16 | ReLu | SGD | 10 | |