Literature DB >> 35449912

Predicting the Redox Potentials of Phenazine Derivatives Using DFT-Assisted Machine Learning.

Siddharth Ghule^1,2, Soumya Ranjan Dash^1,2, Sayan Bagchi^1,2, Kavita Joshi^1,2, Kumar Vanka^1,2.

Abstract

This study investigates four machine-learning (ML) models to predict the redox potentials of phenazine derivatives in dimethoxyethane using density functional theory (DFT). A small data set of 151 phenazine derivatives having only one type of functional group per molecule (20 unique groups) was used for the training. Prediction accuracy was improved by a combined strategy of feature selection and hyperparameter optimization, using the external validation set. Models were evaluated on the external test set containing new functional groups and diverse molecular structures. High prediction accuracies of R 2 > 0.74 were obtained on the external test set. Despite being trained on the molecules with a single type of functional group, models were able to predict the redox potentials of derivatives containing multiple and different types of functional groups with good accuracies (R 2 > 0.7). This type of performance for predicting redox potential from such a small and simple data set of phenazine derivatives has never been reported before. Redox flow batteries (RFBs) are emerging as promising candidates for energy storage systems. However, new green and efficient materials are required for their widespread usage. We believe that the hybrid DFT-ML approach demonstrated in this report would help in accelerating the virtual screening of phenazine derivatives, thus saving computational and experimental costs. Using this approach, we have identified promising phenazine derivatives for green energy storage systems such as RFBs.

Entities: Chemical

Year: 2022 PMID： 35449912 PMCID： PMC9017108 DOI： 10.1021/acsomega.1c06856

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Today, ∼85% of the world’s energy demand is being fulfilled by fossil fuels.[1,2] The limited supply of fossil fuels and the ever-increasing population have raised concerns that we might run out of fossil fuels sooner than expected.[1,3] Furthermore, electricity production from fossil fuels is one of the major factors responsible for greenhouse gas emissions.[4] In this age, humanity faces two major challenges of balancing increased energy demand while reducing the environmental impact associated with energy production. In the past decades, investments and research efforts in the green technology have been increased to overcome these challenges.[5] Significant progress has already been made to access renewable energy sources.[6,7] Renewable energy sources, being intermittent, require efficient energy storage.[4] Improvements in the energy storage technology would not only help in the adoption of renewable energy but also help in making efficient use of non-renewable energy sources. Historically, it has been more expensive to store energy than to expand energy generation for handling increased demand.[8] Thus, grid systems employed today are likely to fail when additional energy cannot be generated during peak demand. The massive Texas Blackout in February 2021 is an example of such a failure.[9] It suggests that an efficient energy storage technology is urgently required. Unfortunately, only 1.0% of the energy consumed worldwide can be stored with the energy storage technology accessible today.[10] Furthermore, the contribution of electrochemical batteries to energy storage capacity is less than 2.0%, even though most of the devices we use every day include batteries.[8,10] Li-ion batteries are widely used today due to their high energy density, high specific energy, long cycle life, and fast charge–discharge cycle.[4,8,11] Unfortunately, Li-ion batteries suffer from high production costs, safety issues, and high environmental impact.[2,12] Redox flow batteries (RFBs) have the potential to overcome drawbacks of Li-ion batteries, owing to their high storage capacity, independent control over storage capacity and power, fast responsiveness, ease of scaling, room-temperature operation, cost-effectiveness, high round trip efficiency, safety, and lower environmental impact.[13−26] RFBs are increasingly being used as energy storage devices in renewable energy systems, thereby helping in the adoption of green energy.[15,22] A schematic diagram of the typical RFB is shown in Figure . The RFB consists of two storage tanks containing cathode and anode redox-active species dissolved in an electrolyte solution. The electrolyte solution in the positive and negative compartments is termed catholyte and anolyte, respectively. These storage tanks are connected to an electrochemical cell (or current collector) via pumps. The electrochemical cell consists of porous electrodes separated by an ion-selective membrane. During operation, electrolytes containing redox-active species are pumped to the electrochemical cell, where they undergo oxidation or reduction depending on the charge/discharge cycle. Then, electrolytes are circulated back to their storage tanks.[13,24] So far, transition metal-based RFBs (such as vanadium, iron, and chromium) have found some commercial success. However, their widespread adoption has been limited mainly due to high production cost, toxicity, and cell component corrosion associated with the use of transition-metal salts.[27,28] Therefore, RFBs containing organic redox-active species are being heavily investigated due to their low production cost, access to a massive space of electroactive compounds, and low environmental impact.[28,29] Many organic compounds such as quinones, viologens, flavins, thiazines, imides, and their derivatives have been investigated for redox-active species in both aqueous and non-aqueous RFBs.[27,30,31] However, non-aqueous RFBs offer large operating voltage.[30] Recently, phenazine derivatives have been shown to be promising redox-active candidates in non-aqueous RFBs. Recent reports have revealed why phenazine derivatives are promising redox-active candidates. Romadina et al. synthesized phenazine derivatives having significantly negative redox potential.[32] RFBs require anolytes with high negative redox potential. They showed that the non-aqueous RFB based on the synthesized phenazine derivative is capable of achieving a potential of 2.3 V, high capacities, >95% Coulombic efficiency, and good charge–discharge cycling stability after the initial 20 cycles. Mavrandonakis and co-workers, in their computational investigation, reported the most negative redox-active candidate based on phenazine for non-aqueous RFBs.[27] They showed that tetra-amino-phenazine has 140 mV more negative potential than N-methylphthalimide (MePht), which has one of the most negative redox potentials reported so far in RFBs.[33] They also proposed all-phenazine RFB reaching a high potential of 2.83 V. Furthermore, the redox potential of phenazine derivatives could be tuned easily with the addition of appropriate electron-donating or electron-withdrawing functional groups. The synthesis of phenazine derivatives is very economical than mining transition metals. Therefore, phenazine derivatives are currently being investigated as candidates for novel redox-active species.[27,32]

Figure 1

Schematic diagram of a typical RFB.

Schematic diagram of a typical RFB. These investigations remain primarily experimental. Unfortunately, the vast chemical space offered by organic compounds cannot be explored using experimental procedures. Quantum mechanical density functional theory (DFT) computations have been used heavily in materials science research due to high accuracy but are very slow and cannot screen millions of molecules in a reasonable amount of time. Therefore, a fast and reliable method to screen millions of compounds without compromising accuracy is required. In this regard, machine-learning (ML) algorithms have shown excellent predictive accuracies along with short development and prediction times.[34−38] Therefore, ML models have been used extensively to screen millions of molecules in materials science and drug discovery.[39−43] ML models generally require a large amount of data for accurate predictions. When the quantity of data is limited, feature engineering is employed to generate the most informative features. These features are expected to capture the appropriate molecular information necessary to predict the target quantity. Feature engineering requires domain knowledge, relying on having access to experts.[44−46] In small data sets, DFT-based or experimentally determined features have been used due to their high accuracy. However, some reports also explore simple features based on the molecular structure.[47−52] In this work, we investigated four ML models to predict the redox potentials of phenazine derivatives in the dimethoxyethane (DME) solvent. The training-set containing 151 phenazine derivatives was obtained from the previously reported DFT study having 189 phenazine derivatives with only one type of functional group per molecule (20 unique functional groups).[27] Molecular features were computed from the optimized neutral structures using the RDKit python library.[53] Model accuracy was improved through feature selection and hyperparameter optimization using the external validation set. Then, the model performance was assessed on the external test-set compiled from the literature consisting of new functional groups, multiple functional groups, and diverse structures. Their redox potential was computed using the DFT. The trained models were employed to predict the redox potentials of randomly generated phenazine derivatives with multiple functional groups. We also carried out feature importance analysis and discussed the structure–functional relationship of phenazine derivatives. Finally, promising candidates were identified for the anolyte from the external test-set and multiple functional group test-sets.

Materials and Methods

Computational Details

The redox potentials of phenazine derivatives were computed using the DFT workflow described in the paper by Mavrandonakis et al.[27] All the DFT calculations were performed with Gaussian 09 software.[54] Geometry optimization of neutral and reduced forms of phenazine and its derivatives were carried out in the gas phase by employing B3LYP/6-31+G(d,p) level of theory.[55−58] Harmonic frequency analysis was performed for all the structures to confirm them as minima. Solvation effects of DME were incorporated during the single-point calculations using the M06-2X functional,[59] by employing the SMD solvation model (details in the Supporting Information).[60,61] The term “Redox Potential” in this report corresponds to the “Reduction Potential” with respect to unsubstituted phenazine molecule (i.e., the parent phenazine). The redox potentials of phenazine derivatives were computed using the following equations:where PZ symbolizes the parent phenazine, XPZ represents the substituted phenazine molecules, E1(ref)0 is the reported redox potential of parent PZ,[27] ΔG(rxn,sol) corresponds to the free energy change of the reaction, F is the Faraday constant, n is number of electrons involved in the reduction, and G(sol)0 represents the final composite free energy of individual species, which was calculated by adding the free energy contribution computed at the B3LYP level of theory, G(therm,gas)(B3LYP), to the single-point energies calculated at the M06-2X level of theory: E(sol)M06-2X.

Data Generation

Training-Set and Internal Test–Test

These data sets were obtained from work reported by Mavrandonakis and co-workers.[27] In their report, the redox potentials of 189 phenazine derivatives were computed using DFT in the DME solvent. These DFT redox potentials were used as a target property in this work during training and testing. 20 unique electron-withdrawing and electron-donating functional groups were present in the data set [−N(CH3)2, −NH2, −OH, −OCH3, −P(CH3)2, −SCH3, −SH, −CH3, −C6H5, −CH=CH2, −F, −Cl, −CHO, −COCH3, −CONH2, −COOCH3, −COOH, −CF3, −CN, and −NO2]. It should be noted that phenazine derivatives in this data set contain only one type of functional group per molecule. The optimized 3D structures of derivatives in neutral and in anionic states were also provided. However, only neutral structures were used in this study. Unfortunately, not all compounds were supplied with their neutral structure, those compounds were modeled, and their optimized structures were added to the data set. Next, 208 different types of features were generated using the RDKit python library.[53] The list of all features is given in Table S1 of Supporting Information. The features were scaled using the “StandardScaler” class of the scikit-learn library,[62] removing the mean and scaling each feature to unit variance. Finally, the whole data set was shuffled and split randomly into a training-set and test-set in an 8:2 ratio (151 samples in the training-set and 38 samples in the test-set). A few phenazine derivatives from the training-set/internal test-set are shown in Table .

Table 1

Representative Structures from Training-Set/Internal Test-Seta

Mol IDs were assigned to identify derivatives from the corresponding data set.

External Test-Set

This data set was compiled from different reports studying various properties of phenazine derivatives.[63−67] Their redox potentials were computed using DFT and used as a target property during testing. We gathered a total of 30 phenazine derivatives. Derivatives containing five or more substituted rings were removed. Also, derivatives having drastically different neutral and anion structures were removed. In the end, 22 diverse phenazine derivatives with multiple types of functional groups remained in the external test-set. Table shows some of the structures from this data set. It can be seen that this data set contains unique and different structures compared to the training-set.

Table 2

Representative Structures from the External Test-Seta

Mol IDs were assigned to identify derivatives from the corresponding data set.

Multiple Functional Group Test-Sets

This data set contains two test-sets: (i) two functional group test-set and (ii) three functional group test-set. These test-sets were generated by randomly choosing the position and the type of the functional group from this list [−N(CH3)2, −NH2, −OH, −OCH3, −P(CH3)2, −SCH3, −SH, −CH3, −C6H5, −CH=CH2, −F, −Cl, −CHO, −COCH3, −CONH2, −COOCH3, −COOH, −CF3, −CN, and −NO2]. 20 derivatives having two different types of functional groups per molecule were generated for two functional group test-set. Similarly, 20 derivatives having three different types of functional groups per molecule were generated for three functional group test-set. Their redox potentials were computed using DFT and used as a target property during testing. Five derivatives from two and three functional group test-sets were removed to form an external validation set. Thus, the final size of two and three functional group test-sets was reduced from 20 to 15. In this report, the term “multiple” refers to the derivatives containing different types and more than one functional group. Similarly, the terms “two functional groups” and “three functional groups” refer to the derivatives containing two different types of functional groups and three different types of functional groups per molecule, respectively. A few representative structures from these test-sets are shown in Table .

Table 3

Representative Structures from Multiple Functional Group Test-Setsa

Mol IDs were assigned to identify derivatives from the corresponding data set.

External Validation Set

An external validation set of 10 phenazine derivatives was compiled from two and three functional group test-sets. Five derivatives from two functional group test-set and five derivatives from three functional group test-set were selected. Their redox potentials were computed using DFT and used as a target property. This validation set does not come from the training-set. Therefore, it is termed as an external validation set. It was used for feature selection and hyperparameter optimization. External validation set improves generalization by transferring knowledge from the test-set to models through hyperparameters.

Hyperparameter Optimization

Hyperparameters of the models were optimized using the “GridSearchCV” class of the scikit-learn library.[62] During hyperparameter optimization, models were trained on the training-set and evaluated on the external validation set. Mean squared error (MSE) was used as an evaluation metric for hyperparameter optimization. The grid of hyperparameters for each model is given in Table S2 of Supporting Information. The parameter grid was adjusted manually.

ML Models

Following four ML models were investigated in this study. These models were chosen due to their ability to generalize from small data sets. Models were implemented with the scikit-learn python library.[62] First, models were trained on the training-set containing all 208 features, followed by hyperparameter optimization. Then, the models were re-trained on different subsets of features to identify the set of features having the highest average performance on the external validation set. Once the optimum features were identified, hyperparameter optimization was performed with the selected features to improve the model performance further.

Automatic Relevance Determination Regression (ARDR)

This is a probabilistic model related to the sparse Bayesian learning (SBL) framework. It assumes axis-parallel, elliptical Gaussian distribution for each coefficient. The precision of each Gaussian distribution is drawn from the prior distribution (gamma distribution); therefore, it can lead to sparser coefficients. Thus, it is an effective tool to remove irrelevant features.[68,69]

Gaussian Process Regression (GP)

It is a nonparametric Bayesian model. The nonparametric Bayesian model provides the probability distribution of parameters over all possible functions that fit the data. The prior in a Gaussian process is specified on the function space. Gaussian process prior is a multivariate normal distribution whose mean is obtained from the data, and covariance is specified using the kernel function. The hyperparameters of the kernel are optimized during the training.[70,71] We used a combination of WhiteKernel and RBF kernel. WhiteKernel is used for specifying the noise level, and RBF kernel is a very popular kernel used in many algorithms.

Kernel Ridge Regression (KRR)

It is the extension of ridge regression with kernel trick. In ridge regression, a linear model is leaned with the l2-norm regularization. Using the kernel trick, KRR learns a linear function in the high dimensional non-linear space without actually transforming the data.[72]

Support Vector Regression (SVR)

This model is the regression form of the support vector machine (SVM), a popular algorithm for classification tasks. Analogous to the SVM, SVR depends on the subset of training data and ignores the points whose prediction is close to their true value. SVM also utilizes the kernel trick and learns a hyperplane in the high dimensional space.[73]

Evaluation Metrics

The following metrics were used for evaluating the model performance. In the formulas below, N denotes the number of data points, ŷ denotes the predicted value of ith sample, and the y denotes the corresponding true value. Coefficient of determination (R2):where Mean Squared Error (MSE): Mean Absolute Error (MAE): The use of terms “Accuracy” and “Performance” in this report is contextual and refers to one or more metrics defined above.

Feature Selection

As the number of features obtained from the RDKit library was more than the size of training-set, it was necessary to implement a feature selection strategy. It has been observed that the training-set containing more features than data points leads to overfitting.[30] Feature selection was implemented using the “SelectKBest” class of the scikit-learn library.[74] The parameter “k” of the “SelectKBest” class was obtained by evaluating the average performance of models on the external validation set at different values of “k.” First, models were trained on the training-set containing all features, followed by hyperparameter optimization. Then, the models were re-trained on the subsets of features selected using “SelectKBest” class at different values of “k.” These values for “k” were tested: 50, 75, 100, 125, 150, and 208. The average model performance at different values of “k” on the external validation set is shown in Table . It can be seen that the models trained on 100 selected features show the highest average performance in terms of R2. Therefore, these 100 features were selected for the subsequent analysis. The models trained on 100 selected features were further improved through hyperparameter optimization.

Table 4

Average Model Performance on External Validation Set at Different Values of “k”

	values of “k”
performance metric	50	75	100	125	150	208
R²	0.45	0.42	0.57	0.55	0.54	0.54
MSE	0.02	0.02	0.02	0.02	0.02	0.02
MAE	0.12	0.12	0.10	0.10	0.10	0.10

Feature Importance Analysis

Feature importance analysis was performed using the technique known as permutation importance. In this technique, values of the feature to be assessed are randomly shuffled (permuted). Then, prediction accuracy is computed on the shuffled data set. Shuffling feature values is equivalent to replacing the feature with noise, thereby removing its information from the data set. Therefore, the model is expected to perform poorly on the shuffled data set if the feature is important. The degree of importance depends on the amount of variation in the accuracy. This technique does not re-train the model; therefore, a trained model is required. The permutation importance was computed using “permutation_importance” class of the scikit-learn library and the training-set.[75] This procedure was repeated 100 times to obtain reliable estimates. The feature importance scores were rescaled between 0 to 1. The mean and standard deviation of the feature scores were reported. The mean feature score was used for the ranking of individual features. The terms “Feature” and “Descriptor” are used interchangeably in this report.

Results and Discussion

Test-Set Performance

We assessed the generalizability of the trained models (i.e., performance on the unseen data) using internal and external test-sets. Please refer to Section for the preparation of internal and external test-sets. As the internal test-set comes from the same source, it is very similar to the training-set and contains derivatives with only one type of functional group per molecule. However, the external test-set is compiled from multiple sources, therefore, it has very diverse phenazine derivatives with different types of functional groups. It also contains functional groups and structures not present in the training-set (e.g., −NHPh, −Br, and extended conjugation). Figure shows the performance on the internal test-set, and Figure shows performance on the external test-set. It can be seen that all models have excellent accuracy on the internal test-set (R2 > 0.98) and high accuracy on the external test-set set (R2 > 0.74). The GP model achieved the highest R2 of 0.89 on the external test-set. After deep analysis in Section , it was revealed that GP is not a stable model, whereas relatively low-performing models KRR (R2 = 0.83) and SVR (R2 = 0.85) are more stable. Therefore, one should be careful while using the high-performing model, and the stability of the model should also be considered. The values of performance metrics on internal and external tests are shown in Table . Such a performance on the external test-set is surprising as models were trained on the phenazine derivatives having only one type of functional group. These results show that ML models are capable of generalizing from a very small and simple data set.

Figure 2

Plots showing ML predictions on internal test-set (y-axis) vs DFT redox potentials (x-axis). Gray dashed line corresponds to the perfect predictions.

Figure 3

Plots showing ML predictions on external test-set (y-axis) vs DFT redox potentials (x-axis). Gray dashed line corresponds to the perfect predictions.

Table 5

Values of Performance Metrics on Internal and External Test-Setsa

	Internal test-set			External test-set
Model name	R²	MSE	MAE	R²	MSE	MAE
ARDR	0.98	0.01	0.06	0.74	0.06	0.18
GP	0.99	0.01	0.05	0.89	0.03	0.11
KRR	0.98	0.01	0.05	0.83	0.04	0.14
SVR	0.98	0.01	0.07	0.85	0.03	0.13

Numbers were rounded upto two decimals.

Plots showing ML predictions on internal test-set (y-axis) vs DFT redox potentials (x-axis). Gray dashed line corresponds to the perfect predictions. Plots showing ML predictions on external test-set (y-axis) vs DFT redox potentials (x-axis). Gray dashed line corresponds to the perfect predictions. Numbers were rounded upto two decimals.

Prediction on Multiple Functional Group Test-Sets

Next, we assessed the model performance on the phenazine derivatives substituted with different types of functional groups per molecule. These test-sets were generated randomly; please refer to Section for the generation of this data set. Figures and 5 show the performance on the derivatives containing two and three different functional groups, respectively. It can be seen that the models performed reasonably well (R2 > 0.7) even though molecules used for the training had only one type of functional group per molecule. In particular, GP model achieved the highest performance of R2 = 0.82 on two functional groups test-set. However, automatic relevance determination regression (ARDR) achieved the highest performance of R2 = 0.82 on three functional groups test-set. A deeper analysis of GP and ARDR in Section suggests that GP and ARDR are not very reliable models. Although KRR and SVR have relatively low performance, they are more reliable. Therefore, one should be careful while using a high-performing model, and the model’s reliability and stability should also be considered. Nevertheless, these results again show the surprising generalization power of ML models.

Figure 4

Plots showing ML predictions on two functional group test-set (y-axis) vs DFT redox potentials (x-axis). Gray dashed line corresponds to the perfect predictions.

Figure 5

Plots showing ML predictions on three functional group test-set (y-axis) vs DFT redox potentials (x-axis). Gray dashed line corresponds to the perfect predictions.

Plots showing ML predictions on two functional group test-set (y-axis) vs DFT redox potentials (x-axis). Gray dashed line corresponds to the perfect predictions. Plots showing ML predictions on three functional group test-set (y-axis) vs DFT redox potentials (x-axis). Gray dashed line corresponds to the perfect predictions. Furthermore, we added all 15 derivatives from two functional group test-set to the training-set and re-trained the models on this new data set of 166 derivatives. The predictive performance of this combined data set was assessed on the same data set of 15 derivatives containing three different types of functional groups. The results of this analysis are shown in Figure . It can be seen that the model performance has improved with the addition of more data in the training-set.

Figure 6

Plots showing ML predictions on three functional group test-set (y-axis) vs DFT redox potentials (x-axis). The combined data set (training-set + two functional group test-set) was used for the training. Gray dashed line corresponds to the perfect predictions. We carried out feature importance analysis using permutation importance. Please refer to Section for the details on the technique. In order to understand how model performance changes with the number of descriptors, we re-trained the models on the subset of features and assessed their performance on the internal test-set. Top 50 features based on their permutation importance score were used. R2 was used as a performance metric. The result of this analysis is shown in Figure . It can be seen that most of the models show a jump in the R2 and have R2 > 0.9 around the top 10 features. The unusual behavior of the GP model is attributed to the instability of the model for a small number of features. The plots in Figure show the histograms of the top 10 important features from each model. Although models show variation in feature importance, they all agree in terms of the most important feature that is, “PEOE_VSA1.” Interestingly, most of the features in ARDR have small weights as ARDR tries to prune the large number of irrelevant features, leading to a sparse model.[69,76] Five out of 10 features—“MaxAbsPartialCharge,” “PEOE_VSA1,” “fr_ArN,” “fr_NH0,” and “fr_NH2” are common to all models. Variations in the feature importance scores could be attributed to the difference in the internal structures of the models. Here, we discuss some of the common features from Figure .

Figure 7

R2 vs number of descriptors. R2 was computed using the internal test-set. In this study, we identified a few issues with ARDR and GP. Despite high predictive performance, ARDR is not a reliable model as it places very high weight on one feature (i.e., “PEOE_VSA1”). Similarly, GP is not a reliable model as it becomes unstable when the small number of features are used. We encountered divided by zero errors in the kernel function during the analysis with the GP model.

Figure 8

Top 10 features (y-axis) vs mean feature importance score (x-axis). Feature importance scores were rescaled between 0 to 1. Error bars represent standard deviation from 100 repetitions.

PEOE_VSA1

This is the sum of the approximate accessible van der Waals surface area (i.e., VSA in Å2) of the atoms having partial charge less than −0.30.[77−79] The partial charges are computed using the partial equalization of orbital electronegativities (PEOE) method developed by Gasteiger and Marsili in 1980. Please refer to the discussion of MaxAbsPartialCharge descriptor for the PEOE method. Thus, this descriptor captures the information related to molecular size and the number of electron-donating functional groups.

MaxAbsPartialCharge

This is the maximum value of the absolute Gasteiger partial charges present in the molecule. In 1980, Gasteiger and Marsili gave the procedure to calculate the partial charges in a molecule. That procedure is known as PEOE. In this method, the charge is transferred between bonded atoms until equilibrium. Gasteiger partial charges depend on the connectivity and orbital electronegativity, thus capturing the electron-donating and electron-withdrawing power of the atoms.[80] Electronegativity is essential information as electron-donating groups decrease the redox potential, and electron-withdrawing groups increase the redox potential.[27]

MinPartialCharge

This is the minimum value of the Gasteiger partial charges present in the molecule. Please refer to the discussion of MaxAbsPartialCharge descriptor for the properties of Gasteiger partial charges.

fr_NH0

It is the number of tertiary amines present in the molecule.

fr_ ArN

It is the number of N functional groups attached to aromatic rings.

fr_NH2

It is the number of primary amines.

NHOHCount

It is the number of N–H and O–H bonds present in the molecule. From the analysis in this section, we realized that there are some issues with the ARDR and GP which are outlined below. One should be very careful while using ARDR and GP models.

Issues with the ARDR Model

As ARDR is related to the SBL framework, it reduces the number of irrelevant features. Unfortunately, in this case, ARDR has put a lot of weight on only one feature, that is, “PEOE_VSA1” (Figure ). Surprisingly, ARDR also archives an accuracy of more than 0.95R2 only with the two features (Figure ). Although it has shown good performance on the data set used in this work, it may not work for the broad chemical space. This type of behavior reduces the reliability of the model.

Issues with the GP Model

From Figure , it can be seen that the model’s accuracy decreases with more features, and at around 10 features, there is a significant drop in the performance. We also encountered divided by zero errors in the kernel function during this analysis. This shows that GP may not be a very stable model in this case.

Structure–Functional Relationship

“PEOE_VSA1” is the most important descriptor common to all models. It is computed by summing over the approximate accessible VSA (i.e., in Å2) of the atoms having partial charge less than −0.30.[77−79] Thus, the “PEOE_VSA1” descriptor captures the information related to molecular size and the number of electron-donating functional groups present in the molecule. From Figure , we can see that the redox potential of phenazine derivatives decreases with the increasing value of “PEOE_VSA1.” The Pearson correlation coefficient between “PEOE_VSA1” and redox potential is −0.69, supporting the the observation. We observed that the value of “PEOE_VSA1” is higher for the systems having delocalization of negative partial charge. The delocalized system contains more atoms with the negative partial charge than the corresponding localized system. Thus, the number of atoms contributing to “PEOE_VSA1” in delocalized systems is higher than localized ones. The effect of delocalization of partial charge on “PEOE_VSA1” is shown in Figure with a few examples from the training-set. Thus, for designing better anolytes, it is suggested to increase the delocalization of negative partial charge in the phenazine derivatives.

Figure 9

Redox potential vs“PEOE_VSA1.”

Figure 10

Examples from the training-set showing the effect of charge delocalization on “PEOE_VSA1.” Values of “PEOE_VSA1” and DFT redox potentials in volts are also shown. Mol IDs were assigned to identify derivatives from the corresponding data set.

Redox potential vs“PEOE_VSA1.” Examples from the training-set showing the effect of charge delocalization on “PEOE_VSA1.” Values of “PEOE_VSA1” and DFT redox potentials in volts are also shown. Mol IDs were assigned to identify derivatives from the corresponding data set. The redox potential of phenazine derivative depends on the type of functional group, the position of attachment, and the number of functional groups. Two types of functional groups have been investigated in this study: (i) electron-donating, and (ii) electron-withdrawing. The redox potential of parent phenazine without any functional group is −1.74 V. When the redox potential of the derivative decreases (i.e., less than −1.74 V) after the attachment of functional groups, then it is called a negative shift. Similarly, if it increases, it is called a positive shift. The shift is quantified as the difference between the redox potential of phenazine derivative and parent phenazine. After sorting phenazine derivatives based on the redox potential, it was observed that electron-donating groups show a negative shift, whereas electron-withdrawing groups show a positive shift. Thus, the shift corresponding to electron-donating groups is negative and electron-withdrawing groups is positive. The redox potentials of phenazine derivatives were computed using the approach discussed in Section . Equation shows that functional groups stabilizing the anionic form of phenazine derivatives have high redox potential. In contrast, those that destabilize the anionic form have low redox potential. Therefore, electron-withdrawing groups show a positive shift as they stabilize the anionic form, and electron-donating groups show a negative shift as they destabilize the anionic form. A few examples showing positive and negative shifts with respect to parent phenazine are shown in Figure .

Figure 11

Examples showing positive and negative shifts with respect to parent phenazine. DFT redox potentials and shifts in volts are also shown. Mol IDs were assigned to identify derivatives from the corresponding data set. In the case of derivatives with multiple functional groups, if all groups are similar, then shift also corresponds to their type. For example, when the derivative contains all electron-donating groups, it shows a negative shift. Similarly, the shift is positive when the derivative contains all electron-withdrawing groups. A few examples having similar types of functional groups are shown in Figure .

Figure 12

Examples showing the effect of similar type of functional groups on the redox potential. DFT redox potentials and shifts in volts are also shown. Mol IDs were assigned to identify derivatives from the corresponding data set. When derivatives contain more than one functional group that differ in their type, the shift is determined by the group showing the highest absolute shift in the corresponding single functional group derivative. For example, derivative A in Table contains −NH2, an electron-donating group which has a shift of −0.11 V and −Cl, an electron-withdrawing group which has a shift of 0.13 V. The absolute of the shift for −Cl is more than −NH2; therefore, derivative A shows a positive shift of 0.03 V, supporting our claim. A similar analysis is applicable to derivative B, which also shows a positive shift. Derivative C contains −N(CH3)2 and −CH3, two electron-donating groups, and −CO(NH2), an electron-withdrawing group. An absolute shift of −N(CH3)2 is −0.24 V, which is the highest among all three groups. Therefore, derivative C shows a negative shift of −0.09 V. Derivative D contains −OCH3 and −C6H5, two electron-donating groups, and −CHO, one electron-withdrawing group. However, derivative D shows a positive shift as the absolute shift of −CHO is more than both electron-donating groups. Thus, the redox potential of phenazine derivatives containing multiple functional groups is determined by the relative strength of electron-donating or electron-withdrawing power of the functional groups.

Table 6

Examples Showing the Effect of Absolute Values of Single Functional Group Shift on the Redox Potential of Derivatives Containing Different Types of Functional Groupsa

DFT redox potentials and shifts in volts are also shown. Mol IDs were assigned to identify derivatives from the corresponding data set.

DFT redox potentials and shifts in volts are also shown. Mol IDs were assigned to identify derivatives from the corresponding data set. The effect of position on the redox potential of single functional group derivatives has been studied by Mavrandonakis and co-workers.[27] They showed that the position does not have a significant effect for electron-withdrawing groups. However, electron-donating groups which are capable of intra-molecular hydrogen boding show more negative shift when attached at position 2 compared to position 1. The position numbers in phenazine derivatives are shown in Figure . They also investigated the effect of the number of functional groups on redox potential. It was shown that the addition of more electron-withdrawing groups shifts the redox potential continuously toward positive values. However, this effect is less significant for electron-donating groups. The difference between the phenazine derivative with four amino groups and eight amino groups is very small (∼0.05 V). The difference between phenazine derivative with four cyano groups and eight cyano groups is ∼1.23 V.

Figure 13

Numbering of the positions in phenazine derivatives.

Identification of Promising Phenazine Derivatives for the Anolyte

In this section, we identify the top five promising candidates for the anolyte using the trained ML models. Models developed in this study are based on features that do not require electronic structure calculations. Therefore, these models could screen millions of molecules in a significantly small amount of time. Then, experimentation or DFT calculations could be performed on the reduced number of molecules to identify the best redox-active molecules, saving computational and experimental costs. Using this hybrid DFT-ML approach, we have identified promising phenazine derivatives for the anolyte in RFBs. These promising candidates would provide a good starting point for the experimentalists. Electron-donating molecules with negative redox potential are preferred candidates for the anolyte. As KRR and SVR are stable models, the predictions here are based on them. The values of redox potentials are averaged over 100 independent iterations of data splitting and model training. Table lists the top five phenazine derivatives from the external test-set with the most negative redox potentials obtained from DFT and two ML models. Four out of five predictions from KRR and SVR match with DFT predictions. The top five promising candidates from multiple functional groups test-sets are shown in Tables S3–S5 Supporting Information.

Table 7

Top Five Anolyte Candidates Predicted Using DFT, KRR, and SVR from the External Test-Seta

SVR and KRR were trained on the phenazine derivatives containing single type of functional group per derivative. Mol IDs and redox potentials predicted from DFT and ML models are shown below the respective candidates. Mol IDs were assigned to identify derivatives from the corresponding test-set. Derivatives are arranged in increasing order of redox potential. Redox potentials are given in the unit of volts.

Conclusions

In this study, four ML models were employed to predict the redox potentials of phenazine derivatives in DME using DFT. Models were trained on a small data set of 151 phenazine derivatives having only one type of functional group per molecule (20 unique functional groups). The trained models achieved high accuracies (R2 > 0.74) on internal and external test-sets containing diverse phenazine derivatives. We also showed that despite being trained on derivatives with a single type of functional groups, models were able to predict the redox potentials of the derivatives containing multiple and different types of functional groups with good accuracies (R2 > 0.7). Feature selection and hyperparameter optimization using the validation set were critical strategies for performance improvement. Feature selection removed the unnecessary and noisy features. Hyperparameter optimization using an external validation set helped improve the generalizability of the models. The addition of 15 derivatives from two functional group test-sets in the training-set improved the accuracy on three functional group test-sets. It was observed that the “PEOE_VSA1” descriptor was the most important molecular feature as it contains information related to molecular size and the partial charges. Deeper analysis showed that one should not rely only on the model performance but also investigate the stability and reliability of the models. Through the structure–functional relationship, we observed that the redox potential of derivatives containing multiple functional groups is influenced by the functional group having either strong electron-donating or strong electron-withdrawing power. Models developed in this study are based on features that do not require electronic structure calculations. Therefore, these models could screen millions of molecules in a significantly small amount of time. Then, experimentation or DFT calculations could be performed on the screened candidates to identify the best molecules, saving computational and experimental costs. Using this hybrid DFT-ML approach, we have identified promising phenazine derivatives for the anolyte in RFBs. These promising candidates would provide a good starting point for the experimentalists. This study shows that it is possible to develop reasonably accurate ML models for complex quantities such as redox potential using small and simple data sets.

26 in total

1. Performance of B3LYP Density Functional Methods for a Large Set of Organic Molecules.

Authors: Julian Tirado-Rives; William L Jorgensen
Journal: J Chem Theory Comput Date: 2008-02 Impact factor: 6.006

2. Towards greener and more sustainable batteries for electrical energy storage.

Authors: D Larcher; J-M Tarascon
Journal: Nat Chem Date: 2014-11-17 Impact factor: 24.427

3. Accurate machine learning in materials science facilitated by using diverse data sources.

Authors: Rohit Batra
Journal: Nature Date: 2021-01 Impact factor: 49.962

4. Machine Learning Analysis of the Thermodynamic Responses of In Situ Dielectric Spectroscopy Data in Amino Acids and Inorganic Electrolytes.

Authors: Yong Wei; Keith Chin; Laura M Barge; Scott Perl; Ninos Hermis; Tao Wei
Journal: J Phys Chem B Date: 2020-12-07 Impact factor: 2.991

Review 5. A Short Review of the Generation of Molecular Descriptors and Their Applications in Quantitative Structure Property/Activity Relationships.

Authors: Sagarika Sahoo; Chandana Adhikari; Minati Kuanar; Bijay K Mishra
Journal: Curr Comput Aided Drug Des Date: 2016 Impact factor: 1.606

6. Calculations of PAH anions: When are diffuse functions necessary?

Authors: Noach Treitel; Roy Shenhar; Ivan Aprahamian; Tuvia Sheradsky; Mordecai Rabinovitz
Journal: Phys Chem Chem Phys Date: 2004-03-09 Impact factor: 3.676

7. Machine-Learning Coupled Cluster Properties through a Density Tensor Representation.

Authors: Benjamin G Peyton; Connor Briggs; Ruhee D'Cunha; Johannes T Margraf; T Daniel Crawford
Journal: J Phys Chem A Date: 2020-06-02 Impact factor: 2.781

8. New phenazine based anolyte material for high voltage organic redox flow batteries.

Authors: Elena I Romadina; Denis S Komarov; Keith J Stevenson; Pavel A Troshin
Journal: Chem Commun (Camb) Date: 2021-02-26 Impact factor: 6.222

9. Chemical Pressure-Driven Enhancement of the Hydrogen Evolving Activity of Ni₂P from Nonmetal Surface Doping Interpreted via Machine Learning.

Authors: Robert B Wexler; John Mark P Martirez; Andrew M Rappe
Journal: J Am Chem Soc Date: 2018-03-26 Impact factor: 15.419

10. Incremental Tuning Up of Fluorous Phenazine Acceptors.

Authors: Karlee P Castro; Tyler T Clikeman; Nicholas J DeWeerd; Eric V Bukovsky; Kerry C Rippy; Igor V Kuvychko; Gao-Lei Hou; Yu-Sheng Chen; Xue-Bin Wang; Steven H Strauss; Olga V Boltalina
Journal: Chemistry Date: 2016-01-12 Impact factor: 5.236

1 in total

Review 1. Machine learning for flow batteries: opportunities and challenges.

Authors: Tianyu Li; Changkun Zhang; Xianfeng Li
Journal: Chem Sci Date: 2022-04-07 Impact factor: 9.969

1 in total