Literature DB >> 35517310

Predicting the emission wavelength of organic molecules using a combinatorial QSAR and machine learning approach.

Zong-Rong Ye¹, I-Shou Huang^1,2, Yu-Te Chan^1,3, Zhong-Ji Li¹, Chen-Cheng Liao¹, Hao-Rong Tsai¹, Meng-Chi Hsieh¹, Chun-Chih Chang^1,4, Ming-Kang Tsai¹.

Abstract

Organic fluorescent molecules play critical roles in fluorescence inspection, biological probes, and labeling indicators. More than ten thousand organic fluorescent molecules were imported in this study, followed by a machine learning based approach for extracting the intrinsic structural characteristics that were found to correlate with the fluorescence emission. A systematic informatics procedure was introduced, starting from descriptor cleaning, descriptor space reduction, and statistical-meaningful regression to build a broad and valid model for estimating the fluorescence emission wavelength. The least absolute shrinkage and selection operator (Lasso) regression coupling with the random forest model was finally reported as the numerical predictor as well as being fulfilled with the statistical criteria. Such an informatics model appeared to bring comparable predictive ability, being complementary to the conventional time-dependent density functional theory method in emission wavelength prediction, however, with a fractional computational expense. This journal is © The Royal Society of Chemistry.

Entities: Chemical

Year: 2020 PMID： 35517310 PMCID： PMC9054811 DOI： 10.1039/d0ra05014h

Source DB: PubMed Journal: RSC Adv ISSN： 2046-2069 Impact factor: 4.036

Introduction

Fluorescent molecules are widely used in various applications. For studying biological science, fluorescent molecules can be used as the analytic and diagnostic tools for understanding the cell biology given the novel photochemical sensitivity and specificity. As shown by modern scientific developments, fluorescent molecules have been used for labeling target cells, RNAs, DNAs, peptides, and live-cell images.[1] Interesting phenomena such as aggregation-induced emission (AIE) were also observed using organic fluorescent dyes.[2] A single organic molecule of AIE is non-emissive, and the light emission can be enhanced through the aggregation formation of these molecules.[2] Additionally, fluorescent molecules can also be light-activated switches, being utilized in electronics and optical memory devices.[3] All these fluorescent molecule applications and properties are closely related to the interactions of different chemical bonding structures at different electronic states, leading to the conventional schematic representation of Jablonski diagram.[4] For a long history, chemists have been searching for the fluorescent core structures ranging from biological proteins and peptides, small organic molecules, other synthetic oligomers or polymers.[5] One of the organic fluorescent core structures is coumarin isolated from tonka beans in 1820.[6] In 2012, Chen et al. studied the coumarin related chromophores, and discovered two isomers with the inversed tetracyclic pyrazolo[3,4-b]pyridine structures with the distinctive fluorescence emission wavelengths.[7] In addition to coumarin, various organic core structures have been introduced for the optical applications, e.g. xanthene, cyanine, squaraine, naphthalene, oxadiazole, anthracene, pyrene, oxazine, acridine, arylmethine, and tetrapyrrole.[3] In order to build a new fluorescent molecule, chemists are prone to explore the available core structure and modify its chemical bonding environment to advance the corresponding photophysical and photochemical properties. From the computational perspective, chemists also developed new approaches to describe chemical properties based upon the success in medicinal chemistry, i.e. quantitative structure–activity relationship (QSAR) and quantitative structure–property relationship (QSPR). QSAR and QSPR are designed for predicting complex physical, chemical, biological properties of molecules from the experimental or calculated fundamental characteristics.[4] Such an approach originated from the early toxicology study on the primary aliphatic alcohols and the water solubility in 1863.[8] As the successes in predicting many physical/chemical properties of compounds, molecular descriptors have also been developed to characterize and classify structural patterns. Molecular descriptors are the information of molecular physicochemical properties such as constitutional, structural, lipophilicity, electronic, geometrical, hydrophobic, solubility, quantum chemical, and topological descriptors. Another type of descriptor is fingerprints. Fingerprints are the binary type (Yes/No) descriptor indicating the presence of certain functional groups within the molecules.[9] With the modern computing capability and capacity, chemists are able to assess the chemical space using large-scale molecular descriptors and fingerprints. To analyze the complexity of chemical databases and build chemical-intuitive mathematical models, machine learning plays a key role in these investigations and has been applied in QSAR study since early 80s. The use of machine-learning method was positioned to create the logical and numerical rules of samples as well as the relevant background knowledge.[10] King et al. described the neural networks (NN) and inductive login programming (ILP) models in QSARs and compared the multiple linear regression (MLR) with these machine learning methods on drug design problems.[11] The authors observed poorer statistical-characteristics for the NN model and higher interpretive ability with the ILP model. Wang et al. reported an extreme learning machine neural network model to predict the electronic energies of 4,4-difluoro-4-bora-3a,4a-diaza-s-indacene (BODIPY) dyes.[12] Li et al. successfully established the QSAR between overall power conversion efficiency and quantum mechanical descriptors using a cascaded support vector machine (SVM) model for 400 organic dye sensitizers.[13] Recently, 109 fluorescent proteins were analyzed using neural networks, decision trees (DT), random forests (RF), and SVM where the RF algorithm relatively outperformed than others.[14] The latest advance using deep neural network for the development of data-driven continuous representation has demonstrated the state-of-the-art advancement in describing molecular structures and predicted properties for the drug discovery application.[15] Noh et al. introduced the inverse design pipeline based upon the invertible image-based featurization to design new functional inorganic solid-state materials.[16] In addition to predict the structural functionality, Häse et al. extracted the fundamental knowledge of excited electron transfer properties of light-harvesting systems using artificial neural network to facilitate the development of excitonic devices.[33] Despite the mathematical forms (or said data structures) for the optimal representation of molecules and materials have been actively explored by several pioneering studies,[17-22] molecular physiochemical phenomena is commonly interpreted in terms of stoichiometry, local valence chemical bonding, and the presence of functional groups. Therefore, a prediction model assessing the valence bonding patterns of the ground and excited electronic states on behalf of a large pool of organic fluorescent molecules is attempted in this study. A fast-and-accessible predictor could substantially boost the high throughput screening for the design of organic fluorescent molecules, followed by the refining of quantum mechanics (QM) characterization before entering the synthetic process. We, therefore, conducted the present study using a systematic-and-statistical approach to build a machine-learning QSAR model for the emission wavelength prediction.

Methods

Sample generation

We extracted 11 460 experimentally-synthesized fluorescent organic molecules from Reaxys database[23] with the corresponding emission wavelength between 200–900 nm, and the molecular weight distribution was between 30–3203 g mol−1 as shown in Fig. 1. More details of dataset construction is provided in ESI.† The solvatochromism effect due to the use of different solvents in the experiments was not preferentially filtered for the purpose of maintaining dataset diversity. Subsequently, we inputted the simplified molecular input line entry system (SMILES) files of these molecules to the descriptor generator.

Fig. 1

(a) Emission wavelength and (b) molecular weight distribution of the 11 460 fluorescent molecules.

Descriptor generation

We used PaDEL package[24] to generate two dimensional molecular descriptors and fingerprints for each fluorescent organic molecule, and the initial number of descriptors per molecule was 11 141. The complete list of the seventy categories of the generated descriptors from PaDEL is summarized in Table S1.†

Result and discussion

Descriptor dimension reduction

We first applied the variance threshold selection (VTS) to reduce the descriptor dimension where the descriptor with the variance σ2 < 0.01 was removed. Consequently, the descriptor dimension was reduced from 11 411 to 6208, and most of the removed cases were fingerprint descriptors. We subsequently conducted the multiple linear regressions (MLR) with the 6208 descriptors, and denoted the results as VTS-MLR model. Additionally, we built two more MLR based models for the comparison purpose. The first one used 4300 out of 6208 descriptors denoted as VTSsel-MLR model, where the dominant 4300 descriptors (|coefficient| > 5) determined by VTS-MLR model were taken into account. The second one contained only the 5158 fingerprint descriptors out of the VTS descriptor ensemble (the rest of the 1050 features are 2D descriptors), being denoted as VTSfp-MLR model. The predicted results using VTS series MLR models are shown in Fig. 2 in comparison with the experiments. The R2 values of these MLR models are summarized in Table 1 with all cases giving R2 > 0.86. In order to examine the overfitting problem, we divided the descriptor dimension into 10 equivalent partitions and conducted cross validation to calculate the mean of 10-fold-R2 (denoted as Q2) as shown in Table 1 (see Table S2† for the details of R2 for each partition). The Q2 results indicate that all three MLR models are found to be short of predictability and are not generalized. We believed that the failure of Q2 results is due to the diverse characteristic of our collected dataset.

Fig. 2

The 2D histograms of the prediction vs. experiment comparison using the VTS series MLR models.

The statistical properties of the VTS series MLR models

Models	# of descriptors	R ²	AdjR²a	Q ²
VTS-MLR	6208	0.92	0.66	−5143.72
VTSsel-MLR	4300	0.86	0.72	−49790.45
VTSfp-MLR	5158	0.89	0.62	−5.64 × 10¹⁵

AdjR2 denotes the adjusted R2 in respect to the size of the descriptor ensemble.

Descriptor dimension classification by Kmeans clustering

In order to gain better insights into the descriptor space, we conducted K-means clustering, an unsupervised machine learning method,[25] for classifying the VTS series descriptor ensembles. We first applied principle component analysis (PCA) method to convert the VTS (6208), VTSsel (4300), and VTSfp (5158) descriptor ensembles, respectively, into a finite 3 dimensional space. Subsequently, we used the three effective descriptors for K-means clustering[26] in order to identify the optimal number of sub-groups categorizing the 11 460 fluorescence molecules. The predicted inertia and Silhouette scores of k = 2–100 group partitions are shown in Fig. 3 (see ESI† for the details of both scoring functions). The Silhouette score of PCA-transformed VTS ensemble labeled as PCA-VTS (Fig. 3b) is found to be substantially different from other counterparts while no apparent elbow points could be identified form three inertia score curves. The highest Silhouette score at k = 15 suggests that the whole 11 460 molecules maybe reasonably categorized by 15 groups using PCA-VTS descriptor ensemble. In Fig. S1,† we also conducted the non-PCA transformed cases using VTS ensemble, and the corresponding results suggested PCA transformation was not trivial for classifying these complicated descriptor ensembles.

Fig. 3

The calculated inertial (a, c and e) and Silhouette (b, d and f) scores in respect to k groups in K-means clustering analysis using the PCA-transformed descriptors of the VTS series ensembles.

With introducing K-means clustering, we intended to identify the general, however subtle, structural characteristics to categorize the collected fluorescent molecules. The visualization of the 15 sub-groups of 11 460 molecules using the PCA-transformed VTS, VTSsel, and VTSfp ensembles are projected onto the ternary plots in Fig. 4 (see Fig. S2† for the corresponding 3-dimentional plots) where the data distribution of the VTS ensemble (Fig. 4a) appeared to be the relatively distinguishable but not the optimal case.

Fig. 4

Three ternary plots of the 15 subgroups using the PCA-transformed descriptors for the VTS series ensembles. The 15 types of colors dots denote the subgroup distributions subject to the three PCA-transformed axes.

Extracting the dominant descriptors of each sub-group by Lasso regression

Lasso regression[27] was applied to extract the important descriptors for each subgroup (15 subgroups) of VTS ensemble. In according to the Lasso predicted contribution, we finally reduced the VTS ensemble from 6208 to 480 descriptors, denoted as Lasso ensemble, after collecting the filtered descriptor for each subgroup. The elbow curve of the inertia score of Lasso ensemble could be identified at k = 7 (see Fig. 5a and ESI†) while the comparable Silhouette scores predicted the acceptable grouping up to k = 13 (see Fig. 5b). The PCA-transformed ternary plots using k = 7 or 13 are shown in Fig. 5c and d, respectively (see Fig. S3† for the corresponding 3-dimensional plots). All of the color dots appeared to be well separated for both case, and suggests the 480 descriptor (Lasso) ensemble could well represent the overall structural characteristics. We, therefore, conducted the linear regression and nonlinear random forest regression[28] for the emission wavelength prediction using Lasso ensemble. A statistical performance comparison for the VTS series and Lasso models were provided in Table S3† based upon R2, AdjR2, Akaike's information criterion (AIC),[29]p-value and mean absolute error (MAE) against the experiments.

Fig. 5

(a and b) The inertia and Silhouette scores of Lasso ensemble, respectively; (c and d) the visualization of 7 and 13 groups of PCM-transformed 3-dimentional plot of Lasso ensemble, respectively.

Golbraikh and Tropshas[30] suggested a combination of statistical criteria for demonstrating a statistically meaningful regression results as shown in Table 2. The ideal values for these statistical criteria are also summarized in Table 2 with k0 denoting the linearity of predicted over experiment through the origin, and (R2 − R02)/R2 representing the predictive ability of the regression if the corresponding value <0.1. Both Lasso-LR and Lasso-RF models gave reasonable predictability with MAE at 40 and 24 nm, respectively, and the transferability of these models also appeared to be significantly better than the VTS series models (see Q2 values in Tables 2 and S2†). Despite the Lasso-RF model gave better predictability than the LR counterpart, the LR model still provided qualitative results in addition to its interpretability (Fig. 6).

The statistical benchmark of Lasso-LR and Lasso-RF models

Criterion	R ² a	R _test ² (MAE)a	Q ²	b	c	c
Ideal values	>0.6	>0.6	>0.5	0.85 ≤ k ≤ 1.15 (0.85 ≤ k′ ≤ 1.15)	Close to R² (close to R²)	<0.1 (<0.1)
Lasso-LR	0.6632	0.5685 (44)	0.5800	0.9868 (0.9999)	0.6627 (0.4984)	0.0009 (0.2486)
Lasso-RF	0.9227	0.7004 (36)	0.6205	0.9919 (1.0025)	0.8565 (0.7933)	0.0717 (0.1402)

The value of k0 denotes the slope of the predicted over experimental data through the origin (intercept equal to zero), and is the inverse k0. The detailed information is summarized in ESI.

R 0 2 denotes the correlation coefficient of k0, and denotes the case of . See ESI for more details.

Fig. 6

The 2D histograms of the regression results of Lasso-LR and Lasso-RF models. The legend shows the linear equation fitting to the predicted values.

Only 80% of 11 460 samples were selected (randomly) as the training set for the Lasso-LR and Lasso-RF models. The rest of 20% samples were used as the testing set with MAE (in nm) shown in the parentheses. The value of k0 denotes the slope of the predicted over experimental data through the origin (intercept equal to zero), and is the inverse k0. The detailed information is summarized in ESI. R 0 2 denotes the correlation coefficient of k0, and denotes the case of . See ESI for more details.

Comparison of informatics vs. quantum mechanics approaches

We selected 20 fluorescence compounds rom Reaxys in addition to the original 11 460 samples, 5 compounds per 100 nm interval between 300–700 nm for carrying out the emission wavelength predictions using time-dependent density functional theory calculations, one of the common methods in predicting emission wavelength using QM approach (see Table S4† for the corresponding SMILES). We employed wB97XD functional[31] and 6-31+G(d) basis set under the implicit solvation model (PCM of ethanol) with Gassian16 package[32] for the S1 state optimization. The emission wavelength was further calibrated by the vertical S0 to S1 excitation energy computed at wB97XD/6-311+G(d,p) level using the prior minimum structure of S1 state. In Fig. 7 and Table S4,† the Lasso-RF model appears to give the reasonable R2 value (R2 = 0.655, MEA = 48 nm) in comparison against the selected DFT predictions (R2 = 0.778, MAE = 60 nm) with significantly less computational expense. The Lasso-RF approach is consequently recommended for the large scale and high-throughput screenings on the emission wavelength of the organic fluorophores being complementary to the TDDFT calculations.

Fig. 7

The comparison of the predicted wavelength in nm of DFT (red) and Lasso-RF (blue) models. The linear equations of both predicted data are shown in dotted lines. The grey solid line denotes the ideal fitting of slope = 1.

Important descriptors extracted from Lasso-RF model

We applied an assemble method like random forest where the descriptor importance could be estimated by calculating the decrease in node impurity weighted by the probability of reaching the node. The node probability could be calculated by the number of samples reaching the node, divided by the total number of samples. The current Lasso-RF model is schematically shown in Fig. S4.† The dominant descriptors of Lasso-RF model are summarized in Fig. 8 where the nTG12Ring and nTG12HeteroRing descriptors describing the number of >12-membered homogeneous and heterogeneous rings are the leading two contributions, followed by SubFPC287 for counting the number of conjugated double bonds and nAtomP denoting the number of atoms in the largest π system. All of the leading four descriptors can be used in conjunction with the conventional QM viewpoint that higher degree of conjugated π-bonding patterns shall introduce smaller HOMO–LUMO gaps and lead to the longer fluorescence emission wavelength.

Fig. 8

Top 20 important descriptors predicted by the Lasso-RF model.

Conclusion

In this study, we imported ten-thousand-plus organic fluorophore as the sample dataset and conducted a clustering and machine learning approach using molecular descriptors for predicting the emission wavelength. We demonstrated a systematic procedure in reducing the descriptor dimension in terms of the measurements of several statistical indicators. We finally concluded with the Lasso-RF model for the numerically predicted wavelength as well as being fulfilled with the statistical criteria. The model identified four conjugated π-bonding related descriptors to dominantly contribute to the predicted emission wavelength. Such an informatics model appeared to bring comparable predictive ability in complementary to the conventional time-dependent density functional theory method in emission wavelength prediction, however, with a fractional computational expense.

Conflicts of interest

There are no conflicts to declare.

15 in total

1. Fast and accurate modeling of molecular atomization energies with machine learning.

Authors: Matthias Rupp; Alexandre Tkatchenko; Klaus-Robert Müller; O Anatole von Lilienfeld
Journal: Phys Rev Lett Date: 2012-01-31 Impact factor: 9.161

2. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints.

Authors: Chun Wei Yap
Journal: J Comput Chem Date: 2010-12-17 Impact factor: 3.376

3. Computation of surface orientation and structure of objects using grid coding.

Authors: Y F Wang; A Mitiche; J K Aggarwal
Journal: IEEE Trans Pattern Anal Mach Intell Date: 1987-01 Impact factor: 6.226

4. A comparison of classifiers for predicting the class color of fluorescent proteins.

Authors: Roger Sá da Silva; Luis Fernando Marins; Daniela Volcan Almeida; Karina Dos Santos Machado; Adriano V Werhli
Journal: Comput Biol Chem Date: 2019-07-09 Impact factor: 2.877

Review 5. A Critical and Comparative Review of Fluorescent Tools for Live-Cell Imaging.

Authors: Elizabeth A Specht; Esther Braselmann; Amy E Palmer
Journal: Annu Rev Physiol Date: 2016-11-16 Impact factor: 19.318

6. Synthesis and properties of fluorescence dyes: tetracyclic pyrazolo[3,4-b]pyridine-based coumarin chromophores with intramolecular charge transfer character.

Authors: Jianhong Chen; Weimin Liu; Jingjin Ma; Haitao Xu; Jiasheng Wu; Xianglin Tang; Zhiyuan Fan; Pengfei Wang
Journal: J Org Chem Date: 2012-03-27 Impact factor: 4.354