Literature DB >> 35684394

Data-Driven and Multiscale Modeling of DNA-Templated Dye Aggregates.

Austin Biaggne1, Lawrence Spear1, German Barcenas1, Maia Ketteridge1, Young C Kim2, Joseph S Melinger3, William B Knowlton1,4, Bernard Yurke1,4, Lan Li1,5.   

Abstract

Dye aggregates are of interest for excitonic applications, including biomedical imaging, organic photovoltaics, and quantum information systems. Dyes with large transition dipole moments (μ) are necessary to optimize coupling within dye aggregates. Extinction coefficients (ε) can be used to determine the μ of dyes, and so dyes with a large ε (>150,000 M-1cm-1) should be engineered or identified. However, dye properties leading to a large ε are not fully understood, and low-throughput methods of dye screening, such as experimental measurements or density functional theory (DFT) calculations, can be time-consuming. In order to screen large datasets of molecules for desirable properties (i.e., large ε and μ), a computational workflow was established using machine learning (ML), DFT, time-dependent (TD-) DFT, and molecular dynamics (MD). ML models were developed through training and validation on a dataset of 8802 dyes using structural features. A Classifier was developed with an accuracy of 97% and a Regressor was constructed with an R2 of above 0.9, comparing between experiment and ML prediction. Using the Regressor, the ε values of over 18,000 dyes were predicted. The top 100 dyes were further screened using DFT and TD-DFT to identify 15 dyes with a μ relative to a reference dye, pentamethine indocyanine dye Cy5. Two benchmark MD simulations were performed on Cy5 and Cy5.5 dimers, and it was found that MD could accurately capture experimental results. The results of this study exhibit that our computational workflow for identifying dyes with a large μ for excitonic applications is effective and can be used as a tool to develop new dyes for excitonic applications.

Entities:  

Keywords:  DNA scaffolds; density functional theory; dye aggregates; exciton; extinction coefficient; machine learning; molecular dynamics; time-dependent density functional theory; transition dipole moment

Mesh:

Substances:

Year:  2022        PMID: 35684394      PMCID: PMC9182218          DOI: 10.3390/molecules27113456

Source DB:  PubMed          Journal:  Molecules        ISSN: 1420-3049            Impact factor:   4.927


1. Introduction

Organic molecules, which absorb and emit light, also known as dyes, are useful for many applications, such as biomedical imaging [1,2], organic photovoltaics [3,4], non-linear optics [5], and quantum information systems [6,7,8,9]. Key parameters that determine the performance of the dyes in those applications include the extinction coefficient () and transition dipole moment (), as well as aggregation ability. Thus, optimizing the key electronic (e.g., ) and molecular (e.g., aggregate) features is crucial for the desired applications of dye molecules. The interaction of dyes with light can be quantified via their extinction coefficient, ε. The value of , resulting from the absorption of light by the dye, can be measured using optical spectroscopy. From the measured value of , the transition dipole moment μ can be extracted [10,11,12]. This relationship not only allows for the measurement of the μ of dyes, but also helps select the dye candidates with optimal electronic properties for excitonic applications (i.e., large ) from numerous dyes. The value of strongly depends on the molecular structure of the dye. Some efforts have been made to augment the of dyes, such as adding donor or acceptor groups to the π-conjugation network [13], extending the π-conjugation network [14], and making the dye structures more planar [15,16,17]. However, the relationship between dye structure and remains unclear. In addition, it is time-consuming to conduct either experimental measurements or computational modeling to screen many dye candidates for desirable properties (e.g., high and μ). Another key feature of dyes for excitonic applications is dye aggregation. Dye aggregation has been observed in natural systems [18,19] as well as artificial systems [20,21]. Dye aggregates feature exciton delocalization, which facilitates energy transfer through the aggregate [22]. The dynamics of excitons residing on a dye aggregate can be described using the Frenkel Hamiltonian [23], where the exchange of excitons is largely dependent on the transition dipole coupling of the dyes [22,23,24]. The dipole coupling strength, exciton delocalization, and corresponding dynamics depend on the electronic properties of individual dyes, or monomers, as well as the orientations of the dyes in the aggregate [22,23,24,25,26]. One method of facilitating dye aggregation in a controlled and predictable manner is using DNA scaffolds. Dyes attached to DNA scaffolds, such as duplexes, Holliday junctions, and origami, have been shown to aggregate into dimers, trimers, and tetramers [27,28,29,30,31,32,33,34,35,36,37,38,39]. Characterization of the optical properties of the aggregates reveals that the dyes can adopt various ideal orientations [22,24,26,40]. One orientation, called an H-aggregate, occurs when the dyes are stacked, and results in a blue-shifted absorption spectrum. When the dyes are oriented head-to-tail, called a J-aggregate, the absorption spectrum is red-shifted. Another aggregate, termed oblique, occurs when the dyes are at 90° to one another, and results in Davydov splitting of the absorption spectrum. To maximize or fine-tune the coupling between dye molecules, the aggregation should be predictable and controlled. Computational studies of dyes aim to identify optimal candidates for excitonic applications. For example, density functional theory (DFT) and time-dependent (TD-) DFT can be used to screen the effects of functional groups on dye electronic properties [41,42,43,44]. Our prior studies indicated that functional group substitution can affect the solvation free energy , transition dipole moment , and absorption wavelength of a dye [45,46]. This effect is correlated with the empirically derived Hammett constant (), which demonstrates the electron-donating or electron-withdrawing strength of a substituent. The DFT and TD-DFT methods are applicable to dye monomers, but not to dye aggregates attached to DNA scaffolds due to structural size and complexity. An alternative method to further screen dyes with favorable electronic properties attached to DNA scaffolds is molecular dynamics (MD), which has been used to study dye–DNA interactions [47,48,49,50]. In a recent study, Mathur et al. used MD to study the orientations of cyanine dyes attached to DNA bundles and found that they were able to accurately capture dye dynamics and orientations using MD [47]. Nicoli et al. utilized MD to study the aggregation of Cy3 dimers attached to DNA duplexes, and found that MD could accurately capture the stacking of dyes leading to H-aggregation [50]. However, both DFT and MD are time-consuming for high-throughput screening of dye candidates with desired properties. Recently, machine learning (ML) has been shown to be a viable method of screening thousands of molecules to identify structure–property relationships based on both computational (e.g., DFT) [51] and experimental data [52,53]. The problem of searching through chemical-based datasets that contain labeled data for optimal molecules is a common task for pharmaceuticals and dye-sensitized solar cells, but there has not yet been work specifically targeting optimization of dye candidates for dye aggregate–DNA constructs. In particular, our group is interested in near-IR dye molecules exhibiting a large (>150,000 M−1cm−1) and hydrophobic properties. The same photophysical data used to create the chemical-based datasets for dyes have been of interest for organic photovoltaics, and there are several public datasets available [52,53,54,55]. ML techniques applied to chemical space exploration is a rich field with a variety of methods from which to choose. The methods can be hierarchical with the size of the dataset, spanning from well-established supervised learning to more complex artificial neural networks [56]. In this work, a systematic approach, combining ML, DFT, TD-DFT, and MD methods, was used to screen dye monomers from an expansive dataset and provide insight into dye aggregate–DNA duplex interactions. We first used ML to identify ideal dye candidates with high extinction coefficients (ε) from a dataset of around 18,000 molecules. Then, for the 100 ML-selected dye candidates with desirable structural features and high values of ε, DFT and TD-DFT calculations were performed to predict their ground and excited state properties. Finally, benchmark MD simulations were conducted to reveal the interactions between the selected dye dimers and the DNA duplexes.

2. Methods

2.1. Machine Learning

Classifier and Regressor models were trained to identify ideal dye candidates with high extinction coefficients (ε) based on dye structure features. The Classifier model could quickly classify the dyes with either high or low ε, where we set a threshold of 150,000 M−1cm−1 for strong exciton coupling in dye aggregates. As it learned from the Classifier model, the Regressor model could further estimate the values of ε for the dyes. Three data sources, including Deep4Chem [57], PhotoChem CAD 3 [54], and Dyomics GmbH [58] (8802 molecule datapoints in total), were used to train and validate the models. We utilized SMILES format for the molecule. We also utilized RDKit [59] to calculate 284 different features, such as the maximum carbon chain length and aromatic, amide, and ester group counts. The development dataset, containing 90% of data points, was used to train and test various model hyperparameters. Based on the accuracy for the Classifier and R2 for the Regressor, a model was selected for further analysis. We then used the validation dataset, containing 10% of data points, to validate the selected model’s effectiveness. The molecules with ε of above 800,000 M−1cm−1 were excluded because the ε values deviated too greatly from the threshold of 150,000 M−1cm−1. Figure 1 shows the dataset breakdown with high and low ε values. The three data sources all have a data imbalance, where the number of molecules with low ε values is larger than that of molecules with high ε values. The effect of imbalanced data is discussed in Section 3.1.
Figure 1

Dataset breakdown for molecules with high and low extinction coefficients ε. The threshold is 150,000 M−1cm−1.

2.2. Density Functional Theory

Density functional theory (DFT) and time-dependent (TD-) DFT calculations were performed to optimize dye structures in the ground state and calculate solvation energies and transition dipole moments. Similarly to our previous work [46], the dyes were optimized with the M06-2X [60] functional and 6-31+G(d,p) basis set, to a residual force of 4.5 × 10−4 Hartree/Bohr. The M06-2X functional with 6-31+G(d,p) basis set was validated in our prior studies of pristine and substituted cyanine and squaraine dyes [45,46] and has been used successfully for the calculations of the excited state properties of similar systems [61,62,63]. Frequency calculations were conducted to confirm the ground state structures were true minima. To determine , single point, vertical excited state calculations using the M06-2X functional were performed on the ground state structures to obtain transitions to the first 30 excited singlet states and identify the state with the largest oscillator strength. Calculations of the ground and excited state properties were conducted with implicit water solvation using the integral equation formalism polarizable continuum model (IEFPCM) [64,65], which was successfully used for the excited state property calculations of similar systems [66,67,68]. Excited state calculations were conducted assuming nonequilibrium solvent conditions. To approximate the relative hydrophobicity of the dyes, the partitioning coefficient between n-octanol and water, , was calculated according to [69,70] where and are the Gibbs free energy of solvation for a dye in n-octanol and water, respectively; ; and . A more positive value of means a molecule is more hydrophobic, and a more negative value means a molecule is more hydrophilic. In general, the Gibbs free energy of solvation for a molecule () is a measure of the amount of energy required to dissolve the dye in solvent, and was calculated according to [45,46,68,71] where is the total energy of the dye in implicit solvent and is the total energy of the dye in vacuum. Calculations for the solvation energy were conducted using the universal solvation model based on density (SMD) variation of IEFPCM [72], which was useful for predicting the solvation energies of organic molecules [73] and calculating the relative hydrophobicity of modified squaraine dyes [70]. All DFT and TD-DFT calculations were conducted using the Gaussian16 software package [74].

2.3. Molecular Dynamics

Molecular dynamics (MD) simulations were performed with the GROMACS 2020.3 software package [75]. Dye–DNA structures were built using the UCSF ChimeraX software [76] with the dyes initialized on the outside of the DNA backbone. The OL15 force-field [77] with non-bonded modifications [78] was used for DNA parameters and the generalized amber forcefield (GAFF) [79] was used for dye parameters. Atomic charges for the dyes were calculated using the HF/6-31G* theory level [80]. The 26-basepair dsDNA duplex sequence and dye locations from Huff et al. [29] was used. The dye–DNA structures were solvated in TIP3P water [81] in a truncated octahedron box with 1.2 nm between the dye–DNA structure and the box edge. Mg2+ ions were used to neutralize the system. Cannon et al. and Huff et al. showed, experimentally, that by adding excess MgCl2 to solutions containing DNA duplexes with two pentamethine indocyanine Cy5 dyes, DNA Holliday junctions could be formed [28,29]. Because of this, no excess MgCl2 was used in the MD simulations apart from replacing a necessary number of water molecules with Mg2+ to achieve neutral charge. Neighbor-searching was used with a cutoff of 1.2 nm. Van der Waals interactions were limited to 1.2 nm, and the particle mesh Ewald (PME) was used with a real-space coulomb cutoff of 1.2 nm. Bonds to hydrogen atoms were constrained using the LINCS algorithm [82]. A timestep of 2 fs was used. The initial systems were energy-minimized with the steepest descent method for 1000 steps. Then, to achieve a well-relaxed starting structure, two subsequent 10 ns equilibration steps were performed with harmonic constraints, the first with 1000 spring constants applied to non-hydrogen atoms, and the second with 100 spring constants applied to non-hydrogen atoms, keeping the number of atoms, volume, and temperature constant. A final 10 ns equilibration was performed with no restraints. Following equilibration, 1 production simulations were carried out, keeping the number of atoms, pressure, and temperature constant. The velocity-rescale thermostat [83] was used to maintain a constant temperature of 300 K with a coupling time of 0.1 ps, with the DNA–dye and solvent being coupled separately. The Parrinello–Rahman barostat [84] was used to keep the pressure at 1 atm with a coupling time of 1.0 ps. Coordinates were written every 10 ps and the first 100 ns of the production simulations were treated as equilibration periods, and so were not used for analysis. To determine the transition dipole coupling strength between two dyes, the orientation of the dyes with respect to one another and the transition dipole moments are needed. The dye–dye center-to-center distances and orientation factors () were determined every 10 ps. The values of were determined using [48] where is the transition dipole moment unit vector of dye or (taken along the long axis of the dye), and is the unit vector between the centers of dyes and . When or , the dyes are in a stacked oblique orientation or tail-to-tail oblique orientation, respectively. When or , the dyes are in a stacked (H-aggregate) or head-to-tail (J-aggregate) orientation, respectively. The exciton exchange energy (), which is a measure of the strength of the transition dipole coupling between two dyes, depends on the transition dipole moment , which is related to [10,11,12] and can be obtained experimentally or using TD-DFT. The of a dimer was approximated using the extended dipole model according to [85] where the values and correspond to either end of dye or along the dye’s long axis. The pre-factor term, , is defined as [85] where is the transition dipole moment magnitude (calculated using TD-DFT) of dyes or , is the vacuum permittivity constant, is the refractive index of water (1.33), and and are the lengths of dyes and , respectively (such that ).

3. Results

3.1. Dye Screening Using Machine Learning and Density Functional Theory

A Random Forest Classifier was trained on the development dataset utilizing a five k-fold to determine the best ε threshold; the max feature, which refers to the maximum number of features to consider; the max depth, which refers to the maximum depth of the tree; and the class weight for the model. Every model we created has a high accuracy of 97% or above. Figure 2a shows the accuracy of the Random Forest Classifier with various ε thresholds in comparison with a model (labeled Always Low ε) that always classifies a molecule with low ε. The Always Low ε model has a high accuracy of around 92% or above, indicating that the high accuracy of 97% or above for the Random Forest Classifier is reasonable. In addition, the accuracy of the Random Forest Classifier starts to converge at 150,000 M−1cm−1, which supports our selection of the threshold of 150,000 M−1cm−1. To address a concern of the effect of imbalanced data on the accuracy of the models, we also trained and validated the models with different datasets that included various percentages of molecules with low ε values. Figure 2b shows the accuracy vs. percentage of low ε in the dataset. The value of 0.5 represents the balanced dataset containing 50% molecules with high ε values and 50% molecules with low ε values. The number of molecules with low ε values gradually increases, so the dataset becomes imbalanced. However, the accuracy of the Random Forest Classifier remains at 97% or above. These results indicate that the effect of the imbalanced data we used to train and validate the models is negligible.
Figure 2

Accuracy comparison between Random Forest Classifier and a model that always classifies a molecule with low ε no matter what structural features the molecule has. They both were developed and validated with (a) all and (b) partial data.

Based on the performance of the Random Classifier model, we developed a Random Forest Regressor model to further predict the precise ε value of a molecule. We trained the Regressor to determine its best hyperparameters of class weight, max features, max depth, and criterion as we did with the Random Forest Classifier. Figure 3 compares the predicted and actual ε values for the development and validation datasets, which have 0.95 and 0.91 for R2 (i.e., the coefficient of determination), respectively. The datasets and codes developed in this study will be published online.
Figure 3

The extinction coefficients ε of the (a) development and (b) validation datasets predicted by a Random Forest Regressor in comparison with actual values from literature.

To identify dyes with high (which is indicative of high ) the optimized Regressor model was further applied to a dataset containing around 18,000 potential dye candidates, including the commercially available cyanine dyes such as Cy3, Cy5, Cy5.5, and Cy7, shown in Figure 4. These dyes are known to exhibit large (and thus, large and strong dye coupling) [86], and are of interest to our research group and collaborators [27]. Four modified Cy5 dyes with hydrophobic substituents were also considered in this study. These dyes, labeled as Cy5-Cl, Cy5-Peg, Cy5-hex, and Cy5-tBu, are also shown in Figure 4 and are hypothesized to exhibit stronger dye coupling due to being more hydrophobic compared to Cy5, which may result in shorter inter-dye distances [87]. Two other modified dyes, Cy5-CN and Cy5-NMe2, were considered since CN and NMe2 were shown to have a large effect on excess dipole moments (the difference in the dipole moments of the ground and excited states) of similar dyes [45,46]. The rest of the dataset consisted of dyes obtained from PubChem [88], including dyes with similar structures to cyanine, porphyrin, and methyl violet molecules. Those three classes of dyes were chosen for their prominent -conjugation [89,90,91], absorption in the visible light range [90,92,93], and excitonic applications [90,92,94,95,96].
Figure 4

Chemical structures of dyes used in the present study and their machine learning (ML)-predicted extinction coefficients () in units of 1000 M−1cm−1. Dyes 1–15 were selected using a combination of ML, density functional theory (DFT), and time-dependent (TD)-DFT.

While can be extracted from experimental absorption spectra [10,11,12], it is not possible to determine based on the peak alone. Because of this, TD-DFT was used to calculate , to compare with experimentally available . Figure 5 shows a comparison of ML-predicted and TD-DFT-calculated with commercially available Cy3, Cy5, Cy5.5, and Cy7 dyes, as well as modified Cy5 dyes for which experimentally measured values of are available. The values for the commercially available dyes were obtained from their respective commercial websites, including AAT Bioquest [97], Lumiprobe [98], Glen Research [99], and Interchim [100], and from Huff et al. [29]. Because multiple vendors advertise slightly different values, a range is given in Figure 5. Meares et al. synthesized Cy5-hex, Cy5-Peg, Cy5-tBu, and Cy5-Cl for incorporation into DNA and measured the of the dyes incorporated into DNA strands at their peak wavelengths [87]. The ranges of for Cy5, Cy5-hex, Cy5-Peg, Cy5-tBu, and Cy5-Cl are likely due to small differences in local environments when the dyes are attached to DNA sequences and relative purities [87]. In general, ML-predicted values agree with the trend of experimental obtained from literature. Notably, ML-predicted values for Cy5, Cy5.5, and Cy7 are within the experimental range indicated by the shaded region in Figure 5. Similarly, the ML-predicted trend agrees with the trend of TD-DFT-calculated values, which are assumed to be correlated (i.e., larger leads to larger ). The dyes that do not fall into the range of experimental consist of the Cy5 derivatives developed by Meares et al. [87]. The percent errors from the experiments include 45% for Cy5-hex, 49% for Cy5-Peg, 10% for Cy5-tBu, and 20% for Cy5-Cl. Such differences could be caused by solvent, DNA-dye interactions, and dye purities [87]. Furthermore, the specific functional groups for Cy5-hex and Cy5-Peg might not be well represented by the ML training dataset, which could lead to inaccuracies when predicting . In general, the Regressor is able to predict the overall trend of , which is necessary for the screening of numerous new dyes for .
Figure 5

ML-predicted extinction coefficients ε and TD-DFT-calculated transition dipole moments μ of 10 dye candidates of interest in comparison with the experimentally available ε values [29,87,97,98,99,100].

The Regressor model was also applied to the dataset obtained from PubChem [88] to identify additional potential dye candidates. The top 100 dyes that were predicted to have above 150,000 M−1cm−1 were then screened using DFT and TD-DFT to determine their values by calculating vertical excited state transitions to the lowest 30 excited states. Of the 100 dye candidates, the 15 dye candidates with desirable properties, such as absorption wavelength in the visible region, large -conjugated networks, and comparable to that of Cy5 (within 50%), are shown in Figure 5 and are labeled 1–15. Their corresponding ML-predicted and TD-DFT-calculated values are listed in Table 1.
Table 1

ML-Regressor predicted extinction coefficient () and TD-DFT-calculated transition dipole moments () of dyes shown in Figure 4. Cartesian vector components along x, y, and z axis are provided in Table S1.

DyeML-Predicted ε × 1000 M1cm1 TD-DFT μ, Debye
Cy312612.25
Cy522215.35
Cy5-CN13815.66
Cy5-NMe221615.83
Cy5-Cl21415.74
Cy5-hex11016.22
Cy5-Peg10716.19
Cy5-tBu22415.90
Cy5.520515.57
Cy723517.62
13079.08
22887.84
326520.25
42407.99
523520.17
622914.00
722710.66
822715.49
922614.68
1022316.19
1122214.00
1221811.26
132169.52
142129.34
1521010.48
Comparing the TD-DFT calculated values of for Cy3, Cy5.5, Cy7, and the Cy5 derivatives, Cy3 has the lowest . Disregarding Cy5-hex and Cy5-Peg, the values of and the ML-predicted generally follow the same trend, with Cy3 having the smallest and Cy7 having the largest. Comparing the 15 selected dyes from the Regressor model, all dyes are predicted to have an above 210,000 M−1cm−1. Dye 1 has the largest overall ML-predicted of 309,000 M−1cm−1, with a TD-DFT of 9.08 D. Dye 3 has the largest overall TD-DFT of 20.25 D and an ML-predicted of 265,000 M−1cm−1. Figure 6 shows the log(Po/w) values calculated with vacuum and implicit solvent DFT for the ML-selected dyes. A more positive value of log(Po/w) indicates a more hydrophobic dye, and a more negative value of log(Po/w) indicates a more hydrophilic dye. It is hypothesized that by increasing hydrophobicity, dyes may aggregate closer, thus improving coupling. This has been demonstrated in a set of squaraine dyes modified with hydrophobic substituents [70], and the values of log(Po/w) for the hydrophobic squaraine dyes are similar to those for the Cy5 derivatives. Comparing the Cy5 derivatives, Cy3, Cy5.5, and Cy7, Cy5-hex is the most hydrophobic and Cy3 is the least hydrophobic, closely followed by Cy5-CN. Most of the 15 dyes chosen from the ML Regressor model predictions exhibit hydrophobicity similar to that of Cy5. Three dyes are hydrophilic (dyes 2, 4, and 13), with dye 13 being the most hydrophilic. These three dyes also exhibit relatively low values compared to the rest of the dataset, indicating that they may not be suitable for excitonic applications that require close inter-dye separations and large transition dipole moment couplings. Conversely, dyes 7, 12, and 15 exhibit the most positive log(Po/w) values, meaning they are estimated to be the most hydrophobic. Furthermore, dyes 7, 12, and 15 have values within 25% of that for Cy5, making those dyes suitable for excitonic applications. Dyes 3 and 5 have log(Po/w) values slightly larger than that of Cy5, and values about 5 D larger than Cy5. Overall, based on our criteria—a large (indicating large ) and a large positive log(Po/w)—dyes 3 and 5 are the most promising candidates in the dataset for excitonic applications.
Figure 6

Partition coefficients of dyes in water versus n-octanol (log(Po/w)), calculated using Equation (1), where Gibbs free energies of solvation of dyes in implicit water and n-octanol solvents are provided in Table S2. A more positive log(Po/w) means a molecule is more hydrophobic, and a more negative log(Po/w) means a molecule is more hydrophilic. The labels 1–15 and dye names correspond to the dye structures in Table 1.

3.2. Molecular Dynamics Simulations of Dye Aggregate–DNA Duplex Interactions

To study the effects of DNA on the dye orientations, 1 MD simulations were performed with two dyes covalently bound to the backbone of DNA duplexes via dual phosphoramidite linkers. In our study, we started with commercially available Cy5 and Cy5.5 as reference dyes to guide other dye candidate selection. Our research group experimentally demonstrated that Cy5 can exhibit aggregation, strong absorption, and excitonic coupling when attached to DNA duplexes and Holliday Junctions [27,28,29]. Due to its similar structure to Cy5, Cy5.5 should exhibit similar properties [38]. The extra aryl groups on Cy5.5 extend the conjugation and add to the size of the molecule, which could affect dye-packing and make slightly larger than , as shown in Table 1. Furthermore, as shown in Figure 6, log(Po/w) for Cy5.5 is 40% larger than that for Cy5, indicating that Cy5.5 might pack closer than Cy5 when aggregated. Thus, we chose a Cy5 dimer and a Cy5.5 dimer for MD simulations as a benchmark for comparison with future simulations of other selected dyes. The simulations were performed in water at a 1 atm pressure and 300 K. Dye orientations were quantified using the orientation factor, , calculated using Equation (3). Dimer exciton exchange energies, , were quantified using Equation (4) with inputs from TD-DFT for the transition dipole moments, where and , as shown in Table 1. The vectors corresponding to were found to primarily reside on the long axis of the dyes, and thus, the values of and in Equation (4) were chosen as the centers of the terminal aryl groups of the dyes. The MD results for the Cy5 dimer attached to a DNA duplex are presented in Figure 7 and Figure 8. Figure 7a shows a heatmap plot of versus , and Figure 7b shows a heatmap plot of versus for the 900 ns of data collection. Based on the values shown in Figure 7a, there are two distinct dimer orientations, labeled “O1” and “O2”. The approximate regions corresponding to O1 and O2 are also shown in Figure 7b. As shown in Figure 7c, O1 corresponds to where the dyes are located outside of the duplex (i.e., non-intercalated). This orientation has a value ranging approximately from 0–1 (oblique and H-like aggregate) and an of approximately 2.5–3.0 nm, which results in a relatively low of less than 20 meV. Examining Figure 8, a shift in the orientation of the dyes occurs after around 150 ns of simulation, where the dyes re-orient and intercalate into the base-stack region of the DNA. This change in orientation results in a mostly head-to-tail (J-like) configuration with some obliqueness, corresponding to a of about 1.25–1.5 and an of about 0.9–1.5 nm, as indicated by O2 and represented in Figure 7d. This more closely spaced orientation results in a larger of roughly 40–80 meV. Since |J| was relatively stable after 200 ns, all dimer orientations beyond that time were averaged, yielding and nm. Thus, averaging over this period of time results in meV. The post-200 ns average values of and agree well with the experimentally derived values of nm and meV, as obtained by Huff et al. [29]. Furthermore, they determined the dimer to have a similar orientation, with a red-shift in the main absorption peak observed for the dimer relative to the monomer, indicating a mostly J-like orientation [29]. The meV larger value obtained from MD might be caused by a slight overestimation of using TD-DFT, which is shown to be D larger than the obtained from experimental measurements [29,46].
Figure 7

Heatmap plots of (a) orientation factor () and (b) exciton exchange energy () versus dye center-to-center distances () for the 900 ns Cy5 dimer–DNA duplex molecular dynamics (MD) trajectory. Snapshots of the structural configurations of (c) pre-intercalation (corresponding to region O1) and (d) post-intercalation (corresponding to region O2) that represent regions in the vs. heatmap. The DNA duplex is shown in blue and the Cy5 dyes and linkers are shown in orange.

Figure 8

Plots of (a) dye center-to-center distance (), (b) orientation factor (), and (c) exciton exchange energy () versus time for the 900 ns Cy5 dimer–DNA duplex MD trajectory. The first 100 ns of the simulation were treated as an equilibration and are therefore excluded.

Compared to the Cy5 dimer, the Cy5.5 dimer exhibits a similar trajectory. The Cy5.5 dyes were initialized, outside of the DNA duplex (i.e., non-intercalated). After about 400 ns, the dyes intercalated, reducing and increasing (more J-like) and . The two regions corresponding to the non-intercalated and intercalated states of the Cy5.5 dimer are labeled as “O1” and “O2” in Figure 9a,b, which are represented in the snapshots in Figure 9c,d. The O1 region of the Cy5.5 dimer has an that ranges from about 2.2–3.2 nm. However, for the Cy5.5 dimer has a range of roughly 0–1.5, which is larger than that for the Cy5 dimer. However, for the O1 region of the Cy5.5 dimer is between 0–15 meV, comparable to that of the Cy5 dimer. The ranges of and of the O2 region of Cy5.5 are roughly 0.8–1.7 nm and 1.2–1.75, respectively, similar to that of the Cy5 dimer. This range of orientations results in a range of about 30–60 meV, slightly smaller than that of the Cy5 dimer.
Figure 9

Heatmap plots of (a) orientation factor (), and (b) exciton exchange energy () versus dye center-to-center distances () for the 900 ns Cy5.5 dimer–DNA duplex MD trajectory. Snapshots of the structural configurations of (c) pre-intercalation (corresponding to region O1) and (d) post-intercalation (corresponding to region O2) that represent peaks in the vs. heatmap. The DNA duplex is shown in blue and the Cy5.5 dyes and linkers are shown in orange.

Averaging the dimer orientations past 400 ns (after which and are relatively stable, as shown in Figure 10) results in and nm. Similarly, averaging over this time period results in meV. Even though the two dimers exhibit similar orientations, the smaller for the Cy5.5 dimer might be caused by a small structure difference between Cy5.5 and Cy5. Cy5.5 is slightly larger and longer than Cy5 (by about 0.1 nm) due to the extra aryl groups at the two ends of the dye. Despite the smaller average and similar orientations, the larger dye length of Cy5.5 compared to that of Cy5 results in a smaller pre-factor term , leading to the smaller . In a similar study, it was found that Cy5.5 homodimers attached to transverse strands on a DNA Holliday junction exhibited closer dye distances and larger values than Cy5 homodimers [38]. A potential reason for this difference might be that the DNA Holliday junctions are more flexible than the duplexes, which allow dye orientations to promote larger J values.
Figure 10

Plots of (a) dye center-to-center distance (), (b) orientation factor (), and (c) exciton exchange energy () versus time for the 900 ns Cy5.5 dimer–DNA duplex MD trajectory. The first 100 ns of the simulation were treated as an equilibration and are therefore excluded.

4. Discussion

Based on the results shown in Figure 5, the developed Regressor is useful for the prediction of experimental trends. This was shown by predicting the of a dataset of over 18,000 molecules, from which 15 were identified using a combination of Regressor predictions and TD-DFT calculations. In general, according to desired criteria the developed ML model can quickly and accurately screen a large dataset of dyes for optimum for further study, which could save computation and experimental time. There are improvements that can be made to the developed ML models. Primarily, more data would improve the chemical space on which the ML models would be trained, such as more chemical structures, as well as the inclusion of the solvents used. This is highlighted in Figure 1, which shows a disproportion in large versus small values in the training set. Furthermore, the present model could benefit from optimization and feature engineering, which could further reduce the computation time and help to improve the model by identifying more features that may better describe dye properties. Other types of ML models such as Support Vector Machine (SVM) algorithms could also be explored, which could potentially lead to better predictions. Finally, an improvement could be implemented by including human knowledge in the workflow so as to identify desirable structural features; this could be useful in interpreting data, improving efficiency, and enhancing model performance. Examining Figure 5 and Table 1, the trends of the TD-DFT-calculated and the available experimentally determined mostly agree when considering the upper range of values. The deviations in compared to could be due to the solvent chosen for TD-DFT calculations (in our case, water) or the exchange–correlation functional used for the calculations. For example, our prior studies showed that the CAM-B3LYP functional produced slightly different values for similar dyes, however, the overall trend remained the same [46]. In the case of the Cy5 derivatives, the values might be affected by the specific DNA strand to which the dyes are attached, leading to the range in shown in Figure 5 for those dyes. This could also lead to some degree of disagreement with the TD-DFT calculations of (only considering free dyes in water) and the ML-predicted . Furthermore, in general, there is no strong correlation between the ML-predicted and TD-DFT for the 15 dyes shown in Table 1. This highlights the need for TD-DFT calculations as a validation for the initial dye screening using the ML Regressor, as the ML model is not 100% accurate. Studying the values of in Table 1, the main dye feature leading to a large is long conjugated chains, as is the case for dyes 3 and 5. The same trend is observed for Cy5 and Cy7, which have two and four more carbons in the conjugated chain compared to Cy3, respectively. Despite having similar structures, dye 10 has a ~1 D larger compared to dye 8, which might be caused by the Cl atom bonded to the polymethine chain in dye 8. Examining Figure 6, the Cy5 derivatives modified with hydrophobic groups exhibit an enhanced hydrophobicity (as exemplified by more positive values), and the trend of follows the same trend obtained by Meares et al. who used the Percepta Platform to predict based on chemical structure [87]. Since these dyes are derived from the Cy5 dye, which has been shown to form aggregates when attached to DNA scaffolds [27,28,29], they are promising candidates for further studies with an aim of enhancing . Examining the for the 15 ML-selected dyes, most exhibit values similar to Cy5 and its derivatives, with the notable exceptions of dyes 2, 4, 7, 12, 13, and 15. Dyes 2, 4, and 13 have a negative and are, therefore, expected to be hydrophilic. Dye 2 is structurally similar to dyes 1, 7, and 12, which have positive values. The hydrophilicity of dye 2 might be attributed to the Se atoms in the ring, which dyes 1, 7, and 12 do not contain. Dyes 4 and 13 contain S groups, which are known to increase hydrophilicity [101]. Dyes 7, 12, and 15 stand out as the most hydrophobic of the dyes tested and all belong to the same family of porphyrins, which are known to be hydrophobic [102,103]. Based on their relatively high and enhanced hydrophobicity, dyes 7, 12, and 15 (porphyrin-based dyes) are promising candidates for dye–DNA applications. Furthermore, it has been shown that porphyrin dyes are suitable for bonding to DNA and are able to intercalate between bases, which may be beneficial for dye aggregation in DNA [102]. Our MD simulations show that the Cy5 dimer and Cy5.5 dimer orient similarly when attached to the same positions on a DNA duplex. Comparing the simulation results for the Cy5 dimer to Huff et al., our predicted and are within 20% and 5%, respectively [29]. It should be noted that the results shown by Huff et al. were obtained using a program developed by our group based on the Kuhn–Renger–May (KRM) theory, which only considers absorbance and circular dichroism spectra to obtain dye orientations [29]. However, the KRM-based program does not provide information on the details of the DNA–dye orientations or the impact of the DNA on the dye orientation as our MD simulations do. Based on the MD results, one possible mechanism is dye intercalation into the DNA duplex from outside of the DNA backbone to form a J-dimer, which occurs for both Cy5 and Cy5.5 simulations. Our results highlight the effectiveness of MD simulations for predicting dye orientations and the excitonic properties of dyes bound to DNA duplexes. For future studies, we will continue applying MD to additional ML-selected dye aggregates in DNA scaffolds.

5. Conclusions

ML models were developed through training and validation on a set of 8802 molecules to predict based on dye structures. A Random Forest Classifier was developed with an accuracy of 97%, and based on the performance of the Classifier, a Random Forest Regressor was constructed. Comparing ML-predicted from the Regressor and experimental , the Regressor was found to have a maximum of 0.95. Using the Regressor, the values of molecules in a dataset of around 18,000 were predicted, and the top 100 dyes were used in TD-DFT calculations to calculate . Overall, 15 dyes were identified to have a relatively large comparable to Cy5. MD simulations were conducted on reference dyes (Cy5 and Cy5.5) to determine the dye dimer orientations and transition dipole moment couplings, . For Cy5, the MD simulations were able to predict dye orientations and within 20% of the experiment. The Cy5.5 dimer yielded similar results to the Cy5 dimer. The successful use of the combined ML and DFT/TD-DFT screening to identify dyes with a large and highlights the effectiveness of our workflow to screen numerous dyes for desired properties. The agreement of our MD simulations with the experiment show that we can accurately detect dye dimer properties when attached to DNA scaffolds, which is crucial to guide dye design and synthesis for excitonic applications.
  59 in total

1.  Choosing a Functional for Computing Absorption and Fluorescence Band Shapes with TD-DFT.

Authors:  Azzam Charaf-Eddin; Aurélien Planchat; Benedetta Mennucci; Carlo Adamo; Denis Jacquemin
Journal:  J Chem Theory Comput       Date:  2013-05-16       Impact factor: 6.006

Review 2.  Squaraine Dyes: Molecular Design for Different Applications and Remaining Challenges.

Authors:  Kristina Ilina; William M MacCuaig; Matthew Laramie; Jannatun N Jeouty; Lacey R McNally; Maged Henary
Journal:  Bioconjug Chem       Date:  2019-08-12       Impact factor: 4.774

3.  Large Davydov Splitting and Strong Fluorescence Suppression: An Investigation of Exciton Delocalization in DNA-Templated Holliday Junction Dye Aggregates.

Authors:  Brittany L Cannon; Lance K Patten; Donald L Kellis; Paul H Davis; Jeunghoon Lee; Elton Graugnard; Bernard Yurke; William B Knowlton
Journal:  J Phys Chem A       Date:  2018-02-19       Impact factor: 2.781

4.  Fluorescent J-aggregates of cyanine dyes: basic research and applications review.

Authors:  Julia L Bricks; Yuri L Slominskii; Ihor D Panas; Alexander P Demchenko
Journal:  Methods Appl Fluoresc       Date:  2017-12-21       Impact factor: 3.009

5.  Quantifying Planarity in the Design of Organic Electronic Materials.

Authors:  Yuxuan Che; Dmitrii F Perepichka
Journal:  Angew Chem Int Ed Engl       Date:  2020-11-12       Impact factor: 15.336

6.  Effect of substitution on the excited state photophysical and spectral properties of boron difluoride curcumin complex dye and their derivatives: A time dependent-DFT study.

Authors:  Karthick Selvam; Sivaraman Gandhi; Sailaja Krishnamurty; Gopu Gopalakrishnan
Journal:  J Photochem Photobiol B       Date:  2019-08-22       Impact factor: 6.252

7.  Optical Properties of Vibronically Coupled Cy3 Dimers on DNA Scaffolds.

Authors:  Paul D Cunningham; Young C Kim; Sebastián A Díaz; Susan Buckhout-White; Divita Mathur; Igor L Medintz; Joseph S Melinger
Journal:  J Phys Chem B       Date:  2018-05-08       Impact factor: 2.991

8.  Coherent Exciton Delocalization in a Two-State DNA-Templated Dye Aggregate System.

Authors:  Brittany L Cannon; Donald L Kellis; Lance K Patten; Paul H Davis; Jeunghoon Lee; Elton Graugnard; Bernard Yurke; William B Knowlton
Journal:  J Phys Chem A       Date:  2017-09-06       Impact factor: 2.781

9.  Rotaxane rings promote oblique packing and extended lifetimes in DNA-templated molecular dye aggregates.

Authors:  Matthew S Barclay; Simon K Roy; Jonathan S Huff; Olga A Mass; Daniel B Turner; Christopher K Wilson; Donald L Kellis; Ewald A Terpetschnig; Jeunghoon Lee; Paul H Davis; Bernard Yurke; William B Knowlton; Ryan D Pensack
Journal:  Commun Chem       Date:  2021-02-18

10.  DFT study of the effect of substituents on the absorption and emission spectra of Indigo.

Authors:  Francisco Cervantes-Navarro; Daniel Glossman-Mitnik
Journal:  Chem Cent J       Date:  2012-07-18       Impact factor: 4.215

View more
  1 in total

1.  Tunable Electronic Structure via DNA-Templated Heteroaggregates of Two Distinct Cyanine Dyes.

Authors:  Jonathan S Huff; Sebastián A Díaz; Matthew S Barclay; Azhad U Chowdhury; Matthew Chiriboga; Gregory A Ellis; Divita Mathur; Lance K Patten; Simon K Roy; Aaron Sup; Austin Biaggne; Brian S Rolczynski; Paul D Cunningham; Lan Li; Jeunghoon Lee; Paul H Davis; Bernard Yurke; William B Knowlton; Igor L Medintz; Daniel B Turner; Joseph S Melinger; Ryan D Pensack
Journal:  J Phys Chem C Nanomater Interfaces       Date:  2022-09-28       Impact factor: 4.177

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.