Rania M Hathout1, AbdelKader A Metwally1,2, Timothy J Woodman3, John G Hardy4,5. 1. Department of Pharmaceutics and Industrial Pharmacy, Faculty of Pharmacy, Ain Shams University, Cairo 11566, Egypt. 2. Department of Pharmaceutics, Faculty of Pharmacy, Health Sciences Center, Kuwait University, Kuwait 90805, Kuwait. 3. Department of Pharmacy and Pharmacology, University of Bath, Bath BA2 7AY, U.K. 4. Department of Chemistry, Lancaster University, Lancaster, Lancashire LA1 4YB, U.K. 5. Materials Science Institute, Lancaster University, Lancaster, Lancashire LA1 4YB, U.K.
Abstract
The delivery of drugs is a topic of intense research activity in both academia and industry with potential for positive economic, health, and societal impacts. The selection of the appropriate formulation (carrier and drug) with optimal delivery is a challenge investigated by researchers in academia and industry, in which millions of dollars are invested annually. Experiments involving different carriers and determination of their capacity for drug loading are very time-consuming and therefore expensive; consequently, approaches that employ computational/theoretical chemistry to speed have the potential to make hugely beneficial economic, environmental, and health impacts through savings in costs associated with chemicals (and their safe disposal) and time. Here, we report the use of computational tools (data mining of the available literature, principal component analysis, hierarchical clustering analysis, partial least squares regression, autocovariance calculations, molecular dynamics simulations, and molecular docking) to successfully predict drug loading into model drug delivery systems (gelatin nanospheres). We believe that this methodology has the potential to lead to significant change in drug formulation studies across the world.
The delivery of drugs is a topic of intense research activity in both academia and industry with potential for positive economic, health, and societal impacts. The selection of the appropriate formulation (carrier and drug) with optimal delivery is a challenge investigated by researchers in academia and industry, in which millions of dollars are invested annually. Experiments involving different carriers and determination of their capacity for drug loading are very time-consuming and therefore expensive; consequently, approaches that employ computational/theoretical chemistry to speed have the potential to make hugely beneficial economic, environmental, and health impacts through savings in costs associated with chemicals (and their safe disposal) and time. Here, we report the use of computational tools (data mining of the available literature, principal component analysis, hierarchical clustering analysis, partial least squares regression, autocovariance calculations, molecular dynamics simulations, and molecular docking) to successfully predict drug loading into model drug delivery systems (gelatin nanospheres). We believe that this methodology has the potential to lead to significant change in drug formulation studies across the world.
The global market for
drug delivery systems is a multibillion-dollar
industry, demand for which is growing in both developed and emerging
economies (in part, driven by aging societies and rapid urbanization).[1−9] Drug delivery systems deliver drugs at rates controlled by specific
features of the systems, particularly their chemical composition (e.g.,
inorganic/organic components, molecular weights of their constituents,
cross-linking density of polymers, etc.).[10−12]The selection
of the appropriate system (carrier and drug) to obtain
optimal delivery is a challenge investigated by researchers in academia
and industry, in which millions of dollars are invested annually.[13] Experiments involving different carriers and
determination of their capacity for drug loading are very time-consuming
and therefore expensive. Consequently, approaches that exploit multivariate
statistical methods, molecular simulations, docking methods, and mining
the data in the literature[14−19] have the potential to make hugely beneficial economic, environmental,
and health impacts through savings in costs associated with chemicals
(and their safe disposal) and time.Computational/theoretical
chemists/biochemists, biomedical/chemical
engineers, and pharmacists have developed a variety of methodologies
that can be applied to understand drug formulations. Principal component
analysis (PCA) and hierarchical clustering analysis (HCA) are considered
exploratory data analysis and unsupervised machine learning methods,
where these techniques extract patterns from the independent factors
(x-variables) only and irrelevant to the y-outcomes. Partial least squares (PLS) is a supervised
pattern recognition method correlating the inputs with outputs and
subsequently leads to the generation of a model.[20] This data mining approach (through a retrospective analysis)
combined with computer-aided analysis and simulation extracts knowledge
from complex variables and responses obtained from historical records.
The significant advantage of this approach is the possibility of uncovering
interactions and linear relationships that might not be easily detectable
with conventional experimental designs.[21] Although not yet fully explored in drug formulation/delivery, multivariate
statistical methods such as PCA and agglomerative HCA were previously
used to develop drug delivery formulations. For example, PCA was utilized
to generate a quantitative composition–permeability relationship
for microemulsion formulations used to deliver testosterone transdermally,
with a linear relationship between the lower-dimensionality data generated
from the main principal component and the permeability coefficients
of the different formulations.[22] PCA and
HCA were used to extract stable SMEDDS (self-microemulsifying drug
delivery systems) and SNEDDS (self-nanoemulsifying drug delivery systems)
formulations of lovastatin and glibenclamide, respectively,[23,24] and PCA and PLS analysis were used to assess the qualitative and
quantitative effects of different variables such as lipid/surfactant
type and their concentrations on parameters related to storage stability.[25] Furthermore, PLS was successfully employed to
predict the sizes and polydispersity index (PDI) for lipid nanocapsules
based on the quantitative mixture composition.[26]Here, we extend these exciting studies by combining
PCA, HCA, and
PLS with molecular dynamics and docking analysis[27] to give valuable insight into drug loading in a polymer
matrix. As a model polymer matrix, we use protein-based nanoparticulate
drug delivery systems (i.e., nanospheres composed of collagen-derived
gelatin). Gelatin is an abundant and inexpensive protein,[28] which is amphiphilic in nature due to its amino
acid contents (ca. 12% anionic glutamic and aspartic acid, ca. 13%
cationic lysine and arginine amino acids, and ca. 11% hydrophobic
leucine, isoleucine, methionine, and valine),[29] and gelatin-based matrices can in principle be used to deliver both
small molecules and macromolecules.[30−36] In this study, we focus on a selection of low-molecular-weight drugs
used in the clinic, as depicted in Figure .
Figure 1
Chemical structures of the substances studied
herein: (A) acyclovir,
(B) cryptolepine, (C) amphotericin B, (D) doxorubicin, (E) 5-fluorouracil
(5FU), (F) isoniazid, (G) resveratrol, (H) paclitaxel, (I) indomethacin,
and (J) curcumin.
Chemical structures of the substances studied
herein: (A) acyclovir,
(B) cryptolepine, (C) amphotericin B, (D) doxorubicin, (E) 5-fluorouracil
(5FU), (F) isoniazid, (G) resveratrol, (H) paclitaxel, (I) indomethacin,
and (J) curcumin.
Materials
and Methods
Data Set
The data set contained four
input variables (descriptors) and one output response (mass of drug
loaded per 100 mg gelatin nanospheres determined experimentally) for
different drugs. Data mining was performed through different databases
such as PubMed and Web of Science to obtain the output response for
10 drugs: acyclovir,[37] amphotericin B,[38] cryptolepine,[39] doxorubicin,[40] 5-fluorouracil (5FU),[41] isoniazid,[42] resveratrol,[43] curcumin,[17] paclitaxel,[44] and indomethacin.[45]
Calculation of Molecular Descriptors
The
drugs were analyzed using Bioclipse version 2.6 (Bioclipse project,
Uppsala University, Sweden).[39] The four
descriptors chosen were constitutional (molecular weight), electronic
(number of hydrogen bond donors and number of hydrogen bond acceptors),
and physicochemical (xLogP).
Hierarchical
Clustering Analysis (HCA)
The molecular descriptors generated
using Bioclipse version 2.6 were
subjected to hierarchical clustering analysis using JMP 7.0 (SAS,
Cary, NC, USA). Ward’s minimum variance method was adopted
to join the clusters and generate a dendrogram. Ward’s method
is considered an agglomerative hierarchical technique where the merging
in the dendrogram starts at the final clusters (leaves) and merging
occurs stepwise until it reaches the trunk. Ward’s minimum
variance criterion minimizes the total within-cluster variance. At
each step, the pair of clusters possessing the minimum between-cluster
distance is merged (i.e., the pair of clusters that leads to the minimum
increase in the total within-cluster variance after merging is selected).[45]
Principal Component Analysis
(PCA)
PCA was used to extract patterns using an exploratory
data analysis
method that deals with the variances in sample observations. PCA was
performed using JMP 7.0. Four principal components were calculated
by taking a linear combination of an eigenvector of the correlation
matrix built up from standardized original variables. The dimensionality
of the data was reduced by extracting two main principal components
possessing the two highest eigenvalues and plotting the data with
respect to these two new orthogonal axes.
Partial
Least Squares Analysis (PLS) for Model
Generation and Validation of the Model
PLS was used to study
correlations between the molecular descriptors and the output response.
PLS was performed using JMP 7.0 using four latent vectors. The PLS
generated model was validated by checking the differences between
the mean actual and predicted response values using t-test statistical analysis at P < 0.05 using
GraphPad Prism v.5.0 (GraphPad software Inc., San Diego, CA, USA)
and by performing a k-fold (5-fold) cross-validation
(leave-two-out) to check the predictability of the model and its ability
to navigate the experimental space. The value of Q2 (predicted R-squared) was calculated
as followswhere PRESS represents the predicted
residual
error sum of squares, while ISS stands for the total initial sum of
squares. Moreover, a predicted versus actual correlation was obtained.
Molecular Dynamics Simulations (MDS) of the
Gelatin Matrix
Molecular dynamics simulations (MDS) were
carried out using the GROMACS[46] v. 4.6.5
freeware (http://www.gromacs.org/). To prepare the gelatin system, 48 peptide molecules were constructed,
with 18 amino acids in each molecule. The primary sequence of the
peptides was AGPRGQ(Hyp)GPAGPDGQ(Hyp)GP. Six hypothetical probe molecules
(with a calculated molecular weight of 767.13) were added at random
positions to the system. The force-field parameters were obtained
from CgenFF[47] (https://cgenff.paramchem.org/). The system was energy minimized by the steepest descent method.
Molecular dynamics was subsequently carried out, with a time step
of 2 fs, full periodic boundary conditions, and a cutoff distance
of 1.2 nm for van der Waals and electrostatic interactions.[48] PME was chosen to handle long-range electrostatic
interactions. All bonds were constrained by the LINCS algorithm. The
MDS were carried out for 3 ns at 373 K and 1 bar using a v-rescale
thermostat and a Berendsen barostat, respectively.[49]
Drug Docking in Simulated
Gelatin Nanospheres
The chemical structures of the studied
drugs were drawn using ChemDraw
Ultra version 10 (Cambridgesoft, Waltham, MA, USA). The corresponding
“.mol2” files needed for docking experiments were obtained
using Chem3D Ultra version 10 (Cambridgesoft, Waltham, MA, USA) after
energy minimization using the MM2 force field of the same program.
Docking analysis was generated by Argus Lab version 4.0.1 (Mark Thompson
and Planaria Software LLC, Seattle, WA, USA). The hypothetical probe
molecules were utilized to construct corresponding binding sites on
the carrier (gelatin-probe), and the AScore was utilized for calculating
the scoring function. The size of the display box in the x, y, and z dimensions were 15 ×
15 × 15 Å as these dimensions were suitable to the size
of the docked molecules and ensured a central position for them inside
the gelatin matrix. Additionally, the genetic algorithm was used as
the docking engine with 150 maximum poses. The type of calculation
and ligand (as chosen using the software options) were Dock and Flexible,
respectively, and the binding energies (ΔG,
kcal/mol) reflecting the docking efficiencies were calculated.
Results
Table reports
the molecular descriptors (number of hydrogen bond donors, number
of hydrogen bond acceptors, xLogP, and molecular weight) for the investigated
drugs. The dendrogram classifying these drugs according to HCA using
Ward’s minimum variance method (an agglomerative type of analysis)
is displayed in Figure . Isoniazid and 5FU were clustered together according to their four
descriptors, Resveratrol and cryptolepine clustered together, whereas
doxorubicin, acyclovir, and amphotericin B constituted separate clusters.
Importantly, the loading pattern followed this classification (see Table ) where 5FU and isoniazid
scored the highest loading masses followed by acyclovir, which is
closest to the aforementioned drugs in the dendrogram. Cryptolepine
and resveratrol were very close, with doxorubicin near to them. Amphotericin
B had the lowest mass loaded into the nanospheres, which was clear
from its separate branch (furthest distance) in the dendrogram.
Table 1
Descriptors of the Drugs, Amounts
of Loaded Drug, and the Obtained Binding Energies from Docking of
the Drugs on a Simulated Gelatin Matrix
drug
xLogP
no. H-bond
donors
no. H-bond
acceptors
molecular
weight (g/mol)
actual amount
of drug loaded (mg/100 mg gelatin)
Lamarckian
genetic algorithm ΔG (kcal/mol)
acyclovir
–1.650
3
8
225.21
8.74
–3.94
amphotericin B
2.068
12
18
923.49
1.16
144.4
cryptolepine
2.180
0
2
233.30
2.00
–3.81
doxorubicin
–1.900
6
9
543.52
2.10
58.29
5-fluorouracil
–0.760
2
4
130.00
25.07
–4.19
isoniazid
–0.683
3
4
137.14
22.00
–4.16
resveratrol
2.050
3
3
228.24
1.96
–3.74
curcumin
1.95
2
6
368.13
3.50
–2.59
paclitaxel
6.15
4
14
853.33
0.52
173.5
indomethacin
3.78
1
4
338.14
1.91
–1.99
Figure 2
Hierarchical
clustering analysis (HCA) of the investigated drugs
with respect to four constitutional, electronic, and physicochemical
descriptors: number of hydrogen bond donors, number of hydrogen bond
acceptors, xLogP, and molecular weight.
Hierarchical
clustering analysis (HCA) of the investigated drugs
with respect to four constitutional, electronic, and physicochemical
descriptors: number of hydrogen bond donors, number of hydrogen bond
acceptors, xLogP, and molecular weight.A score plot of the drugs
with respect to their descriptors after
projecting the data into two main principal components is displayed
in Figure , where
principal component 1 and principal component 2 reflect 69.72 and
26.95% of the data variation, respectively (corresponding to 96.68%
of total variance; Figure , top right panel), and 5FU and isoniazid are clustered together
with acyclovir having the nearest score, and amphotericin B the furthest
score. Figure depicts
the loading plots of the two main principal components. It is obvious
that principal component 1 is mainly composed of the descriptors:
the molecular weight, the number of the H-bond donors, and the number
of the number of H-bond acceptors, while principal component 2 mainly
depends on the remaining descriptor, xLogP. These results confirm
the presentation of the four investigated variables in the two generated
principal components.
Figure 3
Principal component analysis (PCA) score plot of the investigated
drugs with respect to four constitutional, electronic, and physicochemical
descriptors: number of hydrogen bond donors, number of hydrogen bond
acceptors, xLogP, and molecular weight, displaying only two main combined
components. The upper panel depicts the scree plot revealing the percentage
variation of each extracted component (combined from the four descriptors).
Figure 4
Principal component analysis (PCA) loading plot of the
two main
principal components.
Principal component analysis (PCA) score plot of the investigated
drugs with respect to four constitutional, electronic, and physicochemical
descriptors: number of hydrogen bond donors, number of hydrogen bond
acceptors, xLogP, and molecular weight, displaying only two main combined
components. The upper panel depicts the scree plot revealing the percentage
variation of each extracted component (combined from the four descriptors).Principal component analysis (PCA) loading plot of the
two main
principal components.The relationship between
the obtained combined x-scores (combining the contribution
from the four x-variables viz. descriptors) and y-scores is displayed
in Figure , and the
screen plot (Figure , bottom right) depicts the contribution of each individual latent
factor to the combined x-scores with the first two
factors accounting for 96.64% of the obtained scores.
Figure 5
Partial least squares
regression analysis (PLS) of the investigated
drugs with four constitutional, electronic, and physicochemical descriptors:
number of hydrogen bond donors, number of hydrogen bond acceptors,
xLogP, and molecular weight as the x-factors and
the mass of loaded drug per 100 mg gelatin nanoparticles as the y-factor. The lower panel depicts the contribution of each
latent x-factor (combined factor) to the x-scores representing the combined x-dimension.
Partial least squares
regression analysis (PLS) of the investigated
drugs with four constitutional, electronic, and physicochemical descriptors:
number of hydrogen bond donors, number of hydrogen bond acceptors,
xLogP, and molecular weight as the x-factors and
the mass of loaded drug per 100 mg gelatin nanoparticles as the y-factor. The lower panel depicts the contribution of each
latent x-factor (combined factor) to the x-scores representing the combined x-dimension.It is noteworthy that the generated x- and y-scores represent the distances of the points
in space
of all the dimensions to the main vector summarizing the final dimension
(in the current case, there is a principal component or vector for
the x-dimension comprising all the descriptors and
another for the y-dimension representing the loaded
mass). Therefore, the aforementioned scores can be negative numbers.
Consequently, a generated model was developed, whereThe values and the signs of the coefficients of the x-factors in the equation were indicative of the importance
of increasing
the number H-bond acceptors in the drug chemical structure in the
presence of a balanced xLogP and low molecular weight to increase
the loading of the drug. The model was validated by performing t-test statistical analysis between the actual experimental
results for drug loading and the predicted drug loading using the
model where no significant difference was obtained between the means
at P < 0.05. The calculated Q2 or the predicted R-squared after 5-fold
cross-validation scored a value of 0.721 (a highly acceptable value).[50]Figure further demonstrates the predicted versus actual relationship,
where it is observed that most of the points are scattered around
the 45° line. Proximity of the points to this line usually indicates
the favorable similarity of the results. Accordingly, the developed
model can be exploited in predicting the loaded mass of any new physically
loaded or entrapped investigated drug molecule in a gelatin matrix
after projecting its structure to the aforementioned four descriptors
(Table ).
Figure 6
Predicted versus
actual drug loading in gelatin nanospheres.
Predicted versus
actual drug loading in gelatin nanospheres.
Discussion
In the HCA utilized and studied method (Ward’s
method),
the distance between two clusters is the ANOVA sum of squares between
the two clusters added up over all the variables. At each generation,
the within-cluster sum of squares is minimized over all partitions
obtainable by merging two clusters from the previous generation. The
sums of squares are usually easily interpreted when they are divided
by the total sum of squares to give the proportions of variance (squared
semipartial correlations). Ward’s method works under the assumptions
of spherical covariance matrices and the condition of equal sampling
probabilities. Distances between clusters in Ward’s method
are calculated according to the squared Euclidean distance. It is
considered very useful in joining clusters with a small number of
observations and is very accurate though sensitive to outliers.[46]PCA was used to confirm the hierarchical
clustering analysis results.
This type of multivariate analysis deals with the x-factors (descriptors) to reduce their dimensionality by projecting
the data into new orthogonal axes that display the directions (vectors)
of the highest variation. These results confirmed the HCA results
and correlate the x-factors (drug descriptors) with
the y-outputs (mass of drug loaded per 100 mg gelatin),
where clustered points (especially in the same quadrants) represents
high similarity between them regarding their projected descriptors.[19]Accordingly, a supervised learning tool
(PLS) was used to generate
an accurate and sensitive model that would correlate the x-factors with the y-outputs quantitatively. The
techniques implemented in the PLS platform work by extracting successive
linear combinations of the predictors, called factors (also called
components or latent vectors), which optimally address the combined
goals of explaining both response and predictor variation. In particular,
the method of PLS balances the two objectives and maximizes their
correlation.[20]The obtained results
can be explained by the fact that gelatin
is a protein carrier with a relatively balanced hydrophilic/hydrophobic
character displaying several hydrogen bond donor and acceptor groups
with a repetitive sequence of amino acids -Ala-Gly-Pro-Arg-Gly-Glu-4Hyp-Gly-Pro-
along its backbone.[51] This structure can
be transformed to some numerical values that are generated of each
amino acid. Among which are the highly condensed variables “z-scale
descriptors”[52] that are derived
from PCA analysis of several experimental and physicochemical properties
of the 20 natural amino acids: z1, z2, and z3, which represent the
amino acids hydrophobicity, steric properties, and polarity, respectively.
Additionally, they are useful in QSAR analysis of peptides where they
have proven effective in predicting different physiological activities.[53−55] Herein, we used an extended scale (including 67 more artificial
and derivatized amino acids)[56] due to the
presence of 4-hydroxyproline in the gelatin structure.In this
study, we expand the use of the first descriptor (z1) to
predict the drug loading properties of nanoparticles. The first scale
(z1) was chosen as it represents a lipophilicity scale that encompasses
several variables (amino acid descriptors) such as the thin layer
chromatography (TLC) variables, log P, nonpolar surface
area (Snp), and polar surface area (Spol) in combination with the
number of proton-accepting electrons in the side chain (HACCR).[57] In this scale, a large negative value of z1
corresponds to a lipophilic amino acid, while a large positive z1
value corresponds to a polar, hydrophilic amino acid. Therefore, the
gelatin typical structure amino acids (-Ala-Gly-Pro-Arg-Gly-Glu-4Hyp-Gly-Pro-)
can be represented by their z1 values as follows: (0.24), (2.05),
(−1.66), (3.52), (2.05), (3.11), (−0.24), (2.05), and
(−1.66). Furthermore, an overall topological description of
the repetitive sequence was accounted for by encoding the z1 descriptors
of each amino acid into one auto covariance variable[47] that was first introduced by Wold et al.[58] The autocovariance value (AC) was calculated as followswhere AC
represents autocovariances of the
same property (z-scale), i = 1,
2, 3,..., N is the number of amino acids, lag = 1,
2, 3, ... L (where L is the maximum
lag, which is the longest sequence used) and V is
the scale value.Therefore, the AC value for the gelatin typical
structure sequence
was calculated with lag 1 scoring a value approaching zero (0.028),
indicating a balanced hydrophobicity/hydrophilicity structure. In
light of the above, the high loading of 5FU and isoniazid can be ascribed
to their amphiphilic nature with LogP values approaching 0 and to
the presence of several hydrogen bond donors and acceptors groups
relative to their low molecular weight that is favorable in both diffusion
through and entrapment in a protein matrix like that of gelatin nanospheres.
Since there was a recorded deviation between the actual and the predicted
values regarding isoniazid and 5FU (may be attributed to their small
molecular weight that helps their nonstoichiometric physical entrapment
in the gelatin matrix), therefore, the results were further confirmed
by molecular dynamics and docking experiments, where the drugs were
docked on the gelatin matrix simulated structure. Figure shows the molecular simulation
of the gelatin nanosphere matrix. Interestingly, the best binding
energy values ΔG (−4.19 and −4.16
kcal/mol) corresponded to the highest loaded drugs 5FU and isoniazid,
respectively, followed by acyclovir (see Figure ). In the same context, amphotericin B scored
a highly positive ΔG value, which explains
its low loading values. The confirmation of the docking results with
their experimental counterparts can be attributed to the inclusive
scoring function of the Arguslab software. This scoring function is
based on the XScore calculated according to the following equation[59]where ΔGbind is the total calculated
binding energy, ΔGvdw is the binding
energy due to van der Waals forces,
ΔGhydrophobic is the binding energy
due to hydrophobic forces, ΔGH-bond is the binding energy due to H-bonding, ΔGH-bond (chg) is the binding energy due to
H-bonding due to charged molecules, ΔGdeformation is the energy due to rotational bonds and atoms
involved in torsions (rotors) that were frozen due to binding, and
finally, ΔG0 represents the regression-obtained
binding energy. As can be inferred, the equation terms encompass nearly
all the possible interactions that can occur between the drug and
its carrier that may lead to drug entrapment, which explains the high
correlation obtained between the real experimental values and the
docking results.
Figure 7
Molecular dynamics simulation of the gelatin nanosphere
matrix.
Figure 8
Drug loading versus the obtained binding energy
plot of the investigated
drugs after docking on a simulated gelatin matrix built up using molecular
dynamics simulation displaying an exponential relationship.
Molecular dynamics simulation of the gelatin nanosphere
matrix.Drug loading versus the obtained binding energy
plot of the investigated
drugs after docking on a simulated gelatin matrix built up using molecular
dynamics simulation displaying an exponential relationship.An exponential model was generated correlating
the actual experimental
molar masses of the loaded drugs and their corresponding docking binding
energies. This model was highly fitting with an obtained R-squared value of 0.95. This relationship can highly estimate the
molar masses of physically loaded drugs through docking the investigated
molecule on the simulated gelatin matrix. The only limitation of the
model was the number of the experimental studies that are involved
in it (10 studies), which we recommend to increase in further similar
studies.
Conclusions
The current study introduces
new approaches of interpreting and
predicting drugs loading on protein carriers, such as gelatin nanospheres.
These approaches comprise multivariate statistical methods such as
hierarchical clustering analysis, principal component analysis, partial
least squares regression, molecular dynamics, and docking. Moreover,
the utilization of the amino acids z-scales descriptors
represents a new and important asset in interpreting drug loading
in protein-based carriers. We believe that this methodology has the
potential to lead to significant change in drug formulation studies
across the world.