Literature DB >> 35684378

Statistical Methods in the Study of Protein Binding and Its Relationship to Drug Bioavailability in Breast Milk.

Abstract

Protein binding (PB) is indicated as the factor most severely limiting distribution in the organism, reducing the bioavailability of the drug, but also minimizing the penetration of xenobiotics into the fetus or the body of a breastfed child. Therefore, PB is an important aspect to be analyzed and monitored in the design of new drug substances. In this paper, several statistical analyses have been introduced to find the relationship between protein binding and the amount of drug in breast milk and to select molecular descriptors responsible for both pharmacokinetic phenomena. Along with descriptors related to the physicochemical properties of drugs, chromatographic descriptors from TLC and HPLC experiments were also used. Both methods used modification of the stationary phase, using bovine serum albumin (BSA) in TLC and human serum albumin (HSA) in HPLC. The use of the chromatographic data in the protein binding study was found to be positive -the most effective application of normal-phase TLC and HPLCHSA data was found. Statistical analyses also confirmed the prognostic value of affinity chromatography data and protein binding itself as the most important parameters in predicting drug excretion into breast milk.

Entities: Chemical

Keywords: M/P ratio; affinity chromatography; breast milk; chromatographic descriptors; molecular descriptors; protein binding; statistical modeling

Mesh：

Substances：

Year: 2022 PMID： 35684378 PMCID： PMC9182007 DOI： 10.3390/molecules27113441

Source DB: PubMed Journal: Molecules ISSN： 1420-3049 Impact factor: 4.927

1. Introduction

Excretion of drugs into breast milk is an important aspect to be considered in the pharmacotherapy of breastfeeding women. Due to ethical considerations, in vivo studies are very rare and it is difficult to obtain the milk-to-plasma (M/P) ratio of many active pharmaceutical compounds (APIs). A mathematical model capable of calculating M/P values using the available data will greatly facilitate the study of the bioavailability of new APIs. In the previous articles [1,2], we presented a comparison of statistical methods in the study of drug excretion into breast milk with the use of the M/P descriptor. It was shown that the multiple linear regression (MLR) and random forest (RF) analyses were most effective in describing this pharmacokinetic phenomenon, with the use of chromatographic data and physicochemical properties of the tested compounds. These analyses did not deviate from the known principles of bioavailability to breast milk and showed a close relationship between M/P and the level of drug–protein binding (PB) as well as the state of ionization of the API in the bloodstream. The papers also describe the most effective conditions for thin layer chromatography (TLC) as an analytical model for predicting the penetration of drugs into breast milk. According to these results, it can be assumed that the use of drug–protein binding indices, together with chromatographic data, will make it possible to predict the level of drug distribution into breast milk. The main aim of this study is to provide supplementary analyses, which include: determination of physicochemical parameters related to drug protein binding; searching for a mathematical model of PB and/or M/P prediction; and the use of affinity chromatography data as an index of pharmacokinetic properties. The goal of developing such a model is its further utility in predicting the PB of newly developed active pharmaceutical ingredients. Only easily available API properties are needed to use the model. It can facilitate the process of introducing a new drug to use and reduce expensive in vivo testing. In this study the following statistical methods were used: cluster analysis (CA), discriminant function analysis (DFA) and principal component analysis (PCA) random forest regression (RF). All molecular descriptors used in this study are listed and described in Table 1.

Table 1

List of molecular and chromatographic descriptors used in statistical analyses.

Descriptor	Description	Reference/Database/Software
a/b/n code	acidic, basic or neutral character of the compound; describes the division into groups: a, b and n	CHEMBL database [3]
B1	calculation parameter B2, describes the bioavailability in the CNS and determines penetration through the blood-brain barrier: log bb = 0.139 + 0.152 log P	reference [4]
B2	calculation parameter B2, describes the bioavailability in the CNS and determines penetration through the blood-brain barrier: log bb = 0.547 − 0.016 PSA	reference [5]
B3	calculation parameter related to protein binding: log (bound fraction/unbound fraction) = 0.5 log P–0.665	reference [6]
CNS+/−	ability to penetrate into the central nervous system (+ or −)	DrugBank database [7]
DM	dipole moment	HyperChem, Hypercube, Inc.
eH	energy of the highest occupied molecular orbital (HOMO)	HyperChem, Hypercube, Inc.
eH-eL	ionization capacity	HyperChem, Hypercube, Inc.
eL	energy of the lowest unoccupied molecular orbital (LUMO)	HyperChem, Hypercube, Inc.
HA	number of hydrogen bond acceptors	ACD/Labs
HD	number of hydrogen bond donors	ACD/Labs
log D	distribution coefficient	ACD/Labs
log M/P	logarithm of M/P
log MW	logarithm of MW
log P	partition coefficient	HyperChem, Hypercube, Inc.
log U/D	the ratio of neutral to ionized form; determines the degree of ionization	Calculated using: pK_a-pH for acids; pH-pK_a for bases
M/P	milk/plasma drug concentration ratio	references [8,9,10,11,12,13]
MW	molecular weight	HyperChem, Hypercube, Inc.
PB	percentage of plasma protein binding	DrugBank
PhCharge	the charge of the API under physiological conditions	DrugBank
pK_a	negative logarithm of the acid dissociation constant (K_a)	ACD/Labs
PSA	polar surface area	ACD/Labs
Sa	the surface area of the molecule	HyperChem, Hypercube, Inc.
V	the volume of the molecule	HyperChem, Hypercube, Inc.
NP; RP	R_f (retention factor) obtained from TLC using impregnated with bovine serum albumin (BSA) plates in normal and reversed phase	TLC experiment
NP/C; RP/C	R_f from impregnated NP or RP plate/control R_f	TLC experiment
k_HSA	retention factor from HPLC using column with immobilized human serum albumin (HSA)	HPLC experiment
log k_HSA	logarithm of the retention coefficient obtained from HPLC_HSA	HPLC experiment
log k_IAM	logarithm of the retention coefficient obtained from HPLC_IAM (column with immobilized artificial membrane)	HPLC experiment

2. Results

2.1. Correlation Analyses

The experiment investigated the results of using data from several chromatographic analysis experiments (HPLCHSA, NP TLC, RP TLC and, additionally, HPLCIAM) in predicting drug binding to protein, and thus bioavailability to breast milk. A group of 165 APIs was analyzed, in which acidic, basic and neutral drugs were observed. The best correlation with PB values was shown in the results of the HPLCHSA and NP TLC experiments, in the form of log k and Rf values, (HPLCHSA: n = 165, R = 0.39); (NP TLC: n = 162, R = 0.31). The relationship is directly proportional. This is the result for all kinds of relationships. Much better results were obtained for acidic drugs (R = 0.50), even considering the smaller number of cases (n = 34) (Table A1, Appendix A).

Table A1

Chromatographic data from TLC and HPLC experiments and their derivatives used in the analysis of analytical models.

Descriptor	n_abn	n_b	n_n	n_a	PB_abn *	PB_b *	PB_n *	PB_a *
NP	162	49	79	34	0.31	0.31	0.15	0.50
NP/C	162	49	79	34	0.00	−0.11	−0.02	0.50
NP/PSA	162	49	79	34	0.19	0.28	0.17	0.37
NP/B2	162	49	79	34	−0.10	0.02	0.18	−0.69
NP/log P	162	49	79	34	0.12	0.02	−0.20	−0.44
RP	162	49	79	34	0.01	−0.05	−0.20	0.17
RP/C	162	49	79	34	0.12	0.17	0.19	−0.10
RP/PSA	162	49	79	34	0.11	0.21	0.11	−0.03
RP/B2	162	49	79	34	−0.08	0.02	0.07	−0.44
RP/log P	162	49	79	34	0.12	0.09	0.18	0.16
log k_HSA	165	49	80	34	0.39	0.28	0.45	0.55
log k_HSA/B2	165	49	80	36	0.01	0.05	−0.04	0.08
log k_HSA/log P	165	49	80	36	−0.11	−0.04	−0.16	0.09
log k_HSA/PSA	165	49	80	36	0.16	0.11	0.25	0.51
log k_IAM	159	49	74	36	0.20	0.17	0.41	0.28
log k_IAM/PSA	159	49	74	36	−0.05	−0.04	0.07	−0.07
log k_IAM/log P	159	49	74	36	0.04	0.11	−0.05	−0.04
log k_IAM/B2	159	49	74	36	−0.03	−0.06	−0.04	−0.06

* correlation with chromatographic data.

Then the effect of the most frequently mentioned molecular descriptors, related to drug distribution into breast milk and protein binding, was investigated. In all groups of APIs, molecular descriptors related to the hydro-lipophilic nature of drugs play a dominant role. The most important parameters are the partition coefficient and the distribution coefficient (log P and log D). The ability to form hydrogen bonds (HD, HA) is visible here and the correlation with PB is significant. The ratio of neutral to dissociated form (log U/D), dissociation constant (pKa), ionization capacity of compounds (eH-eL) and other electron descriptors: eL and eH, show no significance. The influence of hydrophobic parameters (Sa, V, MW) is visible only in the form of the surface area to volume ratio (Sa/V). As can be seen above, this factor correlates inversely with all types of cases (Table A2, Appendix A).

Table A2

Physicochemical parameters of APIs and their correlation with data on PB.

Descriptor	n_abn	n_b	n_n	n_a	PB_abn *	PB_b *	PB_n *	PB_a *
acid/base	166				−0.15
B1	129	34	66	29	0.28	0.36	0.48	0.13
B2	166	50	81	35	0.12	0.13	0.27	0.05
B3	166	50	81	35	0.13	0.11	0.21	0.05
log U/D	160	50	75	35	0.05	0.16	0.02	0.22
DM	160	47	79	34	−0.02	0.04	−0.04	−0.16
Sa/V	160	47	79	34	−0.29	−0.34	−0.32	−0.04
eH	160	47	79	34	0.05	0.13	−0.02	0.17
MW	162	48	79	35	−0.17	0.14	0.24	0.00
HD	166	50	81	35	−0.23	−0.07	−0.39	−0.23
HA	166	50	81	35	−0.14	−0.16	−0.23	−0.13
eL	160	47	79	35	0.03	0.14	0.00	−0.015
eH-eL	160	50	79	35	0.01	−0.08	−0.01	0.12
log P	160	49	79	35	0.31	0.10	0.34	0.41
log D	160	50	81	35	0.28	0.19	0.38	0.30
MW/V	160	47	79	35	0.03	0.18	0.09	0.03
PhCharge	165	50	80	35	−0.13	−0.05	0.06	−0.20
pKa	160	50	75	35	−0.05	−0.15	0.08	0.22
M/P	104	30	55	19	−0.29	−0.20	−0.35	0.11
CNS+/−	154	49	72	33	−0.18	−0.05	0.16	0.33

* correlation with physicochemical data.

2.2. Discriminant Function Analysis

All of the descriptors most strongly related to the variability of the PB, which at the same time did not limit the number of cases studied, were introduced into the discriminant function analysis (DFA). All cases were tested using the a/b/n code. In the stepwise DFA, the discriminant variables included 9 out of 16 entered variables: PhCharge, B2, pKa, M/P, log kHSA, log kIAM, NP, eL and log U/D (Table 2).

Table 2

Classification matrix for the model using discriminant variables: PhCharge, B2, pKa, M/P, log kHSA, log kIAM, NP, eL, log U/D.

API Group	Correctly Classified Cases (%)	a p = 0.17895	np = 0.52632	b p = 0.29474
a	100,00	17	0	0
n	96,00	0	48	2
b	92,86	0	2	26
all	95,80	17	50	28

The PC1 factor discriminates the groups of APIs the most (PC 1 eigenvalue = 3.61). The variables PhCharge and pKa have the most important share in its value. The PC2 factor (PC2 eigenvalue = 0.81) was shaped by the chromatographic descriptors and the ability to ionize (log U/D). The means of the canonical variables (PC1) for group a = −3.52, for group n = 0.03 and for group b = 2.08, therefore PC1 most strongly discriminates between groups a and b. The means of the canonical variables (PC2) for group a = −0.93, for group n = 0.86 and for group b = −0.97. In this case, the centroids of groups a and b are almost equal, and the group of neutral compounds (n) is the most discriminated against (Figure 1).

Figure 1

Discrimination against acidic (a), basic (b) and neutral drugs (n). The scatter plot of canonical values for root 1 relative to root 2. Discriminating variables: PhCharge, B2, pKa, M/P, log kHSA, log kIAM, NP, eL, log U/D.

2.3. Principal Component Analysis

PCA was performed to determine the effect of the primary descriptors on the characteristics of the drug’s ability to pass into breast milk. In order to better visualize the obtained results from the analysis, the M/P values were converted into the scale of the drug penetration into milk—M/Pcode. The values of this indicator are in the range 1–4. Code 1 corresponds to drugs with an M/P value <0.40—completely safe; 2 corresponds to the range of 0.40–0.80—at the safety limit; 3 range 0.81–1.20—possibly over the safety limit; and 4 is M/P >1.20—dangerous. In the course of the analysis, the smallest number of principal components explaining the maximum range of the total variance in the group was initially established. Five factors explain 100% of the variability in the levels of drug excretion into breast milk. The first two factors, PC1 and PC2 (principal components), are described by all used descriptors. As a result, two main components explaining a total of 72% of the variability were obtained. The HPLCHSA, HPLCIAM, NP TLC and RP TLC chromatographic data is responsible for the first component, PC1 (43.26%), the second component, PC2 (28.66%), is determined by the PB value. The projection of cases on the PC1 × PC2 plane is presented below (Figure 2):

Figure 2

Projection of cases onto the PC1 × PC2 plane.

In the graph of the projection of cases onto the PC plane, where the grouping variable is the scale of drug penetration into breast milk (M/Pcode), it can be seen that the tested APIs can be divided into two groups (surrounded by a box in the graph). One group included drugs with a lower level of M/P (1–2) penetration—safe, and the other group, M/P 3–4—dangerous. This division is not entirely obvious. It was created on the basis of factors explaining 75% of the variability. Few examples of misclassification are visible. The distinction between these groups is related to PC1. Derivatives with a low M/P are located on the right side of the plot and are clearly related to the positive values of PC1. APIs easily excreted into milk are on the left side of the chart and have negative PC1 values. The share of variables in this component, determined by the PC1-variable correlation (factor loadings), reveals the parameters of the greatest importance for the investigated pharmacokinetic feature of drugs. They are: log kHSA, log kIAM, NP and RP. Thus, affinity chromatography, based on protein binding, can predict the bioavailability of an API into breast milk. The graph of the projection of variables onto the PC plane shows graphically the relationship between the component and the variable. The graph shows the so-called unit circle, i.e., the maximum correlation of 1 between the variable and the factor. The closer a given variable is to the unit circle line, the greater its correlation with the observed phenomenon (Figure 3).

Figure 3

Projection of variables on the plane of factors PC1 × PC2.

2.4. Cluster Analysis

In order to emphasize the diagnostic value of the experiment and to determine the difference in the values of the parameters determining the ability of drugs to penetrate into breast milk, cluster analysis (CA) was also performed. CA was conducted in the proposed M/Pcode scale, using the k-means method. The means of the most important biological descriptors (CNS +/−, B1, PhCharge, acid/base, NP, RP, log kHSA, log kIAM and PBcode) were compared for groups M/Pcode 1–4. As shown, all drug biological parameters showed a group variability (see Figure 4). The M/P code values range from 1 to 4 with a clear distinction between relatively safe and unsafe groups. Physicochemical parameters: PB, acid/base, HD, log P, eL, log D also show differentiation, but not in all cases. Unfortunately, M/Pcode is too clustered here, which indicates a smaller influence of the tested properties on the observed feature (Figure 5). The descriptors: log D and eL show the highest differentiation.

Figure 4

Mean descriptor values in M/Pcode cluster analysis (k-means method) using biological and chromatographic descriptors.

Figure 5

Mean descriptor values in M/Pcode cluster analysis (k-means method) using physicochemical descriptors.

The above analyses confirmed the values of the parameters HA, log P, log D and eL. The parameters of log D, HA and eL show the greatest differentiation. Unfortunately, the M/Pcode values are poorly differentiated and their values do not correspond to the variability of other descriptors.

2.5. Regression Methods

MLR failed to create a reliable PB prediction model, therefore an attempt was made to analyze protein binding by other regression methods. A total of 165 test compounds and 22–23 independent variables were used to perform partial least squares (PLS) and random forest regression (RF). The variables used are listed for each model (Table 3 and Table 4). During the analyses, 165 compounds were randomly divided into a training set, 70% of the total (TRAIN, n = 115 compounds,) and a test set for external validation, 30% of the total (TEST, n = 50).

Table 3

Twenty-three independent variables with NP TLC data used to create the RF and PLS model for PB.

No.	Independent Variable	No.	Independent Variable	No.	Independent Variable
1.	B3	9.	NP/B2	17.	eH
2.	PhCharge	10.	NP/log P	18.	eL
3.	acid/base	11.	MW	19.	eH-eL
4.	pKa	12.	log MW	20.	logD
5.	log U/D	13.	PSA	21.	Sa
6.	C	14.	HD	22.	V
7.	NP	15.	HA	23.	logP
8.	NP/C	16.	DM

Table 4

Twenty-two independent variables with HPLCHSA data used to create the RF and PLS model for PB.

No.	Independent Variable	No.	Independent Variable	No.	Independent Variable
1.	B3	9.	log k_HSA/log P	17.	eL
2.	PhCharge	10.	MW	18.	eH-eL
3.	acid/base	11.	log MW	19.	log D
4.	pKa	12.	PSA	20.	Sa
5.	log U/D	13.	HD	21.	V
6.	k_HSA	14.	HA	22.	log P
7.	log k_HSA	15.	DM
8.	log k_HSA/B2	16.	eH

2.5.1. Partial Least Squares Regression

The PLS model using 23 independent variables, including NP TLC data (Table 3) showed low values of R2 and Q2, approximately 0.40, and even lower results of external validation, approximately 0.22–0.24 (Figure A1, Appendix A). Even lower values are achieved with the HPLCHSA chromatographic data. This indicates that, as in the case of breast milk prediction models, the PLS method is again not widely applicable here and is not an appropriate method to analyze this type of data.

Figure A1

Actual versus predicted PBabn values using PLS modelling and 23 molecular descriptors including NP TLC data. LVs = latent variables, RMSECV = root-mean-square error of cross-validation, RMSEP = root-mean-square error of prediction, R2 train/test = coefficient of determination for train/test set models, Q2 train/test = coefficient of determination for the cross-validated models.

2.5.2. Random Forest Regression

RF regression was performed with the use of 150 generated random trees. NP TLC data was used first. The independent variables used for the analysis of all 165 cases (independent variable, PBabn) are listed in Table 3. The obtained model (Figure 6) showed satisfactory results, especially for the training set (n = 115): R2train = 0.81; Q2train = 0.73. The results of external validation using the test kit (nabn = 50) were lower: R2test = 0.65; Q2test = 0.56. The Monte Carlo permutation test (MCPT) showed the average value of the Q2test parameter was equal to 0.56 (Appendix A, Figure A2), which is similar to that in the presented model. The influence of individual independent variables on the model is presented in the chart below (Appendix A, Figure A3). The order of the descriptors presented there is as shown in Table 3. The log D parameter shows the strongest influence on the model using NP TLC data.

Figure 6

Actual versus predicted PBabn values using RF regression modelling of molecular descriptor set containing 23 variables. RMSECV = root-mean-square error of cross-validation, RMSEP = root-mean-square error of prediction, R2 train/test = coefficient of determination for train/test set models, Q2 train/test = coefficient of determination for the cross-validated models.

Figure A2

Monte Carlo permutation test (MCPT) showing Q2 obtained from RF regression models developed on the test set, the number of repetitions was n = 100. The mean value of Q2 was 0.5598 at the significance level p = 2.8196 × 10−12.

Figure A3

Contribution of individual descriptors to the generation of the RF regression model for PBabn. The greatest influence is shown by the descriptor no. 20, i.e., log D.

The data from the HPLCHSA experiment were then used for the RF regression (Table 4). The obtained model (Appendix A, Figure A4) again shows good results of the training set (n = 115): R2train = 0.81; Q2train = 0.78 but much lower parameters were obtained with external validation (nabn = 50): R2test = 0.57; Q2test = 0.53. In the MCPT, the Q2test value was already at a low level and amounted to 0.35 (Appendix A, Figure A5).

Figure A4

Actual versus predicted PBabn values, using RF regression modelling of molecular descriptor set containing 22 variables along with HPLCHSA data. RMSECV = root-mean-square error of cross-validation, RMSEP = root-mean-square error of prediction, R2 train/test = coefficient of determination for train/test set models, Q2 train/test = coefficient of determination for the cross-validated models.

Figure A5

Monte Carlo permutation test (MCPT) showing Q2 obtained from RF regression models developed on the test set, the number of repetitions was n = 100. The mean value of Q2 was 0.3456 at the significance level p = 0.4583.

Then, individual groups of compounds were dealt with, either separately, (a), (b) and (n), or combined, (an), (bn) and (ab). The results are shown in Table 5. Only the NP TLC data (Table 3) were used to construct the models, which gave the best results when tested for the complete set of compounds (nabn = 165).

Table 5

Random forest regression results on individual drug combinations.

API Group	Train Set	Test Set
PB_a	n = 24 R² = 0.78; Q² = 0.62	n = 11 R² = 0.29; Q² = 0.11
PB_b	n = 35 R² = 0.88; Q² = 0.80	n = 15 R² = 0.33; Q² = 0.29
PB_n	n = 57 R² = 0.85; Q² = 0.81	n = 25 R² = 0.62; Q² = 0.59
PB_an	n = 82 R² = 0.82; Q² = 0.74	n = 35 R² = 0.60; Q² = 0.55
PB_bn	n = 92 R² = 0.85; Q² = 0.80	n = 40 R² = 0.44; Q² = 0.44
PB_ab	n = 59 R² = 0.80; Q² = 0.72	n = 26 R² = 0.38; Q² = 0.33

RF models for PBa (na = 35) and PBb (nb = 50) gave poor results, especially in the external validation, similarly to their combined group (nab = 85), where the external validation results were in the range of Q2 = 0.4–0.3. The best results were obtained for the PBn (nn = 82) and PBan (nan = 117) groups. The R2 and Q2 values of the test kits ranged between 0.55 and 0.62 (Appendix A: Figure A6 and Figure A7). In both models, the log D values are the most important in their creation (Appendix A: Figure A8 and Figure A9).

Figure A6

Actual versus predicted PBn values using RF regression modelling of molecular descriptor set containing 23 variables along with NP TLC data. RMSECV = root-mean-square error of cross-validation, RMSEP = root-mean-square error of prediction, R2 train/test = coefficient of determination for train/test set models, Q2 train/test = coefficient of determination for the cross-validated models.

Figure A7

Actual versus predicted PBan values, using RF regression modelling of molecular descriptor set containing 23 variables along with NP TLC data.

Figure A8

Contribution of individual descriptors to the development of the RF regression model for PBn. The greatest influence is shown by the descriptor no. 19, i.e., log D, besides this, the molar weight (MW) and molar volume (V) are important.

Figure A9

Contribution of individual descriptors to the development of the RF regression model for PBan. The greatest influence is shown by the descriptor no. 19, i.e., log D, in this case a greater share of chromatographic parameters can be seen (descriptors nos. 6–9).

3. Discussion

On the basis of the DFA analysis, it was possible to determine the influence of the acidic, basic and neutral properties of APIs on their protein binding capacity and to decide whether the analysis of the pharmacotherapy of nursing mothers (M/P predictions) should be divided into groups: a, n and b. The division into acidic, basic and neutral drugs is strongly related to the PB-related descriptors, so the use of groups a, b and n seems to bring value for further analysis. The low values of Wilks lambda for both roots, PC1 and PC2, confirm the value of the obtained results (0.11 and 0.54, respectively). As the DFA analysis revealed a group of physicochemical and chromatographic parameters important for the bioavailability of drugs to milk, the use of CA emphasized the differentiation of their mean values in the M/P 1–4 groups. The above analyses confirmed the values of the parameters HA, log P, log D and eL. The parameters of log D, HA and eL show the greatest differentiation. Unfortunately, the M/Pcode values are poorly differentiated and their values do not correspond to the variability of other descriptors. Based on the PCA, it can be concluded that the data of the drug–protein binding affinity chromatography, in the form of the proposed analytical models and the protein binding itself as the basis for the experimental design, are the most important parameters in predicting drug excretion into breast milk. The final step in this study was to construct a model capable of predicting PB value, used as a trait strongly correlated with the bioavailability of breast milk. Unfortunately, it was not possible to obtain an MLR or PLS algorithm for protein binding prediction, that was reproducible for different groups. Models created by regression using the random forest method show a significant relationship, visible in the scatter plots (Figure 6, Figure A4, Figure A6 and Figure A7). The influence of the determination coefficient (log D) and chromatographic parameters from the NP TLC and HPLCHSA experiments in each model are also noticeable. Unfortunately, they do not show the best predictive ability (external validation at the level of Q2test = 0.56 and 0.35 in MCPT tests). The best results using random forest regression were obtained for the entire set of compounds, PBabn, and for the PBn and PBan groups. It is the acidic and neutral compounds that bind primarily to albumin, which constitutes the majority of plasma proteins, so the literature values of protein binding (PB) refer mainly to the binding of drugs to HSA.

4. Materials and Methods

4.1. Molecular Descriptors

All tested drugs are listed in Supplementary Materials, along with molecular descriptors. Active pharmaceutical ingredients were extracted from pharmaceutical formulations, purchased in a generally accessible pharmacy. The main criterion used in composing the drug set was the availability of protein binding values (PB) along with milk-to-plasma ratios for each API, as these were the main pharmacokinetic phenomena studied. The molecular descriptors selected for statistical analyses, which should have a significant effect on the penetration into breast milk and protein binding, are listed in Table 1. Some were taken from the literature, including M/P ratio obtained in vivo [8,9,10,11,12,13] or from online databases DrugBank [7] and CHEMBL [3]. Most of the physicochemical data were calculated in the following programs: HyperChem (HyperChem for Windows version 7.02, HyperCube Inc, Gainesville, FL, USA, 2002) and ACD/Labs (ACD/LabsTM Log D Suite 8.0, pKa dB 7.0, Advanced Chemistry Development Inc., Toronto, Canada, 2004). Chromatographic descriptors were obtained in experiments, thin layer chromatography in normal (NP TLC) and reversed mode (RP TLC). The stationary phase was modified with bovine serum albumin (BSA). TLC was the source of retention factor (Rf) values, denoted in statistical models as NP and RP. High performance liquid chromatography was performed using immobilized human serum albumin column (HPLCHSA) and immobilized artificial membrane (HPLCIAM). HPLC was the source of the log k values (logarithm of retention factor), log kHSA and log kIAM. The TLC and HPLC experiments are detailed in Appendix B.

4.2. Statistical Analyses

DFA, PCA and CA were performed in STATISTICA 13.1 (TIBCO Software Inc., Palo Alto, CA, USA). DFA is a classification analysis determining which descriptors best define the assignment of individual cases to each of the predetermined groups. Wilks’ lambda is a parameter used to evaluate the discriminant power of the entire model, i.e., all the independent variables used, and takes values from 0 to 1; the closer these values are to zero, the more discriminatory the model becomes. PCA is used to combine highly correlated variables with one another into one new variable called the principal component (PC). The calculation of new factors consists in diagonalizing the correlation or covariance matrix. The choice of matrix depends on whether the original variables require standardization or centering to mean values. In this way, a reduced number of new variables is generated, but explaining the original variance as much as possible. The purpose of cluster analysis (CA) is to combine cases into groups so that the association within the same group is as large as possible, and with cases from other groups as small as possible. The method of grouping the data used in the presented studies was the k-means method, in which the means for each cluster and in each dimension are examined, which allows assessment of to what extent the created clusters are different from each other. In the analysis of variance, the size of the F statistic performed in each of them shows how well a given dimension separates individual clusters. In the best situation, very different means are obtained for most of the dimensions analyzed. PLS and RF regression were performed with MATLAB ver. 2019a (The MathWorks, Natick, MA, USA). The performance of the models was assessed by a double cross-validation. The statistical significance was then evaluated using permutation testing. In the PLS method, the matrix of independent variables is analyzed for latent variables (LVs) that best describe the covariance between X and Y. Then these transformed independent variables are used in regression to predict the Y response. The RF method uses many decision trees which, based on the entered X variables, repeatedly “make a decision” about the predicted value of Y for each case, from which the mean value is then taken. In regression analyses, it is good practice to divide the set of cases into two sets: training and testing, in order to perform external validation, which will demonstrate the predictive capacity of the model. The training set accounts for approximately 70% of all collected cases and is used to build a regression equation (training model). The rest, i.e., about 30% of cases, are included in the test set on which the equation is validated. The training and test sets are distributed randomly. In order to check the stability of the model and exclude random effects, it is worth carrying out such a division into two subsets and the construction of the equation several times. The Monte Carlo permutation test (MCPT) is used for this. For the training and test sets, RF regression was performed and RMSECV, Q2 and R2 were calculated. Then this procedure was repeated 100 times, each time the training and test sets were drawn anew. Furthermore, the distribution of Q2 in the original and permuted models was compared and a one-way ANOVA was performed. In the next step, 100 training (70%) and test sets (30%) were prepared by randomly splitting the original data matrix. A similar MCPT (100 perm.) was then performed on the training and test sets that were derived from the permuted data matrix. The results of the original and permuted models were obtained and their Q2 values were compared.

5. Conclusions

Positive results were obtained on the expediency of using chromatographic data in the study of protein binding and the penetration of drugs into breast milk. The presented statistical analyses showed a close relationship between HPLC and TLC analytical data (under set conditions) with the bioavailability of the drug into breast milk. The correlation of the PB and M/P ratios with these chromatographic data is high, also in the group of all cases (acidic, basic and neutral drugs) together. The most effective application of NP TLC and HPLCHSA data was found. There is also a greater correlation between PB and the chromatographic data in the group of acidic drugs (a), i.e., for specific binding to albumin. The PCA and DFA analyses identified a group of physicochemical and chromatographic parameters important for the bioavailability of drugs in breast milk. The use of CA emphasized the differentiation of their mean values in groups M/Pcode 1–4. NP TLC was proved to be the most useful chromatographic method in statistical analyses. In the case of HPLCHSA data, the relatively large share of the results from the column in the creation of the RF model turned out to be interesting. The second factor that emerges in almost all analyses is the high proportion of the log D parameter, i.e., lipophilicity associated with ionization.

13 in total

Review 1. Computational approaches to the prediction of the blood-brain distribution.

Authors: Ulf Norinder; Markus Haeberlein
Journal: Adv Drug Deliv Rev Date: 2002-03-31 Impact factor: 15.470

2. QSAR treatment of drugs transfer into human breast milk.

Authors: Alan R Katritzky; Dimitar A Dobchev; Evrim Hür; Dan C Fara; Mati Karelson
Journal: Bioorg Med Chem Date: 2005-03-01 Impact factor: 3.641

3. Prediction of milk/plasma concentration ratios of drugs and environmental pollutants.

Authors: Michael H Abraham; Javier Gil-Lostes; Mohammad Fatemi
Journal: Eur J Med Chem Date: 2009-01-20 Impact factor: 6.514

4. Drug interactions. II. Binding of some pyrazolone and pyrazolidine derivatives to bovine serum albumin.

Authors: S Ozeki; K Tejima
Journal: Chem Pharm Bull (Tokyo) Date: 1974-06 Impact factor: 1.645

Review 5. Structure of serum albumin.

Authors: D C Carter; J X Ho
Journal: Adv Protein Chem Date: 1994

Review 6. Overview of Albumin and Its Purification Methods.

Authors: Ramin Raoufinia; Ali Mota; Neda Keyhanvar; Fatemeh Safari; Sara Shamekhi; Jalal Abdolalizadeh
Journal: Adv Pharm Bull Date: 2016-12-22

7. Spectroscopic investigations of the interactions of tramadol hydrochloride and 5-azacytidine drugs with human serum albumin and human hemoglobin proteins.

Authors: Sibel Tunç; Ahmet Cetinkaya; Osman Duman
Journal: J Photochem Photobiol B Date: 2013-01-31 Impact factor: 6.252