Literature DB >> 30766789

Deep learning for in vitro prediction of pharmaceutical formulations.

Yilong Yang^1,2, Zhuyifan Ye¹, Yan Su¹, Qianqian Zhao¹, Xiaoshan Li², Defang Ouyang¹.

Abstract

Current pharmaceutical formulation development still strongly relies on the traditional trial-and-error methods of pharmaceutical scientists. This approach is laborious, time-consuming and costly. Recently, deep learning has been widely applied in many challenging domains because of its important capability of automatic feature extraction. The aim of the present research is to apply deep learning methods to predict pharmaceutical formulations. In this paper, two types of dosage forms were chosen as model systems. Evaluation criteria suitable for pharmaceutics were applied to assess the performance of the models. Moreover, an automatic dataset selection algorithm was developed for selecting the representative data as validation and test datasets. Six machine learning methods were compared with deep learning. Results showed that the accuracies of both two deep neural networks were above 80% and higher than other machine learning models; the latter showed good prediction of pharmaceutical formulations. In summary, deep learning employing an automatic data splitting algorithm and the evaluation criteria suitable for pharmaceutical formulation data was developed for the prediction of pharmaceutical formulations for the first time. The cross-disciplinary integration of pharmaceutics and artificial intelligence may shift the paradigm of pharmaceutical research from experience-dependent studies to data-driven methodologies.

Entities: Chemical Disease Gene Species

Keywords: ANNs, artificial neural networks; APIs, active pharmaceutical ingredients; Automatic dataset selection algorithm; DNNs, deep neural networks; Deep learning; ESs, expert systems; FDA, U.S. Food and Drug Administration; HPMC, hydroxypropyl methylene cellulose; MAE, mean absolute error; MD-FIS, the Maximum Dissimilarity algorithm with the small group filter and representative initial set selection; MLR, multiple linear regression; OFDF, oral fast disintegrating films; Oral fast disintegrating films; Oral sustained release matrix tablets; PLSR, partial least squared regression; Pharmaceutical formulation; QSAR, quantitative structure activity relationships; QbD, quality by design; RF, random forest; RMSE, root mean squared error; SRMT, sustained release matrix tablets; SVM, support vector machine; Small data; k-NN, k-nearest neighbors

Year: 2018 PMID： 30766789 PMCID： PMC6362259 DOI： 10.1016/j.apsb.2018.09.010

Source DB: PubMed Journal: Acta Pharm Sin B ISSN： 2211-3835 Impact factor: 11.413

Introduction

The pharmaceutical industry currently faces intense pressure to reduce healthcare costs and to reduce the number of new active pharmaceutical ingredients (APIs). The pharmaceutical industry should employ more efficient and systematic ways in both drug discovery and development processes. In the drug discovery area, scientists now widely use high-throughput screening, combinatorial chemistry, and computer-aided drug design to accelerate drug discovery and development. However, modern pharmaceutical formulation development still strongly relies on traditional trial-and-error approaches by pharmaceutical scientists. Such methods are laborious, time-consuming and expensive. Moreover, it is difficult to achieve the optimum formulations by trial-and-error studies in the laboratory. Simplification of formulation development becomes essential to formulation scientists. Thus, it is necessary to develop an efficient and systematic method for formulation development to keep pace with the requirements of the pharmaceutical industry. Machine learning is one of the most exciting research areas in recent years. Machine learning can make data-driven predictions with existed experimental data, which provides a great opportunity for efficient formulation development4., 5., 6., 7., 8., 9.. A well-designed machine learning method can greatly speed the development, optimize formulations, save the cost, keep products consistency, and accumulate and preserve the specific knowledge and expertise from the experts in a well-defined domain. Table 15., 6., 10., 11., 12., 13., 14., 15., 16. summarizes recent progress of machine learning in formulation design. Expert systems (ESs) and artificial neural networks (ANNs) are two useful tools for formulation development4., 5., 6., 8., 9.. An ES is an intelligent program with the ability to accumulate and preserve the knowledge and experiences of the experts in a specific area (e.g., pharmaceutical formulations). However, it is difficult to extract the vague experiences of pharmaceutical experts into the rules of ESs, and then to accurately predict the performance of a formulation. ANN is the most popular machine learning tool in pharmaceutical formulation prediction. ANN simulates the structure and functions of biological neural networks. ANN is able to solve problems that are difficult to solve standard expert systems. However, ANNs still need strong expert knowledge to design feature extractors in the prediction process. In addition, the formulation prediction accuracy by ANNs is relatively low due to the limited experimental data.

Table 1

Recent progress of machine learning in formulation design.

Machine learning techniques	Formulation	Ref.
Hybrid expert system with ANNs	Hard gelatin capsule formulations	10
Expert system (SeDeM Diagram)	Orally disintegrating tablets	11., 12.
Expert system with ANNs	Osmotic pump tablets	5., 13.
Ontology-based expert system	Immediate release tablets	14
ME_expert 2.0	Microemulsions formulations	6
Fuzzy logic-based expert system	Freeze-dried formulations	15
Cubist and Random Forest	Cyclodextrin formulations	16

Recent progress of machine learning in formulation design. Deep learning is an automatic general-purpose learning procedure which has been widely adopted in many domains of science, business, and government. Unlike other machine learning techniques that require domain expertise to design feature extractors, deep learning can server as a feature extractor which automatically transforms low-level features to higher and more abstract level. Furthermore, deep learning has ability to find out irrelevant and particular minute variations, which allows these methods to reach higher accuracy than other machine learning methods. Convolutional neural networks have the advantages of local connection and weight sharing, which is inspired from visual neuroscience. Convolutional neural networks usually obtain good performance in image, video, speech and audio processing21., 22.. Recurrent neural networks can process sequence of different lengths and utilize history information of the sequence. Recurrent neural networks have brought about the breakthrough in sequential data (e.g. text and speech)23., 24.. Pharmaceutical formulation data include formulation compositions and manufacturing processes, neither of which are image data nor sequential data. Therefore, the full-connected deep feed-forward network is a good choice for the prediction of pharmaceutical formulations. A recent study showed that deep neural networks (DNNs) outperform ANNs with one hidden layer in oral disintegrating tablet prediction. The Maximum Dissimilarity algorithm with the small group filter and representative initial set selection (MD-FIS) selected the representative validation set from the small and imbalanced oral disintegrating tablet data. However, more comparisons of deep learning with other machine learning methods are needed for predicting successful formulations. In the past five years there are increasing applications of deep learning in pharmaceutical research26., 27., 28., 29.. The first such study (in 2013) was to compare deep learning with other machine learning approaches to predict the water solubility of drugs. The results showed that deep learning achieved better performance vs. other approaches. Subsequently, more pharmaceutical applications of deep learning were reported. For example, a deep convolution network was developed to predict the epoxidation reactivity of molecules to reduce the drug toxicity. Deep learning was also applied to successfully predict drug-induced liver injury. Another study showed that deep learning outperformed other computational methods (naive Bayes, support vector machines, and random forests) in predicting toxicity in the 2016 Tox21 Data Challenge. Deep learning was also used in drug discovery33., 34., 35.. DNNs were able to make better predictions than other machine learning approaches on quantitative structure activity relationships (QSAR) data sets. Moreover, multitask deep learning and one-shot learning approaches were used in low data drug discovery, which had better performance than single-task learning34., 35.. In addition, applying deep learning to mine the increasing datasets in drug discovery not only enables us to learn from the past but to predict future drug repurposing36., 37.. Recently, the performance of five machine learning models and four DNNs with 2, 3, 4, and 5 hidden layers were evaluated on 8 datasets. Further analysis was carried out by using the ranked normalized scores including seven classic measurements of model performance. The final results of the scores ranked by metric and by dataset indicated that the DNNs with five and four hidden layers made outperformed other machine learning approaches. Nearly all reports in recent 5 years suggested that deep learning had more benefits in predictive performance than other machine learning methods. In the present paper, deep learning was applied to predict successful pharmaceutical formulations by constructing regression models. One of the main difficulties in formulation prediction is the small dataset with imbalanced input space due to the limited experimental data. For better performance, the data splitting algorithm and the evaluation criteria suitable for pharmaceutical formulation data were introduced. The DNNs were trained on the data of two types of pharmaceutical dosage forms, including oral fast disintegrating films (OFDF) and oral sustained release matrix tablets (SRMT). Comparisons of deep learning with other six machine learning techniques were carried out. Compared with other machine learning methods, deep learning can find out the intricate correlation between pharmaceutical formulations and in vitro characteristics, which shows wide prospects for the application of deep learning in pharmaceutical formulation prediction.

Methods

Pharmaceutical data

The pharmaceutical dataset includes 131 formulations of OFDF and 145 formulations of SRMT. The experimental data were extracted from Web of Science. Three different searching terms of oral fast-dissolving films, oral disintegrating films and orodispersible films were used for search literature about the development of OFDF formulations. The searching strategy of hydroxypropyl methyl cellulose (HPMC)-based sustained release matrix tablet formulations was “HPMC” or “hydroxypropyl methylcellulose” or “hydroxypropylmethylcellulose” or “hydroxypropylmethyl cellulose” or “hypromellose” and “tablet” or “tablets”. The formulation data contain types and contents of both drugs and excipients, process parameters and in vitro characteristics of dosage forms. The characteristics of the two dosage forms were chosen as the prediction targets in this research, including disintegration time for OFDF and cumulative dissolution profiles (2, 4, 6 and 8 h) for SRMT. The molecular descriptors were used for representing the properties of APIs. All drugs׳ name were described with the nine molecular descriptors, including molecular weight, XlogP3, hydrogen bond donor count, hydrogen bond acceptor count, rotatable bond count, topological polar surface area, heavy atom count, complexity and logS. The excipient types were encoded to different numbers. The process parameters include weight, thickness, tensile strength, elongation, folding endurance, actual drug content of OFDF and granulation process, diameter, hardness of SRMT.

Data splitting strategy

A three-dataset (training/validation/test datasets) splitting strategy was used. The training set is for training models, and the validation set is for tuning hyper-parameters to find the best model. The accuracy of the test set shows the prediction ability on unknown data. This strategy is widely adopted in machine learning. For each dosage form, the pharmaceutical data were split into three subsets, both the validation set and the test set include 20 formulations, the rest of the data were used to train the models.

Hyperparameters of machine learning methods

Six machine learning methods were introduced to construct regression models to compare with DNNs, including multiple linear regression (MLR), partial least squared regression (PLSR), support vector machine (SVM), ANNs, random forest (RF) and k-nearest neighbors (k-NN). These regression models were trained using the scikit-learn package. For OFDF, in PLSR, the number of components was set to 8. In ANNs, the networks contained 1 hidden layer with 80 hidden nodes. In RF, the maximum depth of the tree was set to 3. In k-NN, the number of neighbors was set to 5. For SRMT, 4 models were trained simultaneously for 4 time points (2, 4, 6 and 8 h) by using each machine learning method. The 4 models were developed using the same hyperparameters. In PLSR, the number of components was set to 10. In ANNs, the networks contained 1 hidden layer with 60 hidden nodes. In RF, the maximum depth of the tree was set to 5. In k-NN, the number of neighbors was set to 3.

Hyperparameters of deep neural networks

DeepLearning4j machine learning framework (https://deeplearning4j.org/) was used to train the deep neural networks. For OFDF, a feed-forward neural network with 10 layers and 900 epochs was adopted. This network contained 50 hidden neurons on each layer. For SRMT, a feed-forward neural network with 9 layers and 2600 epochs was adopted. This network contained 30 hidden neurons on each layer. All networks chose tanh as the activation function of the hidden layers and sigmoid activation function for the last layer. Learning rate was 0.01. Batch gradient descent with the 0.8 momentum was used as the optimization algorithm.

Evaluation criteria

In machine learning, correlation coefficient and coefficient of determination are usually adopted as evaluation metrics for regression problems. Correlation coefficient indicates the linear relationship between two variables. The coefficient of determination shows the correlation between the predicted values and the real values. However, the correlation coefficient and the coefficient of determination cannot properly evaluate the performance of the pharmaceutical formulation prediction models. In pharmaceutics the good models for predicting drug dissolution profiles should have less than 10% error. Thus, specific criteria suitable for pharmaceutics should be introduced to evaluate the model performance. Following the FDA (the U.S. Food and Drug Administration) recommendation using the similarity factor f2 to evaluate the similarity of drug dissolution profiles, the similarity factor f2 was introduced to evaluate the performance of the models for predicting the cumulative drug release curves. If the f2 is greater than or equal to 50, it is considered a successful prediction. The accuracy of the cumulative drug release curve prediction (Eq. (1)) is the percentage of the successful predictions in all predictions: The European Pharmacopoeia stipulates that orodispersible tablets are the tablets that should disperse within 3 min (180 s). In our dataset, the disintegration time of OFDF ranges from 0 to 100 s. Usually, the successful prediction is that the error between the predicated time and the experiment time is not higher than 10 s. The accuracy of the disintegration time prediction (Eq. (2)) is the percentage of the successful predictions in all predictions:where, is the predicted value and is the experimental value.

Results and discussion

Deep learning is a type of representation learning with multiple levels of transform modules, which contains more parameters than other learning algorithms and requires more data for training. However, one of the main difficulties in pharmaceutical formulation prediction is the small dataset with imbalanced input space due to the limited experimental data. Each dosage form has only around 140 formulations. There are 13 APIs in the OFDF dataset, 29 APIs in the SRMT dataset. But near half of the APIs include less than four formulations. Therefore, selecting representative datasets for training and test is very important for the formulation prediction. In our research, the specific evaluation criteria were introduced and several data splitting methods were investigated. Moreover, deep learning was compared with other machine learning techniques for the formulation prediction.

Random data splitting

30% data were randomly selected as the validation set, with the remaining data used as the training set. This procedure was repeated 1000 times. However, entirely different accuracy results were obtained, the maximum variation of accuracy was more than 40%, and the average accuracy was less than 60%. In the whole dataset, near half of the APIs include less than four formulations. Therefore, random data splitting algorithm has near 50% probability to select these APIs with less formulations, which may make prediction accuracy quite low and high variation. In short, the random selection algorithm is not suitable for our research and a new approach need to be developed to select the representative data.

Manual data splitting

In the manual dataset selection, formulation experts picked up 20 representative data as the validation set for each dosage form. All prediction accuracies on both the training set and the validation set were greater than 90%. However, the manual selection method requires domain knowledge of experts, which is not suitable for large datasets and may vary across experts. Therefore, a selection algorithm should be developed to select the validation set automatically.

Maximum dissimilarity for data splitting

Previous research showed rational selection algorithm can generate better statistical results for the validation set than random selection. Another study indicated that the maximum dissimilarity algorithm was able to select representative test data of compounds from chemical databases. The original maximum dissimilarity algorithm was published in the Caret library of R language. In our research, the maximum dissimilarity algorithm was also used to select the validation set. However, test results showed that the maximum dissimilarity algorithm didn׳t work well on our data, because the accuracies of the validation set were only 83.46% for OFDF and 78.85% for SRMT. After analyzing the splitting results, it was found that the maximum dissimilarity algorithm preferred to select the data from a) the formulations in small API groups, b) boundary formulations, and c) formulations with extraordinary values. The possible reason is that the small API group data, the boundary data or the abnormal data have bigger dissimilar values than other data. Moreover, the maximum dissimilarity algorithm adopts the randomly generated initial set to compute the dissimilarity degree, which is still highly various and not robust due to the small dataset. Therefore, the original maximum dissimilarity algorithm should be improved to select the representative formulation data.

MD-FIS algorithm for data splitting

A new algorithm in the R language was developed for selecting the best representative data to validate the models. Fig. 1 shows the improved Maximum Dissimilarity algorithm with the small group filter and representative initial set selection (MD-FIS). The MD-FIS algorithm contains 3 steps. In step 1, the data go through a filter to get rid of the small API group data. In step 2, the MD-FIS algorithm randomly generates 10,000 initial datasets, computes the similarity values between the initial datasets and the remaining datasets, and chooses the initial set with the highest similarity value as the final initial set. In step 3, the final initial set and the remaining data set are used as the input to the dissimilarity algorithm with new cost function. Different from the original cost function, new cost function not only includes the distances (originalDistance) between the candidate data and the initial set, but also contains the distances (subMeanDistance) between the candidate data and the remaining data in the same API group. The new cost function is:where, can control the proportion of , the maximum dissimilarity algorithm selects the data with the maximum cost. The new cost function will prevent the selection of the boundary data. The result was much better than that of the original maximum dissimilarity algorithm. The prediction accuracies were 95.57% for OFDF and 82.02% for SRMT on the validation set.

Figure 1

The workflow of MD-FIS algorithm.

Comparison of deep learning and conventional machine learning methods

In this study, the models of MLR, PLSR, SVM, ANNs, RF, k-NN and deep learning were developed on the formulation data. Here, three datasets were split by using the MD-FIS algorithm twice without the need of personal expertise. In the prediction of SRMT, 4 models were built for the 4 time points (2, 4, 6 and 8 h) by using each of the machine learning methods. The final results of accuracies, root mean squared errors (RMSE) and mean absolute errors (MAE) were shown in Table 2, Table 3. In the prediction of OFDF, for all the models based on the linear or nonlinear conventional machine learning methods, the accuracies only reach around 70% on the OFDF validation and test sets. As to the MLR model, the accuracies are relatively low than other conventional machine learning models on the OFDF validation and test sets. In the prediction of SRMT, the conventional machine learning models made predictions with the accuracies ranging from 25% to 55% on the SRMT validation and test sets, which are far from the satisfied prediction for the formulation development. In summary, all these six conventional machine learning methods could not achieve enough performance for the OFDF and SRMT formulation prediction.

Table 2

Results of the conventional machine learning models and the deep neural network on the OFDF training, validation and test sets.

Machine learning technique	Training set			Validation set			Test set
	Accuracy (%)	RMSE	MAE	Accuracy (%)	RMSE	MAE	Accuracy (%)	RMSE	MAE
MLR	90.11	0.0671	0.0508	60.00	0.1311	0.0999	65.00	0.1778	0.1183
PLSR	76.92	0.0917	0.0705	70.00	0.1136	0.0835	70.00	0.0970	0.0705
SVM	79.12	0.1136	0.0711	70.00	0.1308	0.0959	75.00	0.1039	0.0795
ANN	74.73	0.1140	0.0809	70.00	0.1105	0.0846	70.00	0.0959	0.0772
RF	84.62	0.0775	0.0567	80.00	0.0917	0.0721	70.00	0.1068	0.0774
k-NN	80.22	0.0975	0.0649	75.00	0.1025	0.0727	75.00	0.0877	0.0608
DNN	97.80	0.0420	0.0307	80.00	0.0842	0.0705	80.00	0.0714	0.0565

Table 3

Results of the conventional machine learning models and the deep neural network on the SRMT training, validation, and test sets.

Machine learning technique	Training set			Validation set			Test set
	Accuracy (%)	RMSE	MAE	Accuracy (%)	RMSE	MAE	Accuracy (%)	RMSE	MAE
MLR	52.38	0.1356	0.1031	35.00	0.1212	0.1042	25.00	0.2182	0.1685
PLSR	55.24	0.1446	0.1066	55.00	0.1175	0.0961	45.00	0.1609	0.1203
SVM	60.95	0.1568	0.1013	50.00	0.1170	0.0960	45.00	0.1559	0.1147
ANN	57.14	0.1330	0.0998	50.00	0.1389	0.1137	50.00	0.1497	0.1124
RF	76.19	0.0975	0.0692	55.00	0.1308	0.1045	55.00	0.1170	0.0908
k-NN	64.76	0.1229	0.0825	45.00	0.1526	0.1264	40.00	0.1565	0.1306
DNN	99.05	0.0335	0.0237	80.00	0.0967	0.0660	80.00	0.0902	0.0673

Results of the conventional machine learning models and the deep neural network on the OFDF training, validation and test sets. Results of the conventional machine learning models and the deep neural network on the SRMT training, validation, and test sets. Here, the training, validation and test sets for training the DNNs are the same as the datasets used for training the previous machine learning models. A multi-label model was built for the 4 time points (2, 4, 6 and 8 h) using deep learning techniques. As shown in Table 2, Table 3, all prediction accuracies of the deep neural networks were over 80%, which could satisfy the requirements of the formulation prediction. In both OFDF and SRMT predictions, deep learning got the highest accuracies on the training, validation and test sets. Deep learning surpassed other conventional machine learning methods because deep learning including multiple hidden layers could transform the low level representation to higher level features without artificial feature engineering. In SRMT prediction, huge performance improvements of deep learning were found than other machine learning methods. The result indicates that deep learning can greatly improve the model accuracy in multi-label formulation prediction, because deep learning can leverage the shared information among the multiple tasks. Fig. 2 shows the experimental and the deep learning predicted disintegration time of the formulations in the OFDF test set. Table 4 lists the f2 values between the experimental and the deep learning predicted cumulative drug released curves of the formulations in the SRMT test set. From these figures and tables, it is quite clear that the prediction performance of deep learning is satisfied.

Figure 2

Comparing the experimental- and the deep learning-predicted disintegration time of the formulations in the OFDF test set.

Table 4

f2 values between the experimental and the deep learning predicted cumulative drug released curves of the formulations in the SRMT test set.

Formulation	f₂ value	Formulation	f₂ value
1	77.42	11	65.72
2	63.35	12	90.05
3	64.84	13	57.05
4	67.21	14	41.91
5	59.75	15	55.06
6	50.85	16	65.84
7	77.77	17	51.08
8	30.39	18	49.57
9	44.56	19	64.42
10	74.47	20	59.35

Comparing the experimental- and the deep learning-predicted disintegration time of the formulations in the OFDF test set. f2 values between the experimental and the deep learning predicted cumulative drug released curves of the formulations in the SRMT test set. Figure 3, Figure 4, Figure 5, Figure 6 show the relationship between the experimental and the deep learning predicted results on the OFDF and SRMT training, validation and test sets. It could be seen from these figures that the experimental and the deep learning predicted values are much closed.

Figure 3

Figure 4

Relationship between the experimental- and the deep learning-predicted values of the cumulative drug release percentages at 2, 4, 6, and 8 h on the SRMT training set. A is for the values at 2 h, B is for the values at 4 h, C is for the values at 6 h, D is for the values at 8 h.

Figure 5

Relationship between the experimental- and the deep learning-predicted values of the cumulative drug release percentages at 2, 4, 6, and 8 h on the SRMT validation set. A is for the values at 2 h, B is for the values at 4 h, C is for the values at 6 h, D is for the values at 8 h.

Figure 6

Relationship between the experimental- and the deep learning-predicted values of the cumulative drug release percentages at 2, 4, 6, and 8 h on the SRMT test set. A is for the values at 2 h, B is for the values at 4 h, C is for the values at 6 h, D is for the values at 8 h.

Relationship between the experimental- and the deep learning-predicted values of the disintegration time on the OFDF training, validation and test sets. The dotted line indicates experimental values ±10 s. Relationship between the experimental- and the deep learning-predicted values of the cumulative drug release percentages at 2, 4, 6, and 8 h on the SRMT training set. A is for the values at 2 h, B is for the values at 4 h, C is for the values at 6 h, D is for the values at 8 h. Relationship between the experimental- and the deep learning-predicted values of the cumulative drug release percentages at 2, 4, 6, and 8 h on the SRMT validation set. A is for the values at 2 h, B is for the values at 4 h, C is for the values at 6 h, D is for the values at 8 h. Relationship between the experimental- and the deep learning-predicted values of the cumulative drug release percentages at 2, 4, 6, and 8 h on the SRMT test set. A is for the values at 2 h, B is for the values at 4 h, C is for the values at 6 h, D is for the values at 8 h. Multiple linear regression approaches were employed in an attempt to learn a linear combination of input features which could predict the output. Multiple linear regression is simple and easy to model. The weights and biases of multiple linear regression could be directly calculated by using the least squares method. Multiple linear regression have better interpretability than other nonlinear machine learning models because the weights could indicate the importance of the input features in the prediction. However, multiple linear regression and partial least squared regression could only fit the linear function mapping, while obviously the relationship between the formulation and the key in vitro characteristics is complex and non-linear. Random forest is an ensemble learning method. Ensemble learning methods combining multiple base learners could obtain better generalization ability of models than a single base learner. Random forest usually shows better performance than other ensemble learning models in many learning tasks. In random forest, the diversity of the base learners is not only from the sample disturbance but also from the attribute disturbance, which makes the difference between the base learners increase and the generalization ability be further improved. Support vector machine maps the sample from the original space to the higher dimensional feature space, therefore, the sample can be divided in the higher dimensional feature space. However, the conventional machine learning methods highly rely on the feature extractors designed by the subjective expert experiences. Furthermore, the representative abilities of artificial neural networks enhance with the increase of the hidden layers and hidden nodes. The larger the model capacity, the more complex function the model can achieve. Therefore, deep learning containing more hidden layers could make multiple abstractions and feature extractions, making deep learning be able to accomplish more complex tasks to higher accuracy than the shallow artificial neural networks.

Deep learning in formulation prediction

One main difficulty of formulation prediction is the lack of reliable and standard formulation data. The long experimental cycle and high cost of formulation development results in the small data set in this area. Moreover, current formulation experiments focus on a small number of model drugs, which lead to highly imbalanced data space and further raise the difficulty of formulation prediction for other drugs. This fundamental issue was reported in previous research, in which the data sets are too small or too noisy. To solve the issue, 10-fold cross validation was used for assessing the performance of the algorithm, which the R2 value only reached 0.67 or 0.6930., 44.. The prediction is even weaker in smaller data set because the small data set easily results in overfitting and poor generalizations. Previous suggestions were to increase the size of data sets, but there are very limited formulation data due to experimental limitations. Euclidean distance was used to estimate the domain of applicability of the trained model. Another study indicated that maximum dissimilarity algorithm was able to select representative testing data of compounds from chemical databases. However, both methods were good for the drug molecules, not for highly complex formulation data. Deep learning with usual data selection algorithms and evaluation criteria is difficult to accurately predict from the small amount of formulation data with imbalanced input space. Therefore, the MD-FIS algorithm is suitable for splitting formulation data with small sample size and imbalanced input space. In addition, two common evaluation metrics for regression problems (e.g., the correlation coefficient and coefficient of determination) cannot reflect the performance of pharmaceutical prediction models. Specific criteria need to be introduced for evaluating the model performance. Currently, only two deep learning approaches have been reported for formulation prediction25., 44.. In Zawbaa׳s research, 68 poly-lactide-co-glycolide formulations were used. Input vector contains 320 features and 745 release data at specific time points. Initially, the feature selection models were developed to minimize the input variables. Subsequently, seven machine learning approaches were compared, such as cubist, RF, ANNs, multivariate adaptive regression spline, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations. The final results showed RF to have the best performance with a 0.692 coefficient of determination. Moreover, the prediction of the model was only suitable for 2/3 formulations and the remaining 1/3 of formulations had high error. The dissolution profiles of 4 proteins among total 14 proteins were strongly not recommended by the final model. Formulation development contains high dimensional factors, such as drug diversity, excipient types, drug/excipient ratios, dosage forms, manufacturing processes and multiple characteristics, which lead to the high complexity data. Actually, deep learning can automatically learn high level features from data without explicitly providing feature selection model. Combining feature extractors with deep learning model is unnecessary and inappropriate. Therefore, this research still needs significant improvement on methodology for formulation prediction. Recently deep learning with MD-FIS was applied for formulation prediction of oral fast disintegrating tablets. The results showed that the prediction of deep learning was better than that of ANN because deep learning could extract the features automatically without designing feature extractors. In addition, there are some researches about machine learning methods in formulation design, as summarized in Table 1. For example, an expert system was constructed to predict formulations of oral disintegrating tablets with 15 input parameters12., 13.. Zhang et al.5., 14. built an expert system with ANN for the formulation development of oral osmotic pump controlled release tablets. QSAR models were developed to predict the binding affinity of β-cyclodextrin and sulfobutylether-β-cyclodextrin complexation. However, these results indicated that the conventional machine learning methods need feature extraction before the prediction models and didn’t showed good prediction performance.

Conclusions

In this paper, deep learning models were successfully developed to predict pharmaceutical formulations on small data. The good generalization performance of the models was demonstrated by the external datasets. The proposed models could effectively predict the key characteristics in regression problems than the models trained by other machine learning methods, because deep learning can find out the complex correlation in the data. Modern successful pharmaceutical development needs to incorporate quality by design (QbD) concepts throughout the drug development process. Machine learning methods could not only help to predict the in vivo and in vitro characteristics based on the formulation and process data, but also assist in the pharmaceutical experimental design and help to control the product quality in the whole product cycle. Deep learning shows great potential in the implementation of QbD. We expect deep learning to significantly shorten the drug product development timeline and decrease the material usage. Furthermore, the cross-disciplinary integration of pharmaceutics and artificial intelligence may shift the paradigm of pharmaceutical research from experience-dependent studies to data-driven methodologies. In the future, our laboratory will investigate other machine learning methods (e.g. transfer learning) for formulation prediction to achieve better performance.

15 in total

Review 1. Nanotechnology and artificial intelligence to enable sustainable and precision agriculture.

Authors: Peng Zhang; Zhiling Guo; Sami Ullah; Georgia Melagraki; Antreas Afantitis; Iseult Lynch
Journal: Nat Plants Date: 2021-06-24 Impact factor: 15.793

2. Accelerating 3D printing of pharmaceutical products using machine learning.

Authors: Jun Jie Ong; Brais Muñiz Castro; Simon Gaisford; Pedro Cabalar; Abdul W Basit; Gilberto Pérez; Alvaro Goyanes
Journal: Int J Pharm X Date: 2022-06-09

3. Prediction Model with High-Performance Constitutive Androstane Receptor (CAR) Using DeepSnap-Deep Learning Approach from the Tox21 10K Compound Library.

Authors: Yasunari Matsuzaka; Yoshihiro Uesawa
Journal: Int J Mol Sci Date: 2019-09-30 Impact factor: 5.923

Review 9. Digital Pharmaceutical Sciences.

Authors: Safa A Damiati
Journal: AAPS PharmSciTech Date: 2020-07-26 Impact factor: 3.246

10. Predicting complexation performance between cyclodextrins and guest molecules by integrated machine learning and molecular modeling techniques.

Authors: Qianqian Zhao; Zhuyifan Ye; Yan Su; Defang Ouyang
Journal: Acta Pharm Sin B Date: 2019-05-08 Impact factor: 11.413