Literature DB >> 32715351

Digital Pharmaceutical Sciences.

Abstract

Artificial intelligence (AI) and machine learning, in particular, have gained significant interest in many fields, including pharmaceutical sciences. The enormous growth of data from several sources, the recent advances in various analytical tools, and the continuous developments in machine learning algorithms have resulted in a rapid increase in new machine learning applications in different areas of pharmaceutical sciences. This review summarizes the past, present, and potential future impacts of machine learning technologies on different areas of pharmaceutical sciences, including drug design and discovery, preformulation, and formulation. The machine learning methods commonly used in pharmaceutical sciences are discussed, with a specific emphasis on artificial neural networks due to their capability to model the nonlinear relationships that are commonly encountered in pharmaceutical research. AI and machine learning technologies in common day-to-day pharma needs as well as industrial and regulatory insights are reviewed. Beyond traditional potentials of implementing digital technologies using machine learning in the development of more efficient, fast, and economical solutions in pharmaceutical sciences are also discussed.

Entities: Chemical

Keywords: artificial intelligence; artificial neural networks; machine learning; pharmaceutical industry; pharmaceutical sciences

Mesh：

Year: 2020 PMID： 32715351 PMCID： PMC7382958 DOI： 10.1208/s12249-020-01747-4

Source DB: PubMed Journal: AAPS PharmSciTech ISSN： 1530-9932 Impact factor: 3.246

INTRODUCTION: BIG DATA IN PHARMACEUTICAL SCIENCES

There has been a remarkable increase in the amount of data—including pharmaceutical data—that are generated each day. The term “big data” has gained increasing interest in various research areas. In addition, data-driven companies currently show how various industries are able to profit from the massive generation of data. Several definitions have been proposed for the term “big data.” One of the widely recognized definitions used is the “4 Vs” definition. The definition was first proposed by Douglas Laney and encompasses “3 Vs” which consist of volume, velocity, and variety (1,2). This definition was later extended by IBM to include the fourth “V” for veracity (3). However, the reported definitions of “big data” usually lack consistency and quantification. Because of its potential value, data has been considered as the new oil (4,5). Textbooks and publications, social media, user-generated content, electronic health records, genomics, sensor networks, and many other types of data all form “big data” and contribute to its diversity and complexity. The remarkable increase in the amount of data can be attributed to advancements in data storage and innovative technologies (6). Almost 2.5 million new scientific papers are published annually (7). In addition, there were more than 15,000 PubMed-reported publications on “pharmaceutical sciences” in 2019 only (8). Thus, “big data” in pharmaceutical sciences can be viewed as both a challenge and an opportunity. The evolution of artificial intelligence (AI), particularly machine learning technologies in which computers can “learn” and perform tasks, has improved the potential of using big data in pharmaceutical sciences. The scope of this review is specific to machine learning because, among all AI branches, machine learning is the most currently used AI technology in the field of pharmaceutical sciences. Other AI fields, such as natural language processing (NLP), expert systems, and robotics, are becoming very popular in many healthcare settings, such as in the diagnosis of diseases, patient monitoring, and robotic surgeries (9,10). These methods, however, have not yet received as much attention as machine learning in pharmaceutical sciences settings. The aim of this review is to summarize the past, present, and potential future impacts of machine learning on different areas of pharmaceutical sciences, including drug design and discovery, preformulation, and formulation. This review covers different machine learning algorithms that are commonly implemented in different areas of pharmaceutical sciences, with a special emphasis on the use of artificial neural networks (ANNs). Notably, compared with other machine learning methods, ANNs have displayed superior performance in various pharmaceutical settings, as will be discussed in the following sections.

PRINCIPLES OF AI AND MACHINE LEARNING

Despite its long history, as will be discussed below, there is still no standard definition of AI. However, mimicking human intelligence using computer systems is the basic concept of AI. The physiology and function of neurons in the brain inspired Warren McCulloch and Walter Pitts (1943) to propose a computational model of artificial neurons. Similar to human neurons, artificial neurons are characterized by being “on” or “off” in response to sufficient stimulation from neighboring neurons (11). The term “artificial intelligence” was officially introduced by John McCarthy at the Dartmouth conference in the summer of 1956 (12). Since then, AI has had cycles of success as well as so-called “AI winters” (13). Recently, AI has significantly advanced and gained increasing interest in a wide range of fields, including healthcare (14), engineering (15), and transportation (16). This increased focus on AI applications has been fueled by the growing availability of big data in healthcare and the rapid advancement of numerous analytical techniques (10). Machine learning is a popular AI technique (Fig. 1) whereby computers can accurately adapt or modify their actions (e.g., making predictions). Machine learning algorithms can be classified into two major categories: supervised learning and unsupervised learning (17). In supervised learning, the algorithm uses generalizations to respond appropriately to a set of training examples. Training examples are input-output data that are provided in the dataset to be learned. Because the output data here are known to be the correct responses (or correct answers), they are termed as “targets.” The machine learning model eventually aims to predict an output that is closer to the target.

Fig. 1

Schematic showing the relationship between AI, machine learning, and artificial neural networks (left), and a number of applications of artificial neural networks in pharmaceutical sciences (right)

Schematic showing the relationship between AI, machine learning, and artificial neural networks (left), and a number of applications of artificial neural networks in pharmaceutical sciences (right) Examples of supervised machine learning methods include regression analysis, support vector machines (SVMs), random forests (RF), and ANNs. Unsupervised learning is based on feature extraction methods in which no examples are provided (17), such as principal component analysis (PCA). Some supervised machine learning models may also support unsupervised machine learning models such as SVMs and ANNs (18). Table I shows a comparison of several machine learning methods commonly used in pharmaceutical research. Linear regression, ANNs, KNN, SVM, DT, and RF are common machine learning methods used in pharmaceutical sciences; PCA is considered as an unsupervised dimensionality reduction technique usually integrated into computing transformation of unlabeled data to find a lower-dimensional set of axes (12). Although PCT may be considered as a statistical technique used to analyze multidimensional data, it is usually incorporated as a preprocessing tool in machine learning (19).

Table I

Comparison of Different Machine Learning Methods Commonly Used in Pharmaceutical Research* (12,17,25–33)

Machine learning method	Learning algorithm	Machine learning model	Learning problem	Dataset size	Advantages	Disadvantages/limitations
Linear regression	Supervised	Parametric	Regression	Varies (25)	Easy implementation	Applicable only for linear modeling
Artificial Neural Networks (ANNs)	Supervised and unsupervised	Parametric	Classification and regression	Large	Modeling complex nonlinear relationships	Overfitting/underfitting Relatively slow training time
K-Nearest Neighbor (KNN)	Supervised	Nonparametric	Classification and regression	Large (dependent on noise level in data)	Simple and easy to implement with single pre-defined parameter (i.e., the number of nearest neighbors)	Intolerant of noise
Support vector machine (SVM)	Supervised and unsupervised	Nonparametric	Classification and regression	Small	Able to represent complex functions Offer resistant to overfitting	Relatively slow training time High complexity of the model Long computing time
Decision tree (DT)	Supervised	Nonparametric	Classification and regression	small	Easier than RF Can deal with noisy and missing data Fast	Unstable. Its performance can be affected by slight variations in the training data
Random forest (RF)	Supervised	Nonparametric	Classification and regression	large	Similar to decision trees in addition to its capability to overcome overfitting	Complex
Principal Component Analysis (PCA)	Unsupervised	Feature extraction	Classification and regression	Large	Reduces the dimensionality of multivariate data while maintaining the relevant information in the original dataset	It assumes Gaussian distribution of data which might limit their use if data are not normally distributed such as gene expression data
LightGBM**	Supervised	Nonparametric	Classification and regression	Large	Fast training speed High efficiency and accuracy	Sensitive to overfitting

*Note that this table presents a general comparison of the different machine learning methods commonly used in drug research and development. Different methods may have relative variations compared with other methods

**LightGBM is an emerging machine learning method recently been implemented in pharmaceutical sciences

Comparison of Different Machine Learning Methods Commonly Used in Pharmaceutical Research* (12,17,25–33) Overfitting/underfitting Relatively slow training time Able to represent complex functions Offer resistant to overfitting Relatively slow training time High complexity of the model Long computing time Easier than RF Can deal with noisy and missing data Fast Fast training speed High efficiency and accuracy *Note that this table presents a general comparison of the different machine learning methods commonly used in drug research and development. Different methods may have relative variations compared with other methods **LightGBM is an emerging machine learning method recently been implemented in pharmaceutical sciences In addition, there are other machine learning methods used in pharmaceutical sciences such as the fuzzy logic algorithm. In this method, reasoning with logical expressions is used to describe membership in fuzzy sets (12). This method has the advantage of eliminating the need for expert knowledge regarding the system, considers the noise in the data, and produces easily interpretable predictions (20). Fuzzy logic algorithm provided good prediction models for analyzing gene expression data (20). Additionally, genetic algorithm (GA) is a population-based method commonly used as an optimization technique. This algorithm also offers the advantage of modeling nonlinear relationships. In pharmaceutical research, GA is mainly used in quantitative structure-activity relationship (QSAR) studies as a feature selection tool (21,22). Recently, there is an emergence of several novel machine learning applications in pharmaceutical settings using non-conventional machine learning techniques such as light gradient boosting machine (lightGBM). This method has offered numerous useful features as compared to the other classic machine learning methods as shown in Table I. Furthermore, an emerging machine learning technique is the transfer learning. Transfer learning is based on reusing a pre-trained model in order to build a new, improved model to address the intended target (23). In transfer learning, a relatively large dataset size of the original model is an important determinant for optimum transfer learning performance. Important recent progress of using transfer learning has been achieved in the field of pharmaceutical sciences (24) as will be discussed in a following section. Moreover, machine learning models can be classified into two categories: parametric and nonparametric models (Table I). Parametric models summarize data with a set of constant number of parameters (regardless of the number of training examples), whereas nonparametric models are dependent on the number of parameters and therefore on the number of training examples (12). Table I summarizes the common parametric and nonparametric machine learning methods encountered in different drug research and development studies. Note that each of these machine learning methods may have further subtypes, and a general comparison among these models can be unfair. For example, although certain machine learning methods may require large datasets, an optimum dataset size is usually lacking. The reader is encouraged to refer to the cited references for details. Additionally, no machine learning method is generally considered superior to all others, and each problem (classification or regression) should be addressed individually.

Artificial Neural Networks

ANNs are biologically inspired computational models that mimic the brain’s ability to learn by example (Fig. 2). Our brains consist of billions of processing units called neurons. These neurons are fully interconnected through an enormous number of synapses that connect one neuron to another (34). A biological neuron consists of a cell body that contains a nucleus and controls cell activities, dendrites, which compose the fine threads among neurons and carry the information to the cell, and axons, which consist of one long thread that transports information to the next cell (34).

Fig. 2

Schematic of a typical biological neuron (a), and an artificial neural network (b)

Schematic of a typical biological neuron (a), and an artificial neural network (b) Similar to human neurons, ANNs consist of artificial neurons or processing elements (PEs) that are connected via coefficients (weights) (34). A typical ANN (Fig. 2) consists of three main structural components: input, hidden, and output layers. The first layer of an artificial neuron is the input layer, which corresponds to the dendrites of the biological neuron and transfers information to the next layer. The following layer is the hidden layer, which is the middle layer between the input layer and the output layer. The hidden layer connects these two layers through certain coefficients (weights). Each hidden layer consists of a number of neurons (also called nodes). The choice of the number of neurons in the hidden layer of ANNs is generally achieved by a trial-and-error approach (35). Although there is no definite number of neurons to be used, using too few neurons in the hidden layer may result in a reduction in the ANN learning ability, whereas too many neurons in the hidden layer may result in the memorization or overfitting of the training data, ultimately decreasing the generalization ability of the ANN. Thus, the number of hidden neurons in the neural network that will give the highest correlation coefficient (r2) and lowest error (i.e., the minimum difference between observed and predicted values) should be selected as the optimal ANN. The final layer of an artificial neuron is the output layer, which consists of the outputs (targets). Moreover, by examining the magnitude of the ANN connection weights, ANNs can provide quantitative estimates of the relative importance of the input variables for the output in question (36). Figure 2 illustrates a schematic representation of a typical biological neuron (a) and an ANN (b). The process of designing a neural network that can learn to ultimately solve a problem occurs through iterative use of examples with known answers (targets). This process is called learning or training. The learning/training process as illustrated in Fig. 2 starts with receiving signals (inputs) from the input layer. These inputs are multiplied by connection weights and summed in the hidden layer. The results are then sent to the output layer through a transfer function. Several activation functions are available including identity, logistic, tanh, and exponential functions (17,37). The sigmoid function is a commonly used activation function in pharmaceutical research. During neural network learning, a process called “error back-propagation” is usually implemented (38). In back-propagation, the weights are adjusted to minimize the error between the calculated (predicted) output and the observed (target) output. ANNs are particularly powerful in modeling nonlinear relationships and can make highly accurate predictions due to their ability to analyze complex data primarily based on generalization and pattern recognition (39,40). Nevertheless, some challenges with using ANNs can be encountered, such as trapping at local minima, controlling noise, and overfitting/underfitting. To avoid local minima and control noise, a time-invariant noise algorithm (TINA) can be implemented (41). In addition, there are various ways to overcome overfitting/underfitting problems, including splitting the data into training and validation sets (42). This technique can reduce overfitting. Moreover, stopping the training process at the right point can also prevent both overfitting and underfitting (17). Figure 3 illustrates the optimum stopping point for ANN training.

Fig. 3

An illustration of the effect of overfitting/underfitting of the data on the training and validation error curves showing the optimum point where the training/learning process of ANNs should stop

From ANNs to Deep Learning

DL is a machine learning technique that is also a representation learning method (43). The state-of-the-art of DL methods includes recent advances in neural networks. The major difference between ANNs and DL is that DL includes larger numbers of hidden layers (usually more than three), and each layer comprises many more nodes. Therefore, DL uses multiple levels of representations that can ultimately learn very complex functions. Generally, DL requires very large training sets, which may limit the use of such methods. There are different types of neural network architectures in DL, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and fully connected feed-forward networks, which have been comprehensively discussed elsewhere (44). DL has become very popular and has gained interest in diverse research areas of pharmaceutical research such as in pharmaceutical formulation development (45), drug discovery (46), and drug repurposing (47). Their predictability and generalization performance are generally better than that of other machine learning methods, such as SVMs and RFs (45). This can be due to the improvements in algorithms, computers, and the availability of large datasets. Specific DL applications in pharmaceutical sciences is an interesting topic for future reviews.

APPLICATIONS OF MACHINE LEARNING IN PHARMACEUTICAL SCIENCES

Machine learning has been utilized in different pharmaceutical applications from the early stages of drug discovery to the late phases of drug development. The following sections present three major areas of pharmaceutical sciences that have witnessed a considerable use of ANNs together with a number of other machine learning methods. These studies can be categorized into drug design and discovery, preformulation, and formulation studies.

Machine Learning in Drug Design and Discovery

Drug discovery accounts for a significant share of the machine learning applications in pharmaceutical sciences, mainly due to the use of high-throughput screening, combinatorial chemistry, and computer-aided drug design (45,48). One of the early areas in which ANNs were applied is QSAR studies (49–51). The QSAR approach correlates the physicochemical properties of a compound with the corresponding chemical or biological activities (52,53). The most commonly used physicochemical properties in QSAR studies include molecular weight, partition coefficient (logP), and hydrogen bonding capacity. Because QSAR studies usually involve complex and nonlinear characteristics, ANNs were among the best available QSAR modeling tools. Additionally, due to their usefulness and success, the importance of neural networks has continued to grow in drug discovery with the rapid rise of QSAR studies based on ANNs (54). Table II summarizes several input-output data used to build various machine learning models in different QSAR studies.

Table II

Summarization of Input-Output Data Used to Build Various Pharmaceutical Machine Learning Models in Different QSAR Studies

Machine learning method*	Learning algorithm	Dataset size	Inputs/descriptors	Output(s)/purpose	Reference
ANN Linear regression	Supervised	30	5 molecular descriptors: - water–accessible surface area - polar surface area - maximal electrostatic potentials - ovality - hydrophobicity (logP)	Prediction of tumor specificity of chemotherapeutic agents	(55)
ANN SVM DT RF	Supervised	89	10 molecular descriptors related to: - hydrophobicity - electronic features - topological features - protein-inhibitor interactions	Prediction of activity of HIV inhibitors	(56)
RBFNN KNN SVM RF	Unsupervised and supervised for RBFNN model and the other models, respectively	Twodatasets: 128 (Phenol dataset) 105 (ROCK dataset)	- For phenol datasets: 6 molecular descriptors (related 2D and 3D descriptors such as log P). - For ROCK datasets: 6 molecular descriptors (related 2D descriptors such as ring count)	Prediction of the biological activity of various phenols and Rho kinase (ROCK) inhibitors.	(57)
ANN Linear partial least squares (linear statistical method)	Supervised	36	4 molecular descriptors: - minimum bond dissociation enthalpy - electron transfer enthalpy - proton affinity - hydration energy	Prediction of antioxidant activity of flavonoids	(58)
RF ANN	Supervised	91	166 molecular descriptors including: - structure - topology - molecular connectivity index - geometric descriptors	Prediction of the carcinogenicity of polycyclic aromatic hydrocarbons	(59)
ANN	Supervised	33	6 molecular descriptors (related 2D and 3D descriptors)	Prediction of anti-malarial activity of imidazolopiperazine compounds	(60)
ANN SVM	Supervised	639	341 molecular descriptors related to: - simple constitutional - topological indices - electrotopological state indices - charge-based - hydrogen-bonding descriptors	Prediction of nephrotoxicity of traditional Chinese medicines ingredients	(61)

*The top-ranked machine learning methods in each of these studies demonstrated better predictive ability than the other machine learning methods tested. ANN artificial neural network, SVM support vector machine, DT decision tree, RF random forest, KNN K-nearest neighbor, RBFNN radial basis function neural network

Summarization of Input-Output Data Used to Build Various Pharmaceutical Machine Learning Models in Different QSAR Studies 5 molecular descriptors: - water–accessible surface area - polar surface area - maximal electrostatic potentials - ovality - hydrophobicity (logP) ANN SVM DT RF 10 molecular descriptors related to: - hydrophobicity - electronic features - topological features - protein-inhibitor interactions RBFNN KNN SVM RF Twodatasets: 128 (Phenol dataset) 105 (ROCK dataset) - For phenol datasets: 6 molecular descriptors (related 2D and 3D descriptors such as log P). - For ROCK datasets: 6 molecular descriptors (related 2D descriptors such as ring count) ANN Linear partial least squares (linear statistical method) 4 molecular descriptors: - minimum bond dissociation enthalpy - electron transfer enthalpy - proton affinity - hydration energy RF ANN 166 molecular descriptors including: - structure - topology - molecular connectivity index - geometric descriptors ANN SVM 341 molecular descriptors related to: - simple constitutional - topological indices - electrotopological state indices - charge-based - hydrogen-bonding descriptors *The top-ranked machine learning methods in each of these studies demonstrated better predictive ability than the other machine learning methods tested. ANN artificial neural network, SVM support vector machine, DT decision tree, RF random forest, KNN K-nearest neighbor, RBFNN radial basis function neural network

Machine Learning in Pharmaceutical Preformulation

Preformulation is the stage of drug development in which the physicochemical properties of a drug substance are assessed. Determining the physicochemical properties of a drug substance is very important because it governs various parameters, such as its solubility, stability, interaction with excipients, and ultimately, bioavailability (62). Determining the aqueous solubility of a new drug substance is an essential first step in preformulation. Any drug to be absorbed must possess a certain degree of water solubility. This is true for oral, parenteral, ophthalmic, topical, and other routes of administration. Various solubilization techniques are used to improve the aqueous solubilities of drug substances, such as using surfactant, complexation, salt formation, using hydrotropes, or forming cocrystals (36,63,64). The in silico prediction of the aqueous solubility of drug substances has gained significant interest using different computational approaches, such as molecular dynamics simulations (65) and machine learning techniques (36). For example, Damiati et al. (2017) developed a machine learning application using ANNs to predict the solubility enhancement effect of several hydrotrope molecules. The input data consisted of experimental data together with 10 physicochemical properties (used as descriptors) related to 10 hydrotrope molecules at different hydrotrope concentrations. The physicochemical properties included logP, melting point, and hydrogen bonding capacity. The developed ANN model was subsequently used to predict the solubility enhancement of another 16 potential hydrotrope molecules from an external dataset. The trained model was also able to identify new prospective hydrotropes for the drug molecule. In addition to providing accurate predictions, by determination of the connection weights, the developed ANN was able to provide a quantitative assessment of the relative importance of various physicochemical properties that are required for a good hydrotrope (36). The reported use of ANNs in the prediction of solubility enhancements for drug substances and their successful use in other solubility applications in various research areas (66,67) are encouraging for further exploration of their potential uses in more pharmaceutical preformulation research. Moreover, based on the pharmacokinetic profile of a drug substance, a suitable pharmaceutical formulation can be designed. As stated earlier, important progress has been achieved in utilizing the emerging machine learning technique of transfer learning in pharmaceutical settings. Ye et al. (2019) developed an integrated transfer learning and multitask learning approach for the prediction of four pharmacokinetic-related properties, namely oral bioavailability (BA), plasma protein binding rate (PPBR), apparent volume of distribution at steady-state (VDss), and elimination half-life (HL). Eight molecular descriptors have been used for 1104 approved drug molecules. Descriptors included molecular weight, hydrogen bond donor count, hydrogen bond acceptor count, rotatable bond count, topological polar surface area, heavy atom count, complexity, and covalently bonded unit count. The developed model showed good performance and generalization ability compared to other conventional machine learning techniques including partial least-squares regression (PLSR), SVM, ANNs, RF, and KNN (24). In preformulation studies, transfer learning is a promising machine learning approach for further exploration.

Machine Learning in Pharmaceutical Formulations

Another stage of drug development is the formulation of pure drug substances into drug products to be administered by patients. ANNs have gained significant interest in this area and became the most popular machine learning tool in pharmaceutical formulation prediction (45). Table III summarizes numerous pharmaceutical researches that have been performed utilizing ANNs (as the only method used or as an approach that outperformed other machine learning methods) in the area of pharmaceutical formulation development in the past 20 years. This table compares these studies from different machine learning aspects including the diverse input-output data used, amount of data (dataset sizes), input variables, and purpose(s). Notably, a large number of these studies have utilized ANNs for the development and optimization of formulations and the prediction of formulation- and process-related factors associated with different parameters, such as drug dissolution and release. Additionally, the optimization of formulations (including the optimization of ingredients and/or operating conditions) using machine learning tools—particularly ANNs—has provided considerable success and displayed great promise for future applications that usually require fast and efficient manufacturing.

Table III

Summarization of Input-Output Data Used to Build Various ANN Models in Different Pharmaceutical Formulation Studies

Dataset size	Inputs/variables	Output(s)	Purpose	Reference
125	19 variables related to: - the composition of the formulations - the processing conditions	- Time taken for 10% of the drug to be released - Time taken for 90% of the drug to be released	Prediction of the most important formulation and processing variables contributing to the in vitro dissolution of sustained-release (SR) minitablets	(70)
Two datasets: 154 (for synthetic samples) 169 (for pharmaceutical samples)	- 5 principle components for synthetic samples - 6 principle components for pharmaceutical samples	Concentrations of 3 vitamins in synthetic and pharmaceutical samples	Prediction of vitamins in synthetic and pharmaceutical samples	(71)
30	3 input variables: - acid concentration - acid solution to chitin ratio - reaction time	Percentage production yield of glucosamine	Prediction of glucosamine production yield from chitin under various reaction conditions	(72)
180	4 input variables related to different formula ingredients: - Methocel® K100M - xanthan gum - Carbopol® 974P - Surelease®	In vitro dissolution time profiles at six different sampling times	Development and optimization of sustained-release salbutamol sulfate formulation	(73)
300	5 input variables related to 5 active ingredients and excipients (three physical–chemical properties of active ingredients in addition to two formulation factors): - solubility - mean particle size - specific surface area - the weight ratios of microcrystalline cellulose - the weight ratios of magnesium stearate	Tablet tensile strength and disintegration time before and after accelerated test	Prediction of responses to differences in quantities of excipients and physical–chemical properties of active ingredients in tablets	(74)
327	6 input variables related to 14 active ingredients: - melting point - solubility - specific surface area - mean particle size - size distribution - contents of APIs	- Tablet tensile strength - Disintegration time	Prediction of the contribution of different physicochemical properties of APIs to tablet properties	(75)
15	3 formulation factors: - weight ratio of drug to lipid - the concentration of polymer - the concentration of surfactant	- Drug loading efficiency - Mean particle size	Optimization of controlled-release nanoparticle formulation	(76)
45	3 input variables: - chitosan (Cs) concentration - potasodium tripolyphosphate (TPP) concentration - mass ratio of Cs and TPP	- Nanoparticle size - Percentage yield	Optimization of formulation parameters of chitosan-tripolyphosphate nanoparticles	(77)
43	7 input variables: - alginate percentage - concentration of CaCl₂ solution in the emulsion - percentage of Tween™ 85 in the emulsion - percentage of Tween™ 85 in the receptor bath - flow rates of alginate - flow rates of emulsion - frequency of vibration	- Shape - Oil content - Oil distribution	Optimization of encapsulation of active pharmaceutical ingredients (API) for efficient delivery of hydrophobic compounds	(78)
20	3 input variables: - the amounts of drug (pilocarpine hydrochloride) - the amounts of bile salt (sodium deoxycholate) - the amounts of water	Entrapment efficiency	Optimization of ocular formulation of flexible nano-liposomes containing pilocarpine hydrochloride	(79)
16	3 input variables: - amount of oil - amount of surfactant - amount of co-surfactant	Minimal globule size	Optimization of self-emulsifying drug delivery system	(80)
8	2 formulation variables: - ratio of carrier to coating - type of solubilizing agent	Amount of API resealed in 10 min and 30 min	Development of a new liquisolid formulation	(81)
160	160 NIR and Raman spectral data of each of intact tablets	Dissolution of the tablets	Prediction of the in vitro dissolution of pharmaceutical tablets	(82)
29	4 formulation and process variables: - microcrystalline cellulose concentration - sodium starch glycolate concentration - spheronization time - extrusion speed	- Drug release (at 15 min, 30 min, 45 min, and 60 min) - Aspect ratio - Yield	Prediction of the effects of formulation and process variables on drug release	(83)
144	Amino acid composition of each monoclonal antibody and different formulation conditions (i.e., pH and salt concentrations)	- Melting temperature - Aggregation onset - Temperature - Interaction parameter	Prediction of biophysical properties of therapeutic monoclonal antibodies	(84)
32	4 input variables: - concentration of shell material - concentration of core material - type of shell material - type of core material	- Tensile strength - Brittleness index	Prediction of powder compact ability of tablets using core/shell technique	(85)
646	24 variables related to: - formulation (including molecular weight, melting point, hydrogen bonding for both drug and polymer) - experimental conditions (including temperature, relative humidity, and storage time)	Stability results	Prediction of the physical stability of solid dispersions	(86)

Summarization of Input-Output Data Used to Build Various ANN Models in Different Pharmaceutical Formulation Studies 19 variables related to: - the composition of the formulations - the processing conditions - Time taken for 10% of the drug to be released - Time taken for 90% of the drug to be released Two datasets: 154 (for synthetic samples) 169 (for pharmaceutical samples) - 5 principle components for synthetic samples - 6 principle components for pharmaceutical samples 3 input variables: - acid concentration - acid solution to chitin ratio - reaction time 4 input variables related to different formula ingredients: - Methocel® K100M - xanthan gum - Carbopol® 974P - Surelease® 5 input variables related to 5 active ingredients and excipients (three physical–chemical properties of active ingredients in addition to two formulation factors): - solubility - mean particle size - specific surface area - the weight ratios of microcrystalline cellulose - the weight ratios of magnesium stearate 6 input variables related to 14 active ingredients: - melting point - solubility - specific surface area - mean particle size - size distribution - contents of APIs - Tablet tensile strength - Disintegration time 3 formulation factors: - weight ratio of drug to lipid - the concentration of polymer - the concentration of surfactant - Drug loading efficiency - Mean particle size 3 input variables: - chitosan (Cs) concentration - potasodium tripolyphosphate (TPP) concentration - mass ratio of Cs and TPP - Nanoparticle size - Percentage yield 7 input variables: - alginate percentage - concentration of CaCl2 solution in the emulsion - percentage of Tween™ 85 in the emulsion - percentage of Tween™ 85 in the receptor bath - flow rates of alginate - flow rates of emulsion - frequency of vibration - Shape - Oil content - Oil distribution 3 input variables: - the amounts of drug (pilocarpine hydrochloride) - the amounts of bile salt (sodium deoxycholate) - the amounts of water 3 input variables: - amount of oil - amount of surfactant - amount of co-surfactant 2 formulation variables: - ratio of carrier to coating - type of solubilizing agent 4 formulation and process variables: - microcrystalline cellulose concentration - sodium starch glycolate concentration - spheronization time - extrusion speed - Drug release (at 15 min, 30 min, 45 min, and 60 min) - Aspect ratio - Yield - Melting temperature - Aggregation onset - Temperature - Interaction parameter 4 input variables: - concentration of shell material - concentration of core material - type of shell material - type of core material - Tensile strength - Brittleness index 24 variables related to: - formulation (including molecular weight, melting point, hydrogen bonding for both drug and polymer) - experimental conditions (including temperature, relative humidity, and storage time) Recently, non-traditional machine learning techniques have been utilized in the development of in silico predictive models in pharmaceutical formulation. LightGBM has recently shown high potential predictive ability compared to conventional machine learning methods in pharmaceutical formulation researches. Zhao et al. (2019) compared lightGBM, RF, and DL for the prediction of complexation free energy between cyclodextrins (CDs) and guest molecules with a dataset consisting of 3000 data points. Over 30 numerous descriptors related to the guest molecule, CD, and experimental conditions have been implemented in designing the machine learning models. LightGBM showed better prediction performance compared to the other models including RF and DL (33). Gao et al. (2020) also implemented the lightGBM method for prediction of complexation performance of 341 drugs/phospholipid complex formulations described by over 40 molecular descriptors related to the properties of the drugs, solvents, and experimental conditions. Compared with other conventional machine learning techniques such as SVM and DT, lightGBM model showed the best predictive performance for predicting drug/phospholipid complexation (68). Also, in 2020, He and co-workers used lightGBM to predict the particle size and polydispersity index (PDI) of nanocrystals prepared by different methods. The dataset consisted of 910 experimental size data and 341 PDI data under various conditions and using various API-, stabilizer-, and nanocrystal preparation-related descriptors. The prediction performance of lightGBM was better than that obtained from several classic machine learning methods including deep neural networks (DNN), SVM, and DT for both size and PDI datasets (69). In all these lightGBM studies, it has been proved that lightGBM is a powerful and promising machine learning technique that can be further explored in the future for various pharmaceutical applications not only for its ability to provide accurate predictions but also due to its capability to provide an informative assessment of the importance of the input descriptors.

CURRENT AND FUTURE PROSPECTS

Benefits, Risks, and Efforts

In terms of applying AI and machine learning technologies in common day-to-day pharma needs, a number of aspects are to be considered including the benefits, risks, and efforts. The benefits of machine learning applications in pharmaceutical sciences are evident. This is true for both the classic machine learning tools such as ANNs as well as for the newly emerging tools such as lightGBM. Accelerating advances across the entire spectrum of the development of drug substances and drug products by dramatically reducing the timeline in unnecessary attempts is a substantial benefit of AI in pharmaceutical settings. This may not only allow for improving outcomes in less time, but it also can help to find more efficient solutions in order to sustain manufacturing efficiency and rapid throughput. In addition, depending upon the therapeutic class, the problem of high drug attrition rates (87) can be reduced. Thus, the high costs associated with drug research and development processes can be significantly reduced if performed in silico using data digitalization and reduced extensive laboratory testing. For instance, considering a real pharmaceutical problem in which substantial efforts are needed is the problem of low aqueous solubilities of drugs. It is estimated that approximately 90% of drug candidates in research and development pipelines are poorly water-soluble (88). Considering that only small quantities (< 50 mg) of a drug substance exist in early preformulation (62), determining the baseline solubility and subsequently the optimum solubilization technique for each drug substance may require extensive screening and laboratory work, as well as substantial resources. If well-trained and well-validated machine learning models can be incorporated in such settings, only drug candidates that show positive results in silico may then undergo laboratory testing. Thus, successful drug candidates can ultimately reach the intended patient in less time and with less material waste. Based on the type of data, there is an important advantage of machine learning is that no restrictions are encountered while implementing machine learning algorithms. Different types of data, including binary classification, multiple classes, and continuous data all can be modeled and analyzed by machine learning. Moreover, machine learning models may be used individually or in combination. Compared with traditional statistical models, a number of machine learning technologies (e.g., ANNs) offer the advantage of modeling complex and nonlinear relationships that are frequently encountered in pharmaceutical sciences. Traditional models are usually used to find inference about relationships in the data, whereas machine learning models are designed to model complex relationships which can ultimately produce accurate predictions. For example, the nature of the solubilization effect using hydrotropes is complex, nonlinear, and do not follow a constant pattern (36). Traditional statistical tool would not be able to provide accurate predictions for the solubilizing effect of these systems, whereas machine learning models not only were able to produce highly accurate predictions, but also proved to be powerful tools that can provide useful insights into the relative importance of the different input features in determining the outputs by interrogation of the connection weights. In addition, the machine learning approach also provided valuable insights that eventually lead to the identification of new prospective solubilizing compounds (36). The quality of data is one of the challenges that must be considered when using AI and machine learning in the pharmaceutical sciences. Quality encompasses the consistency, reliability, accuracy, availability, and accessibility of the data. The dataset size also should be considered. Small dataset size can be modeled using simple machine learning methods; if the dataset size is large and more complex to be modeled using simple machine learning methods, the advanced ANN models based on DL approach can offer a potential solution. Other challenges that must also be considered include the training/learning time, underfitting, and overfitting. Therefore, the risk of applying unreliable machine learning models can be eliminated if these challenges were appropriately considered, and well-trained and well-validated machine learning models were carefully designed. Hence, digitalizing pharmaceutical data using AI may require domain experience and the ability to train algorithms; each machine learning method should be implemented “task specifically.”

AI and Pharma Industry

The pharmaceutical industry would greatly benefit from the use of AI and machine learning, due to its wide range of applications as discussed in this review. From proof of concept to product evaluation and marketing, AI can be applied to nearly every aspect of drug development. With the long-standing figures of an average of $2.6 billion and over 10 years to develop new medicines (89), AI can offer a substantial investment to hasten and improve this process. In the last 10 years, there is a remarkable growing number of pharmaceutical companies and startups using AI in drug research and development. A number of pharma companies either collaborated with or acquired AI technologies such as Novartis and Pfizer with IBM Watson (90). Mak and Pichika (2019) provided a comprehensive list of AI and pharmaceutical companies and the corresponding collaboration areas in drug development such as drug repurposing, personalized medicine, and drug discovery (91). Other areas where pharma companies have been actively investigating in AI applications include process automation, robotic manufacturing, and targeted marketing (90). Investing in data management and AI power can sustain manufacturing efficiency and rapid throughput of data digitalization which is powered by advancing algorithms as well as the availability of the diverse, complex, and large amount of data. AI may, therefore, improve decision-making and eventually create new and better medicines. Nonetheless, it has been reported that AI has not yet influenced the pharma industry significantly due to several reasons/challenges suggested by Henstock (2019) including data management (e.g., managing diversity and large amount of data), finding solutions for a large number of problems, insufficient skillsets, shifting towards alternatives to traditional scientific approaches, and lack of investments. To overcome these challenges, the author also suggested internal investment in data management and AI talent (90). Mary and co-workers in 2019 conducted a survey-based study to clarify and understand the adoption and effect of AI in pharmaceutical and biotechnology companies. Across 217 organizations, a number of important AI activities have been identified including the use of AI for patient selection and recruitment for clinical trials, in addition to identification of medicinal products data gathering. Major factors for not utilizing AI technology have been identified including lack of skilled staff, safety, regulatory, and compliance concerns, and budget constraints (92).

Regulatory and Recommendation Insights

In terms of regulatory and recommendation insights, Food and Drug Administration (FDA) recently published a discussion paper “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) - Discussion Paper and Request for Feedback” which discusses the current approach made to subject software as a medical device driven by AI and machine learning to a premarket review in order to ensure safety and effectiveness. Several types of changes/modifications which may have an impact on users (including patients, healthcare professionals, and others) using these softwares have been reviewed. For example, changes/modifications related to re-changing the inputs, training with new data sets, and change in the AI/ML architecture. To ensure lifecycle safety and effectiveness from its premarket development to postmarked performance, FDA also proposed a total product lifecycle regulatory approach for AI/ML-Based SaMD (TPLC) to acquire evaluation and monitoring of a software product (93).

Beyond Traditional Applications

The growing success of machine learning technologies, particularly ANNs, in many pharmaceutical settings showed great potentials for the development of beyond traditional machine learning applications. This trend has already begun in areas such as drug and gene delivery. Therapeutic agents are often transported into the cell using special transporter systems such as cell-penetrating peptides (CPPs). The efficiency of CPPs is usually investigated and screened based on extensive laboratory work, which has recently been successfully performed in silico using ANNs. The developed CPPs/ANN model provided highly accurate predictions and informative assessments for 13 different input features (94). Additionally, drug repurposing also can highly benefit from these technologies (95). At present, although the first AI-designed drug has not reached the market yet, there is an ongoing race to find a treatment for the current COVID-19 pandemic. AI plays an important role in the ongoing efforts by identifying potential molecules that could be used as anti-COVID-19 drugs. For example, Benevolent AI (96,97) reported the use of machine learning to identify drugs for COVID-19 in which clinical trials are already underway.

CONCLUSIONS

Digitalizing pharmaceutical sciences is a very promising area in which numerous AI and machine learning technologies can be discovered and effectively employed. The growing success of machine learning technologies in many pharmaceutical settings shows great potentials for the development of beyond traditional AI applications. In practice, the choice of the machine learning method to be implemented may depend on various factors, including the type of the data and the size of the dataset. Therefore, the choice of which machine learning method should be implemented can be considered task-specific. With a sufficient amount of carefully curated data, building high-value applications using advancing AI algorithms may become a common practice that has the potential to solve many challenges in drug research and development. It is likely that AI will flourish a new era of digital pharmaceutical sciences with efficient, fast, and economical solutions.

62 in total

1. Application of artificial intelligent tools to modeling of glucosamine preparation from exoskeleton of shrimp.

Authors: Hadi Valizadeh; Mohammad Pourmahmood; Javid Shahbazi Mojarrad; Mahboob Nemati; Parvin Zakeri-Milani
Journal: Drug Dev Ind Pharm Date: 2009-04 Impact factor: 3.225

2. Creation of a tablet database containing several active ingredients and prediction of their pharmaceutical characteristics based on ensemble artificial neural networks.

Authors: Keisuke Takagaki; Hiroaki Arai; Kozo Takayama
Journal: J Pharm Sci Date: 2010-10 Impact factor: 3.534

3. Novel machine learning application for prediction of membrane insertion potential of cell-penetrating peptides.

Authors: Safa A Damiati; Ahmed L Alaofi; Prajnaparamita Dhar; Nabil A Alhakamy
Journal: Int J Pharm Date: 2019-06-21 Impact factor: 5.875

4. Can machine learning predict drug nanocrystals?

Authors: Yuan He; Zhuyifan Ye; Xinyang Liu; Zhengjie Wei; Fen Qiu; Hai-Feng Li; Ying Zheng; Defang Ouyang
Journal: J Control Release Date: 2020-03-29 Impact factor: 9.776

Review 5. Molecular design of flotation collectors: A recent progress.

Authors: Guangyi Liu; Xianglin Yang; Hong Zhong
Journal: Adv Colloid Interface Sci Date: 2017-05-10 Impact factor: 12.984

6. The application of machine learning algorithms in understanding the effect of core/shell technique on improving powder compactability.

Authors: Hao Lou; John I Chung; Y-H Kiang; Ling-Yun Xiao; Michael J Hageman
Journal: Int J Pharm Date: 2018-11-20 Impact factor: 5.875

7. Chitosan-tripolyphosphate nanoparticles: Optimization of formulation parameters for improving process yield at a novel pH using artificial neural networks.

Authors: Rania A Hashad; Rania A H Ishak; Sherif Fahmy; Samar Mansour; Ahmed S Geneidi
Journal: Int J Biol Macromol Date: 2016-01-16 Impact factor: 6.953

Review 8. Machine learning in chemoinformatics and drug discovery.

Authors: Yu-Chen Lo; Stefano E Rensi; Wen Torng; Russ B Altman
Journal: Drug Discov Today Date: 2018-05-08 Impact factor: 7.851

Review 9. Drug discovery and development: Role of basic biological research.

Authors: Richard C Mohs; Nigel H Greig
Journal: Alzheimers Dement (N Y) Date: 2017-11-11

10. Non-Linear Quantitative Structure⁻Activity Relationships Modelling, Mechanistic Study and In-Silico Design of Flavonoids as Potent Antioxidants.

Authors: Petar Žuvela; Jonathan David; Xin Yang; Dejian Huang; Ming Wah Wong
Journal: Int J Mol Sci Date: 2019-05-10 Impact factor: 5.923

7 in total

1. Insight into potent TLR2 inhibitors for the treatment of disease caused by Mycoplasma pneumoniae based on machine learning approaches.

Authors: Muhammad Ishfaq; Ziaur Rahman; Muhammad Aamir; Ihsan Ali; Yurong Guan; Zhihua Hu
Journal: Mol Divers Date: 2022-04-29 Impact factor: 2.943

2. Editorial: The Dual-Use Dilemma for Biomimicry.

Authors: Samar Damiati; Rami Mhanna; Shakil A Awan; Rimantas Kodzius; Bernhard Schuster
Journal: Front Mol Biosci Date: 2022-05-19

3. Harnessing machine learning for development of microbiome therapeutics.

Authors: Laura E McCoubrey; Moe Elbadawi; Mine Orlu; Simon Gaisford; Abdul W Basit
Journal: Gut Microbes Date: 2021 Jan-Dec

Review 4. Artificial intelligence and machine learning approaches for drug design: challenges and opportunities for the pharmaceutical industries.

Authors: Chandrabose Selvaraj; Ishwar Chandra; Sanjeev Kumar Singh
Journal: Mol Divers Date: 2021-10-23 Impact factor: 2.943

5. Integrated in silico formulation design of self-emulsifying drug delivery systems.

Authors: Haoshi Gao; Haoyue Jia; Jie Dong; Xinggang Yang; Haifeng Li; Defang Ouyang
Journal: Acta Pharm Sin B Date: 2021-05-05 Impact factor: 11.413

6. Application of Machine-Learning Algorithms for Better Understanding of Tableting Properties of Lactose Co-Processed with Lipid Excipients.

Authors: Jelena Djuris; Slobodanka Cirin-Varadjan; Ivana Aleksic; Mihal Djuris; Sandra Cvijic; Svetlana Ibric
Journal: Pharmaceutics Date: 2021-05-05 Impact factor: 6.321

7. Artificial intelligence application for rapid fabrication of size-tunable PLGA microparticles in microfluidics.

Authors: Safa A Damiati; Damiano Rossi; Haakan N Joensson; Samar Damiati
Journal: Sci Rep Date: 2020-11-11 Impact factor: 4.379

7 in total