| Literature DB >> 32715351 |
Abstract
Artificial intelligence (AI) and machine learning, in particular, have gained significant interest in many fields, including pharmaceutical sciences. The enormous growth of data from several sources, the recent advances in various analytical tools, and the continuous developments in machine learning algorithms have resulted in a rapid increase in new machine learning applications in different areas of pharmaceutical sciences. This review summarizes the past, present, and potential future impacts of machine learning technologies on different areas of pharmaceutical sciences, including drug design and discovery, preformulation, and formulation. The machine learning methods commonly used in pharmaceutical sciences are discussed, with a specific emphasis on artificial neural networks due to their capability to model the nonlinear relationships that are commonly encountered in pharmaceutical research. AI and machine learning technologies in common day-to-day pharma needs as well as industrial and regulatory insights are reviewed. Beyond traditional potentials of implementing digital technologies using machine learning in the development of more efficient, fast, and economical solutions in pharmaceutical sciences are also discussed.Entities:
Keywords: artificial intelligence; artificial neural networks; machine learning; pharmaceutical industry; pharmaceutical sciences
Mesh:
Year: 2020 PMID: 32715351 PMCID: PMC7382958 DOI: 10.1208/s12249-020-01747-4
Source DB: PubMed Journal: AAPS PharmSciTech ISSN: 1530-9932 Impact factor: 3.246
Fig. 1Schematic showing the relationship between AI, machine learning, and artificial neural networks (left), and a number of applications of artificial neural networks in pharmaceutical sciences (right)
Comparison of Different Machine Learning Methods Commonly Used in Pharmaceutical Research* (12,17,25–33)
| Machine learning method | Learning algorithm | Machine learning model | Learning problem | Dataset size | Advantages | Disadvantages/limitations |
|---|---|---|---|---|---|---|
| Linear regression | Supervised | Parametric | Regression | Varies ( | Easy implementation | Applicable only for linear modeling |
| Artificial Neural Networks (ANNs) | Supervised and unsupervised | Parametric | Classification and regression | Large | Modeling complex nonlinear relationships | Overfitting/underfitting Relatively slow training time |
| Supervised | Nonparametric | Classification and regression | Large (dependent on noise level in data) | Simple and easy to implement with single pre-defined parameter ( | Intolerant of noise | |
| Support vector machine (SVM) | Supervised and unsupervised | Nonparametric | Classification and regression | Small | Able to represent complex functions Offer resistant to overfitting | Relatively slow training time High complexity of the model Long computing time |
| Decision tree (DT) | Supervised | Nonparametric | Classification and regression | small | Easier than RF Can deal with noisy and missing data Fast | Unstable. Its performance can be affected by slight variations in the training data |
| Random forest (RF) | Supervised | Nonparametric | Classification and regression | large | Similar to decision trees in addition to its capability to overcome overfitting | Complex |
| Principal Component Analysis (PCA) | Unsupervised | Feature extraction | Classification and regression | Large | Reduces the dimensionality of multivariate data while maintaining the relevant information in the original dataset | It assumes Gaussian distribution of data which might limit their use if data are not normally distributed such as gene expression data |
| LightGBM** | Supervised | Nonparametric | Classification and regression | Large | Fast training speed High efficiency and accuracy | Sensitive to overfitting |
*Note that this table presents a general comparison of the different machine learning methods commonly used in drug research and development. Different methods may have relative variations compared with other methods
**LightGBM is an emerging machine learning method recently been implemented in pharmaceutical sciences
Fig. 2Schematic of a typical biological neuron (a), and an artificial neural network (b)
Fig. 3An illustration of the effect of overfitting/underfitting of the data on the training and validation error curves showing the optimum point where the training/learning process of ANNs should stop
Summarization of Input-Output Data Used to Build Various Pharmaceutical Machine Learning Models in Different QSAR Studies
| Machine learning method* | Learning algorithm | Dataset size | Inputs/descriptors | Output(s)/purpose | Reference |
|---|---|---|---|---|---|
| ANN Linear regression | Supervised | 30 | 5 molecular descriptors: - water–accessible surface area - polar surface area - maximal electrostatic potentials - ovality - hydrophobicity (log | Prediction of tumor specificity of chemotherapeutic agents | ( |
ANN SVM DT RF | Supervised | 89 | 10 molecular descriptors related to: - hydrophobicity - electronic features - topological features - protein-inhibitor interactions | Prediction of activity of HIV inhibitors | ( |
RBFNN KNN SVM RF | Unsupervised and supervised for RBFNN model and the other models, respectively | Twodatasets: 128 (Phenol dataset) 105 (ROCK dataset) | - For phenol datasets: 6 molecular descriptors (related 2D and 3D descriptors such as log P). - For ROCK datasets: 6 molecular descriptors (related 2D descriptors such as ring count) | Prediction of the biological activity of various phenols and Rho kinase (ROCK) inhibitors. | ( |
ANN Linear partial least squares (linear statistical method) | Supervised | 36 | 4 molecular descriptors: - minimum bond dissociation enthalpy - electron transfer enthalpy - proton affinity - hydration energy | Prediction of antioxidant activity of flavonoids | ( |
RF ANN | Supervised | 91 | 166 molecular descriptors including: - structure - topology - molecular connectivity index - geometric descriptors | Prediction of the carcinogenicity of polycyclic aromatic hydrocarbons | ( |
| ANN | Supervised | 33 | 6 molecular descriptors (related 2D and 3D descriptors) | Prediction of anti-malarial activity of imidazolopiperazine compounds | ( |
ANN SVM | Supervised | 639 | 341 molecular descriptors related to: - simple constitutional - topological indices - electrotopological state indices - charge-based - hydrogen-bonding descriptors | Prediction of nephrotoxicity of traditional Chinese medicines ingredients | ( |
*The top-ranked machine learning methods in each of these studies demonstrated better predictive ability than the other machine learning methods tested. ANN artificial neural network, SVM support vector machine, DT decision tree, RF random forest, KNN K-nearest neighbor, RBFNN radial basis function neural network
Summarization of Input-Output Data Used to Build Various ANN Models in Different Pharmaceutical Formulation Studies
| Dataset size | Inputs/variables | Output(s) | Purpose | Reference |
|---|---|---|---|---|
| 125 | 19 variables related to: - the composition of the formulations - the processing conditions | - Time taken for 10% of the drug to be released - Time taken for 90% of the drug to be released | Prediction of the most important formulation and processing variables contributing to the | ( |
Two datasets: 154 (for synthetic samples) 169 (for pharmaceutical samples) | - 5 principle components for synthetic samples - 6 principle components for pharmaceutical samples | Concentrations of 3 vitamins in synthetic and pharmaceutical samples | Prediction of vitamins in synthetic and pharmaceutical samples | ( |
| 30 | 3 input variables: - acid concentration - acid solution to chitin ratio - reaction time | Percentage production yield of glucosamine | Prediction of glucosamine production yield from chitin under various reaction conditions | ( |
| 180 | 4 input variables related to different formula ingredients: - Methocel® K100M - xanthan gum - Carbopol® 974P - Surelease® | Development and optimization of sustained-release salbutamol sulfate formulation | ( | |
| 300 | 5 input variables related to 5 active ingredients and excipients (three physical–chemical properties of active ingredients in addition to two formulation factors): - solubility - mean particle size - specific surface area - the weight ratios of microcrystalline cellulose - the weight ratios of magnesium stearate | Tablet tensile strength and disintegration time before and after accelerated test | Prediction of responses to differences in quantities of excipients and physical–chemical properties of active ingredients in tablets | ( |
| 327 | 6 input variables related to 14 active ingredients: - melting point - solubility - specific surface area - mean particle size - size distribution - contents of APIs | - Tablet tensile strength - Disintegration time | Prediction of the contribution of different physicochemical properties of APIs to tablet properties | ( |
| 15 | 3 formulation factors: - weight ratio of drug to lipid - the concentration of polymer - the concentration of surfactant | - Drug loading efficiency - Mean particle size | Optimization of controlled-release nanoparticle formulation | ( |
| 45 | 3 input variables: - chitosan (Cs) concentration - potasodium tripolyphosphate (TPP) concentration - mass ratio of Cs and TPP | - Nanoparticle size - Percentage yield | Optimization of formulation parameters of chitosan-tripolyphosphate nanoparticles | ( |
| 43 | 7 input variables: - alginate percentage - concentration of CaCl2 solution in the emulsion - percentage of Tween™ 85 in the emulsion - percentage of Tween™ 85 in the receptor bath - flow rates of alginate - flow rates of emulsion - frequency of vibration | - Shape - Oil content - Oil distribution | Optimization of encapsulation of active pharmaceutical ingredients (API) for efficient delivery of hydrophobic compounds | ( |
| 20 | 3 input variables: - the amounts of drug (pilocarpine hydrochloride) - the amounts of bile salt (sodium deoxycholate) - the amounts of water | Entrapment efficiency | Optimization of ocular formulation of flexible nano-liposomes containing pilocarpine hydrochloride | ( |
| 16 | 3 input variables: - amount of oil - amount of surfactant - amount of co-surfactant | Minimal globule size | Optimization of self-emulsifying drug delivery system | ( |
| 8 | 2 formulation variables: - ratio of carrier to coating - type of solubilizing agent | Amount of API resealed in 10 min and 30 min | Development of a new liquisolid formulation | ( |
| 160 | 160 NIR and Raman spectral data of each of intact tablets | Dissolution of the tablets | Prediction of the | ( |
| 29 | 4 formulation and process variables: - microcrystalline cellulose concentration - sodium starch glycolate concentration - spheronization time - extrusion speed | - Drug release (at 15 min, 30 min, 45 min, and 60 min) - Aspect ratio - Yield | Prediction of the effects of formulation and process variables on drug release | ( |
| 144 | Amino acid composition of each monoclonal antibody and different formulation conditions ( | - Melting temperature - Aggregation onset - Temperature - Interaction parameter | Prediction of biophysical properties of therapeutic monoclonal antibodies | ( |
| 32 | 4 input variables: - concentration of shell material - concentration of core material - type of shell material - type of core material | - Tensile strength - Brittleness index | Prediction of powder compact ability of tablets using core/shell technique | ( |
| 646 | 24 variables related to: - formulation (including molecular weight, melting point, hydrogen bonding for both drug and polymer) - experimental conditions (including temperature, relative humidity, and storage time) | Stability results | Prediction of the physical stability of solid dispersions | ( |