| Literature DB >> 35076690 |
Jai Woo Lee1, Miguel A Maria-Solano1, Thi Ngoc Lan Vu1, Sanghee Yoon1, Sun Choi1.
Abstract
There have been numerous advances in the development of computational and statistical methods and applications of big data and artificial intelligence (AI) techniques for computer-aided drug design (CADD). Drug design is a costly and laborious process considering the biological complexity of diseases. To effectively and efficiently design and develop a new drug, CADD can be used to apply cutting-edge techniques to various limitations in the drug design field. Data pre-processing approaches, which clean the raw data for consistent and reproducible applications of big data and AI methods are introduced. We include the current status of the applicability of big data and AI methods to drug design areas such as the identification of binding sites in target proteins, structure-based virtual screening (SBVS), and absorption, distribution, metabolism, excretion and toxicity (ADMET) property prediction. Data pre-processing and applications of big data and AI methods enable the accurate and comprehensive analysis of massive biomedical data and the development of predictive models in the field of drug design. Understanding and analyzing biological, chemical, or pharmaceutical architectures of biomedical entities related to drug design will provide beneficial information in the biomedical big data era.Entities:
Keywords: artificial intelligence; big data; computer-aided drug design; statistics; structure-based drug discovery
Mesh:
Substances:
Year: 2022 PMID: 35076690 PMCID: PMC9022974 DOI: 10.1042/BST20211240
Source DB: PubMed Journal: Biochem Soc Trans ISSN: 0300-5127 Impact factor: 4.919
Figure 1.Data pre-processing and modeling.
Data pre-processing steps include missing data imputation, outlier detection, and redundant feature elimination. After the input data are pre-processed, predictive modeling including unsupervised learning (clustering and dimensionality reduction) and supervised learning (regression and classification) can be utilized.
Software programs for data pre-processing
| Application | Method | Software program | Link |
|---|---|---|---|
| Missing data imputation | Neural Networks (NN) | Alchemite [ |
|
| Outlier detection | Neural Networks (NN) | Alchemite [ |
|
| Redundant feature elimination | Random Forest (RF) | RGIFE [ |
|
Supervised AI modeling applications in drug design
| Category | Name | Summary |
|---|---|---|
| Regression | Penalized Linear Regression | Penalized Linear Regression estimates significant interactions between features in an n-by-p data matrix and the continuous outcome [ |
| Partial Least Squares Regression (PLSR) | PLSR detects new significant features by combining the feature coordinates and extracts the optimal set of latent features by linearly combining them [ | |
| Classification | Penalized Logistic Regression | Penalized Logistic Regression evaluates significant interactions between features in an n-by-p data matrix and the categorical outcome [ |
| Support Vector Machine (SVM) | SVM builds a multidimensional hyperplane that separates data values in one category from data values in other categories by computing the largest possible distance between data values of different categories [ | |
| K-Nearest Neighbors (kNN) | kNN defines a predicted category of an unknown sample based on the K closest data values in a training set [ | |
| Naïve Bayesian Classifier (NBC) | NBC calculates the set of probabilities by counting the frequency of categories for the feature to be predicted in the data [ | |
| Decision Tree (DT) | DT expands subtrees and leaves to obtain a node labeled with a predicted outcome category [ | |
| Random Forest (RF) | RF, an ensemble of classification methods, efficiently analyzes high-dimensional data, merging and obtaining outcomes over individual decision trees [ | |
| Neural Networks (NN) | NN algorithm sets input features in an input layer, implements weighted transformations over hidden layers, and evaluates the outcome on an output layer [ |
Unsupervised AI modeling applications in drug design
| Category | Name | Summary |
|---|---|---|
| Clustering | K-Means Clustering | K-means Clustering defines K clusters representing categories where the input data values are partitioned into [ |
| Hierarchical Clustering (HC) | In HC, the partitions of data values can be assigned with increasing cluster hierarchy. The partitioning process is finalized when a single cluster containing all n data values is formed or n clusters are assigned to n different data values each [ | |
| Dimensionality Reduction | Principal Component Analysis (PCA) | PCA transforms the original features into principal components, which are uncorrelated to each other but contain information from the original data [ |
| Linear Discriminant Analysis (LDA) | LDA builds a prediction model, which classifies patterns in the data [ |
Software and tools for AI modeling applications in structure-based drug design
| Application | Method | Software program | Link |
|---|---|---|---|
| Identification of binding sites in target proteins | NBC | ENRI [ | Source code: |
| DT | P2Rank [ | Source code: | |
| NN | DeepSite [ | Web server: | |
| NN | BiteNet [ | Data set: | |
| K-Means Clustering | iFeature [ | Web server: | |
| LDA | SpotOn [ | Web server: | |
| Structure-based Virtual Screening (SBVS) | NN | DeepBSP [ | Source code: |
| Penalized linear regression/Penalized logistic regression | SAnDReS [ | Source code: | |
| RF | RF-Score-v3 [ | Software: | |
| Prediction of Pharmacokinetics (ADME) and Toxicity (T) | SVM | SwissADME [ | Web server: |
| RF | admetSAR2.0 [ | Web server: | |
| kNN | vNN-ADMET [ | Web server: | |
| RF | AMPL [ | Source code: | |
| RF | ADMETlab [ | Web server: |
Figure 2.Overview of the applications workflow and their interconnection.
On the left, the training data set sources and curation process are shown. The data type and the relevant information for the training process of each application are described in the horizontal arrows. On the right, the different neural network models built using the training data are represented. Application i uses protein ensembles as input and provides druggable conformations encompassing binding sites as output. Application ii is fed with the docked complexes between the druggable conformations and drug candidates as input, yielding hit compounds as output. Application iii takes the hit compounds as input to finally obtain the lead compounds.