| Literature DB >> 34179448 |
Shrooq Alsenan1, Isra Al-Turaiki2, Alaaeldin Hafez3.
Abstract
The blood-brain barrier plays a crucial role in regulating the passage of 98% of the compounds that enter the central nervous system (CNS). Compounds with high permeability must be identified to enable the synthesis of brain medications for the treatment of various brain diseases, such as Parkinson's, Alzheimer's, and brain tumors. Throughout the years, several models have been developed to solve this problem and have achieved acceptable accuracy scores in predicting compounds that penetrate the blood-brain barrier. However, predicting compounds with "low" permeability has been a challenging task. In this study, we present a deep learning (DL) classification model to predict blood-brain barrier permeability. The proposed model addresses the fundamental issues presented in former models: high dimensionality, class imbalances, and low specificity scores. We address these issues to enhance the high-dimensional, imbalanced dataset before developing the classification model: the imbalanced dataset is addressed using oversampling techniques and the high dimensionality using a non-linear dimensionality reduction technique known as kernel principal component analysis (KPCA). This technique transforms the high-dimensional dataset into a low-dimensional Euclidean space while retaining invaluable information. For the classification task, we developed an enhanced feed-forward deep learning model and a convolutional neural network model. In terms of specificity scores (i.e., predicting compounds with low permeability), the results obtained by the enhanced feed-forward deep learning model outperformed those obtained by other models in the literature that were developed using the same technique. In addition, the proposed convolutional neural network model surpassed models used in other studies in multiple accuracy measures, including overall accuracy and specificity. The proposed approach solves the problem inevitably faced with obtaining low specificity resulting in high false positive rate. ©2021 Alsenan et al.Entities:
Keywords: Blood Brain Barrier (BBB) permeability; Chemoinformatics; Convolutional Neural Network (CNN); Quantitative Structure-Activity Relationships (QSAR)
Year: 2021 PMID: 34179448 PMCID: PMC8205267 DOI: 10.7717/peerj-cs.515
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Classification accuracy measures of the studies reported from the literature.
| Model | Test set | |||||
|---|---|---|---|---|---|---|
| Evaluation | ACC | Sens. | Spec. | AUC | MCC | |
| SVM ( | 10-fold | 91.2 | 92.5 | 89.9 | 90.8 | – |
| Consensus MLP ( | 10-fold | 96.6 | 99 | 83.3 | 91.9 | – |
| DT ( | 5-fold | 92.22 | 0.58 | 0.96 | – | 0.55 |
| SVM ( | 5-fold | 83.7 | 88.6 | 75.0 | – | 0.65 |
| ANN ( | LGO | 73.0 | 68.0 | 80.0 | – | 0.79 |
| DT ( | 10-fold | 87.93 | 86.67 | 89.29 | – | – |
| SVM ( | 70/30+ 5-fold | 93.96 | 94.3 91.0 | 94.3 91.0 | – | 0.84 |
Notes.
Overall accuracy
Sensitivity scores
Specificity scores
Matheow correlation coefficient
Area under the curves
Figure 1The four phases of developing the BBB permeability model.
Figure 2Dataset analysis.
Details of the training, testing, and external datasets used in this study.
| Sample size | Testing set | External set | Descriptors | Descriptors | ||
|---|---|---|---|---|---|---|
| Before SMOTE | After SMOTE | before KPCA | after KPCA | |||
| 1803 BBB+ | 1803 BBB+ | 1,874 | 468 | 86 | 6,394 | 3,603 |
| 547 BBB− | 1803 BBB− | |||||
Figure 3Block diagram of FFDNN model.
Figure 4Convolutional layer.
Figure 5Transforming network shape from 2D to 3D.
Figure 6CNN architecture.
SMOTE effect on the enhanced FFDNN model.
| Resampling | Training set | Test set | |||||||
|---|---|---|---|---|---|---|---|---|---|
| ACC | Sens. | Spec. | ACC | Sens. | Spec. | AUC | MCC | CI(95%) | |
| No SMOTE | 99.78 | 99.93 | 99.33 | 91.06 | 93.04 | 83.35 | 92.00 | 73.68 | .057–.125 |
| SMOTE | 99.86 | 99.88 | 99.77 | 96.17 | 93.72 | 98.61 | 98.61 | 92.46 | .033–.081 |
| SMOTE | 99.89 | 99.79 | 100 | 96.20 | 93.51 | 98.89 | 98.73 | 92.54 | .032–.082 |
| SMOTE | 100 | 96.78 | 98.11 | 97.11 | 97.35 | 98.42 | 99.50 | 95.55 | .020–.072 |
Figure 7SMOTE oversampling technique.
(A) Class labels transformation. (B) Synthesizing new instance.
Figure 8Dataset transformation with Kernel PCA.
(A) Original dataset. (B) After kernel PCA.
Performance evaluation of the proposed DL models in comparison with ML methods and benchmark.
| Model | Training set | Test set | |||||||
|---|---|---|---|---|---|---|---|---|---|
| ACC | Sens. | Spec. | ACC | Sens. | Spec. | AUC | MCC | ACC-Ext | |
| FFDNN | 100 | 96.78 | 98.11 | 97.11 | 97.35 | 98.42 | 97.7 | 95.55 | 96.5 |
| CNN | 100 | 98.76 | 99.87 | 97.76 | 94.50 | 98.31 | 99.71 | 92.85 | 97.0 |
| XGBoost | 98.67 | 96.23 | 92.34 | 94.32 | 92.34 | 95.66 | 93.22 | 83.44 | 92.00 |
| SVM | 99.32 | 98.30 | 95.62 | 95.94 | 95.30 | 96.62 | 93.3 | 93.92 | 93.90 |
| RF | 99.47 | 90.21 | 97.15 | 93.61 | 90.21 | 97.15 | 93.68 | 87.46 | 92.04 |
Notes.
Overall accuracy
Sensitivity scores
Specificity scores
Matheow correlation coefficient
Area under the curves
Overall accuracy on external dataset
Figure 9DL vs. ML models.
Figure 10ROC plots for DL models.
(A) ROC Enhanced FFDNN. (B) ROC CNN.
Figure 11ROC plots for ML models.
(A) ROC XGboost. (B) ROC SVM. C) ROC RF.