| Literature DB >> 34946509 |
Hiroshi Sakiyama1, Motohisa Fukuda1, Takashi Okuno1.
Abstract
The blood-brain barrier (BBB) controls the entry of chemicals from the blood to the brain. Since brain drugs need to penetrate the BBB, rapid and reliable prediction of BBB penetration (BBBP) is helpful for drug development. In this study, free-form and in-blood-form datasets were prepared by modifying the original BBBP dataset, and the effects of the data modification were investigated. For each dataset, molecular descriptors were generated and used for BBBP prediction by machine learning (ML). For ML, the dataset was split into training, validation, and test data by the scaffold split algorithm MoleculeNet used. This creates an unbalanced split and makes the prediction difficult; however, we decided to use that algorithm to evaluate the predictive performance for unknown compounds dissimilar to existing ones. The highest prediction score was obtained by the random forest model using 212 descriptors from the free-form dataset, and this score was higher than the existing best score using the same split algorithm without using any external database. Furthermore, using a deep neural network, a comparable result was obtained with only 11 descriptors from the free-form dataset, and the resulting descriptors suggested the importance of recognizing the glucose-like characteristics in BBBP prediction.Entities:
Keywords: blood-brain barrier penetration (BBBP); deep neural network (DNN); forward search; free-form dataset; in-blood-form dataset; machine learning (ML); molecular descriptor; random forest (RF)
Mesh:
Substances:
Year: 2021 PMID: 34946509 PMCID: PMC8708321 DOI: 10.3390/molecules26247428
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Chemical forms of an example compound: a free form (A), a frequently used formal salt form (B), an actual salt form (C), and a protonated form in aqueous solution (D).
Figure 2The ROC-AUC scores in comparison of the three datasets: intact, free-form, and in-blood-form datasets, using a DNN model (a) and a RF model (b) with 200 molecular descriptors.
Figure 3Importance of the top 50 molecular descriptors obtained by the RF method for the free-form dataset (a) and the in-blood-form dataset (b).
Nine descriptors with clear chemical meaning obtained by the random forest method.
| Molecular Descriptor | Meaning 1 |
|---|---|
| NumHeteroatoms | the number of heteroatoms |
| NOCount | the number of nitrogen and oxygen atoms |
| MolLogP | Wildman-Crippen LogP value [ |
| NHOHCount | the number of NH and OH bonds |
| NumHDonors | the number of hydrogen bond donors |
| fr_lactam | the number of β-lactams |
| NumHAcceptors | the number of hydrogen bond acceptors |
| fr_COO2 and fr_COO | the number of carboxylic acids |
| fr_Al_OH_noTert | the number of aliphatic hydroxyl groups excluding |
1 The description is based on the RDKit documentation [18].
Figure 4Distribution of positive and negative BBBP properties in the free-form dataset with respect to the number of hydrogen bond donors (a), the number of hydrogen bond acceptors (b), the molecular weight (c), and the MolLogP value (d).
Set of molecular descriptors.
| Name of Descriptor Set | Molecular Descriptors |
|---|---|
| FreeV11 | NumHeteroatoms, NumHDonors, NHOHCount, NumHAcceptors, |
| FreeTV10 | NumHDonors, NumSaturatedHeterocycles, nO, NumAliphaticRings, MolWt, MolLogP, nN, fr_Al_OH, fr_SH, fr_ketone |
| BloodV9 | NumHeteroatoms, MaxAbsPartialCharge, NOCount, NumHDonors, |
| BloodTV11 | NOCount, MaxAbsPartialCharge, NumHDonors, NumAliphaticHeterocycles, nO, MolWt, NumHeteroatoms, qed, NumHAcceptors, HeavyAtomCount, nN |
| RDKit61 | MaxEStateIndex, MinEStateIndex, MinAbsEStateIndex, qed, MolWt, |
| RDKit200 | 200 RDKit descriptors [ |
| Large | RDKit200 + {nH, nC, nN, nO, nS, nP, nF, nCl, nBr, nI, nX} from Mordred [ |
| Large212 | RDKit200 + {nH, nB, nC, nN, nO, nS, nP, nF, nCl, nBr, nI, nX} from Mordred [ |
Figure 5ROC-AUC scores of the prediction for the free-form and in-blood-form datasets obtained by DNN and ensemble methods.
Figure 6BBBP ratio with respect to the number of aliphatic heterocycles (n) and the number of aliphatic hydroxy groups excluding tertiary alcohol OH, showing BBBP positive (blue) and negative (pink) for n = 0 (a), 1 (b), 2 (c), 3 (d), 4 (e), and 5 (f).
Figure 7Chemical structures of β-glucose (A), salicin (B), amikacin (C), and plicamycin (D).
Figure 8ROC-AUC scores of the prediction for the free-form and in-blood-form datasets obtained by RF method. Descriptor set is in parentheses.
Top six single models.
| No | Method | Dataset | Descriptor Set | ROC-AUC(Training) | ROC-AUC(Validation) | ROC-AUC(Test) |
|---|---|---|---|---|---|---|
| 1 | RF | Free-form | Large212 | 0.999(0) | 0.963(0) | 0.773(0) |
| 2 | RF | In-blood-form | Large | 0.999(0) | 0.964(0) | 0.762(0) |
| 3 | DNN | Free-form | FreeV11 | 0.948(4) | 0.944(3) | 0.760(10) |
| 4 | RF | Free-form | RdKit200 | 0.999(0) | 0.966(0) | 0.757(0) |
| 5 | CB | Free-form | Large212 | 0.990(0) | 0.948(0) | 0.755(0) |
| 6 | DNN | In-blood-form | BloodTV11 | 0.934(14) | 0.923(17) | 0.755(13) |
Figure 9ROC curves for RF:Free(Large212) (a), CB:Free(Large212) (b), DNN:Free(FreeV11) (c), and DNN:Blood(BloodTV11) (d).
Figure 10ROC-AUC scores of the top six prediction results and ensemble results.
Figure 11Distribution of ROC-AUC scores for RF:Free(Large212) with scaffold split.
Figure 12Distributions of ROC-AUC scores for the the test set of 31 compounds (a) and for the the test set of 95 compounds (b).
List of removed 93 items.
| Category | Removed Items |
|---|---|
| One of each two-identical-compound set | ‘63’ (=‘73’), ‘154’ (=‘129’), ‘312’ (=‘97’), ‘337’ (=‘29’), ‘384’ (=‘62’), ‘388’ (=‘96’), ‘394’ (=‘70’), ‘415’ (=‘3’), ‘422’ (=‘105’), ‘435’ (=‘13’), ‘453’ (=‘87’), ‘457’ (=‘52’), ‘468’ (=‘467’), ‘488’ (=‘46’), ‘489’ (=‘56’), ‘508’ (=‘50’), ‘533’ (=‘34’), ‘535’ (=‘75’), ‘562’ (=‘140’), ‘591’ (=‘72’), ‘593’ (=‘2’), ‘607’ (=‘490’), ‘616’ (=‘62’), ‘619’ (=‘567’), ‘644’ (=‘26’), ‘646’ (=‘60’), ‘649’ (=‘40’), ‘650’ (=‘58’), ‘651’ (=‘393’), ‘667’ (=‘4’), ‘668’ (=‘33’), ‘669’ (=‘93’), ‘670’ (=‘89’), ‘671’ (=‘69’), ‘672’ (=‘79’), ‘673’ (=‘23’), ‘690’ (=‘55’), ‘959’ (=‘450’), ‘966’ (=‘691’), ‘1073’ (=‘523’), ‘1085’ (=‘564’), ‘1086’ (=‘1741’), ‘1111’ (=‘555’), ‘1388’ (=‘296’), ‘1462’ (=‘139’), ‘1471’ (=‘585’), ‘1508’ (=‘191’), ‘1567’ (=‘344’), ‘1583’ (=‘654’), ‘1597’ (=‘219’), ‘1662’ (=‘437’), ‘1782’ (=‘45’), ‘1882’ (=‘700’), ‘1906’ (=‘15’), ‘1947’ (=‘101’), ‘1961’ (=‘142’), ‘1971’ (=‘664’), ‘1979’ (=‘189’), ‘2031’ (=‘245’), and ‘2045’ (=‘252’) |
| Two of each three-identical-compound set | ‘269’ (=‘83’ =‘569’), ‘435’ (=‘13’ =‘1161’), ‘534’ (=‘85’ =‘1541’), ‘565’ (=‘59’ =‘617’), ‘566’ (=‘49’ =‘618’), ‘569’ (=‘269’ =‘569’), ‘617’ (=‘59’ =‘565’), ‘618’ (=‘49’ =‘566’), ‘713’ (=‘315’ =‘1916’), ‘1161’ (=‘13’ =‘435’), ‘1541’ (=‘85’ =‘534’), and ‘1916’ (=‘315’ =‘713’) |
| Inconsistent pair | (‘1’ [1], ‘380’ [0]), (‘17’ [1], ‘552’ [0]), (‘53’ [1], ‘648’ [0]), (‘102’ [0], ‘1009’ [1]), (‘128’ [0], ‘1701’ [1]), (‘176’ [0], ‘1645’ [1]), (‘267’ [0], ‘1314’ [1]), (‘284’ [0], ‘1881’ [1]), (‘305’ [0], ‘1361’ [1]), (‘325’ [0], ‘1910’ [1]), (‘326’ [0], ‘1381’ [1]), and (‘571’ [0], ‘1338’ [1]) |