| Literature DB >> 34876230 |
Jaak Simm1, Lina Humbeck2, Adam Zalewski3, Noe Sturm4, Wouter Heyndrickx5, Yves Moreau1, Bernd Beck2, Ansgar Schuffenhauer6.
Abstract
With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.Entities:
Keywords: ChemFold; Cross-validation; Federated machine learning; Leader follower clustering; Locality-sensitive hashing; Scaffold network; Scaffold tree; Sphere exclusion clustering; Train-test-split
Year: 2021 PMID: 34876230 PMCID: PMC8650276 DOI: 10.1186/s13321-021-00576-2
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Fold splitting in sparse multitask setting. For fold splitting in sparse multitask setting there are two approaches a Fold splitting is done on a per task level. For each task the observations are assigned to the test fold independently. In this setup, there are compounds which are for some assays in the test fold and for others in the training set. b Fold splitting on the whole compound space, where a compound with all its measurements is assigned to one fold
Fig. 2Federated Machine learning scheme in MELLODDY. a MELLODDY uses a feed-forward neural network, which is divided in a trunk and head part. The trunk is jointly trained by all partners, using secure aggregation of parameter updates at each iteration. The head part is individual to each pharma partner, an thus allows each partner modelling their own assays without mapping to the other partner’s assays. The common trunk enables transfer learning between the tasks of different partners. b The technical architecture of MELLODDY. Each partner maintains a node on a cloud platform. An additional model dispatcher node aggregates the trunk updates, but is prevented by the secure aggregation protocol from accessing these updates itself
Fig. 3Scaffold based fold splits a The scaffold extraction procedure is illustrated on the example of flucloxacillin. b Examples of highly similar compounds from ChEMBL by means of Tanimoto similarity (Tc) of a fingerprint (ECFP6 32k folded) that are assigned to different scaffolds
Hyperparameter grid used in the optimization
| Hyperparameter | Set of values |
|---|---|
| Hidden sizes | [1200], [1600], [2000], [3000], [1600, 1600] |
| Dropout | 0.4, 0.5, 0.6, 0.7 |
| Weight decay | 1E−5, 1E−6 |
Fig. 4Distribution of compounds over different folds depending on similarity of these compounds. Fraction of intra-fold pairs as function of the Tc ECFP6 similarity of this pair a for public data set and b averaged over 4 pharma data sets (confidence intervals indicated as bars). In a the decadic logarithm of the number of pairs (bold black line) as function of the Tc ECFP6 similarity of this pair is given in addition
Label and data imbalance of different folding methods averaged over all tasks of four partners and the ChEMBL subset. Fraction below 05: fraction of tasks below five compounds in one or more folds, fraction label imbalance: fraction of tasks where the fold standard deviation of the fraction of actives was greater than 0.05
| Fold method | Task size bin lower limit | Fraction below 05 | Fraction label imbalance |
|---|---|---|---|
| LSH | 10 | 0.90 | 0.35 |
| 100 | 0.29 | 0.37 | |
| 1000 | 0.08 | 0.11 | |
| 10000 | 0.03 | 0.00 | |
| 100000 | 0.00 | 0.00 | |
| Sphere exclusion | 10 | 0.95 | 0.46 |
| 100 | 0.37 | 0.49 | |
| 1000 | 0.11 | 0.24 | |
| 10000 | 0.04 | 0.00 | |
| 100000 | 0.05 | 0.00 | |
| Scaffold network | 10 | 0.96 | 0.58 |
| 100 | 0.46 | 0.64 | |
| 1000 | 0.10 | 0.29 | |
| 10000 | 0.04 | 0.08 | |
| 100000 | 0.08 | 0.12 | |
| Random | 10 | 0.67 | 0.07 |
| 100 | 0.18 | 0.05 | |
| 1000 | 0.05 | 0.00 | |
| 10000 | 0.00 | 0.00 |
Differences in best hyperparameter selection for different folding methods on the public data set. The top 10 performing hyperparameter sets for the random fold splitting are given together with the respective rank of this setting for the other folding methods
| Hidden | Dropout | Weight | Rank | Rank | Rank |
|---|---|---|---|---|---|
| Sizes | Decay | Sphere exclusion | Scaffold network | LSH | |
| [2000] | 0.7 | 1E−6 | 2 | 1 | 8 |
| [2000] | 0.6 | 1E−6 | 8 | 6 | 5 |
| [1600] | 0.7 | 1E−6 | 3 | 2 | 2 |
| [1200] | 0.5 | 1E−6 | 11 | 13 | 7 |
| [1200] | 0.6 | 1E−6 | 9 | 10 | 4 |
| [3000] | 0.7 | 1E−6 | 5 | 4 | 6 |
| [1200] | 0.7 | 1E−6 | 1 | 3 | 1 |
| [1600] | 0.6 | 1E−6 | 6 | 9 | 3 |
| [3000] | 0.6 | 1E−6 | 7 | 8 | 11 |
| [1600] | 0.5 | 1E-6 | 4 | 11 | 10 |
Fig. 5Performance difference of folding methods compared to a random folding. Performance difference by means of delta AUROC averaged over at least three test folds (confidence intervals indicated as bars) and compared to a random folding for four folding methods and all tasks of four partners as well as a ChEMBL subset