| Literature DB >> 30517077 |
Niclas Ståhl1, Göran Falkman2, Alexander Karlsson2, Gunnar Mathiason2, Jonas Boström3.
Abstract
We present a flexible deep convolutional neural network method for the analysis of arbitrary sized graph structures representing molecules. This method, which makes use of the Lipinski RDKit module, an open-source cheminformatics software, enables the incorporation of any global molecular (such as molecular charge and molecular weight) and local (such as atom hybridization and bond orders) information. In this paper, we show that this method significantly outperforms another recently proposed method based on deep convolutional neural networks on several datasets that are studied. Several best practices for training deep convolutional neural networks on chemical datasets are also highlighted within the article, such as how to select the information to be included in the model, how to prevent overfitting and how unbalanced classes in the data can be handled.Entities:
Keywords: Deep learning; Molecular property prediction; Side effects prediction; Unbalanced data
Mesh:
Substances:
Year: 2018 PMID: 30517077 PMCID: PMC6798861 DOI: 10.1515/jib-2018-0065
Source DB: PubMed Journal: J Integr Bioinform ISSN: 1613-4516
Figure 1:A graphical representation of the proposed model. This figure shows an example where the model is applied to a molecule consisting of three atoms. The equations governing the inner workings of this model are described in equations (1) to (5).
Results for the SIDER dataset.
| SIDER | Training | Validation | Testing |
|---|---|---|---|
| Experiment set-up 1 | 0.622 | 0.554 | 0.595 |
| Experiment set-up 2 | 0.663 | 0.558 | 0.595 |
| Experiment set-up 3 | 0.666 | 0.560 | 0.602 |
| Experiment set-up 4 | 0.681 | 0.586 | 0.589 |
| Experiment set-up 5 | 0.693 | 0.589 | 0.605 |
| Experiment set-up 6 | 0.691 | 0.594 | 0.623 |
| Graph convolution [ | 0.751 | 0.613 | 0.585 |
Result for the ClinTox dataset.
| ClinTox | Training | Validation | Testing |
|---|---|---|---|
| Experimental set-up 1 | 0.932 | 0.93 | 0.921 |
| Experimental set-up 2 | 0.931 | 0.929 | 0.927 |
| Experimental set-up 3 | 0.942 | 0.942 | 0.94 |
| Experimental set-up 4 | 0.932 | 0.932 | 0.933 |
| Experimental set-up 5 | 0.935 | 0.928 | 0.934 |
| Experimental set-up 6 | 0.933 | 0.929 | 0.931 |
| Graph convolution [ | 0.962 | 0.920 | 0.807 |
Figure 2:The distribution of the achieved AUC-ROC values averaged over all target variables for each experimental set-up. The blue dashed line represents the result achieved by the GCNN presented by Wu et al. [21]. The left plot is the achieved results on the training data and the right plot is the achieved results on the test data.
Figure 3:The ROC curves for each target variable from a model trained using the information described in experiment set-up 5 and set-up 6 for the SIDER dataset. The ROC curves for the five most unbalanced classes are highlighted in red, in order to show how the performance is affected by the weighting of the loss function. The top row shows results achieved using experiment set-up 5 and the bottom row results from experimental set-up 6. The left column shows the ROC curves for the training data and the right column for the test data. Note that the results for the test data are almost similar for the two set-ups while the performance for the test data increased in set-up 6. This is due to a better separation of the unbalanced classes, something that otherwise will bring down the mean AUC-ROC value.
Results for the TOX21 dataset.
| TOX21 | Training | Validation | Testing |
|---|---|---|---|
| Experimental set-up 1 | 0.745 | 0.696 | 0.675 |
| Experimental set-up 2 | 0.812 | 0.772 | 0.674 |
| Experimental set-up 3 | 0.82 | 0.801 | 0.735 |
| Experimental set-up 4 | 0.829 | 0.795 | 0.768 |
| Experimental set-up 5 | 0.867 | 0.847 | 0.839 |
| Experimental set-up 6 | 0.887 | 0.861 | 0.825 |
| Graph convolution [ | 0.905 | 0.825 | 0.829 |