| Literature DB >> 29740518 |
Afshine Amidi1,2, Shervine Amidi2, Dimitrios Vlachakis3, Vasileios Megalooikonomou3, Nikos Paragios2, Evangelia I Zacharaki2,3.
Abstract
During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet.Entities:
Keywords: 3D convolutional neural networks; Deep learning; EnzyNet; Enzyme classification
Year: 2018 PMID: 29740518 PMCID: PMC5937476 DOI: 10.7717/peerj.4750
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Structure of the dataset.
| EC1 | EC2 | EC3 | EC4 | EC5 | EC6 | Total | |
|---|---|---|---|---|---|---|---|
| 7,096 | 12,081 | 15,290 | 2,875 | 1,703 | 1,632 | 40,677 | Training |
| 1,775 | 2,935 | 3,809 | 743 | 488 | 419 | 10,169 | Validation |
| 2,323 | 3,717 | 4,762 | 858 | 571 | 481 | 12,712 | Testing |
| 11,194 | 18,733 | 23,861 | 4,476 | 2,762 | 2,532 | 63,558 | Total |
Figure 1Illustration of the meaning of Rmax with respect to volume V.
Figure 2Illustration of enzyme 2Q3Z for grid sizes l = 32 (A), l = 64 (B), and l = 96 (C).
Summary of the preprocessing steps done to each enzyme at training time
| 1 | |
| 2 | |
| 3 | Extract coordinates of backbone atoms from its PDB file |
| 4 | |
| 5 | Interpolate consecutive backbone atoms by |
| 6 | |
| 7 | Center barycenter |
| 8 | Homothetic transformation of each point with center |
| 9 | |
| 10 | Principal component analysis (PCA) transformation |
| 11 | |
| 12 | |
| 13 | Flip coordinates with respect to the origin along |
| 14 | |
| 15 | Flip coordinates with respect to the origin along |
| 16 | |
| 17 | Flip coordinates with respect to the origin along |
| 18 | |
| 19 | Center barycenter |
| 20 | Transform coordinate points into binary voxels |
Figure 3Drawing of the architecture selected for our experiments.
Figure 4Analysis of the distribution of the radii among training enzymes (A) and corresponding level of information conveyed (B).
Figure 5Evolution of the network’s performance during the training process with uniform (A) and adjusted (B) weights.
Fivefold cross-validation results.
| Weights | Uniform | Adapted | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Decision | None | Probabilities | Classes | None | Probabilities | Classes | ||||||
| Augmentation | Flips | W. flips | Flips | W. flips | Flips | W. flips | Flips | W. flips | ||||
| Accuracy | 0.752 ± 0.004 | 0.771 ± 0.002 | 0.748 ± 0.002 | 0.765 ± 0.002 | 0.711 ± 0.006 | 0.719 ± 0.007 | 0.733 ± 0.004 | 0.710 ± 0.007 | 0.722 ± 0.008 | |||
| Precision | EC | 1 | 0.841 ± 0.019 | 0.900 ± 0.021 | 0.859 ± 0.026 | 0.884 ± 0.022 | 0.752 ± 0.026 | 0.786 ± 0.036 | 0.796 ± 0.035 | 0.717 ± 0.036 | 0.769 ± 0.029 | |
| 2 | 0.761 ± 0.021 | 0.753 ± 0.040 | 0.768 ± 0.035 | 0.725 ± 0.035 | 0.758 ± 0.032 | 0.784 ± 0.015 | 0.784 ± 0.024 | 0.760 ± 0.026 | 0.789 ± 0.025 | |||
| 3 | 0.740 ± 0.017 | 0.686 ± 0.026 | 0.706 ± 0.023 | 0.702 ± 0.025 | 0.706 ± 0.022 | 0.735 ± 0.030 | 0.750 ± 0.030 | 0.743 ± 0.030 | 0.752 ± 0.028 | |||
| 4 | 0.927 ± 0.018 | 0.970 ± 0.012 | 0.964 ± 0.007 | 0.696 ± 0.032 | 0.836 ± 0.042 | 0.833 ± 0.042 | 0.835 ± 0.036 | 0.792 ± 0.037 | ||||
| 5 | 0.888 ± 0.013 | 0.967 ± 0.012 | 0.971 ± 0.015 | 0.954 ± 0.010 | 0.462 ± 0.046 | 0.584 ± 0.070 | 0.589 ± 0.060 | 0.591 ± 0.074 | 0.550 ± 0.067 | |||
| 6 | 0.849 ± 0.037 | 0.971 ± 0.031 | 0.953 ± 0.038 | 0.324 ± 0.040 | 0.330 ± 0.052 | 0.347 ± 0.049 | 0.351 ± 0.049 | 0.339 ± 0.051 | ||||
| Macro | 0.834 ± 0.011 | 0.876 ± 0.007 | 0.867 ± 0.007 | 0.870 ± 0.009 | 0.632 ± 0.006 | 0.676 ± 0.008 | 0.685 ± 0.006 | 0.666 ± 0.010 | 0.665 ± 0.009 | |||
| Recall | EC | 1 | 0.686 ± 0.021 | 0.715 ± 0.018 | 0.706 ± 0.018 | 0.718 ± 0.018 | 0.746 ± 0.008 | 0.722 ± 0.018 | 0.745 ± 0.012 | 0.735 ± 0.014 | 0.739 ± 0.009 | |
| 2 | 0.784 ± 0.021 | 0.766 ± 0.033 | 0.783 ± 0.028 | 0.777 ± 0.026 | 0.657 ± 0.026 | 0.646 ± 0.040 | 0.665 ± 0.034 | 0.650 ± 0.035 | 0.656 ± 0.032 | |||
| 3 | 0.880 ± 0.015 | 0.907 ± 0.021 | 0.881 ± 0.023 | 0.900 ± 0.019 | 0.753 ± 0.018 | 0.790 ± 0.031 | 0.800 ± 0.029 | 0.766 ± 0.032 | 0.781 ± 0.028 | |||
| 4 | 0.645 ± 0.008 | 0.541 ± 0.006 | 0.583 ± 0.009 | 0.514 ± 0.007 | 0.583 ± 0.008 | 0.698 ± 0.019 | 0.711 ± 0.016 | 0.678 ± 0.015 | 0.708 ± 0.021 | |||
| 5 | 0.525 ± 0.016 | 0.416 ± 0.017 | 0.462 ± 0.012 | 0.402 ± 0.017 | 0.458 ± 0.009 | 0.663 ± 0.019 | 0.663 ± 0.021 | 0.643 ± 0.015 | 0.665 ± 0.019 | |||
| 6 | 0.382 ± 0.020 | 0.233 ± 0.022 | 0.283 ± 0.019 | 0.203 ± 0.017 | 0.284 ± 0.018 | 0.638 ± 0.037 | 0.664 ± 0.041 | 0.633 ± 0.036 | 0.655 ± 0.043 | |||
| Macro | 0.660 ± 0.003 | 0.592 ± 0.007 | 0.623 ± 0.005 | 0.582 ± 0.007 | 0.620 ± 0.005 | 0.695 ± 0.007 | 0.697 ± 0.009 | 0.684 ± 0.008 | 0.701 ± 0.010 | |||
| F1 | EC | 1 | 0.791 ± 0.007 | 0.778 ± 0.008 | 0.774 ± 0.006 | 0.792 ± 0.005 | 0.749 ± 0.011 | 0.752 ± 0.011 | 0.769 ± 0.013 | 0.725 ± 0.014 | 0.754 ± 0.010 | |
| 2 | 0.772 ± 0.007 | 0.758 ± 0.005 | 0.752 ± 0.007 | 0.766 ± 0.006 | 0.712 ± 0.010 | 0.707 ± 0.016 | 0.724 ± 0.013 | 0.700 ± 0.013 | 0.715 ± 0.013 | |||
| 3 | 0.781 ± 0.009 | 0.795 ± 0.008 | 0.781 ± 0.007 | 0.791 ± 0.007 | 0.763 ± 0.007 | 0.760 ± 0.006 | 0.773 ± 0.005 | 0.753 ± 0.008 | 0.766 ± 0.007 | |||
| 4 | 0.761 ± 0.007 | 0.695 ± 0.005 | 0.728 ± 0.007 | 0.672 ± 0.005 | 0.726 ± 0.005 | 0.706 ± 0.011 | 0.761 ± 0.015 | 0.748 ± 0.012 | 0.747 ± 0.016 | |||
| 5 | 0.582 ± 0.018 | 0.625 ± 0.013 | 0.569 ± 0.019 | 0.619 ± 0.010 | 0.543 ± 0.032 | 0.618 ± 0.037 | 0.624 ± 0.032 | 0.613 ± 0.043 | 0.600 ± 0.043 | |||
| 6 | 0.376 ± 0.029 | 0.438 ± 0.023 | 0.336 ± 0.023 | 0.438 ± 0.022 | 0.428 ± 0.031 | 0.437 ± 0.042 | 0.452 ± 0.034 | 0.449 ± 0.036 | 0.444 ± 0.041 | |||
| Macro | 0.706 ± 0.004 | 0.730 ± 0.003 | 0.696 ± 0.003 | 0.724 ± 0.002 | 0.662 ± 0.006 | 0.687 ± 0.007 | 0.697 ± 0.004 | 0.675 ± 0.007 | 0.683 ± 0.008 | |||
Note:
Best performance for each indicator is shown in bold.
Figure 6Illustration of the hydropathy (A) and charge (B) attributes incorporated into the shape model for enzyme 2Q3Z.
Each attribute is illustrated in color for better visualization, but it actually corresponds to a single channel.