| Literature DB >> 35432476 |
Ken Lin1, Xiongwen Quan1, Chen Jin2, Zhuangwei Shi1, Jinglong Yang1.
Abstract
Background Classification and annotation of enzyme proteins are fundamental for enzyme research on biological metabolism. Enzyme Commission (EC) numbers provide a standard for hierarchical enzyme class prediction, on which several computational methods have been proposed. However, most of these methods are dependent on prior distribution information and none explicitly quantifies amino-acid-level relations and possible contribution of sub-sequences. Methods In this study, we propose a double-scale attention enzyme class prediction model named DAttProt with high reusability and interpretability. DAttProt encodes sequence by self-supervised Transformer encoders in pre-training and gathers local features by multi-scale convolutions in fine-tuning. Specially, a probabilistic double-scale attention weight matrix is designed to aggregate multi-scale features and positional prediction scores. Finally, a full connection linear classifier conducts a final inference through the aggregated features and prediction scores. Results On DEEPre and ECPred datasets, DAttProt performs as competitive with the compared methods on level 0 and outperforms them on deeper task levels, reaching 0.788 accuracy on level 2 of DEEPre and 0.967 macro-F 1 on level 1 of ECPred. Moreover, through case study, we demonstrate that the double-scale attention matrix learns to discover and focus on the positions and scales of bio-functional sub-sequences in the protein. Conclusion Our DAttProt provides an effective and interpretable method for enzyme class prediction. It can predict enzyme protein classes accurately and furthermore discover enzymatic functional sub-sequences such as protein motifs from both positional and spatial scales.Entities:
Keywords: double-scale attention; enzyme class prediction; feature agreement; multi-scale convolutions; self-attention
Year: 2022 PMID: 35432476 PMCID: PMC9012241 DOI: 10.3389/fgene.2022.885627
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1DAttProt model overview. The sequence “MADE … Q” is an example of input protein sequence. FC is the abbreviation of a full connection layer. Convk and Spacek represent the k-kernel–sized convolution and the k-sized spatial scale. is the weight vector on the positional scale and is the weight matrix on both spatial and positional scales.
Glossary of important symbols in this article.
| Symbol | Meaning | Symbol | Meaning |
|---|---|---|---|
|
| Input protein sequence of tokens |
| Dimensions of key, value features |
|
| Length of input sequence |
| Output of the Transformer encoders |
|
| Primary sequence embedding of |
|
|
|
| Dimension of an embedding vector |
| Feature on spatial scale |
|
| Number of classes |
|
|
|
| Number of kernel sizes |
| Center vector at position |
|
| Number of heads in Transformer |
| Agreement matrix |
|
| Position |
| Double-scale attention matrix |
|
| Spatial scale or kernel size |
| Mixed-scale encoder output |
|
| Input of a Transformer encoder |
| Positional prediction matrix |
Performance results on DEEPre dataset.
| Method | Task level/acc. (var. × 103) | |||
|---|---|---|---|---|
| 0 | 1 | 2 | ||
| “DEEPre” |
| 0.826 (0.017) | 0.436 (0.135) | |
| UDSMProt | Forward | 0.867 (0.015) | 0.816 (0.020) | 0.753 (0.075) |
| Backward | 0.861 (0.017) | 0.834 (0.022) | 0.739 (0.083) | |
| Bi-direction | 0.871 (0.010) | 0.845 (0.020) | 0.781 (0.066) | |
| DAttProt | 3-layer | 0.858 (0.016) | 0.821 (0.019) | 0.736 (0.080) |
| 6-layer | 0.877 (0.018) |
|
| |
The results of level 2 are first calculated by the average of 6 sub-class classification results. All the results are average values calculated by 5 times of experiments on 5 folds and bold values are the best results of each task level (similarly hereinafter). acc: accuracy var: variance.
Performance results on ECPred dataset.
| Method | Task level | ||||||
|---|---|---|---|---|---|---|---|
| 0 | 1 | ||||||
|
|
|
| mac.- | mac.- | mac.- | ||
| “ECPred” | 0.972 | 0.965 | 0.964 (0.008) | 0.970 | 0.948 | 0.963 (0.010) | |
| UDSMProt | Forward | 0.967 | 0.958 | 0.955 (0.013) | 0.958 | 0.926 | 0.933 (0.026) |
| Backward | 0.969 | 0.962 | 0.967 (0.015) | 0.957 | 0.933 | 0.935 (0.022) | |
| Bi-direction | 0.979 |
|
| 0.963 | 0.940 | 0.944 (0.025) | |
| DAttProt | 3-layer | 0.962 | 0.934 | 0.944 (0.016) | 0.953 | 0.904 | 0.925 (0.023) |
| 6-layer |
| 0.960 | 0.965 (0.012) |
|
|
| |
All the results are average values calculated by 5 times of experiments and bold values are the best results of each task level (similarly hereinafter). P: Precision R: Recall var: variance mac-: macro-
Detailed performance indexes of DAttProt.
| Task | Level | Branch | 3-layer | 6-layer | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| acc | P | R |
| acc | P | R |
| |||
| DEEPre | 0 | Proteins | 0.85 | 0.85 | 0.79 | 0.80 | 0.87 | 0.86 | 0.80 | 0.81 |
| 1 | Enzymes | 0.82 | 0.70 | 0.64 | 0.59 | 0.85 | 0.84 | 0.80 | 0.80 | |
| 2 | Oxidoreductases | 0.61 | 0.56 | 0.46 | 0.48 | 0.72 | 0.70 | 0.62 | 0.63 | |
| Transferases | 0.70 | 0.56 | 0.51 | 0.53 | 0.76 | 0.74 | 0.71 | 0.72 | ||
| Hydrolases | 0.72 | 0.42 | 0.40 | 0.39 | 0.80 | 0.59 | 0.54 | 0.55 | ||
| Lyases | 0.78 | 0.61 | 0.50 | 0.53 | 0.81 | 0.76 | 0.73 | 0.74 | ||
| Isomerases | 0.81 | 0.86 | 0.77 | 0.79 | 0.83 | 0.88 | 0.78 | 0.80 | ||
| Ligases | 0.80 | 0.62 | 0.56 | 0.58 | 0.81 | 0.70 | 0.64 | 0.65 | ||
| ECPred | 0 | Proteins | 0.91 | 0.96 | 0.93 | 0.94 | 0.94 | 0.98 | 0.96 | 0.97 |
| 1 | Enzymes | 0.92 | 0.95 | 0.90 | 0.93 | 0.96 | 0.98 | 0.95 | 0.97 | |
p-values of the unpaired and one-tailed heteroscedastic Student’s t-test on 6-layer DAttProt over compared methods on the last task level of each dataset.
| Dataset | Compared methods | |||||
|---|---|---|---|---|---|---|
| “DEEPre” | “ECPred” | UDSMProt | 3-layer DAttProt | |||
| Forward | Backward | Bi-direction | ||||
| DEEPre Lv.2 | 3.955e-57 | - | 6.114e-19 | 2.052e-24 | 1.387e-3 | 5.360e-26 |
| ECPred Lv.1 | - | 4.446e-2 | 3.710e-6 | 3.027e-6 | 4.296e-5 | 5.890e-7 |
For each method, the calculated p-values are based on 25 results of DEEPre and 5 results of ECPred. We use MeN to represent M × 10 in the form of scientific notation
FIGURE 2: Examples of matched motifs in different sizes. Their spatial sizes vary from 3 to over 15. is the marginal weight vector on the positional scale and is the weight vector of K-sized sub-sequences on the positional scale (similarly hereinafter).
FIGURE 3Three matched motifs belonging to the PNPLA domain.
FIGURE 4Matched sites and regions of AYR1_SCHPO. (A) Motif. (B) Amino acid involved in the enzymatic activity. (C) Binding site. (D) Extent of a nucleotide phosphate-binding region.
Statistical data of matched motifs in case study.
| Field | Count |
|---|---|
| Protein samples | 351 |
| Total motif matches | 483 |
| Matched motif types | 113 |
| Motif size in range | 370 |
| Motif size in range | 57 |
| Motif size in range | 50 |
For full data please refer to the Supplementary Table S2