| Literature DB >> 23368876 |
Viola Volpato1, Alessandro Adelfio, Gianluca Pollastri.
Abstract
We present a novel ab initio predictor of protein enzymatic class. The predictor can classify proteins, solely based on their sequences, into one of six classes extracted from the enzyme commission (EC) classification scheme and is trained on a large, curated database of over 6,000 non-redundant proteins which we have assembled in this work. The predictor is powered by an ensemble of N-to-1 Neural Network, a novel architecture which we have recently developed. N-to-1 Neural Networks operate on the full sequence and not on predefined features. All motifs of a predefined length (31 residues in this work) are considered and are compressed by an N-to-1 Neural Network into a feature vector which is automatically determined during training. We test our predictor in 10-fold cross-validation and obtain state of the art results, with a 96% correct classification and 86% generalized correlation. All six classes are predicted with a specificity of at least 80% and false positive rates never exceeding 7%. We are currently investigating enhanced input encoding schemes which include structural information, and are analyzing trained networks to mine motifs that are most informative for the prediction, hence, likely, functionally relevant.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23368876 PMCID: PMC3548677 DOI: 10.1186/1471-2105-14-S1-S11
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of sequences per class in the dataset.
| Metazoa dataset | |
|---|---|
| Oxidoreductase | 954 |
| Transferase | 2110 |
| Hydrolase | 2226 |
| Lyase | 208 |
| Isomerase | 136 |
| Ligase | 445 |
| Total | 6081 |
Figure 1N-to-1 NN architecture for predicting enzymatic class. An N-to-1 Neural Network. N copies of the network (only 3 represented for simplicity) process all the (overlapping) motifs of a predefined length in a sequence. The vectorial outputs fof these networks are added up and the resulting feature vector f is input to the network to produce the enzymatic class prediction.
Results after train and test on the two different input coding schemes.
| MSA | MSA+SS | |||||||
|---|---|---|---|---|---|---|---|---|
| Oxidoreductase | 0.90 | 0.89 | 0.87 | 0.02 | 0.88 | 0.88 | 0.86 | 0.02 |
| Transferase | 0.90 | 0.89 | 0.84 | 0.05 | 0.90 | 0.87 | 0.82 | 0.06 |
| Hydrolase | 0.89 | 0.92 | 0.84 | 0.07 | 0.87 | 0.91 | 0.82 | 0.08 |
| Lyase | 0.91 | 0.84 | 0.87 | 0.00 | 0.89 | 0.82 | 0.85 | 0.01 |
| Isomerase | 0.81 | 0.72 | 0.76 | 0.01 | 0.84 | 0.69 | 0.76 | 0.00 |
| Ligase | 0.89 | 0.85 | 0.86 | 0.01 | 0.87 | 0.84 | 0.85 | 0.01 |
| GC | 0.86 | 0.84 | ||||||
| Q | 0.96 | 0.96 | ||||||
Results for MSA and ProtFun (from [13]) trained and tested in 10-fold cross-validation.
| MSA | ProtFun | |||
|---|---|---|---|---|
| Oxidoreductase | 0.89 | 0.02 | 0.62 | 0.25 |
| Transferase | 0.89 | 0.05 | 0.65 | 0.20 |
| Hydrolase | 0.92 | 0.07 | 0.60 | 0.20 |
| Lyase | 0.84 | 0.00 | 0.71 | 0.15 |
| Isomerase | 0.72 | 0.01 | 0.75 | 0.17 |
| Ligase | 0.85 | 0.01 | 0.87 | 0.10 |