| Literature DB >> 30241466 |
Alperen Dalkiran1,2, Ahmet Sureyya Rifaioglu1,3, Maria Jesus Martin4, Rengul Cetin-Atalay5,6, Volkan Atalay7,8, Tunca Doğan9,10,11.
Abstract
BACKGROUND: The automated prediction of the enzymatic functions of uncharacterized proteins is a crucial topic in bioinformatics. Although several methods and tools have been proposed to classify enzymes, most of these studies are limited to specific functional classes and levels of the Enzyme Commission (EC) number hierarchy. Besides, most of the previous methods incorporated only a single input feature type, which limits the applicability to the wide functional space. Here, we proposed a novel enzymatic function prediction tool, ECPred, based on ensemble of machine learning classifiers.Entities:
Keywords: Benchmark datasets; EC numbers; Function prediction; Machine learning; Protein sequence
Mesh:
Substances:
Year: 2018 PMID: 30241466 PMCID: PMC6150975 DOI: 10.1186/s12859-018-2368-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Structure of an EC number classifier in ECPred
Fig. 2Flowchart of ECPred together with the prediction route of an example query protein. Query protein (P) received a score that is higher than the class specific positive cut-off value of main EC class 1.-.-.- (i.e., oxidoreductase) at Level 0–1 classification (S > S); as a result, the query is only directed to the models for the subclasses of main class 1.-.-.-. Considering the subclass prediction (Level 2), P received a high score (S > S) for EC 1.1.-.- (i.e., acting on the CH-OH group of donors) and further directed to the children sub-subclass EC numbers, where it received a high score (S > S) for EC 1.1.2.- (i.e., with a cytochrome as acceptor) at Level 3, and another high score (S > S) for EC 1.1.2.4 (i.e., D-lactate dehydrogenase - cytochrome) at the substrate level (Level 4) and received the final prediction of EC 1.1.2.4
Fig. 3Positive and negative training dataset construction for EC class 1.1.-.-. Green colour indicates that the members of that class are used in the positive training dataset, grey colour indicates that the members of that class are used neither in the positive training dataset, nor in the negative training dataset and red colour indicates that the members of that class are used in the negative training dataset
The number of proteins and UniRef50 clusters in the initial dataset for each main enzyme class and for non-enzymes
| EC main classes | # of proteins | # of UniRef50 clusters |
|---|---|---|
| Oxidoreductases | 36,577 | 8242 |
| Transferases | 86,163 | 20,133 |
| Hydrolases | 59,551 | 16,018 |
| Lyases | 22,368 | 3475 |
| Isomerases | 13,615 | 2883 |
| Ligases | 29,233 | 4429 |
| Non-Enzyme | 42,382 | 25,333 |
The number of proteins that were used in the training and validation of ECPred, for each main enzyme class
| EC main classes | Positive Training Dataset Size | Negative Training Dataset Size | Positive Validation Dataset Size | Negative Validation Dataset Size | |
|---|---|---|---|---|---|
| Enzymesa | Non-enzymes | ||||
| Oxidoreductases | 7417 | 3709 | 3709 | 825 | 822 |
| Transferases | 18,119 | 9060 | 9060 | 2014 | 2012 |
| Hydrolases | 14,416 | 7208 | 7208 | 1602 | 1601 |
| Lyases | 3127 | 1564 | 1564 | 348 | 344 |
| Isomerases | 2549 | 1275 | 1275 | 284 | 282 |
| Ligases | 3986 | 1993 | 1993 | 443 | 441 |
aEqual number of enzymes were selected from the other EC classes
The performance results of the ECPred validation analysis
| EC Level | F1-score | Recall | Precision |
|---|---|---|---|
| Level 0 | 0.96 | 0.96 | 0.96 |
| Level 1 | 0.96 | 0.96 | 0.96 |
| Level 2 | 0.98 | 0.97 | 0.99 |
| Level 3 | 0.99 | 0.98 | 0.99 |
| Level 4 | 0.99 | 0.99 | 0.99 |
Temporal hold-out test enzyme – non-enzyme (Level 0) prediction performance comparison
| Method | F1-score | Recall | Precision |
|---|---|---|---|
| ProtFun | 0.79 | 0.87 | 0.72 |
| EzyPred | 0.15 | 0.13 | 0.16 |
| EFICAz | 0.42 | 0.30 | 0.69 |
| DEEPre | 0.53 | 0.43 | 0.68 |
| ECPred-wne | 0.65 | 0.93 | 0.50 |
| ECPred |
|
|
|
Temporal hold-out test EC main class (Level 1) prediction performance comparison
| Method | F1-score | Recall | Precision |
|---|---|---|---|
| ProtFun | 0.12 | 0.10 | 0.15 |
| EzyPred | 0.15 | 0.13 | 0.16 |
| EFICAz | 0.42 | 0.30 |
|
| DEEPre |
| 0.40 | 0.67 |
| ECPred-wne | 0.40 |
| 0.34 |
| ECPred | 0.48 |
| 0.54 |
Temporal hold-out test EC subclass class (Level 2) prediction performance comparison
| Method | F1-score | Recall | Precision |
|---|---|---|---|
| EzyPred | 0.11 | 0.10 | 0.13 |
| EFICAz | 0.11 | 0.07 | 0.33 |
| DEEPre | 0.11 |
| 0.07 |
| ECPred |
| 0.20 |
|
Temporal hold-out test EC sub-subclass class (Level 3) prediction performance comparison
| Method | F1-score | Recall | Precision |
|---|---|---|---|
| DEEPre | 0.05 | 0.03 | 0.14 |
| ECPred |
|
|
|
Performance comparison of the individual predictors and ECPred for enzyme – non-enzyme (level 0) prediction
| Method | F1-score | Recall | Precision |
|---|---|---|---|
| SPMap | 0.82 | 0.90 |
|
| BLAST- | 0.75 | 0.83 | 0.68 |
| Pepstats-SVM |
|
| 0.73 |
| ECPred |
|
| 0.73 |
Performance comparison of the individual predictors and ECPred for the main EC class (level 1) prediction
| Method | F1-score | Recall | Precision |
|---|---|---|---|
| SPMap | 0.23 | 0.17 | 0.36 |
| BLAST- | 0.47 |
| 0.52 |
| Pepstats-SVM | 0.26 | 0.20 | 0.35 |
| ECPred |
|
|
|
No-Pfam test dataset enzyme – non-enzyme (Level 0) prediction performance comparison
| Methods | F1-score | Recall | Precision |
|---|---|---|---|
| EzyPred | 0.54 | 0.54 | 0.54 |
| EFICAz | 0.37 | 0.23 |
|
| DEEPre | 0.60 | 0.4 | 0.85 |
| ECPred |
|
| 0.89 |
No-Pfam test dataset EC main class (Level 1) prediction performance comparison
| Methods | F1-score | Recall | Precision |
|---|---|---|---|
| EzyPred | 0.42 | 0.39 | 0.46 |
| EFICAz | 0.33 | 0.20 |
|
| DEEPre | 0.52 | 0.38 | 0.82 |
| ECPred |
|
| 0.86 |
No-Pfam test dataset EC subclass class (Level 2) prediction performance comparison
| Methods | F1-score | Recall | Precision |
|---|---|---|---|
| EzyPred | 0.30 | 0.26 | 0.36 |
| EFICAz | 0.33 | 0.20 |
|
| DEEPre | 0.40 | 0.27 | 0.77 |
| ECPred |
|
| 0.82 |
No-Pfam test dataset EC sub-subclass class (Level 3) prediction performance comparison
| Methods | F-score | Recall | Precision |
|---|---|---|---|
| EFICAz | 0.33 | 0.20 |
|
| DEEPre | 0.33 | 0.22 | 0.73 |
| ECPred |
|
| 0.81 |
No-Pfam test dataset EC substrate class (Level 4) prediction performance comparison
| Methods | F-score | Recall | Precision |
|---|---|---|---|
| EFICAz | 0.33 | 0.20 |
|
| DEEPre | 0.33 | 0.22 | 0.73 |
| ECPred |
|
| 0.71 |