| Literature DB >> 28732462 |
Yasser B Ruiz-Blanco1,2, Guillermin Agüero-Chapin3,4,5, Enrique García-Hernández6, Orlando Álvarez7, Agostinho Antunes8,9, James Green10.
Abstract
Entities:
Keywords: Alignment-free protein analysis; Enzyme; ProtDCal; Protein descriptors; Support vector machines; TI2BioP
Mesh:
Substances:
Year: 2017 PMID: 28732462 PMCID: PMC5521120 DOI: 10.1186/s12859-017-1758-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schematic representation of the protein descriptor generation process of ProtDCal. The dashed drawings denote an alternative pathway in the feature generation, which leads to a different family of descriptors. The blue drawings indicate those families of descriptors derived purely from primary structure information
Fig. 2Workflow for the calculation of the pseudo-fold 2D indices (spectral moments series) in TI2BioP. Illustrations of both, the Nandy and FCM representations of a graph are presented
Number of remaining features for each one of the protein descriptor families after applying several selection filters
| Set | Initial | Info. Gain | Redundancy | Best Subset |
|---|---|---|---|---|
| 0D | 3905 | 891 | 34 | 9 |
| 1D | 8705 | 1456 | 265 | 13 |
| 2D | 1883 | 1256 | 5 | 5 |
| 3D | 64,313 | 8339 | 2456 | 26 |
Fig. 3Distribution of the number of High-scoring Sequence Pairs according to Bit-Score value ranges. Each sequence pair is represented by the highest scoring segment pair (HSP) in the local alignment. HSP were obtained with BLAST using a permissive e-value cutoff = 10
Fig. 4Information gain of the features of each protein family after redundancy reduction. Each point in the curves represents the number of descriptors (x-axis), of a given type, with IG value higher than its value (y-axis)
Comparison with published results, in 10-fold cross-validation, of SVM methods using the D&D dataset
| Kernel | Accuracy* (%) | Reference | Run time | Computer |
|---|---|---|---|---|
| PUK | 82.0 ± 0.3 | ProtDCal 3D model | 53 m 2 s | Intel Core i5–3210 M 2.5 GHz with 8 GB of RAM |
| GraphK ShinglingWL | 81.54 ± 1.54 | [ | 3 h 1 m 7 s | Apple MacPro with 3.0GHz Intel 8-Core with 16GB RAM |
| GraphK WLmod | 80.31 | [ | 25 m 0 s | NA |
| Radial | 80.17 ± 1.24 | [ | NA | NA |
| GraphK WL | 79.78 ± 0.36 | [ | 11 m 0 s | Apple MacPro with 3.0GHz Intel 8-Core with 16GB RAM |
| GraphK WL | 79.00 ± 0.2 | [ | 6 m 42 s | 3.4GHz Intel core i7 processors |
| PUK | 78.8 ± 0.2 | ProtDCal 1D model | 3 m 42 s | Intel Core i5–3210 M 2.5 GHz with 8 GB of RAM |
| GraphK WL | 78.29 | [ | 2 h 12 m 57 s | MAC OS × 10.5 with two 2.66GHz Dual Core Intel Xeon processors, with 4GB 667MHz DDR2 memory |
| PUK | 77.58 | [ | 21 m 51 s | 2.5 GHz Intel 2-Core processor (i.e. i5–3210 m) |
| GraphK LWL | 76.60 ± 0.6 | [ | 11 m 00s | 16 cores machine (Intel Xeon CPU E5–2665@2.40GHZ and 96GB of RAM) |
| GraphK SP | 75,87 | [ | 1 h 40 m 57 s | NA |
| GraphK PRW | 75.40 ± 0.6 | [ | NA | NA |
The runtimes reported for our models comprise both the time for computing the features and times related to the building and assessing the models using Weka 3.7.11
NA Not-available
*For each of the listed references, the tabulated accuracy corresponds to the best performance in the D&D dataset as shown in the article
Runtime and computational resource were also displayed for the methods included in the comparison
All the referenced methods constitute 3D classifiers given that they use 3D–graphs to represent the protein structure
Success rates of sequence-based enzyme identification methods on the benchmark dataset made up of 30 formerly uncharacterized proteins from the S. oneidensis proteome
| Number of correct predictions | Success rate (%) | |
|---|---|---|
| ProtDCal-1D-model | 23 | 76.67 |
| EzyPred | 16 | 53.33 |
| EnzymeDetector | 27 | 90.00 |