| Literature DB >> 27980384 |
Nancy Arana-Daniel1, Alberto A Gallegos1, Carlos López-Franco1, Alma Y Alanís1, Jacob Morales1, Adriana López-Franco1.
Abstract
With the increasing power of computers, the amount of data that can be processed in small periods of time has grown exponentially, as has the importance of classifying large-scale data efficiently. Support vector machines have shown good results classifying large amounts of high-dimensional data, such as data generated by protein structure prediction, spam recognition, medical diagnosis, optical character recognition and text classification, etc. Most state of the art approaches for large-scale learning use traditional optimization methods, such as quadratic programming or gradient descent, which makes the use of evolutionary algorithms for training support vector machines an area to be explored. The present paper proposes an approach that is simple to implement based on evolutionary algorithms and Kernel-Adatron for solving large-scale classification problems, focusing on protein structure prediction. The functional properties of proteins depend upon their three-dimensional structures. Knowing the structures of proteins is crucial for biology and can lead to improvements in areas such as medicine, agriculture and biofuels.Entities:
Keywords: evolutionary algorithms; kernel-adatron; large scale learning; machine learning; protein structure prediction; support vector machines
Year: 2016 PMID: 27980384 PMCID: PMC5140013 DOI: 10.4137/EBO.S40912
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1A binary dataset is composed of positive and negative labeled values. For purposes of generalizing a dataset the hyperplane with the largest margin gives the best results, although there can be several hyperplanes that can optimally separate it.
Figure 2Datasets that are not linearly separable may be separated by a hyperplane in higher dimensions after applying the kernel trick.
Figure 3Left: Amino-acid sequence of a protein. Right: A representation of a three-dimensional structure of a protein.
Kernel Adatron Algorithm.
| 1: | Initialize |
| 2: | |
| 3: | For ( |
| 4: | Calculate |
| 5: | Calculate |
| 6: | |
| 7: | |
| 8: | |
| 9: | |
| 10: | |
| 11: | |
| 12: | Calculate |
| 13: |
Figure 4The diagram explains the basic idea behind the algorithm described in this paper.
Artificial Bee Colony Algorithm.
| 1: | Initialize |
| 2: | |
| 3: | Produce a new solution |
| 4: | |
| 5: | |
| 6: | |
| 7: | Calculate the probability values |
| 8: | Produce a new solution |
| 9: | |
| 10: | |
| 11: | |
| 12: | |
| 13: | Replace |
| 14: | |
| 15: |
Micro Artificial Bee Colony Algorithm.
| 1: | Initialize |
| 2: | |
| 3: | Produce a new solution |
| 4: | |
| 5: | |
| 6: | |
| 7: | Calculate probability values |
| 8: | Produce a new solution |
| 9: | |
| 10: | |
| 11: | |
| 12: | Move second best solution |
| 13: | Move worst solution |
| 14: |
Differential Evolution Algorithm.
| 1: | Initialize |
| 2: | |
| 3: | For each |
| 4: | Generate |
| 5: | Generate |
| 6: | |
| 7: | |
| 8: | |
| 9: |
Figure 5A thread is a component of a process. Multiple threads can exist within the same process; they are executed concurrently and share resources, such as memory.
Fitness Function.
| 1: | Initialize |
| 2: | Generate a vector |
| 3: | |
| 4: |
|
| 5: | |
| 6: |
|
| 7: | |
| 8: |
|
| 9: | |
| 10: | |
| 11: |
|
| 12: |
Brief description of large-scale datasets. Density denotes the average percentage of non-zero features of the data vectors.
| DATASET | DIMENSION | DENSITY |
|---|---|---|
| Astro-Ph | 99757 | 0.08% |
| Aut-Avn | 20707 | 0.23% |
| C11 | 47236 | 0.16% |
| CCAT | 47236 | 0.16% |
| RCV1 | 47236 | 0.18% |
| Real-Sim | 20958 | 0.23% |
| Worm | 804 | 25.00% |
Brief description of the ICOS PSP dataset.
| UNIFORM: | Ω | DIMENSION | DENSITY |
|---|---|---|---|
| Length | 7 | 300 | 86.04% |
| 8 | 340 | 86.98% | |
| 9 | 380 | 88.79% | |
| Frequency | 7 | 300 | 87.24% |
| 8 | 340 | 87.07% | |
| 9 | 380 | 89.17% |
Computational complexity of the algorithms.
| ALGORITHM | COMPLEXITY |
|---|---|
| KA |
|
| SVM |
|
| OCA | |
| SVM | |
| EA approaches |
Results from the Aut-Avn dataset.
| ALGORITHM | TRAINING | GENERALIZATION | TRAINING TIME |
|---|---|---|---|
| 97.23% | 94.98% | 0.0216 | |
| ABC | 97.20% | 94.58% | 0.0725 |
| DE | 97.28% | 96.13% | |
| PSO | 0.0198 | ||
| KA | 97.34% | 94.95% | 12.5613 |
| SVM | 99.70% | 95.65% | 0.1380 |
| SVM | 98.52% | 96.03% | |
| OCA | 90.10% | 0.0384 |
Cross-validation accuracy results for PSP uniform length subsets.
| ALGORITHM | Ω = 7 | Ω = 8 | Ω = 9 |
|---|---|---|---|
| 73.55% | |||
| ABC | 72.35% | 69.85% | 72.58% |
| DE | 72.78% | 70.48% | 74.03% |
| PSO | 73.25% | 69.55% | |
| SVM | 72.00% | 72.60% | |
| OCA | 64.28% | 68.34% | 67.05% |
| SVM | 64.38% | 68.41% | 67.13% |
| KA | 73.15% | 69.40% | 72.88% |
Figure 6ROC curves obtained from large-scale datasets.
Roc curve areas obtained from large-scale datasets.
| DATASET | AREA |
|---|---|
| Astro-Ph | 0.9762 |
| Aut-AVN | 0.9803 |
| C11 | 0.9160 |
| CCAT | 0.9292 |
| RCV1 | 0.9590 |
| Real-Sim | 0.9865 |
| Worm | 0.9891 |
Roc curve areas obtained from ICOS PSP dataset.
| UNIFORM: | Ω | AREA |
|---|---|---|
| Length | 7 | 0.8134 |
| 8 | 0.8345 | |
| 9 | 0.8242 | |
| Frequency | 7 | 0.8248 |
| 8 | 0.8229 | |
| 9 | 0.8087 |
Figure 7ROC curves obtained from the ICOS PSP dataset.
Results obtained from the Friedman test were the sum of squares (SS), mean squares (MS), degrees of freedom (df), χ2 value and P -value.
| (A) Friedman test made to the Astro-Ph, Aut-Avn, C11, CCAT, RCV1, Real-Sim and Worm datasets. | |||||
|---|---|---|---|---|---|
| SOURCE | SS | DF | MS | ||
| Columns | 9.2857 | 3 | 3.0952 | 5.9091 | 0.1161 |
| Error | 23.7143 | 18 | 1.3175 | ||
| Total | 33 | 27 | |||
Mean rank obtained from the Friedman test for each solver.
| (A) Mean rank from the Astro-Ph, Aut-Avn, C11, CCAT, RCV1, Real-Sim and Worm datasets. | ||||
|---|---|---|---|---|
| SVM | SVM | |||
| Mean | 2.2857 | 1.8571 | 3.4286 | 2.4286 |
Results from the Real-Sim dataset.
| ALGORITHM | TRAINING | GENERALIZATION | TRAINING TIME |
|---|---|---|---|
| 97.86% | 96.28% | 0.0336 | |
| ABC | 0.0612 | ||
| DE | 98.21% | 96.51% | |
| PSO | 98.30% | 96.46% | 0.0311 |
| KA | 97.99% | 96.20% | 12.6350 |
| SVM | 99.67% | 0.1510 | |
| SVM | 98.81% | 97.28% | |
| OCA | 92.65% | 0.0378 |
Roc curve significance level obtained from large-scale datasets.
| AUT-AVN | C11 | CCAT | RCV1 | REAL-SIM | WORM | |
|---|---|---|---|---|---|---|
| Astro-Ph | 0.9992 | 0.918 | 0.9275 | 0.9483 | 0.9817 | 0.9721 |
| Aut-AVN | 0.9206 | 0.9303 | 0.9537 | 0.9851 | 0.9775 | |
| C11 | 0.9867 | 0.9471 | 0.9108 | 0.906 | ||
| CCAT | 0.9595 | 0.9194 | 0.914 | |||
| RCV1 | 0.9351 | 0.9246 | ||||
| Real-Sim | 0.9925 |
Roc curve significance level obtained from ICOS PSP dataset (where UF is uniform frequency and UL is uniform length).
| UF8 | UF9 | UL7 | UL8 | UL9 | |
|---|---|---|---|---|---|
| UF7 | 0.9988 | 0.99 | 0.9934 | 0.9945 | 0.9996 |
| UF8 | 0.9922 | 0.9951 | 0.9941 | 0.9993 | |
| UF9 | 0.9975 | 0.9866 | 1.0099 | ||
| UL7 | 0.9896 | 0.9943 | |||
| UF8 | 0.9947 |
Particle Swarm Optimization.
| 1: | Initialize |
| 2: | |
| 3: | Select from |
| 4: | |
| 5: | Obtain velocity |
| 6: | Update position |
| 7: | |
| 8: | |
| 9: | |
| 10: | |
| 11: | |
| 12: | |
| 13: |
Results from the Astro-Ph dataset. The best global results are underlined and the best results obtained by our approach are written in bold letters.
| ALGORITHM | TRAINING | GENERALIZATION | TRAINING TIME |
|---|---|---|---|
| 94.50% | 92.65% | 0.0243 | |
| ABC | 93.63% | 0.0650 | |
| DE | 94.56% | ||
| PSO | 94.53% | 93.77% | 0.0212 |
| KA | 94.61% | 92.68% | 12.0500 |
| SVM | 99.27% | 0.2430 | |
| SVM | 95.82% | 93.85% | 0.0195 |
| OCA | 93.25% | 0.0282 |
Results from the C11 dataset.
| ALGORITHM | TRAINING | GENERALIZATION | TRAINING TIME |
|---|---|---|---|
| 85.42% | 81.33% | 0.0221 | |
| ABC | 86.52% | 86.85% | 0.0129 |
| DE | |||
| PSO | 86.55% | 86.44% | 0.0198 |
| KA | 87.92% | 83.80% | 11.5700 |
| SVM | 98.12% | 0.0111 | |
| SVM | 98.55% | ||
| OCA | 72.84% | 0.0479 |
Results from the CCAT dataset.
| ALGORITHM | TRAINING | GENERALIZATION | TRAINING TIME |
|---|---|---|---|
| 90.49% | 86.18% | 0.0287 | |
| ABC | 91.11% | 86.78% | 0.0436 |
| DE | |||
| PSO | 91.74% | 86.55% | 0.0387 |
| KA | 90.75% | 86.58% | 12.5626 |
| SVM | 98.71% | 0.3220 | |
| SVM | 88.13% | 84.08% | 0.0199 |
| OCA | 83.58% | 0.0637 |
Results from the RCV1 dataset.
| ALGORITHM | TRAINING | GENERALIZATION | TRAINING TIME |
|---|---|---|---|
| 92.78% | 91.03% | 0.0256 | |
| ABC | 92.75% | 93.10% | 0.0488 |
| DE | 92.72% | 93.00% | |
| PSO | 0.0402 | ||
| KA | 92.96% | 91.28% | 12.5998 |
| SVM | 99.01% | 0.2830 | |
| SVM | 96.51% | 94.03% | 0.0118 |
| OCA | 88.15% | 0.0704 |
Results from the Worm dataset.
| ALGORITHM | TRAINING | GENERALIZATION | TRAINING TIME |
|---|---|---|---|
| 81.60% | 80.30% | 0.0268 | |
| ABC | 79.10% | 77.77% | |
| DE | 0.0201 | ||
| PSO | 81.01% | 80.41% | 0.0275 |
| KA | 80.86% | 79.43% | 12.6125 |
| SVM | 97.79% | 0.3150 | |
| SVM | 99.86% | 93.80% | 0.0200 |
| OCA | 89.00% | 0.0840 |
Training accuracy results for PSP uniform frequency subsets.
| ALGORITHM | Ω = 7 | Ω = 8 | Ω = 9 |
|---|---|---|---|
| 74.23% | 74.88% | 73.83% | |
| ABC | 74.24% | 75.36% | 74.40% |
| DE | 74.77% | 74.29% | |
| PSO | 75.35% | ||
| SVM | 86.98% | 87.95% | 88.40% |
| OCA | |||
| SVM | |||
| KA | 75.16% | 76.08% | 75.18% |
Training time results for PSP uniform frequency subsets.
| ALGORITHM | Ω = 7 | Ω = 8 | Ω = 9 |
|---|---|---|---|
| 0.0297s | 0.0304s | 0.0224s | |
| ABC | 0.0084s | 0.0060s | 0.0055s |
| DE | |||
| PSO | 0.0065s | 0.0056s | 0.0059s |
| SVM | 0.0210s | 0.0220s | 0.0200s |
| OCA | 0.1978s | 0.2039s | 0.3558s |
| SVM | 0.2910s | 0.1820s | 0.2900s |
| KA | 1.2851s | 1.2811s | 1.2811s |
Cross-validation accuracy results for PSP uniform frequency subsets.
| ALGORITHM | Ω = 7 | Ω = 8 | Ω = 9 |
|---|---|---|---|
| 73.55% | 73.08% | 72.40% | |
| ABC | 73.75% | 72.93% | |
| DE | 74.63% | 73.08% | |
| PSO | 74.05% | ||
| SVM | 73.33% | 70.87% | 70.68% |
| OCA | 66.82% | 65.28% | 64.12% |
| SVM | 66.90% | 65.37% | 64.22% |
| KA | 74.20% |
Training accuracy results for PSP uniform length subsets.
| ALGORITHM | Ω = 7 | Ω = 8 | Ω = 9 |
|---|---|---|---|
| 74.60% | |||
| ABC | 73.70% | 70.95% | 74.07% |
| DE | 74.72% | 72.23% | 75.32% |
| PSO | 74.31% | 71.19% | |
| SVM | 88.05% | 91.45% | 88.80% |
| OCA | |||
| SVM | 99.95% | ||
| KA | 74.53% | 71.04% | 74.43% |
Training time results for PSP uniform length subsets.
| ALGORITHM | Ω = 7 | Ω = 8 | Ω = 9 |
|---|---|---|---|
| 0.0338s | 0.0203s | 0.0101s | |
| ABC | 0.0064s | 0.0058s | |
| DE | |||
| PSO | 0.0075s | 0.0066s | 0.0050s |
| SVM | 0.0200s | 0.0180s | 0.0250s |
| OCA | 0.1830s | 0.1600s | 0.1931s |
| SVM | 0.1950s | 0.1080s | 0.1730s |
| KA | 1.2963s | 1.2827s | 1.2690s |