Literature DB >> 27980384

Support Vector Machines Trained with Evolutionary Algorithms Employing Kernel Adatron for Large Scale Classification of Protein Structures.

Nancy Arana-Daniel¹, Alberto A Gallegos¹, Carlos López-Franco¹, Alma Y Alanís¹, Jacob Morales¹, Adriana López-Franco¹.

Abstract

With the increasing power of computers, the amount of data that can be processed in small periods of time has grown exponentially, as has the importance of classifying large-scale data efficiently. Support vector machines have shown good results classifying large amounts of high-dimensional data, such as data generated by protein structure prediction, spam recognition, medical diagnosis, optical character recognition and text classification, etc. Most state of the art approaches for large-scale learning use traditional optimization methods, such as quadratic programming or gradient descent, which makes the use of evolutionary algorithms for training support vector machines an area to be explored. The present paper proposes an approach that is simple to implement based on evolutionary algorithms and Kernel-Adatron for solving large-scale classification problems, focusing on protein structure prediction. The functional properties of proteins depend upon their three-dimensional structures. Knowing the structures of proteins is crucial for biology and can lead to improvements in areas such as medicine, agriculture and biofuels.

Entities: CellLine Chemical Disease Gene Species

Keywords: evolutionary algorithms; kernel-adatron; large scale learning; machine learning; protein structure prediction; support vector machines

Year: 2016 PMID： 27980384 PMCID： PMC5140013 DOI： 10.4137/EBO.S40912

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

With the drastic increase of data to be processed in really short amounts of time, new problems have appeared. Chromosome classification, spam filtering, defining which advertisement to show to a person on a web page, recognition of human activities and protein structure prediction are a few applications that involve immense amounts of high-dimensional data.1,2 Sometimes the dimension and/or the number of data samples is too large, making the storage of a dataset in a computer impossible. This problem is solved by large-scale classification learning, which aims to find a function that relates the data and their corresponding class labels for an amount of data that cannot be stored in a modern computer’s memory.3 The main concern (constraint) is the amount of time that an algorithm takes to obtain an accurate result, rather than the number of samples to process.4 A typical problem that support vector machines (SVMs) have to face while working with a large dataset is that learning algorithms are typically quadratic and require several scans of a dataset. Three common strategies can be used to reduce this practical complexity:3,4 Solving several smaller problems by working on subsets of the training data instead of the complete large dataset. Parallelizing the learning algorithm. Designing a less complex algorithm that gives an approximate solution with equivalent or superior performance. This work presents a novel approach to solving large-scale learning problems by designing a less complex algorithm to train a large-scale SVM. Our approach uses a combination of Kernel-Adatron (KA) and some state-of-the-art evolutionary algorithms (EAs), to solve, principally, protein structure prediction (PSP) and other large-scale learning problems.5 The obtained algorithm works with small sub-problems, has low computational complexity and is easy to implement; in addition to providing accurate generalization results, such methodology is also highly parallelizable.

Support vector machines

Since the SVM algorithm was first introduced by Vladimir Vapnik in 1995, it has been one of the most popular methods for classification because of: its simple model, the use of kernel functions and the convexity of the function to optimize (it only has a global minimum).6 SVM’s characteristics make it more appealing for classification problems with high precision requirements than other models such as multilayer perceptron, radial basis function network, Hopfield network, etc.7–9 Many large-scale training algorithms have been proposed for SVMs with the main idea is of minimizing a regularized risk function R and maximizing the margin of separation between classes (Fig. 1) by solving Equation 1 where w is a normal vector to the separating hyperplane, is a quadratic regularization term and C > 0 is the fixed constant that scales the risk function.10–13 Equation 1 is called the primal formulation.14 By using Lagrange multipliers, the primal formulation can be presented in its dual form: where C is a fixed constant, is a training set, α are Lagrange multipliers, K(X, X) is the value of the kernel matrix defined by the inner product 〈X, X〉 (when a linear kernel K is used) and Y ∈ {±1} is a class label.4

Figure 1

A binary dataset is composed of positive and negative labeled values. For purposes of generalizing a dataset the hyperplane with the largest margin gives the best results, although there can be several hyperplanes that can optimally separate it.

The dual formulation has the same optimal values as the primal, but the main advantage of this representation is the use of the “kernel trick” (see Fig. 2). Since SVMs can only classify data in a linear, separable feature space, the role of the kernel function is to induce such feature space by implicitly mapping the training data into a higher dimensional space where data is linearly separable.14,15 There are two main approaches for large-scale SVM training algorithms: those that solve the primal SVM formulation, shown in Equation 1, by a gradient-based method (primal estimated subgradient solver for SVM, careful quasi-Newton stochastic gradient descent, forward looking sub-gradient, etc.) and those that solve the dual formulation of Equation 2 by quadratic programming (QP) methods (SVM for multivariate performance measure, library for large linear classification and bundle method for risk minimization, etc.).4,11,16,17 There are options that do not fall into these categories, such as the optimized cutting plane algorithm (OCA), which uses an improved cutting plane technique and is based on the work of SVM for multivariate performance measure (SVMperf) and bundle method for risk minimization. OCA has fast convergence compared to methods like stochastic gradient descent and primal estimated sub-gradient solver for SVM (Pegasos), and it has shown good classification results and offers computational sublinear scaling.13 Nevertheless, the use of a QP solver to solve a linear constraint problem (where each linear constraint is a cutting plane) makes it a complex approach to implement, even if the number of constraints is drastically lower than the data dimensionality. Gradient-based methods tend to be fast algorithms (especially those that use stochastic gradient descent) and have good generalization capabilities. However, they are highly dependent on step size to obtain a good speed of convergence. If the step size is not chosen carefully or it does not have an adjustment criteria, this can produce slow convergence.4 The dual QP methods can handle kernels easily and can converge quickly by combining them with other optimization techniques. The main disadvantage of these methods is the computational complexity of the quadratic programming solvers and the fact that they are more difficult to implement than a gradient descent method or an EA.4,18–21

Figure 2

Datasets that are not linearly separable may be separated by a hyperplane in higher dimensions after applying the kernel trick.

In the past years, several evolutionary computation-based training algorithms for SVM have been proposed.22–25 These algorithms solve the dual formulation (Equation 2), tend to be easy to implement and have shown good results for small amounts of data. The disadvantage on their implementation is their computational complexity of O(n2) or higher, where n represents the number of training samples. Since the complete kernel is needed on each iteration to calculate the fitness function, as the number of training samples grows, the time needed to process the data will increase drastically.

Evolutionary algorithms

EAs are global optimization methods that scale well to higher dimensional problems. They are robust with respect to noisy evaluation functions, and can be implemented and parallelized with relative ease.26 Even when premature convergence to a local extremum may occur, it has been proven that an algorithm that is “not quite good” or “poor” at optimization can be excellent at generalization for a large-scale learning task.4 This work presents a series of parallelized algorithms based on the KA algorithm as fitness function combined with Artificial Bee Colony (ABC), micro-Artificial Bee Colony (µABC), Differential Evolution (DE) and Particle Swarm Optimization (PSO), in order to solve the SVM learning problem. The EA algorithms combined with KA were chosen based on good results shown in other areas, their exploration and exploitation capabilities, and low computational complexity.27–33 Large-scale training algorithms for SVMs using EA is a promising field that has not been well explored. Although parallelization is a highly desirable approach to the large-scale classification problem, most large-scale SVM training algorithms do not take this into consideration to obtain better results in a shorter amount of time. This is in part because testing complex parallel applications to guarantee a correct behavior is challenging; in scenarios, such as where inherent data dependencies exist, a complex task cannot be partitioned because of sequential constraints, making parallelization less convenient.3,4 One of the main goals in parallelizing an EA is to reduce the search time. This is a very important aspect for some classes of problems with firm requirements on search time, such as in dynamic optimization problems and real-time planning.34

Protein structure prediction

A protein structure (PS) is the three-dimensional arrangement of atoms in a protein molecule.35 These structures arise because particular sequences of amino-acids in polypeptide chains fold to generate, from linear chains, compact domains with specific 3D structures (Fig. 3). The folded domains can serve as modules for building up large assemblies such as virus particles or muscle fibers, or they can provide specific catalytic or binding sites, as found in enzymes or proteins that carry oxygen or regulate the function of DNA. PSP predicts the three-dimensional strucutres of a protein by using its first structure, its amino-acid sequence, to predict its folding and its secondary, tertiary and quaternary structure.36,37 This makes PSP an essential tool in proteomics since the molecular function of a protein depends on its threedimensional structure, which is often unknown.

Figure 3

Left: Amino-acid sequence of a protein. Right: A representation of a three-dimensional structure of a protein.

In the past 50 years there has been enormous growth in the available information regarding genomic sequences, to the point that the pace is difficult to follow. At present, more protein coding sequences are known than their three-dimensional structures. Protein folding is a large-scale problem because 20 different amino acids can generate such a large number of combinations, and there are also many ways for different amino-acid sequences to generate similar structural domains in proteins.35 It has been suggested that many proteins contain enough information in their amino-acid sequences to determine their three-dimensional structure, making possible the prediction of new three-dimensional structures from an amino-acid sequence since it is known that sequence similarity does confer structural similarity.38,39 Fur thermore, to understand the biological function of proteins it is necessary to deduce or predict the three-dimensional structure from the amino-acid sequence, since their functional properties depend upon their structures. If the predictions are accurate enough, the gap between the growing amount of sequence information and their corresponding structures can be diminished. PSP is, overall, an optimization problem where each amino acid can be characterized by several structural features. A good prediction of these features helps to obtain better models for the 3D-PSP problem. These features can be predicted as classification/regression problems, where the goal is to determine the shape (known as fold) that a given amino-acid sequence will adopt. The problem can take two possible directions.40 The sequence may adopt a new fold, or bear resemblance to an existing fold in some protein structure database: If two sequences share evolutionary ancestry, they are called homologous and the structure for the query protein can be built by choosing the structure of the known homologous sequence as a template. If no template structure is found for the query protein, the structure must be built from scratch. Many methods have been developed to assign folds to a protein coding sequence.41 These methods can be divided into three groups: sequencestructure homology recognition methods, threading methods and machine-learning-based methods. Sequence-structure homology and threading methods methods align the target sequence onto known structural templates and calculate their sequence-structure compatibilities, using, for example, environment-specific substitution tables or pseudo-energy-based functions to calculate if it is possible that a template is a fold of a target sequence.42,43 Sequence-structure homology methods (like FUGUE and 3DPSSM) fail when two proteins are structurally similar, but share little in the way of sequence homology.44,45 Threading methods (such as THREADER) depend on data derived from solved structures, but the number of proteins whose structure has been solved is much smaller than the number of proteins that have been sequenced.46 Machine learning-based methods for protein fold recognition, like the approach presented in this paper, see the problem as a fold classification problem, where a classifier is built using a dataset with sequences of features of proteins with a known structure. The classifier can assign a structure-based label to an unknown protein (one that has not yet been solved). In recent years, a number of different SVM-based methods have been developed, producing better results than those obtained by pairwise sequence comparisons.40,43,47–49 These algorithms have made improvements in the detection of homologous structures with low levels of sequence similarity (remote homology detection). Most of the state of the art for PSP classification is not focused on large-scale data, and even if some approaches have shown good results in small-scale PSP classification, most use versions of SVM reliant on kernel functions or neural networks; these do not scale well as the dimension and/or the number of data to classify grows.48,50–54 Because of this, some approaches tend to select an optimized feature subset with a moderate number of samples to improve the generalization performance of the SVM instead of using the complete data-set. This reduces the amount of data to compute, making it more practical to process with the original SVM approach, but also more time consuming since the dataset needs to be selectively preprocessed.55 These methods might be good for small or medium amounts of data, but protein folding, because of its combinational nature, can generate an immense amount of data to process. This is where an algorithm especially designed for large-scale data is needed. Sequencing projects are fast at producing protein coding sequences, but only a small portion of protein coding sequences have experimentally solved 3D structures. This is due to the expensive and timeconsuming laboratory methods, such as X-ray crystallography and nuclear magnetic resonance (NMR).41 This problem is becoming more pressing as the number of known protein coding sequences expands as a result of genome and other sequencing projects.54 Because of this, tools that can predict PS rapidly and accurately, like the one presented in this paper, are needed. The full potential of genome projects will be realized only once we discover and understand the functions of these new proteins. This understanding will be facilitated by structural information for all or almost all proteins.

Methods

The kernel adatron algorithm

The Adaptive Perceptron algorithm (or Adatron) was first introduced by J. K. Anlauf and M. Biehl in 1989 for training linear classifiers.56 This algorithm was proposed as a method for calculating the largest margin classifier. The Adatron is used for on line learning perceptrons and guarantees convergence to an optimal solution, when this exists.57 In 1998, T. Fries et al proposed the KA algorithm. Basically, the KA algorithm is an adaptation of the Adatron algorithm for classification with kernels in high-dimensional spaces.5 It combines the simplicity of implementation of Adatron with an SVM’s capability of working in nonlinear feature spaces to construct a large margin hyperplane using online learning.15 An advantage of KA algorithm is the use of gradient ascent instead of quadratic programming, which is easier to implement and significantly faster to calculate. To implement KA algorithm, it is necessary to calculate the dot product w · X, where X is the set of training points and w denotes the normal vector to the hyperplane that divides the classes with a maximum margin (Fig. 1). Since the kernel K is related to the high-dimensional mapping φ(X) by equation where the normal vector w to the separating hyperplane, can be expressed as then, by using the lineal kernel K, the dot product can be expressed as To update the multipliers, a change in α must be proposed to be evaluated. The change can be calculated as follows where η is the step size and δα is the proposed change to α. If α +δα ≤ 0 it would result in a negative α. To avoid this problem, α is set to 0. Otherwise, update α ← α + δα. The bias b (Fig. 1) can be obtained as follows: where are the patterns with class label +1 and are those with class label −1. The pseudocode is described briefly in Algorithm 1.

Algorithm 1

Kernel Adatron Algorithm.

1:	Initialize α_i = 1.
2:	repeat
3:	For (X_i, Y_i) calculate z_i with Equation 5.
4:	Calculate γ_i with Equation 7.
5:	Calculate δα_i with Equation 6.
6:	if (α_i + δα_i) ≤ 0 then
7:	α_i = 0
8:	end if
9:	if (α_i + δα_i) > 0 then
10:	α_i = α_i + δα_i
11:	end if
12:	Calculate b with Equation 8
13:	until The stopping criteria is met.

Kernel Adatron Algorithm. Evolutionary computing is a subfield of artificial intelligence that includes a range of problem-solving techniques based on principles of biological evolution. The principles for using evolutive processes to solve optimization problems originated in the 1950s.58 The EA are optimization methods that are part of evolutionary computing, applying models based on biological evolution. In EA, a population of possible solutions is composed of individuals that can be compared according to their aptitude to improve the population; the most qualified candidates are those that obtain better results by a fitness function evaluation. The evolution of the population is obtained through iterations, in which a series of operations are applied to the individuals of the population (reproduction, mutation, recombination or selection), from these operations a new set of potentially better solutions are generated. The way the population evolves the possible solutions, and the way it chooses the new global best solutions, is something inherent to each EA.59 A swarm intelligence algorithm is based on swarms that occur in nature; PSO and ABC are two prominent swarm algorithms. There is a debate on whether swarm intelligence-based algorithms are EAs or not, but since one of the inventors of PSO refers to it as an EA, and swarm intelligence algorithms are executed in the same general way as EAs, by evolving a population of candidate problem solutions that improves with each iteration, we consider swarm intelligence to be an EA.59,60 As mentioned before, the KA algorithm requires the α value to be adjusted through iterations. In this approach, the adjustment is made using EA (Fig. 4). This type of algorithm was chosen as an optimization method because they are easy to implement, to parallelize and have shown good results in diverse areas such as computer vision, image processing and path planning.27,28,30,61–63

Figure 4

The diagram explains the basic idea behind the algorithm described in this paper.

Artificial bee colony algorithm

The ABC algorithm was first introduced by Karaboga in 2005.64 This algorithm is based on honey bee foraging behavior. The bees are divided into three classes: Employed: Bee with a food source. Onlookers: Bee that watches the dances of employed bees and choose food sources depending on dances. Scouts: Employed bee that abandons its food source to find a new one. Each food source is equivalent to a possible solution to the optimization problem and, as in nature, individuals are more likely to be attracted to sources with a larger amount of food (a better result obtained by the fitness function). For each food source, only one employed bee is assigned, and when it abandons its food source it becomes a scout. The number of the onlooker bees is also equal to the number of solutions in the population. Initially, ABC algorithm generates a random population P of n solutions. Each solution x ∈ P is a D-dimensional vector, to be evaluated by a fitness function f(), also known as food source. The algorithm searches iteratively for the better food sources based on the findings made by employed, onlooker and scout bees. First, the i-th employed bee generates a random modification in the j-th position of its corresponding food source x, producing a new potential food source v. The potential food source can be obtained by Equation 9 where k ∈ 1, 2, …, n is a randomly chosen index diffierent from i and ϕ is a uniformly distributed random number between [−1, 1]. If the amount of nectar (the value obtained by the fitness function) is greater than the old one, the employed bee takes it as its new food source x. Otherwise, the food source x remains unchanged. Once positions of the employed bees have been updated, the information is shared with the onlooker bees. Onlooker bees choose their food sources based on a probability p that is directly related to the amount of nectar. The value of p is obtained as follows where f is the fitness value of the i-th food source. p is choosen by a roulette wheel selection mechanism (the better the i-th solution, the higher its chances of being selected). A new potential food source v is calculated using Equation 9, where x is selected based on the roulette wheel selection result. And, as with employed bees, if the amount of nectar improves, v replaces x; otherwise, x remains unchanged. If a position x cannot be improved through a certain number of iterations, the i-th food source is abandoned. If this occurs, the scout bee changes its actual food source for a new food source to replace x as follows where rand(0, 1) is a normally distributed random number within [0, 1], and lb and ub are lower and upper bounds of the j-th dimension, respectively. The pseudocode is briefly described in Algorithm 2.

Algorithm 2

Artificial Bee Colony Algorithm.

1:	Initialize x_i
2:	repeat
3:	Produce a new solution v_i for the employed phase with Equation 9.
4:	if f(v_i) < f(x_i) then
5:	x_i ← v_i.
6:	end if
7:	Calculate the probability values p_i with Equation 10 for the solution x_i.
8:	Produce a new solution v_i or the onlooker phase with Equation 9, selecting x_i based on p_i.
9:	if f(u_i) < f(x_i) then
10:	x_i ← v_i.
11:	end if
12:	if x_i is an abandoned solution for the scout phase then
13:	Replace x_i by using the Equation 11.
14:	end if
15:	until The stopping condition is met.

µArtificial bee colony algorithm

The µABC algorithm was first introduced by Rajasekhar in 2012.29 This algorithm is a variant of the ABC algorithm with a small population (only 3 bees). The population of bees evolves through iterations and only the best bee is kept unaltered, whereas the rest of the bees are reinitialized with modifications based on the food source with the best fitness. After the employed and onlooker phases have been completed (in the same way as in the ABC algorithm) the population is ranked according to its fitness values. The bee with the best fitness remains in its food source, while the second best fitness is moved to a position near to the best one in order to facilitate a local search. The bee with the worst position is initialized to a random position to avoid premature convergence. Unlike ABC, more than one variable is modified from the food source. For each parameter x, a uniformly distributed random number rand(0, 1) is generated and if this number is less than the Frequency Control Rate (FCR) parameter, which is user defined, then the variable x is modified as follows Artificial Bee Colony Algorithm. The value of ϕ is a uniformly distributed random number, maintained in the range of [−RF, RF], where RF is the range factor. RF changes automatically during the search by tuning its value in accordance with Rechenberg’s 1/5 rule. This rule states that 1/5 of the total mutations in every t iterations φ(t) should be successful mutations. According to the number of successes φ(t), the value of RF is adjusted according to The pseudocode is briefly described in Algorithm 3.

Algorithm 3

Micro Artificial Bee Colony Algorithm.

1:	Initialize x_i
2:	repeat
3:	Produce a new solution v_i for the employed phase with Equation 12.
4:	if f(v_i) < f(x_i) then
5:	x_i ← v_i.
6:	end if
7:	Calculate probability values p_i with Equation 10 for solution x_i.
8:	Produce a new solution v_i for the onlooker phase with Equation 12, selecting xi based on p_i.
9:	if f(u_i) < f(x_i) theni
10:	x_i ← v_i.
11:	end if
12:	Move second best solution x₂_b to a position very close to best solution x₁_b.
13:	Move worst solution x₃_b to a random position.
14:	until The stopping condition is met.

Micro Artificial Bee Colony Algorithm.

Diffierential evolution

DE was first introduced by R. Storn and K. V. Price in 1995.65 In DE each individual x of the population is a D-dimensional vector that represents a candidate solution from a set of n solutions. Each individual, called a vector, is evaluated by a fitness function f() to define its strength as a solution. The fundamental idea behind DE is creating new candidate solutions based on other solutions that have been previously found. DE takes the difference vector between two randomly chosen individuals, x2 and x3, and adds a scaled version of this vector to a third individual, chosen randomly x1 or the best individual x in the population. For the algorithm described in this paper, we used x1 = x. This new individual is called a mutant vector v where F is a user-defined scaling factor. This mutant vector v is later combined with x by crossover to create a candidate solution to be evaluated by an objective function. The crossover is implemented as follows where u is the crossed vector, r is a random number between [0, 1], CROV is the user-defined constant crossover rate ∈[0, 1] and J is a random integer ∈[0, D] redefined on each iteration. The pseudocode is briefly described in Algorithm 4.

Algorithm 4

Differential Evolution Algorithm.

1:	Initialize F = [0.4, 0.9], CROV and x_i
2:	repeat
3:	For each x_i choose three random integers (r1, r2, r3), where r1 ≠ r2 ≠ r3 and r1, r2, r3 ∈ [1, n].
4:	Generate n mutant vectors with Equation 14.
5:	Generate n crossed vectors with Equation 15.
6:	if f(u_i) < f(x i) then
7:	x_i ← u_i.
8:	end if
9:	until The stopping condition is met.

Differential Evolution Algorithm.

Particle swarm optimization

The PSO algorithm was first introduced by Kennedy and Russell in 1995.66 This algorithm exploits a population of potential solutions. The population of solutions is called a swarm and each individual from a swarm is called a particle. A swarm is defined as a set of n particles. Each particle i is represented as a D-dimensional position vector x, which is evaluated by a fitness function f (). Based on the results of the evaluation, it is easy to measure improvement in new particles compared to old ones. The particles are assumed to move within the search space iteratively. This is done by adjusting their position using a proper position shift, called velocity v. For each iteration t, the velocity changes by applying Equation 16 to each particle. where φ1 and φ2 are random variables uniformly distributed within [0,1]; c1 and c2 are weighting factors, also called the cognitive and social parameters, respectively; ω is called the inertia weight, which decreases linearly from ω to ω during iterations. P and P represent the best position visited by a particle and the best position visited by the swarm before the current iteration t, respectively. The position update is applied by Equation 17 based on the new velocity and the current position. Particle Swarm Optimization. To solve the uncontrolled increase of magnitude of the velocities (swarm explosion effect), it is often necessary to restrict the velocity with a clamping at desirable levels, preventing particles from taking extremely large steps from their current positions.67 Although the use of a maximum velocity threshold improves the performance, by controlling the swarm explosions, without the inertia weight the swarm would not be able to concentrate its particles around the most promising solutions in the last phase of the optimization procedures.67

Kernel adatron trained with evolutionary algorithms

The basic idea behind the proposed algorithms is to use a “divide and conquer” strategy, where each individual in the population of the EA (vector in DE, particle in PSO, food source in ABC and µABC) is seen as a sub-process, in this case a thread (Fig. 5), that will solve a part of the whole problem. Once each sub-process reaches a result, it is compared to the results of its peers to improve future results.

Figure 5

A thread is a component of a process. Multiple threads can exist within the same process; they are executed concurrently and share resources, such as memory.

DE, PSO, ABC and µABC are easily parallelized because each individual can be evaluated independently. The only phases in which the algorithms require communication between their individuals are the phases that involve mutation and the selection of the fittest individual. Also, the process to obtain the kernel matrix can be easily parallelized by dividing the process into several subtasks. For this approach, a lineal kernel is used (represented by the dot product ⟨X, X⟩), since it was the kernel that gave the best results. On each variant of the proposed algorithm, individual x (particle, vector or bee) represents a D-dimensional vector composed of multipliers to be optimized over iterations by the EA. The fitness function f() to be used by the EA is described by Equation 18: where Θ is the margin between classes of the hyperplane, which can be estimated as follows: The value z can be obtained with Equation 5. The values of z can be divided into and depending on their class label, +1 and −1, respectively. The KA algorithm has the implementational simplicity of the Adatron model and can find a solution very rapidly compared to traditional methods like kernel-perceptron and SVM.5 The algorithm comes with all the theoretical guarantees given by support vector theory for large margin classifiers, as well as the convergence properties studied in the statistical learning literature.68 However, the algorithm uses basic operations and has a complexity of O(n2). Because of this, the algorithm has been modified so it can be trained using an EA with a computationally more attractive fitness function. The main problem of KA is calculating the z values. This results in an impractical fitness function, since it turns the linear computational complexity of the EA into quadratic. To solve this problem, it is proposed to use subsets of values to approximate a subset of z for evaluating a candidate solution, instead of calculating each exact value of z. Each subset is generated randomly and uses a much smaller fixed number of values (defined as nvals in Algorithm 6) than the number of values contained by the kernel matrix. The fitness function is described in Algorithm 6.

Algorithm 6

Fitness Function.

1:	Initialize n_vals, zmin+=INFINITY, zmax−=−INFINITY
2:	Generate a vector rvec with n_vals number of integer elements. Where rvec_i ∈[0, n_ts]
3:	for each element in rvec_i do
4:	zi=∑j=1nvalsαrvecjYrvecjK(Xrvecj,Xrveci)
5:	if z_i ∈ z⁺ and zi<zmin+ then
6:	zmin+=zi
7:	else
8:	zmax−=zi
9:	end if
10:	end for
11:	Θ=12(zmin+−zmax−)
12:	return abs(1 – Θ)

Fitness Function. The number of data to be used by the fitness function nvals in this approach needn’t necessarily increase drastically with an increase in the number of training samples of the data set n or dimensionality of the problem. The value for n was obtained from several tests done by running PSO on each variant of the algorithm on several datasets, and obtaining the average of the optimal number of samples needed by each approach. The value for n was merely 400 data samples, which gave the best results in the tests made on the datasets mentioned in Results and Discussion Section. Since all the results were near 400 samples selected randomly, this number was taken as a constant number of samples for n in all the tests, independently of n. The fitness function complexity is O(1), if the kernel matrix K is previously computed, or O(d) for any value that is calculated by the fitness function, where d is the maximum number of non-zero features in any of the training samples.

Interdisciplinary computing and complex biosystems protein structure prediction benchmarks repository

The Interdisciplinary Computing and Complex BioSystems Protein Structure Prediction (ICOS PSP) benchmarks repositoryI contains datasets suitable for testing classification algorithms based on real data.69,70 The dataset is based on PSP, aiming to predict the three-dimensional structures of amino-acid chains based on several structural features. The features are extracted by using a window of size Ω on amino-acid chain to predict the Coordination Number (CN) for residue i by using the information of its neighbors. Where a residue i refers to a specific amino-acid within the polymeric chain of a protein, the CN is the number of residues from the same protein that are in contact with a given residue in the native state. Two residues are said to be in contact when the distance between them is below a certain threshold. The dataset is derived from a set of 1050 protein chains and approximately 260,000 amino-acids (instances) selected using the PDB-REPRDB database. In order to predict the real-valued CN using classification techniques, the continuous domain was mapped onto a finite set of categories.II Two different criteria were used to generate sets with two, three and five classes (or states) to form classes with balanced and imbalanced class distribution, uniform frequency and uniform length, respectively.71,72 Binning is the simplest method to discretize a continuous-valued attribute by creating a specified number of bins. The bins can be created by uniform frequency or length. In both, arity k is used to determine the number of bins, which are associated with a distinct discrete value. For uniform length, the continuous range of a feature is evenly divided into intervals that have equal length and each interval represents a bin. In uniform frequency, an equal number of continuous values are placed in each bin.72 For this dataset the bins are computed separately for each training set using all of its instances, and afterwards applied also to the corresponding test set. To construct the datasets, a Ω window size ranging from 0 to 9 amino acids was used. The primary sequence of the protein and the CN definition of each amino acid were extracted from the PDB file. As in,73 a standard bootstrapping technique was used, which is useful for the robust estimation of prediction accuracy and its error; that is, a dataset of 1050 protein chains was randomly divided into 2 groups: the training set of 950 chains and the test set of 100 chains. This division of the whole dataset was repeated 10 times, resulting in 10 pairs of training and test sets. Each training set contains more than 2x105 residues. For this paper, only the subset divided into two states was used since the approach is proposed for binary classification.

Results and Discussion

The data to classify was taken from the Interdisciplinary Computing and Complex BioSystems Protein Structure Prediction Benchmarks Repository and seven other datasets from diverse fields that are commonly used to test large-scale classifiers; the datasets are briefly described in Tables 1 and 2.

Table 1

Brief description of large-scale datasets. Density denotes the average percentage of non-zero features of the data vectors.

DATASET	DIMENSION	DENSITY
Astro-Ph	99757	0.08%
Aut-Avn	20707	0.23%
C11	47236	0.16%
CCAT	47236	0.16%
RCV1	47236	0.18%
Real-Sim	20958	0.23%
Worm	804	25.00%

Table 2

Brief description of the ICOS PSP dataset.

UNIFORM:	Ω	DIMENSION	DENSITY
Length	7	300	86.04%
	8	340	86.98%
	9	380	88.79%
Frequency	7	300	87.24%
	8	340	87.07%
	9	380	89.17%

From the PSP dataset, only the subsets discretized with uniform length and uniform frequency, with window sizes ranging from 7 to 9, were used for training and generalization because of their density and dimensionality. The Astro-Ph dataset is focused on classifying abstracts of scientific papers from Physics ArXiv.74 The Aut-Avn and Real-Sim classification datasets come from a collection of UseNet articles from four discussion groups: for simulated auto racing, simulated aviation, real autos and real aviation. CCAT and C11 are obtained from the Reuters RCV1 collection, and address the problem of separating corporate related articles.3 The Worm dataset focuses on classifying worm RNA splices.III,13 The experiments were performed on an Intel® Core i7–3770™IV machine with 16 GB of RAM and Fedora Linux 20V operating system. The code was written in C++ using POSIX ThreadsVI and Armadillo.75 For the implementation of the algorithms, the Armadillo random number generator was used; the C++ random number generator was more expensive computationally speaking and increased the execution time drastically. For the experiments done in this section, our approach is compared against algorithms like OCA, SVM, SVM and the original KA algorithm, from which the first three algorithms are large-scale SVM classifiers used in diverse fields.1,2 Something to be taken into account is that it is much easier to implement and parallelize EA algorithms than to implement or parallelize the QP solvers used by OCA, SVM and SVM.13,74,76 The work presented in this paper was developed and tested on a multi-core computer, but since the algorithm is easily parallelizable, it can be implemented to run on a computer cluster with fewer complications than implementing a parallelized version of the previously mentioned algorithms for the same cluster. It is expected that, by using this type of hardware, the training and evaluation time can be reduced, even when processing a considerably larger amount of data. For the EA fitness function, a linear kernel was used in all the algorithms since it gave the best results in the generalization tests. Several tests were made using a radial basis function kernel. In general, the results showed a slight increase in the training accuracy (not sufficient to compete with the other approaches in the training phase), the generalization accuracy decreased slightly and the processing time increased because of the extra operations that had to be performed to calculate the kernel. Because of this, only the results obtained with the linear kernel are shown. Previous to the tests, from each dataset a subset of 4000 training samples was randomly extracted and normalized for binary training classification and cross-validation. Because of hardware limitations, the amount of training samples used on each dataset is not large-scale, so that it could be stored in the computer’s memory. However, since the KA algorithm possess the guarantees given by the support vector theory and, as explained later in this section, the algorithm scales well with the increase in the amount of data and dimensionality, the algorithm can easily be used with a larger amount of data without problems.68 The dimensionality and density of the datasets can be seen in Tables 1 and 2. The generalization accuracy was obtained by applying a 10-fold cross-validation to each dataset. To test the accuracy of training capability of each algorithm, the SVM was trained using 3600 training samples per run, which represents the n value for a dataset, and 400 samples were used for testing. The values used to train the SVM with each EA were obtained by running PSO on each variant of the algorithm to determinate the optimal values. This is not to be confused with the PSO variant that uses KA to classify data. The following values were used by the EAs while using the large-scale datasets: The µABC version used: RF = 0.0001, C = 0.0001, FCR = 0.0001 and maximum of 5 attempts before abandoning a food source. The ABC version used: C = 2, ϕ values ranging between [−2, 2], 5 food sources and a maximum of 9 attempts before abandoning a food source. The DE algorithm used: C = 2.38958, F = 1.87016 and CROV = 0.9 and 6 vectors. The PSO algorithm used: v = 1.49684, w = 1.18472, w = 0.000511895, c1 = 1.03971 c2 = 1.48063, C = 6.74659 and 15 particles. For the PSP dataset, the following values were, used by the EAs: The µABC version used: RF = 0.001, C = 0.0001, FCR = 0.001, with maximum of 5 attempts before abandoning a food source and a maximum of 25 iterations as stopping condition. The ABC version used: C = 5, ϕ values ranging between [−2, 2], 8 food sources, a maximum of 9 attempts before abandoning a food source and a maximum of 20 iterations as stopping condition. The DE algorithm used: C = 2.65435, F = 0.719909 and CROV = 0.1, with 6 vectors and a maximum of 23 iterations as stopping condition. The PSO algorithm used: v = 0.1, w = 0.0494229, w = 0.0001, c1 = 1.13755 c2 = 0.11384, C = 3.5 with 10 particles and a maximum of 30 iterations as stopping condition. The C value in SVM has two main purposes: it functions as constant that scales the risk function for the primal formulation in Equation 1 and it limits the values that any α can take in the dual formulation in Equation 2. In this paper, the value of C is used in the same way as in the dual formulation, for limiting the values of α. A total of 200 iterations was used as stopping condition by the EA for the datasets described in Table 1, because all the algorithms trained with PSO returned values close to 200 iterations as the optimal value for the stopping condition, with 200 being the highest number of iterations. As stated in Section The Kernel Adatron Algorithm, the KA algorithm has appealing advantages such as the simplicity of implementation of Adatron and the capability of working in high-dimension feature spaces to construct a large margin hyperplane. But the main concern of implementing the original KA approach is working with the kernel matrix, since its computational complexity is of , where d is the maximum number of non-zero features in any data vector of the training subset and n is the number of training samples. Nevertheless, there are scenarios, such as that presented in Table 1, where the density of the data samples is low in most cases, so the number of operations to calculate the kernel matrix can be drastically reduced. On the other hand, independently of the density, if it is treated as a divide and conquer problem the computational complexity is reduced, at worst case scenario, to , where t is the number of threads. Methods like Sequential Minimal Optimization or chunking can be used to reduce the computational complexity, but these algorithms, in the worst case scenario, scale to and , respectively, which makes them expensive computationally speaking.77 The approach proposed in this paper always uses, per iteration, a subset of randomly chosen training samples with a much smaller fixed size, and it is independent of the number of training samples n in the dataset. Because of this, the complexity remains linear O(d) (O(d/t) if it is parallelized) even if the dataset increases in size. For all the experiments made using the datasets described in Table 1, a total of 60 randomly chosen training samples from a dataset were used every time the fitness function was called. For the PSP dataset, the number of samples used per fitness function call was 400, over three times more data than with the other datasets, but still a considerably small amount of samples considering the density and the complete number of samples. These values were also obtained with PSO. Since the approaches shown in this paper works with data subsets, some precision in the accuracy of the training phase is lost to gain a better generalization capability in a small amount of time. For the approach shown in this paper, the EAs the computational complexity is linear O(n), where n is the number of individuals in the population of the EA, and O(d) for the fitness function, so the whole complexity of the algorithm is O(n * b) (Table 3). Compared to SVM and KA, in which computational complexity is equal to higher than , the approach shown in this paper is more appealing.4 Algorithms such as OCA and SVM show a computational complexity of O(d* n), which makes this approach competitive by comparison.4,74

Table 3

Computational complexity of the algorithms.

ALGORITHM	COMPLEXITY
KA	O(d*nts2)
SVM^light	O(d*nts2)
OCA	O(d^* n_ts)
SVM^perf	O(d^* n_ts)
EA approaches	O(n ^* b)

As shown in Tables 7 to 18, our approach gave results in generalization and time tests (measured in seconds) that are competitive with or better than those shown by OCA, SVM and SVM, though the accuracy in the training phase is not the strongest point of the algorithm. Notably, in terms of training and generalization, our approach shows similar or better results to the ones obtained by the original KA algorithm, but in a fraction of the time. The best global results shown on Tables 7 to 18 are underlined, and the best results obtained by our approach are written in bold letters.

Table 7

Results from the Aut-Avn dataset.

ALGORITHM	TRAINING	GENERALIZATION	TRAINING TIME
µABC	97.23%	94.98%	0.0216
ABC	97.20%	94.58%	0.0725
DE	97.28%	96.13%	0.0172
PSO	97.65%	96.95%	0.0198
KA	97.34%	94.95%	12.5613
SVM^light	99.70%	95.65%	0.1380
SVM^perf	98.52%	96.03%	0.0102
OCA	100.00%	90.10%	0.0384

Table 18

Cross-validation accuracy results for PSP uniform length subsets.

ALGORITHM	Ω = 7	Ω = 8	Ω = 9
µABC	74.03%	71.53%	73.55%
ABC	72.35%	69.85%	72.58%
DE	72.78%	70.48%	74.03%
PSO	73.25%	69.55%	75.05%
SVM^light	72.00%	72.34%	72.60%
OCA	64.28%	68.34%	67.05%
SVM^perf	64.38%	68.41%	67.13%
KA	73.15%	69.40%	72.88%

As can be seen from the ROC curves in Figures 6A to 6G and in Table 19, the generalization performances of the classifiers shown in this paper are very similar (the curves overlap each other) with excellent values for area under the curve (AUC), ranging from 0.9160 to 0.9891. The ROC curves for the PSP dataset (Table 21 and Figs. 7A to 7F) gave good values for AUC, ranging from 0.8087 to 0.8345. Even though the generalization tests performed on the Worm dataset are not as good as the rest of the generalization tests, it gave the best AUC result compared to the other ROC curve results.

Figure 6

ROC curves obtained from large-scale datasets.

Table 19

Roc curve areas obtained from large-scale datasets.

DATASET	AREA
Astro-Ph	0.9762
Aut-AVN	0.9803
C11	0.9160
CCAT	0.9292
RCV1	0.9590
Real-Sim	0.9865
Worm	0.9891

Table 21

Roc curve areas obtained from ICOS PSP dataset.

UNIFORM:	Ω	AREA
Length	7	0.8134
	8	0.8345
	9	0.8242
Frequency	7	0.8248
	8	0.8229
	9	0.8087

Figure 7

ROC curves obtained from the ICOS PSP dataset.

To detect diffierences between solvers across multiple test attempts, Matlab’s™ implemention of the Friedman test was applied to the results of the four solvers that gave the best generalization results (DE, PSO, SVM and SVM).78 For the test an α = 0.05 was used with 3 degrees of freedom, using as null hypothesis H0 the statement that there is no difference between the classifiers, and as alternative hypothesis H1 the statement that there is a difference. According to the χ2 table, if our χ2 value is greater than 7.815, the null hypothesis will be rejected. The results obtained from the tests were: Friedman test applied to the datasets shown in Table 1: χ2 = 5.9, a value smaller than 7.815, with P-value = 0.1161, which is greater than 0.05. From the results shown in Table 4A, we can state that hypothesis H0 is supported. The Tukey test was used to test which classifiers are statistically significant to one another.79 From the test we obtained an honest significant difference of 2.95; when this value is compared to the results presented in Table 5A, it is easy to see that there is no statistically significant difference between the solvers, since the difference between each pair of means is less than this value.

Table 4

Results obtained from the Friedman test were the sum of squares (SS), mean squares (MS), degrees of freedom (df), χ2 value and P -value.

(A) Friedman test made to the Astro-Ph, Aut-Avn, C11, CCAT, RCV1, Real-Sim and Worm datasets.
SOURCE	SS	DF	MS	χ²	P-VALUE
Columns	9.2857	3	3.0952	5.9091	0.1161
Error	23.7143	18	1.3175
Total	33	27

Table 5

Mean rank obtained from the Friedman test for each solver.

(A) Mean rank from the Astro-Ph, Aut-Avn, C11, CCAT, RCV1, Real-Sim and Worm datasets.
	DE	PSO	SVM^light	SVM^perf
Mean	2.2857	1.8571	3.4286	2.4286

Friedman test applied to the PSP dataset: χ2 = 13.4, a value greater than 7.815, with P-value = 0.0037, which is smaller than 0.05. From the results shown in Table 4B, we can state that hypothesis H0 is rejected. From the test we obtained an honest significant difference of 1.17; when this value is compared to the results presented in Table 5B it is apparent that there is a statistically significant difference between SVM and the rest of the solvers. This is easily noticed since SVM gave the worst results in the cross-validation tests for the PSP dataset. Every possible pair of ROC curves obtained from the datasets shown in Table 11 was compared using MedCalc© to obtain their significance level. From the results shown in Table 20 it can be stated that hypothesis H0 is accepted in all the cases. The same procedure was applied to the ICOS PSP dataset. The results presented in Table 22 also support the H0 hypothesis.

Table 11

Results from the Real-Sim dataset.

ALGORITHM	TRAINING	GENERALIZATION	TRAINING TIME
µABC	97.86%	96.28%	0.0336
ABC	98.25%	96.88%	0.0612
DE	98.21%	96.51%	0.0147
PSO	98.30%	96.46%	0.0311
KA	97.99%	96.20%	12.6350
SVM^light	99.67%	97.63%	0.1510
SVM^perf	98.81%	97.28%	0.0107
OCA	99.73%	92.65%	0.0378

Table 20

Roc curve significance level obtained from large-scale datasets.

	AUT-AVN	C11	CCAT	RCV1	REAL-SIM	WORM
Astro-Ph	0.9992	0.918	0.9275	0.9483	0.9817	0.9721
Aut-AVN		0.9206	0.9303	0.9537	0.9851	0.9775
C11			0.9867	0.9471	0.9108	0.906
CCAT				0.9595	0.9194	0.914
RCV1					0.9351	0.9246
Real-Sim						0.9925

Table 22

Roc curve significance level obtained from ICOS PSP dataset (where UF is uniform frequency and UL is uniform length).

	UF8	UF9	UL7	UL8	UL9
UF7	0.9988	0.99	0.9934	0.9945	0.9996
UF8		0.9922	0.9951	0.9941	0.9993
UF9			0.9975	0.9866	1.0099
UL7				0.9896	0.9943
UF8					0.9947

Conclusions

We developed a simple-to-implement method for classifying sparse, largescale datasets using parallelism with four EA. As can be seen in the results, the approach also works for classifying not-so-sparse data in very short amounts of time without increasing the complexity of the algorithm. Even though the approach did not give good results in the training phase, it gave good generalization results in competitive or smaller amounts of time compared with those obtained by algorithms such as KA, OCA, SVM and SVM for classifying several datasets from different areas and PS data. The simplicity of the EA and training function makes it easier to implement and parallelize the approach. From the Friedman test it can be concluded that there is no difference in terms of generalization between the approaches that use PSO and DE, compared to SVM. The Tukey test confirms that there is no statistically significant difference between the three algorithms, from which it can be concluded that they have the same generalization capabilities. The ROC curve comparisons also show that the algorithms’ ranges from good to excellent, since the area under the curve is greater than 0.8. These results combined with the simplicity and lineal complexity of the algorithms is what makes this approach an appealing algorithm to be used on large-scale classification problems. Comparing the four EAs using variants proposed, it is easy to notice that the DE version is the fastest and also has a good generalization capability; future improvements of the method will focus on the DE approach. Future work includes a multiclass version of this approach, an implementation of the algorithm that can run in computer clusters, and improvements to the accuracy of the training capability of the algorithms.

Algorithm 5

Particle Swarm Optimization.

1:	Initialize c₁, c₂, v_i and x_i
2:	P_ibest ← x_i.
3:	Select from x_i, P_gbest.
4:	repeat
5:	Obtain velocity v_i with Equation 16.
6:	Update position x_i with Equation 17.
7:	if f(x_i) < f(P_ibest) then
8:	P_ibest ← x_i
9:	if f(P_ibest) < f(P_gbest) then
10:	P_gbest ← P_ibest
11:	end if
12:	end if
13:	until The stopping condition is met.

Table 6

Results from the Astro-Ph dataset. The best global results are underlined and the best results obtained by our approach are written in bold letters.

ALGORITHM	TRAINING	GENERALIZATION	TRAINING TIME
µABC	94.50%	92.65%	0.0243
ABC	94.58%	93.63%	0.0650
DE	94.56%	93.80%	0.0191
PSO	94.53%	93.77%	0.0212
KA	94.61%	92.68%	12.0500
SVM^light	99.27%	95.33%	0.2430
SVM^perf	95.82%	93.85%	0.0195
OCA	100.00%	93.25%	0.0282

Table 8

Results from the C11 dataset.

ALGORITHM	TRAINING	GENERALIZATION	TRAINING TIME
µABC	85.42%	81.33%	0.0221
ABC	86.52%	86.85%	0.0129
DE	87.01%	87.58%	0.0121
PSO	86.55%	86.44%	0.0198
KA	87.92%	83.80%	11.5700
SVM^light	98.12%	87.58%	0.0111
SVM^perf	98.55%	87.58%	0.0102
OCA	100.00%	72.84%	0.0479

Table 9

Results from the CCAT dataset.

ALGORITHM	TRAINING	GENERALIZATION	TRAINING TIME
µABC	90.49%	86.18%	0.0287
ABC	91.11%	86.78%	0.0436
DE	91.82%	86.98%	0.0187
PSO	91.74%	86.55%	0.0387
KA	90.75%	86.58%	12.5626
SVM^light	98.71%	92.03%	0.3220
SVM^perf	88.13%	84.08%	0.0199
OCA	99.55%	83.58%	0.0637

Table 10

Results from the RCV1 dataset.

ALGORITHM	TRAINING	GENERALIZATION	TRAINING TIME
µABC	92.78%	91.03%	0.0256
ABC	92.75%	93.10%	0.0488
DE	92.72%	93.00%	0.0154
PSO	93.42%	94.61%	0.0402
KA	92.96%	91.28%	12.5998
SVM^light	99.01%	94.85%	0.2830
SVM^perf	96.51%	94.03%	0.0118
OCA	100.00%	88.15%	0.0704

Table 12

Results from the Worm dataset.

ALGORITHM	TRAINING	GENERALIZATION	TRAINING TIME
µABC	81.60%	80.30%	0.0268
ABC	79.10%	77.77%	0.0178
DE	82.81%	81.70%	0.0201
PSO	81.01%	80.41%	0.0275
KA	80.86%	79.43%	12.6125
SVM^light	97.79%	95.35%	0.3150
SVM^perf	99.86%	93.80%	0.0200
OCA	100%	89.00%	0.0840

Table 13

Training accuracy results for PSP uniform frequency subsets.

ALGORITHM	Ω = 7	Ω = 8	Ω = 9
µABC	74.23%	74.88%	73.83%
ABC	74.24%	75.36%	74.40%
DE	74.77%	75.76%	74.29%
PSO	75.23%	75.35%	74.94%
SVM^light	86.98%	87.95%	88.40%
OCA	100.00%	100.00%	100.00%
SVM^perf	100.00%	100.00%	100.00%
KA	75.16%	76.08%	75.18%

Table 14

Training time results for PSP uniform frequency subsets.

ALGORITHM	Ω = 7	Ω = 8	Ω = 9
µABC	0.0297s	0.0304s	0.0224s
ABC	0.0084s	0.0060s	0.0055s
DE	0.0054s	0.0041s	0.0026s
PSO	0.0065s	0.0056s	0.0059s
SVM^light	0.0210s	0.0220s	0.0200s
OCA	0.1978s	0.2039s	0.3558s
SVM^perf	0.2910s	0.1820s	0.2900s
KA	1.2851s	1.2811s	1.2811s

Table 15

Cross-validation accuracy results for PSP uniform frequency subsets.

ALGORITHM	Ω = 7	Ω = 8	Ω = 9
µABC	73.55%	73.08%	72.40%
ABC	73.75%	74.80%	72.93%
DE	74.43%	74.63%	73.08%
PSO	74.43%	74.05%	73.58%
SVM^light	73.33%	70.87%	70.68%
OCA	66.82%	65.28%	64.12%
SVM^perf	66.90%	65.37%	64.22%
KA	74.20%	75.15%	73.80%

Table 16

Training accuracy results for PSP uniform length subsets.

ALGORITHM	Ω = 7	Ω = 8	Ω = 9
µABC	74.86%	73.33%	74.60%
ABC	73.70%	70.95%	74.07%
DE	74.72%	72.23%	75.32%
PSO	74.31%	71.19%	75.99%
SVM^light	88.05%	91.45%	88.80%
OCA	100.00%	100.00%	100.00%
SVM^perf	99.95%	100.00%	100.00%
KA	74.53%	71.04%	74.43%

Table 17

Training time results for PSP uniform length subsets.

ALGORITHM	Ω = 7	Ω = 8	Ω = 9
µAB	0.0338s	0.0203s	0.0101s
ABC	0.0064s	0.0058s	0.0024s
DE	0.0059s	0.0040s	0.0024s
PSO	0.0075s	0.0066s	0.0050s
SVM^light	0.0200s	0.0180s	0.0250s
OCA	0.1830s	0.1600s	0.1931s
SVM^perf	0.1950s	0.1080s	0.1730s
KA	1.2963s	1.2827s	1.2690s

21 in total

1. Enhanced genome annotation using structural profiles in the program 3D-PSSM.

Authors: L A Kelley; R M MacCallum; M J Sternberg
Journal: J Mol Biol Date: 2000-06-02 Impact factor: 5.469

2. Prediction of protein structural classes by support vector machines.

Authors: Yu-Dong Cai; Xiao-Jun Liu; Xue-biao Xu; Kuo-Chen Chou
Journal: Comput Chem Date: 2002-02

3. Predicting absolute contact numbers of native protein structure from amino acid sequence.

Authors: Akira R Kinjo; Katsuhisa Horimoto; Ken Nishikawa
Journal: Proteins Date: 2005-01-01

4. Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs.

Authors: Mohammad Tabrez Anwar Shamim; Mohammad Anwaruddin; H A Nagarajaram
Journal: Bioinformatics Date: 2007-11-07 Impact factor: 6.937