Literature DB >> 32941468

Group-based local adaptive deep multiple kernel learning with lp norm.

Shengbing Ren¹, Fa Liu¹, Weijia Zhou¹, Xian Feng¹, Chaudry Naeem Siddique¹.

Abstract

The deep multiple kernel Learning (DMKL) method has attracted wide attention due to its better classification performance than shallow multiple kernel learning. However, the existing DMKL methods are hard to find suitable global model parameters to improve classification accuracy in numerous datasets and do not take into account inter-class correlation and intra-class diversity. In this paper, we present a group-based local adaptive deep multiple kernel learning (GLDMKL) method with lp norm. Our GLDMKL method can divide samples into multiple groups according to the multiple kernel k-means clustering algorithm. The learning process in each well-grouped local space is exactly adaptive deep multiple kernel learning. And our structure is adaptive, so there is no fixed number of layers. The learning model in each group is trained independently, so the number of layers of the learning model maybe different. In each local space, adapting the model by optimizing the SVM model parameter α and the local kernel weight β in turn and changing the proportion of the base kernel of the combined kernel in each layer by the local kernel weight, and the local kernel weight is constrained by the lp norm to avoid the sparsity of basic kernel. The hyperparameters of the kernel are optimized by the grid search method. Experiments on UCI and Caltech 256 datasets demonstrate that the proposed method is more accurate in classification accuracy than other deep multiple kernel learning methods, especially for datasets with relatively complex data.

Entities: Chemical Disease Species

Mesh：

Year: 2020 PMID： 32941468 PMCID： PMC7498035 DOI： 10.1371/journal.pone.0238535

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Because different kernels have different characteristics and different parameter settings, the performance of the kernels will be very different on different datasets. And there is no good way to construct or choose a suitable kernel. To solve the problem of these kernels, multiple kernel learning (MKL) method using a combination of kernels has been proposed [1-7], which makes full use of the characteristics of various kernels, and adapts better to different datasets. But in many cases, these combinations of multiple kernel learning don’t change the kernel structure. So how to choose the right basic kernel to combine into a composite kernel is still a major issue. Multiple kernel learning is combined with deep learning [8, 9] to improve learning performance. The deep learning method transforms the input data through multiple nonlinear processing layers to construct a new feature [10]. These methods have successfully made significant progress in image classification [11]. There are many related studies on deep multiple kernel learning(DMKL). Deep multiple kernel learning aims to learn the “deep” kernel machine [12] by exploring a combination of multiple kernels in a multilayer structure. Through multilevel mapping, the proposed multi-layer multiple kernel learning (MLMKL) framework is more adaptable to a wide range of datasets than the MKL framework to find the optimal kernel more efficiently. In [13], the combined kernel is formed through the base kernel in each layer and optimizing the estimate of forgetting errors in support vector machines. The structure is directly the mutual weighted iteration of different combined kernels that results in too many weight parameters, and it is too cumbersome to optimize parameters. In [14], a backpropagation MLMKL framework is proposed, which uses deep learning to iteratively learn the optimal combination of kernels. Three deep kernel learning models for breast cancer classification problems have been proposed in [11] to avoid overfitting risks in deep learning. However, These model structures have a fixed number of layers and a lack of flexibility. Moreover, the model learning methods are global, without high generalization ability under certain conditions. Because classical DMKL models have a fixed number of layers, such models can not adapt to a wide range of datasets. As a result, the classification performance is not the best, and wasting of computing power, needing more data, and so on. Therefore, in [15], we propose an adaptive deep multiple kernel learning framework, which solves the problem of the model with a fixed number of layers. Increasing the number of layers according to the actual datasets, and the cutoff condition for increasing the number of layers is that the highest classification accuracy is continuously unchanged for several layers. However, the adaptive deep multiple kernel learning requires too many layers to achieve the highest accuracy which wastes time. And the classification performance is greatly affected by the type of kernels, resulting in poor model stability. Classical DMKL is limited to learning the global combination of the entire input space. Due to the diversity and correlation between samples, suitable kernels may vary from one local space to another. When the samples in a category exhibit high variation as well as correlation with the samples in other categories, they are difficult to cope with such complicated data and suffer degraded performance. So we introduce the local learning method [16] to solve the problem. The local learning method can take into account the inter-class correlation and intra-class diversity. Moreover, we can divide the datasets into several groups by the clustering algorithm to facilitate the classification and statistical analysis of subsequent models. Moreover, it can be regarded as only one group for the datasets with very simple samples so that DMKL can be carried out directly. It can be seen that the local learning method is very adaptable to a wide range of datasets and can reduce the complexity of the model and save training time. Therefore, it is very feasible to apply the local learning method to DMKL to form local deep multiple kernel learning based on grouping, which can improve the generalization performance of the DMKL model and is higher than classical DMKL in classification accuracy. Another benefit is to save computing power and not need too much data. In group-based local deep multiple kernel learning, samples with similar are clustered into a group so that the intra-class diversity can be represented by a set of groups. In addition, inter-class correlation can be represented by the correlation among the different groups. So group-based local deep multiple kernel learning can be adapted to a wide range of datasets to increase the flexibility of the model and save computing power. Another advantage of group-based local deep multiple kernel learning is that multiple classifiers can be trained separately, and the classifier model layers in each group may be different. In other words, each group is performed separately so that saving training and prediction time. Only need to know which group the new test sample falls into, you can test in the local model of the corresponding group, and calculate the classification accuracy, which helps to adapt to a variety of samples and highlights the flexibility of the model. Because the sparse constraint can lose useful kernels during MKL optimization [17], we utilize the lp norm [18] constraint on the kernels and get non-sparse results to avoid losing useful kernels. Therefore, the lp norm will be used for weight constraints so that the weight of useful kernels will be increased. As a result, useful kernels will not lead to the loss. On the contrary, the weight will be reset to zero for the useless or even counterproductive kernels. In this way, the kernel sparsity is adjusted in the multiple kernel combination in each layer can also improve the classification performance. To solve the above problems, this paper proposes a group-based local adaptive deep multiple kernel learning (GLDMKL) method with lp norm. Unlike classical DMKL, our GLDMKL model is based on local learning and adaptive. So samples are clustered using the multiple kernel k-means clustering algorithm so that the similar samples are in the same group. For those samples after they are divided into multiple groups, the DMKL process is performed in the respective local spaces. The number of layers in each group may be inconsistent, which highlights the flexibility of the model and saves training and testing time. In each group at each layer, we perform a MKL process by weighting multiple kernels with different types and parameters to form a combined kernel. Moreover, determining the proportion of the basic kernel in each combined kernel according to the local kernel weight. In local adaptive deep multiple kernel learning, the output value of the combined kernel in the previous layer is used as an input to the combined kernel in the next layer. However, the actual input of the local adaptive deep multiple kernel learning is still a sample. We can study the deep kernel machine through the above methods and can stop increasing the number of layers as long as the highest classification accuracy is continuously unchanged for several layers. Moreover, our model needs to set an initial value for each candidate kernel hyperparameter, and we adjust it by grid search method [19] to avoid the trouble of manually selecting kernel hyperparameters before the learning process. Also, this learning method is to constrain the weight with the lp norm and to control the sparsity of kernel to avoid losing useful kernel. For the useless kernel, the weight can be reset to zero. Thus, multiple kernel combinations of non-sparse kernels in each layer can improve generalization capabilities. The main contributions of this paper are summarized as follows: (1) A group-based local adaptive deep multiple kernel learning architecture is proposed. The GLDMKL architecture consists of two parts: multiple kernel k-means clustering and local adaptive deep multiple kernel learning. Furthermore, the number of layers grows with the learning process. In each group, the learning process is carried out independently, and the number of layers of the learning model may be different. Our model is more adaptable to data of different dimensions and sizes. (2) A GLDMKL model learning algorithm to adapt the architecture is designed. Our model learning algorithm utilize deep kernel learning to build a local deep multiple kernel learning model layer by layer. And the SVM model parameters and local kernel weights corresponding groups are optimized in turn to fit the model. The hyperparameters of basic kernels are adjusted by the grid search method. Also, stopping the growth of the model layers is determined by the highest classification accuracy invariant in continuous layers. (3) The weight constraint with the lp norm is proposed. For the sake of controlling the sparsity of the kernel and avoiding the loss of useful basic kernels, the weight of useful basic kernels will be increased. (4) Experiments on UCI dataset and Caltech 256 dataset show that our GLDMKL approach has the power to handle complex data. And compared with classical DMKL methods, our GLDMKL method has higher classification accuracy and higher generalization. The rest of this paper is organized as follows: Section Related works provides a brief overview of the relevant background. Then a group-based local adaptive deep multiple kernel learning method with lp norm is described in Section Our approach. Section Experiments describes the experimental part. Section Validation illustrates the validation of the model. Section Conclusion provides a summary of the paper and future work.

Related works

Deep multiple kernel learning

Deep multiple kernel learning(DMKL) [12–15, 20–24] is a hot research topic inspired by deep learning in recent years. This method explores the combination of multiple kernels in a multi-layer architecture and achieves success on various datasets. Therefore, DMKL can be used in many real-world situations. In [12], Zhuang et al. propose a two-layer multiple kernel learning (2LMKL) method and two efficient algorithms for classification tasks. It aims to learn “deep” kernel machines by exploring a combination of multiple kernels in a multi-layer structure. With multi-layer mapping, the proposed 2LMKL framework provides greater flexibility than conventional MKL for finding the best combined kernels faster. Zhuang et al. also show that the number of basic kernels has a certain effect on the classification performance, and it is realized by iteratively updating the parameters of the basic kernel. However, there are only two layers of structure, which cannot adapt to the requirements of a wide range of datasets and the model is global. In [13], a combined kernel is formed by the basic kernel at each layer and then optimizing over an estimate of the support vector machine leave-one-out error. There require only a few basic kernels to continuously improve performance at each layer. Its structure is directly weighted iteration through different kinds of combined kernels, resulting in too many weighting parameters. It is too cumbersome to optimize parameters and the number of model layers is fixed. Moreover, the model is also global, without taking into account intra-class diversity and inter-class correlation. In [14], a new backpropagation MLMKL framework is described, which optimizes the network through an adaptive backpropagation algorithm. Rebai et al. use the gradient ascent method instead of the dual objective function. The deep architecture has a fixed number of layers and cannot adapt to a wide range of datasets. And it’s also a global model, without taking into account intra-class diversity and inter-class correlation. In [15], we propose an adaptive deep multiple kernel learning (SA-DMKL) method. It can optimize the model parameters of each kernel with the grid search method. And each basic kernel is evaluated using a generalization boundary based on Rademacher chaotic complexity and those that exceed the generalization boundary are removed. The output regression value of the SVM classifier constituted by other kernels is used to construct the new feature space. The dimension of the new feature space is the number of the remaining kernels, thus forming the new sample data features as the input of the kernels in the next layer. And the SVM classifier is used to train each candidate kernel. At the same time, in each layer, the SVM classifier based on the kernel is used to classify test data and obtain classification accuracy. The growth of the layers is terminated by the highest classification accuracy unchanging in successively several layers. But the model is also global, without taking into account intra-class diversity and inter-class correlation. DMKL has a good effect on generalization ability when candidate kernels and parameters are adjusted to a very appropriate level. That effect, however, is hard to achieve. Many hyperparameters need to be set and are difficult to adjust. At the same time, the existing DMKL architecture is relatively simple. The combined kernel in each layer consists of a set of the same basic kernels. And the output of the combined kernel in the previous layer is the input of all the basic kernels in the next layer. Also, the number of layers is fixed. The proper selection of kernel and model structure with a fixed number of layers lead to insufficient adaptability to the sample datasets, which affects the performance of the model. So, we propose an adaptive DMKL architecture [15] to solve the problem of a fixed number of layers. The growth of layers can be limited by setting a cutoff condition, and model layers can change with different training datasets. However, there is also a problem that these learning methods adopt a uniform similarity measure over the whole input space. When the samples of a category exhibit high variation as well as correlation with other categories, they are difficult to cope with such complex data.

Group-based local learning

Local learning [16, 25–28] is to divide the whole problem into several small problems, then learning separately. Local learning only needs to find the local optimum, which is more convenient and more efficient than global learning. We apply group-based local learning to DMKL instead of global learning. The following is a description of group-based local learning and global learning. A comparison of group-based local learning and global learning architecture is shown in Fig 1.

Fig 1

Comparison of two multiple kernel learning methods.

The original sample dataset is represented as Data, the number of basic kernels are m and {k1, k2, …, k} are basic kernels. And each kernel has a weight. The difference between the two is: Group-based local learning divides the sample dataset into multiple groups and performs MKL in each group, where the number of groups is g, {G1, G2, …, G} represent g groups which have several samples and the total number of weights is g*m; Global learning is classical MKL for sample dataset, where the total number of weights is m. Here are the benefits of group-based local learning: How to choose the proper kernel is very difficult; it involves the selection of the hyperparameters of basic kernels. There are also a variety of basic kernels, and the weight setting of basic kernels. With so many parameters, it is impossible to select the best combination of parameters quickly. Group-based local learning is also based on multiple kernel learning, but there is no strict need to select the most appropriate kernels for multiple kernel learning. And local learning is carried out by clustering to ease the computational pressure of choosing the right kernels. Another advantage is taking into account inter-class correlation and intra-class diversity and having the ability to deal with complex data. So group-based local learning is a very desirable method.

Lp norm

Norm [29] is a reinforced notion of distance, which by definition adds a scalar multiplication algorithm to distance. Sometimes we can think of the norm as a distance for the sake of understanding. In mathematics, the norm includes the vector norm and the matrix norm. The vector norm represents the size of the vector in the vector space, and the matrix norm represents the size of the change caused by the matrix. A non-strict interpretation is that corresponding vector norms, vectors in vector space are of magnitude. How to measure this size is measured by the norm. Different norms can measure this size, just like both meters and feet can be used to measure distances. We know that by computing AX = BAX = B, vector X can be changed to B, and the matrix norm is used to measure the magnitude of this change. And the lp norm [30-32]is defined as Eq (1), p takes the range of [0, +∞). When p = 0, the lp norm is namely the l0 norm. The l0 norm is not a true norm, which is mainly used to measure the number of non-zero elements in the vector. The l1 norm has many names, such as Manhattan distance, the smallest absolute error, and so on. Use the l1 norm to measure the difference between two vectors, such as the Sum of Absolute Difference. The l2 norm is the most common and commonly used. The most metric distance we use is the Euclidean distance, which is the l2 norm. And l2 can also measure the difference between vectors, such as the Sum of Squared Difference. When p = ∞, the lp norm is the l∞ norm, it is mainly used to measure the maximum value of the vector element. In conclusion, the lp norm is a commonly used regularization term, where the l2 norm ‖ω‖2 tends to balance the components of ω as much as possible, i.e. the number of non-zero components is as dense as possible. The l0 norm ‖ω‖0 and l1 norm ‖ω‖1 tend to be as sparse as possible for ω, i.e. the number of non-zero components is as small as possible.

The sparsity of the kernel

Sparsity regularized multiple kernel learning has been proposed [33-36]. Dong et al. propose a simple multiple kernel learning framework for complicated data modeling, where randomized multi-scale Gaussian kernels are employed as base kernels and a l1-norm regularizer is integrated as a sparsity constraint for the solution. Sparsity refers to the proportion of the number of non-zero elements. If there are more non-zero elements for zero elements, it is dense; If there are fewer non-zero elements for zero elements, it is sparse. The concept of the sparsity of the kernel is introduced, and the sparsity of the kernel refers to the number of kernels used. Sometimes, because of the sparse constraint, the useful kernel may be lost in multiple kernel learning optimization. To improve the sparsity of the useful kernel so that the lp norm will be adopted. And the sparse constraint can be implemented by changing the weight. Therefore, lp norm will be used for weight constraint, so that the weight of the useful kernel will be increased, without causing loss. Conversely, for the useless or even counteracting kernel, its weight is reset to zero.

Our approach

Architecture

In local deep multiple kernel learning, multiple kernels are combined and the advantages of each kernel are used in each local space. The MKL architecture diagram is shown in Fig 2.

Fig 2

Multiple kernel learning.

The number of basic kernels is m and {k1, k2, …, k} are basic kernels. And each kernel has a local kernel weight; they are respectively {β1, β2, …, β}. K is the combined kernel. Our model is based on several MKL components. And MKL is the core element of our model. Since samples consist of multiple features, we propose a multiple kernel k-means clustering method to make the clustering results more reliable. Samples are divided into g groups by clustering, and they are closer together in each group. According to the localization idea, we cluster samples into groups before the first layer network and then optimize the local kernel weight in each group. The purpose of grouping is to make full use of the feature similarity and diversity among samples. So making the learning method more applicable to a wide range of sample datasets. Therefore, we use a group-based local deep multiple kernel learning method. In our GLDMKL model, the output of the previous layer is used as the input of the next layer to construct a DMKL network. The local space in each group is performed a DMKL process. And our local deep multiple kernel learning is an adaptive structure, which is based on the actual situation. In the local space of each group, the number of layers in each learning process may be different. Moreover, layers’ growth is stopped when the highest classification accuracy of several successive layers is unchanged. This prevents the model from constantly growing, wasting time and storage space, and effectively reduces the complexity of the model. Our model needs to set an initial value for each candidate kernel hyperparameter and adjusts it with a grid search method to avoid manually selecting kernel hyperparameters before the learning process. The kernel’s weight parameters are initialized by the gate function to randomly select a relatively small number. The weight parameters are adjusted by the lp norm constraint to obtain non-sparse results to avoid losing useful kernels. Weights are used to adjust the proportion of the basic kernels and we reset the weights to zero for useless kernels. If the weight parameter settings are not appropriate, our model learning algorithm can adjust the combined kernel structure of the next layer by changing the kernel weight. Our GLDMKL architecture is shown in Fig 3. Before the learning process, the multiple kernel k-means clustering algorithm is used to cluster the training data Data, the number of groups is set to g, and the training data is divided into {D1, D2, …, D}, The number of final layers L in each group is respectively {L1, L2, …, L}. The number of basic kernels is m, and {k1, k2, …, k} are basic kernels. And each kernel has a local kernel weight and they are respectively {β1, β2, …, β} which are shown in detail in Fig 2. K(L) is represented as the combined kernel in layer L in group g. g groups correspond to g SVM classifiers, and there will also be g output values.

Fig 3

GLDMKL architecture.

In our GLDMKL architecture, the closer samples were assigned to the same group. In the local space for each group, a SVM classifier that has a multi-layer structure is separately trained. The combined kernel in each layer is composed of a weighted sum of several basic kernels. In local space, the weighted sum of basic kernels is conducted in the previous layer and the output value is used in the previous layer as an input of the combined kernel in the next layer. The input of the actual learning process is still a sample. The multi-layer MKL forms a SVM and samples are classified at the same time. As the model layer continues to grow, the SVM classifier is being updated. The cutoff condition of model layer growth is implemented by the highest classification accuracy unchanged in lasting several layers, thus forming the final SVM classifier model. During the test, we use the clustering algorithm to determine which group samples belong to, then the classification prediction is made in the trained classifier model in the corresponding group, and calculating the classification accuracy in each layer. Because the classical DMKL model has a fixed number of layers and has obvious limitations, it doesn’t take into account intra-class diversity and inter-class correlation and the model flexibility is poor. The adaptive change of layers not only can increase model flexibility but also can improve the classification accuracy. At the same time, different model layers are performed according to different datasets, which is convenient for reducing model training and prediction time. The key to the adaptive layer is the cut-off condition. As long as the highest classification accuracy remains unchanged in several layers, model layers stop growing. Therefore, we adopt a group-based local adaptive deep multiple kernel learning method. Because sparse constraints can lose useful kernels, we use the lp norm constraints on kernels and obtain non-sparse results to avoid losing useful kernels. Therefore, we propose a group-based local adaptive deep multiple kernel learning architecture with the lp norm to solve these problems.

Clustering

In our GLDMKL method, there is a clustering process before training the SVM classifier. We need to design an effective clustering algorithm for our GLDMKL. Since the training samples are represented by multiple features, the traditional clustering algorithms are unable to cluster accurately for a wide variety of samples. In this section, we design a multiple kernel k-means clustering algorithm. Fig 4 shows a complete flow chart about multiple kernel k-means clustering.

Fig 4

Multiple kernel k-means clustering.

We weight the sum of m RBF kernels to form a combined kernel which becomes an element of the sample distance matrix. And the k-means clustering algorithm is used to cluster the input samples. The weight of the RBF kernel is obtained by the centered KTA(CKTA) [37]. CKTA is a novel kernel alignment that performs well in evaluating kernels. The center of the RBF kernel matrix is Eq (2). Where denote the vector with all entries equal to one, I denotes the identity matrix, N represents the total number of samples, k is the i-th RBF kernel matrix, k is the center of the i-th RBF kernel matrix. Then we calculate the weight of the RBF kernel matrix as Eq (3). where Where 〈⋅, ⋅〉 denotes the Frobenius product, and 〈A, B〉 = Tr[A B]. And y is the vector of {−1, +1} labels for the sample. The m RBF kernel matrices are combined into a combined matrix, as shown by Eq (5). The combination matrix K is taken as the distance matrix between samples, and then k-means clustering is implemented on K. The clustering error can be calculated by the Eq (6). where Where G is the number of groups, C denotes the clustering center of group g and k is the distance matrix between sample x and x. Finally, the clustering result for sample x is calculated by the Eq (8). In conclusion, the details of a multiple kernel k-means clustering algorithm are as follows: 1) Starting with CKTA for the kernel alignment, then calculating the kernel weight η. 2) m kernels are RBF kernels with different parameters. 3) The weighted multiple kernel combination between sample pairs is taken as the elements of the distance matrix; then the conventional k-means clustering algorithm is used according to the distance difference between the sample pairs. 4) According to the initiation condition of inputting G groups, it is equivalent to G clustering centers. Then running multiple cycles, and finally clustering into G groups. So samples are closer in each group. 5) The well-grouped samples are used as input for the following learning process.

GLDMKL

Our model is inspired by soft-clustering-based local multiple kernel learning [38], and our model deals with multiple layer learning problems. To make it easier to understand, so we briefly describe the process of training and testing samples in layer l in GLDMKL, as shown in Fig 5.

Fig 5

The training and testing process in GLDMKL.

Definition 1 Suppose we have N training samples, and the training dataset is represented by , where x represents the i-th training samples, y ∈ {−1, +1} is the label of the i-th training sample. And x can be thought of as a vector consisting of d features. In our GLMDKL, there is a clustering process before classification. So the discriminant function f( in layer l is defined as Eq (9). Where ω and b is the model parameter, represents the weight of the m-th kernel of the group c(x) where sample x is located in layer l. Definition 2 By modifying the original SVM classifier using this new discriminant function f(, the training process can be implemented by solving the following optimization problem Eq (10). Where C is the penalty factor, ξ is the slack variable, and p represents lp norm to constrain weight. Definition 3 Inspired by the original SVM, the Lagrangian multiplier method is used to solve the dual problem of the Eq (10). We first fix the kernel weight β and minimize the problem Eq (10). The Lagrangian objective function is represented as a Eq (11). Where α ≥ 0 and γ ≥ 0 are Lagrangian multipliers. Let L have a partial bias of zero for the variable ω, b, ξ, and we get the results Eqs (12)–(14). Using the Eqs (12)–(14) to eliminate ω, b and ξ, we get the dual expression of the original optimization problem Eq (10) as Theorem 1. Thus, we can rewrite the discriminant function Eq (9) to Eq (16). where The purpose of the sign function is to classify f((x) as {-1, +1}. Definition 4 The objective function of the minimized Eq (15) is based on the fixed kernel weight β. If the kernel weight β is added to the dual optimization problem, which is finally a max-min problem as Eq (18). J is a multi-objective function with a coefficient α and a local kernel weight β. When β is fixed, minimizing J means minimizing global classification errors and maximizing the interval. When α is fixed, maximizing J means maximizing sample similarity within the group while minimizing sample similarity between groups. Similar to the canonical MKL, we alternately optimize α and β to solve the max-min problem. In the first phase, we fix β and optimize α. It is easy to know that this problem is a canonical SVM with a specific combined kernel that can be solved with the Theorem 1. In the second phase, we fixed α and optimized β, so J can be rewritten as the Eq (19). where Where α* is the optimization result of α, and is the local kernel weight of the m-th kernel of group g where sample x is located in layer l. Note that the solution of J(β) in the Eq (19) is independent of the latter term, which is equivalent to the problem of solving the Eq (21). Where represents the weight of the m-th basic kernel in group g in the l layer, is a shorthand for the dual formula that optimizes α. This is a quadratic non-convex problem. We know that solving the secondary planning problem requires expensive calculations. Inspired by [16, 39], we use the gated model to represent β. Definition 5 The gate function is designed as shown in Eq (22). Where is the kernel alignment [40] of the m-th kernel in sample group g in layer l, a and b are parameters of the gate function. So it can be calculated by the Eq (23). where We transform the non-convex problem into a convex function problem through a gate function. In this way, the local minimum must be found by the gradient ascent method, so our method must converge. We can observe from the Eq (22), so p norm is used to constrain . We can optimize p to change the sparseness of the kernels according to the datasets, thus changing the number of kernels used so that useful kernels can be fully utilized. After the gate function is used to represent the local kernel weight, J(β) becomes a convex function for a and b. Therefore, we can optimize a and b by gradient ascent to maximize J(β). The bias of J(β) for a and b are Eqs (25) and (26). If m′ = m, then , otherwise . We update a and b with the gradient ascent method, then update with a and b as shown by the Eqs (27) and (28). Where λ and μ are the step sizes, which can be obtained by a line search method as [41] or fixed as a small constant. In this way, optimizing α and β alternately until certain termination criteria are met. We use the duality gap as the termination criterion, which is written as shown in Eq (29). Where ε is the preset tolerance threshold.

GLDMKL learning algorithm

Given a set of training data D = {(x, y)|i = 1, 2, …, n}, where is the sample feature vector, y ∈ {−1, +1} is the sample class label. Our goal is to train deep multiple kernel networks and the SVM classifiers from labeled training data. After the training samples are grouped by multiple kernel k-means clustering, the original local combined kernel form of the first layer is the Eq (30). Where represents the weight of the m-th kernel of the group c(x) where sample x is located in the first layer, denotes the m-th kernel between sample x and x in the first layer. To make it easier to express the relationship between the combined kernels in deep multiple kernel learning, we simplify the local combined kernel of the group g in the first layer into the form Eq (31). The next step is the combined kernel relationship between the previous layer and the next layer. Before this, we must understand the principle of deep kernel learning. Definition 6 The following is the principle formula of deep kernel learning. Where x and x are input feature vectors, Φ( is a feature mapping function applied L times, K( represents the final layer kernel which is the combined kernel of the SVM classifier when reaching the cutoff condition. From Definition 6, the classifier models of the 1 to L layer structures are used for classification, the errors are calculated and the accuracies are obtained. As long as the accuracy is continuously unchanged for several layers, the number of layers can be stopped to increase. Definition 7 The following is the derivation Eq (33) for deep multiple kernel learning. Where K( is the output value of the combined kernel in layer l. Our architecture is based on the theory of local learning and performs GLDMKL. The output value of the combined kernel in layer l − 1 is used as an input to the combined kernel in layer l, and the local deep multiple kernel learning derivation equation in group g is as Eq (34). Where is the output value of the combined kernel in group g in layer l, indicates the weight of the m-th basic kernel in group g in layer l, represents the m-th basic kernel in group g in layer l. The final decision function of the proposed framework in group g in layer l is defined as Eq (35). Adaptive deep multiple kernel learning is performed in each group, there needs to perform a decision function for predicting classification accuracy in each layer. And note that the final layer of the decision function may be different due to the independent prediction in each group. In layer l − 1, the idea of Definition 1—Definition 5 in GLDMKL is utilized, and taking turns to optimize the support vector parameter α and the local kernel weight . The gradient ascent method is used to update the weights, and then the fixed weights are used to calculate the support vector parameters until the cutoff condition is satisfied, and the classification accuracy of the local SVM classifier in each group is obtained. Then we will use the output value of the combined kernel in layer l − 1 as an input to the combined kernel in layer l, and continue to repeat the idea of the GLDMKL learning algorithm. The classification accuracy is needed to calculate in each layer at the same time. The highest classification accuracy remains unchanged in successive layers; then the local deep multiple kernel learning is stopped immediately. To obtain a coefficient that minimizes the real risk of the decision function, the gradient ascent is used to optimize the local kernel weight . In the local space in layer l, J(β) is used from Eq (21). We can maximize J(β) by using an iterative process of gradient ascent. First, we calculate the gradient of each weight in each local space in each layer as the Eq (36). Then, gradient ascent is used to update all local kernel weights of local deep multiple kernel learning in each layer as the Eq (37). Where η is the step size, so we just need ∇J(β) to figure out the local kernel weights, and ∇J(β) is obtained from the gate function in Definition 5. The specific operation process can be found in the Eqs (25)–(28). In the model learning algorithm, we can apply the alternating optimization algorithm used in GLDMKL to learn the decision function coefficient α and all local kernel weights . This can be done: 1) fixing and solving α using the normal method; 2) fixing α and using gradient ascent to solve until the optimization deadline is met. The classification accuracy is evaluated in each layer to determine whether the growth of the model layer is stopped. If the highest classification accuracy does not change in the fixed number of layers, stopping layers growing. The entire process of our model learning algorithm is described in the Algorithm1. According to the Algorithm1, a layer of the GLDMKL architecture is constructed from an iteration of step 5 to 15. In the Algorithm1, i represents the current number of layers, and j represents the number of layers that maximum accuracy A remains unchanged, and j is a Judge condition for stopping the growth of the layer. Step 7 to 10 indicate that local multiple kernel learning with the lp norm is performed in the current layer l, and α and β are optimized in turn until the cutoff condition is reached. At the same time, the classification accuracy A of the SVM classifier in each group is calculated, and the highest classification accuracy A is updated. If the best accuracy does not change in the fixed number of layers, then the number of layers should be stopped. And if the number of iterations exceeds the preset maximum number of iterations, the iteration is stopped. Algorithm 1 Group-based local adaptive deep multiple kernel learning algorithm Input: D: Dataset m: Number of candidate kernels k: Initial parameters of each kernel g: Number of groups l: Maxmum number of layers in which the best accuracy does not change L: Maxmum number of iteration layers Output: Final classification model M 1: Randomly select 50 percent of samples from the entire dataset D as training samples D; 2: Divide the training samples D into g groups with Clustering; 3: Initialize best accuracy A = 0; 4: Initialize current iteration i = 0 and flag j = 0; 5: repeat 6: Use grid search method to adjust the initial parameters k; 7: repeat 8: Initialize gate model parameters a, b with small random numbers; 9: Rotation optimize α and β with GLMKL algorithm in layer l; 10: until meet the termination criterion of Eq (28); 11: Update the best accuracy A; 12: If A does not change then j++; 13: The output of the combined kernel in the previous layer l is used as the input of the combined kernel in the next layer l; 14: i++; 15: untill (i > = L or j > = l).

Experiments

Setup

The main implementation of GLDMKL is written in Python, and related algorithms can call library functions. All experiments were run on a PC with a 2.2 GHz Intel Core i5-5200U CPU and 12 GB RAM and win7 operating system and GTX 1080 GPU server. In kernel settings, RBF and polynomial kernels are usually the most commonly used functions for MKL methods. In our work, we have selected four basic kernels and Arc-cosine kernel. The detailed kernel parameter settings are shown in Table 1.

Table 1

Kernel parameters setting.

Kernel	Equation	Parameters
Polynomial	k(x_i, x_j) = (γ* <x_i, x_j> + c)ⁿ	γ = 1.2 c = 2.1 n = 1
Laplacian	k(xi,xj)=exp(-∥xi-xj∥σ)	σ = 1.2
Tanh	k(x_i, x_j) = tanh(β* <x_i, x_j> + θ)	β = 1.2 θ = 2.1
RBF	k(xi,xj)=exp(-∥xi-xj∥22σ2)	σ = 0.9
Arc-cosine [42]	kn(xi,xj)=1π∥xi∥n∥xj∥nJn(θ)Jn(θ)=(-1)n(sinθ)2n+1(1sinθ∂∂θ)n(π-θsinθ)θ=cos-1(<xi,xj>∥xi∥∥xj∥)	∥a∥and∥b∥areL0normn=0

In the comparison experiment, the other five classical comparison methods are shown in Table 2.

Table 2

Five classical comparison methods.

Methods	Detail	Year
2LMKL	Zhuang proposed a two-layer multiple kernel learning algorithm [12]	In 2011
DMKL	Deep multiple kernel learning algorithm proposed by strobl [13]	In 2013
MLMKL	Multi-layer multiple kernel learning algorithm for backpropagation proposed by Rebai [14]	In 2016
SA-DMKL	Adaptive deep multiple kernel learning algorithm proposed [15]	In 2019
DWS-MKL	Depth-width-scaling multiple kernel learning algorithm proposed [20]	In 2020

To simplify the experiment, we initialize the maximal iteration number maxIter in parameter optimization to 100, the maximum layer number L of the model is 20, and the maximum number of layers l with the best precision unchanged is set to 3 layers.

Data

UCI datasets

We perform a set of extended experiments to evaluate the performance of the proposed GLDMKL algorithm in the classification task with small sample sizes and low dimensions. Several algorithms have been tested on six real-world datasets: Liver, Breast, Sonar, Australian, German, Monk. Table 3 gives a detailed description of the usage datasets.

Table 3

Selected datasets in UCI.

Datasets	Dimensions	Samples
Liver	6	345
Breast	9	286
Sonar	60	208
Australian	14	690
German	20	1000
Monk	6	432

Among them, Liver and German are datasets with relatively complex data. Sonar and Australian are datasets with relatively simple data. Breast and Monk are datasets with very simple data.

Caltech-256 datasets

Caltech-256 is an image object recognition dataset containing 30,608 images and 256 object categories, each has at least 80 images. We select the Caltech-256 dataset to evaluate the performance of our GLDMKL approach in classification tasks with large sample sizes and high dimensions. In our experiments, we randomly select five types of data with similar shapes: bowling-ball, car-tire, desk-globe, roulette-wheel, and sunflower-101. Before the training of the image datasets, feature extraction is required, and FFT is used as the descriptor of the image dataset. Removing irrelevant image features, and then simple processing of image features is carried out to facilitate the formation of features.

Data pretreatment

First, it needs to preprocess for each dataset. Samples are normalized such that the characteristic numbers are in the range of 0 to 1, thereby preventing overflow of data manipulation during the experiment. Next, we randomize samples and divide them into two halves: (1) 50% of the examples are used for training (establishing a deep multiple kernel model with the best parameters), and (2) the remaining 50% is used as test data (evaluating the performance of the resulting model). Finally, the six UCI datasets and five image datasets are used to train our model and test our classification model with the same test datasets. To ensure the reliability of the data, we run ten times for each dataset and take the best classification results.

Metrics

Common classification performance metrics are the accuracy, the generalization ability, the training time, etc. Generalization ability is not easily measured by data, so it is not used. And because the experimental environment is different in different methods, the superiority of our method cannot be well reflected, so the training time is not used. However, the classification accuracy can intuitively reflect the classification performance of a method, and it is the best performance metric, not affected by the experimental environment. Therefore, accuracy is used as a performance indicator to compare with other methods. According to Eq (38), the learning performance of the GLDMKL method is evaluated according to the test accuracy. And Eq (38) represents the ratio of the number of correctly classified samples to the total number of samples. Where TP is the number of true positives, TN is the number of true negatives and N is the total number of samples in the test datasets.

The experimental results in UCI datasets

We have successfully completed three experiments. The first is to compare the classical DMKL method with our GLDMKL method in classification performance. The second experiment shows the effect of the number of clustering groups in our GLDMKL method on classification performance. The last experiment shows the effect of layers on the classification performance.

Comparison of classification performance

The purpose of this experiment is to evaluate the performance of the DMKL algorithm using our GLDMKL method and other classical DMKL methods on the UCI dataset. Therefore, we evaluate the following algorithms: 2LMKL, DMKL, MLMKL, SA-DMKL, DWS-MKL and our GLDMKL. Table 4 shows the detailed results of the classification performance for the different algorithms. Among them, data with the highest classification accuracy for the same dataset is highlighted in bold.

Table 4

Comparison of best classification performance(%).

Datasets	Algorithms
Datasets	2LMKL	DMKL	MLMKL	SA-DMKL	DWS-MKL	GLDMKL
Liver	63.43	69.01	71.80	75.65	74.85	80.00
Breast	96.53	96.59	97.21	91.92	97.08	99.99
Sonar	83.75	83.94	83.84	89.42	84.79	99.99
Australian	82.11	84.40	85.42	82.03	85.51	99.99
German	72.22	72.02	75.06	78.50	73.60	80.52
Monk	96.26	96.62	96.89	97.55	99.07	99.99

By comparing the results among SA-DMKL, DWS-MKL and other DMKL methods (2LMKL, DMKL and MLMKL), we find that SA-DMKL and has higher performance than other DMKL methods. For example, SA-DMKL is superior to other algorithms on three datasets: Liver, Sonar, and German. DWS-MKL is superior to other algorithms on three datasets: Breast, Australian, and Monk. This shows that the adaptive layer structure can achieve higher performance than the fixed structure. It can be seen from Table 4 that our GLDMKL method has better classification performance on the above six datasets than other methods. Moreover, classification accuracies in some datasets reach 99.99%, which shows that the idea of our local adaptive deep multiple kernel learning based on grouping is feasible and the effect is also significant. In other methods, 2LMKL shows the worst performance on the Liver, Sonar, and Monk datasets, while SA-DMKL shows the worst performance on the Breast dataset with very simple data and the Australian dataset with relatively simple data. Therefore, these results show that our GLDMKL method is superior to the classical DMKL method and can be widely adapted to a variety of datasets.

The effect of group numbers on classification performance

In this experiment, we explore the effect of the number of groups on the classification performance in our method. To simplify the experiment, we specially extract clusters into 1 group, 2 groups, 5 groups, 7 groups, and 10 groups as experimental comparisons. As can be seen from Fig 6, the more the number of groups in most datasets, the higher the classification performance will be. Especially for the Monk dataset, the improvement of classification performance is most obvious for multi-grouping, indicating that the dataset with very simple data and is consistent with the actual situation.

Fig 6

Performance comparison(%) of different groups.

For the German dataset with relatively complex data, the classification accuracy increases with the number of groups increasing. For the Sonar dataset, when the number of groups reaches 5 or 7, the classification accuracy is the highest. And increasing the number of groups, it will be greatly reduced, indicating that the dataset is not suitable for grouping too much. So it is the best choice to group into 5 or 7 in the sonar dataset with relatively simple data.

The effect of layer numbers on classification performance

In this experiment, we evaluate the classification performance of different DMKL methods in each layer: DMKL, MLMKL, SA-DMKL, DWS-MKL and our GLDMKL to explore the effect of layers on classification performance. The DMKL, MLMKL, DWS-MKL and GLDMKL methods all train up to three layers to analyze and determine the effectiveness of using multi-layer structures. Since predecessors only have counted the experimental results of three layers, we also write the experimental results of three layers for comparison to facilitate the comparison experiment. And the SA-DMKL method does not perform a comparison experiment about the number of layers, so the number of layers is assumed to be *-layer. Table 5 shows the detailed results of the different method classifications. Among them, data with the highest classification accuracy for the same dataset is highlighted in bold.

Table 5

Comparison of classification performance(%) at each layer.

Dataset	Algorithms
	DMKL			MLMKL			SA-DMKL	DWS-MKL			GLDMKL
	1-layer	2-layer	3-layer	1-layer	2-layer	3-layer	*-layer	1-layer	2-layer	3-layer	1-layer	2-layer	3-layer
Liver	69.53	69.53	69.01	69.53	70.17	71.80	75.65	74.85	72.51	72.51	74.00	80.00	76.92
Breast	94.92	96.56	96.59	96.89	96.83	97.21	91.92	96.78	97.08	97.08	99.99	92.59	72.73
Sonar	82.98	84.13	83.94	83.94	84.23	83.84	89.42	84.76	84.76	84.76	99.99	99.99	74.04
Australian	83.56	84.26	84.40	85.13	85.56	85.42	82.03	84.35	84.64	85.51	99.99	99.99	99.99
German	70.56	71.48	72.02	73.80	74.56	75.06	78.50	72.00	72.00	73.60	80.52	74.25	79.01
Monk	96.25	96.52	96.62	96.89	96.62	96.89	97.55	99.07	99.07	99.07	99.99	99.99	99.54

First, by comparing the results between DMKL and MLMKL methods, we find that DMKL and MLMKL with a multi-layer structure can slightly improve the classification accuracy, but it is not obvious, and even sometimes the classification accuracy is reduced. For the DMKL method, as the number of layers increases, the classification accuracy will decrease in the Liver dataset with relatively complex data. However, we notice that increasing the number of layers has still a certain impact on the improvement of classification accuracy. Secondly, as can be seen from Table 5, the accuracy increases with the number of layers in the DMKL and MLMKL methods. Moreover, Our GLDMKL method improves the classification accuracy more obviously than the other two methods. For the Australian dataset with relatively simple data, the classification accuracy is the highest at each layer. The classification accuracy is also improved obviously in the Liver and German datasets with relatively complex data. In the end, by comparing our GLDMKL method with other classical DMKL methods, our method can complete the classification task in a shorter time than other methods. For the Breast and German dataset, there achieves the best classification accuracy in the first layer and the improvement is very obvious. It can be seen that our GLDMKL method plays the role of shortening the number of layers, which will greatly shorten the training time and makes it achieve the best classification effect faster. Perhaps because these datasets are too simple, our method can be faster and more accurate in the classification task.

The experimental results in Caltech-256 datasets

We have successfully completed five experiments. The first is the classification performance experiment for each layer when the bowling-ball class is positive. The second shows the classification performance experiment for each layer when the car-tire class is positive. The third shows the classification performance experiment for each layer when the desk-globe class is positive. The fourth experiment shows the effect of layers on the classification performance when the roulette-wheel class is positive. The last experiment shows the effect of layers on the classification performance when the sunflower-101 class is positive. To simplify the experiment, we also extract clusters into 1 group, 2 groups, 5 groups, 7 groups, and 10 groups as experimental comparisons. To describe the effect of layers, we write the classification accuracy results in the first five layers. Tables 6–10 show the detailed results. Among them, data with the highest classification accuracy is highlighted in bold. The invalid accuracy data after reaching the cut-off condition is marked in underlined.

Table 6

Classification performance(%) at each layer where the bowling-ball class is positive.

Layers	Groups
Layers	1 group	2 groups	5 groups	7 groups	10 groups
1-layer	87.22	86.55	90.00	90.00	80.00
2-layer	87.51	86.55	86.97	90.00	61.76
3-layer	87.01	86.63	86.97	90.00	85.29
4-layer	89.05	90.00	69.27	86.67	63.73
5-layer	89.79	90.00	84.90	87.14	75.49

Table 10

Classification performance(%) at each layer where the sunflower-101 class is positive.

Layers	Groups
Layers	1 group	2 groups	5 groups	7 groups	10 groups
1-layer	87.57	87.11	90.00	88.57	90.00
2-layer	85.56	81.48	90.00	88.57	90.00
3-layer	85.56	85.59	90.00	84.29	90.00
4-layer	85.56	85.59	90.00	84.29	90.00
5-layer	85.56	85.59	90.00	84.29	90.00

In the first experiment, a bowling-ball class is used as a positive class. The highest classification accuracy of 90.00% can be achieved in 5 groups and 7 groups, and the highest classification accuracy can be maintained in the first three layers. It can be seen that our method can be quickly distinguished from other classes in 5 and 7 groups. In 2 groups, with the increase of layers, the classification accuracy reaches the maximum in the fourth layer and then remains unchanged. It can be concluded that more layers will be needed to achieve the highest classification accuracy when the number of groups is small. And when the number of groups is large, the classification accuracy may decline, indicating that 5 and 7 groups are suitable for classification when the bowling-ball class is positive. In the second experiment, a car-tire class is used as a positive class. The highest classification accuracy of 89.99% can be achieved in 10 groups. In 7 groups, the classification accuracy will not change with the number of layers and will remain as 88.49%. The experiment shows that only three layers are needed to reach the cutoff condition, indicating that the car-tire class can be quickly distinguished from other classes. As the number of groups increases, the highest classification accuracy in each group tends to increase. As we can be seen that it is easier to achieve the highest classification accuracy when the number of groups is relatively large. In the third experiment, a desk-globe class is used as a positive class. The highest classification accuracy of 90.00% can be achieved in 2 groups and 7 groups. In 2 groups, the highest classification accuracy is achieved in the fourth layer and then decreases. In 7 groups, the highest classification accuracy is achieved in the fifth layer. In 5 groups, the classification accuracy in each layer remains relatively high. In 10 groups, with the increase in the number of layers, the classification accuracy tends to increase and cannot reach the highest in the fifth layer. It can be concluded that more layers will be needed to achieve the highest classification accuracy when the number of layers increases. In the fourth experiment, a roulette-wheel class is used as a positive class. The highest classification accuracy of 89.88% can be achieved in 2 groups and 5 groups. In 5 groups, the highest classification accuracy in the first three layers is always 89.88% and the growth of layers can be stopped. In 2 groups, the highest classification accuracy is achieved in the third layer and remains unchanged in three successive layers, having met the cutoff conditions. When the number of layers is large, only three layers can reach the cutoff condition, indicating that the increase of layers will accelerate the process of classification, but it will not necessarily achieve the highest classification accuracy. In the last experiment, a sunflower-101 class is used as a positive class. The highest classification accuracy of 90.00% can be achieved in 5 groups and 10 groups. And the classification accuracy in each layer is 90.00% in 5 groups and 10 groups, so there only needs three layers can reach the cutoff condition. It can be seen that 5 and 10 groups are suitable for classification when the sunflower-101 class is positive.

Discussion

We have done three experiments on UCI datasets. The first experiment can show that the classification performance of our GLDMKL method is better than other classical DMKL methods. The second experiment shows that the number of groups has a certain effect on most datasets and different datasets had different effects. The third experiment shows that our GLDMKL method can shorten the number of model layers and reduce training and prediction time. We also have done five experiments on Caltech-256 datasets. When the bowling-ball class, desk-globe class or sunflower-101 class is used as a positive class, the highest classification accuracy is 90.00%. When the car-tire class is used as a positive class, the highest classification accuracy is 89.99%. When the roulette-wheel class is used as a positive class, the highest classification accuracy is 89.88%. In conclusion, different classification tasks will have different highest classification accuracy.

Validation

Our GLDMKL learning method is superior to other classical DMKL methods in classification accuracy, but there are also certain potential problems. Firstly, clustering into multiple groups is required for training the model and for testing samples. And the testing samples maybe not tested in the corresponding grouping model, which may lead to lower classification accuracy. So the testing process must be repeatedly performed to get the best results. Also, there is a probability in which group samples fall into, and the sample does not necessarily belong to this group. So the probabilistic grouping can be added in our model which is more suitable for the actual situation. The probabilistic grouping is the next major point we need to overcome. Furthermore, if the lp norm is not handled properly, it will increase the sparseness of the kernel and will also result in lower classification accuracy. Moreover, the stability of our model also needs to be considered, such as analyzing from the mean and standard deviation of the classification accuracy. In the end, if samples are very simple, we can directly treat them as a group. And there is the same effect as classical DMKL methods.

Conclusion

This paper proposes a new group-based local adaptive deep multiple kernel learning method (GLDMKL) with lp norm. Our GLDMKL architecture consists of two parts: multiple kernel k-means clustering and local adaptive deep multiple kernel learning. Furthermore, the layer is not fixed and will grow adaptively based on the actual datasets. Our model learning algorithm utilizes deep kernel learning to build a local deep multiple kernel learning model layer by layer. In our model learning algorithm, we can divide samples into groups according to the multiple kernel k-means clustering algorithm. And the SVM model parameters and local kernel weights corresponding groups were optimized in turn to fit the model. The hyperparameters of basic kernels are adjusted by the grid search method. According to the local kernel weight, the proportion of basic kernels in the combined kernel at each layer is changed. And the weight constraint with the lp norm is proposed. So the local kernel weights are adjusted with the lp norm and further controlling the sparseness of the kernel. The experimental results show that our GLDMKL method can test samples in the corresponding grouping model, and achieve better performance than other classical DMKL methods on a wide range of datasets. In future work, we plan to integrate more learning technologies into our GLDMKL methods, such as localization optimization, changes in distance definitions, deep kernel model optimization, changes in data dimensions. And our GLDMKL approach will be implemented in embedded systems. 6 Jul 2020 PONE-D-20-09329 Group-based Local Adaptive Deep Multiple Kernel Learning with Lp Norm PLOS ONE Dear Dr. Ren, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Aug 20 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Robertas Damasevicius Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We suggest you thoroughly copyedit your manuscript for language usage, spelling, and grammar. If you do not know anyone who can help you do this, you may wish to consider employing a professional scientific editing service. Whilst you may use any professional scientific editing service of your choice, PLOS has partnered with both American Journal Experts (AJE) and Editage to provide discounted services to PLOS authors. Both organizations have experience helping authors meet PLOS guidelines and can provide language editing, translation, manuscript formatting, and figure formatting to ensure your manuscript meets our submission guidelines. To take advantage of our partnership with AJE, visit the AJE website (http://learn.aje.com/plos/) for a 15% discount off AJE services. To take advantage of our partnership with Editage, visit the Editage website (www.editage.com) and enter referral code PLOSEDIT for a 15% discount off Editage services. If the PLOS editorial team finds any language issues in text that either AJE or Editage has edited, the service provider will re-edit the text for free. Upon resubmission, please provide the following: The name of the colleague or the details of the professional service that edited your manuscript A copy of your manuscript showing your changes by either highlighting them or using track changes (uploaded as a *supporting information* file) A clean copy of the edited manuscript (uploaded as the new *manuscript* file) 3. We note that Figures 6, 7, 8, 9, 10 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright. We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission: 3.1. You may seek permission from the original copyright holder of Figures 6, 7, 8, 9, 10 to publish the content specifically under the CC BY 4.0 license. We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text: “I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.” Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission. In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].” 3.2. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only. 4. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: THe authors used a group-based local adaptive approach to enhance the existing deep multiple kernel Learning (DMKL) method toenhance the classification accuracy . Four basic kernels , Arc-cosine kernel then small sample sizes and low dimensions UCI datasets and Caltech-256 dataset with large sample sizes and high dimensions have been utilized to evaluate the performance of the GLDMKL approach. The authors provide several analyses and comparisons therefore i recommend the manuscript to be published in the journal Reviewer #2: The authors proposed new algorithms called DMKL. In general the paper is well written but very poor formatted. Moreover, the following issues must be addressed: 1) The quality of images must be improved. 2) The quality of charts must be improved. 3) Used bibliography must be changed. Use mainly the papers from the last 3-4 years to show the current state of knowledge. 4) More explanation is needed for what exactly is the novelty of this paper -- it must be more underlined in the abstract and related works. 5) Backgound sections must be merged to related works. 6) Some theoretical analysis of the convergence of your solution is needed. 7) More experiments must be added. 8) More comparisons with the lastest solutions are needed. 9) More statistical analysis should also be added and discussed. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: dalia yousri Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 13 Aug 2020 Dear Reviewers， Thank you for your review of our manuscript (ID: PONE-D-20-09329), which is titled “Group-based local adaptive deep multiple kernel learning with lp norm”. We appreciate your concerns and suggestions, and have revised our manuscript accordingly. The followings are our responses for your comments. Point1: The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Response1: We supplement and refine the experimental data, add corresponding analysis and further explain the ambiguities. All the modifications are marked in Yellow in the paper. Point2: Has the statistical analysis been performed appropriately and rigorously? Response2: Thanks to the reviewers for affirming that there are detailed statistical analysis in my paper. In the experiment of UCI datasets, we have discussed comparison of classification performance, the effect of group numbers on classification performance and the effect of layer numbers on classification performance. In the experiment of Caltech-256 datasets, we randomly select five types of data with similar shapes: bowling-ball, car-tire, desk-globe, roulette-wheel and sunflower-101. When these five datasets are respectively used as positive, we have analyzed the classification accuracy under different numbers of layers and groups. Special thanks for your good suggestions. Point3: The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Response3: The data in my experiment has been provided as part of the paper. There are no restrictions on publicly sharing data. The url is: https://github.com/fage8/GLDMKL.git. Point4: PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Response4: The language usage, spelling and grammar have been completely revised in the paper. All the modifications are marked in Yellow in the paper. For example, lp norm is modified to the lp norm. classical DMKL is modified to the classical DMKL. Point5: The authors provide several analyses and comparisons therefore i recommend the manuscript to be published in the journal. Response5: Thank your recommendation about our paper. We make further modifications to our paper. We supplement and refine the experimental data, add corresponding analysis and further explain the ambiguities. Point6: The quality of images must be improved. Response6: Figures 1-5 and Figure 11 have been modified, and some key elements in the figures have been modified and added with color labels. Figures 6-10 have been removed from the paper. Point7: The quality of charts must be improved. Response7: All tables in the paper are changed to three-line tables, and their captions and header elements are marked in bold. Point8: Used bibliography must be changed. Use mainly the papers from the last 3-4 years to show the current state of knowledge. Response8: Thank you for your excellent suggestions. We add the papers from the last 3-4 years as references. All the modifications are marked in Yellow in the paper. Point9: More explanation is needed for what exactly is the novelty of this paper -- it must be more underlined in the abstract and related works. Response9: The novelties of this paper are underlined in the abstract and related works. The novelties of this paper are the followings: our GLDMKL method can divide samples into multiple groups according to the multiple kernel k-means clustering algorithm. The learning process in each well-grouped local space is exactly adaptive deep multiple kernel learning. And our structure is adaptive, so there is no fixed number of layers. The learning model in each group is trained independently, so the number of layers of the learning model maybe different which highlights the flexibility of the model. Our model is more adaptable to data of different dimensions and sizes. All the modifications are marked in Yellow in the paper. Point10: Backgound sections must be merged to related works. Response10: Special thanks for your good suggestions. Background section is merged to related works. Some descriptions about DMKL have been integrated and some new references have been added. Point11: Some theoretical analysis of the convergence of your solution is needed. Response11: We add some theoretical analysis of the convergence of our method. In our method, we transform the non-convex problem into a convex function problem through a gate function. In this way, the local minimum must be found by the gradient ascent method, so our method must converge. Point12: More experiments must be added. Response12: The latest solutions are added in our paper. The latest method DWS-MKL in 2020 is added to the classification accuracy comparison experiment. Specifically, these added experiments include the comparison experiment of best classification performance and the comparison experiment of classification performance at each layer. Point13: More comparisons with the lastest solutions are needed. Response13: The latest method DWS-MKL in 2020 is added to the classification accuracy comparison experiment. Specifically, these comparisons include the comparison of best classification performance and the comparison of classification performance at each layer. In comparison experiment, the classification performance of our GLDMKL method is better than DWS-MKL method on various datasets. Point14: More statistical analysis should also be added and discussed. Response14: We have discussed the effect of group numbers and the effect of layer numbers. Some statistical analysis will be added in futrue work in Validation section. For example, the probabilistic grouping can be added in our model, the influence of the lp norm can be discussed and the stability of model can be discussed. More reasons and modifications have been noted in the article, all the modifications are marked in Yellow in the paper. Special thanks to you for your good comments. Sincerely yours, Shengbing Ren, Fa Liu, Weijia Zhou, Xian Feng and Chaudry Naeem Siddique Submitted filename: renamed_f587d.docx Click here for additional data file. 19 Aug 2020 Group-based local adaptive deep multiple kernel learning with lp norm PONE-D-20-09329R1 Dear Dr. Ren, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Robertas Damasevicius Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: All the comments have been addressed I dont have another additional comments therefore the manuscript should be accepted in the journal. Reviewer #2: The paper can be accepted. All comments have been addressed and the quality of the paper is ok for PLOS ONE. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 8 Sep 2020 PONE-D-20-09329R1 Group-based local adaptive deep multiple kernel learning with lp norm Dear Dr. Ren: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Professor Robertas Damasevicius Academic Editor PLOS ONE

Table 7

Classification performance(%) at each layer where the car-tire class is positive.

Layers	Groups
Layers	1 group	2 groups	5 groups	7 groups	10 groups
1-layer	89.01	87.70	87.88	88.49	89.99
2-layer	87.01	80.80	82.00	88.49	71.14
3-layer	86.52	80.80	82.00	88.49	71.14
4-layer	87.51	80.80	82.00	88.49	89.99
5-layer	88.51	80.80	82.00	88.49	89.99

Table 8

Classification performance(%) at each layer where the desk-globe class is positive.

Layers	Groups
Layers	1 group	2 groups	5 groups	7 groups	10 groups
1-layer	89.36	70.00	88.63	85.63	84.12
2-layer	55.81	71.63	89.32	85.63	84.12
3-layer	83.97	71.63	89.32	85.00	84.05
4-layer	37.74	90.00	89.32	85.00	84.05
5-layer	55.92	85.43	87.40	90.00	86.43

Table 9

Classification performance(%) at each layer where the roulette-wheel class is positive.

Layers	Groups
Layers	1 group	2 groups	5 groups	7 groups	10 groups
1-layer	88.52	79.50	89.88	80.00	88.86
2-layer	89.37	88.02	88.13	72.36	88.86
3-layer	89.37	89.88	88.13	72.36	80.68
4-layer	89.37	83.30	88.50	72.36	80.68
5-layer	89.37	89.88	88.50	72.36	80.68

8 in total