Literature DB >> 24511159

Stochastic margin-based structure learning of Bayesian network classifiers.

Abstract

The margin criterion for parameter learning in graphical models gained significant impact over the last years. We use the maximum margin score for discriminatively optimizing the structure of Bayesian network classifiers. Furthermore, greedy hill-climbing and simulated annealing search heuristics are applied to determine the classifier structures. In the experiments, we demonstrate the advantages of maximum margin optimized Bayesian network structures in terms of classification performance compared to traditionally used discriminative structure learning methods. Stochastic simulated annealing requires less score evaluations than greedy heuristics. Additionally, we compare generative and discriminative parameter learning on both generatively and discriminatively structured Bayesian network classifiers. Margin-optimized Bayesian network classifiers achieve similar classification performance as support vector machines. Moreover, missing feature values during classification can be handled by discriminatively optimized Bayesian network classifiers, a case where purely discriminative classifiers usually require mechanisms to complete unknown feature values in the data first.

Entities: Chemical Disease Species

Keywords: Bayesian network classifier; Discriminative learning; Maximum margin learning; Structure learning

Year: 2013 PMID： 24511159 PMCID： PMC3914412 DOI： 10.1016/j.patcog.2012.08.007

Source DB: PubMed Journal: Pattern Recognit ISSN： 0031-3203 Impact factor: 7.740

Introduction

Generative probabilistic classifiers optimize the joint probability distribution of the features and the corresponding class labels C using maximum likelihood (ML) estimation. The class label is usually predicted using the maximum a posteriori estimate of the class posteriors obtained by applying Bayes rule. Discriminative probabilistic classifiers such as logistic regression model directly. Discriminative classifiers may lead to better classification performance, particularly when the class conditional distributions poorly approximate the true distribution [1]. Basically, in Bayesian network classifiers both parameters and structure can be learned either generatively or discriminatively [2]. Discriminative learning requires objective functions such as classification rate (CR), conditional log-likelihood (CL), or the margin (as we propose to use in this paper), that optimize the model for a particular inference scenario, e.g. for a classification task. We are particularly interested in learning the discriminative structure1 of a generative Bayesian network classifier that factorize as . Learning the graph structure of a Bayesian network classifier is hard. Optimally learning various forms of constrained Bayesian network structures is NP-hard [3] even in the generative sense. Recently, approaches for finding the globally optimal generative Bayesian network structure have been proposed. These methods are based on dynamic programming [4,5], branch-and-bound techniques [6,7], or search over various variable orderings [8]. The experiments of these exact methods are restricted to variable nodes. Alternatively, approximate methods such as stochastic search or greedy heuristics are used, which can handle cases with many more variables. Discriminative structure learning is not less difficult because of the non-decomposability2 of the scores. Discriminative structure learning methods – relevant for learning Bayesian network classifiers – are usually approximate methods based on local search heuristics. In [9], a greedy hill-climbing heuristic is used to learn a classifier structure using the CR score. Particularly, at each iteration one edge is added to the structure which complies with the restrictions of the network topology and the acyclicity constraints of a Bayesian network. In a similar algorithm, the CL has been applied for discriminative structure learning [10]. Recently, we introduced a computationally efficient order-based greedy search heuristic for finding discriminative structures [2]. Our order-based structure learning is based on the observations in [11] and shows similarities to the K2 heuristic [12]. However, we proposed to use a discriminative scoring metric (i.e. CR) and suggest approaches for establishing the variable ordering based on conditional mutual information [13]. One of the most successful discriminative classifiers, namely the support vector machine (SVM), finds a decision boundary which maximizes the margin between samples of distinct classes resulting in good generalization properties [14] of the classifier. Recently, the margin criterion has been applied to learn the parameters of probabilistic models. Taskar et al. [15] observed that undirected graphical models can be efficiently trained to maximize the margin. More recently, Guo et al. [16] introduced the maximization of the margin for parameter learning based on convex relaxation to Bayesian networks. We proposed to use a conjugate gradient algorithm for maximum margin optimization of the parameters and show its advantages with respect to computational requirements [17]. Further generative and discriminative parameter learning methods for Bayesian network classifiers are summarized in [2,17] and references therein. In this paper, we use the maximum margin (MM) criterion for discriminative structure learning. We use greedy hill-climbing (HC) and stochastic search heuristics such as simulated annealing (SA) [18,19] for learning discriminative classifier structures. SA is less prone to get stuck in local optima. We empirically evaluate our margin-based discriminative structure learning heuristics on two handwritten digit recognition tasks, one spam e-mail, and one remote sensing data set. We use naive Bayes (NB) as well as generatively and discriminatively optimized tree augmented naive Bayes (TAN) [20] structures. Furthermore, we experimentally compare both discriminative and generative parameter learning on both discriminative and generatively structured Bayesian network classifiers. Maximum margin structure learning outperforms recently proposed generative and discriminative structure learning approaches. SA heuristics mostly lead to better performing structures at a lower number of score evaluations (CR or MM) compared to HC methods. Discriminative parameter learning produces a significantly better classification performance than ML parameter learning on the same classifier structure. This is especially valid for cases where the structure of the underlying model is not optimized for classification [21]. We introduced the MM score for structure learning in [22] using the HC heuristic. The benefit of the MM score over other discriminative scores (i.e. CR) remained open in [22] since the HC heuristic might get trapped in local optimal solutions. This makes the reported performance gain of the MM score during structure learning ambiguous—either MM is useful, or the HC heuristic using CR gets stuck in low-performing local optimal solutions. For this reason we use SA which partially alleviates this problem. Recently, we also used the MM score for exact structure learning of Bayesian network classifiers [23]. This method is capable to find the global optimal solution. It is based on branch-and-bound techniques within a linear programming framework which offers the advantage of an any-time solution, i.e. premature termination of the algorithm returns the currently best solution together with a worst-case sub-optimality bound. Empirically it is shown that MM optimized structures compete with SVMs and outperform generatively learned network structures. Unfortunately, experiments are limited to rather small-scale data sets from the UCI repository [24]. To overcome these limitations, we use approximate methods for structure learning in this paper. The paper is organized as follows: In Section 2, we introduce Bayesian network classifiers as well as NB and TAN structures. In Section 3, we present the non-decomposable discriminative scores CL, CR, and MM. Additionally, we discuss techniques for making the determination of these discriminative scores computationally competitive. Section 4 introduces different structure learning heuristics. Particular focus is on SA which is rarely used for discriminative learning of Bayesian network structures. In Section 5, we present experimental results. Section 6 concludes the paper.

Bayesian network classifiers

A Bayesian network [25] is a directed acyclic graph consisting of a set of nodes and a set of directed edges connecting the nodes where is an edge directed from Z to Z . This graph represents factorization properties of the distribution of a set of random variables . The variables in have values denoted by lower case letters . We use boldface capital letters, e.g. , to denote a set of random variables and correspondingly boldface lower case letters denote a set of instantiations (values). Without loss of generality, in Bayesian network classifiers the random variable Z 1 represents the class variable , where represents the number of classes and denotes the set of random variables representing the N attributes of the classifier. In a Bayesian network each node is independent of its non-descendants given its parents. The set of parameters which quantify the network are represented by . Each random variable Z is represented as a local conditional probability distribution given its parents . We use to denote a specific conditional probability table entry (assuming discrete variables); the probability that variable Z takes on its i th value assignment given that its parents take their h th assignment, i.e. . The training data consists of M independent and identically distributed samples where . The joint probability distribution of a Bayesian network factorizes according to the graph structure and is given for a sample asThe class labels are predicted according to where the last equality follows from neglecting in . In this work, we restrict ourselves to NB and TAN structures. The NB network assumes that all the attributes are conditionally independent given the class label. As reported in [20], the performance of the NB classifier is surprisingly good even if the conditional independence assumption between attributes is unrealistic or even wrong for most of the data. Friedman et al. [20] introduced the TAN classifier which is based on structural augmentations of the NB network. In order to relax the conditional independence properties of NB, each attribute may have at most one other attribute as an additional parent. This means that the tree-width of the attribute induced sub-graph is unity, i.e. we have to learn a 1-tree over the attributes. A TAN classifier structure is shown in Fig. 1 . In [2], we noticed that k-trees over the features – k indicates the tree-width3 – often do not improve classification performance significantly without regularization. Therefore, we limit the experiments to NB and TAN structures.

Fig. 1

An example of a TAN classifier structure.

Discriminative scores for structure learning

In this section, we first summarize traditionally used discriminative scores such as CR and CL. Then, we introduce our maximum margin (MM) score. Finally, we provide some techniques to make the computation of discriminative scores computationally competitive. For the sake of brevity, we only notate instantiations of the random variables in the following.

Traditional scores: CR and CL

There are two score functions we consider: the CR [9,27,2] and the CL [10] Symbol denotes the indicator function (i.e. equals 1 if the Boolean expression i=j is true and 0 otherwise). CR is tightly connected to CL under the 0/1-loss function (other smooth convex upper-bound surrogates might be employed [28]). While either the CL or the CR can be used as score for structure learning, we in this work restrict our experiments to CR (empirical results show it performs better).

Maximum margin score

The multi-class margin [16] of sample m can be expressed as If , then sample m is correctly classified and vice versa. The magnitude of is related to the confidence of the classifier about its decision. Taking the logarithm, we obtain Usually, the maximum margin approach maximizes the margin of the sample with the smallest margin for a separable classification problem [29], i.e. the objective function is written as . For non-separable problems, we aim to relax this by introducing a soft margin, i.e. we focus on samples with close to zero. For this purpose, we consider the hinge loss function where the scaling parameter controls the margin with respect to the loss function and is set by cross-validation. This means that the emphasis is on samples with , i.e. samples with a large positive margin are considered as constant factor during on the optimization. We use as score for discriminative structure learning. Note that the CR and MM scores are determined from a classifier trained and tested on different data . For the sake of simplicity we do not denote this explicitly.

Computational considerations

All discriminative scores, i.e. CR, CL, and MM, are based on the joint probability distribution . The computational complexity for one score evaluation is which is essentially the cost for determining for all and . Hence, the computational complexity for CR, CL, and MM is basically the same. However, the CR computation can be accelerated by the following techniques, where observations 2, 3, and 4 equivalently apply for computing the margin score: The data samples are reordered during structure learning so that misclassified samples from previous evaluations are classified first. The classification is terminated as soon as the performance drops below the currently best network score [30]. During structure learning the parameters are set to the ML values. When learning the structure we only have to update the parameters of those nodes where the set of parents changes. This observation can be also used for computing the joint probability during classification. We can cache the joint probability and exchange only the probabilities of those nodes where the set of parents changed to get the new joint probability [9]. Furthermore, since we are restricted to 1-trees, each attribute can have only one of the other attributes as parent. This means that we can compute the ML parameter estimates beforehand. This prevents redundant parameter learning at later stages of the search. In fact, after the selection of the first edge during HC search, we have already determined all ML estimates (see also Section 4.2). In case of large memory, a multi-way table of order four can be assembled at the beginning which enables to determine the joint probability for class c and sample m of any structure. One of the ML parameter estimates for each variable X () and each conditioning possible attribute () given the values of sample m and class c constitutes the element at index of . Hence, the log joint probability for class c and sample m can be obtained by summing over all entries X in using the appropriate conditioning parent given by the structure. This multi-way table enables to quickly compute the CR for all possible TAN structures. The consideration of all these techniques leads to a tremendous computational speedup.

Structure learning heuristics

This section provides three structure learning heuristics used in the experiments. Note that the parameters during structure learning are optimized generatively using maximum likelihood estimation [25].

Generative structure learning

The conditional mutual information (CMI) [13] between the attributes given the class variable is determined as where denotes the expectation. This measures the information between X and X in the context of C. Friedman et al. [20] gives an algorithm for constructing a TAN network using this measure. First, the pairwise CMI and is computed. Then, an undirected 1-tree is built using the maximum weighted spanning tree algorithm [25] where each edge connecting X and X is weighted by . The undirected 1-tree is transformed to a directed tree. Therefore, a root variable is selected and all edges are directed away from this root. Finally, the class node C and the edges from C to all attributes are added.

Discriminative structure learning: greedy hill-climbing (HC)

A Bayesian network is initialized to NB and at each iteration we add the edge that, while maintaining a partial 1-tree, gives the largest improvement of the scoring function. Basically, all discriminative scoring functions can be considered, i.e. CR, CL, and MM. Structure learning is performed until we obtain a 1-tree over the attributes, i.e. we add edges. This approach is computationally expensive since each time an edge is added, the scores for all edges of all attributes without conditioning attribute complying with the acyclicity requirements need to be re-evaluated due to the discriminative non-decomposable scoring functions we employ. Overall, for learning a 1-tree structure, score evaluations are necessary. In our experiments, we consider either the CR or the margin as objective.

Discriminative structure learning: simulated annealing (SA)

The main advantage of SA compared to HC is that this heuristic is capable to escape from local optima, although, SA is not guaranteed to find global optimal solutions. In the context of discriminative structure learning, we empirically show in Section 5 that SA is beneficial in terms of finding well performing structures at low computational costs. SA for learning discriminative TAN structures is summarized in Algorithm 1. Simulated annealing for discriminative structure learning. The basic principle of SA is that there is an additional temperature parameter T which enables to accept worse solutions than the currently best one with a certain probability. Parameter T is decreased periodically after either the maximum number of trials with one temperature (MaxTry) or the maximum number of successful network changes (MaxSuccess) is reached. Therefore, a cooling schedule is introduced which makes the acceptance of lower scoring solutions less likely at later stages of the search, i.e. towards the end of the optimization, the values of T are small and SA accepts almost only score improvements similar as greedy hill-climbing. For learning MM structures, the CR score is replaced by . We initialize the network to NB (Edges of NB are not included in .). The constant D implements the cooling schedule and r denotes a constant.4 Various values of D have been tested in the experiments. The values for MaxConsecutiveRejection, MaxTry, and MaxSuccess depend on the number of features and are set to , , and , respectively. The function GenerateNeighboringTree generates a new neighboring tree by randomly changing one edge in the 1-tree. First, one variable X is selected randomly. For X we randomly select a new parent variable . The edge is added/replaced if the acyclicity constraints are not violated.

Experiments

We present results for two handwritten digit recognition tasks, spam e-mail classification, and for a remote sensing application. In the following, we provide details about the data sets: MNIST data: The MNIST data [31] contains 60 000 samples for training and 10 000 handwritten digits for testing. We down-sample the gray-level images by a factor of two which results in a resolution of 14×14 pixels, i.e. 196 features. USPS data: This data set contains 11 000 uniformly distributed handwritten digit images from zip codes of mail envelopes. The data set is split into 8000 images for training and 3000 for testing. Each digit is represented as a 16×16 grayscale image, where each pixel is considered as feature. Spambase data: This data [24] considers the classification of e-mails into either spam or not-spam. Most of the 57 attributes indicate whether a particular word or character is frequently occurring in the e-mail. The data set contains 2301 and 2300 samples for training and testing, respectively. DCMall data: We use a hyperspectral remote sensing image of the Washington DC. Mall area containing 191 spectral bands having a spectral width of 5–10 nm.5 As ground reference a classification performed at Purdue University was used containing seven classes, namely roof, road, grass, trees, trail, water, and shadow.6 The image contains 1280×307 hyperspectral pixels, i.e. 392 960 samples. We arbitrarily choose 5000 samples of each class to learn the classifier. For structure learning we use the algorithms introduced in Section 4. We empirically compare deterministic greedy HC versus stochastic SA using either CR or the MM score for discriminative structure learning. In particular, we apply the following approaches for learning TAN structures: TAN–ST–CMI: Generative TAN structure learning using the spanning tree (ST) algorithm and the CMI score [20]. TAN–HC–CR: Discriminative TAN structure learning using greedy hill-climbing and the CR score [9,2]. TAN–HC–MM: Discriminative TAN structure learning using greedy hill-climbing and the MM score (introduced in this paper). TAN–SA–CR: Discriminative stochastic TAN structure learning based on simulated annealing maximizing the CR score (introduced in this paper). TAN–SA–MM: Discriminative stochastic TAN structure learning based on simulated annealing maximizing the MM score (introduced in this paper). The CR and MM scores are determined by 5-fold cross-validation on the training data. Zero probabilities in the conditional probability tables are replaced with small values . Various learning algorithms use the same data set partitioning. We use ML parameter learning during discriminative structure learning. We recently developed an MM parameter learning method for Bayesian network classifiers [17]. We are interested in the case if MM parameter learning is further improving the classification performance once the structure has been discriminatively optimized, i.e. we empirically compare both discriminative and generative parameter learning on both discriminative and generatively structured Bayesian network classifiers. Both parameter learning methods are abbreviated as ML and MM. Since there is an abundance of combinations of algorithms a simple naming scheme is introduced: We use a 2-, or 4-tag scheme A–B–C–D where “A” (if given) refers to either NB or TAN (1-tree), “B” and “C” (if given) refer to the structure learning heuristic and score, respectively, and “D” (if given) refers to the parameter training method of the final resultant model structure, i.e. either ML or MM.

Results—discriminative structure learning

Table 1 shows the classification performance of all combinations of structure learning methods using ML parameter learning.

Table 1

Classification results in (%) with standard deviation using ML parameter learning. Best structure learning results are emphasized using bold font.

Classifier structure	MNIST	USPS	Spambase	DCMall
NB–ML	83.73±0.37	87.10±0.61	89.87±0.63	81.07±0.06
TAN–ST–CMI–ML	91.28±0.28	91.90 ±0.50	92.86±0.54	85.63±0.06
TAN–HC–CR–ML	92.63±0.26	93.07±0.46	92.08±0.56	87.58±0.06
TAN–HC–MM–ML	93.15±0.25	95.40±0.38	93.43±0.52	88.61±0.05
TAN–SA–CR–ML	92.66±0.26	94.07±0.43	93.52±0.51	88.69±0.05
TAN–SA–MM–ML	93.30±0.25	95.46±0.38	93.26±0.52	88.67±0.05

The best discriminatively optimized structures significantly outperform generatively learned, i.e. TAN–ST–CMI, and NB structures. For Spambase TAN–HC–CR–ML is performing worse compared to generative structure learning. In this case, the greedy HC heuristic gets stuck in local optimal solutions with poor performance. Comparing the classification rates achieved with discriminative scores in Fig. 2 (also Table 1) we observe that the MM score is leading to better or equal performing networks for all data sets. Since SA is a stochastic optimization method we perform 10 independent structure learning runs for each data set. While Table 1 reports the result of the best run, Fig. 2 uses a box plot (median, 25% and 75% quantiles, and total range (whiskers)) to summarize all runs. SA achieves mostly a better classification performance as greedy HC search using the same scoring measure. SA is able to escape from local optimal solutions due to the cooling schedule (see Section 4.3) while HC has no mechanism for accepting lower-performing structures, i.e. SA has a more flexible treatment of local optima. However, also SA does not guarantee to find global optima.

Fig. 2

Classification performance of discriminatively structured Bayesian network classifier using either the CR or MM score in the HC and SA heuristics for structure learning. Horizontal red lines of the box plot corresponds to the median over 10 runs, the boxes represent the 25% and 75% quantiles, and the whiskers visualize the total range. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Furthermore, in Fig. 3 we compare the number of score evaluations used to learn the discriminative structures for each of the data sets. Fortunately, SA requires roughly one order of magnitude less score evaluations for data sets with a large number of features N, i.e. for MNIST, USPS, and DCMall. As noted in Section 4.2, the HC heuristic for learning TAN structures requires score evaluations. For SA the number of score evaluations depend on the parameters of the algorithm, e.g. the cooling schedule. While all the SA results in Fig. 3 have been determined for D=0.8, Fig. 4 shows the influence of the cooling schedule of Algorithm 1. In particular, we compare the classification performance over the number of evaluations for using the USPS data. Ten SA structure optimizations have been performed for each value of D. These curves reveal that a faster schedule (i.e. lower value for D) might even reduce the number of score evaluations at similar classification performance. The score evaluation costs for CR and MM are essentially the same. Details are given in Section 3.3.

Fig. 3

Number of score evaluations used to learn discriminatively structured Bayesian network classifiers. Horizontal red lines of the box plot corresponds to the median over 10 runs, the boxes represent the 25% and 75% quantiles, and the whiskers visualize the total range. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Fig. 4

Influence of the annealing schedule for using the USPS data: SA–CR (left column) and SA–MM (right column). For each value of D, 10 stochastic structure learning runs have been performed.

Results—discriminative parameter learning

Table 2 shows the classification performance of all combinations of structure learning methods using MM parameter learning.7

Table 2

Classification results in (%) with standard deviation using MM parameter learning. Best MM parameter learning results are emphasized using bold font.

Classifier structure	MNIST	USPS	Spambase	DCMall
NB–MM	91.73±0.28	94.80±0.41	94.39±0.48	89.23±0.05
TAN–ST–CMI–MM	94.13±0.24	95.50 ±0.38	93.91±0.50	89.22±0.05
TAN–HC–CR–MM	94.83±0.22	96.27±0.35	93.56±0.51	90.29±0.05
TAN–HC–MM–MM	95.20±0.21	96.23±0.35	94.21±0.49	90.29±0.05
TAN–SA–CR–MM	94.82±0.22	96.10±0.35	94.26±0.49	89.96±0.05
TAN–SA–MM–MM	95.25±0.21	96.40±0.35	94.47±0.48	90.28±0.05
SVM	96.40±0.19	97.86±0.26	94.57±0.47	88.98±0.05
	(C⁎=1,σ=0.005)	(C⁎=1,σ=0.05)	(C⁎=1,σ=0.01)	(C⁎=1,σ=0.05)

Discriminative parameter learning produces a better classification performance than ML parameter learning (see Table 1) on the same classifier structures. This is especially valid for cases where the structure of the underlying model is not optimized for classification [21], i.e. NB and TAN–CMI. SVMs8 are slightly better than our discriminative Bayesian network classifier on three data sets, i.e. MNIST, USPS, and Spambase. In Table 3 we compare the model complexity, i.e. the number of parameters, between SVMs, the NB, and the best performing TAN Bayesian network classifier. This table reveals that for the cases where the Bayesian network achieve slightly inferior performance the model is dramatically smaller. The best Bayesian network uses , , and times fewer parameters than the SVM for MNIST, USPS, and Spambase, respectively. For DCMall the Bayesian network outperforms SVMs by but the model has twice as many parameters than the SVM. It is a well-known fact that the number of support vectors in classical SVMs increases linearly with the number of training samples [32]. In contrast, the structure of Bayesian network classifiers naturally limits the number of parameters. Furthermore, the used Bayesian network structures are probabilistic generative models. They might be preferred since it is easy to work with missing features, domain knowledge can be directly incorporated into the graph structure, and it is easy to work with structured data.

Table 3

Model complexity for NB, best TAN Bayesian network (BN), and SVM.

Data	N	# of SVs	# SVM parameters	# of NB BN parameters	# of TAN BN parameters
MNIST	196	17 201	3 371 396	5159	31 399
USPS	256	3837	982 272	6089	22 139
Spambase	57	517	29 469	161	517
DCMall	191	11 934	2 279 394	61 228	4 804 505

In the following, we verify that a discriminatively optimized generative model still offers its advantages in the missing feature case. Our MM parameter learning keeps the sum-to-one constraint of the probability distributions [17]. Therefore, we suggest that we can similarly to the generatively optimized model simply sum over the missing feature values. The interpretation of marginalizing over missing features is delicate since the discriminatively optimized parameters might not have anything in common with consistently estimated probabilities (such as e.g. maximum likelihood estimation) [33]. However, at least empirically there is a strong support for using the marginal density where is a subset of the features in . This is tractable if the complexity class of is limited (e.g., 1-tree) and the variable order in the summation is chosen appropriately. A discriminative classifier, however, is inherently conditional and it is not possible to go from to . This problem is also true for SVMs, logistic regression, and multi-layered perceptrons. In Fig. 5 , we report classification results for NB–ML and NB–MM assuming missing features using the DCMall data. The x-axis denotes the number of missing features. We average the performances over 100 classifications of the test data with randomly missing features. Standard deviation bars indicate that the resulting differences are significant for a moderate number of missing features. Discriminatively parametrized NB classifiers outperform NB–ML in the case of up to 150 missing features. Additionally, we present results for SVMs where missing features are replaced with the average of all training samples . This mean value imputation heavily degrades the classification performance of SVMs in case of missing features. Handling missing features with NB structures is easy since we can simply neglect the conditional probability of the missing feature Z in Eq. (1), i.e. the joint probability is the product over the available features only.

Fig. 5

DCMall data: Classification results for NB–ML, NB–MM, and SVMs (using mean value imputation) assuming missing features.

Conclusions

In this paper, we proposed to use the maximum margin score for learning discriminative Bayesian network classifier structures. Furthermore, we replaced traditional greedy hill-climbing heuristics for discriminative structure optimization with stochastic simulated annealing methods. Simulated annealing offers mechanisms to escape local optima. The main results of the work are: (i) Maximum margin structure learning mostly outperform other generative and discriminative structure learning methods. (ii) Stochastic optimization leads for most cases to better performing Bayesian network structures and requires a lower number of score evaluations in comparison to greedy hill-climbing. (iii) We also provide results for discriminative parameter learning on top of generatively or discriminatively optimized structures. Margin-optimized Bayesian networks perform on par with SVMs in terms of classification rate, however the Bayesian network classifiers can directly deal with missing features, a case where discriminative classifiers usually require feature imputation techniques.

Input:X1:N,C,S;

Output: Set of edges E for TAN network;

Initialization:T←1, i←0, RejectCounter←0, SuccessCounter←0;

CRold←CR(B|S) for NB classifier;

Eold←∅;

while

RejectCounter<MaxConsecutiveRejection

while(SuccessCounter<MaxSuccess)&(i<MaxTry)doi←i+1;E←GenerateNeighboringTree(Eold);CR←CR(B|S)Evaluatecurrentnetwork;ifCR−CRold>0thenRejectCounter←0;SuccessCounter←SuccessCounter+1;CRold←CR;Eold←E;elseifrandom[0,1)<exp(CR−CRoldrT)thenSuccessCounter←SuccessCounter+1;CRold←CR;Eold←E;else|RejectCounter←RejectCounter+1;endendendT←DT;i←0;SuccessCounter←0;

end

1 in total

1. Maximum margin Bayesian network classifiers.

Authors: Franz Pernkopf; Michael Wohlmayr; Sebastian Tschiatschek
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2012-03 Impact factor: 6.226

1 in total

1. Structure Extension of Tree-Augmented Naive Bayes.

Authors: Yuguang Long; Limin Wang; Minghui Sun
Journal: Entropy (Basel) Date: 2019-07-25 Impact factor: 2.524

1 in total