Literature DB >> 33267154

M-ary Rank Classifier Combination: A Binary Linear Programming Problem.

Abstract

The goal of classifier combination can be briefly stated as combining the decisions of individual classifiers to obtain a better classifier. In this paper, we propose a method based on the combination of weak rank classifiers because rankings contain more information than unique choices for a many-class problem. The problem of combining the decisions of more than one classifier with raw outputs in the form of candidate class rankings is considered and formulated as a general discrete optimization problem with an objective function based on the distance between the data and the consensus decision. This formulation uses certain performance statistics about the joint behavior of the ensemble of classifiers. Assuming that each classifier produces a ranking list of classes, an initial approach leads to a binary linear programming problem with a simple and global optimum solution. The consensus function can be considered as a mapping from a set of individual rankings to a combined ranking, leading to the most relevant decision. We also propose an information measure that quantifies the degree of consensus between the classifiers to assess the strength of the combination rule that is used. It is easy to implement and does not require any training. The main conclusion is that the classification rate is strongly improved by combining rank classifiers globally. The proposed algorithm is tested on real cytology image data to detect cervical cancer.

Entities: Chemical Disease Gene Species

Keywords: HPV; aggregation; binary linear programming; cervical cancer; classifier combination; data fusion; independence; mutual information; plurality voting; rank; total order

Year: 2019 PMID： 33267154 PMCID： PMC7514928 DOI： 10.3390/e21050440

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Using a single classifier has shown limitations in achieving satisfactory recognition performance, and this leads us to use multiple classifiers, which is now a common practice in machine learning. Classifier combination has been studied in many disciplines such as the social sciences, sensor fusion, pattern recognition, etc. Schapire [1] proved that a strong classifier can be generated by combining weak classifiers. It has been accepted as an effective method to improve classification performances. Many examples of ensemble classifier systems can be found in process engineering or medicine. For a survey of the issues and approaches on classifier combination, readers are referred to Woźniak [2] and Oza and Turner [3]. The same type of approach has also been used, for instance, in remote sensing domains (e.g., for land cover mapping with Landsat Multispectral Scanner, elevation) [4], computer security [5], financial risks [6], proteomics [7]. Classifiers can provide as their final decision only a single class, a ranked list of all the classes, or a score associated with each class as a measure of confidence for the class. In this paper, we focus only on rank-values to perform combination. Rank data are useful when data can not be easily reduced to numbers, such as data that are related to concepts, opinions, feelings, values, and behaviors of people in a social context, genes, characters, etc. Ranking also has the advantage of removing scale effects while permitting ranking patterns to be compared. But rank-ordering has also its disadvantages: it is difficult to combine data from different rankings, and the information contained in the data is limited [8]. After learning, each classifier of the ensemble has output its own results. Several fusion strategies have been proposed in the literature to combine classifiers at the rank level [9,10]. Among them, one of the most common techniques is certainly the linear combination of the classifier outputs [11,12]. The voting principle is the simplest method of combination, where the top candidate from each classifier constitutes a single vote. The final decisions can be made by majority rule (over half of the votes) [13], plurality (maximum number of votes) [14], weighted sum of significance [15], or other variants. The method of Borda count [16], which sums up the rank values of classifiers, can be considered as a generalization of the voting principle. The Bayesian approach estimates the class posterior probabilities conditioned on classifier decisions by approximating various probability densities [17]. Although decision theory itself does not assume classifiers are independent, this assumption is almost always adopted in practical implementation to reduce the exponential complexity of probability estimation. In summary, classifier combination is an ensemble method that classifies new data by taking a weighted vote of the predictions of a set of classifiers [18]. This is originally a Bayesian averaging, but more recent algorithms include boosting, bagging, random forests, and variants [19,20,21]. Note that Dempster–Shafer formalism for aggregating beliefs based on uncertainty reasoning lends itself to a more flexible model used to combine multiple pieces of evidence and capable of taking uncertainty and ignorance into account [22]. Finally, a rank classifier provides an ordered list of classes associating each class with a rank integer that indicates its importance in the list. The output of a classifier is therefore a vector of ranks attributed to the K classes. An ensemble of classifiers might be a better choice than a single classifier because of the variability of the ensemble errors, which is such that the consensus performance is significantly better than the best individual in the ensemble [23]. This analysis is certainly true when the classifiers of the ensemble “see” different training patterns, and it can be effective even when the classifiers all share the same training set. In a computerized tomography problem to illustrate how the ensemble consensus outperformed the best individuals, Anthimopoulos observed that the marginal benefit obtained by increasing the ensemble size is usually low due to correlation among errors: most classifiers will get the right answer on easy inputs, while many classifiers will make mistakes on difficult inputs [24]. In addition, running several searches and combining the solutions produces a better approximation than many learning techniques that use local searches to converge toward a solution, with the risk of staying stacked in local optima (which may not be true in the case of deep learning classifiers, since Kawaguchi has shown that every local minimum is a global minimum [25]). Thus, we might not be capable of producing the optimal classifier using a training set and a given classifier architecture, compared to a set of several classifiers. Since the number of classifiers can be very high (in the thousands), it is difficult to “understand” the classifier ensemble decision characteristics. Although general performances are often improved when classifiers are combined, it becomes computationally costly to combine well-trained classifiers [26]. Most of the time, it is believed that the combination of independent classifiers will provide greater performance improvement [27], while combiner decisions could be biased toward duplicated outputs. However, this belief stems from the difficulty of using a dependence assumption. In fact, in practical situations, classifier independence is difficult to assess. How do multiple rank classifiers improve separation performances when individual classification performances are slightly better than random decision making? And what is “classifier independence” ? This term raises several issues that we will address in Section 4, where we come back to the theory of rank aggregation and propose an algorithm to combine classifiers. The main properties of the classifier are discussed. Section 2 exposes the general framework and the notations used. A classifier ensemble dependence measure is then proposed to evaluate the conditional mutual information in Section 5. Experimental results are presented in Section 6 for the detection of cervical cancer. Finally, Section 7 gives conclusions on rank classifier combination and further investigations are discussed. Notations Set and regions are indicated by double-trace uppercase letters such as , vectors with bold lowercase such as , and matrices with uppercase bold letters such as . The elements of a matrix are indexed by the row index i and the column index j. Lowercase letters refer to individual elements in a vector whose position in the vector is indicated by the last subscript. Therefore, refers to the jth element of vector . is the a priori probability of the random value X belonging to class , K being the number of classes. M is the number of classifiers used for combination. denotes the cardinality of set . denotes the transpose operator.

2. Problem Statement and Model

We consider a classification dataset with n observations obtained from a physical signal, or synonymously, explanatory variables, objects, instances, cases, patterns, t-uples, etc. where each belongs to class . The vector lies in an attribute space and each component is a numerical or nominal categorical attribute, also named feature, variable, dimension, component, field, etc. The output of the M classifiers are represented by a K-dimensional vector : each component is a certain value associated with class given by . Depending on the nature of the classifier , can be a rank value that reflects a complete or partial ordering of all classes, or a value in corresponding to the predicted class assigned to 1 and the others to zero, or a score, e.g., a discriminant value, associated with each class , which serves as a confidence measure for the class to be the true class. The latter can easily be converted into the two former. Therefore, each classifier defines a mapping function from the image domain to a K-dimensional vector space defined over a set of values . The general framework is illustrated in Figure 1.

Figure 1

General framework for classifier combination. The classifier produces output vector . Finally, from the combination function produces a final decision vector .

In this paper, is a rank value that reflects a complete or partial ordering of the classes. The objective is to design an optimal combination function that takes all the as input and produces as an output the decision vector , where is the rank associated with the decision on class , that is, . Thus, we seek as a discriminant function defined over . In the following, it is assumed that (i) classifiers have equal individual performance (ii) classifiers are treated as “black boxes”. Hence, the combination operator applies only on the real space vectors .

3. Conditional Independence Properties

The term “classifier independence” has been used in an intuitive manner, but what is classifier independence? Formally, two classifiers and are said to be independent if with and being the decision values of and . The idea is illustrated in the following example. (Independent classifiers). Consider a binary classification problem (with equiprobable classes Then the total probability rule helps to find the probability of the outputs: The two classifiers are independent if the joint probability And similarly, In Equations (5)–(6), and do not appear anymore. The value of should be , independently of the classifier performances. This is possible only if . Thus, the ensemble performance does not depend of the performance of the individuals. In other words, independent classifiers in the sense of definition (2) are random classifiers (recognition rate of 50%)! Suppose now that classifiers are very efficient and that which is far from the value of Example (1) suggests that interesting classifiers (non-random!) cannot be independent in the sense of Equation (2). Making the assumption that decision vectors are conditionally independent given , the discriminant function maximizes the posterior probability , which can be point estimated from the entries of the M -confusion matrices, as given, for instance, in Table 1.

Table 1

Confusion matrix of a classifier used to estimate in the Bayesian approach. denotes the classifier decision on class being ranked jth.

		Predicted Classes
		R1	…	Rj	…	RK
True classes	C1	n11	⋯	n1j	⋯	n1K	n1·
	⋮	⋮	⋱	⋮		⋮	⋮
	Cj	nj1	⋯	njj	⋯	njK	nj·
	⋮	⋮		⋮		⋮	⋮
	CK	nK1	⋯	nKj	⋯	nKK	nK·
		n·1	⋯	n·j	⋯	n·K

Let be the indicatrix function for which if the rank of the class is less than the alternative class , and 0 otherwise. Then in Table 1, and the line and column marginals are respectively defined by and . If class is the kth choice for classifier , then . (Conditional independent classifiers). Consider once again the binary classification case introduced in Example (1) and assume that the classifiers are very efficient: We conclude that two classifiers can be conditionally independent even if they are very efficient. Equation ( or conditionally dependent. Therefore, conditional independence can be seen as a necessary condition for classifier combination. But the direct use of the confusion matrix as a criterion to derive the optimal combination rule is not feasible since the true classes are unknown.

4. Rank Class Combination Problem

4.1. Rank-Order Statistic Model

A rank classifier gives an ordered list of classes associating each class with a integer that indicates its importance in the list; in the case of K classes, it is an integer . The output of a classifier is a vector of ranks attributed to K classes: and is the rank assigned to class by the classifier . By convention, the smaller the rank assigned to a class, the more likely it is. In other words, if judges more likely than . The vector is therefore a permutation of the first K integers. The matrix represents the total order ranking of the K classes attributed by the M classifiers, i.e., [28]. In the following, for ease of writing, we will denote . Then where is the set of ranks assigned to class by the M classifiers. The solution of a rank class combination problem is a total order ranking (TOR) , given by a virtual classifier minimizing the disagreement of opinions between the M classifiers. The optimization problem is defined as follows: where is the rank distribution on the K classes proposed by the classifier , is the symmetric group of the permutations [29], and is a metric on . Solving Equation (11) is difficult due to the constraint . In the following subsections, the search for conducts to a linear optimization program with an exact solution that depends on the metric used, i.e., the disagreement distance or the Condorcet distance. The choice of these metrics is motivated by a range of properties: (i) both have an intuitive and plausible interpretation as a number of pairwise choices, (ii) they provide the best possible description of the process of ranking classes as performed by a human, (iii) both have a number of appealing mathematical properties such as counting rather than measuring and providing a very good concordance indicator [30,31].

4.2. Total Order Ranking with Disagreement Distance

The disagreement between the rankings from classifiers and is measured by . The kth permutation can be represented by a permutation matrix , with if class i is positioned in place j and 0 otherwise (see Figure 2). Therefore, the constraint in Equation (11) imposes . Let with tensor Einstein notation. Equation (11) can then be rewritten: where denotes the rank of the ith candidate in the unknown ranking . As can be represented by its permutation matrix , it comes from the rewriting of in Equation (12): which is equivalent to:

Figure 2

Permutation matrix put for the ranking of classifier .

Taking into account the summation on j and the fact that only takes the value 1 once (and 0 elsewhere), only corresponding to the value j for which is considered. Then Let us define by the cost of attributing the alternative i in position j. is also the number of classifiers that don’t position the alternative i in place j. is equivalent to , where is the number of classifiers who do position the alternative i in place j. Given that because , we obtain and then In Equation (18), considering that is the number of classifiers that position class in place j, the linear objective function associated with Equation (12) is finally formulated as constrained by . The form to be minimized in Equation (19) recodes the classifier combination rule, which is reduced to solve an NP-hard binary linear programming problem (see [32] for some resolution strategies).

4.3. Total Order Ranking with Condorcet Distance

To define this distance, we define a new set of matrices , where is put for the indicator matrix of classifier with the convention if the rank of class is less than that of class and 0 otherwise (see Figure 3).

Figure 3

Condorcet matrices.

Using the tables as in Section 4.2 which can be simplified as follows in the case of total order: As or 1, the consensus function associated with the Condorcet distance is given by Let be the total number of classifiers preferring class to . Defining as a matrix summing the M matrices associated with the rankings of the classifier allows us to rewrite as As defines a total order, and . Let . Then is constant and is Finally, the search for an optimal rank classifier combination conducts to the following binary linear program: From a machine learning perspective, solving Equations (19) and (25) provides deterministic matrix solutions and , respectively, from which is easily reconstructed, but these solutions are not necessarily identical [28]. (Classifier ensemble aggregation rule). The problem selected to illustrate our theory is that of combining four classifiers for recognizing handwritten digits 0 to 9. Binary images from the MNIST database are used [ The two rankings are concordant except for the predictions for digits 5 and 6.

5. Classifier Ensemble Information Measure

Since sgn, then If , then from Equation (26), In summary, , which means that is more uncertain than and could be preferred for a classifier ensemble agreement. The question is, how precisely can we measure this voting conjunction? Section 4.3 introduced a matrix representation of the information. By summing for all the tables , one obtains the matrix defined previously. If we arrange the classifiers according to a permutation order , can be represented from matrix obtained by the permutation of rows and columns. The objective function to minimized is given in the general case by: and as follows in the case of total orders: A measure of classifier ensemble agreement is a coefficient between 0 and 1 measuring the intensity of the link between the set of classifier votes. The closer its value is to 1, the more the opinions of the classifiers are in agreement. Conversely, the closer their value is to 0, the greater the disagreement between the votes. Here, we give the coefficients of concordance for the two metrics. (Conjunction coefficient interval for the disagreement metric). Let

5.1. Disagreement Distance

See Appendix A for the proof.

5.2. Condorcet Distance

If M classifiers vote on K classes with pairing order comparison matrices , the sum of which makes it possible to obtain with , as defined in Section 4.3, the conjunction coefficient is defined as (Conjunction coefficient interval for the Condorcet metric). Let See Appendix B for the proof.

6. Experiments

6.1. The Detection of Cervical Cancer

Many studies have shown evidence that cervical cancer may be imputed to a subset of DNA viruses called human papillomavirus (HPV) (referred to as risky patients)that infect cutaneous and mucosal epithelia, and in which acute infection causes benign cutaneous lesions [34,35]. Some of these viruses infect the genital tract and cause malignant tumors, which are most commonly located in the cervix. Even though most of these infections are controlled by the immune system, some remain persistent and are ascribed to different types of cancers and particularly, to cervical cancer. In 2016, cervical cancer represented the 12th most lethal female cancer in the European Union, accounting for 13500 deaths a year and 30400 new cases a year. Therefore, cervical cancer screening still continues to play a critical role in the control of cervical cancer. However, the screening of a smear is nowadays mostly made manually: a pathologist inspects each cell of a smear with a microscope to check if it is atypical or not. Consequently, human error is always possible, and in particular, mistakenly diagnosing atypical cells as normal. This situation can occur because of the practitioner’s fatigue or a lack of experience or concentration. In addition, diagnosis is also linked to the preparation of cells, and in some situations, atypical cells can be partially hidden by others, which makes their interpretation or classification difficult. In addition, the presence of atypical cells in the entire studied population is very uncommon (up to 1‰) which makes the detection task even more difficult. Therefore, an error is easily possible. This could have irreversible effects on the evolution of the cancer and can impact treatment. The introduction of an automatic procedure, able to point out the pathological cells, would both help the practitioner in his diagnosis and improve or strengthen it. Depending on the morphology of the nuclei of the cells, the diagnosis varies: if a nucleus is considered normal and all of the cells removed have the same diagnosis, then the cervix is considered normal. On the other hand, if a nucleus is considered abnormal, the diagnosis is not automatically associated with a risky smear. We propose to test our classifier combination strategy to cluster cells into three different classes (normal cells, atypical cells, and debris) using a certain number of classifiers.

6.2. The Dataset

The cytological dataset is constituted of smear images from 14 different women. They generally comprise more than one hundred cells characterized by 42 morphological or textural variables. Nine showed a negative hpv test and the other five, a positive test. In addition, few observations were labeled by an expert who pointed out some atypical cells and noisy objects. The dataset is presented in detail in Table 3. Among the most recurrent patterns of abnormal cells are nucleus regularity or a swollen aspect, nucleus size, important optical density, number of nucleoli, high core/cytoplasm ratio, ratio of minimum/maximum width of the nucleus, etc.

Table 3

Dataset characteristics.

HPV Test	Total Number of Cells	Number (or %) of
HPV Test	Total Number of Cells	– Debris – – Cancer –
positive	405	78	(0.19)	49	(0.12)
positive	114	19	(0.17)	8	(0.07)
positive	206	31	(0.15)	13	(0.06)
positive	448	30	(0.06)	2	(0.004)
positive	519	70	(0.13)	33	(0.06)
negative	137	13	(0.09)	–	–
negative	76	5	(0.06)	–	–
negative	211	84	(0.39)	–	–
negative	251	31	(0.12)	–	–
negative	251	52	(0.20)	–	–
negative	257	40	(0.15)	–	–
negative	223	24	(0.11)	–	–
negative	691	155	(0.22)	–	–
negative	67	23	(0.24)	–	–
Total	3857	655	(0.17)	105	(0.02)

The images were colored with Papanicolaou stain, which is the most widely used reference color for the screening of cervical cancers; it makes it possible to distinguish the different nuclei, which are colored in blue, the mother cells in dark purple to black, and the keratinized and squamous epithelium. The images were then segmented into thumbnail images of pixels which correspond a priori to objects. Most of the time, these objects are nuclei, but they may sometimes be non-identified objects that we call “noise”. Indeed, they can correspond, for example, to a poor segmentation, a superimposed nuclei, etc. A few observations were labeled by an expert who pointed out some atypical cells and noisy objects. The fact that a nucleus has one of these characteristics does not always imply its malignancy. In fact, a cell can have a singular morphology but not be infected, and others may present abnormalities that correspond to pre-cancerous lesions such as dysplastic cells and in situ carcinomas or to cancerous cells. Figure 4a shows a cluster of abnormal cells (with large nuclei) that are not yet cancerous, because of their low density, unlike Figure 4b, where one can observe a set of abnormal cells with dense nuclei.

Figure 4

Images of cervical cells colored with Papanicolaou stain. (a) Clumps of abnormal cells with large nuclei. (b) Abnormal cells with dense nuclei.

Table 4 summarizes the characteristics of the dataset. First, the observed data come from samples of 14 different smears, which supposes the existence of inter-individual variability (confirmed by tests of variance between the hpv negatives, the hpv positives, or between the two types of population; the 5% risk threshold tests rejected the assumption of equality of means for all variables). However, it is possible that this variability is simply relative to the studied dataset, in the sense that the study was done on a small number of smear samples. This assumption remains to be verified on larger databases. It can also be noted in Table 3 that the known population of “abnormal” cells remains very low in proportion to the other classes, and, in contrast, the recognized “default/waste” class represents more than 15% of the data. The low proportion of the target class and the heterogeneity of the debris present obstacles for clustering. This means that, among the cells belonging to risky patient smears, there exists a non-null risk that some nuclei are atypical. Iin practice, this proportion is usually very low (0.1% to 5%).

Table 4

Overview of the studied dataset.

	No. of Patients	No. of Nuclei	No./Yype of Data
control patients	9	2165	427/noisy objects
risky patients	5	1692	105/atypical nuclei
	–	–	228 / noisy objects
Total	14	3857	760 objects

From this image segmentation, morphological and photometric features are extracted and computed. In total, the studied dataset has 3857 cell samples belonging to 14 different smears and consists of 42 variables: variables 1 to 19 represent morphological variables, and the rest corresponds to textural and photometric characters. The channel of treatments from the smear image to the dataset is reported in Figure 5.

Figure 5

Overview of the processing chain.

Each smear was pre-processed according to a standardized protocol: cell collection, spreading a thin layer on slides, and the staining of these slides. Each slide was then scanned, segmented cell by cell, and finally, underwent an extraction of 42 morphological and textural characteristics.

6.3. Experimental Protocol

Two-layer multilayer perceptrons (MLPs) were chosen as classifiers to produce the desired outputs, which were ordered to produce the ranks. Each multilayer perceptron (MLP) contains 42 input units, 10 hidden units, and 3 output units. Training was achieved using a learning rate of 0.1 and a momentum of 0.9 for two epochs on the training set. We deliberately trained the MLPs without optimization of a validation set. It is important to stress that the training set for the classifier was not the same set as the test set, the ensure that the experiments would be unbiased. The best results obtained for an mlp were a classification error rate of and a false positive rate (FPR) (or false alarm ratio) of . The fpr is the number of false positives divided by the total number of negatives N, i.e., . The false negative rate (FNR) is the number of false negatives divided by the number of real positive cases in the data, i.e., . In practice, this is a test result that indicates that a condition does not hold, while in fact it does. In order to assess the efficiency of the rank classifier combination algorithms, error rates were computed from a certain percentage of nuclei whose labels were known. This represented 70% of the observations in a subsample, as we took into account the 20 labeled atypical nuclei randomly selected, and we also assumed that those coming from control patients (120) were all normal nuclei. We proceeded in the same manner to compute the fpr which stands for the percentage of actual atypical nuclei mis-classified. In Table 5 and in Figure 6a, we report the classification error rate computed from the data with known labels and its corresponding fpr, for the two procedures. First of all, we can observe that the Condorcet combination rule shows the best performances in terms of classification error rate and fpr (see also Figure 6c). Indeed, only 4.28% of cells are mis-classified, whereas the disagreement combination rule has a mis-classification rate of 4.84% in the best case, with 765 classifiers. The main conclusion is that the success ratio is strongly improved when combining classifiers. However, it is disappointing to see that the Condorcet algorithm results in a significant number of false negatives (pathological cells classified as normal ones); the fnr also remains relatively high, around 10%, in many simulations. Indeed, the classification risk is not symmetric here: the detection of pathological cells activates the decision for treatment, and their absence implies an absence of treatment.

Table 5

Classification results with disagreement and Condorcet combination rules using a set of M classifiers (with ).

Disagreement Distance					Condorcet Distance
Id	M	Error Rate	FPR	FNR	IC	M	Error Rate	FPR	FNR
0.873	1321	0.0800	0.0934	0.0644	0.901	1321	0.0820	0.0870	0.0777
0.901	1073	0.0544	0.0491	0.0576	0.906	1073	0.0572	0.0566	0.0561
0.866	907	0.0560	0.0553	0.0562	0.966	907	0.0428	0.0529	0.0313
0.895	845	0.0524	0.0593	0.0464	0.920	845	0.0508	0.0532	0.0465
0.870	765	0.0484	0.0502	0.0500	0.822	765	0.0516	0.0555	0.0493
0.800	538	0.0636	0.0675	0.0600	0.817	538	0.0664	0.0718	0.0592
0.792	302	0.0728	0.0766	0.0710	0.744	302	0.1020	0.0923	0.1126
0.781	205	0.0896	0.0906	0.0864	0.757	205	0.1472	0.1015	0.1968
0.759	120	0.1292	0.1157	0.1439	0.776	120	0.1472	0.1118	0.1846
0.697	95	0.1424	0.1023	0.1809	0.660	95	0.1424	0.0985	0.1888
0.706	66	0.1392	0.1098	0.1667	0.689	66	0.1548	0.1055	0.2090
0.739	49	0.1328	0.0840	0.1854	0.672	49	0.1568	0.1117	0.2067
0.694	38	0.1460	0.1152	0.1770	0.516	38	0.1772	0.1404	0.2233
0.643	19	0.1540	0.1294	0.1801	0.484	19	0.1696	0.1323	0.2032
0.496	13	0.1644	0.1127	0.2224	0.455	13	0.1728	0.1261	0.2248
0.561	4	0.1580	0.1238	0.1962	0.477	4	0.1668	0.1234	0.2130

Figure 6

Graphic representations of the classification results for disagreement (blue) and Condorcet (red) distances.

We compared the clustering partition obtained by the three competitors: sparse k-means (SkM) proposed by Witten and Tibshirani [36,37], general sparse multi-class linear discriminant analysis (GSM-LDA) [38], and sparse EM (sEM)by Zhong et al. [39]. First, we can observe that among these algorithms, the sEM shows the best performance in terms of clustering accuracy. Only 9% of observations are mis-classified, on average, whereas the GSM-LDA algorithm has a mis-classification rate of 15.9%, and the SkM algorithm mis-classifies 19.2% of nuclei. However, the sparse approaches provide a better clustering results from a medical point of view since the results can be interpreted conversely to the LDA-type algorithm, for which the fitted discriminative axis is a linear combination of the original variables. Therefore, SkM and sEM provide information which can be interpreted to better understand both the data and the phenomenon. The rank classifier combination provides the best classification results. We can observe that the global clustering error rates are considerably reduced (Table 6). Indeed, the best error rate reaches 4.28% with 907 classifiers and a conjunction coefficient of 96.6%.

Table 6

Results obtained for the sparse k-means (SkM), general sparse multi-class linear discriminant analysis (GSM-LDA), and sparse EM (sEM) algorithms: Average and standard error of clustering error rate, false positive rate fpr, and false negative rate fnr on 20 simulations.

Algorithm	Error Rate	FPR	FNR
skm [36]	0.192±0.016	0.205±0.044	0.165±0.084
gsm [38]	0.159±0.022	0.133±0.050	0.118±0.099
sem [39]	0.090±0.047	0.077±0.022	0.062±0.061

7. Conclusions and Future Research

In this paper, we show that an exact optimal combination rule for a rank classifier ensemble can be computed as the solution to a binary linear programming problem. This rule can be seen as a total order ranking attributed to K classes by a virtual voter resuming the points of view of M voters. One could also stand the dual problem of the previous one, i.e., is there a distribution of marks or values that could have been attributed to a virtual class C by the m voters? The first problem is related to the idea of aggregating points of view, the second with the idea of summarizing profiles. We compared disagreement and Condorcet metrics, making it possible to quantify the consensus between the classifiers with a conjunction coefficient. The optimal rankings are not the same, i.e., the solution depends of the metric used. But they have shown their efficiency, in addition to the appealing property of being deterministic algorithms: they improve the classification results and ease the interpretation and the understanding of the results. Another point worth mentioning is the theoretical capability of handling the reject option. A weak point of this technique is that it treats all classifiers equally and does not take into account individual classifier capabilities. This disadvantage can be reduced to a certain degree by applying weights. The weights can be different for every classifier, which in turn requires additional training. This idea deserves to be further explored. The role of variable selection appears to be significant, as it enables the improvement of both the clustering partition and the modeling of the atypical cells in the cancer detection smear (see Figure 5). In the future, we propose including a rule to rank the selected features and to investigate how the number and nature of classifiers influence the results of the rank classifier combination.

Table 2

Proposed rank classifier combination using disagreement and Condorcet distances.

Digits	Classifier Ranks				Proposed Rank
	K1	K2	K3	K4	Disag.	Condorcet
0	1	4	3	10	3	3
1	2	2	1	2	2	2
2	3	1	2	1	1	1
3	4	6	4	3	4	4
4	5	5	7	5	5	6
5	6	3	6	4	6	5
6	7	8	5	6	7	7
7	8	7	8	9	8	8
8	9	10	10	8	10	10
9	10	9	9	7	9	9

7 in total

7. Double staining cytologic samples with quantitative Feulgen-thionin and anti-Ki-67 immunocytochemistry as a method of distinguishing cells with abnormal DNA content from normal cycling cells.

Authors: Gerald Li; Martial Guillaud; Michele Follen; Calum MacAulay
Journal: Anal Quant Cytopathol Histpathol Date: 2012-10