Literature DB >> 26844025

Evolutionary approach to violating group anonymity using third-party data.

Dan Tavrov1, Oleg Chertov1.   

Abstract

In the era of Big Data, it is almost impossible to completely restrict access to primary non-aggregated statistical data. However, risk of violating privacy of individual respondents and groups of respondents by analyzing primary data has not been reduced. There is a need in developing subtler methods of data protection to come to grips with these challenges. In some cases, individual and group privacy can be easily violated, because the primary data contain attributes that uniquely identify individuals and groups thereof. Removing such attributes from the dataset is a crude solution and does not guarantee complete privacy. In the field of providing individual data anonymity, this problem has been widely recognized, and various methods have been proposed to solve it. In the current work, we demonstrate that it is possible to violate group anonymity as well, even if those attributes that uniquely identify the group are removed. As it turns out, it is possible to use third-party data to build a fuzzy model of a group. Typically, such a model comes in a form of a set of fuzzy rules, which can be used to determine membership grades of respondents in the group with a level of certainty sufficient to violate group anonymity. In the work, we introduce an evolutionary computing based method to build such a model. We also discuss a memetic approach to protecting the data from group anonymity violation in this case.

Entities:  

Keywords:  Fuzzy inference; Group anonymity; Memetic algorithm; Microfile; Privacy-preserving data publishing; Subgroup discovery

Year:  2016        PMID: 26844025      PMCID: PMC4728171          DOI: 10.1186/s40064-016-1692-9

Source DB:  PubMed          Journal:  Springerplus        ISSN: 2193-1801


Background

A son of Dmitrii Mendeleyev, the world-renowned chemist and creator of the periodic table, recalls (Tishchenko and Mladientsev 1993, pp. 353–354) an interesting fact. In 1890, his father came up with a formula of the smokeless pirokollody gunpowder (Gordin 2003), which at the time was thoroughly protected by French manufacturers. As it turned out, Mendeleyev’s findings were based on analyzing public statistical data from the railroad company annual report on freight traffic. A separate branch line supplied the gunpowder factory. Annual statistics provided all the necessary information to easily retrieve the gunpowder composition ratios. One hundred and twenty five years later, in the era of Big Data, various statistical data are publicly available. The task of ensuring that security intensive information does not leak out becomes much more challenging. A modern man lives and works in a society oriented toward collecting and storing data on each and every person. Statistics services do that (census forms), taxing services do that (tax declarations), medical facilities do that (patient’s medical records), law enforcement agencies do that (person’s IDs), employers do that (CVs), retail stores do that (personal discount cards), security does that (security cameras files), and so on and so forth. Problems of preserving privacy in such data are widely discussed within the field of privacy-preserving data publishing (Fung et al. 2010; Wong and Fu 2010). To great extent, appropriate protection implies removing identifiers (passport data, full name etc.), and distorting the data (e.g., values of certain characteristics are swapped between respondents or get noised) or suppressing them (e.g., data on elder people are grouped in a category of senior citizens). At the same time, problems of protecting group distributions for certain categories of respondents remain unsolved. Let us consider a case when abnormal concentration of nuclear physicists on a specific territory reveals the site of a secret nuclear research facility. Of course, removing such attributes as Occupation or Industry seems to be a first choice. However, the risk of privacy violation remains high if there is information about where respondents pursued their higher education (e.g., National Institute for Nuclear Science and Technology for academic training in atomic energetics is situated in Saclay commune, France), or about where they lived (for instance, Dubna, Russian Federation, is a home to Joint Institute for Nuclear Research). Therefore, the task of protecting distributions for a certain group of respondents (which can be persons, households, enterprises etc.) with minimal distortion of primary statistical data is a pressing one. There are numerous practical cases when we do not have attributes at our disposal that classify a respondent as belonging to a certain group (either because they were deliberately removed by the data publisher, or because they were not present in the first place). However, we can try to restore group distributions by analyzing publicly available data such as statistical surveys, polls etc. (Chertov and Tavrov 2015). Using expert judgments about these data, we can build a fuzzy model of a group in a form of a fuzzy inference system (FIS) that, for each respondent, gives her membership grade in the group under consideration. A distribution constructed this way can violate group anonymity as discussed above. Expert judgments often are not a reliable source of fuzzy rules that constitute the main part of any FIS. Sometimes, it is hard even to properly identify attributes necessary to include into a model of a group, let alone determine particular fuzzy rules. In this work, we propose an evolutionary based method of building the fuzzy model using third-party data. We also describe a memetic algorithm for solving the task of anonymizing the obtained distribution. This algorithm seeks minimal distortion in the microfile, and at the same time ensures that group anonymity cannot be violated.

Related work

Data anonymity

Anonymity of a subject means (Pfitzmann and Hansen 2010) that it is not identifiable (uniquely characterized) within a set of subjects. There can be distinguished two kinds of anonymity: individual anonymity means that a single respondent is unidentifiable within a given dataset; group anonymity means that information about a group of respondents cannot be used to violate sensitive features of appropriate distributions. Methods for providing individual anonymity are discussed in the field of privacy-preserving data publishing (Fung et al. 2010; Wong and Fu 2010). A plenty of methods have been proposed over the years, some of which are randomization (Evfimievski 2002), microaggregation (Domingo-Ferrer and Mateo-Sanz 2002), data swapping (Fienberg and McIntyre 2005), differential privacy (Dwork 2006), etc. A comprehensive overview of recent developments in the field can be found in Sowmyarani and Srinivasan (2012) and Rashid and Yasin (2015). For the first time, the problem of violating data group anonymity, i.e., anonymity not of individual respondents, but of groups thereof, was introduced in the context of providing group anonymity in Chertov and Tavrov (2010). It was shown that group anonymity can be violated by analyzing outliers of a so called quantity signal, where each , , stands for a number of respondents belonging to a given group (e.g., group of military personnel, or group of nuclear scientists) in a given submicrofile, whose total number is . A submicrofile is a subset of microfile records sharing the same property, such as region of work. In Chertov and Tavrov (2010), it was argued that outliers in a quantity signal that corresponds to the regional distribution of military personnel can be used to disclose locations of (potentially classified) military bases. In Chertov and Tavrov (2012), the concept of a quantity signal has been taken further by introducing a concentration signal, where each , , is obtained by dividing the corresponding by a total number of records in a corresponding submicrofile. The concentration signal can be used to violate anonymity of groups when absolute numbers of respondents are not sufficient. For instance, as was argued in Chertov and Tavrov (2012) using scientists as an example, extreme ratios of scientists working in a given region could potentially give away the location of a classified research center. In general, group anonymity can be violated by analyzing such sensitive properties of quantity and concentration signals as (Chertov 2010, p. 77) outliers (almost always a sensitive feature of any distribution), certain statistical features and trends (especially in the case when the quantity signal represents an ordered sequence of numbers), cycles or periods (especially when the quantity signal represents a time series), or frequency spectrum. In certain practical applications, when the groups are defined in terms of specific attributes (such as a group of military personnel, which is defined by a special attribute uniquely identifying a respondent as a military enlisted), it is possible to protect group anonymity by removing this attribute from the original dataset before publishing. Being a crude solution by itself, it is still not applicable in a number of cases, when it is possible to build an approximation of a group, i.e., define a set of records in the dataset such that its quantity or concentration signal is sufficiently similar to the original one so that it is possible to violate anonymity of the group in question. Taking into consideration uncertain and imprecise nature of statistical datasets, it was proposed in Chertov and Tavrov (2015) to violate group anonymity with the help of a fuzzy model of a group. In Chertov and Tavrov (2014), a method for providing group anonymity based on memetic computing was proposed. This method enables us to modify the quantity (or concentration) signal in order to mask its outliers, and at the same time tries to minimize distortion introduced in the dataset. In Tavrov (2015), this algorithm was adapted to work with the fuzzy models proposed in Chertov and Tavrov (2015). In the next subsection, we will briefly review the concept of fuzzy inference, which is necessary for discussing fuzzy models of groups of respondents.

Fuzzy inference

The concept of a fuzzy set was first introduced in Zadeh (1965). A fuzzy setA in a universal set X is a class, in which a point may have a grade of membership in the interval . Each fuzzy set A is characterized by a membership function, which associates with each a real number in the interval considered as the “grade of membership” of x in A. Fuzzy sets constitute a core of linguistic variables (Zadeh 1975). An ordinary variable is characterized by a triple , in which X is the name of the variable, U is the universe of discourse, u is a generic name for the elements of U, and is a subset of U, which represents a restriction on the values of u imposed by X. A fuzzy variable differs from the ordinary one in that R is a fuzzy subset of U, which represents a fuzzy restriction on the values of u imposed by X. A linguistic variable differs from an ordinary numerical variable in that its values are not numbers but words or sentences in a natural or artificial language. It is formally characterized by a quintuple , in which is the name of the variable; denotes the term-set of —the set of names of linguistic values of , with each value being a fuzzy variable denoted generically by X and ranging over a universe of discourseU, which is associated with the base variableu; G is a syntactic rule for generating the names, X, of values of ; and M is a semantic rule for associating with each X its meaning, , which is a fuzzy subset of U. The meaning, , of a term X is defined to be the restriction, , on the base variable u, which is imposed by the fuzzy variable named X. For example, we can consider a linguistic variable named Number, which is associated with the finite term-set , where denotes union, and in which each term represents a restriction on the values of u in the universe of discourse . Linguistic variables can be used to formalize knowledge in form of fuzzy propositions. While each classical proposition (i.e., a sentence in some language) is required to be either true or false, the truth of fuzzy propositions is a matter of degree. The canonical form of the fuzzy proposition, p, is expressed (Klir and Yuan 1995) by the sentencewhere is a linguistic variable with the base variable v defined on some universal set V, and F is a fuzzy set on V that represents a fuzzy predicate. Given a particular value of v, this value belongs to F with membership grade . This membership grade is then interpreted as the degree of truth, , of proposition p. Of particular interest for the task of building fuzzy models of groups are conditional propositions (fuzzy rules), expressed by the canonical form (Klir and Yuan 1995)where and are linguistic variables with the base variables x and y whose values are in sets X and Y, respectively; A and B are fuzzy sets on X and Y, respectively. Antecedents (left parts) of fuzzy rules can contain more than one linguistic variable:where logical connective and can be interpreted as a proper fuzzy intersection (Zadeh 1965). In Chertov and Tavrov (2015), there has been proposed an expert-based procedure for building fuzzy model of a given group to be protected in a form of a fuzzy inference system (Klir and Yuan 1995), i.e., a system which employs expert knowledge in the form of fuzzy rules for making inferences. Such a fuzzy model can be then thought of as a fuzzy classifier that assigns to a given respondent a certain grade of membership in the group. One of the biggest challenges in creating a fuzzy model of a group is coming up with a comprehensive and complete set of rules. When the number of input variables is relatively big, the total number of consistent fuzzy rules can grow beyond a point when it is all but impossible to use subjective expert knowledge to formalize them. In some cases, the problem is not only that of defining proper fuzzy rules, but of defining, which variables to account for in the antecedents. For instance, in the case of building a fuzzy model of a group of military personnel, the choice needs to be made as to what microfile attributes need to be considered to make an accurate classification of a given respondent as a military person. In many practical tasks, there is no way of knowing this beforehand, so appropriate efficient search algorithms should be applied, such as evolutionary algorithms.

Evolutionary approach to building fuzzy rules

Evolutionary algorithms are heuristic generate-and-test algorithms that mimic biological evolution by natural selection (Eiben and Smith 2015, p. 5). The task of creating a fuzzy rule set that enables us to violate group anonymity is a complex one, therefore utilizing evolutionary algorithms is a suitable approach to solving this problem. Historically, application of evolutionary and, in particular, genetic algorithms to evolving rule-based systems was first proposed in Holland (1976) in the context of learning classifier systems. Such systems were described (Eiben and Smith 2015, p. 108) as a framework for studying learning in condition:action rule based systems, using genetic algorithms as the method for the discovery of new rules. Over the years, evolutionary algorithms have been proposed for evolving fuzzy rules as well. For instance, in Ishibuchi et al. (1995, 1999), there was proposed an evolutionary algorithm for evolving fuzzy classifiers, i.e., rule based systems with fuzzy rules for solving classification tasks. In such systems, consequents (right parts) of the rules in the form (3) are labels of classes of interest rather than linguistic variables. The task of evolving fuzzy rules for violating group anonymity can be viewed as a task of subgroup discovery, which is defined (Wrobel 1997) as the task of finding interesting subgroups in a population of individuals, where interestingness is defined as distributional unusualness with respect to a certain property of interest. Subgroup discovery represents (Jesus et al. 2007) a form of supervised inductive learning, in which, given a set of data and a property of interest to the user, an attempt is made to locate subgroups that are statistically most interesting for the user. Since the subgroups discovered in data need to be of a more explanatory nature (interpretability of the extracted knowledge for the final user is a crucial aspect), a fuzzy approach (Jesus et al. 2007) for a subgroup discovery process, which considers linguistic variables in descriptive fuzzy rules, is a good approach to take. It is important to make a distinction between subgroup discovery and the task of classification, because Carmona et al. (2014) subgroup discovery attempts to describe knowledge by data while a classifier attempts to predict the target value for new data to incorporate in the model. In the context of a fuzzy model of a group of respondents, whose anonymity needs to be violated, we are more interested in the classification side. However, many ideas from the field of subgroup discovery can provide useful insight, as will be shown in the paper. An overview of recent developments in the field of subgroup discovery can be found in Atzmueller (2015). Evolutionary algorithms for subgroup discovery are discussed in Carmona et al. (2014). In general, there can be distinguished two approaches to evolving rule-based systems: Michigan approach (Valenzuela-Rendón 1991) and Pittsburgh approach (Smith 1980). In the first case, each individual in the evolutionary algorithm population corresponds to a single rule. In the second case, each individual is a complete model, i.e., the whole set of rules. In the extraction of rules for the subgroup discovery task, the Michigan approach is more suited because (Jesus et al. 2007) the objective is to find a reduced set of rules, in which the quality of each rule is evaluated independently of the rest, and it is not necessary to evaluate jointly the set of rules. Moreover, the computation load of the Pittsburgh approach is typically much higher (Ishibuchi et al. 1999, p. 616). Rules used for describing a subgroup differ in their ability to describe an interesting subgroup, which is measured by a certain quality measure. In general, quality measures can be grouped (Freitas 1999) into objective and subjective measures. Since subjective measures involve experts for evaluating rules, we will focus only on objective measures that are data-driven, and don’t involve expert judgment. A comprehensive overview of quality measures can be found in Lavrač et al. (1999). However, for the task of violating anonymity of a group of respondents with the help of fuzzy rules in terms of disclosing outliers in the quantity signal, quality measures described in the literature are not suitable. We are interested in cumulative classification properties of fuzzy rules. In other words, we allow ourselves for a certain degree of misclassifications, as long as outliers in the quantity signal obtained with the help of the fuzzy rules correspond to the ones in the original quantity signal. In this work, we propose a novel quality measure that takes this into account. We also propose a version of an evolutionary algorithm for building a fuzzy model of a group as a set of fuzzy rules, which differs from the ones described in the literature in the quality measure used for evaluating fuzzy rule. The fuzzy model evolved using such an algorithm can be used for violating group anonymity in terms of disclosing outliers in the quantity signal.

Group anonymity basics

To set a stage for discussing the fuzzy model of a group, we will first introduce some basic notation.

General group anonymity definitions

Let us define microdata as the data about certain respondents presented in a form of a depersonalized microfile (i.e., a microfile without identifiers). Each record , , in this microfile contains values of several attributes , . Let us denote by the set of all the values of . There are two types of attributes of the microfile necessary to define a group. Let , , denote vital microfile attributes. These attributes represent those characteristics of records that enable us to determine whether they belong to a group or not. Let us define a vital value combinationV as an element of the Cartesian product . Let us denote a set of vital value combinations by . We will call records whose attribute values belong to vital records. We will denote vital records by , . Let , denote a parameter microfile attribute. This attribute determines values, over which we should take the distribution of a group defined by the vital attributes. A parameter valueP can be defined as a value of the parameter attribute, i.e., . Let us denote a set of parameter values by . By their nature, parameter values enable us to divide into several submicrofiles. Each submicrofile contains records, , . All the records in a certain submicrofile share the same parameter value . A word of caution is in order. Throughout this paper, we will assume that if contains several attributes that can be concatenated to form a single parameter attribute, they will be concatenated. We will call all the other attributes , , , , basic attributes. Obviously, . The group of records , whose distribution needs to be masked when providing group anonymity, can be determined by the values of the vital and parameter attributes. We will denote the distribution of G, whose sensitive features need to be protected, by . In consistency with existing literature, we will call this distribution the goal representation of a group. Throughout this paper, we will limit ourselves to a particular goal representation most widely used in practice called the quantity signal. This signal is denoted by , where each , , stands for a number of records in that belong to G, i.e., whose vital attribute values belong to .

Quantity signal and its sensitive features

As pointed out before, when providing group anonymity, it is necessary to protect sensitive features of the goal representation under consideration. In this work, we will consider such sensitive features of a quantity signal as its outliers. Outliers of a quantity signal might attract attention to parameter submicrofiles that are supposed to be indistinguishable (sites of military bases, classified research centers etc.). By outliers of a quantity signal, we will understand its values that are statistically inconsistent with the rest of the signal. There have been proposed several approaches to determining outliers in a given dataset. According to the American National Standard of the American Society of Mechanical Engineers ASME PTC 19.1 (ASME 2013, p. 78), two tests are in common usage, the Thompson Technique (Thompson 1935) and the Grubbs Method (Grubbs 1969). In this work, we propose to use the Modified ThompsonTechnique (MTTT) as the method recommended by ASME (2013, p. 79) for identifying suspected outliers. This method is based on the Student’s t-distribution (Student 1908), which is most applicable in situations when the sample size is small, which is typically the case with the quantity signals. Let the values of the quantity signal be arranged in increasing order. To determine outliers in this signal, one needs to carry out the following steps: Calculate sample mean and sample standard deviation: where is the number of elements in . For each signal value , , calculate absolute value of its deviation from as Calculate according to where is the critical Student’s t value (Student 1908) based on significance level and degrees of freedom. If there is such i that , then is the outlier. In this case, we need to remove from the signal and return to step 1. If for all i, the algorithm stops. Statistical characteristics (4) are not robust to the presence of outliers in a signal, so there have been proposed (Lanzante 1996) other characteristics: the median, which can be interpreted as the “middle” value of a signal and is estimated by the pseudo-standard deviation, which can be defined based on the interquartile range (IQR): where () is the upper (lower) quartile. If is even, the upper (lower) quartile is the median of the largest (smallest) observations. If the is odd, the upper (lower) quartile is the median of the largest (smallest) observations. In this work, we will use the MTTT as described above, where estimates (7) and (8) are used in place of estimates (4). Typically, a set of outliers yielded by MTTT contains signal elements that typically would not be considered as outliers by an expert. Moreover, in some practical cases, not all outliers need to be masked. E.g., when there is a well known military base associated with a particular signal element, masking a corresponding outlier will distort the data and make it obvious that the primary data have been tampered with. Therefore, in the context of providing group anonymity, it is necessary for an expert to revise the set of outliers as determined by MTTT. Let us denote by the set of indexes of that correspond to outliers yielded by MTTT. Let us denote by the subset of indexes of obtained by excluding from those indexes, which an expert considers as not important for the task at hand. For brevity, we will also denote by the relative complement of with respect to .

The task of providing group anonymity

To solve the task of providing group anonymity (TPGA), we need to modify the original microfile in order obtain a new, protected one . Such modification needs to meet three conditions (Chertov and Pilipyuk 2011, p. 339): disclosure risk is low or at least adequate to importance of information being protected; both original and protected microfile data, when analyzed, yield sufficiently similar results; the cost of transforming the data is acceptable. In this paper, by the TPGA, we will understand the task of modifying the microfile in such a way that it is no longer possible to determine outliers in the quantity signal, and at the same time introduce as little distortion as possible in the process. The easiest “solution” to the TPGA is to recode vital values or remove some of the vital attributes, so that it is impossible to restore the original quantity signal. However, this approach satisfies only one out of three properties stated above, namely, it is easy to carry out. At the same time, this simplistic approach only gives an impression of reducing the disclosure risk. As we will demonstrate later, if an adversary has access to appropriate third-party data, sensitive features of the group distribution can be violated under several conditions. Therefore, even if we choose to remove the vital attributes (or otherwise modify them), we will still need to perform additional microfile modifications in order to properly protect anonymity of a given group.

Auxiliary microfiles

Let us further on assume that all the vital attributes are removed from . Let us denote by the harmonized version of , which can be obtained from by means of two basic transformations: attributes are replaced by a single harmonized attribute; several values of the attribute , , are replaced by a single value of the harmonized attribute, which may or may not be equal to any of the values in . Let us denote by the auxiliary microfile with records denoted by , which has the following properties: records in and in are drawn from sufficiently similar distributions; contains auxiliary vital attributes that have the same values and interpretation as the vital attributes in . Auxiliary vital attributes can be used to determine auxiliary vital records, whose total number is . In addition, vital and auxiliary vital records (as well as the records that are not vital or auxiliary vital, respectively) are drawn from sufficiently similar distributions; and can be transformed into their harmonized versions and , so that their basic attributes are identical both in terms of values and their interpretation. More precisely, and contain harmonized basic attributes , ; value combinations of attributes , , can be used to determine membership grades of each record , , in a group G, whose anonymity needs to be violated; an adversary has access to . It is worth noting that it is not required to harmonize parameter attribute in the original microfile or its analogy in the auxiliary one. Throughout this paper, we will without loss of generality assume that and remain intact during the harmonization process. If the conditions given above are met, it is possible to build a set of fuzzy rules to determine membership grades , , of each record in a group. This set of rules can be interpreted as a fuzzy model of the group whose anonymity needs to be violated. This model enables us to construct an auxiliary quantity signal, where , , are defined bywhere is the parameter submicrofile of , whose records share the same parameter value ; is the group membership threshold used to cut off records that don’t belong to G with a sufficiently high grade. Throughout this paper, we will use . The auxiliary quantity signal doesn’t have to be close in a numerical sense to the original quantity signal —it is only required that outliers in correspond to those ones in .

Fuzzy rules in a fuzzy model of a group

In order to construct the auxiliary quantity signal as defined by (9), we need to calculate membership grades of each microfile record , . In general, this can be done using appropriate fuzzy rules. For the case of a fuzzy model of a group, such fuzzy rules can be presented in the following form:where , , denotes the fuzzy rule, denotes the value of the linguistic variable used in the fuzzy rule, G denotes the class of records that belong to a group. Each linguistic variable in the fuzzy rules, , corresponds to the attribute , , in the harmonized microfile ( or ). It has several values , , with their membership functions denoted by . In addition, each linguistic variable by default has a value with the membership function . If is present in a fuzzy rule , it means that the actual value of attribute is discarded. As pointed out in Ishibuchi et al. (1999), in this way we can obtain fuzzy rules of different generalization capacity. For each linguistic variable, we can define a range of acceptable values of a corresponding base variable. All the records from and , whose values of attributes lie outside the specified ranges, , need to be removed. In order not to complicate the notation, we will further on assume that and denote microfiles that contain only those records, whose attribute values lie inside corresponding ranges, unless specified otherwise. Similarly, we will further on assume that values and denote the total number of records in and , respectively, where and denote either original microfiles or microfiles with records whose attribute values belong to specified ranges, depending on the context. In what follows, we will make use of notation accepted in the subgroup discovery field. Let us define the antecedent part compatibility (Jesus et al. 2007) as the degree of compatibility between a record and the antecedent part of aswhere is the membership function of the fuzzy set , denotes a proper fuzzy intersection. Throughout this paper, we will use arithmetic product as the fuzzy intersection. To account for the group membership threshold alpha introduced in (9), we will further on use the following modification of (11): Then, we can say thatwhere denotes fuzzy union (Zadeh 1965). In this work, we will use maximum function as the fuzzy union. We say that a record verifies the antecedent part of if , and that it is covered by if additionally . In the context of violating group anonymity in terms of disclosing outliers in the auxiliary quantity signal, we are interesting in cumulative classification properties of the fuzzy rules. In other words, we allow ourselves for a certain degree of misclassifications, as long as outliers in the auxiliary quantity signal obtained with the help of the fuzzy rules correspond to the ones in the original quantity signal. Therefore, we need to introduce quality measures that are different from the ones described in the literature: a fuzzy rule should have reasonable discriminative capability: which means that rule classifies as belonging to the group G a disproportionally bigger number of auxiliary vital records than auxiliary records in general. We will introduce a discriminative factor defined by a fuzzy rule should have reasonable relative confidence: which means that incorrectly classifies no more than records as belonging to G, where will be called the relative confidence threshold. We will introduce the relative confidence factor defined by It can be recognized that the minuend from (14) is a fuzzy version of a well-known quality measure called support, and the subtrahend is a fuzzy version of another quality measure called coverage (Lavrač et al. 2004). Support considers the number of examples satisfying both the antecedent and the consequent parts of the rule, whereas coverage measures the percentage of examples covered on average by one rule. It can also be recognized that (16) resembles the quality measure called confidence introduced in Jesus et al. (2007). However, our version differs in the denominator. Classically, the division is performed over the sum of the degree of membership of all the records that verify the antecedent part of this rule, whereas in our version we consider only those records that verify the antecedent part of the rule and don’t belong toG. In our view, this makes interpretation of this quality measure more tractable, because it can be easily assessed how many respondents the rule classifies incorrectly, in relative terms. In a fuzzy model of a group, each rule needs to have quality measures with the following properties: , . In this case, we will reduce misclassifications, and thereby obtain a more suitable auxiliary quantity signal. Auxiliary quantity signal contains all the information necessary to violate group anonymity. On the other hand, to protect group anonymity, we need to use a signal that consists of crisp values representing numbers of respondents, not fuzzy degrees. Let us introduce a crisp auxiliary quantity signal:Values of (17) correspond to quantities of records in a corresponding microfile, which are assigned a membership grade greater than . We will make use of the signal defined in this way when we will discuss the method for protecting group anonymity in one of the subsequent sections. As it was mentioned earlier, due to complicated interrelations between different rules in the rule base, it is virtually impossible to construct the rule base from scratch using only expert knowledge. In sections to follow, we will present an appropriately tailored evolutionary algorithm for solving this task.

Adequacy of the fuzzy model of a group

In this section, we will briefly discuss possible tests for evaluating adequacy of the fuzzy model of the group described above. By adequacy of the fuzzy model we will consider its ability to correctly determine outliers in the quantity signal, i.e., how similar are the outliers in the original and auxiliary quantity signals. It therefore seems natural to evaluate model adequacy using tests designed to evaluate accuracy of classifiers. Let be the multidimensional pattern space under investigation, each element of which belongs to one of the two classes from the set . Let be the unknown joint distribution over . Let us be given a classifier that maps each pattern to a certain class. Let be the classifier error, where E is the expectation operator. Since in practical cases X is typically a set of finite size, can only be estimated. Let be the set of pairs drawn from . Let us introduce the confusion matrix (Olivetti et al. 2012)where TP (true positive) is the number of patterns from S that belong to class , and for which ; FP (false positive) is the number of patterns from S that belong to class , and for which ; FN (false negative) is the number of patterns from S that belong to class , and for which ; TN (true negative) is the number of patterns from S that belong to class , and for which . The sum of values of (18) is m. Let us denote by e the number of incorrectly classified patterns. Then, . The prediction accuracy is defined as When the number of patterns per class is not equal, a setting is called unbalanced. As was shown in Olivetti et al. (2012), test (19) is not suitable for unbalanced data. One of the tests suitable for unbalanced data is Youden’s J statistic (Youden 1950):This test explicitly captures the type I and type II errors. In Olivetti et al. (2015), there was proposed a Bayesian test of statistical independence between the results given by the classifier, on the one hand, and the true distribution , on the other hand. This test also takes into account the unbalanced nature of the data and the size of the data set. Let us denote by the hypothesis that the results given by the classifier are statistically independent of the true distribution . Let us also denote by the hypothesis that such results are statistically dependent. Then, let us denote by B the Bayes factor that measures the evidence of the data in favor of with respect to :where ; and are non-negative integer parameters. The test for evaluating the classifier based on (21) is calculated by Guidelines for the interpretation of this test are given in Table 1 (Kass and Raftery 1995).
Table 1

Guidelines for the interpretation of MB in terms of the strength of evidence in favor of against

MB \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<$$\end{document}<00–11–33–5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>$$\end{document}>5
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_1$$\end{document}H1 strengthNegativeBare mentionPositiveStrongDecisive
Guidelines for the interpretation of MB in terms of the strength of evidence in favor of against In the context of evaluating the adequacy of the fuzzy model of a given group, the pattern space has to be taken as a set of parameter values: . Class contains those parameter values that correspond to outliers in , contains all the other parameter values. The auxiliary quantity signal can differ from in two ways: some of the outliers in don’t have a correspondence in , i.e., we cannot violate anonymity of some of the outliers (type II errors). We will call such outliers undisclosed outliers; some of the outliers in don’t have a correspondence in , i.e., the fuzzy rules introduce additional outliers not supported by real data (type I errors). We will call such outliers false outliers. Taking into consideration notation introduced earlier, elements of the confusion matrix (18) can be defined as follows: ; ; ; .

General approach to applying fuzzy rules to violating group anonymity

In general, to violate anonymity of a certain group G in a microfile in terms of disclosing outliers in its quantity signal, we need to proceed along the following steps: Harmonization Choose a microfile and determine a group G of records, whose distribution should be disclosed. Choose an auxiliary microfile that satisfies all the conditions given earlier. Perform harmonization of and and obtain harmonized microfiles and that have identical attributes with two exceptions: parameter attributes in both harmonized microfiles may not be identical, and contains auxiliary vital attributes, whereas has vital attributes removed. Input Variables Identification For each linguistic variable corresponding to a basic harmonized attribute , , define a range of values of its base variable . Remove from and records whose values of attributes lie outside the specified ranges, . Use expert judgment to determine the fuzzy values for each linguistic variable , , , defined by appropriate membership functions denoted by . Evolution Use the evolutionary algorithm to evolve fuzzy rules for violating anonymity of G in based on the data from . To reduce the number of undisclosed and false outliers, select only those rules R, for which and , and whose support is greater than a predefined value . To reduce computational overhead, remove rules that are more specific versions of other rules in the set, i.e., for each pair of rules and , if , remove . Using the fuzzy rules obtained, assign membership grades to all the records in , uniting the results in the fuzzy sense. Disclosing Outliers Construct the auxiliary quantity signal (9) and determine outliers in it.

Evolutionary algorithm for building the fuzzy model of a group

Outline of the evolutionary algorithm

In the proposed algorithm, whose outline corresponds to the outline presented in Ishibuchi et al. (1995), we perform evolution only at the level of fuzzy rules. This means that we do not perform any fine-tuning of membership functions of input variables. We choose this approach to preserve comprehensibility for humans of the fuzzy rules in the system. The outline of the algorithm is as follows: Randomly generate initial population of individuals, . Calculate values of the fitness function for each individual: , . Check termination condition: if it is satisfied, stop; continue otherwise. Select pairs of individuals and put them into set . Recombine pairs of individuals from with a recombination operator, , . Put the offspring into set . Mutate individuals from with a mutation operator, . Replace individuals from that have the lowest fitness values with the mutated offspring. Go to step 3.

Representation and fitness function

In this work, we treat each individual , , as a single rule in the fuzzy rule set being evolved. I.e., the whole population constitutes the whole fuzzy rule set, in full concordance with the Michigan approach. We propose to represent each rule , , as a vector of integer valueswhere is a certain index of the fuzzy value of a linguistic variable . Availability of values , , in enables us to evolve rules that don’t take into account values of the attribute . In other words, the evolutionary process can lead to obtaining more generalized rules. In this work, we evaluate fitness of each individual in terms of its quality measures introduced earlier:

Other algorithm parameters

Operator should be a proper recombination operator for integer representation applied with a high probability to two individuals and that yields two offspring individuals and . Operator should be a proper mutation operator for integer representation applied with a low probability to a single individual R that yields the mutated one . In this paper, we will use uniform crossover (Syswerda 1989) as a recombination operator and random resetting mutation (Eiben and Smith 2015, p. 43) as a mutation operator. We will also choose the following algorithm parameters: we will choose tournament selection (Brindle 1981) as an efficient and easy to implement selection operator, with the tournament size 10; we will create initial populations by randomly generating values of each fuzzy rule element , , , from a uniform distribution on ; we will choose the number of generations N as a termination condition, i.e., we will terminate the algorithm after having obtained N consequent populations.

Memetic algorithm for protecting group distributions

General information

In previous sections, we have shown that the TPGA is a pressing one, and group distributions need to be protected even when vital attributes are removed from the microfile. In this section, we will discuss the memetic algorithm (MA) for solving the task of providing group anonymity. This algorithm was introduced in Chertov and Tavrov (2014), and we will heavily rely on that publication when presenting the algorithm here. We will assume that the data publisher decides to remove vital attributes from the microfile. As pointed out before, to provide group anonymity, we need to mask outliers in an auxiliary quantity signal obtained using appropriate fuzzy rules. The general outline of a single-stage approach to solving the TPGA is as follows: Prepare a (depersonalized) microfile representing data to be anonymized. Define groups of respondents , whose quantity signals need to be masked, . For each i from 1 to k: Build the quantity signal for . Obtain fuzzy models of using the evolutionary algorithm. Build the auxiliary quantity signal for using the obtained fuzzy models, and the corresponding crisp auxiliary quantity signal . Compare two signals and determine whether there is risk of violating group anonymity in terms of disclosing their outliers. If there is such risk, define the modifying transformation, obtain the modified crisp auxiliary quantity signal, and hence the modified microfile. Prepare the modified microfile for publishing. In order to modify the auxiliary quantity signal for a given group in a given microfile, we need to physically alter some of the values in the microfile, more precisely, alter parameter values for certain records. To preserve the number of records with a particular parameter value, the records have to be altered in pairs, which can be interpreted as swapping the records between submicrofiles. One record needs to belong to the fuzzy model of a group, and another needs not to. As mentioned before, to solve the TPGA means not only to modify the auxiliary quantity signal, but also to introduce as little distortion into the microfile as possible. To this end, the records being swapped have to be close to each other is some sense. In this work, we will apply the influential metric (Chertov 2010) to determine the degree of similarity between two microfile records. This metric is defined in terms of so called influential attributes, i.e., those ones whose distribution is important for further researches using microfile data. In this work, we will assume that influential attributes are the same as the basic harmonized attributes. The influential metric is defined aswhere is the ordinal basic attribute (their overall number is ), is the categorical basic attribute (their overall number is ), denotes the operator that equals to if values and fall into one category, and equals to otherwise, and are non-negative weighting coefficients (the bigger the coefficient, the more important is the attribute for the researches). Preserving data utility from the minimal data distortion point of view is a task of high complexity and dimensionality, therefore, it is a good idea to use MAs (Moscato 1989) to solve the TPGA. MAs are typically implemented as evolutionary algorithms with local search procedures (Eiben and Smith 2015, p. 173). New applications of MAs to solving complex optimization tasks can be found in Kumar et al. (2014).

Outline of the algorithm

An outline of a memetic algorithm for modifying the microfile in order to protect outliers in corresponding quantity signal is as follows: Create population P of individuals, apply to them local search operatorS. Calculate fitness function for each individual . Check termination condition. It if holds, stop, otherwise, go to 4. Select pairs of parents. Apply recombination operatorR to each parent pair. Apply mutation operatorM to each of offspring. Put the offspring into . Apply local search operator S to each individual . Calculate fitness function for each individual . Select individuals from , put them into P in place of current ones. Go to 3. In the algorithm outline above, we made use of several symbols introduced earlier, but with a different meaning. We hope it will be understandable from the context, what symbols mean in each particular case. Each individual is a matrix U with Q rows and four columns with the following elements: The first column contains indexes of submicrofiles to remove vital records from. The user has to define the set of such submicrofiles. The third column contains indexes of submicrofiles to add vital records to. The user has to define the set of such submicrofiles. The second column contains indexes of the records from to be removed. The fourth column contains indexes of the records from to be swapped with the ones defined by . By its nature, each individual U uniquely defines the modified quantity signal , and also determines the particular way of obtaining it, because each row in U defines a particular pair of respondents to be swapped. Thereby, each U defines a complete solution to the TPGA at hand. Two restrictions are imposed on each individual U:These restrictions cannot be violated throughout the algorithm run. a submicrofile index i can occur in the first column of U not more than times; each pair or cannot occur in U more than once. In this work, we propose to use the fitness function as the productwhere gives estimation of the solution quality in terms of minimizing microfile distortion, gives estimation the solution quality in terms of protecting outliers in the quantity signal, and is a penalty term against obtaining individuals with too many rows. We propose to use the following expression for the first term of (26):where is the greatest possible value of the cumulative influential metric (25), is the operator yielding the record of the submicrofile , . Other terms of the fitness function can be chosen depending on the TPGA at hand. In this work, we use the following recombination operator . It generates two random crossover points and , splits each parent at appropriate points, exchanges the tails between them, and thus creates the offspring. This operator has to be applied with a high probability . We also use the mutation operator that is a superposition of the following operators: is a swap mutation operator (Syswerda 1991) applied with a small probability to the first column of U. Each pair needs to be preserved . is also a swap mutation operator applied with a small probability to the third column of U. Each pair needs to be preserved . is a random resetting mutation operator (Eiben and Smith 2015, p. 43) applied with a small probability to the second column of U. is a random resetting mutation operator applied with a small probability to the fourth column of U. In this work, we use the following local search memetic operator : Carry out steps 2–4 . Generate a uniformly distributed number . If , assign to the index of a record from closest to the record defined by from in terms of (25). Otherwise, assign to the index of a record from closest to the record defined by from in terms of (25). Go to step 2. Other MA components, such as selection, initialization, termination, population size etc. should be chosen individually for each TPGA to be solved.

Results

Problem definition and microfile harmonization

To illustrate ideas developed in this work, we decided to set a task of violating anonymity of a group of regionally distributed military personnel in the U.S. Outliers in quantity signals representing such a distribution might point to sites of military facilities, some of which might potentially be classified. We decided to choose the 1 % sample microfile of the American Community Survey (ACS) conducted in 2013 available from the IPUMS-International Project (Ruggles et al. 2010) as the microfile we would like to violate group anonymity in. This microfile contains records. The microfile contains attributes Place of work: state, 1980 onward and Place of work: PUMA, 2000 onward (where PUMA stands for Public Use Microdata Area), that, if concatenated, give a unique code of a PUMA where a respondent works. We decided to replace these attributes with a single one called Place of work by concatenating the values of the attributes for each microfile record. The newly obtained attribute plays the role of the parameter attribute for our task. The microfile also contains vital attribute Occupation, SOC classification (where SOC stands for the 2010 Standard Occupational Classification system), which enables us to uniquely identify all the military personnel in the microfile, . We decided to choose the 5 % sample microfile of the 2000 U.S. Census also available from the IPUMS-International Project (Ruggles et al. 2010) as the auxiliary microfile . This microfile contains records. Since this microfile also contains attributes Place of work: state, 1980 onward and Place of work: PUMA, 2000 onward, we decided to replace them with the Place of work attribute in the same way as described above. This auxiliary microfile satisfies all the necessary requirements: records in and in are drawn from sufficiently similar distributions under assumption that demographics of respondents in both microfiles haven’t changed much over 13 years; contains an auxiliary vital attribute Occupation, SOC classification, identical to the vital attribute in in terms of military occupations. Vital records in and auxiliary vital ones in are drawn from sufficiently similar distributions under assumption that demographics of military personnel haven’t changed much over 13 years. There are auxiliary vital records in ; and contain almost identical attributes, with the exception of several technical ones. In our example, we performed the following harmonization: we replaced the Occupation, SOC classification attribute in both microfiles with a new one Military Personnel, which has only two values, 0 and 1. The value 1 was assigned only to those records that had one of the values of attribute Occupation, SOC classification presented in Table 2;
Table 2

Values of the Occupation, SOC classification attribute that correspond to the value 1 of the harmonized attribute Military Personnel

Attribute valueInterpretation
551,010Military Officer Special and Tactical Operations Leaders
552,010First-Line Enlisted Military Supervisors
553,010Military Enlisted Tactical Operations and
Air/Weapons Specialists and Crew Members
559,830Military, Rank Not Specified
we removed all attributes from both microfiles except for Military Personnel, Place of work, and basic harmonized attributes, which we consider to be useful for building a fuzzy model of a group. Information about each basic harmonized attribute , , is given in Table 3, where C stands for a categorical attribute, O stands for an ordinal one.
Table 3

Basic harmonized attributes used in the practical example

IndexNameTypeValues
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_1$$\end{document}b1 Age O 000—Less than 1 year old, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1\dots 130$$\end{document}1130—1 to 130 years, 135—135
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_2$$\end{document}b2 Educational attainment [general version] C 00—N/A or no schooling, 01—Nursery school to grade 4, 02—Grade 5, 6, 7, or 8, 03—Grade 9, 04—Grade 10, 05—Grade 11, 06—Grade 12, 07—1 year of college, 08—2 years of college, 09—3 years of college, 10—4 years of college, 11—5+ years of college
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_3$$\end{document}b3 Sex C 1—Male, 2—Female
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_4$$\end{document}b4 Race [general version] C 1—White, 2—Black/Negro, 3—American Indian or Alaska Native, 4—Chinese, 5—Japanese, 6—Other Asian or Pacific Islander, 7—Other race, nec, 8—Two major races, 9—Three or more major races
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_5$$\end{document}b5 Usual hours worked per week O 00—N/A, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$01\ldots 98$$\end{document}0198—1 to 98 h worked per week, 99—99 (Topcode)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_6$$\end{document}b6 Hispanic origin [general version] C 0—Not Hispanic, 1—Mexican, 2—Puerto Rican, 3 —Cuban, 4—Other, 9—Not Reported
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_7$$\end{document}b7 Marital status C 1—Married, spouse present, 2—Married, spouse absent, 3—Separated, 4—Divorced, 5—Widowed, 6—Never married/single
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_8$$\end{document}b8 Means of transportation to work C 00—N/A, 10—Auto, truck, or van, 11—Auto, 12—Driver, 13—Passenger, 14—Truck, 15—Van, 20—Motorcycle, 30—Bus or streetcar, 31—Bus or trolley bus, 32—Streetcar or trolley car, 33—Subway or elevated, 34—Railroad, 35—Taxicab, 36—Ferryboat, 40—Bicycle, 50—Walked only, 60—Other, 70—Worked at home
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_9$$\end{document}b9 Time of departure for work O 0000—N/A, other values report the time usually leaving for work last week (12:01 a.m. is coded as 0001, and 11:59 p.m. is coded as 2359)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_{10}$$\end{document}b10 Travel time to work O 000—N/A, other values are amounts of time, in minutes, it took to get to work last week
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_{11}$$\end{document}b11 Weeks worked last year, intervalled C 0—N/A, 1—1–13 weeks, 2—14–26 weeks, 3—27–39 weeks, 4—40–47 weeks, 5—48–49 weeks, 6—50–52 weeks
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_{12}$$\end{document}b12 Total personal income O A 7-digit numeric code reporting each respondent’s total pre-tax personal income or losses from all sources for the previous year
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_{13}$$\end{document}b13 Speaks English C 0—N/A (Blank), 1 —Does not speak English, 2—Yes, speaks English..., 3—Yes, speaks only English, 4—Yes, speaks very well, 5—Yes, speaks well, 6 —Yes, but not well, 7—Unknown, 8—Illegible
Values of the Occupation, SOC classification attribute that correspond to the value 1 of the harmonized attribute Military Personnel Basic harmonized attributes used in the practical example

Input variables identification

In this section, we will discuss linguistic variables corresponding to basic harmonized attributes , . Each bares the name of the corresponding attribute . Ranges of acceptable values of base variables for each , , are given in Table 4.
Table 4

Ranges of acceptable for each linguistic variable in the practical example

Name of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_j$$\end{document}Lj \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l\left( L_j\right)$$\end{document}lLj \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u\left( L_j\right)$$\end{document}uLj
Age1845
Educational attainment [general version]111
Sex12
Race [general version]12
Usual hours worked per week0100
Hispanic origin [general version]09
Marital status16
Means of transportation to work070
Time of departure for work12359
Travel time to work1119
Weeks worked last year, intervalled16
Total personal income0200,000
Speaks English25
Ranges of acceptable for each linguistic variable in the practical example After having removed all the records, whose basic harmonized attribute values don’t belong to the specified ranges, we obtained the microfiles with , , and with , . Let us introduce several generic membership functions of one argument x and several parameters: Then, the fuzzy values of all linguistic variables are as follows: Variable has 5 fuzzy values: Young, with the membership function Middle Aged 1, with Middle Aged 2, with Middle Aged 3, with Old, with . Variable has 2 fuzzy values: Low, with High, with . Variable has 2 fuzzy values: Male, with Female, with Variable has 2 fuzzy values: White, with Black, with Variable has 3 fuzzy values: Low, with Medium, with High, with . Variable has 2 fuzzy values: No, with Yes, with Variable has 2 fuzzy values: Married, with Not married, with Variable has 3 fuzzy values: Car, with Public, with Walked, with Variable has 3 fuzzy values: Night, with Morning, with Day, with . Variable has 3 fuzzy values: Little, with Medium, with Much, with . Variable has 2 fuzzy values: Abnormal, with Normal, with . Variable has 3 fuzzy values: Low, with Medium, with High, with . We decided not to define values for variable . Its range of acceptable values was used to remove unacceptable records from the microfiles, but the attribute itself was not involved in the fuzzy rules evolved using the evolutionary algorithm.

Generating fuzzy rules by the evolutionary algorithm

In order to evolve fuzzy rules to obtain the auxiliary quantity signal for the practical example, we applied the evolutionary algorithm with the following parameters: the population size was fixed at 100; on each iteration, we replaced worst fit individuals with the newly obtained by applying recombination and mutation operators; we applied recombination operator with the probability , and mutation operator with probability ; we performed 10 separate runs of the evolutionary algorithm, each of which lasted for generations. Of all the fuzzy rules obtained in all generations, we selected the fuzzy rules, whose RCF was greater than and support was greater than . After that, we removed those rules that are more specific versions of the more general ones in the set, as described previously. In Table 5, we presented all of the resultant rules. For each fuzzy rule from the rule base, we specified its discriminative factor, relative confidence factor, and support. We present all the numerical values with 3 significant numbers, although the calculations were carried out with a much higher precision.
Table 5

Fuzzy rules used in the example

R DF RCF Support
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 0, 0, 0, 0, 2, 3, 1, 1, 0, 2\right)$$\end{document}1,0,0,0,0,0,2,3,1,1,0,2 0.0320.7550.032
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 0, 0, 3, 0, 0, 3, 1, 0, 0, 0\right)$$\end{document}1,0,0,0,3,0,0,3,1,0,0,0 0.0310.7870.031
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 0, 0, 3, 1, 0, 3, 0, 0, 2, 1\right)$$\end{document}1,0,0,0,3,1,0,3,0,0,2,1 0.0120.8010.012
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 0, 1, 0, 1, 0, 3, 1, 1, 1, 2\right)$$\end{document}1,0,0,1,0,1,0,3,1,1,1,2 0.0100.7810.010
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 0, 1, 3, 0, 0, 3, 0, 0, 2, 1\right)$$\end{document}1,0,0,1,3,0,0,3,0,0,2,1 0.0120.8510.012
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 1, 0, 0, 0, 0, 3, 1, 1, 0, 2\right)$$\end{document}1,0,1,0,0,0,0,3,1,1,0,2 0.0340.8400.034
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 1, 0, 0, 0, 2, 3, 1, 1, 2, 0\right)$$\end{document}1,0,1,0,0,0,2,3,1,1,2,0 0.0250.7650.025
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 1, 0, 3, 0, 2, 3, 2, 0, 0, 1\right)$$\end{document}1,0,1,0,3,0,2,3,2,0,0,1 0.0180.9310.018
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 1, 0, 3, 1, 0, 3, 2, 0, 0, 1\right)$$\end{document}1,0,1,0,3,1,0,3,2,0,0,1 0.0170.9150.018
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 1, 1, 0, 0, 0, 3, 1, 1, 2, 0\right)$$\end{document}1,0,1,1,0,0,0,3,1,1,2,0 0.0250.7540.026
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 0, 1, 1, 0, 0, 2, 3, 1, 0, 0, 2\right)$$\end{document}1,0,1,1,0,0,2,3,1,0,0,2 0.0320.7510.032
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 1, 0, 0, 3, 0, 2, 3, 2, 1, 0, 1\right)$$\end{document}1,1,0,0,3,0,2,3,2,1,0,1 0.0180.9510.018
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 1, 0, 0, 3, 1, 0, 3, 2, 0, 0, 1\right)$$\end{document}1,1,0,0,3,1,0,3,2,0,0,1 0.0190.7670.019
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 1, 1, 0, 3, 0, 0, 3, 2, 0, 2, 1\right)$$\end{document}1,1,1,0,3,0,0,3,2,0,2,1 0.0091.8760.009
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 1, 1, 0, 3, 0, 0, 3, 2, 1, 1, 1\right)$$\end{document}1,1,1,0,3,0,0,3,2,1,1,1 0.0080.7610.009
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 1, 1, 0, 3, 0, 2, 3, 0, 1, 2, 1\right)$$\end{document}1,1,1,0,3,0,2,3,0,1,2,1 0.0101.3250.010
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 1, 1, 0, 3, 0, 2, 3, 2, 1, 2, 0\right)$$\end{document}1,1,1,0,3,0,2,3,2,1,2,0 0.0260.7670.026
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( 1, 2, 1, 0, 0, 0, 0, 3, 1, 0, 0, 2\right)$$\end{document}1,2,1,0,0,0,0,3,1,0,0,2 0.0020.9140.002
Fuzzy rules used in the example As we can see, all of these rules share one common characteristic, i.e., their value of variable is Walked, which means that all the respondents considered by the fuzzy rules as military personnel walked to their work rather than used a car or other means of transportation. Judging from the values of other variables, we can make general conclusions that these respondents typically are young males with medium yearly income.

Disclosing outliers in the group distribution using evolved fuzzy rules

To demonstrate how the evolved fuzzy rules can be used to violate outliers in the quantity signal, we will first apply them to the auxiliary microfile, and then proceed to disclosing outliers in quantity signals obtained for the main microfile. Since it would be impractical to try to analyze the auxiliary quantity signal constructed for all the PUMAs as a whole (there were 1238 different PUMAs circa 2000 in the U.S.), we will present appropriate results state by state. Quantity signal (solid line) and auxiliary quantity signal (dashed line) obtained for the state of New York by applying the fuzzy rules from the example to the 2000 U.S. census microfile Let us consider for illustration purposes the state of New York. In Fig. 1, we presented both the quantity signal (solid line) and the auxiliary quantity signal (dashed line). Values over the x axis stand for the PUMA of the state of New York. The list of PUMAs can be found on the IPUMS-International website (PUMAs and Super-PUMAs 2000) . The values over the y axis stand for:
Fig. 1

Quantity signal (solid line) and auxiliary quantity signal (dashed line) obtained for the state of New York by applying the fuzzy rules from the example to the 2000 U.S. census microfile

in case of the quantity signal, the number of military personnel working in a corresponding PUMA; in case of the auxiliary quantity signal, the sum of all membership grades assigned to the respondents in a corresponding PUMA by the evolved fuzzy rules. Applying MTTT with to the quantity signal, we can obtain the following index set: Analysis of the Report of the Deputy Under Secretary of Defense (2000) permits us to conclude that most of the indexes obtained by MTTT do not correspond to sites of military bases. Further on, we will assume that , because: the outlier in PUMA 5 corresponds to Fort Drum (Deputy Under Secretary of Defense 2000, p. ARMY-9); the outlier in PUMA 42 corresponds to West Point Military Reservation (Deputy Under Secretary of Defense 2000, p. ARMY-10). Applying MTTT with to the auxiliary quantity signal, we can obtain the following index set:Taking into account previous discussion, we can assume that . Equality indicates that the sites of military facilities can be easily disclosed even if the vital attributes are removed from the microfile. In a similar fashion, we can analyze all the other states and determine undisclosed and false outliers. The overall figures are given in Table 6. We included in the table only those states, where the number of working military personnel exceeds 0.5 % of all the military personnel in original harmonized auxiliary microfile , i.e., the value .
Table 6

Results of applying the evolved fuzzy rules to the 2000 census microfile

StateNumber of outliers in the quantity signalNumber of undisclosed outliersNumber of outliers in the auxiliary quantity SignalNumber of false outliers
Alabama4310
Alaska2020
Arizona4130
California4040
Colorado2020
Connecticut1010
Florida7441
Georgia5151
Hawaii1010
Illinois2110
Kansas3210
Kentucky2020
Louisiana4220
Maryland2110
Mississippi2110
Missouri2110
New Jersey3120
New York2020
North Carolina4220
Ohio4310
Oklahoma3210
Pennsylvania4242
Rhode Island1010
South Carolina6150
Tennessee3300
Texas7250
Virginia9360
Washington5230
Total9838644
Results of applying the evolved fuzzy rules to the 2000 census microfile The confusion matrix (18) for this example is The tests (19), (20), and (22) based on the values of are as follows: , , . These figures indicate the high effectiveness of the evolved fuzzy rules in disclosing sensitive data features. Quantity signal (solid line) and auxiliary quantity signal (dashed line) obtained for the state of New York by applying the fuzzy rules from the example to the 2013 U.S. ACS microfile Let us now discuss the results of the application of the evolved fuzzy rules to the original microfile . In Fig. 2, we presented both quantity signal (solid line) and auxiliary quantity signal (dashed line) for the state of New York. Values over the x axis stand for the PUMA of the state of New York. The list of PUMAs circa 2013 can be found on the IPUMS-International website (PUMAs 2010).
Fig. 2

Quantity signal (solid line) and auxiliary quantity signal (dashed line) obtained for the state of New York by applying the fuzzy rules from the example to the 2013 U.S. ACS microfile

Applying MTTT with to the quantity signal, we can obtain the following index set: As pointed out before, analysis of (Deputy Under Secretary of Defense 2000) permits us to conclude that most of the indexes obtained by MTTT do not correspond to sites of military bases. Further on, we will assume that , because: the outlier in PUMA 5 corresponds to Fort Drum (Deputy Under Secretary of Defense 2000, p. ARMY-9); the outlier in PUMA 42 corresponds to West Point Military Reservation (Deputy Under Secretary of Defense 2000, p. ARMY-10). Applying MTTT with to the auxiliary quantity signal, we can obtain the following index set:Taking into account previous discussion, we can assume that . I.e., both outliers are clearly visible in the auxiliary quantity signal as well. Analogous results for other states are given in Table 7. We once again included in the table only those states, where the number of working military personnel exceeds 0.5 % of all the military personnel in original harmonized microfile , i.e., the value .
Table 7

Results of applying the evolved fuzzy rules to the 2013 ACS microfile

StateNumber of outliers in the quantity signalNumber of undisclosed outliersNumber of outliers in the auxiliary quantity signalNumber of false outliers
Alabama2211
Alaska2020
Arizona4141
California3120
Colorado2020
Connecticut1021
Florida7531
Georgia7340
Hawaii1010
Illinois2121
Kansas2200
Kentucky2110
Louisiana4400
Maryland3210
Mississippi1010
Missouri2200
Nevada1010
New Jersey2200
New Mexico2200
New York2020
North Carolina3120
Ohio2132
Oklahoma3210
South Carolina4130
Texas6150
Virginia7441
Washington4130
Total8139508
Results of applying the evolved fuzzy rules to the 2013 ACS microfile The confusion matrix (18) for this example is The tests (19), (20), and (22) based on the values of are as follows: , , . The values of all the tests are lower than their counterparts calculated for the 2000 census data. The matter is that the fuzzy rules were evolved using 2000 census data. Nevertheless, presented values indicate high effectiveness of the evolved fuzzy rules and their good generalization abilities.

Results of protecting group distributions using memetic algorithm

To illustrate the application of the MA for protecting outliers in the auxiliary quantity signal for the task discussed above, we will limit ourselves to the state of New York. There is a total of 91,398 respondents that work in this state. The auxiliary quantity signal corresponding to this state is shown in Fig. 2 (dashed line), and the corresponding crisp auxiliary quantity signal is shown in Fig. 3 (solid line).
Fig. 3

Initial (solid line) and modified (dashed line) auxiliary quantity signals for the state of New York (2013 U.S. ACS microfile)

Initial (solid line) and modified (dashed line) auxiliary quantity signals for the state of New York (2013 U.S. ACS microfile) As we’ve already discussed earlier, to mask the outliers in the signal, we need to reduce the values of the and the signal elements. We can achieve this task by imposing such fuzzy restrictions that lead the evolutionary process in the direction of obtaining signals, whose and signal values will not be greater than 2. This leads to the following fitness function:where ; , , is the basic attribute; returns the value of the attribute of the record in ; is a function defined asEach row in (28) corresponds to a single part of the fitness function (26). To simplify the matters, we considered all the basic attributes to be categorical ones with following parameters of (25): , , . The metric (25) defined this way shows the number of attribute values that need to be physically altered during one swap of the records between the submicrofiles. We decided to apply tournament selection (Brindle 1981) as an efficient and easy to implement selection operator, with the tournament size 5. Other algorithm parameters were chosen as follows: , , , , . We terminated the algorithm after having obtained 1000 consequent populations. The population was initialized by randomly generating matrices with different numbers of rows. Elements of the first column were generated with probabilities proportional to the values of the corresponding elements of . Elements of the third column were generated with probabilities proportional to the total numbers of records in corresponding submicrofiles. During the MA run, we applied linear fitness scaling in the form presented in Goldberg (1989, p. 79) to prevent premature convergence. We also multiplied the mutation probabilities by the factor of 10 whenever the standard deviation of the population fitness values dropped below 0.03. We performed 10 runs of the MA. Among 1000 individuals obtained in the last generations of each run, 983 correspond to valid solutions of the TPGA in terms of masking outliers in the auxiliary quantity signal. In Fig. 3 (dashed line), we presented the solution with the lowest cumulative influential metric (25), namely, 53. This solution is valid because applying MTTT to it yields . Since , we can conclude that the memetic algorithm managed to successfully modify the auxiliary quantity signal by creating new outliers in the and signal elements and eliminating the real ones. The mean cumulative metric (25) over all solutions that can be in a similar fashion viewed as valid is 62.518, i.e., we need to alter only  % of microfile attribute values in order to provide group anonymity.

Conclusions

In this work, we demonstrated that even if vital attributes are removed from the microfile, it does not necessarily follow that group anonymity is fully provided. Using an appropriately tailored evolutionary algorithm, it is possible to build up the fuzzy model of a group in the form of fuzzy rules that can violate group anonymity. We also discussed how memetic algorithms can be used to really provide group anonymity in a microfile at the cost of introducing only a small amount of distortion into the micro data. Much work remains to be done. Several directions for future research include: enhancing the classification accuracy of the fuzzy rules and enhancing the memetic algorithm efficiency by choosing appropriate operators.
  2 in total

1.  Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems.

Authors:  H Ishibuchi; T Nakashima; T Murata
Journal:  IEEE Trans Syst Man Cybern B Cybern       Date:  1999

2.  Index for rating diagnostic tests.

Authors:  W J YOUDEN
Journal:  Cancer       Date:  1950-01       Impact factor: 6.860

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.