Byungduk Jeong1, Wonjoon Lee1, Deok-Soo Kim2, Hayong Shin1. 1. Department of Industrial & Systems Engineering, KAIST (Korea Advanced Institute of Science and Technology), Daejeon, South Korea. 2. Department of Mechanical Engineering, Hanyang University, Seoul, South Korea.
Abstract
Generating synthetic baseline populations is a fundamental step of agent-based modeling and simulation, which is growing fast in a wide range of socio-economic areas including transportation planning research. Traditionally, in many commercial and non-commercial microsimulation systems, the iterative proportional fitting (IPF) procedure has been used for creating the joint distribution of individuals when combining a reference joint distribution with target marginal distributions. Although IPF is simple, computationally efficient, and rigorously founded, it is unclear whether IPF well preserves the dependence structure of the reference joint table sufficiently when fitting it to target margins. In this paper, a novel method is proposed based on the copula concept in order to provide an alternative approach to the problem that IPF resolves. The dependency characteristic measures were computed and the results from the proposed method and IPF were compared. In most test cases, the proposed method outperformed IPF in preserving the dependence structure of the reference joint distribution.
Generating synthetic baseline populations is a fundamental step of agent-based modeling and simulation, which is growing fast in a wide range of socio-economic areas including transportation planning research. Traditionally, in many commercial and non-commercial microsimulation systems, the iterative proportional fitting (IPF) procedure has been used for creating the joint distribution of individuals when combining a reference joint distribution with target marginal distributions. Although IPF is simple, computationally efficient, and rigorously founded, it is unclear whether IPF well preserves the dependence structure of the reference joint table sufficiently when fitting it to target margins. In this paper, a novel method is proposed based on the copula concept in order to provide an alternative approach to the problem that IPF resolves. The dependency characteristic measures were computed and the results from the proposed method and IPF were compared. In most test cases, the proposed method outperformed IPF in preserving the dependence structure of the reference joint distribution.
Large-scale micro-simulations using agent-based models have gained wide popularity in recent years in various fields of socio-economic studies [1] including transportation planning [2] and land use [3]. Generating synthetic baseline populations is a key step in agent-based modeling and simulation. An agent in a microsimulation is described by a set of attributes such as age, income, residence type/region, and so on. These attributes are usually dependent on each other. Hence, the synthetic population generation can be considered as creating a set of agents with the attributes drawn from a joint distribution. However, a difficult element of the synthetic population generation is obtaining a relevant data set. As described in [4], there are two typical types of data sets available: disaggregated census data in form of PUMS (Public Use Micro Samples) and aggregated data in form of summary tables in census reports. PUMS contains individual samples of small size (typically less than 5% of population), which can be used to infer the joint distribution of the attributes. On the other hand, aggregated data in summary tables exhibits marginal distributions of attributes specific to each analysis zone of interest. As samples in PUMS data are chosen from a rather larger area (like state or nationwide) than target zones, the reference joint distribution from PUMS is often inconsistent with marginal distributions from the aggregate data of each zone. Beckman et al. [5] proposed to combine the disaggregated data with the aggregated data using IPF (Iterative Proportional Fitting) procedure. The primary concept of IPF is to maintain the dependence structure from the disaggregated data and alter the joint distribution to fit the marginal distribution of the attributes from the aggregated data. (We will use the term ‘marginal distribution’ and ‘margin’ interchangeably.) Since the inception by Beckman et al, there has been much research following this path: see [4,6]. IPF, which is briefly reviewed in a later section, is a very efficient and powerful technique for constructing a joint distribution table from a reference joint distribution and target margins. (In population synthesis area, the term ‘contingency table’ is often used to refer a table with the frequency of population in each cell. Since a contingency table can be easily converted to a distribution table (a probability mass function of discrete random variables), we will use the concept of distribution table instead.) Although the IPF procedure is very popularly in a variety of applications including synthetic population generation, it has some limitations as well. In this paper, we propose a novel approach based on copula theory for the same problem of constructing a joint distribution in place of IPF. It should be noted that the proposed approach can deal with only ordinal variables, not categorical ones.Recently, some research papers have used copula theory for microsimulation of traffic behavior (e.g., see [7,8], in which the copula was used in different contexts. Kao et al. [9] also proposed a copula based approach to synthesizing households in order to preserve the dependence structure. However, they combine target margins using Gaussian copula, whose covariance matrix is determined from the reference joint distribution (possibly represented by samples). A limitation of this approach is that some dependency information is lost because of the intermediate Gaussian copula. A similar approach can be found in [10], which utilizes Copula for representing temporal dependence structure among time series of stream flow in a geographic region. These literature uses some well-known copula functions such as Gaussian or Gumbel copula, then parameters of the copula function are chosen to fit the data. On the other hand, in this paper, we propose to directly use the empirical copula as explained in “Copula based approach to joint fitting problem” section.
Problem Description
Though, in synthetic population generation, there are many research issues such as household–individual hierarchy and aggregation data inconsistency, the problem focused in this paper is the construction of a joint distribution from a given reference joint distribution and target margins. Although the approach is applicable to multi-dimensional distributions without significant modification, the description in this paper is confined to two dimensional setting, for simplicity. For the most part of this paper, we will assume that target margins are discrete distributions, then the procedure will be extended to continuous variables in “Distribution view of CBJF and extensions” section.In order to formally describe the problem, the following notations are introduced. Let (X,Y) be a pair of discrete random variables that represent the attributes of the reference population. X and Y can have values from {x1,x2,…,x} and {y1,y2,…,y} respectively. We assume that X and Y are ordinal or interval variables, possible values of which have natural ordering. (For the variable types, readers are referred to [11].) The possibility of relaxing this assumption to handle categorical variables will be discussed in the conclusion section as a further research topic. Let be a pair of random variables representing the attributes of the target population. For simplicity, we assume that and have the same values as X and Y, respectively. Note that this assumption can be easily removed in the proposed approach by introducing a mapping between them. We will use the following notations, illustrated in Fig 1.
Fig 1
Joint fitting problem.
a = [a]: an m×n matrix denoting the reference joint distribution of (X,Y),i.e. a = P[X = x, Y = y]r = [r]: target marginal distribution of , i.e.c = [c]: target marginal distribution of , i.e.b = [b]: an m×n matrix denoting the target joint distribution of ,i.e.The reference joint distribution a can be either given directly or obtained from the detailed disaggregated census data available in form of PUMS (by counting the samples in each cell), while the target margins r and c are obtained from the aggregated data. From the input data {a,r,c}, the goal is to find the target joint distribution b inheriting the dependence structure from a while fitting it to the margins r and c. We call this problem a joint fitting problem. By definition, a, r, c, b are probabilities, meaning that they are nonnegative and sum to one. The following symbols are used to denote marginal summations:a = ∑a: row margin of a, i.e. a = P[X = x]a+, = ∑a: column margin of a, i.e. a+, = P[Y = y]b = ∑b: row margin of bb+, = ∑b: column margin of bUsing these symbols, the constraints of the problem are to satisfy b = r and b+, = c. The goal of “preserving the dependence structure” in a may appear ambiguous. Quantitative measure of this goal is differently defined in each method described in the following sections.
Brief Overview of IPF (Iterative Proportional Fitting)
IPF (Iterative Proportional Fitting) is a concise and efficient procedure to solve the joint fitting problem described in the previous section. IPF has many names, including RAS algorithm, matrix raking, matrix scaling, bi-proportional fitting, and so on. Since its introduction in by Deming & Stephan [12], the properties of IPF has been studied thoroughly and used widely in various fields including the synthetic population generation. Although there are some variations of IPF, its essence can be described using the following algorithm (Algorithm 1):
Algorithm 1 (IPF)
b ← a (Initialization)While (convergence criterion is not met)A common choice of convergence criterion is to measure the maximum deviation ε from the given margins r and:Ireland & Kullback [13] proved that if IPF procedure converges to a certain distribution table under the given constraints on the marginal, then the resulting table minimizes the relative entropy (called the ‘discrimination information’ in their paper), as defined below:Wong [14] investigated the reliability of IPF for use in geographical studies. Beckman et al. [5] proposes using IPF to combine disaggregated data with aggregated margins from different data sources. Following Beckman’s lead, much literature has been produced on adopting IPF for population generation, as summarized in [4]. Despite the popularity and rigorous mathematical analysis, IPF has some limitations, as follows:Convergence problem: Although this is rare in practical applications, IPF procedure may not converge. Pukelsheim & Simeone [15] showed conditions when IPF may fail to converge (i.e. when a can be permuted to a block diagonal structure in 2-dimensional case, as exemplified below). Though IPF converges very fast in general as explained in [16], it may take many iterations before reaching the desired level of fitting accuracy when there are cells with non-zero initial probability that are eventually wiped out to zero.Zero margin / Zero cell problem: If any of a (or a+,) vanishes to zero while r (or c, respectively) is not, the IPF procedure fails. This is called zero margin problem. Furthermore, the zero cells in a do not have a chance to obtain positive probability mass during IPF procedure. That is, if a = 0, then b = 0 always. A zero cell is not a problem in computational sense, however it poses a semantic problem because a zero cell can be a result of underrepresentation caused by the limited sample size. As a work-around, zero cells (and hence zero margins) are replaced with a very small number at the initialization step.Dependence structure preserving: Most importantly, it must be seriously considered whether IPF preserves well the dependence structure of the reference table a. Since there is no strict relationship between minimizing the discrimination information and preserving the dependence structure, minimum discrimination information does not guarantee the most similar dependence structure. Although IPF converges to the solution table b which minimizes the relative entropy subject to the marginal constraints, it does not necessarily mean that the dependence structure captured in a is best transferred to b. In this paper a new method is proposed to replace IPF and these methods are compared in terms of some dependence measures which can capture the strength of relationships between variables.Tables 1 and 2 show a small example joint fitting problem and a solution obtained by IPF procedure. Inputs are the reference joint matrix a and the target margins r, c in Table 1. With the convergence tolerance ε = 10−5, the output of IPF procedure is shown in Table 2, obtained after more than 300 iterations. If a5,3 is changed to 0, the non-zero cells in a matrix are separated in two groups: upper-left group and lower-right group. In this case, IPF procedure does not come to convergence because probability mass in one group cannot be transferred to the other group due to the barricade of zero cells. If we replace zero cells with a very small number (10−20), IPF converges after 1000 iterations.
Table 1
Input matrix a and target margins r, c.
Ref
y1
y2
y3
y4
y5
[a+,i]
[ri]
x1
0.04
0
0
0
0
0.04
0.07
x2
0.08
0.04
0
0
0
0.12
0.13
x3
0
0.12
0.08
0
0
0.12
0.15
x4
0
0
0.16
0
0.04
0.20
0.25
x5
0
0
0.04
0.2
0
0.24
0.27
x6
0
0
0.04
0.04
0.04
0.12
0.07
x7
0
0
0.04
0.04
0
0.08
0.06
[ai,+]
0.12
0.16
0.36
0.28
0.08
1
1
[cj]
0.16
0.17
0.30
0.25
0.12
1
Table 2
IPF output bIPF.
IPF
y1
y2
y3
y4
y5
x1
0.07
0
0
0
0
x2
0.09
0.04
0
0
0
x3
0
0.13
0.02
0
0
x4
0
0
0.167
0
0.083
x5
0
0
0.059
0.211
0
x6
0
0
0.019
0.014
0.037
x7
0
0
0.035
0.025
0
Copula Based Approach to Joint Fitting Problem
Copula, which was first coined by Sklar in [17] from a Latin word copulare meaning “to connect or link”, is a popular tool for modeling dependence between random variables. This section begins by reviewing the foundation of the copula theory. Further details on the copula can be found in [18]. A copula C(u,v) is a joint cumulative distribution function whose margins are uniform (0,1) distributions, satisfying the following properties:Uniform (0,1) margins:Monotonously increasing:Rectangle inequality (non-negative probability for [u,u + du] × [v,v + dv]):A copula can also be obtained from a joint distribution. Let FXY(x,y) = P[X ≤ x, Y ≤ y] denote the joint cumulative distribution function of (X,Y). Furthermore, let FX(x) = P[X ≤ x] and FY(y) = P[Y ≤ y] be the marginal cumulative distribution functions of X and Y respectively. The copula of FXY(x,y) is defined by a function C(u,v) that satisfies the following:Sklar’s theorem states that such a function C(u,v) exists, and if X and Y are continuous, C(u,v) is uniquely determined. That is, by letting u = FX(x) and v = FY(y), the copula function associated with FXY(x,y) can be obtained as follows:However, if the variables are not continuous (like X and Y in this paper), then the copula C is not unique; in this case, the values of the copula are uniquely determined at points (x,y), and a copula C for which the properties (a)-(c) above holds can be obtained by interpolating the values at these points [19]. The proof for the general n-dimensional case is outlined in [20]. When we speak of the copula of variables X and Y, we will mean the copula whose existence is guaranteed using the bilinear interpolation, if one or both of the random variables are not continuous.Using the bilinear interpolation (detail procedure is explained in the last part of this chapter), the copula function C(u,v) is obtained from the joint cumulative distribution function by removing the marginal information, meaning that C(u,v) extracts the dependence structure of the joint distribution without the information specific to margins. This forms the perfect basis of applying the copula to the joint fitting problem: the copula function derived from the reference joint distribution of (X,Y) can be combined with the new target margins of and to form the target joint distribution of . This is further elaborated with basic mathematical operations.Let FXY(x,y) denote the joint cumulative distribution function of (X,Y), and let FX(x) and FX(x) denote the marginal distribution of X and Y, respectively. Let C(u,v) be the copula function obtained via bilinear interpolation. If we set U = FX(X) and V = FY(Y), then C(u, v) is the cumulative distribution function of (U,V). Now, the goal is to find the joint distribution of , whose marginal cumulative distributions are and . We would like to make the dependence structure of the joint distribution of as close to that of (X, Y) as possible. For this reason, the same copula C(u, v) is applied to the margins and in order to achieve the goal. Let and . Then, C(u, v) is used as the cumulative distribution function of .The following symbols are introduced for denoting cumulative distributions:: cumulative distribution of (X, Y), i.e. A = P[X ≤ x, Y ≤ y]: cumulative distribution of X, i.e. u = P[X ≤ x] = FX(x): cumulative distribution of Y, i.e. v = P[Y ≤ y] = FY(y): cumulative distribution of , i.e.: cumulative distribution of , i.e.: for notational convenienceBecause and , we get . Therefore:Hence, the target joint distribution b can be computed, if we have the copula function C(u,v) obtained from a. Because A = P[X ≤ x, Y ≤ y] = P[U ≤ FX(x), V ≤ FY(y)] = P[U ≤ u, V ≤ v] = C(u,v), the values of C(u,v) is uniquely defined only at each grid points (u,v). Since does not necessarily coincide with (u,v), C(u,v) should be estimated from adjacent known grid points. For a given (u,v), the cell [u,u] × [v,v] containing (u,v) can be found. The local parameters s and t ∈ [0,1] are defined as shown in Eq (7) and Fig 2:
Fig 2
Bilinear interpolation.
Then, C(u, v) is computed using a bilinear interpolation as below:With C(u, v) as defined in Eq (8), the following lemmas can be easily proved. (Proofs can be found in S1 Text.)Lemma 1. C(u, v) is at least C0 continuous at all locations in [0,1] × [0,1].Lemma 2. C(u, v) satisfies the properties (a)~(c) of a copula function.Hence, C(u, v) qualifies as a copula from the reference joint distribution a. The following algorithm (Algorithm 2) is presented based on the copula, which we call copula based joint fitting (CBJF).
Algorithm 2 (CBJF)
Step 0 (Initialize)Step 1 (Compute [A])A = a + A + A − A for i = 1,…,m and j = 1,…,nStep 2 (Compute )u = a + u for i = 1,…,mv = a+, + v for j = 1,…,nStep 3 (Compute )For each k = 1,…,m and l = 1,…,nFind the cell (i,j) such thatB = (1 − s)(1 − t)A + (1 − s)tA + s(1 − t)A + stAStep 4 (Compute [b])For each k = 1,…,m and l = 1,…,nb = B − B − B + BShown in Table 3 is the output of Algorithm 2 (CBJF) applied to the input condition in Table 1. It can be noted that some zero cells in a matrix get some probability mass by CBJF, unlike the result of IPF shown in Table 2. It can be interpreted as diffusion of probability through the lens of copula upon the request of target margins.
Table 3
CBJF output bCBJF.
CBJF
y1
y2
y3
y4
y5
x1
0.063
0.007
0
0
0
x2
0.073
0.043
0.013
0.001
0
x3
0.022
0.076
0.050
0.002
0
x4
0.002
0.028
0.142
0.033
0.045
x5
0
0.008
0.047
0.164
0.051
x6
0
0.004
0.022
0.024
0.020
x7
0
0.004
0.025
0.027
0.004
Distribution View of CBJF and Extensions
The computational complexity of the above Algorithm 2 (CBJF) is O(T), where T = mn is the total number of cells in the joint distribution table. Though it is a very efficient algorithm with linear complexity proportional to the input/output size, the total number of cells T grows exponentially as the dimension of target margins increases, which can easily lead to prohibitively large matrices for storage and computation. However, in many practical high dimensional distributions, non-zero cells are very sparse and this is the clue to handle high dimensional cases. In order to exploit the sparsity of the joint distribution table, we need to re-organize Algorithm 2.We will start with an interpretation of Algorithm 2 (CBJF). In step 2 of Algorithm 2, [u] and [v] define a partition G on [0,1]x[0,1] space, shown as solid (blue) lines in Fig 3. [a] is the probability mass assigned to each cell G of the partition. The target margins r and c overlay a new partition defined by and , shown as dashed (red) lines in Fig 3. Step 4 of Algorithm 2 can be interpreted as a “collection view”. In other words, the probability mass b for a cell in the new partition is computed by collecting the probability mass of cells in G overlapping with . For example, the cell overlaps with G2,2, G2,3, G3,2, and G3,3. So, b2,2 is computed by collecting probability mass from a2,2, a2,3, a3,2, and a3,3 in proportion to the ratio of the overlapping area, which is achieved by the bilinear interpolation copula function in Eq (8). In this view, every cell in the partition should be visited and computed, since the algorithm does not know in advance whether the visit will result in a zero cell or not.
Fig 3
Collection view of CBJF (Partition lines are for the example in Table 1).
The same algorithm can be viewed from the other way around, which we call “distribution view”. Instead of collecting probability from the overlapping cells, we can visit each non-zero cell G and distribute the probability mass a to the overlapping cells in , again in proportion to the ratio of the overlapping area. Fig 4 shows that a2,2 is distributed to b1,1, b1,2, b2,1, and b2,2, where b acts as an accumulator of probability mass incoming from each non-zero cell of G. Sparse matrix representation can be used to store a and b. This algorithm (Algorithm 3) is shown below.
Fig 4
Distribution view of CBJF.
Algorithm 3 (CBJF-Distribution)
Step 0 (Initialize)Define partition G by computing [u], [v]Define partition by computing[b] is an empty zero martixStep 1 (Distribute [a])For each non-zero aFor each overlapping cellb = b + p * aAlgorithm 3 (CBJF-Distribution) is simpler, faster, and storage-efficient, and hence, it can be applied to high dimensional case. On top of these benefits, it enables further extensions. Let us first look into an extension to resampling-based population generation, which is very common. In case that disaggregated micro-samples are given for the reference joint distribution, let ( = (x(,y(,() denote the k-th survey record in PUMS, which we will call a PUMS entry (k) hereafter in order to reduce confusion in the use of the term “sample”. x( and y( are attributes whose target margins are given, while ( is a vector of additional attributes. Let P = {(; k = 1.N} denote the PUMS set. Once the target joint distribution b is obtained, it can be directly used for generating agents with attributes (x,y) drawn from b. However, we cannot set the additional attributes . This is the motivation of resampling based population generation, where synthetic population is generated by resampling P. In this case, a selection probability w( is assigned to each PUMS entry (, so that resampling selects ( with probability w(. The goal of simple resampling is to compute w( so as to meet the target margin requirement. Simple resampling can be efficiently combined with IPF procedure. However, in this case, the zero cell problem still persists, i.e. we cannot generate any population in a zero cell where no PUMS entry exists. This can be solved by viewing Algorithm 3 (CBJF-Distribution) at the granularity of PUMS entry. Algorithm 3 (CBJF-Distribution) allocates the probability mass of a to overlapping grids in . On the same token, we can think a PUMS entry ( as a cell having the probability mass of 1/N. (When there are more than one PUMS entries in a cell, we can consider them as multiple layers overlaid on the same cell, each of which has the probability mass of 1/N.) For each (, we can find the overlapping cells , and a copy of ( is added to the cell with the selection probability determined in proportion to the overlapping area ratio. Note that (x,y) attribute of ( is to be replaced by that of . In this way, ( is duplicated into multiple copies, whose weights sum to 1/N, and each copy has different values of (x,y) attribute while retaining the same ( attribute. Details are given in Algorithm 4 (CBJF-Resampling). At the end of the algorithm, we get a set Q = {((,w(); k = 1.M}, which can be used as the source of resampling for synthetic population generation.
Algorithm 4 (CBJF-Resampling)
Step 0 (Initialize)Define partition G by computing [u], [v]Define partition by computingQ = an empty setStep 1 (Distribute PUMS entry)For each PUMS entry ( = (x(,y(,() ∈ PFind G (the cell where ( belongs)For each overlapping cell= (x,y,()Add (,w) to QThe last extension is about handling continuous variables. For the continuous variable cases, it is natural to consider the reference joint is given as PUMS set P = {(; k = 1.N} as above, while target margins are given in functional form and . We also assume that their inverses and are available. In fact, continuous case is much simpler, at least conceptually, due to the fact that (X,Y) and share the same copula [18]. For each PUMS entry ( = (x(,y(,(), the corresponding point can be computed as shown in Fig 5: and , where u( = FX(x() and v( = FY(y(). Since the reference joint distribution is given as the PUMS set P, u( = i(/N and v( = j(/N, where i( (or j() is the number of PUMS entries ( with x( ≤ x( (or y( ≤ y(). Computation of u( (or v() can be done more efficiently by sorting {x(} (or {y(}). This procedure is described in Algorithm 5 (CBJF-Continuous) below. The overall computational complexity of CBJF-Continuous is O(N log N), where N is the size of PUMS set. The new sample set Q = {(; k = 1.N} is the source of resampling for synthetic population generation. In hybrid case where continuous and discrete variables are mixed, the above extensions can be easily combined and details are left to the readers.
Fig 5
Mapping x( to .
Algorithm 5 (CBJF-Continuous)
Step 0 (Initialize)Sort {x(} and {y(}Q is an empty setStep 1 (Transform PUMS set)For each PUMS entry ( = (x(,y(,() ∈ PObtain i( and j( from sorted setsu( = i(/N and v( = j(/NandAdd to Q
Numerical Experiments and Comparison
For comparison purpose, the joint fitting problem was also formulated as the following weighted least square formulation, which is an instance of a quadratic programming (QP Formulation) problem:
QP Formulation
Subject tob = ri for i = 1,…,m,b+, = c for j = 1,…,n,b ≥ 0,whereand M is a sufficiently large number.Note that the objective function in the QP can be replaced with a weighted L1-norm:Although the L1-norm is not a linear function, a standard technique can be applied to convert the optimization problem to a linear programming (LP) (see [21], for example), resulting in the following LP formulation, where the new variable (or ) represent the positive (or negative) part of b − a. (The solution of this LP Formulation was also computed for the test cases listed in this section, however, LP results were discarded because the output quality was significantly inferior to other outputs. Though, we leave the formulation here for documentation purpose.)
LP Formulation
Subject tob = r for ∀ib+, = c for ∀jwhereand M is a sufficiently large number.Some quantitative measures will be introduced in order to compare the results in terms of preserving the dependence structure. The most popular measure of dependence is Pearson’s product moment correlation coefficient, which is defined as:Although it is simple and familiar, Pearson’s correlation coefficient measures only the linear dependence between X and Y. An alternative to Pearson’s correlation is a rank correlation, such as Spearman’s rho or Kendall’s tau. Spearman’s rho measures the Pearson’s correlation between the two uniform random variables FX(X) and FY(Y):For Kendall’s tau, consider the two independent samples (X1,Y1) and (X2,Y2) with the same joint distribution as (X,Y). (X1,Y1) and (X2,Y2) are concordant if (X1−X2)(Y1−Y2) > 0 or discordant if (X1−X2)(Y1−Y2) < 0. Then, Kendall’s tau is defined as follows:For the general types of dependence, the maximal information coefficient (MIC) [22] is also used. MIC is a measure of dependence which captures a wide range of (either functional or non-functional) associations between variables. In case that a functional relationship exists, MIC provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. (10,000 sample points were drawn from each joint distribution and used to calculate their MIC values.)In order to test the effectiveness of the proposed method, the following five different types of reference joint distribution types were chosen (shown in Fig 6):
(a) Bivariate normal, (b) Bimodal, (c) Lower tail dependence, (d) U-shape, (e) Circle.Normal (Fig 6(A)): bivariate normal with mean μ = (0.0, 0.0), standard deviation σ = 1 and correlation ρ = 0.7Bimodal (Fig 6(B)): mixture of two bivariate Gaussiansμ1 = (0.0, 0.0), σ1 = 1 and ρ1 = 0.7μ2 = (3.0, 1.0), σ2 = 1 and ρ2 = -0.5Tail dependent (Fig 6(C)): a joint distribution showing strong tail dependence when X is lowU-shape (Fig 6(D)): U-shaped distribution whose correlation coefficients are quite smallCircle (Fig 6(E)): Circular joint distribution.Then, in order to obtain target margins, the following eight types of marginal modification operators were applied to the margins of each reference joint distribution (see Fig 7):
(a) Original marginal distribution, (b) Skew left, (c) Skew right, (d) Uniform, (e) Fat tail, (f) Thin tail, (g) Perturbation.Skew: Left or right skew applied to the column margin and the row margin. There are 4 different combinations of skewed margins (LL, RR, LR, RL). For example, “Skew LR” indicates that the column margin [c] is skewed left and the row margin [r] is skewed right.Uniform: Uniform target marginFat tail: To make the margin tails fatterThin tail: To make the margin tails thinnerPerturb: Add ±10% noise to each bin of the reference marginsCombining the five reference joint types with the eight marginal modification operators, a total of 40 combinations were used as the test set. The test set is not meant to be comprehensive, but rather some typical cases were selected in order to examine the effectiveness of the proposed methods in terms of dependency measures. Each margin was discretized into 100 bins, resulting in 100x100 cells in a joint table. For each combination of reference joint distribution and target margins, IPF procedure, QP method, and CBJF approach were applied in order to obtain the target joint distribution.Fig 8 shows the output distributions for one out of the 40 test cases: the normal joint + fat tail operator. Fig 8(D), 8(E), 8(G) and 8(H) are the scatter plots of the 10,000 sample points (for graphical presentation purpose) from the reference distribution, the IPF result, the QP result, and the CBJF result, respectively. While copula result seems to have similar dependent structure of reference joint, IPF and QP results look having more linearly correlated dependent structure. This is one of many cases where the output from CBJF approach outperforms the other methods. (The output plots for all 40 cases can be found in S1 Fig. Another numerical experiment of various changes on marginal distributions can be found in S2 Fig.)
Fig 8
Output results for normal joint distribution and fat tail target margins.
IPF produces the smallest relative entropy (Eq (2)). However, relative entropy does not convey the same information as level of dependence structure preservation which can be measured from correlation coefficients. For the quantitative comparison of dependence structure preservation, the dependency measures stated in the previous section were computed: Pearson’s correlation, Spearman’s rho, Kendall’s tau, and the MIC. The results are shown in Tables 4–7 (rows marked as (a)), respectively. For each dependency measure, shown in Tables 4–7 (rows marked as (b)) are the deviation from the measure of the original reference joint table to that of the output joint table of each method. There, the smaller deviation implies the better preservation of dependence structure. The comparison results are summarized as follows:
The perturbation operation (the second last columns of Tables 4–7) does not change the target margins significantly from those of the reference margins. For this small change, all three methods work well; showing almost no difference in the dependence measures.For the larger modifications, the copula based approach was superior to the other methods in almost all combinations. This indicates that the proposed method preserves the dependence structure of the reference joint distribution, while the other methods (IPF and QP) often fail to maintain the dependence structure when the target margins are significantly different from those of the reference.(a) = Observed Pearson correlation ρ(Observed)(b) = Absolute deviation |ρ(Reference)—ρ(Observed) |(a) = Observed Spearman's rank correlation s(Observed)(b) = Absolute deviation | s(Reference)–s(Observed) |(a) = Observed Kendall's rank correlation τ(Observed)(b) = Absolute deviation |τ(Reference)—τ(Observed) |(a) = Observed MIC(b) = Absolute deviation | MIC(Reference)—MIC(Observed)Since not only the type of marginal modification operator but also the amount of marginal change affects the result, skew, fat tail and thin tail operators are applied to each of the reference joint types with various levels of marginal change. And for each test combination, the MIC is calculated to see how the marginal change affects the dependence structure. Marginal changes are controlled by marginal variation which we define as the summation of the total variation distance of row margins and that of column margins where total variation distance between distribution P and Q is:From the results shown in Figs 9–13, we can see that all the methods perform well in maintaining the reference’s dependence structure when the marginal variation is very small. However, as the variation gets bigger, IPF and QP fail to preserve reference joint distribution’s dependence structure measured by MIC. On the other hand, MIC of the CBJF output remains almost unchanged as the level of marginal variation increases.
Fig 9
Results from the normal joint distribution with skew, fat tail and thin tail operators.
Fig 13
Results from the circle joint distribution with skew, fat tail and thin tail operators.
Finally, we applied the proposed method to generating synthetic patient population for simulating emergency department of a large hospital in Korea. Reference joint samples are obtained from more than 100K patient visit records during the year of 2013. Since the purpose of simulation analysis is to evaluate layout, patient admission policy, and staffing of the emergency department, we need to generate synthetic patient population which reflects the forecasted change in marginal distribution of patient age and severity of disease. As reported in [23], South Korea is one of the most rapidly aging society. ESI (Emergency Severity Index) is a 5-level triage system which classifies patients from level 1 (most urgent) to level 5 (least urgent) by both acuity and resource needs [24]. In this simulation study, the hospital wanted to analyze the capability of its emergency department in various situations, and hence the distribution of ESI was varied. Fig 14 shows the reference patient population in 2013 and synthetic populations generated by IPF, QP, and CBJF. In this study, target marginal distribution of age reflects the forecast in Statistics Korea [23], where the portion of elderly people (60+) is predicted to reach 42% in 2025 (from 31% in the reference population of 2013). In order to see the effect of reducing the patient concentration at ESI level 3, the distribution of ESI is changed from (2%, 11%, 78%, 6%, 3%) to (8%, 12%, 45%, 20%, 15%). As shown in Table 8, CBJF outperforms IPF and QP in preserving all dependency measures used in this study.
Fig 14
Synthetic population for ED(emergency department) simulation.
Table 8
Dependency measures for synthetic ED patient populations.
Dependency measure
Reference
IPF
QP
Copula
Pearson
Observed
-0.171
-0.260
-0.291
-0.144
Abs. Dev.
0.089
0.120
0.027
Spearman
Observed
-0.165
-0.294
-0.314
-0.146
Abs. Dev.
0.129
0.149
0.019
Kendall
Observed
-0.089
-0.179
-0.203
-0.086
Abs. Dev.
0.090
0.114
0.003
MIC
Observed
0.031
0.098
0.209
0.019
Abs. Dev.
0.067
0.178
0.012
Summary and Concluding Remarks
The joint fitting problem turned out to be a natural application area of the copula concept so as to preserve the dependence structure of the reference distribution. In this paper, a novel method based on the copula concept, called CBJF, was proposed. Although IPF has long been used in a wide range of applications and studied with mathematical rigor, it is not a silver bullet for joint fitting problems including synthetic population generation. From the numerical tests, it was found that CBJF is superior to IPF or QP methods in most cases for the dependence structure preservation. Furthermore, CBJF is computationally efficient since it does not require iterative procedure. Also, its robustness is a significant advantage of CBJF as it does not need to consider the convergence problem or zero cells.A disadvantage of CBJF is that it requires caution when X and/or Y are categorical variables, such as gender or ethnic groups. In such cases, CBJF result is affected by the ordering of the rows (or columns) when there is no definite natural ordering of the attribute values. In its present form, CBJF can be applied when the attribute values have a natural ordering, such as age, annual income, number of vehicles, or location coordinates. Handling categorical variables is a topic requiring further research. When categorical target variables are mixed with ordinal target variables, IPF may be applied first to the categorical variables and then CBJF can work on the remaining ordinal variables.In many cases of agent-based simulation applications, micro-samples (such as PUMS) from the reference population may not be available mainly because of cost of survey or privacy issues. There are recent researches on generating synthetic population without micro-samples [25-27]. We believe the concept of CBJF can be applied even when micro-samples are not available, however the details depend of the available information and require further research.
Experiments on 40 test cases.
(PDF)Click here for additional data file.
Experiments on various changes on marginal distributions.
Authors: David N Reshef; Yakir A Reshef; Hilary K Finucane; Sharon R Grossman; Gilean McVean; Peter J Turnbaugh; Eric S Lander; Michael Mitzenmacher; Pardis C Sabeti Journal: Science Date: 2011-12-16 Impact factor: 47.728