| Literature DB >> 35052090 |
Santiago Gómez-Guerrero1, Inocencio Ortiz1, Gustavo Sosa-Cabrera1, Miguel García-Torres2, Christian E Schaerer1.
Abstract
Interaction between variables is often found in statistical models, and it is usually expressed in the model as an additional term when the variables are numeric. However, when the variables are categorical (also known as nominal or qualitative) or mixed numerical-categorical, defining, detecting, and measuring interactions is not a simple task. In this work, based on an entropy-based correlation measure for n nominal variables (named as Multivariate Symmetrical Uncertainty (MSU)), we propose a formal and broader definition for the interaction of the variables. Two series of experiments are presented. In the first series, we observe that datasets where some record types or combinations of categories are absent, forming patterns of records, which often display interactions among their attributes. In the second series, the interaction/non-interaction behavior of a regression model (entirely built on continuous variables) gets successfully replicated under a discretized version of the dataset. It is shown that there is an interaction-wise correspondence between the continuous and the discretized versions of the dataset. Hence, we demonstrate that the proposed definition of interaction enabled by the MSU is a valuable tool for detecting and measuring interactions within linear and non-linear models.Entities:
Keywords: categorical data; gain in multiple correlation; interaction; intrinsic interaction; multivariable correlation; multivariate symmetrical uncertainty; patterned data
Year: 2021 PMID: 35052090 PMCID: PMC8774864 DOI: 10.3390/e24010064
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
MSU values of 3-way XOR: minimum of 0.5 and maximum of 0.75. Here where ⨁ represents the XOR operation.
| 3-Way Collective | 3-Way ABC | 1-Way A | 1-Way B | 1-Way C | ||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| 0 | 0 | 0 | 000 | 0.25 | −0.5 | |||
| 0 | 1 | 1 | 011 | 0.25 | −0.5 | −0.5 | −0.5 | −0.5 |
| 1 | 0 | 1 | 101 | 0.25 | −0.5 | |||
| 1 | 1 | 0 | 110 | 0.25 | −0.5 | −0.5 | −0.5 | −0.5 |
|
| 2 | 1 | 1 | 1 | ||||
|
| 0.5 | |||||||
|
|
|
|
|
| ||||
|
|
|
|
|
|
|
|
|
|
| 0 | 0 | 0 | 000 | 0.25 | −0.5 | |||
| 0 | 1 | 1 | 011 | 1.00 × 10 | −2.66 × 10 | −0.5 | −0.31 | −5.30 × 10 |
| 1 | 0 | 1 | 101 | 1.00 × 10 | −2.66 × 10 | |||
| 1 | 1 | 0 | 110 | 0.75 | −0.311 | −0.311 | −0.5 | 0. |
|
| 0.811 | 0.811 | 0.811 | 5.30 × 10 | ||||
|
| 0.75 | |||||||
MSU values of the 4-way XOR with a minimum of 1/3 and a maximum of 0.746. Here .
| 4-Way Collective | 4-Way ABCD | 1-Way A | 1-Way B | 1-Way C | 1-Way D | |||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| 0 | 0 | 0 | 0 | 0000 | 0.125 | −0.375 | ||||
| 0 | 0 | 1 | 1 | 0011 | 0.125 | −0.375 | ||||
| 0 | 1 | 0 | 1 | 0101 | 0.125 | −0.375 | ||||
| 0 | 1 | 1 | 0 | 0110 | 0.125 | −0.375 | −0.5 | −0.5 | −0.5 | −0.5 |
| 1 | 0 | 0 | 1 | 1001 | 0.125 | −0.375 | ||||
| 1 | 0 | 1 | 0 | 1010 | 0.125 | −0.375 | ||||
| 1 | 1 | 0 | 0 | 1100 | 0.125 | −0.375 | ||||
| 1 | 1 | 1 | 1 | 1111 | 0.125 | −0.375 | −0.5 | −0.5 | −0.5 | −0.5 |
|
| 3 | 1 | 1 | 1 | 1 | |||||
|
| 0.333 | |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
| 0 | 0 | 0 | 0 | 0000 | 1.000 | 0.000 | ||||
| 0 | 0 | 1 | 1 | 0011 | 1.00 × 10 | −2.66 × 10 | ||||
| 0 | 1 | 0 | 1 | 0101 | 1.00 × 10 | −2.66 × 10 | ||||
| 0 | 1 | 1 | 0 | 0110 | 1.00 × 10 | −2.66 × 10 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0 | 0 | 1 | 1001 | 1.00 × 10 | −2.66 × 10 | ||||
| 1 | 0 | 1 | 0 | 1010 | 1.00 × 10 | −2.66 × 10 | ||||
| 1 | 1 | 0 | 0 | 1100 | 1.00 × 10 | −2.66 × 10 | ||||
| 1 | 1 | 1 | 1 | 1111 | 1.00 × 10 | −2.66 × 10 | −1.06 × 10 | −1.06 × 10 | −1.06 × 10 | −1.06 × 10 |
|
| −1.86 × 10 | −1.06 × 10 | −1.06 × 10 | −1.06 × 10 | −1.06 × 10 | |||||
|
| 0.746 | |||||||||
MSU values of the 4-way AND show a minimum of 0.2045 and a maximum of 1. Here, .
| 4-Way Collective | 4-Way ABCD | 1-Way A | 1-Way B | 1-Way C | 1-Way D | |||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| 0 | 0 | 0 | 0 | 0000 | 0.125 | −0.375 | ||||
| 0 | 0 | 1 | 1 | 0011 | 0.125 | −0.375 | ||||
| 0 | 1 | 0 | 1 | 0101 | 0.125 | −0.375 | ||||
| 0 | 1 | 1 | 0 | 0110 | 0.125 | −0.375 | −0.5 | −0.5 | −0.5 | −0.169 |
| 1 | 0 | 0 | 1 | 1001 | 0.125 | −0.375 | ||||
| 1 | 0 | 1 | 0 | 1010 | 0.125 | −0.375 | ||||
| 1 | 1 | 0 | 0 | 1100 | 0.125 | −0.375 | ||||
| 1 | 1 | 1 | 1 | 1111 | 0.125 | −0.375 | −0.5 | −0.5 | −0.5 | −0.375 |
|
| 3 | 1 | 1 | 1 | 0.544 | |||||
|
| 0.205 | |||||||||
Comparative behavior of MSU for some patterns.
| Name |
|
|
| Probab Distribution | Partial MSU Values | Global MSU |
|---|---|---|---|---|---|---|
| XOR | 3 | 2 | 4 | Equal likelohoods | MSU(AC) = 0 | MSU(ABC) = 0.5 |
| MSU(BC) = 0 | ||||||
| 3 | 2 | 4 | 0.25; 1.00 × | MSU(AC) = 0 | MSU(ABC) = 0.75 | |
| MSU(BC) = 0 | ||||||
| XOR | 4 | 2 | 8 | Equal likelihoods | MSU(AD) = 0 | MSU(ABCD) = 0.333 |
| MSU(BD) = 0 | ||||||
| MSU(CD) = 0 | ||||||
| 4 | 2 | 8 | 1; 1.00 × | MSU(AD) = 0.371 | MSU(ABCD) = 0.746 | |
| MSU(BD) = 0.371 | ||||||
| MSU(CD) = 0.371 | ||||||
| AND | 3 | 2 | 4 | Equal likelihoods | MSU(AC) = 0.258 | MSU(ABC) = 0.433 |
| MSU(CD) = 0.258 | ||||||
| 3 | 2 | 4 | 0.25; 1.00 × | MSU(AC) = 0.75 | MSU(ABC) = 1 | |
| MSU(CD) = 0.75 | ||||||
| AND | 4 | 2 | 8 | Equal likelihoods | MSU(AD) = 0.179 | MSU(ABCD) = 0.205 |
| MSU(BD) = 0.179 | ||||||
| MSU(CD) = 0.179 | ||||||
| 4 | 2 | 8 | 0.2; 1.00 × | MSU(AD) = 1 | MSU(ABCD) = 1 | |
| MSU(BD) = 1 | ||||||
| MSU(CD) = 1 | ||||||
| OR | 3 | 2 | 4 | 1.00 × | MSU(AC) = 0 | MSU(ABC) = 0 |
| MSU(BC) = 0.654 | ||||||
| 3 | 2 | 4 | Equal likelihoods | MSU(AC) = 0.344 | MSU(ABC) = 0.433 | |
| MSU(BC) = 0.344 | ||||||
| 3 | 2 | 4 | 0.4; 1.00 × | MSU(AC) = 1 | MSU(ABC) = 1 | |
| MSU(BC) = 1 | ||||||
| OR | 4 | 2 | 8 | 1.00 × | MSU(AD) = 0 | MSU(ABCD) = 0.005 |
| 0.009; 0.01; 0.125; | MSU(BD) = 0 | |||||
| 0.125; 0.729 | MSU(CD) = 0 | |||||
| 4 | 2 | 8 | Equal likelihoods | MSU(AD) = 0.179 | MSU(ABCD) = 0.205 | |
| MSU(BD) = 0.179 | ||||||
| MSU(CD) = 0.179 | ||||||
| 4 | 2 | 8 | 0.2; 1.00 × | MSU(AD) = 1 | MSU(ABCD) = 1 | |
| MSU(BD) = 1 | ||||||
| MSU(CD) = 1 | ||||||
|
| 3 | 2 | 4 | 1.00 × | MSU(AC) = 0 | MSU(ABC) = 0 |
| MSU(BC) = 0.654 | ||||||
| 3 | 2 | 4 | 1.00 × | MSU(AC) = 0 | MSU(ABC) = 0.75 | |
| MSU(BC) = 1 |
n = Number of attributes; c = Cardinality of each attribute (all of them equal c); k = Number of record configurations in sample.
Original Body Fat Data.
|
| st.c | mc.c | bf |
|---|---|---|---|
| 1 | −5.805 | 1.48 | 11.9 |
| 2 | −0.605 | 0.58 | 22.8 |
| 3 | 5.395 | 9.38 | 18.7 |
| 4 | 4.495 | 3.48 | 20.1 |
| 5 | −6.205 | 3.28 | 12.9 |
| 6 | 0.295 | −3.92 | 21.7 |
| 7 | 6.095 | −0.02 | 27.1 |
| 8 | 2.595 | 2.98 | 25.4 |
| 9 | −3.205 | −4.42 | 21.3 |
| 10 | 0.195 | −2.82 | 19.3 |
| 11 | 5.795 | 2.38 | 25.4 |
| 12 | 5.095 | 0.68 | 27.2 |
| 13 | −6.605 | −4.62 | 11.7 |
| 14 | −5.605 | 0.98 | 17.8 |
| 15 | −10.705 | −6.32 | 12.8 |
| 16 | 4.195 | 2.48 | 23.9 |
| 17 | 2.395 | −1.92 | 22.6 |
| 18 | 4.895 | −3.02 | 25.4 |
| 19 | −2.605 | −0.52 | 14.8 |
| 20 | −0.105 | −0.12 | 21.1 |
Original Body Fat Data discretized.
|
| dst | dmc | dbf |
|---|---|---|---|
| 1 | low | high | low |
| 2 | med | med | high |
| 3 | high | high | low |
| 4 | high | high | med |
| 5 | low | high | low |
| 6 | med | low | med |
| 7 | high | med | high |
| 8 | med | high | high |
| 9 | low | low | med |
| 10 | med | low | med |
| 11 | high | high | high |
| 12 | high | med | high |
| 13 | low | low | low |
| 14 | low | med | low |
| 15 | low | low | low |
| 16 | high | high | high |
| 17 | med | low | med |
| 18 | high | low | high |
| 19 | low | med | low |
| 20 | med | med | med |
Pattern 1 from body fat regression and empirical finding of its lowest MSU value.
| Pattern 1 |
|
| 1-Way | 1-Way | 1-Way | ||
|---|---|---|---|---|---|---|---|
| low | low | low | 0.027 | −0.141 | −0.302 | −0.360 | −0.390 |
| low | low | med | 0.027 | −0.141 | |||
| low | med | low | 0.008 | −0.054 | |||
| low | high | low | 0.023 | −0.126 | |||
| med | low | med | 0.015 | −0.093 | −0.228 | −0.194 | −0.530 |
| med | med | med | 0.008 | −0.054 | |||
| med | high | high | 0.023 | −0.126 | |||
| high | low | high | 0.046 | −0.205 | −0.186 | −0.209 | −0.507 |
| high | med | high | 0.019 | −0.110 | |||
| high | high | low | 0.077 | −0.285 | |||
| high | high | med | 0.332 | −0.528 | |||
| high | high | high | 0.386 | −0.530 | |||
| Entropy: | 2.448 | 0.716 | 0.763 | 1.428 | |||
| MSU: | 0.237 | ||||||
Figure 2Moving a few body fat data points to produce an interaction: On a graph of bf as a function of product , six points were moved to induce interaction in the linear regression.
Modified Body Fat Data with Interaction.
|
| st.c | mc.c | bf.mod |
|---|---|---|---|
| 1 | −5.805 | 1.48 | 11.9 |
| 2 | −0.605 | 0.58 | 22.8 |
| 3 | 5.395 | 9.38 | 31 |
| 4 | 4.495 | 3.48 | 20.1 |
| 5 | −6.205 | 3.28 | 12.9 |
| 6 | 0.295 | −3.92 | 21.7 |
| 7 | 6.095 | −0.02 | 24 |
| 8 | 2.595 | 2.98 | 25.4 |
| 9 | −3.205 | −4.42 | 21.3 |
| 10 | 0.195 | −2.82 | 19.3 |
| 11 | 5.795 | 2.38 | 25.4 |
| 12 | 5.095 | 0.68 | 22 |
| 13 | −6.605 | −4.62 | 28 |
| 14 | −5.605 | 0.98 | 17.8 |
| 15 | −10.705 | −6.32 | 32 |
| 16 | 4.195 | 2.48 | 23.9 |
| 17 | 2.395 | −1.92 | 22.6 |
| 18 | 4.895 | −3.02 | 17 |
| 19 | −2.605 | −0.52 | 14.8 |
| 20 | −0.105 | −0.12 | 21.1 |
Modified Body Fat Data discretized. Superscript symbol o denotes recategorized data because of modified cutoff values. Superscript symbol * denotes underlying numerical value modified to produce interaction.
|
| dst | dmc | dbf |
|---|---|---|---|
| 1 | low | high | low |
| 2 | med | med | med o |
| 3 | high | high | high * |
| 4 | high | high | low o |
| 5 | low | high | low |
| 6 | med | low | med |
| 7 | high | med | high * |
| 8 | med | high | high |
| 9 | low | low | med |
| 10 | med | low | low o |
| 11 | high | high | high |
| 12 | high | med | med * |
| 13 | low | low | high * |
| 14 | low | med | low |
| 15 | low | low | high * |
| 16 | high | high | high |
| 17 | med | low | med |
| 18 | high | low | low * |
| 19 | low | med | low |
| 20 | med | med | med |
Pattern 2 from body fat regression and empirical finding of its lowest MSU value.
| Pattern 2 |
|
| 1-Way dst | 1-Way dmc | 1-Way dbf | ||
|---|---|---|---|---|---|---|---|
| low | low | med | 0.04 | −0.185 | −0.523 | −0.521 | −0.468 |
| low | low | high | 0.06 | −0.244 | |||
| low | med | low | 0.08 | −0.292 | |||
| low | high | low | 0.13 | −0.383 | |||
| med | low | low | 0.06 | −0.244 | −0.435 | −0.494 | −0.423 |
| med | low | med | 0.03 | −0.152 | |||
| med | med | med | 0.03 | −0.152 | |||
| med | high | high | 0.05 | −0.216 | |||
| high | low | low | 0.11 | −0.350 | −0.491 | −0.515 | −0.514 |
| high | med | med | 0.06 | −0.244 | |||
| high | med | high | 0.07 | −0.269 | |||
| high | high | low | 0.18 | −0.445 | |||
| high | high | high | 0.1 | −0.332 | |||
| Entropy: | 3.506 | 1.449 | 1.530 | 1.406 | |||
| MSU: | 0.301 | ||||||
Comparative behavior of MSU and interaction for two discretized patterns.
| Name |
|
|
| Record Frequencies | Partial MSU Values | Global MSU | Interaction |
|---|---|---|---|---|---|---|---|
| Pattern1 | 3 | 3 | 13 | 7, 7, 2, 6, 4, 2 | MSU(dst, dbf) = 0.142 | MSU(dst, dmc, dbf) = 0.237 | 0.095 |
| 2, 6, 12, 5, 20, 86, 100 | MSU(dmc, dbf) = 0.012 | ||||||
| 3 | 3 | 13 | 2, 1, 2, 2, 3, 1, 1, 1, 1, 2, 1, 1, 2 | MSU(dst, dbf) = 0.441 | MSU(dst, dmc, dbf) = 0.367 |
| |
| (original observations) | MSU(dmc, dbf) = 0.097 | ||||||
| 3 | 3 | 13 | Equal frequencies | MSU(dst, dbf) = 0.312 | MSU(dst, dmc, dbf) = 0.326 | 0.014 | |
| MSU(dmc, dbf) = 0.043 | |||||||
| Pattern2 | 3 | 3 | 13 | 4, 6, 8, 13, 6, 3 | MSU(dst, dbf) = 0.037 | MSU(dst, dmc, dbf) = 0.301 | 0.176 |
| 3, 5, 11, 6, 7, 18, 10 | MSU(dmc, dbf) = 0.124 | ||||||
| 3 | 3 | 13 | 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 3 | MSU(dst, dbf) = 0.152 | MSU(dst, dmc, dbf) = 0.367 | 0.206 | |
| (original observations) | MSU(dmc, dbf) = 0.161 | ||||||
| 3 | 3 | 13 | Equal frequencies | MSU(dst, dbf) = 0.043 | MSU(dst, dmc, dbf) = 0.326 | 0.186 | |
| MSU(dmc, dbf) = 0.141 |
n = Number of attributes; c = Cardinality of each attribute (all of them equal c); k = Number of record configurations in sample.